Understanding CLIP embeddings

A quick oveview of OpenAI's CLIP embeddings

Summary


Approach

CLIP training algorithm pseudocode.
CLIP training and inference diagram.

Proposed Dataset

The paper doesn’t seem to give any details on how this dataset was constructed. It simply mentions that they “searched” What does it mean to "search", search how? for image-text pairs where text includes an item from one of 500K list of items. Kept upto 20K image-text pairs per “query”.What is a query?

The base query list this is the 500K terms list? is all words occurring at least 100 times in the English version of Wikipedia. This is augmented with bi-grams with high pointwise mutual information as well as the names of all Wikipedia articles above a certain search volume. Finally all WordNet synsets not already in the query list are added