A quick oveview of OpenAI's CLIP embeddings
Paper from OpenAI published in ICML 2021
Using symmetric contrastive loss
The paper doesn’t seem to give any details on how this dataset was constructed. It simply mentions that they “searched”
The base query list
this is the 500K terms list? is all words occurring at least 100 times in the English version of Wikipedia. This is augmented with bi-grams with high pointwise mutual information as well as the names of all Wikipedia articles above a certain search volume. Finally all WordNet synsets not already in the query list are added