CLIP
E95184
CLIP is an OpenAI model that learns joint representations of images and text, enabling tasks like zero-shot image classification and natural language-based image retrieval.
Aliases (2)
Statements (56)
| Predicate | Object |
|---|---|
| instanceOf |
contrastive learning model
→
multimodal machine learning model → vision-language model → |
| architectureComponent |
image encoder
→
text encoder → |
| capability |
cross-modal retrieval
→
learning from natural language supervision → open-vocabulary recognition → prompt-based classification → zero-shot transfer to downstream vision tasks → |
| developer |
OpenAI
NERFINISHED
→
|
| fullName |
Contrastive Language–Image Pre-training
→
|
| imageEncoderType |
ResNet
→
Vision Transformer → |
| input |
image
→
natural language text prompt → |
| inspired |
subsequent vision-language models
→
|
| introducedBy |
Aditya Ramesh
→
Alec Radford NERFINISHED → Amanda Askell → Chris Hallacy → Gabriel Goh → Girish Sastry NERFINISHED → Gretchen Krueger → Ilya Sutskever → Jack Clark → Jong Wook Kim NERFINISHED → Pamela Mishkin NERFINISHED → Sandhini Agarwal → |
| learningParadigm |
contrastive learning
→
self-supervised learning → |
| license |
OpenAI model license
→
|
| lossFunction |
InfoNCE-style loss
→
contrastive loss → |
| modality |
image
→
text → |
| organization |
OpenAI
NERFINISHED
→
|
| output |
joint embedding vectors for images and text
→
similarity scores between images and text → |
| pretrainingDataType |
image-text pairs
→
|
| property |
aligns image and text embeddings in a shared space
→
does not require task-specific fine-tuning for many tasks → uses cosine similarity in embedding space → |
| publicationTitle |
Learning Transferable Visual Models From Natural Language Supervision
→
|
| publicationType |
arXiv preprint
→
|
| publicationYear |
2021
→
|
| task |
image representation learning
→
image-text matching → natural language-based image retrieval → text representation learning → zero-shot image classification → |
| textEncoderType |
Transformer
→
|
| trainingObjective |
maximize similarity of matching image-text pairs
→
minimize similarity of non-matching image-text pairs → |
| usedFor |
as a backbone in multimodal systems
→
downstream fine-tuning for vision tasks → |
Referenced by (4)
| Subject (surface form when different) | Predicate |
|---|---|
|
CLIP
("Contrastive Language–Image Pre-training")
→
|
fullName |
|
CLIP
("Learning Transferable Visual Models From Natural Language Supervision")
→
|
publicationTitle |
|
DALL·E
→
|
relatedTo |
|
Hugging Face Transformers
→
|
supportsModelType |