CLIP

E95184

contrastive learning model multimodal machine learning model vision-language model

CLIP is an OpenAI model that learns joint representations of images and text, enabling tasks like zero-shot image classification and natural language-based image retrieval.

Aliases (2)

Statements (56)

Predicate	Object
instanceOf	contrastive learning model → multimodal machine learning model → vision-language model →
architectureComponent	image encoder → text encoder →
capability	cross-modal retrieval → learning from natural language supervision → open-vocabulary recognition → prompt-based classification → zero-shot transfer to downstream vision tasks →
developer	OpenAI NERFINISHED →
fullName	Contrastive Language–Image Pre-training →
imageEncoderType	ResNet → Vision Transformer →
input	image → natural language text prompt →
inspired	subsequent vision-language models →
introducedBy	Aditya Ramesh → Alec Radford NERFINISHED → Amanda Askell → Chris Hallacy → Gabriel Goh → Girish Sastry NERFINISHED → Gretchen Krueger → Ilya Sutskever → Jack Clark → Jong Wook Kim NERFINISHED → Pamela Mishkin NERFINISHED → Sandhini Agarwal →
learningParadigm	contrastive learning → self-supervised learning →
license	OpenAI model license →
lossFunction	InfoNCE-style loss → contrastive loss →
modality	image → text →
organization	OpenAI NERFINISHED →
output	joint embedding vectors for images and text → similarity scores between images and text →
pretrainingDataType	image-text pairs →
property	aligns image and text embeddings in a shared space → does not require task-specific fine-tuning for many tasks → uses cosine similarity in embedding space →
publicationTitle	Learning Transferable Visual Models From Natural Language Supervision →
publicationType	arXiv preprint →
publicationYear	2021 →
task	image representation learning → image-text matching → natural language-based image retrieval → text representation learning → zero-shot image classification →
textEncoderType	Transformer →
trainingObjective	maximize similarity of matching image-text pairs → minimize similarity of non-matching image-text pairs →
usedFor	as a backbone in multimodal systems → downstream fine-tuning for vision tasks →

Referenced by (4)

Subject (surface form when different)	Predicate
CLIP ("Contrastive Language–Image Pre-training") →	fullName
CLIP ("Learning Transferable Visual Models From Natural Language Supervision") →	publicationTitle
DALL·E →	relatedTo
Hugging Face Transformers →	supportsModelType