CLIP

E95184

CLIP is an OpenAI model that learns joint representations of images and text, enabling tasks like zero-shot image classification and natural language-based image retrieval.


Statements (56)
Predicate Object
instanceOf contrastive learning model
multimodal machine learning model
vision-language model
architectureComponent image encoder
text encoder
capability cross-modal retrieval
learning from natural language supervision
open-vocabulary recognition
prompt-based classification
zero-shot transfer to downstream vision tasks
developer OpenAI NERFINISHED
fullName Contrastive Language–Image Pre-training
imageEncoderType ResNet
Vision Transformer
input image
natural language text prompt
inspired subsequent vision-language models
introducedBy Aditya Ramesh
Alec Radford NERFINISHED
Amanda Askell
Chris Hallacy
Gabriel Goh
Girish Sastry NERFINISHED
Gretchen Krueger
Ilya Sutskever
Jack Clark
Jong Wook Kim NERFINISHED
Pamela Mishkin NERFINISHED
Sandhini Agarwal
learningParadigm contrastive learning
self-supervised learning
license OpenAI model license
lossFunction InfoNCE-style loss
contrastive loss
modality image
text
organization OpenAI NERFINISHED
output joint embedding vectors for images and text
similarity scores between images and text
pretrainingDataType image-text pairs
property aligns image and text embeddings in a shared space
does not require task-specific fine-tuning for many tasks
uses cosine similarity in embedding space
publicationTitle Learning Transferable Visual Models From Natural Language Supervision
publicationType arXiv preprint
publicationYear 2021
task image representation learning
image-text matching
natural language-based image retrieval
text representation learning
zero-shot image classification
textEncoderType Transformer
trainingObjective maximize similarity of matching image-text pairs
minimize similarity of non-matching image-text pairs
usedFor as a backbone in multimodal systems
downstream fine-tuning for vision tasks

Referenced by (4)
Subject (surface form when different) Predicate
CLIP ("Contrastive Language–Image Pre-training")
fullName
CLIP ("Learning Transferable Visual Models From Natural Language Supervision")
publicationTitle
DALL·E
relatedTo
Hugging Face Transformers
supportsModelType

Please wait…