ViT

E435871

deep learning model image recognition model vision transformer architecture

ViT (Vision Transformer) is a deep learning model architecture that applies the transformer framework to image recognition tasks by treating images as sequences of patches.

Try in SPARQL Jump to: Surface forms Disambiguation Statements Elicitation Referenced by

All labels observed (2)

Label	Occurrences
Vision Transformer	2
ViT canonical	1

How this entity was disambiguated

This entity first appeared as the object of triple T4389196 — resolving that mention is where its identity was fixed. The disambiguator weighed these candidate entities and picked the highlighted one (or “None”, minting a new entity). This is how homonymy is resolved: the same surface form can point to different entities.

NED1 Entity disambiguation (via context triple) gpt-5-mini-2025-08-07

Target entity: ViT
Context triple: [Hugging Face Transformers, supportsModelType, ViT]

A. PWSFTviT
PWSFTviT is the renowned Łódź Film School in Poland, one of Europe’s leading film and television academies known for training many acclaimed filmmakers.
B. VGG
VGG is a deep convolutional neural network architecture known for its simple, uniform use of small 3×3 filters and great depth, which achieved strong performance in image recognition tasks.
C. ResNet
ResNet is a deep convolutional neural network architecture known for its use of residual connections to enable very deep models and achieve state-of-the-art performance in image recognition tasks.
D. CLIP
CLIP is an OpenAI model that learns joint representations of images and text, enabling tasks like zero-shot image classification and natural language-based image retrieval.
E. ResNeXt
ResNeXt is a deep convolutional neural network architecture that extends ResNet by using grouped convolutions and a split-transform-merge strategy to improve accuracy and efficiency in image recognition tasks.
F. None of above. chosen
G. Unsure - the case is ambiguous/there is not enough information to decide.

NED2 Entity disambiguation (via description) gpt-5-mini-2025-08-07

Target entity: ViT
Target entity description: ViT (Vision Transformer) is a deep learning model architecture that applies the transformer framework to image recognition tasks by treating images as sequences of patches.

A. PWSFTviT
PWSFTviT is the renowned Łódź Film School in Poland, one of Europe’s leading film and television academies known for training many acclaimed filmmakers.
B. VGG
VGG is a deep convolutional neural network architecture known for its simple, uniform use of small 3×3 filters and great depth, which achieved strong performance in image recognition tasks.
C. ResNet
ResNet is a deep convolutional neural network architecture known for its use of residual connections to enable very deep models and achieve state-of-the-art performance in image recognition tasks.
D. CLIP
CLIP is an OpenAI model that learns joint representations of images and text, enabling tasks like zero-shot image classification and natural language-based image retrieval.
E. ResNeXt
ResNeXt is a deep convolutional neural network architecture that extends ResNet by using grouped convolutions and a split-transform-merge strategy to improve accuracy and efficiency in image recognition tasks.
F. None of above. chosen

Statements (50)

Predicate	Object
instanceOf	deep learning model ⓘ image recognition model ⓘ vision transformer architecture ⓘ
advantage	global receptive field from early layers ⓘ scales well with model and data size ⓘ
basedOn	Transformer architecture ⓘ
comparedWith	convolutional neural networks ⓘ
developedAt	Google Brain NERFINISHED ⓘ Google Research NERFINISHED ⓘ
fullName	Vision Transformer NERFINISHED ⓘ
hasVariant	DeiT NERFINISHED ⓘ Swin Transformer NERFINISHED ⓘ ViT-B NERFINISHED ⓘ ViT-H NERFINISHED ⓘ ViT-L NERFINISHED ⓘ
implementedIn	PyTorch NERFINISHED ⓘ TensorFlow NERFINISHED ⓘ
inputRepresentation	image patches ⓘ
introducedBy	Alexander Kolesnikov NERFINISHED ⓘ Alexey Dosovitskiy NERFINISHED ⓘ Dirk Weissenborn NERFINISHED ⓘ Georg Heigold NERFINISHED ⓘ Jakob Uszkoreit NERFINISHED ⓘ Lucas Beyer NERFINISHED ⓘ Matthias Minderer NERFINISHED ⓘ Mostafa Dehghani NERFINISHED ⓘ Neil Houlsby NERFINISHED ⓘ Sylvain Gelly NERFINISHED ⓘ Thomas Unterthiner NERFINISHED ⓘ Xiaohua Zhai NERFINISHED ⓘ
introducedInPaper	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale NERFINISHED ⓘ
limitation	data-hungry compared to CNNs ⓘ
openSourceImplementation	official Google Research repository ⓘ timm library NERFINISHED ⓘ
patchSizeTypical	16x16 pixels ⓘ
performsWellOn	ImageNet NERFINISHED ⓘ ImageNet-21k NERFINISHED ⓘ JFT-300M NERFINISHED ⓘ
pretrainingStrategy	self-supervised pretraining (e.g., DINO, MAE, etc.) ⓘ supervised pretraining on large datasets ⓘ
publicationYear	2020 ⓘ
requires	large-scale training data ⓘ
task	image classification ⓘ image recognition ⓘ
treatsImageAs	sequence of patches ⓘ
uses	MLP blocks ⓘ layer normalization ⓘ multi-head self-attention ⓘ position embeddings ⓘ self-attention mechanism ⓘ

How these facts were elicited

Referenced by (3)

Full triples — surface form annotated when it differs from this entity's canonical label.

Hugging Face Transformers → supportsModelType → ViT ⓘ

CLIP → imageEncoderType → ViT ⓘ

this entity surface form: Vision Transformer

Transformer → foundationFor → ViT ⓘ