ViT
E435871
UNEXPLORED
ViT (Vision Transformer) is a deep learning model architecture that applies the transformer framework to image recognition tasks by treating images as sequences of patches.
Aliases (1)
Referenced by (2)
| Subject (surface form when different) | Predicate |
|---|---|
|
CLIP
("Vision Transformer")
→
|
imageEncoderType |
|
Hugging Face Transformers
→
|
supportsModelType |