ViT

E435871 UNEXPLORED

ViT (Vision Transformer) is a deep learning model architecture that applies the transformer framework to image recognition tasks by treating images as sequences of patches.

Aliases (1)

Referenced by (2)
Subject (surface form when different) Predicate
CLIP ("Vision Transformer")
imageEncoderType
Hugging Face Transformers
supportsModelType

Please wait…