VisionEncoderDecoderModel
E435885
UNEXPLORED
VisionEncoderDecoderModel is a Hugging Face Transformers architecture that combines a vision encoder with a text decoder to perform tasks like image captioning and visual question answering.
Referenced by (1)
| Subject (surface form when different) | Predicate |
|---|---|
|
Hugging Face Transformers
→
|
supportsModelType |