Show, Attend and Tell

E899062

Show, Attend and Tell is a neural image captioning model that introduced visual attention mechanisms to dynamically focus on different parts of an image while generating descriptive text.

Try in SPARQL Jump to: Statements Referenced by

Statements (48)

Predicate Object
instanceOf attention-based model
deep learning model
neural image captioning model
affiliationContext University of Montreal NERFINISHED
approach encoder-decoder architecture
citationImpact highly cited in vision-language research
demonstrates visualization of attention maps over image regions
domain computer vision
natural language processing
evaluationDataset Flickr30k NERFINISHED
Flickr8k NERFINISHED
MSCOCO NERFINISHED
focusesOn different parts of an image during caption generation
goal generate descriptive natural language captions for images
hasAuthor Aaron Courville NERFINISHED
Jimmy Ba NERFINISHED
Kelvin Xu NERFINISHED
Kyunghyun Cho NERFINISHED
Richard Zemel NERFINISHED
Ruslan Salakhutdinov NERFINISHED
Ryan Kiros NERFINISHED
Yoshua Bengio NERFINISHED
hasFullName Show, Attend and Tell: Neural Image Caption Generation with Visual Attention NERFINISHED
improvesOver non-attention image captioning models
influenced transformer-based image captioning models
inspired later attention-based vision-language models
introduced visual attention mechanism for image captioning
language English captions
learningParadigm supervised learning
property dynamically attends to image regions at each word step
publicationType conference paper
publicationYear 2015
publishedIn International Conference on Machine Learning NERFINISHED
publishedInShort ICML NERFINISHED
task image caption generation
uses CNN features as image encoder
RNN language model
alignment model between image regions and words
hard attention
soft attention
visual attention
usesDecoder LSTM NERFINISHED
recurrent neural network
usesEncoder convolutional neural network
usesTrainingMethod REINFORCE for hard attention approximation
backpropagation through time
stochastic gradient descent
usesTrainingObjective maximum likelihood estimation

Referenced by (1)

Full triples — surface form annotated when it differs from this entity's canonical label.