Flickr30k

E899059

benchmark dataset image dataset vision-language dataset

Flickr30k is a large-scale image dataset of 31,000 photographs each paired with multiple human-written captions, widely used for training and evaluating image captioning and vision-language models.

Try in SPARQL Jump to: Surface forms Disambiguation Statements Elicitation Referenced by

All labels observed (1)

Label	Occurrences
Flickr30k canonical	1

How this entity was disambiguated

This entity first appeared as the object of triple T11003512 — resolving that mention is where its identity was fixed. The disambiguator weighed these candidate entities and picked the highlighted one (or “None”, minting a new entity). This is how homonymy is resolved: the same surface form can point to different entities.

NED1 Entity disambiguation (via context triple) gpt-5-mini-2025-08-07

Target entity: Flickr30k
Context triple: [Show and Tell: A Neural Image Caption Generator, usesDataset, Flickr30k]

A. Flickr
Flickr is an online photo and video hosting and sharing platform that became one of the earliest popular social media sites for photographers and casual users alike.
B. Images and Words
Images and Words is a landmark 1992 progressive metal album by Dream Theater, widely credited with bringing the band mainstream recognition and defining their signature sound.
C. Britannica ImageQuest
Britannica ImageQuest is a curated educational image database offering millions of rights-cleared photos and illustrations for teaching and learning.
D. CLIP
CLIP is an OpenAI model that learns joint representations of images and text, enabling tasks like zero-shot image classification and natural language-based image retrieval.
E. Photolibrary
Photolibrary was a stock photography agency and image licensing company later incorporated into Getty Images’ global visual media portfolio.
F. None of above. chosen
G. Unsure - the case is ambiguous/there is not enough information to decide.

NED2 Entity disambiguation (via description) gpt-5-mini-2025-08-07

Target entity: Flickr30k
Target entity description: Flickr30k is a large-scale image dataset of 31,000 photographs each paired with multiple human-written captions, widely used for training and evaluating image captioning and vision-language models.

A. Flickr
Flickr is an online photo and video hosting and sharing platform that became one of the earliest popular social media sites for photographers and casual users alike.
B. Images and Words
Images and Words is a landmark 1992 progressive metal album by Dream Theater, widely credited with bringing the band mainstream recognition and defining their signature sound.
C. Britannica ImageQuest
Britannica ImageQuest is a curated educational image database offering millions of rights-cleared photos and illustrations for teaching and learning.
D. CLIP
CLIP is an OpenAI model that learns joint representations of images and text, enabling tasks like zero-shot image classification and natural language-based image retrieval.
E. Photolibrary
Photolibrary was a stock photography agency and image licensing company later incorporated into Getty Images’ global visual media portfolio.
F. None of above. chosen

Statements (46)

Predicate	Object
instanceOf	benchmark dataset ⓘ image dataset ⓘ vision-language dataset ⓘ
containsContentType	activities ⓘ everyday scenes ⓘ objects ⓘ people ⓘ
domain	computer vision ⓘ multimodal AI ⓘ natural language processing ⓘ
hasAnnotationGranularity	image-level descriptions ⓘ
hasAnnotationType	sentence-level captions ⓘ
hasApproximateNumberOfImages	31000 ⓘ
hasCaptionsPerImage	5 ⓘ
hasCollectionPlatform	Flickr website GENERATED ⓘ
hasDataModality	images ⓘ text captions ⓘ
hasDataSplit	test set ⓘ training set ⓘ validation set ⓘ
hasInputFormat	image plus multiple captions ⓘ
hasLanguage	English ⓘ
hasLicense	research use ⓘ
hasNumberOfImages	31000 ⓘ
hasScaleComparedTo	larger than Flickr8k ⓘ
hasTask	generate caption from image ⓘ retrieve caption from image ⓘ retrieve image from caption ⓘ
hasTotalCaptions	155000 ⓘ
hasTypicalImageResolution	variable ⓘ
imagesSource	Flickr NERFINISHED ⓘ
isSuccessorOf	Flickr8k NERFINISHED ⓘ
isUsedToEvaluate	caption quality ⓘ image-text alignment ⓘ multimodal representation learning ⓘ
isWidelyUsedIn	academic research ⓘ image captioning benchmarks ⓘ vision-language evaluation ⓘ
usedFor	benchmarking captioning systems ⓘ evaluation of models ⓘ image captioning ⓘ image-text retrieval ⓘ multimodal learning ⓘ natural language description of images ⓘ training models ⓘ vision-language modeling ⓘ

How these facts were elicited

Referenced by (1)

Full triples — surface form annotated when it differs from this entity's canonical label.

Show and Tell: A Neural Image Caption Generator → usesDataset → Flickr30k ⓘ

All labels observed (1)

How this entity was disambiguated Show

Statements (46)

How these facts were elicited Show

Referenced by (1)

How this entity was disambiguated

How these facts were elicited