Flickr30k
E899059
Flickr30k is a large-scale image dataset of 31,000 photographs each paired with multiple human-written captions, widely used for training and evaluating image captioning and vision-language models.
Statements (46)
| Predicate | Object |
|---|---|
| instanceOf |
benchmark dataset
ⓘ
image dataset ⓘ vision-language dataset ⓘ |
| containsContentType |
activities
ⓘ
everyday scenes ⓘ objects ⓘ people ⓘ |
| domain |
computer vision
ⓘ
multimodal AI ⓘ natural language processing ⓘ |
| hasAnnotationGranularity | image-level descriptions ⓘ |
| hasAnnotationType | sentence-level captions ⓘ |
| hasApproximateNumberOfImages | 31000 ⓘ |
| hasCaptionsPerImage | 5 ⓘ |
| hasCollectionPlatform | Flickr website GENERATED ⓘ |
| hasDataModality |
images
ⓘ
text captions ⓘ |
| hasDataSplit |
test set
ⓘ
training set ⓘ validation set ⓘ |
| hasInputFormat | image plus multiple captions ⓘ |
| hasLanguage | English ⓘ |
| hasLicense | research use ⓘ |
| hasNumberOfImages | 31000 ⓘ |
| hasScaleComparedTo | larger than Flickr8k ⓘ |
| hasTask |
generate caption from image
ⓘ
retrieve caption from image ⓘ retrieve image from caption ⓘ |
| hasTotalCaptions | 155000 ⓘ |
| hasTypicalImageResolution | variable ⓘ |
| imagesSource | Flickr NERFINISHED ⓘ |
| isSuccessorOf | Flickr8k NERFINISHED ⓘ |
| isUsedToEvaluate |
caption quality
ⓘ
image-text alignment ⓘ multimodal representation learning ⓘ |
| isWidelyUsedIn |
academic research
ⓘ
image captioning benchmarks ⓘ vision-language evaluation ⓘ |
| usedFor |
benchmarking captioning systems
ⓘ
evaluation of models ⓘ image captioning ⓘ image-text retrieval ⓘ multimodal learning ⓘ natural language description of images ⓘ training models ⓘ vision-language modeling ⓘ |
Referenced by (1)
Full triples — surface form annotated when it differs from this entity's canonical label.