WMT English-French dataset

E899019

The WMT English-French dataset is a large-scale parallel corpus of English–French sentence pairs widely used as a benchmark for training and evaluating machine translation systems.

Try in SPARQL Jump to: Surface forms Statements Referenced by

All labels observed (1)

Label Occurrences
WMT English-French dataset canonical 1

Statements (47)

Predicate Object
instanceOf English–French corpus
bilingual dataset
machine translation benchmark
parallel corpus
accessMethod download from WMT or associated sites
alignmentType parallel sentences
associatedEvent Conference on Machine Translation NERFINISHED
benchmarkFor neural machine translation systems
statistical machine translation systems
benchmarkLevel state-of-the-art comparison
benchmarkStatus standard benchmark in MT research
contains sentence pairs
curatedBy Workshop on Machine Translation organizers
dataFormat plain text
tokenized text
dataType text
domain general-domain text
evaluationMetric BLEU NERFINISHED
COMET NERFINISHED
chrF
evaluationSetting shared task evaluation campaigns
field machine translation
natural language processing
granularity sentence-level alignment
includes development set
test set
training set
languagePair English–French NERFINISHED
license research use (varies by component corpus)
modality written language
origin crawled and curated web and text sources
scale large-scale
sourceLanguage English
targetLanguage French
task sentence-level translation
timeSpan updated annually in WMT campaigns
typicalModel sequence-to-sequence models
typicalPreprocessing subword segmentation (e.g. BPE) GENERATED
tokenization GENERATED
truecasing GENERATED
typicalUse supervised learning
usedBy academic researchers
industry MT practitioners
usedFor benchmarking translation quality
machine translation evaluation
machine translation training
usedIn WMT shared tasks

Referenced by (1)

Full triples — surface form annotated when it differs from this entity's canonical label.