WMT English-French dataset

E899019

English–French corpus bilingual dataset machine translation benchmark parallel corpus

The WMT English-French dataset is a large-scale parallel corpus of English–French sentence pairs widely used as a benchmark for training and evaluating machine translation systems.

Try in SPARQL Jump to: Surface forms Statements Referenced by

All labels observed (1)

Label	Occurrences
WMT English-French dataset canonical	1

Statements (47)

Predicate	Object
instanceOf	English–French corpus ⓘ bilingual dataset ⓘ machine translation benchmark ⓘ parallel corpus ⓘ
accessMethod	download from WMT or associated sites ⓘ
alignmentType	parallel sentences ⓘ
associatedEvent	Conference on Machine Translation NERFINISHED ⓘ
benchmarkFor	neural machine translation systems ⓘ statistical machine translation systems ⓘ
benchmarkLevel	state-of-the-art comparison ⓘ
benchmarkStatus	standard benchmark in MT research ⓘ
contains	sentence pairs ⓘ
curatedBy	Workshop on Machine Translation organizers ⓘ
dataFormat	plain text ⓘ tokenized text ⓘ
dataType	text ⓘ
domain	general-domain text ⓘ
evaluationMetric	BLEU NERFINISHED ⓘ COMET NERFINISHED ⓘ chrF ⓘ
evaluationSetting	shared task evaluation campaigns ⓘ
field	machine translation ⓘ natural language processing ⓘ
granularity	sentence-level alignment ⓘ
includes	development set ⓘ test set ⓘ training set ⓘ
languagePair	English–French NERFINISHED ⓘ
license	research use (varies by component corpus) ⓘ
modality	written language ⓘ
origin	crawled and curated web and text sources ⓘ
scale	large-scale ⓘ
sourceLanguage	English ⓘ
targetLanguage	French ⓘ
task	sentence-level translation ⓘ
timeSpan	updated annually in WMT campaigns ⓘ
typicalModel	sequence-to-sequence models ⓘ
typicalPreprocessing	subword segmentation (e.g. BPE) GENERATED ⓘ tokenization GENERATED ⓘ truecasing GENERATED ⓘ
typicalUse	supervised learning ⓘ
usedBy	academic researchers ⓘ industry MT practitioners ⓘ
usedFor	benchmarking translation quality ⓘ machine translation evaluation ⓘ machine translation training ⓘ
usedIn	WMT shared tasks ⓘ

Referenced by (1)

Full triples — surface form annotated when it differs from this entity's canonical label.

Sequence to Sequence Learning with Neural Networks → demonstratedOn → WMT English-French dataset ⓘ