WebText dataset

E99319

language model training dataset text corpus

The WebText dataset is a large-scale corpus of web pages curated by OpenAI to train language models like GPT-2 on diverse, high-quality internet text.

Aliases (1)

WebText ×49

Statements (49)

Predicate	Object
instanceOf	language model training dataset → text corpus →
access	not fully open to public download →
associatedWith	GPT-2 →
collectionMethod	crawling URLs extracted from Reddit →
comparedWith	Wikipedia-only training corpora →
contains	Wikipedia pages → articles → code snippets → dialogue-like text → news → online books → stories → technical documentation → web documents → web forum discussions →
curatedBy	OpenAI researchers →
curationFocus	high-quality internet text →
dataModality	natural language text →
dataSource	outbound links from Reddit → web pages →
developer	OpenAI NERFINISHED →
domain	web text →
excludes	low-quality spam pages → non-text-heavy pages →
goal	capture broad distribution of internet text → improve generalization of language models →
influenced	later web-scale language modeling datasets →
language	English →
license	not publicly released as a full dataset →
organization	OpenAI NERFINISHED →
preprocessingStep	deduplication of documents → filtering low-quality pages → tokenization →
publication	Language Models are Unsupervised Multitask Learners NERFINISHED →
publicationYear	2019 →
relatedTo	OpenAI GPT models →
releasedBy	OpenAI NERFINISHED →
scale	large-scale →
selectionCriterion	filtering for high-quality content → links from Reddit with high karma →
sizeDescription	on the order of billions of tokens →
topicCoverage	diverse internet topics →
trainingObjective	next-token prediction →
usedFor	training GPT-2 → training large language models → unsupervised language modeling →
usedIn	evaluation of GPT-2 capabilities → research on zero-shot learning with language models →

Referenced by (1)

Subject (surface form when different)	Predicate
GPT-2 →	trainingDataSource