WebText dataset

E99319

The WebText dataset is a large-scale corpus of web pages curated by OpenAI to train language models like GPT-2 on diverse, high-quality internet text.

All labels observed (1)

Label Occurrences
WebText dataset canonical 1

How this entity was disambiguated

Statements (49)

Predicate Object
instanceOf language model training dataset
text corpus
access not fully open to public download
associatedWith GPT-2
collectionMethod crawling URLs extracted from Reddit
comparedWith Wikipedia-only training corpora
contains Wikipedia pages
articles
code snippets
dialogue-like text
news
online books
stories
technical documentation
web documents
web forum discussions
curatedBy OpenAI researchers
curationFocus high-quality internet text
dataModality natural language text
dataSource outbound links from Reddit
web pages
developer OpenAI
domain web text
excludes low-quality spam pages
non-text-heavy pages
goal capture broad distribution of internet text
improve generalization of language models
influenced later web-scale language modeling datasets
language English
license not publicly released as a full dataset
organization OpenAI
preprocessingStep deduplication of documents
filtering low-quality pages
tokenization
publication Language Models are Unsupervised Multitask Learners
publicationYear 2019
relatedTo OpenAI API platform
surface form: OpenAI GPT models
releasedBy OpenAI
scale large-scale
selectionCriterion filtering for high-quality content
links from Reddit with high karma
sizeDescription on the order of billions of tokens
topicCoverage diverse internet topics
trainingObjective next-token prediction
usedFor training GPT-2
training large language models
unsupervised language modeling
usedIn evaluation of GPT-2 capabilities
research on zero-shot learning with language models

How these facts were elicited

Referenced by (1)

Full triples — surface form annotated when it differs from this entity's canonical label.

GPT-2 trainingDataSource WebText dataset