WebText dataset

E99319

The WebText dataset is a large-scale corpus of web pages curated by OpenAI to train language models like GPT-2 on diverse, high-quality internet text.

Aliases (1)

Statements (49)
Predicate Object
instanceOf language model training dataset
text corpus
access not fully open to public download
associatedWith GPT-2
collectionMethod crawling URLs extracted from Reddit
comparedWith Wikipedia-only training corpora
contains Wikipedia pages
articles
code snippets
dialogue-like text
news
online books
stories
technical documentation
web documents
web forum discussions
curatedBy OpenAI researchers
curationFocus high-quality internet text
dataModality natural language text
dataSource outbound links from Reddit
web pages
developer OpenAI NERFINISHED
domain web text
excludes low-quality spam pages
non-text-heavy pages
goal capture broad distribution of internet text
improve generalization of language models
influenced later web-scale language modeling datasets
language English
license not publicly released as a full dataset
organization OpenAI NERFINISHED
preprocessingStep deduplication of documents
filtering low-quality pages
tokenization
publication Language Models are Unsupervised Multitask Learners NERFINISHED
publicationYear 2019
relatedTo OpenAI GPT models
releasedBy OpenAI NERFINISHED
scale large-scale
selectionCriterion filtering for high-quality content
links from Reddit with high karma
sizeDescription on the order of billions of tokens
topicCoverage diverse internet topics
trainingObjective next-token prediction
usedFor training GPT-2
training large language models
unsupervised language modeling
usedIn evaluation of GPT-2 capabilities
research on zero-shot learning with language models

Referenced by (1)
Subject (surface form when different) Predicate
GPT-2
trainingDataSource

Please wait…