WebText dataset
E99319
The WebText dataset is a large-scale corpus of web pages curated by OpenAI to train language models like GPT-2 on diverse, high-quality internet text.
Aliases (1)
- WebText ×49
Statements (49)
| Predicate | Object |
|---|---|
| instanceOf |
language model training dataset
→
text corpus → |
| access |
not fully open to public download
→
|
| associatedWith |
GPT-2
→
|
| collectionMethod |
crawling URLs extracted from Reddit
→
|
| comparedWith |
Wikipedia-only training corpora
→
|
| contains |
Wikipedia pages
→
articles → code snippets → dialogue-like text → news → online books → stories → technical documentation → web documents → web forum discussions → |
| curatedBy |
OpenAI researchers
→
|
| curationFocus |
high-quality internet text
→
|
| dataModality |
natural language text
→
|
| dataSource |
outbound links from Reddit
→
web pages → |
| developer |
OpenAI
NERFINISHED
→
|
| domain |
web text
→
|
| excludes |
low-quality spam pages
→
non-text-heavy pages → |
| goal |
capture broad distribution of internet text
→
improve generalization of language models → |
| influenced |
later web-scale language modeling datasets
→
|
| language |
English
→
|
| license |
not publicly released as a full dataset
→
|
| organization |
OpenAI
NERFINISHED
→
|
| preprocessingStep |
deduplication of documents
→
filtering low-quality pages → tokenization → |
| publication |
Language Models are Unsupervised Multitask Learners
NERFINISHED
→
|
| publicationYear |
2019
→
|
| relatedTo |
OpenAI GPT models
→
|
| releasedBy |
OpenAI
NERFINISHED
→
|
| scale |
large-scale
→
|
| selectionCriterion |
filtering for high-quality content
→
links from Reddit with high karma → |
| sizeDescription |
on the order of billions of tokens
→
|
| topicCoverage |
diverse internet topics
→
|
| trainingObjective |
next-token prediction
→
|
| usedFor |
training GPT-2
→
training large language models → unsupervised language modeling → |
| usedIn |
evaluation of GPT-2 capabilities
→
research on zero-shot learning with language models → |
Referenced by (1)
| Subject (surface form when different) | Predicate |
|---|---|
|
GPT-2
→
|
trainingDataSource |