WebText dataset
E99319
The WebText dataset is a large-scale corpus of web pages curated by OpenAI to train language models like GPT-2 on diverse, high-quality internet text.
All labels observed (1)
| Label | Occurrences |
|---|---|
| WebText dataset canonical | 1 |
How this entity was disambiguated
This entity first appeared as the object of triple T848984 — resolving that mention is where its identity was fixed. The disambiguator weighed these candidate entities and picked the highlighted one (or “None”, minting a new entity). This is how homonymy is resolved: the same surface form can point to different entities.
Target entity: WebText dataset Context triple: [GPT-2, trainingDataSource, WebText dataset]
-
A.
Wikisource
Wikisource is a free online digital library of public domain and freely licensed texts that anyone can read and help transcribe.
-
B.
CREA corpus
The CREA corpus is a large, authoritative reference collection of contemporary Spanish language usage compiled for linguistic and lexicographic research.
-
C.
torchtext (ecosystem)
torchtext is a PyTorch library that provides tools, datasets, and utilities for building and processing text data in natural language processing workflows.
-
D.
Read
Read is a surname shared by various notable individuals across fields such as politics, arts, and academia.
-
E.
Project Gutenberg
Project Gutenberg is a pioneering digital library that offers free access to thousands of public-domain ebooks in multiple formats.
- F. None of above. chosen
- G. Unsure - the case is ambiguous/there is not enough information to decide.
Target entity: WebText dataset Target entity description: The WebText dataset is a large-scale corpus of web pages curated by OpenAI to train language models like GPT-2 on diverse, high-quality internet text.
-
A.
Wikisource
Wikisource is a free online digital library of public domain and freely licensed texts that anyone can read and help transcribe.
-
B.
CREA corpus
The CREA corpus is a large, authoritative reference collection of contemporary Spanish language usage compiled for linguistic and lexicographic research.
-
C.
torchtext (ecosystem)
torchtext is a PyTorch library that provides tools, datasets, and utilities for building and processing text data in natural language processing workflows.
-
D.
Read
Read is a surname shared by various notable individuals across fields such as politics, arts, and academia.
-
E.
Project Gutenberg
Project Gutenberg is a pioneering digital library that offers free access to thousands of public-domain ebooks in multiple formats.
- F. None of above. chosen
Statements (49)
| Predicate | Object |
|---|---|
| instanceOf |
language model training dataset
ⓘ
text corpus ⓘ |
| access | not fully open to public download ⓘ |
| associatedWith | GPT-2 ⓘ |
| collectionMethod | crawling URLs extracted from Reddit ⓘ |
| comparedWith | Wikipedia-only training corpora ⓘ |
| contains |
Wikipedia pages
ⓘ
articles ⓘ code snippets ⓘ dialogue-like text ⓘ news ⓘ online books ⓘ stories ⓘ technical documentation ⓘ web documents ⓘ web forum discussions ⓘ |
| curatedBy | OpenAI researchers ⓘ |
| curationFocus | high-quality internet text ⓘ |
| dataModality | natural language text ⓘ |
| dataSource |
outbound links from Reddit
ⓘ
web pages ⓘ |
| developer | OpenAI ⓘ |
| domain | web text ⓘ |
| excludes |
low-quality spam pages
ⓘ
non-text-heavy pages ⓘ |
| goal |
capture broad distribution of internet text
ⓘ
improve generalization of language models ⓘ |
| influenced | later web-scale language modeling datasets ⓘ |
| language | English ⓘ |
| license | not publicly released as a full dataset ⓘ |
| organization | OpenAI ⓘ |
| preprocessingStep |
deduplication of documents
ⓘ
filtering low-quality pages ⓘ tokenization ⓘ |
| publication | Language Models are Unsupervised Multitask Learners ⓘ |
| publicationYear | 2019 ⓘ |
| relatedTo |
OpenAI API platform
ⓘ
surface form:
OpenAI GPT models
|
| releasedBy | OpenAI ⓘ |
| scale | large-scale ⓘ |
| selectionCriterion |
filtering for high-quality content
ⓘ
links from Reddit with high karma ⓘ |
| sizeDescription | on the order of billions of tokens ⓘ |
| topicCoverage | diverse internet topics ⓘ |
| trainingObjective | next-token prediction ⓘ |
| usedFor |
training GPT-2
ⓘ
training large language models ⓘ unsupervised language modeling ⓘ |
| usedIn |
evaluation of GPT-2 capabilities
ⓘ
research on zero-shot learning with language models ⓘ |
How these facts were elicited
The pipeline generated the facts above by prompting gpt-5.1 with this entity's name + description and the instruction below.
You are a knowledge base construction expert. Given a subject entity and a description of it, return factual statements that you know for the subject as a JSON list of dictionaries(triples), where keys must be "subject", "predicate" and "object". The number of facts may be very high, between 25 to 50 or more, for very popular subjects. For less popular subjects, the number of facts can be very low, like 5 or 10. # Requirements - If you don't know the subject at all, return an empty list. - If the subject is not a named entity, return an empty list. - Include at least one triple where predicate is "instanceOf". - Do not get too wordy. - Separate several objects into multiple triples with one object.
Subject: WebText dataset Description of subject: The WebText dataset is a large-scale corpus of web pages curated by OpenAI to train language models like GPT-2 on diverse, high-quality internet text.
Referenced by (1)
Full triples — surface form annotated when it differs from this entity's canonical label.