Disambiguation evidence for WebText dataset via surface form
"WebText"
As subject (49)
Triples where this entity appears as subject under the
label "WebText".
| Predicate | Object |
|---|---|
| access | not fully open to public download → |
| associatedWith | GPT-2 → |
| collectionMethod | crawling URLs extracted from Reddit → |
| comparedWith | Wikipedia-only training corpora → |
| contains | Wikipedia pages → |
| contains | articles → |
| contains | code snippets → |
| contains | dialogue-like text → |
| contains | news → |
| contains | online books → |
| contains | stories → |
| contains | technical documentation → |
| contains | web documents → |
| contains | web forum discussions → |
| curatedBy | OpenAI researchers → |
| curationFocus | high-quality internet text → |
| dataModality | natural language text → |
| dataSource | outbound links from Reddit → |
| dataSource | web pages → |
| developer | OpenAI NERFINISHED → |
| domain | web text → |
| excludes | low-quality spam pages → |
| excludes | non-text-heavy pages → |
| goal | capture broad distribution of internet text → |
| goal | improve generalization of language models → |
| influenced | later web-scale language modeling datasets → |
| instanceOf | language model training dataset → |
| instanceOf | text corpus → |
| language | English → |
| license | not publicly released as a full dataset → |
| organization | OpenAI NERFINISHED → |
| preprocessingStep | deduplication of documents → |
| preprocessingStep | filtering low-quality pages → |
| preprocessingStep | tokenization → |
| publication | Language Models are Unsupervised Multitask Learners NERFINISHED → |
| publicationYear | 2019 → |
| relatedTo |
OpenAI API platform
→
surface form: "OpenAI GPT models"
|
| releasedBy | OpenAI NERFINISHED → |
| scale | large-scale → |
| selectionCriterion | filtering for high-quality content → |
| selectionCriterion | links from Reddit with high karma → |
| sizeDescription | on the order of billions of tokens → |
| topicCoverage | diverse internet topics → |
| trainingObjective | next-token prediction → |
| usedFor | training GPT-2 → |
| usedFor | training large language models → |
| usedFor | unsupervised language modeling → |
| usedIn | evaluation of GPT-2 capabilities → |
| usedIn | research on zero-shot learning with language models → |