Disambiguation evidence for WebText dataset via surface form

"WebText"


As subject (49)

Triples where this entity appears as subject under the label "WebText".

Predicate Object
access not fully open to public download
associatedWith GPT-2
collectionMethod crawling URLs extracted from Reddit
comparedWith Wikipedia-only training corpora
contains Wikipedia pages
contains articles
contains code snippets
contains dialogue-like text
contains news
contains online books
contains stories
contains technical documentation
contains web documents
contains web forum discussions
curatedBy OpenAI researchers
curationFocus high-quality internet text
dataModality natural language text
dataSource outbound links from Reddit
dataSource web pages
developer OpenAI NERFINISHED
domain web text
excludes low-quality spam pages
excludes non-text-heavy pages
goal capture broad distribution of internet text
goal improve generalization of language models
influenced later web-scale language modeling datasets
instanceOf language model training dataset
instanceOf text corpus
language English
license not publicly released as a full dataset
organization OpenAI NERFINISHED
preprocessingStep deduplication of documents
preprocessingStep filtering low-quality pages
preprocessingStep tokenization
publication Language Models are Unsupervised Multitask Learners NERFINISHED
publicationYear 2019
relatedTo OpenAI API platform
surface form: "OpenAI GPT models"
releasedBy OpenAI NERFINISHED
scale large-scale
selectionCriterion filtering for high-quality content
selectionCriterion links from Reddit with high karma
sizeDescription on the order of billions of tokens
topicCoverage diverse internet topics
trainingObjective next-token prediction
usedFor training GPT-2
usedFor training large language models
usedFor unsupervised language modeling
usedIn evaluation of GPT-2 capabilities
usedIn research on zero-shot learning with language models

Please wait…