Common Crawl

E102298

Common Crawl is a massive, publicly available web archive that regularly crawls and stores petabytes of web page data for use in research and large-scale data analysis.

Try in SPARQL Jump to: Surface forms Statements Referenced by

All labels observed (1)

Label Occurrences
Common Crawl canonical 2

Statements (51)

Predicate Object
instanceOf non-profit organization
open data project
web crawl corpus
accessCost free
country United States of America
surface form: United States
coverage billions of URLs
worldwide web
dataAccess publicly available
dataCollectionMethod large-scale web crawling
dataFormat WARC
WAT
WET
dataType HTML content
metadata
outlinks
text extracts
web pages
dataVolume petabytes of data
distributionPlatform Amazon Web Services Open Data
surface form: AWS Open Data Sponsorship Program

Amazon S3
HTTP download
foundedBy Gil Elbaz
hasAPI index and access tools provided by third parties
hasComponent URL index
crawl archives
metadata files
hasLanguage multilingual
headquartersLocation San Francisco, California, United States of America
surface form: San Francisco, California, United States
inception 2007
legalForm 501(c)(3) non-profit organization
license CC-BY for metadata
no known copyright restrictions on raw crawl data to the extent permitted by law
mission to build and maintain an open repository of web crawl data that is accessible to everyone
name Common Crawl self-link
notableFor providing one of the largest publicly accessible web corpora
operatesOn publicly accessible web pages
purpose enable innovation with open web data
support large-scale data analysis
support research
supportedBy donations
sponsorships
updateFrequency regular crawls
typically monthly crawls in recent years
useCase computational social science
data mining
machine learning training data
natural language processing research
search and information retrieval research
web graph analysis
web-scale language modeling
website https://commoncrawl.org/

Referenced by (2)

Full triples — surface form annotated when it differs from this entity's canonical label.

GPT-3 trainingDataSource Common Crawl
Common Crawl name Common Crawl self-link