Common Crawl

E102298

non-profit organization open data project web crawl corpus

Common Crawl is a massive, publicly available web archive that regularly crawls and stores petabytes of web page data for use in research and large-scale data analysis.

Try in SPARQL Jump to: Surface forms Statements Referenced by

All labels observed (1)

Label	Occurrences
Common Crawl canonical	2

Statements (51)

Predicate	Object
instanceOf	non-profit organization ⓘ open data project ⓘ web crawl corpus ⓘ
accessCost	free ⓘ
country	United States of America ⓘ surface form: United States
coverage	billions of URLs ⓘ worldwide web ⓘ
dataAccess	publicly available ⓘ
dataCollectionMethod	large-scale web crawling ⓘ
dataFormat	WARC ⓘ WAT ⓘ WET ⓘ
dataType	HTML content ⓘ metadata ⓘ outlinks ⓘ text extracts ⓘ web pages ⓘ
dataVolume	petabytes of data ⓘ
distributionPlatform	Amazon Web Services Open Data ⓘ surface form: AWS Open Data Sponsorship Program Amazon S3 ⓘ HTTP download ⓘ
foundedBy	Gil Elbaz ⓘ
hasAPI	index and access tools provided by third parties ⓘ
hasComponent	URL index ⓘ crawl archives ⓘ metadata files ⓘ
hasLanguage	multilingual ⓘ
headquartersLocation	San Francisco, California, United States of America ⓘ surface form: San Francisco, California, United States
inception	2007 ⓘ
legalForm	501(c)(3) non-profit organization ⓘ
license	CC-BY for metadata ⓘ no known copyright restrictions on raw crawl data to the extent permitted by law ⓘ
mission	to build and maintain an open repository of web crawl data that is accessible to everyone ⓘ
name	Common Crawl self-link ⓘ
notableFor	providing one of the largest publicly accessible web corpora ⓘ
operatesOn	publicly accessible web pages ⓘ
purpose	enable innovation with open web data ⓘ support large-scale data analysis ⓘ support research ⓘ
supportedBy	donations ⓘ sponsorships ⓘ
updateFrequency	regular crawls ⓘ typically monthly crawls in recent years ⓘ
useCase	computational social science ⓘ data mining ⓘ machine learning training data ⓘ natural language processing research ⓘ search and information retrieval research ⓘ web graph analysis ⓘ web-scale language modeling ⓘ
website	https://commoncrawl.org/ ⓘ

Referenced by (2)

Full triples — surface form annotated when it differs from this entity's canonical label.

GPT-3 → trainingDataSource → Common Crawl ⓘ

Common Crawl → name → Common Crawl self-link ⓘ