Common Crawl
E102298
Common Crawl is a massive, publicly available web archive that regularly crawls and stores petabytes of web page data for use in research and large-scale data analysis.
All labels observed (1)
| Label | Occurrences |
|---|---|
| Common Crawl canonical | 2 |
Statements (51)
| Predicate | Object |
|---|---|
| instanceOf |
non-profit organization
ⓘ
open data project ⓘ web crawl corpus ⓘ |
| accessCost | free ⓘ |
| country |
United States of America
ⓘ
surface form:
United States
|
| coverage |
billions of URLs
ⓘ
worldwide web ⓘ |
| dataAccess | publicly available ⓘ |
| dataCollectionMethod | large-scale web crawling ⓘ |
| dataFormat |
WARC
ⓘ
WAT ⓘ WET ⓘ |
| dataType |
HTML content
ⓘ
metadata ⓘ outlinks ⓘ text extracts ⓘ web pages ⓘ |
| dataVolume | petabytes of data ⓘ |
| distributionPlatform |
Amazon Web Services Open Data
ⓘ
surface form:
AWS Open Data Sponsorship Program
Amazon S3 ⓘ HTTP download ⓘ |
| foundedBy | Gil Elbaz ⓘ |
| hasAPI | index and access tools provided by third parties ⓘ |
| hasComponent |
URL index
ⓘ
crawl archives ⓘ metadata files ⓘ |
| hasLanguage | multilingual ⓘ |
| headquartersLocation |
San Francisco, California, United States of America
ⓘ
surface form:
San Francisco, California, United States
|
| inception | 2007 ⓘ |
| legalForm | 501(c)(3) non-profit organization ⓘ |
| license |
CC-BY for metadata
ⓘ
no known copyright restrictions on raw crawl data to the extent permitted by law ⓘ |
| mission | to build and maintain an open repository of web crawl data that is accessible to everyone ⓘ |
| name | Common Crawl self-link ⓘ |
| notableFor | providing one of the largest publicly accessible web corpora ⓘ |
| operatesOn | publicly accessible web pages ⓘ |
| purpose |
enable innovation with open web data
ⓘ
support large-scale data analysis ⓘ support research ⓘ |
| supportedBy |
donations
ⓘ
sponsorships ⓘ |
| updateFrequency |
regular crawls
ⓘ
typically monthly crawls in recent years ⓘ |
| useCase |
computational social science
ⓘ
data mining ⓘ machine learning training data ⓘ natural language processing research ⓘ search and information retrieval research ⓘ web graph analysis ⓘ web-scale language modeling ⓘ |
| website | https://commoncrawl.org/ ⓘ |
Referenced by (2)
Full triples — surface form annotated when it differs from this entity's canonical label.