Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

How can I use Common Crawl data?
Please refer to the Common Crawl Terms of Use document for a detailed, authoritative description of our Terms of Use guidelines, but, in general, you cannot republish the data retrieved from the crawl (unless allowed by fair use), you cannot resell access to the service, you cannot use the crawl data for any illegal purposes, and you must respect the Terms of Use of the sites we crawl.

Accessing the Data

How much does it cost?

 

I want to analyze specific urls from the Common Crawl corpus. How can I do that?


Web Crawler

What is the ccBot crawler?
The ccBot crawler is a distributed crawling infrastructure that makes use of the Apache Hadoop project. We use Map-Reduce to process and extract crawl candidates from our crawl database. This candidate list is sorted by host (domain name) and then distributed to a set of spider (bot) servers. We do not use Nutch for the purposes of crawling, but instead utilize a custom crawl infrastructure to strictly limit the rate at which we crawl individual web hosts. The resulting crawl data is then post processed (for the purposes of link extraction and deduplication) and then reintegrated into the crawl database.

...