What do you intend to do with the crawled content?
Our mission is to democratize access to web information by producing and maintaining an open repository of web crawl data that is universally accessible. We store the crawl data on Amazon’s S3 service, allowing it to be bulk downloaded as well as directly accessed for map-reduce processing in EC2.
How can I use Common Crawl data?
What is the ccBot crawler?
The ccBot crawler is a distributed crawling infrastructure that makes use of the Apache Hadoop project. We use Map-Reduce to process and extract crawl candidates from our crawl database. This candidate list is sorted by host (domain name) and then distributed to a set of spider (bot) servers. We do not use Nutch for the purposes of crawling, but instead utilize a custom crawl infrastructure to strictly limit the rate at which we crawl individual web hosts. The resulting crawl data is then post processed (for the purposes of link extraction and deduplication) and then reintegrated into the crawl database.
How does the bot identify itself?
Our bot will identify itself with the following User-Agent string:CCBot/1.0 (+http://www.commoncrawl.org/bot.html)
How often does the bot access pages?
We aim to build a system that can maintain a fresh crawl of the web, but, for now, our crawling aims are more modest, and we intend not to overtax anyone’s servers.
How can I ask for a slower crawl if the bot is taking up too much bandwidth?
We obey the crawl-delay the robots.txt convention, so by increasing that number, you will indicate to ccBot to slow down the rate of crawling.
How can I block this bot?
You configure your robots.txt file which uses the Robots Exclusion Protocol to block the crawler. Our bot’s Exclusion User-Agent string is: ccbot.
How can I ensure this bot can crawl my site effectively?
We are working hard to add features to the crawl system and hope to support the sitemap protocol in the future.
Does the bot support conditional gets/compression?
We do support conditional get requests. We also currently support the gzip encoding format.
Why is the bot crawling pages I don’t have links to?
The bot may have found your pages by following links from other sites.
What is the IP range of the bot?
22.214.171.124 through 126.96.36.199
Does the bot support nofollow?
Currently, we do honor the nofollow attribute as it applies to links embedded on your site. It should be noted that the nofollow attribute value is not meant for blocking access to content or preventing content to be indexed by search engines. Instead, the nofollow attribute is primarily used by site authors to prevent Search Engines such as Google from having the source page’s PageRank impact the PageRank of linked targets. If we ever did ignore nofollow in the future, we would do so only for the purposes of link discovery and would never create any association between the discovered link and the source document.
What parts of robots.txt does the bot support?
We support Disallow as well as Disallow / Allow combinations. We also support the crawl-delay directive. We plan to support the sitemap directive in a future release.
What robots meta tags does the bot support?
We support the NOFOLLOW meta-tag.