Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
About the Common Crawl Data Set
- number of years
- size
- link back to website
Overview
The Common Crawl data set contains approximately 6 billion web documents stored on a publicly accessible, scalable computer cluster. Here is some more information on the content and storage of the data set.
File Locations
The entire Common Crawl data set is stored on Amazon S3 as an a Public Data Set.:
http://aws.amazon.com/datasets/41740
The directory structure is as followsdata set is divided into three major subsets:
Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010
Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010
Current Crawl #3 - s3://aws-publicdatasets/common-crawl/parse-output/
ARC Files - Raw Content
ARC files contain the full HTTP response & payload for all pages crawled.
Text Files - Text Only
Common Crawl also produces a series of text only files. These files take content returned as HTML or RSS, and strip out all but the visible text.
The textData file contains the full text content for any HTML content we crawl. This content includes the title tag of the HTML file, the contents of the description Meta tag if present, and finally the text content of the HTML file minus all HTML tags, and attributes. Again, we use the SequenceFile format to store this data, the data is keyed off of the source URL, and the generated file is usually a fifth of the size of the raw content. The purpose of this file is to allow the community to perform text analysis and perhaps indexing of the corpus without having to again incur the overhead of separating the text from the HTML tags and related metadata.
Metadata
The metadata file contains, for every URL crawled, Crawl Status (successfuy crawled or, if not, the reason why not), HTTP headers, text encoding, document md5, sim-hash if text content, and all links found in the document. Each link contains not only the related URL, but also the context for the link (tag type, attributes and text associated with the link). All of this information is encoded using JSON instead of a proprietary data structure, making it easily consumable by a host of languages and environments. This data is stored in a SequenceFile, making it easy to feed it into a Hadoop job, and is usually a fifth of the size of the comparable raw compressed content. Our hope is that by making metadata available in a convenient format, part of the community that is interested in building a graph of the Web will be able to do so without incurring the overhead,headache, and cost of processing raw content. Currently we emit metadata for only HTML and Feed (Atom and RSS) content. The JSON schema for the generated metadata can be found at the end of this post.
------------------------------
Metadata File Format
The filename prefix for the metadata file is "metadata-". The metadata file is written out as a SequenceFile using GZIP compression. The Key and Value data types are TextBytes. They Key represents the source URL, and the Value represents the JSON encoded metadata for the source URL.
JSON Metadata Schema
Basic JSON data metadata attributes:
"attempt_time": "Time (unix time) for the crawl attempt." "disposition": "SUCCESS (if the crawler got a valid HTTP result) or FAILURE(if not)." "failure_reason": "Short (enumerated) reason why the crawl failed." "failure_detail": "More detail (if available) on why the crawl failed."
Extended JSON metadata attributes for urls that received a valid HTTP response:
"server_ip": "The IP Address of the HTTP Server that returned a successful HTTP response." "http_result": "The HTTP result code" "http_headers": "A JSON object consisting of the HTTP Header key/value pairs returned by the server." "redirect_from": "A JSON object representing the source of the final URL and document (if it originated via a redirect)." "content_len": "The content length, as advertised by the server" "mime_type": "The mime-type returned via the HTTP headers (stripped of charset)." "download_size": "The actual size of the downloaded content." "content_is_gzip": "Optional attribute that specifies that source content was gzip'd." "gunzip_content_len": "If the content was gzip'd, this is the decompressed length of the incoming content." "md5": "md5 hash of the incoming content." "charset_detected": "The charset that we detected for the download content." "charset_detector": "0 if from headers,1 if from Meta tags,2 if ICU detector used, 3 if Mozilla detector used, 10 if no match and ISO-8895-1 used." "text_simhash": "A 64 bit simhash of the text (UCS-2) content if document was a valid text type." "parsed_as": "html - if document was parsed as html document, feed - if document was parsed as an Atom or RSS feed." "content": "If the http response was 20x, as the document was parsed as html or a feed, the JSON object represent the document metadata."
The details of the content JSON object if the document was an HTML document:
"type": "html-doc" "title": "The document title (if present)." "meta_tags": "A JSON array of objects representing each Meta tag discovered by the parser." "links": "A JSON array of objects representing each Link discovered by the parser."
Link JSON object attributes:
"type": "a if anchor tag,area if area tag,frame if frame tag,iframe if iframe tag,script if script tag,img if img tag, and link if link tag and no type attribute in link, otherwise the value of type attribute of the link tag." "href": "The url associated with the link." "text": "The text contents of the anchor tag." "[other attributes]": "Every other attribute in the original link tag gets represented as a property/value pair in the JSON object."
The details of the content JSON object if the document was an RSS feed:
"type": "rss-feed" "title": "The title attribute of the feed." "link": "A JSON object representing the link attribute of the feed." "description": "Feed description" "updated": "Either LastBuildDate or PubDate, depending on which one is greater." "categories": "A JSON array representing the category names associated with the feed." "generator": "The Feed Generator" "ttl": "The TTL value if specified in the RSS Feed." "items": "A JSON array of feed items."
Each RSS feed item contains the following attributes:
"title": "The title attribute of the Feed Item." "description": "Feed Item description." "link": "A JSON object representing the link attribute of the Feed Item." "author": "The Feed Item's author." "categories": "A JSON array representing the category names associated with the Feed Item." "comments": "The url used to retrieve comments associated with this Feed." "published": "The PubDate of this Feed Item." "guid": "The RSS guide representing this Feed Item." "content": "A JSON object or array of objects that contain any links embedded in the content body of the Feed Item."
The details of the content JSON object if the document was an Atom feed:
"type": "atom-feed" "title": "The HTML stripped title attribute of the feed." "link": "A JSON object representing the rel=alternate link attribute of the feed." "updated": "The Update date of the Feed." "categories": "A JSON array representing the category names associated with the feed." "generator": "The Feed Generator" "authors": "A JSON array of objects representing the Feed's authors." "items": "A JSON array of Feed Items."
Each Atom feed item contains the following attributes:
"title": "The title attribute of the Feed Item." "description": "Feed Item description." "published": "The Published date of this Feed Item." "updated": "The Updated date of this Feed Item." "link": "A JSON object or array of objects representing the rel=alternate links in contained in this Feed Item." "self": "A JSON object or array of objects representing the rel=self links in contained in this Feed Item." "replies": "A JSON object or array of objects representing the rel=replies links in contained in this Feed Item." "authors": "A JSON array of objects representing the Feed's authors." "categories": "A JSON array representing the category names associated with the feed." "content": "A JSON object that contains any links embedded in the content body of the Feed Item." "content": "A JSON object or array of objects that contain any links embedded in the content body of the Feed Item."
The JSON data structure for a document that we failed to receive a valid HTTP response for:
{
"attempt_time": 1328296943586,
"disposition": "FAILURE",
"failure_reason": "RedirectFailed",
"failure_detail": "Alread Visited Redirect Location:http://www.amazon.com/Death-Winter-Star-Generation-Unnumbered/dp/074349721X"
}
A sample JSON data structure for an HTML Document:
{ "attempt_time": 1328296964112, "disposition": "SUCCESS", "server_ip": "98.139.126.20", "http_result": 200, "http_headers": { "response": "HTTP/1.1 200 OK", "date": "Fri, 03 Feb 2012 19:24:05 GMT", "cache-control": "private", "x-served-by": "www187.flickr.mud.yahoo.com", "vary": "Accept-Encoding", "content-type": "text/html; charset\u003dutf-8", "age": "0", "via": "HTTP/1.1 r46.ycpi.mud.yahoo.net (YahooTrafficServer/1.20.9 [cMsSf ]), HTTP/1.1 r22.ycpi.sp2.yahoo.net (YahooTrafficServer/1.20.9 [cMsSf ])", "server": "YTS/1.20.9" }, "content_len": 16092, "mime_type": "text/html", "download_size": 16092, "content_is_gzip": true, "gunzip_content_len": 71009 "md5": "b4da640c09a2704652bb36d7a519c71b", "charset_detected": "UTF-8", "charset_detector": 0, "text_simhash": 9063827461659557026, "parsed_as": "html", "content": { "type": "html-doc", "title": "AdriiiiiV\u0027s photosets on Flickr", "meta_tags": [ { "name": "name", "value": "Flickr is almost certainly the best online photo management and .." } .. ], "links": [ { "type": "link", "rel": "apple-touch-icon-precomposed", "text": "", "href": "http://www.flickr.com/apple-touch-icon.png" }, { "type": "script", "text": "", "href": "http://l.yimg.com/g/combo/1/3.4.0?j/yui/3.4.0/yui/yui-.E.A.vWNh4\u0026j/yui/3.4.0/.FN/.FN-.E.A.vWNh4" }, ... }
A sample JSON data structure for an Atom Feed Document:
{ "attempt_time": 1328296939830, "disposition": "SUCCESS", "server_ip": "74.125.53.132", "http_result": 200, "http_headers": { "response": "HTTP/1.1 200 OK", "content-type": "text/xml; charset\u003dUTF-8", "server": "GSE" .. }, "content_len": 523871, "mime_type": "text/xml", "download_size": 523871, "download_truncated": true, "md5": "6145983cbc9787251736c3a90df86e65", "charset_detected": "UTF-8", "charset_detector": 0, "text_simhash": 1373553292315487284, "parsed_as": "feed", "content": { "type": "atom-feed", "title": "Market Talk with Piranha", "link": { "type": "text/html", "href": "http://marketstockwatch.blogspot.com/", "rel": "alternate" }, "authors": [ { "name": "Chris Perruna", "url": "http://www.blogger.com/profile/00976645165089898374" } ], "generator": "Blogger", "updated": 1161695436039, "categories": [], "published": 1148076780000, "items": [ { "title": "My Latest Blog Entries", "description": "", "link": { "type": "text/html", "href": "http://marketstockwatch.blogspot.com/2007/01/my-latest-blog-entries.html", "rel": "alternate", "title": "My Latest Blog Entries" }, "replies": [ { "type": "application/atom+xml", "href": "http://marketstockwatch.blogspot.com/feeds/1858788298009572369/comments/default", "rel": "replies", "title": "Post Comments" }, { "type": "text/html", "href": "http://www.blogger.com/comment.g?blogID\u003d7434962\u0026postID\u003d1858788298009572369", "rel": "replies", "title": "19 Comments" } ], "self": { "type": "application/atom+xml", "href": "http://www.blogger.com/feeds/7434962/posts/default/1858788298009572369", "rel": "self" }, "authors": [ { "name": "Chris Perruna", "url": "http://www.blogger.com/profile/00976645165089898374" } ], "content": { "type": "html-doc", "links": [ { "type": "a", "text": "www.chrisperruna.com", "href": "http://www.chrisperruna.com" }, { "type": "a", "text": "Intercontinental Exchange (ICE) is HOT", "href": "http://www.chrisperruna.com/2007/01/12/intercontinental-exchange-ice-is-hot/" }, .. }- crawl data from 2012
The two archived crawl data sets are stored in folders organized by the year, month, date, and hour the content was crawled. For example:
s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz
the contents of this file were crawled started on January 6th, 2010 at 10 AM.
The current crawl data set is stored in the "parse-output" folder in a similar manner to how Nutch stores archives. Crawl data is stored in a "segments" subfolder, then in a folder that starts with the UNIX timestamp of crawl start time. For example:
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz
The "crawl-intermediate", "merge-output", and "stats-output" subfolders are used for internal data processing, and, while publicly available, are currently not documented and are not considered part of the corpus.
File Types
The current crawl data set includes three different types of files: ARC raw content, Text Only, and Metadata.
The archived crawl data sets contain only ARC raw content files.
ARC Files - Raw Content
ARC files contain the full HTTP response and payload for all pages crawled. The ARC file format was designed by the Internet Archive. You can read more about this file format here:
http://archive.org/web/researcher/ArcFileFormat.php
ARC files are a series of concatenated GZIP documents. The first compressed member is a ARC file header, which usually looks like this:
filedesc://1341817173109_4.arc.gz 0.0.0.0 20120709065933 text/plain 73
1 0 CommonCrawl
URL IP-address Archive-date Content-type Archive-length
This file header lists the fields that are used in the record header of subsequent records: URL, IP Address, Archive Date, Content Type, Archive Length
The rest of the individually compressed members consist of an ARC record header, followed by the full HTTP response:
http://www.srlchem.com/products/ 74.55.84.98 20120518232759 text/html 28556
HTTP/1.1 200 OK
Server:nginx
Date:Fri, 18 May 2012 23:28:04 GMT
Content-Type:text/html
...
The ARC files reside in timestamp-based folders in the archived crawls, and in the segment folders in the current crawl. They files are named "*.arc.gz". For example:
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz
The Metadata files, described below, contain offsets into the ARC files. You can use these offsets as a cross-reference between content metadata and the actual content.
Text Files - Text Only
Common Crawl also produces a series of text only files. These files take content returned as HTML or RSS and parse out just the text content - making it easier for researchers to perform text-based analysis.
Text Only files are saved as Hadoop SequenceFiles using GZIP compression. The key and value data types are both Text. The key in these files is the URL, and the value is the actual text content. From HTML pages, the text content includes the page title, the page meta description content, and all text content from the HTML body. They are located in the segment directories, with a file name of "textData-nnnnn". For example:
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112
The numbers at the end of the filename are sequentially assigned to Text Only files within the segment - they have not specific meaning.
Further, Text Only files are translated from their native character sets into UTF-8. All Text Only content (in all languages) can be read using the UTF-8 character set.
Currently, we are only producing Text Only files from HTML and RSS/Atom content. The Text Only files are on average 20% of the size of the raw content.
Metadata
In addition to content files, Common Crawl produces a series of Metadata files that provide useful information about the crawled content. For each URL, the Metadata files contain status information, the HTTP response code, and file names and offsets of ARC files where the raw content can be found.
Most importantly for some users, the Metadata files contain the HTML title, HTML meta tags, RSS/Atom information, and all anchors/hyperlinks from HTML documents (including all fields on the link tags). Users can scan the metadata files to pick up extracted links rather than extracting the links themselves.
Records in the Metadata files are in the same order and have the same file numbers as the Text Only content.
Metadata files are saved as Hadoop SequenceFiles using GZIP compression. The key and value data types are also both Text. The key in these files is the URL, and the value is a JSON structure of fields and subfields - the full structure is defined below.. Just like the Text Only files, the Metadata files are located in the segment directories, with a file name of "metadata-nnnnn". For example:
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/metadata-00112
Metadata Schema
The Metadata file JSON contains the following attributes.
All attributes are considered optional, and are only present when they apply.
General Attributes
The following attributes apply to all types of content:
Attribute Name | Attribute Description | Available |
---|---|---|
attempt_time | The time (in UNIX time format) that the crawl of this page was attempted. | always |
disposition | SUCCESS if the crawler received a successful HTTP response; FAILURE if not. | always |
failure_reason | A code representing why the crawl of this page failed. | on failure |
failure_detail | A message, if available, on why the crawl of this page failed. | on failure |
server_ip | The IP address of the server that returned the response. | on success |
http_result | The HTTP result code. | on success |
http_headers | A JSON object containing all returned HTTP headers as key/value pairs. | on success |
redirect_from | if URL was redirected | |
content_len | The value of the Content-Length HTTP header. | on success |
mime_type | The value of the Content-Type HTTP header (stripped of the charset). | on success |
download_size | The actual size of the downloaded content. | on success |
content_is_gzip | Optional attribute that specifies that source content was gzip'd. | if payload is gzip'd |
gunzip_content_len | If the content was gzip'd, this is the decompressed length of the incoming content. | if payload is gzip'd |
md5 | The md5 hash of the downloaded content. | on success |
text_simhash | The 64-bit simhash of the text (UCS-2) content if document was a valid text type. | if payload is text |
charset_detected | The character set Common Crawl detected for the downloaded content. | on success |
charset_detector | 0 - The character set was derived from an HTTP header. | on success |
parsed_as | html - Downloaded content was parsed as HTML. feed - Downloaded content was parsed as an RSS/Atom feed. | on success |
content | If the HTTP response code was 20x, and if the downloaded content was parsed as HTML or | HTTP Code = 20x |
archiveInfo | A JSON object with information about where the content for this retrieved URL can be found. | on success |
archiveInfo > arcSourceSegmentId | The segment that contains the ARC file in which the content for this record is stored. i.e. ../parse-output/segment/[arcSourceSegmentId]/[arcFileDate]_[arcFileParition].arc.gz | on success |
archiveInfo > arcFileDate | The date prefix of the ARC file in which the content for this record is stored. i.e. ../parse-output/segment/[arcSourceSegmentId]/[arcFileDate]_[arcFileParition].arc.gz | on success |
archiveInfo > arcFileParition | The partition ID of the ARC file in which the content for this record is stored. i.e. ../parse-output/segment/[arcSourceSegmentId]/[arcFileDate]_[arcFileParition].arc.gz | on success |
archiveInfo > arcFileOffset | The byte offset at which the ARC file record is stored. | on success |
archiveInfo > compressedSize | The compressed size of the ARC file record associated with this URL. | on success |
HTML Content Attributes
The "content" JSON object of an HTML document can contain the following fields:
Attribute Name | Attribute Description |
---|---|
content > type | Always "html-doc". |
content > title | The value of the HTML "title" tag. |
content > meta_tags | A JSON array of objects representing each "meta" tag found by the parser. Note: If the "meta" tag uses a "property" attribute instead of a "name" attribute, "property" is used as the key. |
content > links | A JSON array of objects representing each link found by the parser. |
content > links > type | The HTML tag type that the link was found in. Examples: a, area, frame, iframe, script, img, link, etc. |
content > links > href | The URL associated with the tag, usually from the "href" attribute. |
content > links > text | The text displayed for the link. Usually the value of the link element. |
content > links > * | Every attribute of the link tag is provided. |
RSS Content Attributes
The "content" JSON object of an RSS feed document can contain the following fields:
Attribute Name | Attribute Value |
---|---|
content > type | Always "rss-feed". |
content > title | The value of the feed "title" element. |
content > link | The value of the feed "link" element. |
content > description | The value of the feed "description" element. |
content > updated | The later of either the "lastBuildDate" or the "pubDate" elements. |
content > generator | The value of the feed "generator" element. |
content > ttl | The value of the feed "ttl" element. |
content > categories | A JSON array of category names associated with the feed. |
content > items | A JSON array of objects representing each feed item. |
content > items > title | The value of the item "title" element. |
content > items > description | The value of the item "description" element. |
content > items > link | The value of the item "link" element. |
content > items > author | The value of the item "author" element. |
content > items > comments | The value of the item "comments" element. A URL where users can comment on the feed item. |
content > items > published | The value of the item "pubDate" element. |
content > items > guid | The value of the item "GUID" element. A unique identifier for the feed item. |
content > items > categories | A JSON array of category names associated with the item. |
content > items > content | A JSON object or array of objects containing any links embedded in the body of the item. |
Atom Content Attributes
The "content" JSON object of an Atom feed document can contain the following fields:
Attribute Name | Attribute Value |
---|---|
content > type | Always "atom-feed". |
content > title | The value of the feed "title" element, stripped of any HTML. |
content > link | A JSON object representing the feed rel=alternate "link" element. |
content > description | The value of the feed "description" element. |
content > updated | The value of the feed "updated" element. |
content > generator | The value of the feed "generator" element. |
content > authors | A JSON array of authors associated with the feed. |
content > categories | A JSON array of category names associated with the feed. |
content > items | A JSON array of objects representing each feed item. |
content > items > title | The value of the item "title" element, stripped of any HTML. |
content > items > description | The value of the item "description" element. |
content > items > link | A JSON object or array of objects representing the item rel=alternate "link" elements. |
content > items > self | A JSON object or array of objects representing the item rel=self "link" elements. |
content > items > replies | A JSON object or array of objects representing the item rel=replies "link" elements. |
content > items > authors | A JSON array of objects represeting the item authors. |
content > items > published | The value of the item "published" element. |
content > items > updated | The value of the item "updated" element. |
content > items > categories | A JSON array of category names associated with the item. |
content > items > content | A JSON object or array of objects containing any links embedded in the body of the item. |