About the Data Set

About the Data Set

Overview

The Common Crawl data set contains approximately 6 billion web documents stored on a publicly accessible, scalable computer cluster.  Here is some more information on the content and storage of the data set.

File Locations

The entire Common Crawl data set is stored on Amazon S3 as a Public Data Set:

  http://aws.amazon.com/datasets/41740

The data set is divided into three major subsets:

  Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010
  Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010
  Current Crawl - s3://aws-publicdatasets/common-crawl/parse-output/ - crawl data from 2012

The two archived crawl data sets are stored in folders organized by the year, month, date, and hour the content was crawled.  For example:

  s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz

the contents of this file were crawled started on January 6th, 2010 at 10 AM.

The current crawl data set is stored in the "parse-output" folder in a similar manner to how Nutch stores archives.  Crawl data is stored in a "segments" subfolder, then in a folder that starts with the UNIX timestamp of crawl start time.  For example:

  s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz

The "crawl-intermediate", "merge-output", and "stats-output" subfolders are used for internal data processing, and, while publicly available, are currently not documented and are not considered part of the corpus.

File Types

The current crawl data set includes three different types of files:  ARC raw content, Text Only, and Metadata.

The archived crawl data sets contain only ARC raw content files.

ARC Files - Raw Content

ARC files contain the full HTTP response and payload for all pages crawled. The ARC file format was designed by the Internet Archive.  You can read more about this file format here:

  http://archive.org/web/researcher/ArcFileFormat.php

ARC files are a series of concatenated GZIP documents.  The first compressed member is a ARC file header, which usually looks like this:

filedesc://1341817173109_4.arc.gz 0.0.0.0 20120709065933 text/plain 73
1 0 CommonCrawl
URL IP-address Archive-date Content-type Archive-length

This file header lists the fields that are used in the record header of subsequent records:  URL, IP Address, Archive Date, Content Type, Archive Length

The rest of the individually compressed members consist of an ARC record header, followed by the full HTTP response:

http://www.srlchem.com/products/ 74.55.84.98 20120518232759 text/html 28556
HTTP/1.1 200 OK
Server:nginx
Date:Fri, 18 May 2012 23:28:04 GMT
Content-Type:text/html
...


The ARC files reside in timestamp-based folders in the archived crawls, and in the segment folders in the current crawl.  They files are named "*.arc.gz".  For example:

  s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz

The Metadata files, described below, contain offsets into the ARC files.  You can use these offsets as a cross-reference between content metadata and the actual content.


Text Files - Text Only

Common Crawl also produces a series of text only files.  These files take content returned as HTML or RSS and parse out just the text content - making it easier for researchers to perform text-based analysis.

Text Only files are saved as Hadoop SequenceFiles using GZIP compression.  The key and value data types are both Text.  The key in these files is the URL, and the value is the actual text content.  From HTML pages, the text content includes the page title, the page meta description content, and all text content from the HTML body.  They are located in the segment directories, with a file name of "textData-nnnnn".  For example:

  s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112

The numbers at the end of the filename are sequentially assigned to Text Only files within the segment - they have not specific meaning.

Further, Text Only files are translated from their native character sets into UTF-8.  All Text Only content (in all languages) can be read using the UTF-8 character set.

Currently, we are only producing Text Only files from HTML and RSS/Atom content.  The Text Only files are on average 20% of the size of the raw content.


Metadata

In addition to content files, Common Crawl produces a series of Metadata files that provide useful information about the crawled content.  For each URL, the Metadata files contain status information, the HTTP response code, and file names and offsets of ARC files where the raw content can be found.

Most importantly for some users, the Metadata files contain the HTML title, HTML meta tags, RSS/Atom information, and all anchors/hyperlinks from HTML documents (including all fields on the link tags).  Users can scan the metadata files to pick up extracted links rather than extracting the links themselves.

Records in the Metadata files are in the same order and have the same file numbers as the Text Only content.

Metadata files are saved as Hadoop SequenceFiles using GZIP compression.  The key and value data types are also both Text.  The key in these files is the URL, and the value is a JSON structure of fields and subfields - the full structure is defined below..  Just like the Text Only files, the Metadata files are located in the segment directories, with a file name of "metadata-nnnnn".  For example:

  s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/metadata-00112

Metadata Schema

The Metadata file JSON contains the following attributes.

All attributes are considered optional, and are only present when they apply.


General Attributes

The following attributes apply to all types of content:

Attribute Name

Attribute Description

Available

Attribute Name

Attribute Description

Available

attempt_time

The time (in UNIX time format) that the crawl of this page was attempted.

always

disposition

SUCCESS if the crawler received a successful HTTP response; FAILURE if not.

always

failure_reason

A code representing why the crawl of this page failed.

on failure

failure_detail

A message, if available, on why the crawl of this page failed.

on failure

server_ip

The IP address of the server that returned the response.

on success

http_result

The HTTP result code.

on success

http_headers

A JSON object containing all returned HTTP headers as key/value pairs.

on success

redirect_from

 

if URL was redirected

content_len

The value of the Content-Length HTTP header.

on success

mime_type

The value of the Content-Type HTTP header (stripped of the charset).

on success

download_size

The actual size of the downloaded content.

on success

content_is_gzip

Optional attribute that specifies that source content was gzip'd.

if payload is gzip'd

gunzip_content_len

If the content was gzip'd, this is the decompressed length of the incoming content.

if payload is gzip'd

md5

The md5 hash of the downloaded content.

on success

text_simhash

The 64-bit simhash of the text (UCS-2) content if document was a valid text type.

if payload is text

charset_detected

The character set Common Crawl detected for the downloaded content.

on success

charset_detector

0 - The character set was derived from an HTTP header.
1 - The character set was derived from an HTML "meta" tag.
2 - The character set was derived from the ICU detector.
3 - The character set was derived from the Mozilla detector.
10 - The character set could not be determined.  ISO-8895-1 is assumed.

on success

parsed_as

html - Downloaded content was parsed as HTML.
feed - Downloaded content was parsed as an RSS/Atom feed.

on success

content

If the HTTP response code was 20x, and if the downloaded content was parsed as HTML or
as Feed, a JSON object represent the document's metadata.

HTTP Code = 20x

archiveInfo

A JSON object with information about where the content for this retrieved URL can be found.

on success

archiveInfo > arcSourceSegmentId

The segment that contains the ARC file in which the content for this record is stored.
i.e. ../parse-output/segment/[arcSourceSegmentId]/[arcFileDate]_[arcFileParition].arc.gz

on success

archiveInfo > arcFileDate

The date prefix of the ARC file in which the content for this record is stored.
i.e. ../parse-output/segment/[arcSourceSegmentId]/[arcFileDate]_[arcFileParition].arc.gz

on success

archiveInfo > arcFileParition

The partition ID of the ARC file in which the content for this record is stored.
i.e. ../parse-output/segment/[arcSourceSegmentId]/[arcFileDate]_[arcFileParition].arc.gz

on success

archiveInfo > arcFileOffset

The byte offset at which the ARC file record is stored.

on success

archiveInfo > compressedSize

The compressed size of the ARC file record associated with this URL.

on success


HTML Content Attributes

The "content" JSON object of an HTML document can contain the following fields:

Attribute Name

Attribute Description

Attribute Name

Attribute Description

content > type

Always "html-doc".

content > title

The value of the HTML "title" tag.

content > meta_tags

A JSON array of objects representing each "meta" tag found by the parser.

Note:  If the "meta" tag uses a "property" attribute instead of a "name" attribute, "property" is used as the key.

content > links

A JSON array of objects representing each link found by the parser.

content > links > type

The HTML tag type that the link was found in.  Examples:  a, area, frame, iframe, script, img, link, etc.

content > links > href

The URL associated with the tag, usually from the "href" attribute.

content > links > text

The text displayed for the link.  Usually the value of the link element.

content > links > *

Every attribute of the link tag is provided.


RSS Content Attributes

The "content" JSON object of an RSS feed document can contain the following fields:

Attribute Name

Attribute Value

Attribute Name

Attribute Value

content > type

Always "rss-feed".

content > title

The value of the feed "title" element.

content > link

The value of the feed "link" element.

content > description

The value of the feed "description" element.

content > updated

The later of either the "lastBuildDate" or the "pubDate" elements.

content > generator

The value of the feed "generator" element.

content > ttl

The value of the feed "ttl" element.

content > categories

A JSON array of category names associated with the feed.

content > items

A JSON array of objects representing each feed item.

content > items > title

The value of the item "title" element.

content > items > description

The value of the item "description" element.

content > items > link

The value of the item "link" element.

content > items > author

The value of the item "author" element.

content > items > comments

The value of the item "comments" element.  A URL where users can comment on the feed item.

content > items > published

The value of the item "pubDate" element.

content > items > guid

The value of the item "GUID" element.  A unique identifier for the feed item.

content > items > categories

A JSON array of category names associated with the item.

content > items > content

A JSON object or array of objects containing any links embedded in the body of the item.

 

Atom Content Attributes

The "content" JSON object of an Atom feed document can contain the following fields:

Attribute Name

Attribute Value

Attribute Name

Attribute Value

content > type

Always "atom-feed".

content > title

The value of the feed "title" element, stripped of any HTML.

content > link

A JSON object representing the feed rel=alternate "link" element.

content > description

The value of the feed "description" element.

content > updated

The value of the feed "updated" element.

content > generator

The value of the feed "generator" element.

content > authors

A JSON array of authors associated with the feed.

content > categories

A JSON array of category names associated with the feed.

content > items

A JSON array of objects representing each feed item.

content > items > title

The value of the item "title" element, stripped of any HTML.

content > items > description

The value of the item "description" element.

content > items > link

A JSON object or array of objects representing the item rel=alternate "link" elements.

content > items > self

A JSON object or array of objects representing the item rel=self "link" elements.

content > items > replies

A JSON object or array of objects representing the item rel=replies "link" elements.

content > items > authors

A JSON array of objects represeting the item authors.

content > items > published

The value of the item "published" element.

content > items > updated

The value of the item "updated" element.

content > items > categories

A JSON array of category names associated with the item.

content > items > content

A JSON object or array of objects containing any links embedded in the body of the item.