Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

About the Common Crawl Data Set

  • number of years
  • size
  • link back to website

    Overview

    The Common Crawl data set contains

    File Locations

    The entire Common Crawl data set is stored on Amazon S3 as an a Public Data Set.:

      http://aws.amazon.com/datasets/41740

    The directory structure is as followsdata set is broken down in three subsets:

      Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010
      Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010
      Crawl #3 - Current Crawl - s3://aws-publicdatasets/common-crawl/parse-output/ - crawl data from 2012

    The two archived crawl data sets are stored in folders organized by the year, month, date, and hour the content was crawled.  For example:

      s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz

    the contents of this file were crawled started on January 6th, 2010 at 10 AM.

    The current crawl data set is stored in the "parse-output" folder in a similar manner to how Nutch stores archives.  Crawl data is stored in a "segments" subfolder, then in a folder that starts with the UNIX timestamp of crawl start time.  For example:

      s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/
    General

    .

    ARC
    1341826131693_45.arc.gz

    The "crawl-intermediate", "merge-output", and "stats-output" subfolders are used for internal data processing, and, while publicly available, are currently not documented and are not considered part of the corpus.

    File Types

    The current crawl data set includes three different types of files:  ARC raw content, Text Only, and Metadata.

    The archived crawl data sets contain only ARC raw content files.

    ARC Files - Raw Content

    ARC files contain the full HTTP response & payload for all pages crawled.

    Text Only files are saved as Hadoop SequenceFiles using GZIP compression.  The key and value data types are both Text.  The key in these files is the URL, and the value is the actual text content.  From HTML pages, the text content includes the page title, the page meta description content, and all text content from the HTML body.  They are located in the segment directories, with a file name of "textData-nnnnnfiles are named "*.arc.gz".  For example:

      s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz

    in the segment directories, with a file name of "textData-nnnnn".  For example:

     The ARC file format was designed by the Internet Archive in 1996.  You can read more about this file format here:

      http://archive.org/web/researcher/ArcFileFormat.php

    Text Files - Text Only

    Common Crawl also produces a series of text only files.  These files take content returned as HTML or RSS and parse out just the text content - making it easier for researchers to perform text-based analysis.

    Text Only files are saved as Hadoop SequenceFiles using GZIP compression.  The key and value data types are both Text.  The key in these files is the URL, and the value is the actual text content.  From HTML pages, the text content includes the page title, the page meta description content, and all text content from the HTML body.  They are located in the segment directories, with a file name of "textData-nnnnn".  For example:

      s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112

    The numbers at the end of the filename are sequentially assigned to Text Only files within the segment - they have not specific meaning.

    Further, Text Only files are translated from their native character sets into UTF-8.  All Text Only content (in all languages) can be read using the UTF-8 character set.

    Currently, we are only producing Text Only files from HTML and RSS/Atom content.  The Text Only files are on average 20% of the size of the raw content.

    Metadata

    In addition to content files, Common Crawl produces a series of metadata files that provide useful information about the crawled content.  For each URL, the

    Records in the Metadata files are in the same order and have the same file numbers as the Text Only content.

    Metadata files are saved as Hadoop SequenceFiles using GZIP compression.  The key and value data types are also both Text.  The key in these files is the URL, and the value is a JSON structure of fields and subfields - the full structure is defined below..  Just like the Text Only files, the Metadata files are located in the segment directories, with a file name of "metadata-nnnnn".  For example:

      s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/metadata-00112

    Currently, we are only producing Metadata from HTML and RSS/Atom content.  
     

    Metadata Schema

    Metadata Schema
    following attributesThe Metadata file JSON contains the following attributes.

    All attributes are considered optional, and are only present when they apply.

    General Attributes

    The following attributes apply to all types of content:

    Attribute NameAttribute DescriptionAvailable
    attempt_timeThe time (in UNIX time format) that the crawl of this page was attempted.always
    dispositionSUCCESS if the crawler received a successful HTTP response; FAILURE if not.always
    failure_reasonA code representing why the crawl of this page failed.on failure
    failure_detailA message, if available, on why the crawl of this page failed.on failure
    server_ipThe IP address of the server that returned the response.on success
    http_resultThe HTTP result code.on success
    http_headersA JSON object containing all returned HTTP headers as key/value pairs.on success
    redirect_from if URL was redirected
    content_lenThe value of the Content-Length HTTP header.on success
    mime_typeThe value of the Content-Type HTTP header (stripped of the charset).on success
    download_sizeThe actual size of the downloaded content.on success
    content_is_gzipOptional attribute that specifies that source content was gzip'd.if payload is gzip'd
    gunzip_content_lenIf the content was gzip'd, this is the decompressed length of the incoming content.if payload is gzip'd
    md5The md5 hash of the downloaded content.on success
    text_simhashThe 64-bit simhash of the text (UCS-2) content if document was a valid text type.if payload is text
    charset_detectedThe character set Common Crawl detected for the downloaded content.on success
    charset_detector

    0 - The character set was derived from an HTTP header.
    1 - The character set was derived from an HTML "meta" tag.
    2 - The character set was derived from the ICU detector.
    3 - The character set was derived from the Mozilla detector.
    10 - The character set could not be determined.  ISO-8895-1 is assumed.

    on success
    parsed_ashtml - Downloaded content was parsed as HTML.
    feed - Downloaded content was parsed as an RSS/Atom feed.
    on success
    content

    If the HTTP response code was 20x, and if the downloaded content was parsed as HTML or
    as Feed, a JSON object represent the document's metadata.

    HTTP Code = 20x
    (placeholder)


    HTML Content Attributes

    The "content" JSON object of an HTML document can contain the following fields:

    Attribute NameAttribute Description
    content > typeAlways "html-doc".

    content > title

    The value of the HTML "title" tag.
    content > meta_tags

    A JSON array of objects representing each "meta" tag found by the parser.  Note:  If the "meta" tag uses a "property" attribute instead of a "name" attribute, "property" is used as the key.

    content > linksA JSON array of objects representing each link found by the parser.
    content > links > typeThe HTML tag type that the link was found in.  Examples:  a, area, frame, iframe, script, img, link, etc.
    content > links > hrefThe URL associated with the tag, usually from the "href" attribute.
    content > links > textThe text displayed for the link.  Usually the value of the link element.
    content > links > *Every attribute of the link tag is provided.
    (placeholder)


    RSS Content Attributes

    The "content" JSON object of an RSS feed document can contain the following fields:

    Attribute NameAttribute Value
    content > type

    Always "rss-feed".

    content > titleThe value of the feed "title" element.
    content > link

    The value of the feed "link" element.

    content > descriptionThe value of the feed "description" element.
    content > updatedThe later of either the "lastBuildDate" or the "pubDate" elements.
    content > generator

    The value of the feed "generator" element.

    content > ttlThe value of the feed "ttl" element.
    content > categoriesA JSON array of category names associated with the feed.
    content > itemsA JSON array of objects representing each feed item.
    content > items > titleThe value of the item "title" element.
    content > items > descriptionThe value of the item "description" element.
    content > items > linkThe value of the item "link" element.
    content > items > authorThe value of the item "author" element.
    content > items > commentsThe value of the item "comments" element.  A URL where users can comment on the feed item.
    content > items > publishedThe value of the item "pubDate" element.
    content > items > guidThe value of the item "GUID" element.  A unique identifier for the feed item.
    content > items > categoriesA JSON array of category names associated with the item.
    content > items > contentA JSON object or array of objects containing any links embedded in the body of the item.
    (placeholder)

     

    Atom Content Attributes

    The "content" JSON object of an Atom feed document can contain the following fields:

    Attribute NameAttribute Value
    content > type

    Always "atom-feed".

    content > titleThe value of the feed "title" element, stripped of any HTML.
    content > link

    A JSON object representing the feed rel=alternate "link" element.

    content > descriptionThe value of the feed "description" element.
    content > updatedThe value of the feed "updated" element.
    content > generator

    The value of the feed "generator" element.

    content > authorsA JSON array of authors associated with the feed.
    content > categoriesA JSON array of category names associated with the feed.
    content > itemsA JSON array of objects representing each feed item.
    content > items > titleThe value of the item "title" element, stripped of any HTML.
    content > items > descriptionThe value of the item "description" element.
    content > items > linkA JSON object or array of objects representing the item rel=alternate "link" elements.
    content > items > selfA JSON object or array of objects representing the item rel=self "link" elements.
    content > items > repliesA JSON object or array of objects representing the item rel=replies "link" elements.
    content > items > authorsA JSON array of objects represeting the item authors.
    content > items > publishedThe value of the item "published" element.
    content > items > updatedThe value of the item "updated" element.
    content > items > categoriesA JSON array of category names associated with the item.
    content > items > contentA JSON object or array of objects containing any links embedded in the body of the item.