Page Comparison

About the Common Crawl Data Set

number of years
size
link back to website

File Locations

The entire Common Crawl data set is stored on Amazon S3 as an Public Data Set.

http://aws.amazon.com/datasets/41740

The directory structure is as follows:

Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/
Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/
Crawl #3 - s3://aws-publicdatasets/common-crawl/parse-output/

General

.

ARC Files - Raw Content

ARC files contain the full HTTP response & payload for all pages crawled.

Text Only files are saved as Hadoop SequenceFiles using GZIP compression. The key and value data types are both Text. The key in these files is the URL, and the value is the actual text content. From HTML pages, the text content includes the page title, the page meta description content, and all text content from the HTML body. They are located in the segment directories, with a file name of "textData-nnnnn". For example:

  s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz

in the segment directories, with a file name of "textData-nnnnn". For example:

Text Files - Text Only

Common Crawl also produces a series of text only files. These files take content returned as HTML or RSS and parse out just the text content - making it easier for researchers to perform text-based analysis.

Text Only files are saved as Hadoop SequenceFiles using GZIP compression. The key and value data types are both Text. The key in these files is the URL, and the value is the actual text content. From HTML pages, the text content includes the page title, the page meta description content, and all text content from the HTML body. They are located in the segment directories, with a file name of "textData-nnnnn". For example:

  s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112

The numbers at the end of the filename are sequentially assigned to Text Only files within the segment - they have not specific meaning.

Further, Text Only files are translated from their native character sets into UTF-8. All Text Only content (in all languages) can be read using the UTF-8 character set.

Currently, we are only producing Text Only files from HTML and RSS/Atom content. The Text Only files are on average 20% of the size of the raw content.

Metadata

In addition to content files, Common Crawl produces a series of metadata files that provide useful information about the crawled content. For each URL, the

Records in the Metadata files are in the same order and have the same file numbers as the Text Only content.

Metadata files are saved as Hadoop SequenceFiles using GZIP compression. The key and value data types are also both Text. The key in these files is the URL, and the value is a JSON structure of fields and subfields - the full structure is defined below.. Just like the Text Only files, the Metadata files are located in the segment directories, with a file name of "metadata-nnnnn". For example:

  s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/metadata-00112

Currently, we are only producing Metadata from HTML and RSS/Atom content.

Metadata Schema

following attributes:

Attribute Name	Attribute Description	Available
attempt_time	The time (in UNIX time format) that the crawl of this page was attempted.	Alwaysalways
disposition	SUCCESS if the crawler received a successful HTTP response; FAILURE if not.	Alwaysalways
failure_reason	A code representing why the crawl of this page failed.	On Failureon failure
failure_detail	A message, if available, on why the crawl of this page failed.	On Failureon failure
server_ip	The IP address of the server that returned the response.	On Successon success
http_result	The HTTP result code.	On Successon success
http_headers	A JSON object containing all returned HTTP headers as key/value pairs.	On Successon success
redirect_from		if URL was redirected
content_len	The value of the Content-Length HTTP header.	On Successon success
mime_type	The value of the Content-Type HTTP header (stripped of the charset).	On Successon success
download_size	The actual size of the downloaded content.	On Successon success
content_is_gzip	Optional attribute that specifies that source content was gzip'd.	if payload is gzip'd
gunzip_content_len	If the content was gzip'd, this is the decompressed length of the incoming content.	if payload is gzip'd
md5	The md5 hash of the incoming downloaded content.	on success
text_simhash	A The 64-bit simhash of the text (UCS-2) content if document was a valid text type.	if payload is text
charset_detected	The character set Common Crawl detected for the downloaded content.	on success
charset_detector	0 - The character set was derived from an HTTP header. 1 - The character set was derived from an HTML "meta" tag. 2 - The character set was derived from the ICU detector. 3 - The character set was derived from the Mozilla detector. 10 - The character set could not be determined. ISO-8895-1 is assumed.	on success
parsed_as	html - Downloaded content was parsed as HTML. feed - Downloaded content was parsed as an RSS/Atom feed.	on success
content	If the http HTTP response code was 20x, as and if the document downloaded content was parsed as html HTML or a feed, the as Feed, a JSON object represent the document's metadata.	HTTP Code = 20x

(placeholder)

Attribute Name	Attribute Description
content > type		Always "html-doc".
content > title title	The value of the HTML "title" tag.
content > meta_tags	A JSON array of objects representing each "meta" tag found by the parser. Note: If the "meta" tag uses a "property" attribute instead of a "name" attribute, "property" is used as the key.
content > links	A JSON array of objects representing each link found by the parser.
content > links > type	The HTML tag type that the link was found in. Examples: a, area, frame, iframe, script, img, link, etc.
content > links > href	The URL associated with the tag, usually from the "href" attribute.
content > links > text	The text displayed for the link. Usually the value of the link element.
content > links > *	Every attribute of the link tag is provided.

The details of the content JSON object if the document was an RSS feed:
"type": (placeholder)

Attribute Name	Attribute Value
content > type	Always "rss-feed".

"title": "The title attribute

content > title

The value of the feed "title" element.

"

"link": "A JSON object representing the link attribute

content > link

The value of the feed "link" element.

"

"description": "Feed description"
"updated": "Either LastBuildDate or PubDate, depending on which one is greater."
"categories": "A JSON array representing the

content > description	The value of the feed "description" element.
content > updated	The later of either the "lastBuildDate" or the "pubDate" elements.
content > generator	The value of the feed "generator" element.
content > ttl	The value of the feed "ttl" element.
content > categories	A JSON array of category names associated with the feed.

"
"generator": "The Feed Generator"
"ttl": "The TTL value if specified in the RSS Feed."
"items": "


content > items	A JSON array of objects representing each feed

items."
Each RSS feed item contains the following attributes:
"title": "The title attribute of the Feed Item."
"description": "Feed Item description."
"link": "A JSON object representing the link attribute of the Feed Item."
"author": "The Feed Item's author."
"categories": "A JSON array representing the category names associated with the Feed Item."
"comments": "The url used to retrieve comments associated with this Feed."
"published": "The PubDate of this Feed Item."
"guid": "The RSS guide representing this Feed Item."
"content": "

item.
content > items > title	The value of the item "title" element.
content > items > description	The value of the item "description" element.
content > items > link	The value of the item "link" element.
content > items > author	The value of the item "author" element.
content > items > comments	The value of the item "comments" element. A URL where users can comment on the feed item.
content > items > published	The value of the item "pubDate" element.
content > items > guid	The value of the item "GUID" element. A unique identifier for the feed item.
content > items > categories	A JSON array of category names associated with the item.
content > items > content	A JSON object or array of objects

that contain

containing any links embedded in the

content

body of the

Feed Item."
The details of the content JSON object if the document was an Atom feed:
"type":

item.

(placeholder)

Attribute Name	Attribute Value
content > type	Always "atom-feed".
content > title	The value of the feed "title"

: "The HTML stripped title attribute of the feed."
"link": "

element, stripped of any HTML.
content > link	A JSON object representing the feed rel=alternate

link attribute

"link" element.
content > description	The value of the feed

."

"description" element.
content > updated	The value of the feed "updated"

: "The Update date of the Feed."
"categories": "A JSON array representing the category names

element.
content > generator	The value of the feed "generator" element.
content > authors	A JSON array of authors associated with the feed.

"
"generator": "The Feed Generator"
"authors": "


content > categories	A JSON array of

objects representing the Feed's authors."
"items": "

category names associated with the feed.
content > items	A JSON array of

Feed Items."
Each Atom feed item contains the following attributes:
"title": "The title attribute of the Feed Item."
"description": "Feed Item description."
"published": "The Published date of this Feed Item."
"updated": "The Updated date of this Feed Item."
"link": "

objects representing each feed item.
content > items > title	The value of the item "title" element, stripped of any HTML.
content > items > description	The value of the item "description" element.
content > items > link	A JSON object or array of objects representing the item rel=alternate

links in contained in this Feed Item."
"self": "

"link" elements.
content > items > self	A JSON object or array of objects representing the item rel=self

links in contained in this Feed Item."
"replies": "

"link" elements.
content > items > replies	A JSON object or array of objects representing the item rel=replies

links in contained in this Feed Item."
"authors": "

"link" elements.
content > items > authors	A JSON array of objects

representing

represeting the

Feed's

item authors.
content > items > published	The value of the item "published" element.
content > items > updated	The value of the item "updated"

categories": "

element.
content > items > categories	A JSON array

representing the

of category names associated with the

feed

item.

"

"content": "A JSON object that contains any links embedded in the content body of the Feed Item."
"content": "

content > items > content

A JSON object or array of objects

that contain

containing any links embedded in the

content

body of the

Feed Item

item.

"

Versions Compared

Old Version 6

New Version 7

Key

File Locations

General

ARC Files - Raw Content

Text Files - Text Only