- Created by Chris Stephens, last modified on Jul 16, 2012
You are viewing an old version of this page. View the current version.
Compare with Current View Page History
« Previous Version 7 Next »
About the Common Crawl Data Set
- number of years
- size
- link back to website
File Locations
The entire Common Crawl data set is stored on Amazon S3 as an Public Data Set.
http://aws.amazon.com/datasets/41740
The directory structure is as follows:
Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/
Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/
Crawl #3 - s3://aws-publicdatasets/common-crawl/parse-output/
General
.
ARC Files - Raw Content
ARC files contain the full HTTP response & payload for all pages crawled.
Text Only files are saved as Hadoop SequenceFiles using GZIP compression. The key and value data types are both Text. The key in these files is the URL, and the value is the actual text content. From HTML pages, the text content includes the page title, the page meta description content, and all text content from the HTML body. They are located in the segment directories, with a file name of "textData-nnnnn". For example:
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz
in the segment directories, with a file name of "textData-nnnnn". For example:
Text Files - Text Only
Common Crawl also produces a series of text only files. These files take content returned as HTML or RSS and parse out just the text content - making it easier for researchers to perform text-based analysis.
Text Only files are saved as Hadoop SequenceFiles using GZIP compression. The key and value data types are both Text. The key in these files is the URL, and the value is the actual text content. From HTML pages, the text content includes the page title, the page meta description content, and all text content from the HTML body. They are located in the segment directories, with a file name of "textData-nnnnn". For example:
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112
The numbers at the end of the filename are sequentially assigned to Text Only files within the segment - they have not specific meaning.
Further, Text Only files are translated from their native character sets into UTF-8. All Text Only content (in all languages) can be read using the UTF-8 character set.
Currently, we are only producing Text Only files from HTML and RSS/Atom content. The Text Only files are on average 20% of the size of the raw content.
Metadata
In addition to content files, Common Crawl produces a series of metadata files that provide useful information about the crawled content. For each URL, the
Records in the Metadata files are in the same order and have the same file numbers as the Text Only content.
Metadata files are saved as Hadoop SequenceFiles using GZIP compression. The key and value data types are also both Text. The key in these files is the URL, and the value is a JSON structure of fields and subfields - the full structure is defined below.. Just like the Text Only files, the Metadata files are located in the segment directories, with a file name of "metadata-nnnnn". For example:
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/metadata-00112
Currently, we are only producing Metadata from HTML and RSS/Atom content.
Metadata Schema
following attributes:
Attribute Name | Attribute Description | Available |
---|---|---|
attempt_time | The time (in UNIX time format) that the crawl of this page was attempted. | always |
disposition | SUCCESS if the crawler received a successful HTTP response; FAILURE if not. | always |
failure_reason | A code representing why the crawl of this page failed. | on failure |
failure_detail | A message, if available, on why the crawl of this page failed. | on failure |
server_ip | The IP address of the server that returned the response. | on success |
http_result | The HTTP result code. | on success |
http_headers | A JSON object containing all returned HTTP headers as key/value pairs. | on success |
redirect_from | if URL was redirected | |
content_len | The value of the Content-Length HTTP header. | on success |
mime_type | The value of the Content-Type HTTP header (stripped of the charset). | on success |
download_size | The actual size of the downloaded content. | on success |
content_is_gzip | Optional attribute that specifies that source content was gzip'd. | if payload is gzip'd |
gunzip_content_len | If the content was gzip'd, this is the decompressed length of the incoming content. | if payload is gzip'd |
md5 | The md5 hash of the downloaded content. | on success |
text_simhash | The 64-bit simhash of the text (UCS-2) content if document was a valid text type. | if payload is text |
charset_detected | The character set Common Crawl detected for the downloaded content. | on success |
charset_detector | 0 - The character set was derived from an HTTP header. | on success |
parsed_as | html - Downloaded content was parsed as HTML. feed - Downloaded content was parsed as an RSS/Atom feed. | on success |
content | If the HTTP response code was 20x, and if the downloaded content was parsed as HTML or | HTTP Code = 20x |
(placeholder)
Attribute Name | Attribute Description |
---|---|
content > type | Always "html-doc". |
content > title | The value of the HTML "title" tag. |
content > meta_tags | A JSON array of objects representing each "meta" tag found by the parser. Note: If the "meta" tag uses a "property" attribute instead of a "name" attribute, "property" is used as the key. |
content > links | A JSON array of objects representing each link found by the parser. |
content > links > type | The HTML tag type that the link was found in. Examples: a, area, frame, iframe, script, img, link, etc. |
content > links > href | The URL associated with the tag, usually from the "href" attribute. |
content > links > text | The text displayed for the link. Usually the value of the link element. |
content > links > * | Every attribute of the link tag is provided. |
(placeholder)
Attribute Name | Attribute Value |
---|---|
content > type | Always "rss-feed". |
content > title | The value of the feed "title" element. |
content > link | The value of the feed "link" element. |
content > description | The value of the feed "description" element. |
content > updated | The later of either the "lastBuildDate" or the "pubDate" elements. |
content > generator | The value of the feed "generator" element. |
content > ttl | The value of the feed "ttl" element. |
content > categories | A JSON array of category names associated with the feed. |
content > items | A JSON array of objects representing each feed item. |
content > items > title | The value of the item "title" element. |
content > items > description | The value of the item "description" element. |
content > items > link | The value of the item "link" element. |
content > items > author | The value of the item "author" element. |
content > items > comments | The value of the item "comments" element. A URL where users can comment on the feed item. |
content > items > published | The value of the item "pubDate" element. |
content > items > guid | The value of the item "GUID" element. A unique identifier for the feed item. |
content > items > categories | A JSON array of category names associated with the item. |
content > items > content | A JSON object or array of objects containing any links embedded in the body of the item. |
(placeholder)
Attribute Name | Attribute Value |
---|---|
content > type | Always "atom-feed". |
content > title | The value of the feed "title" element, stripped of any HTML. |
content > link | A JSON object representing the feed rel=alternate "link" element. |
content > description | The value of the feed "description" element. |
content > updated | The value of the feed "updated" element. |
content > generator | The value of the feed "generator" element. |
content > authors | A JSON array of authors associated with the feed. |
content > categories | A JSON array of category names associated with the feed. |
content > items | A JSON array of objects representing each feed item. |
content > items > title | The value of the item "title" element, stripped of any HTML. |
content > items > description | The value of the item "description" element. |
content > items > link | A JSON object or array of objects representing the item rel=alternate "link" elements. |
content > items > self | A JSON object or array of objects representing the item rel=self "link" elements. |
content > items > replies | A JSON object or array of objects representing the item rel=replies "link" elements. |
content > items > authors | A JSON array of objects represeting the item authors. |
content > items > published | The value of the item "published" element. |
content > items > updated | The value of the item "updated" element. |
content > items > categories | A JSON array of category names associated with the item. |
content > items > content | A JSON object or array of objects containing any links embedded in the body of the item. |
- No labels