Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The entire Common Crawl data set is stored on Amazon S3 as an Public Data Set.

  http://aws.amazon.com/datasets/41740

The directory structure is as follows:

  Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/
  Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/
  Crawl #3 - s3://aws-publicdatasets/common-crawl/parse-output/

...