Quick Start - Amazon AMI

Overview

The Common Crawl Foundation has created an Amazon Machine Image loaded with Common Crawl example programs, a development Hadoop instance, and scripts to submit jobs to Amazon Elastic MapReduce.

Launching an Instance

The Common Crawl Quick Start AMI can be found here:

http://aws.amazon.com/amis/common-crawl-quick-start

Our Amazon AMI ID is "ami-07339a6e". Currently, it is only available in "US East (Virginia)".

Note: The Common Crawl AMI can be run using the "Default" Security Group. No custom Security Group and no custom firewall rules are required.

Using the Instance

After launching and connecting your new EC2 instance, follow these steps to run your first sample job against the Common Crawl corpus:

1. Find your Amazon Access Credentials (Access Key ID & Secret Access Key) and save them as two lines in this file:

  /home/ec2-user/.awssecret

For example:

  JLASKHJFLKDHJLFKSJDF
  DFHSDJHhhoiaGKHDFa6sd42rwuhfapgfuAGSDAjh

Change the permissions of this file to read/write only by 'ec2-user':

  chmod 600 /home/ec2-user/.awssecret

Now you can use Tim Kay's AWS Command Line tool. Try this:

  aws ls -1 aws-publicdatasets/common-crawl/parse-output/segment/1341690167474/metadata-

2. Move to the 'commoncrawl-examples' directory. Make sure it is up-to-date:

  cd commoncrawl-examples
  git pull

3. Compile the latest example code:

ant

4. Run an example! Decide whether you want to run an example on the small local Hadoop instance on on Amazon Elastic MapReduce.

Run this command to see your options:

  bin/ccRunExample

then go ahead and run an example:

  bin/ccRunExample LocalHadoop ExampleMetadataDomainPageCount

then look at the code:

  nano src/java/org/commoncrawl/examples/ExampleMetadataDomainPageCount.java

Note: You need to have your own Amazon S3 bucket to run Amazon Elastic MapReduce jobs.