Quick Start - Amazon AMI
- Dave Lester
- Chris Stephens
Overview
The Common Crawl Foundation has created an Amazon Machine Image loaded with Common Crawl example programs, a development Hadoop instance, and scripts to submit jobs to Amazon Elastic MapReduce.
Launching an Instance
The Common Crawl Quick Start AMI can be found here:
http://aws.amazon.com/amis/common-crawl-quick-start
Our Amazon AMI ID is "ami-07339a6e". Currently, it is only available in "US East (Virginia)".
Note: The Common Crawl AMI can be run using the "Default" Security Group. No custom Security Group and no custom firewall rules are required.
Using the Instance
After launching and connecting your new EC2 instance, follow these steps to run your first sample job against the Common Crawl corpus:
1. Find your Amazon Access Credentials (Access Key ID & Secret Access Key) and save them as two lines in this file:
/home/ec2-user/.awssecret
For example:
JLASKHJFLKDHJLFKSJDF
DFHSDJHhhoiaGKHDFa6sd42rwuhfapgfuAGSDAjh
Change the permissions of this file to read/write only by 'ec2-user':
chmod 600 /home/ec2-user/.awssecret
Now you can use Tim Kay's AWS Command Line tool. Try this:
aws ls -1 aws-publicdatasets/common-crawl/parse-output/segment/1341690167474/metadata-
2. Move to the 'commoncrawl-examples' directory. Make sure it is up-to-date:
cd commoncrawl-examples
git pull
3. Compile the latest example code:
ant
4. Run an example! Decide whether you want to run an example on the small local Hadoop instance on on Amazon Elastic MapReduce.
Run this command to see your options:
bin/ccRunExample
then go ahead and run an example:
bin/ccRunExample LocalHadoop ExampleMetadataDomainPageCount
then look at the code:
nano src/java/org/commoncrawl/examples/ExampleMetadataDomainPageCount.java
Note: You need to have your own Amazon S3 bucket to run Amazon Elastic MapReduce jobs.