Additional Resources

Amazon Web Services

Common Crawl data is stored as a public data set on Amazon Web Services (AWS), making it free to access using your AWS credentials and Elastic MapReduce. If you don’t already have an account with Amazon Web Services, you'll need to create one to get started.

  https://aws-portal.amazon.com/gp/aws/developer/registration/index.html

Once you’ve registered, you'll to generate an Access Key ID and Secret Access Key:

  https://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key

The Access Key ID and Secret Access Key verifies that you are accessing data on Amazon’s cloud.  They can be used to authorize things that cost money, so be sure to keep this information in a safe place.

Local Development Environment (Java/Hadoop/Eclipse)

Yahoo! provides an excellent tutorial showing how to set up a local MapReduce development environment:

  http://developer.yahoo.com/hadoop/tutorial/module3.html