The QuickStart guide leads you through the process of building example MapReduce jobs on your local machine and running them in the cloud with Common Crawl data. You’ll learn how to: pull down example example code, compile it, upload the resulting JAR to Amazon S3, and run an Elastic MapReduce job using the newly created JAR.  This was based on Steve Salevan MapReduce for the Masses blog post.
 
To get started, your machine must meet basic requirements for this QuickStart guide:
 
QuickStart Requirements

  • The following software installed: git, Eclipse IDE, Java 1.6+
  • An Amazon Web Services Account

Step 1: Check out example code

Having met the QuickStart requirements, we’re now ready to begin by pulling the Common Crawl examples MapReduce code onto your local machine. To do this, run the following command in your terminal window:

git clone git://github.com/commoncrawl/commoncrawl-examples.git
Step 2: Configure Eclipse and compile the "commoncrawl-examples" JAR
The next step is to add the project within your IDE, and point it to the Common Crawl examples build file. In this example we use Eclipse, but other IDEs may be used.

First start by opening Eclipse. Select the File menu, and then select “Project” from the “New” menu. Open the “Java” folder and select “Java Project from Existing Ant Buildfile”. Click Browse, then locate the folder containing the code you just checked out (if you didn’t change the directory when you opened the terminal, it should be in your home directory) and select the “build.xml” file. Eclipse will find the right targets, and tick the “Link to the buildfile in the file system” box, as this will enable you to share the edits you make to it in Eclipse with git.

We now need to tell Eclipse how to build our JAR, so right click on the base project folder (by default it’s named “Common Crawl Examples”) and select “Properties” from the menu that appears. Navigate to the Builders tab in the left hand panel of the Properties window, then click “New”.  Select “Ant Builder” from the dialog which appears, then click OK.


To configure our new Ant builder, three pieces of information need to be specified: where the buildfile is located, where the root directory of the project is, and which ant build target we wish to execute.  To set the buildfile, click the “Browse File System” button under the “Buildfile:” field, and find the build.xml file which you found earlier.  To set the root directory, click the “Browse File System” button under the “Base Directory:” field, and select the folder into which you checked out our code.  To specify the target, enter “dist” without the quotes into the “Arguments” field.  Click OK and close the Properties window.

 
Finally, by right clicking on the base project folder and selecting “Build Project”, Ant will assemble a JAR that is ready for use in Elastic MapReduce.


Step 3: Upload the Common Crawl examples JAR to Amazon S3

When running an Elastic MapReduce job, the JAR file you created must exist on Amazon’s servers. For this we’ll use Amazon S3 for storage, first creating a “bucket” which is effectively a file directory in the cloud, and then uploading the JAR file. This same bucket will also be used to save the output of our MapReduce job. First, visit the S3 Console: https://console.aws.amazon.com/s3/home

 
Next, click “Create Bucket”, give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane, then click the “Upload” button, and select the JAR you just built. It should be locally on your machine at: 

 

<directory where you checked out code>/dist/lib/commoncrawl-examples-1.0.0.jar

Step 4: Create an Elastic MapReduce job based on your new JAR

With the JAR uploaded to S3, the remaining step is to point Elastic MapReduce to it. Start by visiting the Elastic MapReduce console: https://console.aws.amazon.com/elasticmapreduce/home, and clicking the “Create New Job Flow” button. Give your new flow a name, and tick the “Run your own application” box. Select “Custom JAR” from the “Choose a Job Type” menu and click the “Continue” button.
 
The next field in the wizard will ask you which JAR to use and what command-line arguments to pass to it. Add the following location:
 

s3n://<your bucket name>/commoncrawl-examples-1.0.0.jar

 
Add the following arguments to it:

org.commoncrawl.examples.ExampleTextWordCount -Dfs.s3n.awsAccessKeyId=[AWS Access Key ID] -Dfs.s3n.awsSecretAccessKey=[AWS Secret Access Key] s3n://<your bucket name>/emr/output/ExampleTextWordCount

By passing these arguments to the JAR we uploaded, we’re telling Hadoop to: 

  1. Run the main() method in our ExampleTextWordCount class.  (The class is located at org.commoncrawl.examples.ExampleTextWordCount).
  2. Log into Amazon S3 with your AWS access codes.
  3. Count all the words taken from a small sample of crawled content.  (The file name is in the "run()" method of the Java code.)
  4. Output the results as a series of tab-separated files into your Amazon S3 bucket (in a directory called "emr/output/ExampleTextWordCount").

 
Don’t worry about the continue fields for now, just accept the default values.  If you’re offered the opportunity to use debugging, I recommend enabling it to be able to see your job in action. Once you’ve clicked through them all, click the “Create Job Flow” button and your Hadoop job will be sent to the Amazon cloud.


Step 5: Watch the show

Now just wait and watch as your job runs through the Hadoop flow; you can look for errors by using the Debug button.  Within about 10 minutes, your job will be complete. 
 
The Common Crawl examples MapReduce job is a simple word count, and the output files it generates are alphabetically organized and list the word frequency beside each string retrieved.  Note that the since there are many numbers included in the files, these appear at the top of the file; don’t be alarmed!  You can retrieve the results from the S3 console.  If you download these files, they will likely be too large to load into a normal text editor.  Consider using less on the command line to see the results of the job.


Step 6: Now Go Experiment

Now that you have successfully assembled a custom JAR and run an Elastic MapReduce job, it’s time to begin experimenting to test the limits of what you already understand. Try using a slightly different ARC file as the input to analyze.  If you'd like to learn more about organization and format of the public dataset, see About the Data Set.

The output that the MapReduce job creates is rather raw, and you may want to use it in other applications.  You can take this sort of data and add it into a database, or create a new Hadoop OutputFormat to export into XML which you can render into HTML with an XSLT.  The possibilities are pretty much endless.