Helpful Guides and Links

If you're new to data mining and related technologies, you may want to review some introductory materials; this page is a collection of general resoures related to Common Crawl and data mining. If you have suggestions for additional resources, please leave a comment on this page.

Processing Common Crawl Data on Hadoop on AWS
A video presentation by Jaideep Dhok on Common Crawl and analyzing data with Amazon Web Services

12 Steps to running your Ruby code across 5 billion web pages
Written by a Common Crawl Advisory Board Member, Pete Warden who walks you through a simple example of how you can run your own code across Common Crawl's archived pages.

MapReduce for the Masses: Zero to Hadoop in 5 Minutes with Common Crawl
Steve Salevan's video tutorial will show you how to harness the power of MapReduce data analysis against the Common Crawl dataset.

Sample Wordcount Streaming Job Using PHP
A blog tutorial about running PHP on Elastic MapReduce to analyze Common Crawl data.

 

What is Apache Hadoop?

Edd Dumbill explains the basics of Hadoop, MapReduce, HDFS, and related technologies in an introductory guide.

Apache Hadoop - Petabytes and Terawatts
A LinkedIn Tech Talk by Jakob Homan that gives an overview of Hadoop and its ecosystem.

EC2 for Poets
Dave Winer's tutorial to make cloud computing and Amazon Web Services less mysterious for nontechnical people. 

AWS in Education Grants

The AWS in Education grant program enables educators, academic researchers, and students to apply to obtain free usage credits for the on-demand infrastructure of the Amazon Web Services cloud to teach advanced courses, tackle research endeavors, and explore new projects.

Common Crawl Mailing List

The Common Crawl mailing list is for announcements and for users to ask questions about using the Common Crawl corpus.