Inspiration and Ideas

There are a growing number of projects that use Common Crawl data, and a list of possible project ideas that have been shared by the community; this page highlights them as a launching point for your own work. If you have a project or idea that you would like added to the list, please share it with us on the discussion board.

Inspiration:

Web Data Commons

The Web Data Commons project has extracted all Microformat, Microdata and RDFa data for the Common Crawl corpus, and packaged the extracted data for each format separately for download. This make to analyze a subset of the corpus in very specific formats, such as hCard, hCalendar, hRecipe, and Geo Microformats.


Study of Web Pages that Reference Facebook

Matthew Berk has conducted a study of ~1.3 billion URLs crawled by Common Crawl in 2012, to look at pages that reference Facebook urls. He found that 22% of Web pages contain Facebook URLs, and 8% of Web pages implement Open Graph tags.


Analyzing and Classifying Privacy Policies

Safe Shepherd is working on a project to analyze and classify every privacy policy on the Internet, in order to find out what information a given website collects about you and what they choose to do with that information.


Link Reverse

A web application that shows which pages link to a given URL. The web app (limited to domain mit.edu and to the first two valid segments of CommonCrawl): http://linkrev.herokuapp.com/ The source code for the web app: https://github.com/namin/linkrev The source code for the CommonCrawl experiments (using the Spark framework): https://github.com/namin/spark/tree/namin/namin/src/main/scala/net/namin/commoncrawl

 

Online Sentiment Towards Congressional Bills

The United States Congress debates, and eventually votes, on legislation (bills) to be turned into law. This project is looking at how the internet responds to these bills - correlating Common Crawl and congressional data allows us to look at the conversation surrounding individual pieces of legislation. For a particular bill, it is interesting to see how many times is it mentioned across the internet, what websites talk about it the most, which sites are most influential, and what language is commonly associated with a bill. The work is limits its analysis to bills considered by the 112th United States Congress, but our methods can easily be generalized to other time periods, legislation, and countries.

Program Files:
1) BillCounter: counts on how many pages the bill, in any of its forms, has been mentioned 
2) DomainAnalysis: records the domains of pages that mention a bill, in any of its forms, and outputs the 50 domains that have mentioned the bill the most (with their count of pages that have mentioned the bill)
 3) AssociationAnalysis: outputs the top 50 words found across all pages that mention a bill in any of its forms, less a set of 100 very common words
Code on GitHub: https://github.com/awavering/CC-Bill-Tracker

 

Is Money The Root Of All Evil?

This project looks the mention of money across the crawl and how the 7 deadly sins appear alongside it. The pages are ranked in terms of their "moneyness" and they are then assessed for their sin quotient. Does money bring out the worst in us?
Code on GitHub: https://github.com/joyita/IsMoneyTheRootOfAllEvil

Ideas:

Analyzing How Jobs Factor Into The Economy

By identifying web pages that look like job listings and extracting information like date, location, title, and required skills you could identify what companies and industries are growing, how are certain areas of the job market doing in comparison to others, and what skills are highly valued.

Extracting Data about Digital Library Collections

Jason Ronallo suggested that structured data on library web pages could be extracted from the Common Crawl corpus to fetch images and create virtual aggregations of library and archival content.

Analyzing Social Impact and Sentiment About Politicans

The Common Crawl corpus has information related to politics, including political speeches, the fulltext of bills, and news articles that mention politicians. By identifying web pages that relate to politics, you could find out what words are associated with individual politicians.

Bollywood

Bollywood is the second biggest movie industry after Hollywood. It has a far reaching social and economic impact. It would be interesting to get an analysis of movie watching trends in the south and north of India from Common Crawl data.