Data Set Size & Statistics - 2012
Total # of Web Documents: 3.8 billion
Total Uncompressed Content Size: 100 TB+
# of Domains: 61 million
# of PDFs: 92.2 million
# of Word Docs: 6.6 million
# of Excel Docs: 1.3 million
You can access the current crawl data on S3, which is a public data: s3://aws-publicdatasets/common-crawl/parse-output/