Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Coming Soon!Total # of Web Documents: 3.8 billion
Total Uncompressed Content Size: 100 TB+
# of Domains: 61 million
# of PDFs: 92.2 million
# of Word Docs: 6.6 million
# of Excel Docs: 1.3 million

 

You can access the current crawl data on S3, which is a public data: s3://aws-publicdatasets/common-crawl/parse-output/