CommonCrawl is a non-profit organization that builds a comprehensive web-scale crawl using Hadoop. It crawls broadly and frequently across all top-level domains, prioritizing the crawl based on rank and freshness. The data is uploaded to Amazon S3 and made widely accessible to enable innovation. CommonCrawl uses a modest Hadoop cluster to crawl over 100 million URLs per day and processes over 800 million documents during post processing. The goal is to reduce the cost of "mapping and reducing the internet" to spur new opportunities.