Common Crawl is a non-profit that makes web data freely accessible. Each crawl captures billions of web pages totaling over 150 terabytes. The data is released without restrictions on Amazon. Common Crawl was founded in 2007 to democratize access to web data at scale. The data has been used for natural language processing, machine learning, analytics, and more. Researchers have extracted tables, links, phone numbers, and parallel text from the data.
Related topics: