Nutch is an open source web crawler built on Hadoop that can be used to crawl websites at scale. It integrates directly with Solr to index crawled content. HDFS provides a scalable storage layer that Nutch and Solr can write to and read from directly. This allows building indexes for Solr using Hadoop's MapReduce framework. Morphlines allow defining ETL pipelines to extract, transform, and load content from various sources into Solr running on HDFS.