This document discusses using Apache Nutch to crawl and gather data from websites and store it in HDFS files. The data is then processed using Pig Latin scripts to structure the raw data and extract fields like company name, location, funding amounts and dates. The structured data is analyzed using bar charts to visualize total investments by year and sector. Additional visualizations show investments by zip code and county to identify geographic trends. The process requires careful error checking as crowd-sourced data contains many errors that can influence results.
Related topics: