SlideShare a Scribd company logo
1
– Someday Soon (Flickr)
Mining the web with Hadoop
Steve Watt Emerging Technologies @ HP
2
– timsnell (Flickr)
3
Gathering Data
Data Marketplaces
4
5
6
Gathering Data
Apache Nutch
(Web Crawler)
7
Pascal Terjan (Flickr)
8
9
10
Using Apache
Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2
For example:
http://www.crunchbase.com/companies?c=a&q=private_held
http://www.crunchbase.com/companies?c=b&q=private_held
http://www.crunchbase.com/companies?c=c&q=private_held
http://www.crunchbase.com/companies?c=d&q=private_held
. . .
Crawl data is stored in sequence files in the segments dir on the HDFS
11
ALSO
12
Company POJO then /t Out
Prelim Filtering on URL
Making the data STRUCTURED
Retrieving HTML
13
Company City State Country Sector Round Day Month Year Amount Investors
InfoChimps Austin TX USA Enterprise Angel 14 9 2010 350000 Stage One Capital
InfoChimps Austin TX USA Enterprise A 7 11 2010 1200000 DFJ Mercury
MassRelevance Austin TX USA Enterprise A 20 12 2010 2200000 Floodgate, AV,etc
Masher Calabasas CA USA Games_Video Seed 0 2 2009 175000
Masher Calabasas CA USA Games_Video Angel 11 8 2009 300000 Tech Coast Angels
The Result? Tab Delimited Structured Data…
Note: I dropped the ZipCode because it didn’t occur consistently
14
Time to Analyze/Visualize the data…
Step1: Select the right visual encoding for your
questions
Lets start by asking questions & seeing what we can
learn from some simple Bar Charts…
*Total Tech Investments By Year
*Total Tech Investments By Year
*Total Tech Investments By Year
*Investment Funding By Sector
18
Total Investments By Zip Code for all Sectors
$7.3 Billion in San Francisco
$2.9 Billion in Mountain View
$1.2 Billion in Boston
$1.7 Billion in Austin
19
Total Investments By Zip Code for all Sectors
$7.3 Billion in San Francisco
$2.9 Billion in Mountain View
$1.2 Billion in Boston
$1.7 Billion in Austin
20
Total Investments By Zip Code for Consumer Web
$1.2 Billion in Chicago
$600 Million in Seattle
$1.7 Billion in San Francisco
21
Total Investments By Zip Code for BioTech
$1.3 Billion in Cambridge
$528 Million in Dallas
$1.1 Billion in San Diego
22
HP Confidential
Geospatial Encoding of Data
Steve’s Not so Excellent Adventure
23
• Let’s try a Choropleth Encoding of the distribution of investment income by
County
• Wait, what is GeoJSON?
• OK, the GeoJSON County is mapped to some code
• Each County code has a value that corresponds to a palette color
• So what are these codes? FIPS Codes? But Google returns 3 & 5 digit
codes?!?
• I found a 5 digit code list, it has A LOT of codes in it. I’m going to assume its
correct because there is no way I can manually verify all of them
Generating Investment Income By County
24
FIPS = LOAD ‘data/fips.txt’ using PigStorage(‘t’) as (City, State, FIPSCode);
Amt = LOAD ‘data/equity.txt’ using PigStorage(‘t’) as (City, State, Amount);
AmtGroup = Group Amt BY (City, State);
SumGroup = FOREACH AmtGroup Generate group, SUM(Amt.Amount);
JoinGroup = JOIN SumGroup by (City,State), FIPS By (City,State);
Final = FOREACH JoinGroup generate FIPSCode, Amount;
RESULT: 51234 5000000
16234 1234000 (...)
ALWAYS, ALWAYS check your output…
But wait, why are there duplicate records?
25
Apparently some cities can actually belong to two counties… I guess I’ll pick
one.
Yay, no duplicates. Lets visualize this!
26
• Wait, what happened to California ?
• Aaargh, I stored the FIPS codes in PIG as INTS instead of charrays which
trimmed off the leading Zero. OK, I add them back. Voila! We have California.
On Error Checking…
27
• Crowd Sourced data has LOADS of errors in it. Actually influencing your
results. You need a good system that helps identify those errors.
• Santa Clara, Ca
• Santa, Clara
• Santa, Clara CA
• Track(Count) input and output records. Examine the results. Something fishy?
28
HP Confidential
29
Questions?
Steve Watt swatt@hp.com
@wattsteve
emergingafrican.com

More Related Content

PPTX
Need 4 speed
PDF
Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span...
PDF
Three Functional Programming Technologies for Big Data
PDF
SOLR Power FTW: short version
PDF
Austin bdug 2011_01_27_small_and_big_data
PPTX
Steve Watt Presentation
PPTX
Tech4Africa - Opportunities around Big Data
PDF
Bug Bounty Hunter Methodology - Nullcon 2016
Need 4 speed
Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span...
Three Functional Programming Technologies for Big Data
SOLR Power FTW: short version
Austin bdug 2011_01_27_small_and_big_data
Steve Watt Presentation
Tech4Africa - Opportunities around Big Data
Bug Bounty Hunter Methodology - Nullcon 2016

Similar to Mining the Web for Information using Hadoop (20)

PPTX
Big Data: Beyond the "Bigness" and the Technology (webcast)
PPTX
Data Extraction, Visualization and Processing with application to census and ...
PDF
Big Data @ Bodensee Barcamp 2010
PDF
Data Science At Zillow
PPT
DM UNIT_5 ppt for btech final year students
PPTX
Hadoop as data refinery
PPTX
Hadoop as Data Refinery - Steve Loughran
PPTX
The Hacking Game - Think Like a Hacker Meetup 12072023.pptx
PDF
EDF2013: Big Data Tutorial: Marko Grobelnik
PPT
Cloud_Big_Data_Analytics_Mobile_Social_modern_internet_scale_business_models_...
PPTX
Session 03 acquiring data
PPTX
Session 03 acquiring data
PDF
Big Data Tutorial - Marko Grobelnik - 25 May 2012
PDF
OpenFest 2012 : Leveraging the public internet
PPTX
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
PPTX
Intro to Big Data - Orlando Code Camp 2014
PDF
Security data deluge
PPT
Big Data = Big Decisions
PPT
introduction to data mining applications
PDF
Pig and Python to Process Big Data
Big Data: Beyond the "Bigness" and the Technology (webcast)
Data Extraction, Visualization and Processing with application to census and ...
Big Data @ Bodensee Barcamp 2010
Data Science At Zillow
DM UNIT_5 ppt for btech final year students
Hadoop as data refinery
Hadoop as Data Refinery - Steve Loughran
The Hacking Game - Think Like a Hacker Meetup 12072023.pptx
EDF2013: Big Data Tutorial: Marko Grobelnik
Cloud_Big_Data_Analytics_Mobile_Social_modern_internet_scale_business_models_...
Session 03 acquiring data
Session 03 acquiring data
Big Data Tutorial - Marko Grobelnik - 25 May 2012
OpenFest 2012 : Leveraging the public internet
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Intro to Big Data - Orlando Code Camp 2014
Security data deluge
Big Data = Big Decisions
introduction to data mining applications
Pig and Python to Process Big Data
Ad

More from Steve Watt (11)

PPT
Building Clustered Applications with Kubernetes and Docker
PPT
Building Clustered Applications with Kubernetes and Docker
PPT
Hadoop for the disillusioned
PPT
Hadoop file systems
ODP
Apache con 2013-hadoop
PPTX
Apache con 2012 taking the guesswork out of your hadoop infrastructure
PPTX
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
PPT
Final deck
PPT
Web Crawling and Data Gathering with Apache Nutch
PPT
Introduction to Apache Hadoop
PPTX
Extractiv
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
Hadoop for the disillusioned
Hadoop file systems
Apache con 2013-hadoop
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Final deck
Web Crawling and Data Gathering with Apache Nutch
Introduction to Apache Hadoop
Extractiv
Ad

Recently uploaded (20)

PPTX
The various Industrial Revolutions .pptx
PPTX
1. Introduction to Computer Programming.pptx
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
Getting Started with Data Integration: FME Form 101
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Hybrid model detection and classification of lung cancer
PPTX
Tartificialntelligence_presentation.pptx
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
The various Industrial Revolutions .pptx
1. Introduction to Computer Programming.pptx
OMC Textile Division Presentation 2021.pptx
observCloud-Native Containerability and monitoring.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Chapter 5: Probability Theory and Statistics
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
Getting Started with Data Integration: FME Form 101
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
cloud_computing_Infrastucture_as_cloud_p
Hybrid model detection and classification of lung cancer
Tartificialntelligence_presentation.pptx
O2C Customer Invoices to Receipt V15A.pptx
A comparative study of natural language inference in Swahili using monolingua...
1 - Historical Antecedents, Social Consideration.pdf
NewMind AI Weekly Chronicles - August'25-Week II
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...

Mining the Web for Information using Hadoop

  • 1. 1 – Someday Soon (Flickr) Mining the web with Hadoop Steve Watt Emerging Technologies @ HP
  • 4. 4
  • 5. 5
  • 8. 8
  • 9. 9
  • 10. 10 Using Apache Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2 For example: http://www.crunchbase.com/companies?c=a&q=private_held http://www.crunchbase.com/companies?c=b&q=private_held http://www.crunchbase.com/companies?c=c&q=private_held http://www.crunchbase.com/companies?c=d&q=private_held . . . Crawl data is stored in sequence files in the segments dir on the HDFS
  • 12. 12 Company POJO then /t Out Prelim Filtering on URL Making the data STRUCTURED Retrieving HTML
  • 13. 13 Company City State Country Sector Round Day Month Year Amount Investors InfoChimps Austin TX USA Enterprise Angel 14 9 2010 350000 Stage One Capital InfoChimps Austin TX USA Enterprise A 7 11 2010 1200000 DFJ Mercury MassRelevance Austin TX USA Enterprise A 20 12 2010 2200000 Floodgate, AV,etc Masher Calabasas CA USA Games_Video Seed 0 2 2009 175000 Masher Calabasas CA USA Games_Video Angel 11 8 2009 300000 Tech Coast Angels The Result? Tab Delimited Structured Data… Note: I dropped the ZipCode because it didn’t occur consistently
  • 14. 14 Time to Analyze/Visualize the data… Step1: Select the right visual encoding for your questions Lets start by asking questions & seeing what we can learn from some simple Bar Charts…
  • 16. *Total Tech Investments By Year *Total Tech Investments By Year
  • 18. 18 Total Investments By Zip Code for all Sectors $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.2 Billion in Boston $1.7 Billion in Austin
  • 19. 19 Total Investments By Zip Code for all Sectors $7.3 Billion in San Francisco $2.9 Billion in Mountain View $1.2 Billion in Boston $1.7 Billion in Austin
  • 20. 20 Total Investments By Zip Code for Consumer Web $1.2 Billion in Chicago $600 Million in Seattle $1.7 Billion in San Francisco
  • 21. 21 Total Investments By Zip Code for BioTech $1.3 Billion in Cambridge $528 Million in Dallas $1.1 Billion in San Diego
  • 23. Steve’s Not so Excellent Adventure 23 • Let’s try a Choropleth Encoding of the distribution of investment income by County • Wait, what is GeoJSON? • OK, the GeoJSON County is mapped to some code • Each County code has a value that corresponds to a palette color • So what are these codes? FIPS Codes? But Google returns 3 & 5 digit codes?!? • I found a 5 digit code list, it has A LOT of codes in it. I’m going to assume its correct because there is no way I can manually verify all of them
  • 24. Generating Investment Income By County 24 FIPS = LOAD ‘data/fips.txt’ using PigStorage(‘t’) as (City, State, FIPSCode); Amt = LOAD ‘data/equity.txt’ using PigStorage(‘t’) as (City, State, Amount); AmtGroup = Group Amt BY (City, State); SumGroup = FOREACH AmtGroup Generate group, SUM(Amt.Amount); JoinGroup = JOIN SumGroup by (City,State), FIPS By (City,State); Final = FOREACH JoinGroup generate FIPSCode, Amount; RESULT: 51234 5000000 16234 1234000 (...) ALWAYS, ALWAYS check your output…
  • 25. But wait, why are there duplicate records? 25 Apparently some cities can actually belong to two counties… I guess I’ll pick one.
  • 26. Yay, no duplicates. Lets visualize this! 26 • Wait, what happened to California ? • Aaargh, I stored the FIPS codes in PIG as INTS instead of charrays which trimmed off the leading Zero. OK, I add them back. Voila! We have California.
  • 27. On Error Checking… 27 • Crowd Sourced data has LOADS of errors in it. Actually influencing your results. You need a good system that helps identify those errors. • Santa Clara, Ca • Santa, Clara • Santa, Clara CA • Track(Count) input and output records. Examine the results. Something fishy?

Editor's Notes

  • #7: Give a Nutch example