SlideShare a Scribd company logo
Building Satori: Web Data
Extraction On Hadoop
Nikolai Avteniev
Sr. Staff Software Engineer
LinkedIn
Building Opportunity from the Empire State Building
2
LinkedIn NYC
3
The Team
Nikita Lytkin
Staff Software Engineer
Pi-Chuan Chang
Sr. Software Engineer
David Astle
Sr. Software Engineer
Nikolai Avteniev
Sr. Staff Software Engineer
Eran Leshem
Sr. Staff Software Engineer
THE ECONOMIC GRAPH
Connecting talent with opportunity
at massive scale
What we thought we needed
6
The BIG Idea
Inspired by Hsieh, Jonathan M., Steven D. Gribble, and Henry M. Levy.
"The Architecture and Implementation of an Extensible Web Crawler." NSDI. 2010.
Questions we wanted to answer
7
Focused our Vision
Who would use this tool?
Do we need to crawl the entire web?
Do we need to process the pages near line?
Where would we store this data?
How would we correct mistakes in the flow?
Identity Team
Virtually All Member Value Relies On Identity Data
Susan Kaplan
Sr. Marketing Manager at Weblo
SEARCH
Research & Contact
AD TARGETING
Market Products
& Services
PMYK
Build Your Network
RECRUITER
Recruit & Hire
FEED
Get Daily News
NETWORK
Keep in Touch
RECOMMENDATIONS
Get a Job/Gig
WVMP
Establish Yourself
as Expert
Identity Use Case
A smarter way to build your profile
• Suggest 1-click profile updates to members
• Using this, we can help members easily fill in profile gaps
& get credit for certificates, patents, publications…
Kafka/Samza Team
• Avg. HTML Document is 6K
37% < 10K
• Samza can handle 1.2M
messages per node [2]
• There is a limit of how much
data is retained between 7
and 30 days.
• Most of the data is filtered out
• Need to bootstrap Samza
stores
12
Not a perfect fit
1. HTML Document Transfer size http://httparchive.org/interesting.php?a=All&l=Oct%2015%202015#bytesHtmlDoc
2. Feng, Tao “Benchmarking Apache Samza: 1.2 million messages per second on a single node”
https://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single-
node
Help 400M members fully realize
their professional identity on
LinkedIn.
Find sources of professional
content on the public internet.
Fetch the content, extract
structured data and match it to
member profiles
13
The Project: Satori
Web Data Extraction HOW TO:
• Enterprise VS Social Web
use cases
• Web Sources
• Wrappers
15
Web Data Extraction System
3. Ferrara, Emilio, et al. "Web data extraction, applications and techniques: A survey." Knowledge-Based Systems 70
(2014): 301-323.
16
What is a Wrapper?
Induce wrappers based on data [4]
Build wrappers that are robust. [5]
Cluster similar pages by URL [6]
The web is huge and there are
interesting things in the long tale[7]
17
Industrial Web Data Extraction
4. Dalvi, Nilesh, Ravi Kumar, and Mohamed Soliman. "Automatic wrappers for large scale web extraction." Proceedings of the VLDB
Endowment 4.4 (2011): 219-230.
5. Dalvi, Nilesh, Philip Bohannon, and Fei Sha. "Robust web extraction: an approach based on a probabilistic tree-edit model." Proceedings of
the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.
6. Blanco, Lorenzo, Nilesh Dalvi, and Ashwin Machanavajjhala. "Highly efficient algorithms for structural clustering of large websites."
Proceedings of the 20th international conference on World wide web. ACM, 2011.
7. Dalvi, Nilesh, Ashwin Machanavajjhala, and Bo Pang. "An analysis of structured data on the web." Proceedings of the VLDB Endowment
5.7 (2012): 680-691.
Picking a Crawler
HERITRIX powers archive.org
NUTCH powers common crawl
BUbinNG part of LAW
Scrapy used with in LinkedIn
19
The Contestants
8. Web crawling, C Olston, M Najork - Foundations and Trends in Information Retrieval, 2010
9. An Introduction to Heritrix: An Open Source Archival Quality Web Crawler, A Dan, K Michele – 2004
10.BUbiNG: massive crawling for the masses, P Boldi, A Marino, M Santini, S Vigna -, 2014
11.Nutch: A Flexible and Scalable Open-Source Web Search Engine. CommerceNet Labs, R Khare, D Cutting, K
Sitaker, A Rifkin - 2004 - CN-TR-04-04, November
20
And the winner is …
Satori
• Built on Nutch 1.9
• Runs on Hadoop 2.3
• Scheduled to run every 5
hours
• Respects robots.txt
• Default crawl delay of 5
seconds
22
Crawl Flow
• Output into target schema
• Apply XPATH wrappers
• Wrappers are hierarchical
mapping of Schema field to
XPath expression
• Indexed by data domain and
data source
23
Extract Flow
Crawl rate is bound by the
number of sites and the site
crawl delay
Common Crawl Great Source
https://commoncrawl.org/
Gobblin Great Ingestion
Framework
https://github.com/linkedin/gobblinn
25
Bootstrap From Bulk Sources
XPath extractors can be
challenging on sites with rich
data
It is easy to exceed the Hadoop
quota
Match[in]
Matching authors and publications to members
to power profile edit experiences
30
Overview
Match using global identifiers,
email or full name.
The data might not be clean
after extraction
Start with a small set of data and
get it to the users quickly
31
Start Simple
Narrow the candidates with
LSH[1]
Use the simple model to
generate the ground truth
Train using a simple algorithm
and a few hundred features
32
Keep It Simple
1. https://en.wikipedia.org/wiki/Locality-sensitive_hashing
5.3
2.3
3.9
0.6
Publications Companies
Extractor Objects
Total Processed
33
Current Status
56
2
5.6
2.5
1.2 0.1
Publication Company
Crawler Objects
Unfetched Fetched Gone
Target a data source which has
data that will be easy to fetch,
extract and match.
Add tracking to the entire flow
Do it all offline if you can
Get the product to the
customers early to validate the
process and value proposition
Most important of all write it all
down and share it with everyone

©2014 LinkedIn Corporation. All Rights Reserved.

More Related Content

PPTX
Building Satori: Web Data Extraction On Hadoop
PDF
Three Linked Data choices for Libraries
PPT
Marc and beyond: 3 Linked Data Choices
PPT
Zeine 2011 LinkedIn Use of Information Technology for Global Professional Net...
PPTX
ODSC-East-2016_Marmanis_Public
PDF
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
PDF
Structured data: Where did that come from & why are Google asking for it
PDF
Structured Data: It's All About the Graph!
Building Satori: Web Data Extraction On Hadoop
Three Linked Data choices for Libraries
Marc and beyond: 3 Linked Data Choices
Zeine 2011 LinkedIn Use of Information Technology for Global Professional Net...
ODSC-East-2016_Marmanis_Public
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
Structured data: Where did that come from & why are Google asking for it
Structured Data: It's All About the Graph!

What's hot (20)

PDF
Schema.org Structured data the What, Why, & How
PPT
SIOC: Semantic Web for Social Media Sites
PDF
Oas schwartz OA Summit
PDF
Rapid Data Exploration With Hadoop
PDF
The open semantic enterprise enterprise data meets web data
PPTX
Life after MARC: Cataloging Tools of the Future
PDF
Linked Data Book: DC Semantic Web Meetup 20130129
PPTX
1st Birmingham Big Data Science Group meetup
PPTX
Rank | Analyse | Lead | Search
PPTX
Conclusions - Linked Data
PDF
Toogdag 2017
PPTX
IRMS 2018 - Looking to the future to preserver the past
PPTX
Presentation at Google Day on Big Data
PDF
FIBO & Schema.org
PPTX
Enterprise Data World 2016 | FIBO extension to Schema.org | FIBO SEO | Christ...
PDF
Schema.org where did that come from?
PDF
Entity-Centric Data Management
PDF
Contextual Computing - Knowledge Graphs & Web of Entities
PDF
Structured Data for the Financial Industry
PPTX
Knowledge Architecture: Graphing Your Knowledge
Schema.org Structured data the What, Why, & How
SIOC: Semantic Web for Social Media Sites
Oas schwartz OA Summit
Rapid Data Exploration With Hadoop
The open semantic enterprise enterprise data meets web data
Life after MARC: Cataloging Tools of the Future
Linked Data Book: DC Semantic Web Meetup 20130129
1st Birmingham Big Data Science Group meetup
Rank | Analyse | Lead | Search
Conclusions - Linked Data
Toogdag 2017
IRMS 2018 - Looking to the future to preserver the past
Presentation at Google Day on Big Data
FIBO & Schema.org
Enterprise Data World 2016 | FIBO extension to Schema.org | FIBO SEO | Christ...
Schema.org where did that come from?
Entity-Centric Data Management
Contextual Computing - Knowledge Graphs & Web of Entities
Structured Data for the Financial Industry
Knowledge Architecture: Graphing Your Knowledge
Ad

Viewers also liked (15)

PPTX
Kemiskinan dan kesenjangan pendapatan
PPTX
Industrialisasi dan pertembangan
PDF
에너지자립마을 이야기11. 행복마을 전농 래미안아름숲
PDF
البحث في مصادر المعلومات الالكترونية
PPTX
i-Go Lite Travel Trailer Features and Benefits
PDF
Sql on everything with drill
DOCX
Karya ilmiah PKN
PPTX
Gambaran umum perekonomian indonesia
PDF
Dernière évolution du projet de démateialisation des procédures du commerce e...
PPTX
Usaha kecil dan menengah
DOCX
Nishant_Patnaik
PPTX
Fruhling, Sommer
PPTX
Gobblin: Unifying Data Ingestion for Hadoop
PDF
Jadual berkala unsur
PPTX
Weihnachtstraditionen in der slowakei
Kemiskinan dan kesenjangan pendapatan
Industrialisasi dan pertembangan
에너지자립마을 이야기11. 행복마을 전농 래미안아름숲
البحث في مصادر المعلومات الالكترونية
i-Go Lite Travel Trailer Features and Benefits
Sql on everything with drill
Karya ilmiah PKN
Gambaran umum perekonomian indonesia
Dernière évolution du projet de démateialisation des procédures du commerce e...
Usaha kecil dan menengah
Nishant_Patnaik
Fruhling, Sommer
Gobblin: Unifying Data Ingestion for Hadoop
Jadual berkala unsur
Weihnachtstraditionen in der slowakei
Ad

Similar to DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn (20)

PDF
Web Crawling with Apache Nutch
PDF
Crawling and Processing the Italian Corporate Web
PDF
What is web scraping?
PPT
Web Crawling and Data Gathering with Apache Nutch
PDF
Sparkler - Spark Crawler
PDF
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
PDF
Sparkler Presentation for Spark Summit East 2017
PDF
Sparkler at spark summit east 2017
PPTX
Scrappy
PDF
Large Scale Crawling with Apache Nutch and Friends
ODP
Large Scale Crawling with Apache Nutch and Friends
PDF
Halko_santafe_2015
PPTX
The Internet as a Single Database
PPT
Web Scraping and Data Extraction Service
PPT
Web Crawler
PDF
The Bixo Web Mining Toolkit
PDF
Rethink Web Harvesting and Scraping
PPTX
4 Web Crawler.pptx
PPTX
Scalability andefficiencypres
DOCX
Open source search engine
Web Crawling with Apache Nutch
Crawling and Processing the Italian Corporate Web
What is web scraping?
Web Crawling and Data Gathering with Apache Nutch
Sparkler - Spark Crawler
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler Presentation for Spark Summit East 2017
Sparkler at spark summit east 2017
Scrappy
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
Halko_santafe_2015
The Internet as a Single Database
Web Scraping and Data Extraction Service
Web Crawler
The Bixo Web Mining Toolkit
Rethink Web Harvesting and Scraping
4 Web Crawler.pptx
Scalability andefficiencypres
Open source search engine

More from Hakka Labs (20)

PDF
Always Valid Inference (Ramesh Johari, Stanford)
PPTX
DataEngConf SF16 - High cardinality time series search
PDF
DataEngConf SF16 - Data Asserts: Defensive Data Science
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
PDF
DataEngConf SF16 - Recommendations at Instacart
PDF
DataEngConf SF16 - Running simulations at scale
PDF
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
PDF
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
PDF
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
PDF
DataEngConf SF16 - Three lessons learned from building a production machine l...
PDF
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
PDF
DataEngConf SF16 - Bridging the gap between data science and data engineering
PDF
DataEngConf SF16 - Multi-temporal Data Structures
PDF
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
PDF
DataEngConf SF16 - Beginning with Ourselves
PDF
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
PDF
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
PDF
DataEngConf SF16 - Spark SQL Workshop
Always Valid Inference (Ramesh Johari, Stanford)
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - Data Asserts: Defensive Data Science
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Spark SQL Workshop

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Electronic commerce courselecture one. Pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
KodekX | Application Modernization Development
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Weekly Chronicles - August'25 Week I
The Rise and Fall of 3GPP – Time for a Sabbatical?
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Advanced methodologies resolving dimensionality complications for autism neur...
Network Security Unit 5.pdf for BCA BBA.
Understanding_Digital_Forensics_Presentation.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Modernizing your data center with Dell and AMD
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Electronic commerce courselecture one. Pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
KodekX | Application Modernization Development
MYSQL Presentation for SQL database connectivity
Reach Out and Touch Someone: Haptics and Empathic Computing
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

  • 1. Building Satori: Web Data Extraction On Hadoop Nikolai Avteniev Sr. Staff Software Engineer LinkedIn
  • 2. Building Opportunity from the Empire State Building 2 LinkedIn NYC
  • 3. 3 The Team Nikita Lytkin Staff Software Engineer Pi-Chuan Chang Sr. Software Engineer David Astle Sr. Software Engineer Nikolai Avteniev Sr. Staff Software Engineer Eran Leshem Sr. Staff Software Engineer
  • 5. Connecting talent with opportunity at massive scale
  • 6. What we thought we needed 6 The BIG Idea Inspired by Hsieh, Jonathan M., Steven D. Gribble, and Henry M. Levy. "The Architecture and Implementation of an Extensible Web Crawler." NSDI. 2010.
  • 7. Questions we wanted to answer 7 Focused our Vision Who would use this tool? Do we need to crawl the entire web? Do we need to process the pages near line? Where would we store this data? How would we correct mistakes in the flow?
  • 9. Virtually All Member Value Relies On Identity Data Susan Kaplan Sr. Marketing Manager at Weblo SEARCH Research & Contact AD TARGETING Market Products & Services PMYK Build Your Network RECRUITER Recruit & Hire FEED Get Daily News NETWORK Keep in Touch RECOMMENDATIONS Get a Job/Gig WVMP Establish Yourself as Expert
  • 10. Identity Use Case A smarter way to build your profile • Suggest 1-click profile updates to members • Using this, we can help members easily fill in profile gaps & get credit for certificates, patents, publications…
  • 12. • Avg. HTML Document is 6K 37% < 10K • Samza can handle 1.2M messages per node [2] • There is a limit of how much data is retained between 7 and 30 days. • Most of the data is filtered out • Need to bootstrap Samza stores 12 Not a perfect fit 1. HTML Document Transfer size http://httparchive.org/interesting.php?a=All&l=Oct%2015%202015#bytesHtmlDoc 2. Feng, Tao “Benchmarking Apache Samza: 1.2 million messages per second on a single node” https://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single- node
  • 13. Help 400M members fully realize their professional identity on LinkedIn. Find sources of professional content on the public internet. Fetch the content, extract structured data and match it to member profiles 13 The Project: Satori
  • 15. • Enterprise VS Social Web use cases • Web Sources • Wrappers 15 Web Data Extraction System 3. Ferrara, Emilio, et al. "Web data extraction, applications and techniques: A survey." Knowledge-Based Systems 70 (2014): 301-323.
  • 16. 16 What is a Wrapper?
  • 17. Induce wrappers based on data [4] Build wrappers that are robust. [5] Cluster similar pages by URL [6] The web is huge and there are interesting things in the long tale[7] 17 Industrial Web Data Extraction 4. Dalvi, Nilesh, Ravi Kumar, and Mohamed Soliman. "Automatic wrappers for large scale web extraction." Proceedings of the VLDB Endowment 4.4 (2011): 219-230. 5. Dalvi, Nilesh, Philip Bohannon, and Fei Sha. "Robust web extraction: an approach based on a probabilistic tree-edit model." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009. 6. Blanco, Lorenzo, Nilesh Dalvi, and Ashwin Machanavajjhala. "Highly efficient algorithms for structural clustering of large websites." Proceedings of the 20th international conference on World wide web. ACM, 2011. 7. Dalvi, Nilesh, Ashwin Machanavajjhala, and Bo Pang. "An analysis of structured data on the web." Proceedings of the VLDB Endowment 5.7 (2012): 680-691.
  • 19. HERITRIX powers archive.org NUTCH powers common crawl BUbinNG part of LAW Scrapy used with in LinkedIn 19 The Contestants 8. Web crawling, C Olston, M Najork - Foundations and Trends in Information Retrieval, 2010 9. An Introduction to Heritrix: An Open Source Archival Quality Web Crawler, A Dan, K Michele – 2004 10.BUbiNG: massive crawling for the masses, P Boldi, A Marino, M Santini, S Vigna -, 2014 11.Nutch: A Flexible and Scalable Open-Source Web Search Engine. CommerceNet Labs, R Khare, D Cutting, K Sitaker, A Rifkin - 2004 - CN-TR-04-04, November
  • 22. • Built on Nutch 1.9 • Runs on Hadoop 2.3 • Scheduled to run every 5 hours • Respects robots.txt • Default crawl delay of 5 seconds 22 Crawl Flow
  • 23. • Output into target schema • Apply XPATH wrappers • Wrappers are hierarchical mapping of Schema field to XPath expression • Indexed by data domain and data source 23 Extract Flow
  • 24. Crawl rate is bound by the number of sites and the site crawl delay
  • 25. Common Crawl Great Source https://commoncrawl.org/ Gobblin Great Ingestion Framework https://github.com/linkedin/gobblinn 25 Bootstrap From Bulk Sources
  • 26. XPath extractors can be challenging on sites with rich data
  • 27. It is easy to exceed the Hadoop quota
  • 29. Matching authors and publications to members to power profile edit experiences
  • 31. Match using global identifiers, email or full name. The data might not be clean after extraction Start with a small set of data and get it to the users quickly 31 Start Simple
  • 32. Narrow the candidates with LSH[1] Use the simple model to generate the ground truth Train using a simple algorithm and a few hundred features 32 Keep It Simple 1. https://en.wikipedia.org/wiki/Locality-sensitive_hashing
  • 33. 5.3 2.3 3.9 0.6 Publications Companies Extractor Objects Total Processed 33 Current Status 56 2 5.6 2.5 1.2 0.1 Publication Company Crawler Objects Unfetched Fetched Gone
  • 34. Target a data source which has data that will be easy to fetch, extract and match.
  • 35. Add tracking to the entire flow
  • 36. Do it all offline if you can
  • 37. Get the product to the customers early to validate the process and value proposition
  • 38. Most important of all write it all down and share it with everyone 
  • 39. ©2014 LinkedIn Corporation. All Rights Reserved.