DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Building Satori: Web Data
Extraction On Hadoop
Nikolai Avteniev
Sr. Staff Software Engineer
LinkedIn

Building Opportunity from the Empire State Building
2
LinkedIn NYC

3
The Team
Nikita Lytkin
Staff Software Engineer
Pi-Chuan Chang
Sr. Software Engineer
David Astle
Sr. Software Engineer
Nikolai Avteniev
Eran Leshem

Connecting talent with opportunity
at massive scale

What we thought we needed
6
The BIG Idea
Inspired by Hsieh, Jonathan M., Steven D. Gribble, and Henry M. Levy.
"The Architecture and Implementation of an Extensible Web Crawler." NSDI. 2010.

Questions we wanted to answer
7
Focused our Vision
Who would use this tool?
Do we need to crawl the entire web?
Do we need to process the pages near line?
Where would we store this data?
How would we correct mistakes in the flow?

Virtually All Member Value Relies On Identity Data
Susan Kaplan
Sr. Marketing Manager at Weblo
SEARCH
Research & Contact
AD TARGETING
Market Products
& Services
PMYK
Build Your Network
RECRUITER
Recruit & Hire
FEED
Get Daily News
NETWORK
Keep in Touch
RECOMMENDATIONS
Get a Job/Gig
WVMP
Establish Yourself
as Expert

Identity Use Case
A smarter way to build your profile
• Suggest 1-click profile updates to members
• Using this, we can help members easily fill in profile gaps
& get credit for certificates, patents, publications…

• Avg. HTML Document is 6K
37% < 10K
• Samza can handle 1.2M
messages per node [2]
• There is a limit of how much
data is retained between 7
and 30 days.
• Most of the data is filtered out
• Need to bootstrap Samza
stores
12
Not a perfect fit
1. HTML Document Transfer size http://httparchive.org/interesting.php?a=All&l=Oct%2015%202015#bytesHtmlDoc
2. Feng, Tao “Benchmarking Apache Samza: 1.2 million messages per second on a single node”
https://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single-
node

Help 400M members fully realize
their professional identity on
LinkedIn.
Find sources of professional
content on the public internet.
Fetch the content, extract
structured data and match it to
member profiles
13
The Project: Satori

• Enterprise VS Social Web
use cases
• Web Sources
• Wrappers
15
Web Data Extraction System
3. Ferrara, Emilio, et al. "Web data extraction, applications and techniques: A survey." Knowledge-Based Systems 70
(2014): 301-323.

Induce wrappers based on data [4]
Build wrappers that are robust. [5]
Cluster similar pages by URL [6]
The web is huge and there are
interesting things in the long tale[7]
17
Industrial Web Data Extraction
4. Dalvi, Nilesh, Ravi Kumar, and Mohamed Soliman. "Automatic wrappers for large scale web extraction." Proceedings of the VLDB
Endowment 4.4 (2011): 219-230.
5. Dalvi, Nilesh, Philip Bohannon, and Fei Sha. "Robust web extraction: an approach based on a probabilistic tree-edit model." Proceedings of
the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.
6. Blanco, Lorenzo, Nilesh Dalvi, and Ashwin Machanavajjhala. "Highly efficient algorithms for structural clustering of large websites."
Proceedings of the 20th international conference on World wide web. ACM, 2011.
7. Dalvi, Nilesh, Ashwin Machanavajjhala, and Bo Pang. "An analysis of structured data on the web." Proceedings of the VLDB Endowment
5.7 (2012): 680-691.

HERITRIX powers archive.org
NUTCH powers common crawl
BUbinNG part of LAW
Scrapy used with in LinkedIn
19
The Contestants
8. Web crawling, C Olston, M Najork - Foundations and Trends in Information Retrieval, 2010
9. An Introduction to Heritrix: An Open Source Archival Quality Web Crawler, A Dan, K Michele – 2004
10.BUbiNG: massive crawling for the masses, P Boldi, A Marino, M Santini, S Vigna -, 2014
11.Nutch: A Flexible and Scalable Open-Source Web Search Engine. CommerceNet Labs, R Khare, D Cutting, K
Sitaker, A Rifkin - 2004 - CN-TR-04-04, November

• Built on Nutch 1.9
• Runs on Hadoop 2.3
• Scheduled to run every 5
hours
• Respects robots.txt
• Default crawl delay of 5
seconds
22
Crawl Flow

• Output into target schema
• Apply XPATH wrappers
• Wrappers are hierarchical
mapping of Schema field to
XPath expression
• Indexed by data domain and
data source
23
Extract Flow

Crawl rate is bound by the
number of sites and the site
crawl delay

Common Crawl Great Source
https://commoncrawl.org/
Gobblin Great Ingestion
Framework
https://github.com/linkedin/gobblinn
25
Bootstrap From Bulk Sources

XPath extractors can be
challenging on sites with rich
data

It is easy to exceed the Hadoop
quota

Matching authors and publications to members
to power profile edit experiences

Match using global identifiers,
email or full name.
The data might not be clean
after extraction
Start with a small set of data and
get it to the users quickly
31
Start Simple

Narrow the candidates with
LSH[1]
Use the simple model to
generate the ground truth
Train using a simple algorithm
and a few hundred features
32
Keep It Simple
1. https://en.wikipedia.org/wiki/Locality-sensitive_hashing

5.3
2.3
3.9
0.6
Publications Companies
Extractor Objects
Total Processed
33
Current Status
56
2
5.6
2.5
1.2 0.1
Publication Company
Crawler Objects
Unfetched Fetched Gone

Target a data source which has
data that will be easy to fetch,
extract and match.

Add tracking to the entire flow

Get the product to the
customers early to validate the
process and value proposition

Most important of all write it all
down and share it with everyone


DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

More Related Content

What's hot (20)

Viewers also liked (15)

Similar to DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn (20)

More from Hakka Labs (20)

Recently uploaded (20)

DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn