SlideShare a Scribd company logo
Understanding Voice of
Members via Text Mining
– How Linkedin built a text analytics platform at scale
Chi-Yi Kuan
Weidong Zhang
Tiger Zhang
Who are we?
www.linkedin.com/in/chiyikuan
Chi-Yi Kuan
www.linkedin.com/in/weidongzhang1
Weidong Zhang
 Tiger Zhang
www.linkedin.com/in/tigerzhang
•  Director, Analytics at Linkedin
•  Big data evangelist and
practitioner
•  Manager, Analytics Platform &
Apps at Linkedin
•  Build big data and analytics
products
•  Sr. Staff, Analytics at Linkedin
•  Text mining scientist and big data
enthusiast
Strata + Hadoop World, 12/8/2016
Strata + Hadoop World, 12/8/2016
KnowledgeSchoolsSkillsJobsCompaniesMembers
467M 7M 6M 3B 27k 200k
Endorsements Daily posts
Strata + Hadoop World, 12/8/2016
467M 2B Billions
LinkedIn Big Data
Strata + Hadoop World, 12/8/2016
Strata + Hadoop World, 12/8/2016
467+ million members = a lot of data
Voices: drive actionable intelligence from member voices…
What’s trending Products
Home
Page
Mobile Inbox
Sentiments Value Props
Hire Market Sell
Relevance filtering
Classification
Topic mining
Identify content that is
relevant to Linkedin
brand and products/
services
Structuralize
unstructured textual
data into well-defined
categories
Find most significant
topics and stories in a
certain time window 
Strata + Hadoop World, 12/8/2016
…creating impact across business metrics
Developed game-changing solutions to drive Voice of
Member impact
Improved analytics efficiency with unstructured data by
20X
Drove end-to-end technological integration on big data
and embedding NLP solutions
Piloting operational solutions to scale advanced analytics
impact for broader organization
Strata + Hadoop World, 12/8/2016
LinkedIn Hadoop Ecosystem
HDFS
Map-Reduce Tez Spark
Pig Hive Scalding
YARN
AZKABAN
Strata + Hadoop World, 12/8/2016
Design Principles for Voices Platform
Scalability Availability Easy to Use
Process Platform Data Systems Application Framework
Kafka, Hadoop
Spark
Gobblin
Elasticsearch
NoSQL
Phoenix
Elasticsearch
Highcharts
Strata + Hadoop World, 12/8/2016
E2E Voices Platform Architecture
Strata + Hadoop World, 12/8/2016
Data Processing at Scale – with Generic ETL
Strata + Hadoop World, 12/8/2016
Smart IDs – for Viral Mentions with Threading
Strata + Hadoop World, 12/8/2016
High Availability – through Heterogeneous Data
Strata + Hadoop World, 12/8/2016
Machine learning based analytic engine to surface insights
to everyday business users
Customized Feeds
Central navigation
Trending insights
Social analytics & topic
mining
Deep dives
Sentiment solutions
Strata + Hadoop World, 12/8/2016
Text mining is a crowded space
Strata + Hadoop World, 12/8/2016
Our solution targets unique use cases for LinkedIn
Member info
•  Identity
•  Behavior
•  Social
Social data
Customer feedback
•  Customer service
•  Group updates
•  Network updates
Survey results
What’s trending
Products
Sentiments
Value Propositions
PYMK Group
Home
Page Mobile Inbox
Identity Network
Hire Market Sell
Relevance
solution
Topic mining
Text Classification
Strata + Hadoop World, 12/8/2016
▪ Product insights, launches, and
events
▪ Horizontal themes
▪ PR and marketing campaigns
▪ Brand and value
▪ LinkedIn’s strategy, financial
performance, international etc.
Relevant: Non-relevant:
▪ Status update, e.g. "I posted
something on Linkedin";
▪ Social mentions, e.g. "Please
connect with me on Linkedin" or
"Follow me on Linkedin";
▪ Self promoting materials, e.g.
“share on LinkedIn”
▪ SPAMs
1) Focusing on relevant data
Strata + Hadoop World, 12/8/2016
Keyword based approach
Relevance
prediction
power
Rules
56%
Whitelist
Blacklist
10%
60%
6%
19%
35%
Strata + Hadoop World, 12/8/2016
Generic text classification framework
▪  Feature generation
▪  Feature selection
▪  Machine learning algorithms:
–  Naïve Bayes (NB)
–  Logistic Regression (LR)
–  Support Vector Machines (SVM)
(LibLinear)
▪  Cross-validation and evaluation
Applications
▪  LinkedIn relevance
▪  Sentiment analysis
▪  Product categorization
▪  Value proposition classification
2) Leveraging text classification engine
Strata + Hadoop World, 12/8/2016
Machine learning approach increases overall
relevance by 40%
Relevance
prediction
power
Rules
56%
Whitelist
Blacklist
6%
19%
40%
100%
SVM
35%
SVM: great gain in balancing
precision and recall
Strata + Hadoop World, 12/8/2016
3) Enabling topic mining
HIGH SPARK
Description
POS pattern matching
Part-of-speech (POS) tagging
(Stanford CoreNLP)
This is great.
… …
Topic pruning
-  Stemming
-  removing stop words
-  merging synonyms
-  clustering (optional)
**** ing ****** s
= =
Topic ranking: TF-IDF weighting
and DF ranking
Strata + Hadoop World, 12/8/2016
Trending Insights – identify organic trending topics
Didi and Kuaidi merger
Product release
Strata + Hadoop World, 12/8/2016
LinkedIn’s customer support has evolved into an
intelligence platform…
Scaling to have a broader impact across LinkedIn
▪  GCO cases
▪  Issue resolution
▪  Support focused
▪  Internal data (GCO,
surveys, site
feedback)
▪  App review
▪  LI.com
▪  Social data
▪  Product insight
▪  Member insight
▪  Launch tracking
▪  Social sentiment
▪  Brand tracking
▪  Viral mentions
Reactive Multi-channel Intelligent Predictive
Support Feedback Insights Anticipation
Strata + Hadoop World, 12/8/2016
…breaks down into sentiment
and drivers…
4
(For LI data ) deep dive into
MLC segmentation…
6
…geographic locations…
5
…and audience segmentation…
7
…generates automatic reporting,
alerts and escalations…
8
…and close the feedback loop
with support and PR solutions
9
This is what the future could look like
From the first time we pick up
an isolated comment…
1
Machine determines if there is
significant reach…
2
…and whether it is a trending
topic…
3
Strata + Hadoop World, 12/8/2016
Best customer experience starts
from understanding Voices of
members!
Thank
You!
Engineering blogs for Voices
Strata + Hadoop World, 12/8/2016
Part I.
Voices: a Text Analytics Platform for Understanding Member Feedback
Part II. Technical Details for Topic Mining
References
1.  LibLinear: a library for large linear classification, available at
https://www.csie.ntu.edu.tw/~cjlin/liblinear/
2.  LingPipe: a Java-based toolkit for processing text using computational linguistics,
available at http://alias-i.com/lingpipe/
3.  NLTK: a leading platform for building Python programs to work with human language
data, available at http://www.nltk.org/
4.  Stanford CoreNLP: an open source project lead by Stanford NLP group, available at
http://nlp.stanford.edu/software/

More Related Content

PDF
Slides: Metadata Management for the Governance Minded
PDF
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
PDF
Understanding voice of the member via text mining
PDF
Data-Ed Slides: Exorcising the Seven Deadly Data Sins
PDF
DAMA BCS Chris Bradley Information is at the Heart of ALL architectures 18_06...
PDF
How LinkedIn leverages data to build scalable payments strategy
PDF
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
PDF
Data Modeling, Data Governance, & Data Quality
Slides: Metadata Management for the Governance Minded
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding voice of the member via text mining
Data-Ed Slides: Exorcising the Seven Deadly Data Sins
DAMA BCS Chris Bradley Information is at the Heart of ALL architectures 18_06...
How LinkedIn leverages data to build scalable payments strategy
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Data Modeling, Data Governance, & Data Quality

What's hot (18)

PDF
Publising Data on the Web
PDF
RWDG Slides: Using Agile to Justify Data Governance
PDF
Data Management Capabilities for the Oil & Gas Industry 17-19 March, Dubai
PDF
Data Architecture Best Practices for Today’s Rapidly Changing Data Landscape
PPTX
Metadata
PDF
Denodo Platform 7.0: What's New?
PDF
Metadata Strategies - Data Squared
PDF
PDF
The Missing Link in Enterprise Data Governance - Automated Metadata Management
PDF
DAS Slides: Data Modeling at the Environment Agency of England – Case Study
PDF
Data Con LA 2018 - From the Panama Papers by Mark Quinsland
PDF
Advanced Data Modelling course 3 day synopsis
PDF
Information Management Fundamentals DAMA DMBoK training course synopsis
PDF
CDO Webinar: 2017 Trends in Data Strategy
PPTX
Information Management Training Options
PDF
dsl & bigdata
PDF
Elasticsearch as a DMP
PPTX
Generating Big Value from Big Data
Publising Data on the Web
RWDG Slides: Using Agile to Justify Data Governance
Data Management Capabilities for the Oil & Gas Industry 17-19 March, Dubai
Data Architecture Best Practices for Today’s Rapidly Changing Data Landscape
Metadata
Denodo Platform 7.0: What's New?
Metadata Strategies - Data Squared
The Missing Link in Enterprise Data Governance - Automated Metadata Management
DAS Slides: Data Modeling at the Environment Agency of England – Case Study
Data Con LA 2018 - From the Panama Papers by Mark Quinsland
Advanced Data Modelling course 3 day synopsis
Information Management Fundamentals DAMA DMBoK training course synopsis
CDO Webinar: 2017 Trends in Data Strategy
Information Management Training Options
dsl & bigdata
Elasticsearch as a DMP
Generating Big Value from Big Data
Ad

Viewers also liked (18)

PDF
Overview of text mining and NLP (+software)
PDF
Text Analysis Using Twitter: A Case Study in Dhaka
PPTX
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
PDF
聽見網路上的聲音- NVivo10處理文字探勘與語意分析-三星統計陳群典-20140104
PDF
Case Study: Advanced analytics in healthcare using unstructured data
PPTX
Web Mining & Text Mining
PDF
Best Practices for Large Scale Text Mining Processing
PPT
Data Mining Overview
PDF
LinkedIn naudojimas B2B pardavimams/ Marius Ivanovas
PPTX
如何使用社會網絡分析工具NodeXL找出意見領袖?Facebook臉書偵測應用實例分析-三星統計林崑峯-20140104
PDF
Overcoming Big Data Bottlenecks in Healthcare - a Predictive Analytics Case S...
PPTX
Text data mining1
PPTX
Text mining
PPTX
Aspect extraction using conditional random fields [SentiRuEval]
PPTX
Introduction to Text Mining
PPT
Big Data & Text Mining
PPT
Textmining Introduction
PDF
The Top Skills That Can Get You Hired in 2017
Overview of text mining and NLP (+software)
Text Analysis Using Twitter: A Case Study in Dhaka
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
聽見網路上的聲音- NVivo10處理文字探勘與語意分析-三星統計陳群典-20140104
Case Study: Advanced analytics in healthcare using unstructured data
Web Mining & Text Mining
Best Practices for Large Scale Text Mining Processing
Data Mining Overview
LinkedIn naudojimas B2B pardavimams/ Marius Ivanovas
如何使用社會網絡分析工具NodeXL找出意見領袖?Facebook臉書偵測應用實例分析-三星統計林崑峯-20140104
Overcoming Big Data Bottlenecks in Healthcare - a Predictive Analytics Case S...
Text data mining1
Text mining
Aspect extraction using conditional random fields [SentiRuEval]
Introduction to Text Mining
Big Data & Text Mining
Textmining Introduction
The Top Skills That Can Get You Hired in 2017
Ad

Similar to Understanding Voice of Members via Text Mining – How Linkedin Built a Text Analytics Engine at Scale (20)

PDF
The “Big Data” Ecosystem at LinkedIn
PDF
The "Big Data" Ecosystem at LinkedIn
PDF
Citihub Open Source and Cloud approach to Social Media Listening
PPTX
Hadoop World 2011: LeveragIng Hadoop to Transform Raw Data to Rich Features a...
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
PDF
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
PDF
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
PPTX
Tools and Methods for Big Data Analytics by Dahl Winters
PPTX
Tools and Methods for Big Data Analytics by Dahl Winters
PPTX
Building Recommendation Platforms with Hadoop
PPTX
Hive at LinkedIn
PDF
Social media with big data analytics
PDF
Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
PDF
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
PPTX
Large scale social recommender systems and their evaluation
PDF
IRJET- Sentiment Analysis on Twitter Posts using Hadoop
PDF
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
PDF
Business Applications of Predictive Modeling at Scale
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PPT
Linked in stream experimentation framework
The “Big Data” Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
Citihub Open Source and Cloud approach to Social Media Listening
Hadoop World 2011: LeveragIng Hadoop to Transform Raw Data to Rich Features a...
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
Building Recommendation Platforms with Hadoop
Hive at LinkedIn
Social media with big data analytics
Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
Large scale social recommender systems and their evaluation
IRJET- Sentiment Analysis on Twitter Posts using Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Business Applications of Predictive Modeling at Scale
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Linked in stream experimentation framework

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Computer network topology notes for revision
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
annual-report-2024-2025 original latest.
PDF
Business Analytics and business intelligence.pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Quality review (1)_presentation of this 21
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Foundation of Data Science unit number two notes
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Data_Analytics_and_PowerBI_Presentation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Computer network topology notes for revision
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Acceptance and paychological effects of mandatory extra coach I classes.pptx
annual-report-2024-2025 original latest.
Business Analytics and business intelligence.pdf
climate analysis of Dhaka ,Banglades.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Knowledge Engineering Part 1
Galatica Smart Energy Infrastructure Startup Pitch Deck
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Quality review (1)_presentation of this 21
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Foundation of Data Science unit number two notes
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg

Understanding Voice of Members via Text Mining – How Linkedin Built a Text Analytics Engine at Scale

  • 1. Understanding Voice of Members via Text Mining – How Linkedin built a text analytics platform at scale Chi-Yi Kuan Weidong Zhang Tiger Zhang
  • 2. Who are we? www.linkedin.com/in/chiyikuan Chi-Yi Kuan www.linkedin.com/in/weidongzhang1 Weidong Zhang Tiger Zhang www.linkedin.com/in/tigerzhang •  Director, Analytics at Linkedin •  Big data evangelist and practitioner •  Manager, Analytics Platform & Apps at Linkedin •  Build big data and analytics products •  Sr. Staff, Analytics at Linkedin •  Text mining scientist and big data enthusiast Strata + Hadoop World, 12/8/2016
  • 3. Strata + Hadoop World, 12/8/2016
  • 4. KnowledgeSchoolsSkillsJobsCompaniesMembers 467M 7M 6M 3B 27k 200k Endorsements Daily posts Strata + Hadoop World, 12/8/2016
  • 5. 467M 2B Billions LinkedIn Big Data Strata + Hadoop World, 12/8/2016
  • 6. Strata + Hadoop World, 12/8/2016 467+ million members = a lot of data
  • 7. Voices: drive actionable intelligence from member voices… What’s trending Products Home Page Mobile Inbox Sentiments Value Props Hire Market Sell Relevance filtering Classification Topic mining Identify content that is relevant to Linkedin brand and products/ services Structuralize unstructured textual data into well-defined categories Find most significant topics and stories in a certain time window Strata + Hadoop World, 12/8/2016
  • 8. …creating impact across business metrics Developed game-changing solutions to drive Voice of Member impact Improved analytics efficiency with unstructured data by 20X Drove end-to-end technological integration on big data and embedding NLP solutions Piloting operational solutions to scale advanced analytics impact for broader organization Strata + Hadoop World, 12/8/2016
  • 9. LinkedIn Hadoop Ecosystem HDFS Map-Reduce Tez Spark Pig Hive Scalding YARN AZKABAN Strata + Hadoop World, 12/8/2016
  • 10. Design Principles for Voices Platform Scalability Availability Easy to Use Process Platform Data Systems Application Framework Kafka, Hadoop Spark Gobblin Elasticsearch NoSQL Phoenix Elasticsearch Highcharts Strata + Hadoop World, 12/8/2016
  • 11. E2E Voices Platform Architecture Strata + Hadoop World, 12/8/2016
  • 12. Data Processing at Scale – with Generic ETL Strata + Hadoop World, 12/8/2016
  • 13. Smart IDs – for Viral Mentions with Threading Strata + Hadoop World, 12/8/2016
  • 14. High Availability – through Heterogeneous Data Strata + Hadoop World, 12/8/2016
  • 15. Machine learning based analytic engine to surface insights to everyday business users Customized Feeds Central navigation Trending insights Social analytics & topic mining Deep dives Sentiment solutions Strata + Hadoop World, 12/8/2016
  • 16. Text mining is a crowded space Strata + Hadoop World, 12/8/2016
  • 17. Our solution targets unique use cases for LinkedIn Member info •  Identity •  Behavior •  Social Social data Customer feedback •  Customer service •  Group updates •  Network updates Survey results What’s trending Products Sentiments Value Propositions PYMK Group Home Page Mobile Inbox Identity Network Hire Market Sell Relevance solution Topic mining Text Classification Strata + Hadoop World, 12/8/2016
  • 18. ▪ Product insights, launches, and events ▪ Horizontal themes ▪ PR and marketing campaigns ▪ Brand and value ▪ LinkedIn’s strategy, financial performance, international etc. Relevant: Non-relevant: ▪ Status update, e.g. "I posted something on Linkedin"; ▪ Social mentions, e.g. "Please connect with me on Linkedin" or "Follow me on Linkedin"; ▪ Self promoting materials, e.g. “share on LinkedIn” ▪ SPAMs 1) Focusing on relevant data Strata + Hadoop World, 12/8/2016
  • 20. Generic text classification framework ▪  Feature generation ▪  Feature selection ▪  Machine learning algorithms: –  Naïve Bayes (NB) –  Logistic Regression (LR) –  Support Vector Machines (SVM) (LibLinear) ▪  Cross-validation and evaluation Applications ▪  LinkedIn relevance ▪  Sentiment analysis ▪  Product categorization ▪  Value proposition classification 2) Leveraging text classification engine Strata + Hadoop World, 12/8/2016
  • 21. Machine learning approach increases overall relevance by 40% Relevance prediction power Rules 56% Whitelist Blacklist 6% 19% 40% 100% SVM 35% SVM: great gain in balancing precision and recall Strata + Hadoop World, 12/8/2016
  • 22. 3) Enabling topic mining HIGH SPARK Description POS pattern matching Part-of-speech (POS) tagging (Stanford CoreNLP) This is great. … … Topic pruning -  Stemming -  removing stop words -  merging synonyms -  clustering (optional) **** ing ****** s = = Topic ranking: TF-IDF weighting and DF ranking Strata + Hadoop World, 12/8/2016
  • 23. Trending Insights – identify organic trending topics Didi and Kuaidi merger Product release Strata + Hadoop World, 12/8/2016
  • 24. LinkedIn’s customer support has evolved into an intelligence platform… Scaling to have a broader impact across LinkedIn ▪  GCO cases ▪  Issue resolution ▪  Support focused ▪  Internal data (GCO, surveys, site feedback) ▪  App review ▪  LI.com ▪  Social data ▪  Product insight ▪  Member insight ▪  Launch tracking ▪  Social sentiment ▪  Brand tracking ▪  Viral mentions Reactive Multi-channel Intelligent Predictive Support Feedback Insights Anticipation Strata + Hadoop World, 12/8/2016
  • 25. …breaks down into sentiment and drivers… 4 (For LI data ) deep dive into MLC segmentation… 6 …geographic locations… 5 …and audience segmentation… 7 …generates automatic reporting, alerts and escalations… 8 …and close the feedback loop with support and PR solutions 9 This is what the future could look like From the first time we pick up an isolated comment… 1 Machine determines if there is significant reach… 2 …and whether it is a trending topic… 3 Strata + Hadoop World, 12/8/2016
  • 26. Best customer experience starts from understanding Voices of members! Thank You!
  • 27. Engineering blogs for Voices Strata + Hadoop World, 12/8/2016 Part I. Voices: a Text Analytics Platform for Understanding Member Feedback Part II. Technical Details for Topic Mining
  • 28. References 1.  LibLinear: a library for large linear classification, available at https://www.csie.ntu.edu.tw/~cjlin/liblinear/ 2.  LingPipe: a Java-based toolkit for processing text using computational linguistics, available at http://alias-i.com/lingpipe/ 3.  NLTK: a leading platform for building Python programs to work with human language data, available at http://www.nltk.org/ 4.  Stanford CoreNLP: an open source project lead by Stanford NLP group, available at http://nlp.stanford.edu/software/