SlideShare a Scribd company logo
Scaling Big Data Mining Infrastructure: 
The Twitter Experience 
Jimmy Lin and Dmitriy Ryaboy 
Presented by 
Ohud Saud 
The paper can be found on KKD and appear in SIGKDD explorations Dec, 2012 
http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-02-Lin.pdf
Overview 
• Jimmy Lin and Dmitriy Ryaboy worked in Twitter 
• presents insights about Big Data mining infrastructures. 
• the experience of doing analytics at Twitter. 
• due to the current state of the data mining tools, 
performing analytics is not straightforward process. 
• Time consumed in preparatory work to the application 
of data mining methods..
Motivation 
• Big data is not easy to clean and manage. 
• Twitter has one of the highest unstructured data, yet a 
big target of research. 
• Twitter has experienced tremendous growth over the 
past few years in terms of size, complexity, number of 
users, and variety of use cases. 
• Some of well-known algorithms simply does not work 
with Twitter environment.
The problem 
• First: schemas play an important role in helping data scientists 
understand petabyte-scale data stores, but they’re insufficient to 
provide an overall “big picture” of the data available to generate 
insights. 
• Second: a major challenge in building data analytics platforms 
stems from the heterogeneity of the various components that 
must be integrated together into production workflows.
• It was estimated that 2007 was the 
first year in which it was not possible 
to store all the data that we are 
producing.
Approach Block Diagram 
Twitter's Java regular expression to parse Scribe infrastructure 
log messages at Twitter 2010
Approach Block Diagram 
JSON log message Apache Thrift message
Lessons Learned 
• Data science in the is not so glamorous, it is about get the 
plumbing right). 
• Storing logs in a traditional RDBMS such as MySQL simply 
does not . 
• JSON not a good alternative (does not scale). 
• Hadoop infrastructure is bad at iterative algorithms. 
• Naming things is hard (when diff. codes combined!).
Recommendations 
• Log directly into DB. 
• Use Scribe for logging before aggregating data 
and then writing it to HDFS and to Hadoop. 
• Thrift provides a great balance between 
flexibility and structure. 
• Use stochastic gradient descent and have 
some sort of online learning. 
• ensemble learning is a good option (Pig). 
• XRX tool is powerful.
Insights Gained 
• 80% of work with data is data cleaning not 
mining. 
• successful big data mining in practice is more 
than what most academics would consider data 
mining is. 
• life “in the trenches” is occupied by much 
preparatory work that precedes the application of 
data mining algorithms. 
• Open sources improving continuously. Thus, can 
be the solution you are looking for.
Discussion 
Thoughts?

More Related Content

PPTX
Big Data Presentation
PDF
Big data hype or reality
PPT
Data mining with big data
PPTX
Mining Big Data in Real Time
PPTX
Big data(1st presentation)
PPTX
A brief history of "big data"
PDF
Big Data introduction - Café Numérique Bruxelles
PPTX
A4 r overview deck_1.7
Big Data Presentation
Big data hype or reality
Data mining with big data
Mining Big Data in Real Time
Big data(1st presentation)
A brief history of "big data"
Big Data introduction - Café Numérique Bruxelles
A4 r overview deck_1.7

What's hot (20)

PDF
Full-Stack Data Science: How to be a One-person Data Team
PDF
KNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATION
PPTX
Knowledge Graph Semantics/Interoperability
PPTX
“Filling the digital preservation gap” an update from the Jisc Research Data ...
PPTX
Big data 101
PPTX
Research Automation for Data-Driven Discovery
PPTX
"Filling the Digital Preservation Gap" with Archivematica
PDF
متن‌بازسازی کلان‌داده
PPTX
Research Data (and Software) Management at Imperial: (Everything you need to ...
PPTX
Bar camp bigdata
PDF
Practical Best Practices for Data Management
PDF
Big Data: Big Issues for IP
PPTX
Improving Data Management Capacity in the Mekong Basin Using SEAD
PPTX
Lunch & Learn Intro to Big Data
PDF
Data Management for Mountain Observatories Workshop
PDF
How to build and run a big data platform in the 21st century
PDF
Autodiscovery or The long tail of open data
PPTX
How to boost your datamanagement with Dremio ?
PDF
ESA Ignite talk on UC3 Dash platform for data sharing
PPTX
Tragedy of the (Data) Commons
Full-Stack Data Science: How to be a One-person Data Team
KNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATION
Knowledge Graph Semantics/Interoperability
“Filling the digital preservation gap” an update from the Jisc Research Data ...
Big data 101
Research Automation for Data-Driven Discovery
"Filling the Digital Preservation Gap" with Archivematica
متن‌بازسازی کلان‌داده
Research Data (and Software) Management at Imperial: (Everything you need to ...
Bar camp bigdata
Practical Best Practices for Data Management
Big Data: Big Issues for IP
Improving Data Management Capacity in the Mekong Basin Using SEAD
Lunch & Learn Intro to Big Data
Data Management for Mountain Observatories Workshop
How to build and run a big data platform in the 21st century
Autodiscovery or The long tail of open data
How to boost your datamanagement with Dremio ?
ESA Ignite talk on UC3 Dash platform for data sharing
Tragedy of the (Data) Commons
Ad

Viewers also liked (7)

PPT
Siu vision thailand 2020 aug26 2011
DOCX
wilfredo guila resume crane operator
PPTX
Infrastructure
PDF
Caribbean Infrastructure: Strategies to Increase the Appeal of the Caribbean ...
PPTX
Infrastructure as code: running microservices on AWS using Docker, Terraform,...
PDF
UKCIF- Deploying the 300million GBP for Infrastructure
PDF
Infrastructure for modern brands
Siu vision thailand 2020 aug26 2011
wilfredo guila resume crane operator
Infrastructure
Caribbean Infrastructure: Strategies to Increase the Appeal of the Caribbean ...
Infrastructure as code: running microservices on AWS using Docker, Terraform,...
UKCIF- Deploying the 300million GBP for Infrastructure
Infrastructure for modern brands
Ad

Similar to Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and Dmitriy Ryaboy (20)

PPTX
BigData.pptx
PPTX
PresentationBig Data111111111111111.pptx
PPTX
Introduction to Big Data and Hadoop
PPTX
Big data by Mithlesh sadh
PPTX
Big data
PPTX
Introduction to Cloud computing and Big Data-Hadoop
PDF
Intro to Big Data
PDF
Unlock Your Data for ML & AI using Data Virtualization
PDF
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
PPTX
Getting Started with Big Data in the Cloud
PDF
Processing Drone data @Scale
PDF
Metadata Strategies - Data Squared
PDF
The Hadoop Ecosystem for Developers
PPTX
Big Data with IOT approach and trends with case study
PPTX
Architecting for Big Data: Trends, Tips, and Deployment Options
PPTX
Big data4businessusers
PPTX
Lecture 5- Data Collection and Storage.pptx
PDF
Metadata Strategies
PPTX
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
BigData.pptx
PresentationBig Data111111111111111.pptx
Introduction to Big Data and Hadoop
Big data by Mithlesh sadh
Big data
Introduction to Cloud computing and Big Data-Hadoop
Intro to Big Data
Unlock Your Data for ML & AI using Data Virtualization
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Getting Started with Big Data in the Cloud
Processing Drone data @Scale
Metadata Strategies - Data Squared
The Hadoop Ecosystem for Developers
Big Data with IOT approach and trends with case study
Architecting for Big Data: Trends, Tips, and Deployment Options
Big data4businessusers
Lecture 5- Data Collection and Storage.pptx
Metadata Strategies
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to machine learning and Linear Models
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Database Infoormation System (DBIS).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
1_Introduction to advance data techniques.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Business Analytics and business intelligence.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Quality review (1)_presentation of this 21
Introduction-to-Cloud-ComputingFinal.pptx
Fluorescence-microscope_Botany_detailed content
Introduction to machine learning and Linear Models
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Reliability_Chapter_ presentation 1221.5784
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Business Acumen Training GuidePresentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Database Infoormation System (DBIS).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
ISS -ESG Data flows What is ESG and HowHow
Introduction to Knowledge Engineering Part 1
1_Introduction to advance data techniques.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Business Analytics and business intelligence.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx

Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and Dmitriy Ryaboy

  • 1. Scaling Big Data Mining Infrastructure: The Twitter Experience Jimmy Lin and Dmitriy Ryaboy Presented by Ohud Saud The paper can be found on KKD and appear in SIGKDD explorations Dec, 2012 http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-02-Lin.pdf
  • 2. Overview • Jimmy Lin and Dmitriy Ryaboy worked in Twitter • presents insights about Big Data mining infrastructures. • the experience of doing analytics at Twitter. • due to the current state of the data mining tools, performing analytics is not straightforward process. • Time consumed in preparatory work to the application of data mining methods..
  • 3. Motivation • Big data is not easy to clean and manage. • Twitter has one of the highest unstructured data, yet a big target of research. • Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. • Some of well-known algorithms simply does not work with Twitter environment.
  • 4. The problem • First: schemas play an important role in helping data scientists understand petabyte-scale data stores, but they’re insufficient to provide an overall “big picture” of the data available to generate insights. • Second: a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows.
  • 5. • It was estimated that 2007 was the first year in which it was not possible to store all the data that we are producing.
  • 6. Approach Block Diagram Twitter's Java regular expression to parse Scribe infrastructure log messages at Twitter 2010
  • 7. Approach Block Diagram JSON log message Apache Thrift message
  • 8. Lessons Learned • Data science in the is not so glamorous, it is about get the plumbing right). • Storing logs in a traditional RDBMS such as MySQL simply does not . • JSON not a good alternative (does not scale). • Hadoop infrastructure is bad at iterative algorithms. • Naming things is hard (when diff. codes combined!).
  • 9. Recommendations • Log directly into DB. • Use Scribe for logging before aggregating data and then writing it to HDFS and to Hadoop. • Thrift provides a great balance between flexibility and structure. • Use stochastic gradient descent and have some sort of online learning. • ensemble learning is a good option (Pig). • XRX tool is powerful.
  • 10. Insights Gained • 80% of work with data is data cleaning not mining. • successful big data mining in practice is more than what most academics would consider data mining is. • life “in the trenches” is occupied by much preparatory work that precedes the application of data mining algorithms. • Open sources improving continuously. Thus, can be the solution you are looking for.