SlideShare a Scribd company logo
Sudhir Tonse (@stonse)
Danny Yuan (@g9yuayon)
Big Data Pipeline and Analytics Platform
Using NetflixOSS and Other Open Source Software
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
Data Is the most important asset
at Netflix
If all the data is easily available to all
teams, it can be leveraged in new and
exciting ways
~1000 Device Types
~500 Apps/Web Services
~100 Billion Events/Day
3.2M messages per
second at peak time
3GB per second at peak
time
Dashboard
Type of Events
• User Interface Events
• Search Event (‘Matrix’ using PS3 …)
• Star Rating Event (HoC : 5 stars, Xbox, US, …)
• Infrastructural Events
• RPC Call (API -> Billing Service, ‘/bill/..’, 200, …)
• Log Errors (NPE, “Movie is null”, …, …)
• Other Events …
Making Sense of Billions of Events
http://netflix.github.io
+
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
A Humble Beginning
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
Evolution …Scale!
Application
Application
Application Application
Application
Application
Application
Application
ApplicationApplication
We Want to Process
App Data in Hadoop
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
Our Hadoop Ecosystem
@NetflixOSS Big Data Tools
Hadoop as a Service
Pig Scripting on Steroids
Pig Married to Clojure
“Map-Reduce for Clojure”
S3MPER
S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
S3mper is a library that provides an
additional layer of consistency
checking on top of Amazon's S3 index through
use of a consistent, secondary index.
Efficient ETL with Cassandra
Cassandra
Offline Analysis
Evolution … Speed!
We Want to Aggregate, Index, and
Query Data in Real Time
Interactive Exploration
Let’s walk through some use cases
client activity event
*
/name = “movieStarts”
Pipeline Challenges
• App owners: send and forget
• Data scientists: validation, ETL, batch
processing
• DevOps: stream processing, targeted search
Message Routing
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
We Want to Consume Data
Selectively in Different Ways
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
• Message broker
• High-throughput
• Persistent and replicated
There Is More
Intelligent Alerts
Intelligent Alerts
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
• Ad-hoc query with different dimensions
• Quick aggregations and Top-N queries
• Time series with flexible filters
• Quick access to raw data using boolean
queries
What We Need
Druid
• Rapid exploration of high dimensional data
• Fast ingestion and querying
• Time series
• Real-time indexing of event streams
• Killer feature: boolean search
• Great UI: Kibana
The Old Pipeline
The New Pipeline
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
There Is More
It’s Not All About Counters and Time Series
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
RequestId Parent Id Node Id Service Name Status
4965-4a74 0 123 Edge Service 200
4965-4a74 123 456 Gateway 200
4965-4a74 456 789 Service A 200
4965-4a74e 456 abc Service B 200
Status:200
Distributed Tracing
Distributed Tracing
Distributed Tracing
A System that Supports All These
A Data Pipeline To Glue
Them All
Make It Simple
Message Producing
• Simple and Uniform API
• messageBus.publish(event)
Consumption Is Simple Too
consumer.observe().subscribe(new Subscriber<>() {
@Override
public void onNext(Ackable<IncomingMessage> ackable) {
process(ackable.getEntity(MyEventType.class));
ackable.ack();
}
});
consumer.pause();
consumer.resume()
RxJava
• Functional reactive programming model
• Powerful streaming API
• Separation of logic and threading model
Design Decisions
• Top Priority: app stability and throughput
• Asynchronous operations
• Aggressive buffering
• Drops messages if necessary
Anything Can Fail
Cloud Resiliency
Fault Tolerance Features
• Write and forward with auto-reattached EBS
(Amazon’s Elastic Block Storage)
• disk-backed queue: big-queue
• Customized scaling down
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
There’s More to Do
• Contribute to @NetflixOSS
• Join us :-)
Summary
http://netflix.github.io
+
You can build your own web-scale data
pipeline using open source components
Thank You!
Sudhir Tonse
http://www.linkedin.com/in/sudhirtonse
Twitter: @stonse
Danny Yuan
http://www.linkedin.com/pub/danny-
yuan/4/374/862
Twitter: @g9yuayon

More Related Content

PDF
Cloud Connect 2012, Big Data @ Netflix
PPTX
The evolution of the big data platform @ Netflix (OSCON 2015)
PDF
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
PDF
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
PDF
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
PPTX
Lessons Learned - Monitoring the Data Pipeline at Hulu
PDF
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
PPTX
Netflix Big Data Paris 2017
Cloud Connect 2012, Big Data @ Netflix
The evolution of the big data platform @ Netflix (OSCON 2015)
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Lessons Learned - Monitoring the Data Pipeline at Hulu
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
Netflix Big Data Paris 2017

What's hot (18)

PPTX
Running Presto and Spark on the Netflix Big Data Platform
PDF
Developing high frequency indicators using real time tick data on apache supe...
PPTX
Large Scale Graph Analytics with JanusGraph
PDF
Kafka, Killer of Point-to-Point Integrations, Lucian Lita
PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
PDF
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
PPTX
Realtime streaming architecture in INFINARIO
PPTX
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
PDF
The Netflix data platform: Now and in the future by Kurt Brown
PPTX
Netflix incloudsmarch8 2011forwiki
PDF
Extracting Insights from Data at Twitter
PDF
Stream Processing in Uber
PPTX
Data Analysis on AWS
PDF
The Netflix Way to deal with Big Data Problems
PPTX
Putting Lipstick on Apache Pig at Netflix
PPTX
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
PPTX
Taboola Road To Scale With Apache Spark
PPTX
Presto Talk @ Hadoop Summit'15
Running Presto and Spark on the Netflix Big Data Platform
Developing high frequency indicators using real time tick data on apache supe...
Large Scale Graph Analytics with JanusGraph
Kafka, Killer of Point-to-Point Integrations, Lucian Lita
Rental Cars and Industrialized Learning to Rank with Sean Downes
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Realtime streaming architecture in INFINARIO
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
The Netflix data platform: Now and in the future by Kurt Brown
Netflix incloudsmarch8 2011forwiki
Extracting Insights from Data at Twitter
Stream Processing in Uber
Data Analysis on AWS
The Netflix Way to deal with Big Data Problems
Putting Lipstick on Apache Pig at Netflix
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Taboola Road To Scale With Apache Spark
Presto Talk @ Hadoop Summit'15
Ad

Viewers also liked (7)

PDF
OLAP options on Hadoop
KEY
Large scale ETL with Hadoop
PDF
Hadoop Family and Ecosystem
PPTX
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
PDF
Druid at SF Big Analytics 2015-12-01
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
PPTX
Scalable Real-time analytics using Druid
OLAP options on Hadoop
Large scale ETL with Hadoop
Hadoop Family and Ecosystem
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid at SF Big Analytics 2015-12-01
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Scalable Real-time analytics using Druid
Ad

Similar to Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries (20)

PDF
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
PPTX
What's New in 6.3 + Data On-Boarding
PPTX
D3SF17- Improving Our China Clients Performance
PDF
Mesoscon 2015
PDF
Mitigating One Million Security Threats With Kafka and Spark With Arun Janart...
PDF
Distributed dataintelligence
PDF
Don't think DevOps think Compliant Database DevOps
PDF
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
PDF
Big Data made easy in the era of the Cloud - Demi Ben-Ari
PDF
Setting up InfluxData for IoT
PDF
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
PDF
Building a modern in-house analytics pipeline
PPTX
How Precisely and Splunk Can Help You Better Manage Your IBM Z and IBM i Envi...
PPTX
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
PPTX
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
PPTX
Data Onboarding Breakout Session
PPTX
Azure Event Grid: Glue for the Internet
PPTX
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
PPTX
Managing python at scale without breaking the bank
PDF
High Availability HPC ~ Microservice Architectures for Supercomputing
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
What's New in 6.3 + Data On-Boarding
D3SF17- Improving Our China Clients Performance
Mesoscon 2015
Mitigating One Million Security Threats With Kafka and Spark With Arun Janart...
Distributed dataintelligence
Don't think DevOps think Compliant Database DevOps
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Setting up InfluxData for IoT
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
Building a modern in-house analytics pipeline
How Precisely and Splunk Can Help You Better Manage Your IBM Z and IBM i Envi...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Data Onboarding Breakout Session
Azure Event Grid: Glue for the Internet
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
Managing python at scale without breaking the bank
High Availability HPC ~ Microservice Architectures for Supercomputing

More from Sudhir Tonse (9)

PPTX
Big Data Pipelines and Machine Learning at Uber
PDF
ML and Data Science at Uber - GITPro talk 2017
PDF
Stream Computing & Analytics at Uber
PPTX
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent
PPTX
MicroServices at Netflix - challenges of scale
PPTX
Big Data Pipeline and Analytics Platform
PDF
Architecting for the Cloud using NetflixOSS - Codemash Workshop
PPTX
Web Scale Applications using NeflixOSS Cloud Platform
PDF
Netflix Cloud Platform Building Blocks
Big Data Pipelines and Machine Learning at Uber
ML and Data Science at Uber - GITPro talk 2017
Stream Computing & Analytics at Uber
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent
MicroServices at Netflix - challenges of scale
Big Data Pipeline and Analytics Platform
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Web Scale Applications using NeflixOSS Cloud Platform
Netflix Cloud Platform Building Blocks

Recently uploaded (20)

PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Digital Logic Computer Design lecture notes
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
DOCX
573137875-Attendance-Management-System-original
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
CH1 Production IntroductoryConcepts.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Operating System & Kernel Study Guide-1 - converted.pdf
Sustainable Sites - Green Building Construction
Lesson 3_Tessellation.pptx finite Mathematics
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Digital Logic Computer Design lecture notes
UNIT 4 Total Quality Management .pptx
OOP with Java - Java Introduction (Basics)
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
573137875-Attendance-Management-System-original
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Foundation to blockchain - A guide to Blockchain Tech
Internet of Things (IOT) - A guide to understanding
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Model Code of Practice - Construction Work - 21102022 .pdf

Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries

Editor's Notes

  • #16: Something happened. Our traffic turned into a hockey stick, and the number of applications exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
  • #33: For one thing: interactive exploration. Sometimes we want to get data in real time so we can act quickly. Some data is only useful in a small time window after all. Sometimes we want to perform lots of experimental queries just to find the right insights. If we wait too long for a query back, we won’t be able to iterate fast enough. Either way, we need to get query results back in seconds.
  • #38: Here is one example: we process more than 150 thousand events per second about user activities. What if we’d like to know the geographically how many users started playing videos in the past 5 minutes? So I submit my query, and in a few seconds.... But this is an aggregated view. What if I want to drill down the data immediately along different dimensions? In this particular case, to find out failed attempts on our SilverLight players that run on PCs and Macs?
  • #43: Note this is different from alerting based on monitoring metrics. Monitoring metrics are great and versatile. But it doesn’t help us catch unexpected errors. When we build an application, we instrument our code diligently, yet it’s very likely we miss some critical instrumentation points. There’s one thing that we always catch, though: logged errors and unhandled exceptions. It’s about The alert provides a precise entrypoint and the right context for people to drill down the right problems
  • #44: Note this is different from alerting based on monitoring metrics. Monitoring metrics are great and versatile. But it doesn’t help us catch unexpected errors. When we build an application, we instrument our code diligently, yet it’s very likely we miss some critical instrumentation points. There’s one thing that we always catch, though: logged errors and unhandled exceptions. It’s about The alert provides a precise entrypoint and the right context for people to drill down the right problems