SlideShare a Scribd company logo
Building highly reliable data
pipelines @ Datadog
Quentin FRANCOIS
Team Lead, Data Engineering
1
Data Engineers Meetup Paris
26 Février 2019
2
3
4
5
Building highly reliable data
pipelines @ Datadog
6
Quentin FRANCOIS
Team Lead, Data Engineering
Data Engineers Meetup Paris
26 Février 2019
Reliability is the probability that a system will
produce correct outputs up to some given time t.
Source: E.J. McClusky & S. Mitra (2004). "Fault Tolerance" in Computer Science Handbook 2ed. ed. A.B. Tucker. CRC Press.
7
1. Architecture
Highly reliable data pipelines
8
1. Architecture
2. Monitoring
Highly reliable data pipelines
9
1. Architecture
2. Monitoring
3. Failures handling
Highly reliable data pipelines
10
Historical metric queries
Time series data
metric system.load.1
timestamp 1526382440
value 0.92
tags host:i-xyz,env:dev,...
11
Historical metric queries
12
1 point/second
Historical metric queries
13
1 point/day
Historical metric queries
14
High
resolution
data
Low
resolution
data
1pt /min
1pt /hour
1pt /day
1pt /sec
AWS S3
• Runs once a day.
• Dozens of TBs of input data.
• Trillions of points processed.
Rollups
pipeline
1. Architecture
2. Monitoring
3. Failures handling
Highly reliable data pipelines
15
Our big data platform architecture
CLUSTERS
DATA
WORKERS
USERS
Luigi Spark
Datadog
monitoring
S3
EMR
Web SchedulerCLI
16
EMR EMR EMR
Our big data platform architecture
CLUSTERS
DATA
WORKERS
USERS
Luigi Spark
Datadog
monitoring
S3
EMR
Web SchedulerCLI
17
EMR EMR EMR
Many ephemeral clusters
• New cluster for every pipeline.
• Dozens of clusters at a time.
• Median lifetime of ~3 hours.
18
Total isolation
We know what is happening and why.
19
Pick the best hardware for each job
For CPU-bound jobs
c3
For memory-bound jobs
r3
20
Scale up/down clusters
• If we are behind.
• Scale as we grow.
• No more waiting on loaded clusters.
21
Safer upgrades of EMR/Hadoop/Spark
5.13 5.12
5.125.125.12
22
Spot-instance clusters
Ridiculous savings
(up to 80% off the on-demand price)
Nodes can die at any time
23
-
+
Spot-instance clusters
Ridiculous savings
(up to 80% off the on-demand price)
Nodes can die at any time
24
-
+
How can we build highly reliable data pipelines
with instances killed randomly all the time?
25
?
No long running jobs
• The longer the job, the more work you lose on average.
• The longer the job, the longer it takes to recover.
26
No long running jobs
0 9
27
Pipeline A
Pipeline B
Time
(hours)
No long running jobs
Pipeline A
Pipeline B
0
Time
(hours)97 10 16
Job failure
28
Break down jobs into smaller pieces
Vertically - persist intermediate data between transformations.
Horizontally - partition the input data.
29
Example
30
Input
data
Output
data
Rollups pipeline
Aggregated time
series data
(custom file format)
Raw time series
data
Example
31
Input
data
Output
data
Rollups pipeline Raw time series
data
Aggregated time
series data
(custom file format)
1. Aggregate high
resolution data.
2. Store the
aggregated data
in our custom file
format.
1
2
Example
32
Output
data
Checkpoint
data
Input
data
Vertical split
Aggregated time
series data
(Parquet format)
Raw time series
data
Aggregated time
series data
(custom file format)
1
2
1. Aggregate high
resolution data.
2. Store the
aggregated data
in our custom file
format.
1
2
Example
33
Output
data
Checkpoint
data
Horizontal split
Aggregated time
series data
(Parquet format)
Raw time series
data
Aggregated time
series data
(custom file format)
1
2
A B
C D
1. Aggregate high
resolution data.
2. Store the
aggregated data
in our custom file
format.
1
2
Example
34
A
A
Horizontal split
B
B
C
C
D
D
Aggregated time
series data
(Parquet format)
Raw time series
data
Aggregated time
series data
(custom file format)
1
2
1. Aggregate high
resolution data.
2. Store the
aggregated data
in our custom file
format.
1
2
Break down jobs into smaller pieces
35
Fault tolerance
Performance
Lessons
• Many clusters for better isolation.
• Break down jobs into smaller pieces (no longer than ~3 hours).
• Trade-off between performance and fault tolerance.
36
1. Architecture
2. Monitoring
3. Failures handling
Highly reliable data pipelines
37
Reliability is the probability that a system will
produce correct outputs up to some given time t.
38
Reliability is the probability that a system will
produce correct outputs up to some given time t.
39
Monitoring data pipelines
40
We monitor actively 3 types of metrics:
• Data lags metrics.
• Cluster health metrics.
• Job health metrics.
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
41
We monitor actively 3 types of metrics:
• Data lags metrics.
• Cluster health metrics.
• Job health metrics.
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
42
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
43
We monitor actively 3 types of metrics:
• Data lags metrics.
• Cluster health metrics.
• Job health metrics.
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
44
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
45
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
46
We monitor actively 3 types of metrics:
• Data lags metrics.
• Cluster health metrics.
• Job health metrics.
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
47
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
48
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
49
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
50
● Add custom counters throughout the pipelines.
○ Count records.
○ Count duplicates.
○ Count records that can’t join.
● Ad hoc checks on the output data.
2. Is the data produced correct?
Lessons
• Monitoring = will we finish before t? + is the data correct?
• Measure, measure and measure!
• Alert on meaningful and actionable metrics.
51
1. Architecture
2. Monitoring
3. Failures handling
Highly reliable data pipelines
52
53
Data pipelines will break
Hardware
failures
Bad code
changes
54
Upstream
delays
Increasing
volume of
data
Data pipelines will break
1. Recover fast
We want to fix the issues ASAP.
2. Degrade gracefully
We want to limit the customer-facing impact.
55
Recover fast
• No long running job.
• Switch from spot to on-demand clusters.
• Increase cluster size.
• Easy ways to rerun jobs (not always trivial!).
56
Example: rerun the rollups pipeline
2018-01
2018-02
2018-03
2018-04
2018-05
s3://bucket/
57
Example: rerun the rollups pipeline
as-of_2018-05-01
as-of_2018-05-02
...
as-of_2018-05-21
s3://bucket/2018-05/
58
Example: rerun the rollups pipeline
as-of_2018-05-01
as-of_2018-05-02
...
as-of_2018-05-21
s3://bucket/2018-05/
Active location
59
Example: rerun the rollups pipeline
as-of_2018-05-01
as-of_2018-05-02
...
as-of_2018-05-21
as-of_2018-05-22
s3://bucket/2018-05/
Active location
60
Example: rerun the rollups pipeline
as-of_2018-05-01
as-of_2018-05-02
...
as-of_2018-05-21
as-of_2018-05-22
s3://bucket/2018-05/
Active location
61
Example: rerun the rollups pipeline
as-of_2018-05-01
as-of_2018-05-02
...
as-of_2018-05-21
as-of_2018-05-22
s3://bucket/2018-05/
Active location
62
Example: rerun the rollups pipeline
as-of_2018-05-01
as-of_2018-05-02
...
as-of_2018-05-21
as-of_2018-05-22
as-of_2018-05-22_run-2
s3://bucket/2018-05/
Active location
63
Degrade
64
A
A
B
B
C
C
D
D
gracefully
• Isolate issues to a limited
number of customers thanks
to horizontal sharding.
• Keep the functionalities
operational at the cost of
performance/accuracy.
Lessons
• Think about potential issues ahead of time.
• Have knobs ready to recover fast.
• Have knobs ready to limit the customer facing impact.
65
Conclusion
66
Building highly reliable data pipelines
Conclusion
• Know your time constraints.
67
Building highly reliable data pipelines
Conclusion
• Know your time constraints.
• Break down jobs into small survivable pieces.
68
Building highly reliable data pipelines
Conclusion
• Know your time constraints.
• Break down jobs into small survivable pieces.
• Monitor cluster metrics, job metrics and data lags.
69
Building highly reliable data pipelines
Conclusion
• Know your time constraints.
• Break down jobs into small survivable pieces.
• Monitor cluster metrics, job metrics and data lags.
• Think about failures ahead of time and get prepared.
70
Building highly reliable data pipelines
Thanks!
We’re hiring!
qf@datadoghq.com
https://jobs.datadoghq.com
71

More Related Content

PPTX
Monitoring and scaling postgres at datadog
PDF
Provisioning Datadog with Terraform
PDF
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
PDF
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
PDF
Using ClickHouse for Experimentation
PDF
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
PDF
tdtechtalk20160330johan
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
Monitoring and scaling postgres at datadog
Provisioning Datadog with Terraform
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Using ClickHouse for Experimentation
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
tdtechtalk20160330johan
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn

What's hot (19)

PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
PPTX
Building a system for machine and event-oriented data with Rocana
PDF
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
PDF
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
PDF
Elastic Data Analytics Platform @Datadog
PPTX
DataEngConf SF16 - High cardinality time series search
DOCX
empirical analysis modeling of power dissipation control in internet data ce...
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
PDF
Macy's: Changing Engines in Mid-Flight
PDF
netflix-real-time-data-strata-talk
PDF
Building Data Pipelines in Python
PDF
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
PDF
Scaling monitoring with Datadog
PDF
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
PDF
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
PDF
Elastic Stack roadmap deep dive
PPTX
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PPTX
Symantec: Cassandra Data Modelling techniques in action
Rental Cars and Industrialized Learning to Rank with Sean Downes
Building a system for machine and event-oriented data with Rocana
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Elastic Data Analytics Platform @Datadog
DataEngConf SF16 - High cardinality time series search
empirical analysis modeling of power dissipation control in internet data ce...
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Macy's: Changing Engines in Mid-Flight
netflix-real-time-data-strata-talk
Building Data Pipelines in Python
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
Scaling monitoring with Datadog
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Elastic Stack roadmap deep dive
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Symantec: Cassandra Data Modelling techniques in action
Ad

Similar to Building highly reliable data pipeline @datadog par Quentin François (20)

PDF
End-to-end pipeline agility - Berlin Buzzwords 2024
PDF
Resilient Predictive Data Pipelines (QCon London 2016)
PDF
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
PDF
Data ops in practice - Swedish style
PPTX
Tools and practices to use in a Continuous Delivery pipeline
PPTX
Data ops in practice
PDF
Multi-tenant Data Pipeline Orchestration
PPTX
Data pipelines from zero
PDF
Data pipelines from zero to solid
PDF
Data Pipline Observability meetup
PDF
Workload Characterization for Resource Optimization of Big Data Analytics: Be...
PPTX
"ML in Production",Oleksandr Bagan
PPTX
Production Monitoring Platform
PDF
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
PPTX
When We Spark and When We Don’t: Developing Data and ML Pipelines
PDF
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
PPTX
Real time monitoring of hadoop and spark workflows
PDF
Accelerating the Developers' Road to the Cloud at Enterprise Scale
PPTX
Prometheus - Open Source Forum Japan
PDF
C19013010 the tutorial to build shared ai services session 2
End-to-end pipeline agility - Berlin Buzzwords 2024
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Data ops in practice - Swedish style
Tools and practices to use in a Continuous Delivery pipeline
Data ops in practice
Multi-tenant Data Pipeline Orchestration
Data pipelines from zero
Data pipelines from zero to solid
Data Pipline Observability meetup
Workload Characterization for Resource Optimization of Big Data Analytics: Be...
"ML in Production",Oleksandr Bagan
Production Monitoring Platform
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
When We Spark and When We Don’t: Developing Data and ML Pipelines
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Real time monitoring of hadoop and spark workflows
Accelerating the Developers' Road to the Cloud at Enterprise Scale
Prometheus - Open Source Forum Japan
C19013010 the tutorial to build shared ai services session 2
Ad

More from Paris Data Engineers ! (11)

PDF
Spark tools by Jonathan Winandy
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
SCIO : Apache Beam API
PDF
Apache Beam de A à Z
PDF
REX : pourquoi et comment développer son propre scheduler
PDF
Deeplearning in production
PDF
Utilisation de MLflow pour le cycle de vie des projet Machine learning
PDF
Introduction à Apache Pulsar
PDF
10 things i wish i'd known before using spark in production
PDF
Change Data Capture with Data Collector @OVH
PDF
Scala pour le Data Engineering par Jonathan Winandy
Spark tools by Jonathan Winandy
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
SCIO : Apache Beam API
Apache Beam de A à Z
REX : pourquoi et comment développer son propre scheduler
Deeplearning in production
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Introduction à Apache Pulsar
10 things i wish i'd known before using spark in production
Change Data Capture with Data Collector @OVH
Scala pour le Data Engineering par Jonathan Winandy

Recently uploaded (20)

PDF
Introduction to Business Data Analytics.
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Computer network topology notes for revision
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Lecture1 pattern recognition............
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Business Data Analytics.
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
1_Introduction to advance data techniques.pptx
Clinical guidelines as a resource for EBP(1).pdf
Computer network topology notes for revision
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IB Computer Science - Internal Assessment.pptx
Quality review (1)_presentation of this 21
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Major-Components-ofNKJNNKNKNKNKronment.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Business Acumen Training GuidePresentation.pptx
Lecture1 pattern recognition............
.pdf is not working space design for the following data for the following dat...
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Miokarditis (Inflamasi pada Otot Jantung)
Launch Your Data Science Career in Kochi – 2025
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx

Building highly reliable data pipeline @datadog par Quentin François