Building highly reliable data pipeline @datadog par Quentin François

Building highly reliable data
pipelines @ Datadog
Quentin FRANCOIS
Team Lead, Data Engineering
1
Data Engineers Meetup Paris
26 Février 2019

Building highly reliable data
pipelines @ Datadog
6
Quentin FRANCOIS
Team Lead, Data Engineering
Data Engineers Meetup Paris
26 Février 2019

Reliability is the probability that a system will
produce correct outputs up to some given time t.
Source: E.J. McClusky & S. Mitra (2004). "Fault Tolerance" in Computer Science Handbook 2ed. ed. A.B. Tucker. CRC Press.
7

1. Architecture
Highly reliable data pipelines
8

1. Architecture
2. Monitoring
9

1. Architecture
2. Monitoring
3. Failures handling
10

Historical metric queries
Time series data
metric system.load.1
timestamp 1526382440
value 0.92
tags host:i-xyz,env:dev,...
11

12
1 point/second

13
1 point/day

14
High
resolution
data
Low
resolution
data
1pt /min
1pt /hour
1pt /day
1pt /sec
AWS S3
• Runs once a day.
• Dozens of TBs of input data.
• Trillions of points processed.
Rollups
pipeline

1. Architecture
2. Monitoring
15

Our big data platform architecture
CLUSTERS
DATA
WORKERS
USERS
Luigi Spark
Datadog
monitoring
S3
EMR
Web SchedulerCLI
16
EMR EMR EMR

Our big data platform architecture
CLUSTERS
DATA
WORKERS
USERS
Luigi Spark
Datadog
monitoring
S3
EMR
Web SchedulerCLI
17
EMR EMR EMR

Many ephemeral clusters
• New cluster for every pipeline.
• Dozens of clusters at a time.
• Median lifetime of ~3 hours.
18

Total isolation
We know what is happening and why.
19

Pick the best hardware for each job
For CPU-bound jobs
c3
For memory-bound jobs
r3
20

Scale up/down clusters
• If we are behind.
• Scale as we grow.
• No more waiting on loaded clusters.
21

Safer upgrades of EMR/Hadoop/Spark
5.13 5.12
5.125.125.12
22

Spot-instance clusters
Ridiculous savings
(up to 80% off the on-demand price)
Nodes can die at any time
23
-
+

Spot-instance clusters
Ridiculous savings
(up to 80% off the on-demand price)
Nodes can die at any time
24
-
+

How can we build highly reliable data pipelines
with instances killed randomly all the time?
25
?

No long running jobs
• The longer the job, the more work you lose on average.
• The longer the job, the longer it takes to recover.
26

0 9
27
Pipeline A
Pipeline B
Time
(hours)

Pipeline A
Pipeline B
0
Time
(hours)97 10 16
Job failure
28

Break down jobs into smaller pieces
Vertically - persist intermediate data between transformations.
Horizontally - partition the input data.
29

Example
30
Input
data
Output
data
Rollups pipeline
Aggregated time
series data
(custom file format)
Raw time series
data

Example
31
Input
data
Output
data
Rollups pipeline Raw time series
data
Aggregated time
series data
1. Aggregate high
resolution data.
2. Store the
aggregated data
in our custom file
format.
1
2

Example
32
Output
data
Checkpoint
data
Input
data
Vertical split
Aggregated time
series data
(Parquet format)
Raw time series
data
Aggregated time
series data
1
2
1. Aggregate high
resolution data.
2. Store the
aggregated data
in our custom file
format.
1
2

Example
33
Output
data
Checkpoint
data
Horizontal split
Aggregated time
series data
(Parquet format)
Raw time series
data
Aggregated time
series data
1
2
A B
C D
1. Aggregate high
resolution data.
2. Store the
aggregated data
in our custom file
format.
1
2

Example
34
A
A
Horizontal split
B
B
C
C
D
D
Aggregated time
series data
(Parquet format)
Raw time series
data
Aggregated time
series data
1
2
1. Aggregate high
resolution data.
2. Store the
aggregated data
in our custom file
format.
1
2

Break down jobs into smaller pieces
35
Fault tolerance
Performance

Lessons
• Many clusters for better isolation.
• Break down jobs into smaller pieces (no longer than ~3 hours).
• Trade-off between performance and fault tolerance.
36

1. Architecture
2. Monitoring
37

38

39

Monitoring data pipelines
40
We monitor actively 3 types of metrics:
• Data lags metrics.
• Cluster health metrics.
• Job health metrics.
1. Is the data pipeline going to finish before the deadline?

41

42

43

44

45

46

47

48

49

50
● Add custom counters throughout the pipelines.
○ Count records.
○ Count duplicates.
○ Count records that can’t join.
● Ad hoc checks on the output data.
2. Is the data produced correct?

Lessons
• Monitoring = will we finish before t? + is the data correct?
• Measure, measure and measure!
• Alert on meaningful and actionable metrics.
51

1. Architecture
2. Monitoring
52

Data pipelines will break
Hardware
failures
Bad code
changes
54
Upstream
delays
Increasing
volume of
data

Data pipelines will break
1. Recover fast
We want to fix the issues ASAP.
2. Degrade gracefully
We want to limit the customer-facing impact.
55

Recover fast
• No long running job.
• Switch from spot to on-demand clusters.
• Increase cluster size.
• Easy ways to rerun jobs (not always trivial!).
56

Example: rerun the rollups pipeline
2018-01
2018-02
2018-03
2018-04
2018-05
s3://bucket/
57

as-of_2018-05-01
as-of_2018-05-02
...
as-of_2018-05-21
s3://bucket/2018-05/
58

as-of_2018-05-01
as-of_2018-05-02
...
as-of_2018-05-21
s3://bucket/2018-05/
Active location
59

as-of_2018-05-01
as-of_2018-05-02
...
as-of_2018-05-21
as-of_2018-05-22
s3://bucket/2018-05/
Active location
60

as-of_2018-05-01
as-of_2018-05-02
...
as-of_2018-05-21
as-of_2018-05-22
s3://bucket/2018-05/
Active location
61

as-of_2018-05-01
as-of_2018-05-02
...
as-of_2018-05-21
as-of_2018-05-22
s3://bucket/2018-05/
Active location
62

as-of_2018-05-01
as-of_2018-05-02
...
as-of_2018-05-21
as-of_2018-05-22
as-of_2018-05-22_run-2
s3://bucket/2018-05/
Active location
63

Degrade
64
A
A
B
B
C
C
D
D
gracefully
• Isolate issues to a limited
number of customers thanks
to horizontal sharding.
• Keep the functionalities
operational at the cost of
performance/accuracy.

Lessons
• Think about potential issues ahead of time.
• Have knobs ready to recover fast.
• Have knobs ready to limit the customer facing impact.
65

Conclusion
66
Building highly reliable data pipelines

Conclusion
• Know your time constraints.
67

Conclusion
• Break down jobs into small survivable pieces.
68

Conclusion
• Monitor cluster metrics, job metrics and data lags.
69

Conclusion
• Monitor cluster metrics, job metrics and data lags.
• Think about failures ahead of time and get prepared.
70

Thanks!
We’re hiring!
qf@datadoghq.com
https://jobs.datadoghq.com
71

Building highly reliable data pipeline @datadog par Quentin François

More Related Content

What's hot (19)

Similar to Building highly reliable data pipeline @datadog par Quentin François (20)

More from Paris Data Engineers ! (11)

Recently uploaded (20)

Building highly reliable data pipeline @datadog par Quentin François