Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai

Conquering
Hadoop & Spark with
Operational Intelligence
Akshay Rai
Senior Software Engineer
LinkedIn
#Exp2SAIS

About Me
• Sr. Software Engineer in the Data Platform team at LinkedIn
• Engineering lead for Dr. Elephant
• Building an operational intelligence platform for Hadoop & Spark
#Exp2SAIS

Create economic opportunity for every
member of the global workforce
OUR VISION

Connect the world’s professionals to make
them more productive and successful
OUR MISSION

• Everyday problems with Hadoop & Spark
• Approach & its complexity
• Application Metrics Architecture
• Operational Intelligence Vision
• Examples & Use-cases
Today’s Talk
#Exp2SAIS 5

Everyday problems with Hadoop & Spark
Debug issues like slow jobs
Generate metric reports for jobs
Setup alerts and monitor flows
Cluster snapshot with slice & dice
Capacity Planning
Generate Cost to Serve Reports
Track user behavior
Debug & address global issues
Hadoop/Spark Users Platform Developers
Operational Experts Engineering Leads
#Exp2SAIS 6

Bird’s-Eye View
#Exp2SAIS
Capture Metrics Detect Anomalies Identify Root Cause
7

#Exp2SAIS
Capture Metrics
Complexity of the approach
• Collect application metrics in near real time
• Collect metrics from multiple engines like MR and Spark
• Integrate app metrics with data lineage and metadata
8

#Exp2SAIS
Detect Anomalies
• Knowledge of various modeling techniques
• In-depth knowledge of Hadoop and Spark Metrics
• Event based anomaly detection and alerting
9

#Exp2SAIS
Identify Root Cause
• Correlation with other crucial metrics
• Integration with events that happen at LinkedIn
• E.g., Deployments, Issues, Commits, etc.
• Discover trends in metrics; dimensional analysis
10

#Exp2SAIS
11
Application Metrics Pipeline

#Exp2SAIS
ThirdEye
Github: https://github.com/linkedin/pinot/tree/master/thirdeye
12

Capture Metrics Application Metrics Architecture
#Exp2SAIS 13

#Exp2SAIS
• Hadoop Metrics
• Counters – Parse Job History files & emit to Kafka using Flume
• JVM Metrics – Launch a java-agent & emit metrics to Kafka
• Spark Metrics
• Status API V1 metrics – Rest API of Spark History Server
• JVM Metrics – “Spark Metrics System”
• Derived Metrics
• Resource & Time metrics - Parse RM logs using Flume & emit to Kafka
Application Metrics Architecture Emission
14

#Exp2SAIS
15

• Logic shared by Batch & Speed Layer
– Single source of truth
– Easy to add new metrics & tests
– Simpler to maintain
#Exp2SAIS
Application Metrics Architecture Processing
16

#Exp2SAIS
17

• Discover patterns and trends in the data
• Gain insight into data through fast, consistent, interactive access
• Supports selection, aggregation, filtering, group by, order by, distinct queries
#Exp2SAIS
SELECT job_name, sum(metric_value), <other dimensions>
FROM AppSummary
WHERE counter_group_name="SPARK_EXECUTOR_METRICS”
AND <other clauses>
AND daypartition="2018-05-15”;
Application Metrics Architecture Storage
Query Pattern
18

#Exp2SAIS
• Realtime distributed OLAP datastore; open sourced by LinkedIn
• Ingest data from offline & online data sources
• Support SQL like query language
• In-house expertise; well Integrated with LinkedIn’s infrastructure
Pinot
19

• Application, Task, Stage and Job Level Tables
• Support addition of arbitrary number of metrics
– Dimensions followed by ONE metric per row; columnar compression!
– Schema immune to growing metrics
#Exp2SAIS
app_id status queue start_time finish_time grid … Metric Name Metric Value
job_1508278384745_15795287 SUCCEEDED default 1526737272000 1526837272000 default … TOTAL_SHUFFLE_READ 84464656363
job_1508278384745_15795287 SUCCEEDED default 1526737272000 1526837272000 default … TOTAL_SHUFFLE_WRITE 104464656363
… … … … … … … … …
job_1508278384745_15795287 SUCCEEDED default 1526737272000 1526837272000 default … RESOURCE_USAGE 3504.98
Schema
20

Operational Intelligence Vision
#Exp2SAIS
COHERENT OI EXPERIENCE
Curated Dashboards Investigate Anomalies Root Cause Analysis Reporting
Events & MetadataHadoop & Spark Metrics Anomaly Alerts
21

Revisit our daily problems
Debug issues like slow jobs
Generate metric reports for jobs
Setup alerts and monitor flows
Cluster snapshot with slice & dice
Capacity Planning
Generate Cost to Serve Reports
Track user behavior
Debug & address global issues
Hadoop/Spark Users Platform Developers
Operational Experts Engineering Leads
#Exp2SAIS 22

Examples & Use-Cases Curated Dashboard
#Exp2SAIS 23

Examples & Use-Cases Debugging a slow job
#Exp2SAIS
Duration Vs Delay
24
Hadoop/Spark Users
I want to know why my job ran slowly?

Examples & Use-Cases Debugging a slow job
#Exp2SAIS
Duration Vs Input Records
Root Cause: Job is slow because of a huge influx in the input data
25

Examples & Use-Cases
#Exp2SAIS
Debug why a job ran slowly?
26
Debugging a slow job

#Exp2SAIS
Delay Contribution
27
50% slower due to
delay in AM container
allocation
Total Job Duration
Debugging a slow job

#Exp2SAIS
AM Container Delay for Flow
X
Finding the Culprit
28
Platform
Developers
What caused the the delay in Application Master allocation?

#Exp2SAIS
Conclusion: Looks like the queue was operating at its peak load
AM Container Delay Vs Queue Resource Usage
Finding the Culprit
29

#Exp2SAIS
Conclusion: Somebody has launched a job with 1000s of executors!
AM Container Delay Vs Executors Launched
Finding the Culprit
( NUM_EXECUTORS, grid-name, queue-name)
30

#Exp2SAIS
Found the Culprit
Dimensional Analysis on NUM_EXECUTORS
31

#Exp2SAIS
Anomaly Detection
Ref: ThirdEye
32
hadoop_numExecutors_queue_up_hours
counter value (NUM_EXECUTORS,
grid-name, queue-name)
#45338883
hadoop_numExecutors_queue_up_hou
(NUM_EXECUTORS, grid-name, queue-
name)

#Exp2SAIS
Spark Job Distribution Among QueueTop Spark Users (Last 2 Weeks)
Reporting
33

#Exp2SAIS
Offenders
Total Spark Resource Usage per QueueTotal Spark Resource Usage per User
34

• Spark Real time metrics
• Improve Anomaly Detection & RCA
• Job Classification & improved Auto–Tuning
• Higher Level Metrics
#Exp2SAIS
Future Work
35

Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai

Metrics Emission Hadoop Metrics
#Exp2SAIS 39

Metrics Emission Profiler
#Exp2SAIS
• Run Java-agent on Containers
• Collect JVM Metrics periodically
• Too frequent => Pressure on Kafka
40

Metrics Emission Hadoop Metrics
• Configure mapred-site.xml to store job history files in HDFS
mapreduce.jobhistory.intermediate-done-dir
mapreduce.jobhistory.done-dir
• Metrics collected after job is complete
• Implement Custom HDFS Source with a file marker
#Exp2SAIS 41

• Why Hive?
– Run complex queries
– Join with external data
– Longer Retention
• Why Presto?
– Run quick interactive analytic queries
Storage Layer Hive
#Exp2SAIS 42

Metrics Processing Metrics Library Resource Usage
#Exp2SAIS
yarn.scheduler.minimum-allocation-mb = 1 GB
1 GB 2 GB 3 GB 8 GB
…
spark.executor.memory = 2 GB
spark.yarn.executor.memoryOverhead
OR
MAX (spark.executor.memory * 0.10, 384 MB)
yarn.scheduler.increment-allocation-mb = 1 GB
yarn.scheduler.maximum-allocation-mb = 8 GB
Container Memory Distribution
Container Size
43

Where,
• ContainerSizek (GB) = CEIL yarn.scheduler.increment-allocation-mb (ExecutorMemoryk + OverheadMemoryk)
• ExecutorMemoryk = spark.executor.memory bounded by yarn.scheduler.(minimum/maximum)-allocation-mb
• OverheadMemoryk= spark.yarn.executor.memoryOverhead OR MAX (ExecutorMemoryk * 0.10, 384 MB)
• UpTimek = Wall clock time for which the container was up and running
Metrics Processing Metrics Library Resource Usage
#Exp2SAIS
Spark Resource
Usage (GB-Hours)
= ∑9:;
<=<>?@ABCDEBFG<B
ContainerSizek ∗ UpTimek
44

• Spark Metrics System
– Configurable metrics system based on Dropwizard
– Emit metrics to a variety of configurable sinks
– This is what most commercial products surface
• Pros:
– Emit metrics in real-time to a configurable Sink
– Easy to maintain (Part of Spark code-base)
• Cons:
– Limited metrics; No Status API V1 metrics
– Derived Metrics like Resource Usage cannot be computed
Metrics Emission Spark Metrics
#Exp2SAIS 45

• Spark Application Tracking Pipeline
– Query Spark History Server Rest APIs & dump data to HDFS
• Pros:
– Collect all the metrics including Status API V1
• Cons:
– Not Real-time; Delayed by almost an hour
– Extra load on Spark History Server
Metrics Emission Spark Metrics
#Exp2SAIS 46

Lessons learnt
• Metric Quality & Trust
• Instrumentation of components
• Integration with existing Infrastructure
• Pre-canned solutions
#Exp2SAIS 48

Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai

More Related Content

What's hot (20)

Similar to Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai (20)

More from Databricks (20)

Recently uploaded (20)

Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai