SlideShare a Scribd company logo
Conquering
Hadoop & Spark with
Operational Intelligence
Akshay Rai
Senior Software Engineer
LinkedIn
#Exp2SAIS
About Me
• Sr. Software Engineer in the Data Platform team at LinkedIn
• Engineering lead for Dr. Elephant
• Building an operational intelligence platform for Hadoop & Spark
#Exp2SAIS
Create economic opportunity for every
member of the global workforce
OUR VISION
Connect the world’s professionals to make
them more productive and successful
OUR MISSION
• Everyday problems with Hadoop & Spark
• Approach & its complexity
• Application Metrics Architecture
• Operational Intelligence Vision
• Examples & Use-cases
Today’s Talk
#Exp2SAIS 5
Everyday problems with Hadoop & Spark
Debug issues like slow jobs
Generate metric reports for jobs
Setup alerts and monitor flows
Cluster snapshot with slice & dice
Capacity Planning
Generate Cost to Serve Reports
Track user behavior
Debug & address global issues
Hadoop/Spark Users Platform Developers
Operational Experts Engineering Leads
#Exp2SAIS 6
Bird’s-Eye View
#Exp2SAIS
Capture Metrics Detect Anomalies Identify Root Cause
7
#Exp2SAIS
Capture Metrics
Complexity of the approach
• Collect application metrics in near real time
• Collect metrics from multiple engines like MR and Spark
• Integrate app metrics with data lineage and metadata
8
#Exp2SAIS
Detect Anomalies
• Knowledge of various modeling techniques
• In-depth knowledge of Hadoop and Spark Metrics
• Event based anomaly detection and alerting
Complexity of the approach
9
#Exp2SAIS
Identify Root Cause
• Correlation with other crucial metrics
• Integration with events that happen at LinkedIn
• E.g., Deployments, Issues, Commits, etc.
• Discover trends in metrics; dimensional analysis
Complexity of the approach
10
#Exp2SAIS
Capture Metrics Detect Anomalies Identify Root Cause
11
Application Metrics Pipeline
#Exp2SAIS
Capture Metrics Detect Anomalies Identify Root Cause
ThirdEye
Github: https://github.com/linkedin/pinot/tree/master/thirdeye
12
Capture Metrics Application Metrics Architecture
#Exp2SAIS 13
#Exp2SAIS
• Hadoop Metrics
• Counters – Parse Job History files & emit to Kafka using Flume
• JVM Metrics – Launch a java-agent & emit metrics to Kafka
• Spark Metrics
• Status API V1 metrics – Rest API of Spark History Server
• JVM Metrics – “Spark Metrics System”
• Derived Metrics
• Resource & Time metrics - Parse RM logs using Flume & emit to Kafka
Application Metrics Architecture Emission
14
#Exp2SAIS
Capture Metrics Application Metrics Architecture
15
• Logic shared by Batch & Speed Layer
– Single source of truth
– Easy to add new metrics & tests
– Simpler to maintain
#Exp2SAIS
Application Metrics Architecture Processing
16
#Exp2SAIS
Capture Metrics Application Metrics Architecture
17
• Discover patterns and trends in the data
• Gain insight into data through fast, consistent, interactive access
• Supports selection, aggregation, filtering, group by, order by, distinct queries
#Exp2SAIS
SELECT job_name, sum(metric_value), <other dimensions>
FROM AppSummary
WHERE counter_group_name="SPARK_EXECUTOR_METRICS”
AND <other clauses>
AND daypartition="2018-05-15”;
Application Metrics Architecture Storage
Query Pattern
18
#Exp2SAIS
• Realtime distributed OLAP datastore; open sourced by LinkedIn
• Ingest data from offline & online data sources
• Support SQL like query language
• In-house expertise; well Integrated with LinkedIn’s infrastructure
Pinot
Application Metrics Architecture Storage
19
• Application, Task, Stage and Job Level Tables
• Support addition of arbitrary number of metrics
– Dimensions followed by ONE metric per row; columnar compression!
– Schema immune to growing metrics
#Exp2SAIS
app_id status queue start_time finish_time grid … Metric Name Metric Value
job_1508278384745_15795287 SUCCEEDED default 1526737272000 1526837272000 default … TOTAL_SHUFFLE_READ 84464656363
job_1508278384745_15795287 SUCCEEDED default 1526737272000 1526837272000 default … TOTAL_SHUFFLE_WRITE 104464656363
… … … … … … … … …
job_1508278384745_15795287 SUCCEEDED default 1526737272000 1526837272000 default … RESOURCE_USAGE 3504.98
Schema
Application Metrics Architecture Storage
20
Operational Intelligence Vision
#Exp2SAIS
COHERENT OI EXPERIENCE
Curated Dashboards Investigate Anomalies Root Cause Analysis Reporting
Events & MetadataHadoop & Spark Metrics Anomaly Alerts
21
Revisit our daily problems
Debug issues like slow jobs
Generate metric reports for jobs
Setup alerts and monitor flows
Cluster snapshot with slice & dice
Capacity Planning
Generate Cost to Serve Reports
Track user behavior
Debug & address global issues
Hadoop/Spark Users Platform Developers
Operational Experts Engineering Leads
#Exp2SAIS 22
Examples & Use-Cases Curated Dashboard
#Exp2SAIS 23
Examples & Use-Cases Debugging a slow job
#Exp2SAIS
Duration Vs Delay
24
Hadoop/Spark Users
I want to know why my job ran slowly?
Examples & Use-Cases Debugging a slow job
#Exp2SAIS
Duration Vs Input Records
Root Cause: Job is slow because of a huge influx in the input data
25
Examples & Use-Cases
#Exp2SAIS
Debug why a job ran slowly?
26
Debugging a slow job
Examples & Use-Cases
#Exp2SAIS
Delay Contribution
27
50% slower due to
delay in AM container
allocation
Total Job Duration
Debugging a slow job
Examples & Use-Cases
#Exp2SAIS
AM Container Delay for Flow
X
Finding the Culprit
28
Platform
Developers
What caused the the delay in Application Master allocation?
Examples & Use-Cases
#Exp2SAIS
Conclusion: Looks like the queue was operating at its peak load
AM Container Delay Vs Queue Resource Usage
Finding the Culprit
29
Examples & Use-Cases
#Exp2SAIS
Conclusion: Somebody has launched a job with 1000s of executors!
AM Container Delay Vs Executors Launched
Finding the Culprit
( NUM_EXECUTORS, grid-name, queue-name)
30
Examples & Use-Cases
#Exp2SAIS
Found the Culprit
Dimensional Analysis on NUM_EXECUTORS
31
( NUM_EXECUTORS, grid-name, queue-name)
Examples & Use-Cases
#Exp2SAIS
Anomaly Detection
Ref: ThirdEye
32
hadoop_numExecutors_queue_up_hours
( NUM_EXECUTORS, grid-name, queue-name)
counter value (NUM_EXECUTORS,
grid-name, queue-name)
#45338883
hadoop_numExecutors_queue_up_hou
(NUM_EXECUTORS, grid-name, queue-
name)
Examples & Use-Cases
#Exp2SAIS
Spark Job Distribution Among QueueTop Spark Users (Last 2 Weeks)
Reporting
33
Examples & Use-Cases
#Exp2SAIS
Offenders
Total Spark Resource Usage per QueueTotal Spark Resource Usage per User
34
• Spark Real time metrics
• Improve Anomaly Detection & RCA
• Job Classification & improved Auto–Tuning
• Higher Level Metrics
#Exp2SAIS
Future Work
35
Thank you
#Exp2SAIS 36
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Additional Backup Slides
38
Metrics Emission Hadoop Metrics
#Exp2SAIS 39
Metrics Emission Profiler
#Exp2SAIS
• Run Java-agent on Containers
• Collect JVM Metrics periodically
• Too frequent => Pressure on Kafka
40
Metrics Emission Hadoop Metrics
• Configure mapred-site.xml to store job history files in HDFS
mapreduce.jobhistory.intermediate-done-dir
mapreduce.jobhistory.done-dir
• Metrics collected after job is complete
• Implement Custom HDFS Source with a file marker
#Exp2SAIS 41
• Why Hive?
– Run complex queries
– Join with external data
– Longer Retention
• Why Presto?
– Run quick interactive analytic queries
Storage Layer Hive
#Exp2SAIS 42
Metrics Processing Metrics Library Resource Usage
#Exp2SAIS
yarn.scheduler.minimum-allocation-mb = 1 GB
1 GB 2 GB 3 GB 8 GB
…
spark.executor.memory = 2 GB
spark.yarn.executor.memoryOverhead
OR
MAX (spark.executor.memory * 0.10, 384 MB)
yarn.scheduler.increment-allocation-mb = 1 GB
yarn.scheduler.maximum-allocation-mb = 8 GB
Container Memory Distribution
Container Size
43
Where,
• ContainerSizek (GB) = CEIL yarn.scheduler.increment-allocation-mb (ExecutorMemoryk + OverheadMemoryk)
• ExecutorMemoryk = spark.executor.memory bounded by yarn.scheduler.(minimum/maximum)-allocation-mb
• OverheadMemoryk= spark.yarn.executor.memoryOverhead OR MAX (ExecutorMemoryk * 0.10, 384 MB)
• UpTimek = Wall clock time for which the container was up and running
Metrics Processing Metrics Library Resource Usage
#Exp2SAIS
Spark	Resource
Usage	(GB-Hours)
= ∑9:;
<=<>?@ABCDEBFG<B
ContainerSizek ∗ UpTimek
44
• Spark Metrics System
– Configurable metrics system based on Dropwizard
– Emit metrics to a variety of configurable sinks
– This is what most commercial products surface
• Pros:
– Emit metrics in real-time to a configurable Sink
– Easy to maintain (Part of Spark code-base)
• Cons:
– Limited metrics; No Status API V1 metrics
– Derived Metrics like Resource Usage cannot be computed
Metrics Emission Spark Metrics
#Exp2SAIS 45
• Spark Application Tracking Pipeline
– Query Spark History Server Rest APIs & dump data to HDFS
• Pros:
– Collect all the metrics including Status API V1
• Cons:
– Not Real-time; Delayed by almost an hour
– Extra load on Spark History Server
Metrics Emission Spark Metrics
#Exp2SAIS 46
Data Locality
#Exp2SAIS 47
Lessons learnt
• Metric Quality & Trust
• Instrumentation of components
• Integration with existing Infrastructure
• Pre-canned solutions
#Exp2SAIS 48

More Related Content

PPTX
Metrics-driven tuning of Apache Spark at scale
PDF
Self-Service Apache Spark Structured Streaming Applications and Analytics
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
PDF
Infrastructure for Deep Learning in Apache Spark
PDF
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
PDF
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
PDF
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
PDF
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Metrics-driven tuning of Apache Spark at scale
Self-Service Apache Spark Structured Streaming Applications and Analytics
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Infrastructure for Deep Learning in Apache Spark
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...

What's hot (20)

PPTX
What’s new in Apache Spark 2.3
PDF
SQL Analytics Powering Telemetry Analysis at Comcast
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Apache Pulsar: The Next Generation Messaging and Queuing System
PDF
Family data sheet HP Virtual Connect(May 2013)
PDF
Spark Summit EU talk by Christos Erotocritou
PPTX
Solr + Hadoop: Interactive Search for Hadoop
PDF
Apache Spark Data Validation
PDF
Using Databricks as an Analysis Platform
PDF
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
PDF
Data Engineering Course Syllabus - WeCloudData
PDF
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
PDF
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
PDF
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
PDF
What’s New in the Upcoming Apache Spark 3.0
PDF
End-to-End Data Pipelines with Apache Spark
PDF
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
PDF
Accelerating Machine Learning on Databricks Runtime
PDF
Is there a way that we can build our Azure Synapse Pipelines all with paramet...
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
What’s new in Apache Spark 2.3
SQL Analytics Powering Telemetry Analysis at Comcast
Scaling your Data Pipelines with Apache Spark on Kubernetes
Apache Pulsar: The Next Generation Messaging and Queuing System
Family data sheet HP Virtual Connect(May 2013)
Spark Summit EU talk by Christos Erotocritou
Solr + Hadoop: Interactive Search for Hadoop
Apache Spark Data Validation
Using Databricks as an Analysis Platform
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Data Engineering Course Syllabus - WeCloudData
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
What’s New in the Upcoming Apache Spark 3.0
End-to-End Data Pipelines with Apache Spark
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Accelerating Machine Learning on Databricks Runtime
Is there a way that we can build our Azure Synapse Pipelines all with paramet...
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
Ad

Similar to Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai (20)

PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
PDF
Webinar: What's new in CDAP 3.5?
PDF
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
PPTX
DOES SFO 2016 - Avan Mathur - Planning for Huge Scale
PDF
Track A-2 基於 Spark 的數據分析
PDF
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
PPTX
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
PPTX
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
PPTX
Big data and non relational database
PDF
Fighting Fraud with Apache Spark
PDF
DataOps with Project Amaterasu
PPTX
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
PDF
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
PPTX
Spark One Platform Webinar
PDF
Apache Hadoop YARN - The Future of Data Processing with Hadoop
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PPTX
Unlock the value of your big data infrastructure
PDF
Powering a Startup with Apache Spark with Kevin Kim
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Webinar: What's new in CDAP 3.5?
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
DOES SFO 2016 - Avan Mathur - Planning for Huge Scale
Track A-2 基於 Spark 的數據分析
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Big data and non relational database
Fighting Fraud with Apache Spark
DataOps with Project Amaterasu
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
Spark One Platform Webinar
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
Unlock the value of your big data infrastructure
Powering a Startup with Apache Spark with Kevin Kim
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Introduction to machine learning and Linear Models
PDF
Introduction to the R Programming Language
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Mega Projects Data Mega Projects Data
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Computer network topology notes for revision
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Database Infoormation System (DBIS).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
ISS -ESG Data flows What is ESG and HowHow
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Supervised vs unsupervised machine learning algorithms
SAP 2 completion done . PRESENTATION.pptx
Introduction to machine learning and Linear Models
Introduction to the R Programming Language
IB Computer Science - Internal Assessment.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Clinical guidelines as a resource for EBP(1).pdf
Mega Projects Data Mega Projects Data
Business Ppt On Nestle.pptx huunnnhhgfvu
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Computer network topology notes for revision
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Miokarditis (Inflamasi pada Otot Jantung)

Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai

  • 1. Conquering Hadoop & Spark with Operational Intelligence Akshay Rai Senior Software Engineer LinkedIn #Exp2SAIS
  • 2. About Me • Sr. Software Engineer in the Data Platform team at LinkedIn • Engineering lead for Dr. Elephant • Building an operational intelligence platform for Hadoop & Spark #Exp2SAIS
  • 3. Create economic opportunity for every member of the global workforce OUR VISION
  • 4. Connect the world’s professionals to make them more productive and successful OUR MISSION
  • 5. • Everyday problems with Hadoop & Spark • Approach & its complexity • Application Metrics Architecture • Operational Intelligence Vision • Examples & Use-cases Today’s Talk #Exp2SAIS 5
  • 6. Everyday problems with Hadoop & Spark Debug issues like slow jobs Generate metric reports for jobs Setup alerts and monitor flows Cluster snapshot with slice & dice Capacity Planning Generate Cost to Serve Reports Track user behavior Debug & address global issues Hadoop/Spark Users Platform Developers Operational Experts Engineering Leads #Exp2SAIS 6
  • 7. Bird’s-Eye View #Exp2SAIS Capture Metrics Detect Anomalies Identify Root Cause 7
  • 8. #Exp2SAIS Capture Metrics Complexity of the approach • Collect application metrics in near real time • Collect metrics from multiple engines like MR and Spark • Integrate app metrics with data lineage and metadata 8
  • 9. #Exp2SAIS Detect Anomalies • Knowledge of various modeling techniques • In-depth knowledge of Hadoop and Spark Metrics • Event based anomaly detection and alerting Complexity of the approach 9
  • 10. #Exp2SAIS Identify Root Cause • Correlation with other crucial metrics • Integration with events that happen at LinkedIn • E.g., Deployments, Issues, Commits, etc. • Discover trends in metrics; dimensional analysis Complexity of the approach 10
  • 11. #Exp2SAIS Capture Metrics Detect Anomalies Identify Root Cause 11 Application Metrics Pipeline
  • 12. #Exp2SAIS Capture Metrics Detect Anomalies Identify Root Cause ThirdEye Github: https://github.com/linkedin/pinot/tree/master/thirdeye 12
  • 13. Capture Metrics Application Metrics Architecture #Exp2SAIS 13
  • 14. #Exp2SAIS • Hadoop Metrics • Counters – Parse Job History files & emit to Kafka using Flume • JVM Metrics – Launch a java-agent & emit metrics to Kafka • Spark Metrics • Status API V1 metrics – Rest API of Spark History Server • JVM Metrics – “Spark Metrics System” • Derived Metrics • Resource & Time metrics - Parse RM logs using Flume & emit to Kafka Application Metrics Architecture Emission 14
  • 15. #Exp2SAIS Capture Metrics Application Metrics Architecture 15
  • 16. • Logic shared by Batch & Speed Layer – Single source of truth – Easy to add new metrics & tests – Simpler to maintain #Exp2SAIS Application Metrics Architecture Processing 16
  • 17. #Exp2SAIS Capture Metrics Application Metrics Architecture 17
  • 18. • Discover patterns and trends in the data • Gain insight into data through fast, consistent, interactive access • Supports selection, aggregation, filtering, group by, order by, distinct queries #Exp2SAIS SELECT job_name, sum(metric_value), <other dimensions> FROM AppSummary WHERE counter_group_name="SPARK_EXECUTOR_METRICS” AND <other clauses> AND daypartition="2018-05-15”; Application Metrics Architecture Storage Query Pattern 18
  • 19. #Exp2SAIS • Realtime distributed OLAP datastore; open sourced by LinkedIn • Ingest data from offline & online data sources • Support SQL like query language • In-house expertise; well Integrated with LinkedIn’s infrastructure Pinot Application Metrics Architecture Storage 19
  • 20. • Application, Task, Stage and Job Level Tables • Support addition of arbitrary number of metrics – Dimensions followed by ONE metric per row; columnar compression! – Schema immune to growing metrics #Exp2SAIS app_id status queue start_time finish_time grid … Metric Name Metric Value job_1508278384745_15795287 SUCCEEDED default 1526737272000 1526837272000 default … TOTAL_SHUFFLE_READ 84464656363 job_1508278384745_15795287 SUCCEEDED default 1526737272000 1526837272000 default … TOTAL_SHUFFLE_WRITE 104464656363 … … … … … … … … … job_1508278384745_15795287 SUCCEEDED default 1526737272000 1526837272000 default … RESOURCE_USAGE 3504.98 Schema Application Metrics Architecture Storage 20
  • 21. Operational Intelligence Vision #Exp2SAIS COHERENT OI EXPERIENCE Curated Dashboards Investigate Anomalies Root Cause Analysis Reporting Events & MetadataHadoop & Spark Metrics Anomaly Alerts 21
  • 22. Revisit our daily problems Debug issues like slow jobs Generate metric reports for jobs Setup alerts and monitor flows Cluster snapshot with slice & dice Capacity Planning Generate Cost to Serve Reports Track user behavior Debug & address global issues Hadoop/Spark Users Platform Developers Operational Experts Engineering Leads #Exp2SAIS 22
  • 23. Examples & Use-Cases Curated Dashboard #Exp2SAIS 23
  • 24. Examples & Use-Cases Debugging a slow job #Exp2SAIS Duration Vs Delay 24 Hadoop/Spark Users I want to know why my job ran slowly?
  • 25. Examples & Use-Cases Debugging a slow job #Exp2SAIS Duration Vs Input Records Root Cause: Job is slow because of a huge influx in the input data 25
  • 26. Examples & Use-Cases #Exp2SAIS Debug why a job ran slowly? 26 Debugging a slow job
  • 27. Examples & Use-Cases #Exp2SAIS Delay Contribution 27 50% slower due to delay in AM container allocation Total Job Duration Debugging a slow job
  • 28. Examples & Use-Cases #Exp2SAIS AM Container Delay for Flow X Finding the Culprit 28 Platform Developers What caused the the delay in Application Master allocation?
  • 29. Examples & Use-Cases #Exp2SAIS Conclusion: Looks like the queue was operating at its peak load AM Container Delay Vs Queue Resource Usage Finding the Culprit 29
  • 30. Examples & Use-Cases #Exp2SAIS Conclusion: Somebody has launched a job with 1000s of executors! AM Container Delay Vs Executors Launched Finding the Culprit ( NUM_EXECUTORS, grid-name, queue-name) 30
  • 31. Examples & Use-Cases #Exp2SAIS Found the Culprit Dimensional Analysis on NUM_EXECUTORS 31
  • 32. ( NUM_EXECUTORS, grid-name, queue-name) Examples & Use-Cases #Exp2SAIS Anomaly Detection Ref: ThirdEye 32 hadoop_numExecutors_queue_up_hours ( NUM_EXECUTORS, grid-name, queue-name) counter value (NUM_EXECUTORS, grid-name, queue-name) #45338883 hadoop_numExecutors_queue_up_hou (NUM_EXECUTORS, grid-name, queue- name)
  • 33. Examples & Use-Cases #Exp2SAIS Spark Job Distribution Among QueueTop Spark Users (Last 2 Weeks) Reporting 33
  • 34. Examples & Use-Cases #Exp2SAIS Offenders Total Spark Resource Usage per QueueTotal Spark Resource Usage per User 34
  • 35. • Spark Real time metrics • Improve Anomaly Detection & RCA • Job Classification & improved Auto–Tuning • Higher Level Metrics #Exp2SAIS Future Work 35
  • 39. Metrics Emission Hadoop Metrics #Exp2SAIS 39
  • 40. Metrics Emission Profiler #Exp2SAIS • Run Java-agent on Containers • Collect JVM Metrics periodically • Too frequent => Pressure on Kafka 40
  • 41. Metrics Emission Hadoop Metrics • Configure mapred-site.xml to store job history files in HDFS mapreduce.jobhistory.intermediate-done-dir mapreduce.jobhistory.done-dir • Metrics collected after job is complete • Implement Custom HDFS Source with a file marker #Exp2SAIS 41
  • 42. • Why Hive? – Run complex queries – Join with external data – Longer Retention • Why Presto? – Run quick interactive analytic queries Storage Layer Hive #Exp2SAIS 42
  • 43. Metrics Processing Metrics Library Resource Usage #Exp2SAIS yarn.scheduler.minimum-allocation-mb = 1 GB 1 GB 2 GB 3 GB 8 GB … spark.executor.memory = 2 GB spark.yarn.executor.memoryOverhead OR MAX (spark.executor.memory * 0.10, 384 MB) yarn.scheduler.increment-allocation-mb = 1 GB yarn.scheduler.maximum-allocation-mb = 8 GB Container Memory Distribution Container Size 43
  • 44. Where, • ContainerSizek (GB) = CEIL yarn.scheduler.increment-allocation-mb (ExecutorMemoryk + OverheadMemoryk) • ExecutorMemoryk = spark.executor.memory bounded by yarn.scheduler.(minimum/maximum)-allocation-mb • OverheadMemoryk= spark.yarn.executor.memoryOverhead OR MAX (ExecutorMemoryk * 0.10, 384 MB) • UpTimek = Wall clock time for which the container was up and running Metrics Processing Metrics Library Resource Usage #Exp2SAIS Spark Resource Usage (GB-Hours) = ∑9:; <=<>?@ABCDEBFG<B ContainerSizek ∗ UpTimek 44
  • 45. • Spark Metrics System – Configurable metrics system based on Dropwizard – Emit metrics to a variety of configurable sinks – This is what most commercial products surface • Pros: – Emit metrics in real-time to a configurable Sink – Easy to maintain (Part of Spark code-base) • Cons: – Limited metrics; No Status API V1 metrics – Derived Metrics like Resource Usage cannot be computed Metrics Emission Spark Metrics #Exp2SAIS 45
  • 46. • Spark Application Tracking Pipeline – Query Spark History Server Rest APIs & dump data to HDFS • Pros: – Collect all the metrics including Status API V1 • Cons: – Not Real-time; Delayed by almost an hour – Extra load on Spark History Server Metrics Emission Spark Metrics #Exp2SAIS 46
  • 48. Lessons learnt • Metric Quality & Trust • Instrumentation of components • Integration with existing Infrastructure • Pre-canned solutions #Exp2SAIS 48