SlideShare a Scribd company logo
Managing Apache Spark Workload
and Automatic Optimizing
Lantao Jin,
Software Engineer, Data Platform Engineering (eBay)
Who We Are
2
● Date Platform Engineering team in eBay
● We build an automate data platform and self-serve site with minimum touch points
● Focus on Spark optimization and self-serve platform building
What We Do
3
● Build the platform of one-stop experience for Spark/Hadoop
● Manage entire Spark/Hadoop workload
● Open API and self-serve tools to users
● Performance tuning for Spark engine and jobs
Why Manage Spark Workload
4
● Complex failure job root cause analysis needs
● Extreme performance tuning and optimization need
● Maximum resource utilization needs
● Compute showback and capacity planning in a global view
Agenda
5
❖ Mission & Gaps & Challenges
❖ Architecture & Design
❖ JPM Analysis Service
❖ Success Cases
❖ Summary
Challenges
6
● Over 20 product clusters
● Over 500PB data
● Over 5PB(compressed) incremental data per day
● Over 80000 jobs per day
● Metadata of job/data is not clear
● Many kinds of job like Pig, Hive, Cascading, Spark, Mapreduce
● Jobs are not standard developed
● Over 20+ teams to communicate and hundreds of batch users
● Job onboarding is out of control
Mission
7
Improve
Development
Experience
Increase
Resource
Efficiency
Gaps
8
● Development Experience
○ Distributed logging service for failure diagnostics
○ Job/Task level metrics is hard for developer understanding
○ Application healthiness visibility
○ Tedious communication to problem resolution for any workload issue
● Resource Efficiency
○ Huge manual effort of analyzing cluster/queue high load
○ Blind to “bad” jobs
Object
9Data Platform Engineering
❏ Application-specific
diagnostics and
Performance
Recommendation
❏ Highlight applications
need attention
❏ Identify bottlenecks and
resource usage
❏ Reduce performance
incidents in production
❏ Easy communication back
to developer for detailed
performance insights
❏ Shorten time to
production
❏ Resource usage insight
and guidance
❏ Increase cluster ROI
For Developers For Operators For Managers
JPM Architecture
10
Job Processing
11
JPM job/runtime
processor (bolt)
Profile listener
12
● Collect/dump extra metrics for
compatibility purposes
○ Real memory usage
○ PRC count
○ Input/Output
* With this version spark profiler, we also modify the Spark
Core to expose memory related metrics.
Spark
Driver
DAGScheduler
ListenerBus
CatalogEventListener
ExecutionPlanListener
ExecutorMetricsListener
HDFS Rest API
Events
JPM profiler
13
JPM Analysis Service
JPM service backend
14
JPM Analysis Service
15
JPM Analysis Service
16
JPM Analysis Service
17
JPM Analysis Service
18
JPM Analysis Service
19
JPM Analysis Service
20
JPM Analysis Service
21
JPM Analysis Service
22
JPM Analysis Service
Success Cases
23
❖ Reduce High RPC Jobs
❖ Reduce Account Usage
❖ Repeatedly failed jobs
❖ Optimize job path with data lineage
❖ Historical based optimization
❖ Running job issue detection
24
Reduce High RPC Jobs
● Background: Jobs with high RPC
● Solution: JPM alert the high RPC jobs with advices:
○ add a reducer for map only jobs (hint)
○ change mapper join to reducer join (pipeline optimization)
● Sample: The RPC calls for the job reduced from 43M to 46k.
Cluster RPC Queue Time
Job Resource Usage Trend
Metrics
Engine
25
Reduce Account Usage
*HCU (Hadoop Compute Unit): 1 HCU is equal to 1 GB memory used for 1
second or 0.5 GB used for 2 seconds.
● Background: Spark jobs may require much more
memory resource than they actually need.
● Solution: JPM highlights the resource wasted jobs
with advices:
○ make the advisory memory configuration
○ combine the SQLs which have same table scan
● Sample: the usage for the account b_seo_eng
decreases from 500MB to 30MB, saving around 1.5%
of cluster.
Metrics
Engine
Resource
Analyzer
Catalog
Analyzer
26
Repeatedly Failed Jobs
● Background: Repeatedly failed jobs always mean
there are many opportunities in them.
● Solution: In JPM Spotlight page, these repeatedly
failed jobs will be grouped by
○ failure exception | user | diagnosis
○ limit the resource of those high failure rate
jobs, stop 0% success jobs when exceed
threshold and alert the users (configurable).
● Sample: The stopped jobs save around 1.4% cluster
usage per week.
Metrics
Engine
Resource
Analyzer
Log
Diagnoser
27
Optimize job path with data lineage
● Background: Over 80k apps per day in our YARN
clusters. Partial of them are not standard developed.
Metadata is even unclear.
● Solution: JPM worked out the data lineage by analysing
jobs, analysing audit log, extracting Hive metastore,
combining OIV. Below actions are benefited based on
the lineage:
○ SQLs combination
○ Hotspot detection and optimization
○ Useless data/jobs retire
Catalog
Analyzer
Data
Lineage
Auditlog OIV
28
● Sample 1: SQLs combination/Hotspot
detection
○ SEO team has many batch jobs
which scan one same big table
without middle table, and the only
difference in their outputs are
grouping condition.
● Sample 2: Useless data/jobs retire
○ There are many jobs without
downstream job which their data
no accessed over 6 months.
Table/Folder Save
/sys/edw/dw_lstg_item/orc
/sys/edw/dw_lstg_item/orc_partitioned
Apollo (1.3%)
/sys/edw/dw_lstg_item_cold/orc
/sys/edw/dw_lstg_item_cold/orc_partitioned
Ares(0.4%)
/sys/edw/dw_checkout_trans/orc Ares (0.15%)
29
Historical based optimization
● Background: This is an old topic but always useful.
What we are care about here are the workload
and environment between different running
instances.
● Solution: Besides gives us the trend, JPM could:
○ analyzes the entire workload of multiple
level of queue and cluster environment.
○ tell us the impact from queue and env size.
○ tell us the changes of configurations
○ give an advice about job scheduler strategy (WIP)
HBO
Configura
tion Diff
Resource
Analyzer
Metrics
Engine
30
● Sample 1: Job slowness due to resource crisis
○ “My job is running 30 minutes slower than yesterday, what happened?”
31
● Sample 2: Job failed due to unexpected configuration changed
○ “My job failed today, but I did nothing. Any changes from platform side?”
this picture is not a
good sample
32
Running job issue detection
● Background: Slowness critical job should be detected in running instead missed SLA. Hanging job should be
distinguished from (long) running job. Job/Data develop team needs aggregated job/cluster metrics ASAP.
● Solution: JPM gives users a chance to query the suspected slow running job on demand to get the report
in fly. Current JPM could tell us five kinds of cases:
○ Queue resource overload
○ Slow down by preemption
○ Shuffle Data skewing
○ Failure disk related
○ Known Spark/Hadoop bug detection
Resource
Analyzer
Metrics
Engine
Running Job
Checker
Stack
Analyzer
33
● Sample 1:
34
● Sample 2: Known Spark/Hadoop bugs auto detection (WIP)
○ Job hang up due to known bug “[HDFS-10223]
peerFromSocketAndKey performs SASL exchange
before setting connection timeouts”
JPM will snapshots the threaddump for executor
which a slow task running in and analyzes
automatically.
35
Part of issues may cause a hung task/job before 2.4
https://issues.apache.org/jira/browse/SPARK-22172
https://issues.apache.org/jira/browse/SPARK-22074
https://issues.apache.org/jira/browse/SPARK-18971
https://issues.apache.org/jira/browse/SPARK-21928
https://issues.apache.org/jira/browse/SPARK-22083
https://issues.apache.org/jira/browse/SPARK-14958
https://issues.apache.org/jira/browse/SPARK-20079
https://issues.apache.org/jira/browse/SPARK-13931
https://issues.apache.org/jira/browse/SPARK-19617
https://issues.apache.org/jira/browse/SPARK-23365
https://issues.apache.org/jira/browse/SPARK-21834
https://issues.apache.org/jira/browse/SPARK-19631
https://issues.apache.org/jira/browse/SPARK-21656
JPM frontend and UI
36
Restful API layer
Portal UI
Configuration
Manager and deploy
Health
Monitor
Read Job/ Metrics/
Analysis/
Suggestion
Read ES/ Storm/
Metrics/ Status
Frontend Layer
Read/Write conf for
each cluster
Restful API Extension
37
❖ Based on Antlr4
Partial of SearchQuery.g4
grammar SearchQuery;
//[@site="[SITEPARAM]"]{@jobId,@jobDefId,@jobName,@currentState,@user,@queue,@startTime,@endTime,@jobType}&pageSize=[PAGESIZEPARAM]&start
Time=[STARTTIMEPARAM]&endTime=[ENDTIMEPARAM]
//JobProcessTimeStampService[@site="SITEPARAM"]<@site>{max(currentTimeStamp)}
query : clzName (filter)? (sort)? (selector)? (aggregator)? (search_max_size)?;
//query : value;
clzName : KEY;
filter : '['filter_list']' | '[]';
filter_list : (filter_item','filter_list) | filter_item;
filter_item : filter_item_equal | filter_item_range | filter_item_time_range | filter_item_compare | filter_item_terms;
selector : '{'selector_list'}' | '{}';
selector_list : (selector_item','selector_list) | selector_item;
selector_item : '@'KEY;
aggregator : '<'aggregator_list'>' | '<>';
aggregator_list : (aggregator_item','aggregator_list) | aggregator_item;
aggregator_item : aggregator_term_item | aggregator_top_item | aggregator_stat_item | aggregator_stat_list |aggregator_nested_item | aggregator_histo_item;
Example
38
api/elastic/search?query=spark_app_entity[@site="apollophx"]&@hcu<&size=100
spark_app_entity/_search
{
“query”:{
“term”:{
“site”: {
“value”:”apollophx”
}
}
},
"sort": [
{
"hcu": {
"order": "desc"
}
}
],
“size”:100
}
39
Job
Spotlight
40
Job
Spotlight
41
42
43
Summary
44
Improve
Development
Experience
Increase
Resource
Efficiency
Efficiency & Near-time
Self-serve & Automatically
Dev & Ops Friendly
Entire views
Job level resource analysis
One-stop management
Open Source Plan
45
Soon future
JPM vs similar product (open source v2.0.6)
46
Dr JPM
Scope Only cares isolated application Has all user related info and cluster resource status
Diagnostics Based on metrics Aggregates failed job log to diagnose
Scalability Uses thread pool which is hard to horizontally scale Uses distributed streaming
Maintenance One instance per cluster Designed for crossing clusters within one instance
Volume Mysql as backend storage couldn’t save all task level entities Elasticsearch has more powerful in volume
Veracity Sampling to avoid OOM, Sacrificing veracity of result Precisely handle every tasks
Availability Single point failure No single point failure
Variety Analysis only based on metrics Plus environment and cluster status
Realtime Weak since it depends on SHS Realtime
Relationship N/A Uses data pipeline to detect issue
Histories N/A Historic based analysis
Runtime N/A Running job analysis
Q & A
47
Thank you!

More Related Content

PDF
How to Automate Performance Tuning for Apache Spark
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PPTX
Optimizing Apache Spark SQL Joins
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Spark and S3 with Ryan Blue
PPTX
Introduction to Apache Spark Developer Training
PDF
Beyond SQL: Speeding up Spark with DataFrames
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
How to Automate Performance Tuning for Apache Spark
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Optimizing Apache Spark SQL Joins
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Spark and S3 with Ryan Blue
Introduction to Apache Spark Developer Training
Beyond SQL: Speeding up Spark with DataFrames
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia

What's hot (20)

PDF
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Enabling Vectorized Engine in Apache Spark
PDF
Dongwon Kim – A Comparative Performance Evaluation of Flink
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
Introduction to PySpark
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
Introduction to Apache Spark
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
PPTX
Apache Knox setup and hive and hdfs Access using KNOX
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Dive into PySpark
PDF
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Deep Dive into GPU Support in Apache Spark 3.x
PDF
Apache Spark At Scale in the Cloud
PPTX
Apache spark
PPTX
Apache Flink and what it is used for
PDF
Debugging Apache Spark
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Processing Large Data with Apache Spark -- HasGeek
Spark SQL Deep Dive @ Melbourne Spark Meetup
Enabling Vectorized Engine in Apache Spark
Dongwon Kim – A Comparative Performance Evaluation of Flink
Apache Arrow Flight: A New Gold Standard for Data Transport
Introduction to PySpark
A Deep Dive into Query Execution Engine of Spark SQL
Introduction to Apache Spark
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Apache Knox setup and hive and hdfs Access using KNOX
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Dive into PySpark
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Deep Dive into GPU Support in Apache Spark 3.x
Apache Spark At Scale in the Cloud
Apache spark
Apache Flink and what it is used for
Debugging Apache Spark
Ad

Similar to Managing Apache Spark Workload and Automatic Optimizing (20)

PDF
Apache Spark at Viadeo
PDF
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
PPTX
PDF
Use Machine Learning to Get the Most out of Your Big Data Clusters
PDF
Optimizing Spark-based data pipelines - are you up for it?
PPTX
Optimizing spark based data pipelines - are you up for it?
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
PDF
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
PDF
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
PPTX
How to deploy Apache Spark in a multi-tenant, on-premises environment
PDF
Revealing the Power of Legacy Machine Data
PDF
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
PPTX
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PDF
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
PDF
IoT Crash Course Hadoop Summit SJ
PDF
Solving Big Data Problems using Hortonworks
PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
PPTX
Application Performance Monitoring from end user to Oracle Java Cloud Service...
PDF
Faster Data Integration Pipeline Execution using Spark-Jobserver
PDF
HPE Hadoop Solutions - From use cases to proposal
Apache Spark at Viadeo
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Use Machine Learning to Get the Most out of Your Big Data Clusters
Optimizing Spark-based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
How to deploy Apache Spark in a multi-tenant, on-premises environment
Revealing the Power of Legacy Machine Data
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
Bursting on-premise analytic workloads to Amazon EMR using Alluxio
IoT Crash Course Hadoop Summit SJ
Solving Big Data Problems using Hortonworks
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Application Performance Monitoring from end user to Oracle Java Cloud Service...
Faster Data Integration Pipeline Execution using Spark-Jobserver
HPE Hadoop Solutions - From use cases to proposal
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Lecture1 pattern recognition............
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Foundation of Data Science unit number two notes
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Moving the Public Sector (Government) to a Digital Adoption
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Reliability_Chapter_ presentation 1221.5784
Introduction to Knowledge Engineering Part 1
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Lecture1 pattern recognition............
oil_refinery_comprehensive_20250804084928 (1).pptx
Quality review (1)_presentation of this 21
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
IB Computer Science - Internal Assessment.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Foundation of Data Science unit number two notes
Major-Components-ofNKJNNKNKNKNKronment.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn

Managing Apache Spark Workload and Automatic Optimizing

  • 1. Managing Apache Spark Workload and Automatic Optimizing Lantao Jin, Software Engineer, Data Platform Engineering (eBay)
  • 2. Who We Are 2 ● Date Platform Engineering team in eBay ● We build an automate data platform and self-serve site with minimum touch points ● Focus on Spark optimization and self-serve platform building
  • 3. What We Do 3 ● Build the platform of one-stop experience for Spark/Hadoop ● Manage entire Spark/Hadoop workload ● Open API and self-serve tools to users ● Performance tuning for Spark engine and jobs
  • 4. Why Manage Spark Workload 4 ● Complex failure job root cause analysis needs ● Extreme performance tuning and optimization need ● Maximum resource utilization needs ● Compute showback and capacity planning in a global view
  • 5. Agenda 5 ❖ Mission & Gaps & Challenges ❖ Architecture & Design ❖ JPM Analysis Service ❖ Success Cases ❖ Summary
  • 6. Challenges 6 ● Over 20 product clusters ● Over 500PB data ● Over 5PB(compressed) incremental data per day ● Over 80000 jobs per day ● Metadata of job/data is not clear ● Many kinds of job like Pig, Hive, Cascading, Spark, Mapreduce ● Jobs are not standard developed ● Over 20+ teams to communicate and hundreds of batch users ● Job onboarding is out of control
  • 8. Gaps 8 ● Development Experience ○ Distributed logging service for failure diagnostics ○ Job/Task level metrics is hard for developer understanding ○ Application healthiness visibility ○ Tedious communication to problem resolution for any workload issue ● Resource Efficiency ○ Huge manual effort of analyzing cluster/queue high load ○ Blind to “bad” jobs
  • 9. Object 9Data Platform Engineering ❏ Application-specific diagnostics and Performance Recommendation ❏ Highlight applications need attention ❏ Identify bottlenecks and resource usage ❏ Reduce performance incidents in production ❏ Easy communication back to developer for detailed performance insights ❏ Shorten time to production ❏ Resource usage insight and guidance ❏ Increase cluster ROI For Developers For Operators For Managers
  • 12. Profile listener 12 ● Collect/dump extra metrics for compatibility purposes ○ Real memory usage ○ PRC count ○ Input/Output * With this version spark profiler, we also modify the Spark Core to expose memory related metrics. Spark Driver DAGScheduler ListenerBus CatalogEventListener ExecutionPlanListener ExecutorMetricsListener HDFS Rest API Events JPM profiler
  • 13. 13 JPM Analysis Service JPM service backend
  • 23. Success Cases 23 ❖ Reduce High RPC Jobs ❖ Reduce Account Usage ❖ Repeatedly failed jobs ❖ Optimize job path with data lineage ❖ Historical based optimization ❖ Running job issue detection
  • 24. 24 Reduce High RPC Jobs ● Background: Jobs with high RPC ● Solution: JPM alert the high RPC jobs with advices: ○ add a reducer for map only jobs (hint) ○ change mapper join to reducer join (pipeline optimization) ● Sample: The RPC calls for the job reduced from 43M to 46k. Cluster RPC Queue Time Job Resource Usage Trend Metrics Engine
  • 25. 25 Reduce Account Usage *HCU (Hadoop Compute Unit): 1 HCU is equal to 1 GB memory used for 1 second or 0.5 GB used for 2 seconds. ● Background: Spark jobs may require much more memory resource than they actually need. ● Solution: JPM highlights the resource wasted jobs with advices: ○ make the advisory memory configuration ○ combine the SQLs which have same table scan ● Sample: the usage for the account b_seo_eng decreases from 500MB to 30MB, saving around 1.5% of cluster. Metrics Engine Resource Analyzer Catalog Analyzer
  • 26. 26 Repeatedly Failed Jobs ● Background: Repeatedly failed jobs always mean there are many opportunities in them. ● Solution: In JPM Spotlight page, these repeatedly failed jobs will be grouped by ○ failure exception | user | diagnosis ○ limit the resource of those high failure rate jobs, stop 0% success jobs when exceed threshold and alert the users (configurable). ● Sample: The stopped jobs save around 1.4% cluster usage per week. Metrics Engine Resource Analyzer Log Diagnoser
  • 27. 27 Optimize job path with data lineage ● Background: Over 80k apps per day in our YARN clusters. Partial of them are not standard developed. Metadata is even unclear. ● Solution: JPM worked out the data lineage by analysing jobs, analysing audit log, extracting Hive metastore, combining OIV. Below actions are benefited based on the lineage: ○ SQLs combination ○ Hotspot detection and optimization ○ Useless data/jobs retire Catalog Analyzer Data Lineage Auditlog OIV
  • 28. 28 ● Sample 1: SQLs combination/Hotspot detection ○ SEO team has many batch jobs which scan one same big table without middle table, and the only difference in their outputs are grouping condition. ● Sample 2: Useless data/jobs retire ○ There are many jobs without downstream job which their data no accessed over 6 months. Table/Folder Save /sys/edw/dw_lstg_item/orc /sys/edw/dw_lstg_item/orc_partitioned Apollo (1.3%) /sys/edw/dw_lstg_item_cold/orc /sys/edw/dw_lstg_item_cold/orc_partitioned Ares(0.4%) /sys/edw/dw_checkout_trans/orc Ares (0.15%)
  • 29. 29 Historical based optimization ● Background: This is an old topic but always useful. What we are care about here are the workload and environment between different running instances. ● Solution: Besides gives us the trend, JPM could: ○ analyzes the entire workload of multiple level of queue and cluster environment. ○ tell us the impact from queue and env size. ○ tell us the changes of configurations ○ give an advice about job scheduler strategy (WIP) HBO Configura tion Diff Resource Analyzer Metrics Engine
  • 30. 30 ● Sample 1: Job slowness due to resource crisis ○ “My job is running 30 minutes slower than yesterday, what happened?”
  • 31. 31 ● Sample 2: Job failed due to unexpected configuration changed ○ “My job failed today, but I did nothing. Any changes from platform side?” this picture is not a good sample
  • 32. 32 Running job issue detection ● Background: Slowness critical job should be detected in running instead missed SLA. Hanging job should be distinguished from (long) running job. Job/Data develop team needs aggregated job/cluster metrics ASAP. ● Solution: JPM gives users a chance to query the suspected slow running job on demand to get the report in fly. Current JPM could tell us five kinds of cases: ○ Queue resource overload ○ Slow down by preemption ○ Shuffle Data skewing ○ Failure disk related ○ Known Spark/Hadoop bug detection Resource Analyzer Metrics Engine Running Job Checker Stack Analyzer
  • 34. 34 ● Sample 2: Known Spark/Hadoop bugs auto detection (WIP) ○ Job hang up due to known bug “[HDFS-10223] peerFromSocketAndKey performs SASL exchange before setting connection timeouts” JPM will snapshots the threaddump for executor which a slow task running in and analyzes automatically.
  • 35. 35 Part of issues may cause a hung task/job before 2.4 https://issues.apache.org/jira/browse/SPARK-22172 https://issues.apache.org/jira/browse/SPARK-22074 https://issues.apache.org/jira/browse/SPARK-18971 https://issues.apache.org/jira/browse/SPARK-21928 https://issues.apache.org/jira/browse/SPARK-22083 https://issues.apache.org/jira/browse/SPARK-14958 https://issues.apache.org/jira/browse/SPARK-20079 https://issues.apache.org/jira/browse/SPARK-13931 https://issues.apache.org/jira/browse/SPARK-19617 https://issues.apache.org/jira/browse/SPARK-23365 https://issues.apache.org/jira/browse/SPARK-21834 https://issues.apache.org/jira/browse/SPARK-19631 https://issues.apache.org/jira/browse/SPARK-21656
  • 36. JPM frontend and UI 36 Restful API layer Portal UI Configuration Manager and deploy Health Monitor Read Job/ Metrics/ Analysis/ Suggestion Read ES/ Storm/ Metrics/ Status Frontend Layer Read/Write conf for each cluster
  • 37. Restful API Extension 37 ❖ Based on Antlr4 Partial of SearchQuery.g4 grammar SearchQuery; //[@site="[SITEPARAM]"]{@jobId,@jobDefId,@jobName,@currentState,@user,@queue,@startTime,@endTime,@jobType}&pageSize=[PAGESIZEPARAM]&start Time=[STARTTIMEPARAM]&endTime=[ENDTIMEPARAM] //JobProcessTimeStampService[@site="SITEPARAM"]<@site>{max(currentTimeStamp)} query : clzName (filter)? (sort)? (selector)? (aggregator)? (search_max_size)?; //query : value; clzName : KEY; filter : '['filter_list']' | '[]'; filter_list : (filter_item','filter_list) | filter_item; filter_item : filter_item_equal | filter_item_range | filter_item_time_range | filter_item_compare | filter_item_terms; selector : '{'selector_list'}' | '{}'; selector_list : (selector_item','selector_list) | selector_item; selector_item : '@'KEY; aggregator : '<'aggregator_list'>' | '<>'; aggregator_list : (aggregator_item','aggregator_list) | aggregator_item; aggregator_item : aggregator_term_item | aggregator_top_item | aggregator_stat_item | aggregator_stat_list |aggregator_nested_item | aggregator_histo_item;
  • 41. 41
  • 42. 42
  • 43. 43
  • 44. Summary 44 Improve Development Experience Increase Resource Efficiency Efficiency & Near-time Self-serve & Automatically Dev & Ops Friendly Entire views Job level resource analysis One-stop management
  • 46. JPM vs similar product (open source v2.0.6) 46 Dr JPM Scope Only cares isolated application Has all user related info and cluster resource status Diagnostics Based on metrics Aggregates failed job log to diagnose Scalability Uses thread pool which is hard to horizontally scale Uses distributed streaming Maintenance One instance per cluster Designed for crossing clusters within one instance Volume Mysql as backend storage couldn’t save all task level entities Elasticsearch has more powerful in volume Veracity Sampling to avoid OOM, Sacrificing veracity of result Precisely handle every tasks Availability Single point failure No single point failure Variety Analysis only based on metrics Plus environment and cluster status Realtime Weak since it depends on SHS Realtime Relationship N/A Uses data pipeline to detect issue Histories N/A Historic based analysis Runtime N/A Running job analysis