Managing Apache Spark Workload and Automatic Optimizing

Managing Apache Spark Workload
and Automatic Optimizing
Lantao Jin,
Software Engineer, Data Platform Engineering (eBay)

Who We Are
2
● Date Platform Engineering team in eBay
● We build an automate data platform and self-serve site with minimum touch points
● Focus on Spark optimization and self-serve platform building

What We Do
3
● Build the platform of one-stop experience for Spark/Hadoop
● Manage entire Spark/Hadoop workload
● Open API and self-serve tools to users
● Performance tuning for Spark engine and jobs

Why Manage Spark Workload
4
● Complex failure job root cause analysis needs
● Extreme performance tuning and optimization need
● Maximum resource utilization needs
● Compute showback and capacity planning in a global view

Agenda
5
❖ Mission & Gaps & Challenges
❖ Architecture & Design
❖ JPM Analysis Service
❖ Success Cases
❖ Summary

Challenges
6
● Over 20 product clusters
● Over 500PB data
● Over 5PB(compressed) incremental data per day
● Over 80000 jobs per day
● Metadata of job/data is not clear
● Many kinds of job like Pig, Hive, Cascading, Spark, Mapreduce
● Jobs are not standard developed
● Over 20+ teams to communicate and hundreds of batch users
● Job onboarding is out of control

Mission
7
Improve
Development
Experience
Increase
Resource
Efficiency

Gaps
8
● Development Experience
○ Distributed logging service for failure diagnostics
○ Job/Task level metrics is hard for developer understanding
○ Application healthiness visibility
○ Tedious communication to problem resolution for any workload issue
● Resource Efficiency
○ Huge manual effort of analyzing cluster/queue high load
○ Blind to “bad” jobs

Object
9Data Platform Engineering
❏ Application-specific
diagnostics and
Performance
Recommendation
❏ Highlight applications
need attention
❏ Identify bottlenecks and
resource usage
❏ Reduce performance
incidents in production
❏ Easy communication back
to developer for detailed
performance insights
❏ Shorten time to
production
❏ Resource usage insight
and guidance
❏ Increase cluster ROI
For Developers For Operators For Managers

Job Processing
11
JPM job/runtime
processor (bolt)

Profile listener
12
● Collect/dump extra metrics for
compatibility purposes
○ Real memory usage
○ PRC count
○ Input/Output
* With this version spark profiler, we also modify the Spark
Core to expose memory related metrics.
Spark
Driver
DAGScheduler
ListenerBus
CatalogEventListener
ExecutionPlanListener
ExecutorMetricsListener
HDFS Rest API
Events
JPM profiler

13
JPM Analysis Service
JPM service backend

Success Cases
23
❖ Reduce High RPC Jobs
❖ Reduce Account Usage
❖ Repeatedly failed jobs
❖ Optimize job path with data lineage
❖ Historical based optimization
❖ Running job issue detection

24
Reduce High RPC Jobs
● Background: Jobs with high RPC
● Solution: JPM alert the high RPC jobs with advices:
○ add a reducer for map only jobs (hint)
○ change mapper join to reducer join (pipeline optimization)
● Sample: The RPC calls for the job reduced from 43M to 46k.
Cluster RPC Queue Time
Job Resource Usage Trend
Metrics
Engine

25
Reduce Account Usage
*HCU (Hadoop Compute Unit): 1 HCU is equal to 1 GB memory used for 1
second or 0.5 GB used for 2 seconds.
● Background: Spark jobs may require much more
memory resource than they actually need.
● Solution: JPM highlights the resource wasted jobs
with advices:
○ make the advisory memory configuration
○ combine the SQLs which have same table scan
● Sample: the usage for the account b_seo_eng
decreases from 500MB to 30MB, saving around 1.5%
of cluster.
Metrics
Engine
Resource
Analyzer
Catalog
Analyzer

26
Repeatedly Failed Jobs
● Background: Repeatedly failed jobs always mean
there are many opportunities in them.
● Solution: In JPM Spotlight page, these repeatedly
failed jobs will be grouped by
○ failure exception | user | diagnosis
○ limit the resource of those high failure rate
jobs, stop 0% success jobs when exceed
threshold and alert the users (configurable).
● Sample: The stopped jobs save around 1.4% cluster
usage per week.
Metrics
Engine
Resource
Analyzer
Log
Diagnoser

27
Optimize job path with data lineage
● Background: Over 80k apps per day in our YARN
clusters. Partial of them are not standard developed.
Metadata is even unclear.
● Solution: JPM worked out the data lineage by analysing
jobs, analysing audit log, extracting Hive metastore,
combining OIV. Below actions are benefited based on
the lineage:
○ SQLs combination
○ Hotspot detection and optimization
○ Useless data/jobs retire
Catalog
Analyzer
Data
Lineage
Auditlog OIV

28
● Sample 1: SQLs combination/Hotspot
detection
○ SEO team has many batch jobs
which scan one same big table
without middle table, and the only
difference in their outputs are
grouping condition.
● Sample 2: Useless data/jobs retire
○ There are many jobs without
downstream job which their data
no accessed over 6 months.
Table/Folder Save
/sys/edw/dw_lstg_item/orc
/sys/edw/dw_lstg_item/orc_partitioned
Apollo (1.3%)
/sys/edw/dw_lstg_item_cold/orc
/sys/edw/dw_lstg_item_cold/orc_partitioned
Ares（0.4%）
/sys/edw/dw_checkout_trans/orc Ares (0.15%)

29
Historical based optimization
● Background: This is an old topic but always useful.
What we are care about here are the workload
and environment between different running
instances.
● Solution: Besides gives us the trend, JPM could:
○ analyzes the entire workload of multiple
level of queue and cluster environment.
○ tell us the impact from queue and env size.
○ tell us the changes of configurations
○ give an advice about job scheduler strategy (WIP)
HBO
Configura
tion Diff
Resource
Analyzer
Metrics
Engine

30
● Sample 1: Job slowness due to resource crisis
○ “My job is running 30 minutes slower than yesterday, what happened?”

31
● Sample 2: Job failed due to unexpected configuration changed
○ “My job failed today, but I did nothing. Any changes from platform side?”
this picture is not a
good sample

32
Running job issue detection
● Background: Slowness critical job should be detected in running instead missed SLA. Hanging job should be
distinguished from (long) running job. Job/Data develop team needs aggregated job/cluster metrics ASAP.
● Solution: JPM gives users a chance to query the suspected slow running job on demand to get the report
in fly. Current JPM could tell us five kinds of cases:
○ Queue resource overload
○ Slow down by preemption
○ Shuffle Data skewing
○ Failure disk related
○ Known Spark/Hadoop bug detection
Resource
Analyzer
Metrics
Engine
Running Job
Checker
Stack
Analyzer

34
● Sample 2: Known Spark/Hadoop bugs auto detection (WIP)
○ Job hang up due to known bug “[HDFS-10223]
peerFromSocketAndKey performs SASL exchange
before setting connection timeouts”
JPM will snapshots the threaddump for executor
which a slow task running in and analyzes
automatically.

35
Part of issues may cause a hung task/job before 2.4
https://issues.apache.org/jira/browse/SPARK-22172

JPM frontend and UI
36
Restful API layer
Portal UI
Configuration
Manager and deploy
Health
Monitor
Read Job/ Metrics/
Analysis/
Suggestion
Read ES/ Storm/
Metrics/ Status
Frontend Layer
Read/Write conf for
each cluster

Restful API Extension
37
❖ Based on Antlr4
Partial of SearchQuery.g4
grammar SearchQuery;
//[@site="[SITEPARAM]"]{@jobId,@jobDefId,@jobName,@currentState,@user,@queue,@startTime,@endTime,@jobType}&pageSize=[PAGESIZEPARAM]&start
Time=[STARTTIMEPARAM]&endTime=[ENDTIMEPARAM]
//JobProcessTimeStampService[@site="SITEPARAM"]<@site>{max(currentTimeStamp)}
query : clzName (filter)? (sort)? (selector)? (aggregator)? (search_max_size)?;
//query : value;
clzName : KEY;
filter : '['filter_list']' | '[]';
filter_list : (filter_item','filter_list) | filter_item;
filter_item : filter_item_equal | filter_item_range | filter_item_time_range | filter_item_compare | filter_item_terms;
selector : '{'selector_list'}' | '{}';
selector_list : (selector_item','selector_list) | selector_item;
selector_item : '@'KEY;
aggregator : '<'aggregator_list'>' | '<>';
aggregator_list : (aggregator_item','aggregator_list) | aggregator_item;
aggregator_item : aggregator_term_item | aggregator_top_item | aggregator_stat_item | aggregator_stat_list |aggregator_nested_item | aggregator_histo_item;

Example
38
api/elastic/search?query=spark_app_entity[@site="apollophx"]&@hcu<&size=100
spark_app_entity/_search
{
“query”:{
“term”:{
“site”: {
“value”:”apollophx”
}
}
},
"sort": [
{
"hcu": {
"order": "desc"
}
}
],
“size”:100
}

Summary
44
Improve
Development
Experience
Increase
Resource
Efficiency
Efficiency & Near-time
Self-serve & Automatically
Dev & Ops Friendly
Entire views
Job level resource analysis
One-stop management

Open Source Plan
45
Soon future

JPM vs similar product (open source v2.0.6)
46
Dr JPM
Scope Only cares isolated application Has all user related info and cluster resource status
Diagnostics Based on metrics Aggregates failed job log to diagnose
Scalability Uses thread pool which is hard to horizontally scale Uses distributed streaming
Maintenance One instance per cluster Designed for crossing clusters within one instance
Volume Mysql as backend storage couldn’t save all task level entities Elasticsearch has more powerful in volume
Veracity Sampling to avoid OOM, Sacrificing veracity of result Precisely handle every tasks
Availability Single point failure No single point failure
Variety Analysis only based on metrics Plus environment and cluster status
Realtime Weak since it depends on SHS Realtime
Relationship N/A Uses data pipeline to detect issue
Histories N/A Historic based analysis
Runtime N/A Running job analysis

Managing Apache Spark Workload and Automatic Optimizing

More Related Content

What's hot (20)

Similar to Managing Apache Spark Workload and Automatic Optimizing (20)

More from Databricks (20)

Recently uploaded (20)

Managing Apache Spark Workload and Automatic Optimizing