SlideShare a Scribd company logo
Monitoring Spark Applications
Tzach Zohar @ Kenshoo, March/2016
Who am I
System Architect @ Kenshoo
Java backend for 10 years
Working with Scala + Spark for 2 years
https://www.linkedin.com/in/tzachzohar
Who’s Kenshoo
10-year Tel-Aviv based startup
Industry Leader in Digital Marketing
500+ employees
Heavy data shop
http://kenshoo.com/
And who’re you?
Agenda
Why Monitor
Spark UI
Spark REST API
Spark Metric Sinks
Applicative Metrics
The Importance of being Earnest
Why Monitor
Failures
Performance
Know your data
Correctness of output
Monitoring Distributed Systems
No single log file
No single User Interface
Often - no single framework (e.g. Spark + YARN + HDFS…)
Spark UI
Spark UI
See http://spark.apache.org/docs/latest/monitoring.html#web-interfaces
The first go-to tool for understanding what’s what
Created per SparkContext
Spark UI
Jobs -> Stages -> Tasks
Spark UI
Jobs -> Stages -> Tasks
Spark UI
Use the “DAG Visualization” in Job Details to:
Understand flow
Detect caching opportunities
Spark UI
Jobs -> Stages -> Tasks
Detect unbalanced stages
Detect GC issues
Spark UI
Jobs -> Stages -> Tasks -> “Event Timeline”
Detect stragglers
Detect repartitioning opportunities
Spark UI Disadvantages
“Ad-Hoc”, no history*
Human readable, but not machine readable
Data points, not data trends
Spark UI Disadvantages
UI can quickly become hard to use…
Spark REST API
Spark’s REST API
See http://spark.apache.org/docs/latest/monitoring.html#rest-api
Programmatic access to UI’s data (jobs, stages, tasks, executors, storage…)
Useful for aggregations over similar jobs
Spark’s REST API
Example: calculate total shuffle statistics:
object SparkAppStats {
case class SparkStage(name: String, shuffleWriteBytes: Long, memoryBytesSpilled: Long, diskBytesSpilled: Long)
implicit val formats = DefaultFormats
val url = "http://<host>:4040/api/v1/applications/<app-name>/stages"
def main (args: Array[String]) {
val json = fromURL(url).mkString
val stages: List[SparkStage] = parse(json).extract[List[SparkStage]]
println("stages count: " + stages.size)
println("shuffleWriteBytes: " + stages.map(_.shuffleWriteBytes).sum)
println("memoryBytesSpilled: " + stages.map(_.memoryBytesSpilled).sum)
println("diskBytesSpilled: " + stages.map(_.diskBytesSpilled).sum)
}
}
Example: calculate total shuffle statistics:
Example output:
stages count: 1435
shuffleWriteBytes: 8488622429
memoryBytesSpilled: 120107947855
diskBytesSpilled: 1505616236
Spark’s REST API
Spark’s REST API
Example: calculate total time per job name:
val url = "http://<host>:4040/api/v1/applications/<app-name>/jobs"
case class SparkJob(jobId: Int, name: String, submissionTime: Date, completionTime: Option[Date], stageIds: List[Int]) {
def getDurationMillis: Option[Long] = completionTime.map(_.getTime - submissionTime.getTime)
}
def main (args: Array[String]) {
val json = fromURL(url).mkString
parse(json)
.extract[List[SparkJob]]
.filter(j => j.getDurationMillis.isDefined) // only completed jobs
.groupBy(_.name)
.mapValues(list => (list.map(_.getDurationMillis.get).sum, list.size))
.foreach { case (name, (time, count)) => println(s"TIME: $timetAVG: ${time / count}tNAME: $name") }
}
Spark’s REST API
Example: calculate total time per job name:
Example output:
TIME: 182570 AVG: 16597 NAME: count at
MyAggregationService.scala:132
TIME: 230973 AVG: 1297 NAME: parquet at MyRepository.scala:99
TIME: 120393 AVG: 2188 NAME: collect at MyCollector.scala:30
TIME: 5645 AVG: 627 NAME: collect at MyCollector.scala:103
But that’s still ad-
hoc, right?
Spark Metric Sinks
Metrics: easy Java API for creating and updating metrics stored in memory, e.g.:
Metrics
See http://spark.apache.org/docs/latest/monitoring.html#metrics
Spark uses the popular dropwizard.metrics library (renamed from codahale.metrics
and yammer.metrics)
// Gauge for executor thread pool's actively executing task counts
metricRegistry.register(name("threadpool", "activeTasks"), new Gauge[Int] {
override def getValue: Int = threadPool.getActiveCount()
})
Metrics
What is metered? Couldn’t find any detailed documentation of this
This trick flushes most of them out: search sources for “metricRegistry.register”
Where do these
metrics go?
Spark Metric Sinks
A “Sink” is an interface for viewing these metrics, at given intervals or ad-hoc
Available sinks: Console, CSV, SLF4J, Servlet, JMX, Graphite, Ganglia*
we use the Graphite Sink to send all metrics to Graphite
$SPARK_HOME/metrics.properties:
*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=<your graphite hostname>
*.sink.graphite.port=2003
*.sink.graphite.period=30
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=<token>.<app-name>.<host-name>
.. and it’s in Graphite ( + Grafana)
Graphite Sink
Very useful for trend analysis
WARNING: Not suitable for short-running applications (will pollute graphite with
new metrics for each application)
Requires some Graphite tricks to get clear readings (wildcards, sums, derivatives,
etc.)
Applicative Metrics
The Missing Piece
Spark meters its internals pretty thoroughly, but what about your internals?
Applicative metrics are a great tool for knowing your data and verifying output
correctness
We use Dropwizard Metrics + Graphite for this too (everywhere)
Counting RDD Elements
rdd.count() might be costly (another action)
Spark Accumulators are a good alternative
Trick: send accumulator results to Graphite, using “Counter-backed Accumulators”
/** *
* Call returned callback after acting on returned RDD to get counter updated
*/
def countSilently[V: ClassTag](rdd: RDD[V], metricName: String, clazz: Class[_]): (RDD[V], Unit => Unit) = {
val counter: Counter = Metrics.newCounter(new MetricName(clazz, metricName))
val accumulator: Accumulator[Long] = rdd.sparkContext.accumulator(0, metricName)
val countedRdd = rdd.map(v => { accumulator += 1; v })
val callback: Unit => Unit = u => counter.inc(accumulator.value)
(countedRdd, callback)
}
Counting RDD Elements
We Measure...
Input records
Output records
Parsing failures
Average job time
Data “freshness” histogram
Much much more...
WARNING:
it’s addictive...
Monitoring Spark Applications
Monitoring Spark Applications
Conclusions
Spark provides a wide variety of monitoring options
Each one should be used when appropriate - neither one is sufficient on its own
Metrics + Graphite + Grafana can give you visibility to any numeric timeseries
Questions?
Thank you

More Related Content

PDF
Deep Dive into the New Features of Apache Spark 3.0
PPTX
Apache Flink and what it is used for
PDF
Spark Summit EU 2015: SparkUI visualization: a lens into your application
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Deep Dive: Memory Management in Apache Spark
PPTX
Apache airflow
PDF
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
PDF
Data Lineage with Apache Airflow using Marquez
Deep Dive into the New Features of Apache Spark 3.0
Apache Flink and what it is used for
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Evening out the uneven: dealing with skew in Flink
Deep Dive: Memory Management in Apache Spark
Apache airflow
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Data Lineage with Apache Airflow using Marquez

What's hot (20)

PDF
Getting Started with Apache Spark on Kubernetes
PDF
Deep Dive into GPU Support in Apache Spark 3.x
PDF
Apache Airflow
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Grafana Loki: like Prometheus, but for Logs
PDF
Managing Apache Spark Workload and Automatic Optimizing
PPTX
Transformations and actions a visual guide training
PDF
Loki - like prometheus, but for logs
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
PPTX
Spark streaming
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
PPTX
Optimizing Apache Spark SQL Joins
PDF
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PDF
Why we chose Argo Workflow to scale DevOps at InVision
PDF
Apache Spark Core – Practical Optimization
PDF
Fall in Love with Graphs and Metrics using Grafana
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Building an analytics workflow using Apache Airflow
Getting Started with Apache Spark on Kubernetes
Deep Dive into GPU Support in Apache Spark 3.x
Apache Airflow
Processing Large Data with Apache Spark -- HasGeek
Grafana Loki: like Prometheus, but for Logs
Managing Apache Spark Workload and Automatic Optimizing
Transformations and actions a visual guide training
Loki - like prometheus, but for logs
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Spark streaming
Airflow Best Practises & Roadmap to Airflow 2.0
Optimizing Apache Spark SQL Joins
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Why we chose Argo Workflow to scale DevOps at InVision
Apache Spark Core – Practical Optimization
Fall in Love with Graphs and Metrics using Grafana
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
A Deep Dive into Query Execution Engine of Spark SQL
Building an analytics workflow using Apache Airflow
Ad

Viewers also liked (20)

PPTX
Advanced Visualization of Spark jobs
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PPTX
Developers like winning - gamifying code reviews
PPT
Defining and Evaluating Success: Metrics and Metric Frameworks for Informatio...
PDF
PX4 Seminar 03
PPTX
YARN Services
PPTX
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
PDF
How to measure everything - a million metrics per second with minimal develop...
PDF
Ciclo termodinâmico stirling
PDF
Productionizing Spark and the Spark Job Server
PPTX
HiveServer2
PDF
Is spark streaming based on reactive streams?
PPSX
Dinâmica climática
PDF
Top 5 mistakes when writing Spark applications
PDF
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
PDF
Java application monitoring with Dropwizard Metrics and graphite
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
PDF
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
PDF
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Advanced Visualization of Spark jobs
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Developers like winning - gamifying code reviews
Defining and Evaluating Success: Metrics and Metric Frameworks for Informatio...
PX4 Seminar 03
YARN Services
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
How to measure everything - a million metrics per second with minimal develop...
Ciclo termodinâmico stirling
Productionizing Spark and the Spark Job Server
HiveServer2
Is spark streaming based on reactive streams?
Dinâmica climática
Top 5 mistakes when writing Spark applications
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Java application monitoring with Dropwizard Metrics and graphite
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Ad

Similar to Monitoring Spark Applications (20)

PDF
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
PDF
Performance Troubleshooting Using Apache Spark Metrics
PDF
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
PPTX
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
PPTX
Tuning and Debugging in Apache Spark
PDF
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
PDF
What is New with Apache Spark Performance Monitoring in Spark 3.0
PDF
Tuning and Debugging in Apache Spark
PDF
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
PDF
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
PPTX
Spark Overview and Performance Issues
PDF
How to Performance-Tune Apache Spark Applications in Large Clusters
PDF
How to performance tune spark applications in large clusters
PDF
Apache Spark streaming and HBase
PPTX
Free Code Friday - Spark Streaming with HBase
PDF
Advanced spark training advanced spark internals and tuning reynold xin
PDF
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
PPTX
Spark 计算模型
PDF
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
PPTX
Spark real world use cases and optimizations
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Performance Troubleshooting Using Apache Spark Metrics
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Tuning and Debugging in Apache Spark
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
What is New with Apache Spark Performance Monitoring in Spark 3.0
Tuning and Debugging in Apache Spark
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Spark Overview and Performance Issues
How to Performance-Tune Apache Spark Applications in Large Clusters
How to performance tune spark applications in large clusters
Apache Spark streaming and HBase
Free Code Friday - Spark Streaming with HBase
Advanced spark training advanced spark internals and tuning reynold xin
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Spark 计算模型
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Spark real world use cases and optimizations

Recently uploaded (20)

PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
top salesforce developer skills in 2025.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Digital Strategies for Manufacturing Companies
PPTX
history of c programming in notes for students .pptx
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
System and Network Administration Chapter 2
PDF
AI in Product Development-omnex systems
PPTX
ai tools demonstartion for schools and inter college
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Design an Analysis of Algorithms II-SECS-1021-03
top salesforce developer skills in 2025.pdf
Transform Your Business with a Software ERP System
How Creative Agencies Leverage Project Management Software.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Odoo Companies in India – Driving Business Transformation.pdf
Operating system designcfffgfgggggggvggggggggg
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Odoo POS Development Services by CandidRoot Solutions
Digital Strategies for Manufacturing Companies
history of c programming in notes for students .pptx
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Wondershare Filmora 15 Crack With Activation Key [2025
System and Network Administration Chapter 2
AI in Product Development-omnex systems
ai tools demonstartion for schools and inter college
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Internet Downloader Manager (IDM) Crack 6.42 Build 41

Monitoring Spark Applications

  • 1. Monitoring Spark Applications Tzach Zohar @ Kenshoo, March/2016
  • 2. Who am I System Architect @ Kenshoo Java backend for 10 years Working with Scala + Spark for 2 years https://www.linkedin.com/in/tzachzohar
  • 3. Who’s Kenshoo 10-year Tel-Aviv based startup Industry Leader in Digital Marketing 500+ employees Heavy data shop http://kenshoo.com/
  • 5. Agenda Why Monitor Spark UI Spark REST API Spark Metric Sinks Applicative Metrics
  • 6. The Importance of being Earnest
  • 7. Why Monitor Failures Performance Know your data Correctness of output
  • 8. Monitoring Distributed Systems No single log file No single User Interface Often - no single framework (e.g. Spark + YARN + HDFS…)
  • 10. Spark UI See http://spark.apache.org/docs/latest/monitoring.html#web-interfaces The first go-to tool for understanding what’s what Created per SparkContext
  • 11. Spark UI Jobs -> Stages -> Tasks
  • 12. Spark UI Jobs -> Stages -> Tasks
  • 13. Spark UI Use the “DAG Visualization” in Job Details to: Understand flow Detect caching opportunities
  • 14. Spark UI Jobs -> Stages -> Tasks Detect unbalanced stages Detect GC issues
  • 15. Spark UI Jobs -> Stages -> Tasks -> “Event Timeline” Detect stragglers Detect repartitioning opportunities
  • 16. Spark UI Disadvantages “Ad-Hoc”, no history* Human readable, but not machine readable Data points, not data trends
  • 17. Spark UI Disadvantages UI can quickly become hard to use…
  • 19. Spark’s REST API See http://spark.apache.org/docs/latest/monitoring.html#rest-api Programmatic access to UI’s data (jobs, stages, tasks, executors, storage…) Useful for aggregations over similar jobs
  • 20. Spark’s REST API Example: calculate total shuffle statistics: object SparkAppStats { case class SparkStage(name: String, shuffleWriteBytes: Long, memoryBytesSpilled: Long, diskBytesSpilled: Long) implicit val formats = DefaultFormats val url = "http://<host>:4040/api/v1/applications/<app-name>/stages" def main (args: Array[String]) { val json = fromURL(url).mkString val stages: List[SparkStage] = parse(json).extract[List[SparkStage]] println("stages count: " + stages.size) println("shuffleWriteBytes: " + stages.map(_.shuffleWriteBytes).sum) println("memoryBytesSpilled: " + stages.map(_.memoryBytesSpilled).sum) println("diskBytesSpilled: " + stages.map(_.diskBytesSpilled).sum) } }
  • 21. Example: calculate total shuffle statistics: Example output: stages count: 1435 shuffleWriteBytes: 8488622429 memoryBytesSpilled: 120107947855 diskBytesSpilled: 1505616236 Spark’s REST API
  • 22. Spark’s REST API Example: calculate total time per job name: val url = "http://<host>:4040/api/v1/applications/<app-name>/jobs" case class SparkJob(jobId: Int, name: String, submissionTime: Date, completionTime: Option[Date], stageIds: List[Int]) { def getDurationMillis: Option[Long] = completionTime.map(_.getTime - submissionTime.getTime) } def main (args: Array[String]) { val json = fromURL(url).mkString parse(json) .extract[List[SparkJob]] .filter(j => j.getDurationMillis.isDefined) // only completed jobs .groupBy(_.name) .mapValues(list => (list.map(_.getDurationMillis.get).sum, list.size)) .foreach { case (name, (time, count)) => println(s"TIME: $timetAVG: ${time / count}tNAME: $name") } }
  • 23. Spark’s REST API Example: calculate total time per job name: Example output: TIME: 182570 AVG: 16597 NAME: count at MyAggregationService.scala:132 TIME: 230973 AVG: 1297 NAME: parquet at MyRepository.scala:99 TIME: 120393 AVG: 2188 NAME: collect at MyCollector.scala:30 TIME: 5645 AVG: 627 NAME: collect at MyCollector.scala:103
  • 24. But that’s still ad- hoc, right?
  • 26. Metrics: easy Java API for creating and updating metrics stored in memory, e.g.: Metrics See http://spark.apache.org/docs/latest/monitoring.html#metrics Spark uses the popular dropwizard.metrics library (renamed from codahale.metrics and yammer.metrics) // Gauge for executor thread pool's actively executing task counts metricRegistry.register(name("threadpool", "activeTasks"), new Gauge[Int] { override def getValue: Int = threadPool.getActiveCount() })
  • 27. Metrics What is metered? Couldn’t find any detailed documentation of this This trick flushes most of them out: search sources for “metricRegistry.register”
  • 29. Spark Metric Sinks A “Sink” is an interface for viewing these metrics, at given intervals or ad-hoc Available sinks: Console, CSV, SLF4J, Servlet, JMX, Graphite, Ganglia* we use the Graphite Sink to send all metrics to Graphite $SPARK_HOME/metrics.properties: *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=<your graphite hostname> *.sink.graphite.port=2003 *.sink.graphite.period=30 *.sink.graphite.unit=seconds *.sink.graphite.prefix=<token>.<app-name>.<host-name>
  • 30. .. and it’s in Graphite ( + Grafana)
  • 31. Graphite Sink Very useful for trend analysis WARNING: Not suitable for short-running applications (will pollute graphite with new metrics for each application) Requires some Graphite tricks to get clear readings (wildcards, sums, derivatives, etc.)
  • 33. The Missing Piece Spark meters its internals pretty thoroughly, but what about your internals? Applicative metrics are a great tool for knowing your data and verifying output correctness We use Dropwizard Metrics + Graphite for this too (everywhere)
  • 34. Counting RDD Elements rdd.count() might be costly (another action) Spark Accumulators are a good alternative Trick: send accumulator results to Graphite, using “Counter-backed Accumulators” /** * * Call returned callback after acting on returned RDD to get counter updated */ def countSilently[V: ClassTag](rdd: RDD[V], metricName: String, clazz: Class[_]): (RDD[V], Unit => Unit) = { val counter: Counter = Metrics.newCounter(new MetricName(clazz, metricName)) val accumulator: Accumulator[Long] = rdd.sparkContext.accumulator(0, metricName) val countedRdd = rdd.map(v => { accumulator += 1; v }) val callback: Unit => Unit = u => counter.inc(accumulator.value) (countedRdd, callback) }
  • 36. We Measure... Input records Output records Parsing failures Average job time Data “freshness” histogram Much much more...
  • 40. Conclusions Spark provides a wide variety of monitoring options Each one should be used when appropriate - neither one is sufficient on its own Metrics + Graphite + Grafana can give you visibility to any numeric timeseries