Monitoring Spark Applications

Monitoring Spark Applications
Tzach Zohar @ Kenshoo, March/2016

Who am I
System Architect @ Kenshoo
Java backend for 10 years
Working with Scala + Spark for 2 years
https://www.linkedin.com/in/tzachzohar

Who’s Kenshoo
10-year Tel-Aviv based startup
Industry Leader in Digital Marketing
500+ employees
Heavy data shop
http://kenshoo.com/

Agenda
Why Monitor
Spark UI
Spark REST API
Spark Metric Sinks
Applicative Metrics

The Importance of being Earnest

Why Monitor
Failures
Performance
Know your data
Correctness of output

Monitoring Distributed Systems
No single log file
No single User Interface
Often - no single framework (e.g. Spark + YARN + HDFS…)

Spark UI
See http://spark.apache.org/docs/latest/monitoring.html#web-interfaces
The first go-to tool for understanding what’s what
Created per SparkContext

Spark UI
Jobs -> Stages -> Tasks

Spark UI
Use the “DAG Visualization” in Job Details to:
Understand flow
Detect caching opportunities

Spark UI
Jobs -> Stages -> Tasks
Detect unbalanced stages
Detect GC issues

Spark UI
Jobs -> Stages -> Tasks -> “Event Timeline”
Detect stragglers
Detect repartitioning opportunities

Spark UI Disadvantages
“Ad-Hoc”, no history*
Human readable, but not machine readable
Data points, not data trends

Spark UI Disadvantages
UI can quickly become hard to use…

Spark’s REST API
See http://spark.apache.org/docs/latest/monitoring.html#rest-api
Programmatic access to UI’s data (jobs, stages, tasks, executors, storage…)
Useful for aggregations over similar jobs

Spark’s REST API
Example: calculate total shuffle statistics:
object SparkAppStats {
case class SparkStage(name: String, shuffleWriteBytes: Long, memoryBytesSpilled: Long, diskBytesSpilled: Long)
implicit val formats = DefaultFormats
val url = "http://<host>:4040/api/v1/applications/<app-name>/stages"
def main (args: Array[String]) {
val json = fromURL(url).mkString
val stages: List[SparkStage] = parse(json).extract[List[SparkStage]]
println("stages count: " + stages.size)
println("shuffleWriteBytes: " + stages.map(_.shuffleWriteBytes).sum)
println("memoryBytesSpilled: " + stages.map(_.memoryBytesSpilled).sum)
println("diskBytesSpilled: " + stages.map(_.diskBytesSpilled).sum)
}
}

Example: calculate total shuffle statistics:
Example output:
stages count: 1435
shuffleWriteBytes: 8488622429
memoryBytesSpilled: 120107947855
diskBytesSpilled: 1505616236
Spark’s REST API

Spark’s REST API
Example: calculate total time per job name:
val url = "http://<host>:4040/api/v1/applications/<app-name>/jobs"
case class SparkJob(jobId: Int, name: String, submissionTime: Date, completionTime: Option[Date], stageIds: List[Int]) {
def getDurationMillis: Option[Long] = completionTime.map(_.getTime - submissionTime.getTime)
}
def main (args: Array[String]) {
val json = fromURL(url).mkString
parse(json)
.extract[List[SparkJob]]
.filter(j => j.getDurationMillis.isDefined) // only completed jobs
.groupBy(_.name)
.mapValues(list => (list.map(_.getDurationMillis.get).sum, list.size))
.foreach { case (name, (time, count)) => println(s"TIME: $timetAVG: ${time / count}tNAME: $name") }
}

Spark’s REST API
Example: calculate total time per job name:
Example output:
TIME: 182570 AVG: 16597 NAME: count at
MyAggregationService.scala:132
TIME: 230973 AVG: 1297 NAME: parquet at MyRepository.scala:99
TIME: 120393 AVG: 2188 NAME: collect at MyCollector.scala:30
TIME: 5645 AVG: 627 NAME: collect at MyCollector.scala:103

But that’s still ad-
hoc, right?

Metrics: easy Java API for creating and updating metrics stored in memory, e.g.:
Metrics
See http://spark.apache.org/docs/latest/monitoring.html#metrics
Spark uses the popular dropwizard.metrics library (renamed from codahale.metrics
and yammer.metrics)
// Gauge for executor thread pool's actively executing task counts
metricRegistry.register(name("threadpool", "activeTasks"), new Gauge[Int] {
override def getValue: Int = threadPool.getActiveCount()
})

Metrics
What is metered? Couldn’t find any detailed documentation of this
This trick flushes most of them out: search sources for “metricRegistry.register”

Spark Metric Sinks
A “Sink” is an interface for viewing these metrics, at given intervals or ad-hoc
Available sinks: Console, CSV, SLF4J, Servlet, JMX, Graphite, Ganglia*
we use the Graphite Sink to send all metrics to Graphite
$SPARK_HOME/metrics.properties:
*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=<your graphite hostname>
*.sink.graphite.port=2003
*.sink.graphite.period=30
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=<token>.<app-name>.<host-name>

.. and it’s in Graphite ( + Grafana)

Graphite Sink
Very useful for trend analysis
WARNING: Not suitable for short-running applications (will pollute graphite with
new metrics for each application)
Requires some Graphite tricks to get clear readings (wildcards, sums, derivatives,
etc.)

The Missing Piece
Spark meters its internals pretty thoroughly, but what about your internals?
Applicative metrics are a great tool for knowing your data and verifying output
correctness
We use Dropwizard Metrics + Graphite for this too (everywhere)

Counting RDD Elements
rdd.count() might be costly (another action)
Spark Accumulators are a good alternative
Trick: send accumulator results to Graphite, using “Counter-backed Accumulators”
/** *
* Call returned callback after acting on returned RDD to get counter updated
*/
def countSilently[V: ClassTag](rdd: RDD[V], metricName: String, clazz: Class[_]): (RDD[V], Unit => Unit) = {
val counter: Counter = Metrics.newCounter(new MetricName(clazz, metricName))
val accumulator: Accumulator[Long] = rdd.sparkContext.accumulator(0, metricName)
val countedRdd = rdd.map(v => { accumulator += 1; v })
val callback: Unit => Unit = u => counter.inc(accumulator.value)
(countedRdd, callback)
}

We Measure...
Input records
Output records
Parsing failures
Average job time
Data “freshness” histogram
Much much more...

Conclusions
Spark provides a wide variety of monitoring options
Each one should be used when appropriate - neither one is sufficient on its own
Metrics + Graphite + Grafana can give you visibility to any numeric timeseries

Monitoring Spark Applications

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Monitoring Spark Applications (20)

Recently uploaded (20)

Monitoring Spark Applications