SlideShare a Scribd company logo
Testing Spark: Best Practices
Anupama Shetty
Senior SDET, Analytics, Ooyala Inc
Neil Marshall
SDET, Analytics, Ooyala Inc
Spark Summit 2014
Agenda - Anu
1. Application Overview
● Batch mode
● Streaming mode with kafka
2. Test Overview
● Test environment setup
● Unit testing spark applications
● Integration testing spark applications
3. Best Practices
● Code coverage support with scoverage and scct
● Auto build trigger using jenkins hook via github
Agenda - Neil
4. Performance testing of Spark
● Architecture & technology overview
● Performance testing setup & run
● Result analysis
● Best practices
Company Overview
● Founded in 2007
● 300+ employees worldwide
● Global footprint of 200M unique users in 130 countries
● Ooyala works with the most successful broadcasts and media
companies in the world
● Reach, measure, monetize video business
● Cross-device video analytics and monetization products and
services
Application Overview
● Analytics ETL pipeline service
● Receives 5B+ player generated events such as plays, displays
on a daily basis.
● Computed metrics include player conversion rate, video
conversion rate and engagement metrics.
● Third party services used are
○ Spark 1.0 used to process player generated big data.
○ Kafka 0.9.0 with Zookeeper as our message queue
○ CDH5 HDFS as our intermediate storage file system
Spark based Log Processor details
● Supports two input data formats
○ Json
○ Thrift
● Batch Mode Support
○ Uses Spark Context
○ Consumes input data via a text file
● Streaming Mode Support
○ Uses Spark streaming context
○ Consumes data via kafka stream
Test pipeline setup
● Player simulation done using Watir (ruby gem based on
Selenium).
● Kafka(with zookeeper) setup as local virtual machine using
vagrant. VMs can be monitored using VirtualBox.
● Spark cluster run in local mode.
Unit test setup - Spark in Batch mode
● Spark cluster setup for testing
○ Build your spark application jar using `sbt “assembly”`
○ Create config with spark.jar set to application jar and spark.master to “local”
■ var config = ConfigFactory parseString """spark.jar = "target/scala-2.10
/SparkLogProcessor.jar",spark.master = "local" """
○ Store local spark directory path for spark context creation
■ val sparkDir = <path to local spark directory> + “spark-0.9.0-
incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.
10-0.9.0-incubating-hadoop2.2.0.jar").mkString
● Creating spark context
○ var sc: SparkContext = new SparkContext("local", getClass.getSimpleName,
sparkDir, List(config.getString("spark.jar")))
Test Setup for batch mode using Spark Context
Before block
After block
Scala test framework “FunSpec” is
used with “ShouldMatchers” (for
assertions) and “BeforeAndAfter”
(for setup/teardown).
Kafka setup for spark streaming
● Bring up Kafka virtual
machine using Vagrantfile
with following command
`vagrant up kafkavm`
● Configure Kafka
○ Create topic
■ `bin/kafka-create-topic.sh --zookeeper "localhost:2181" --topic "thrift_pings"`
○ Consume messages using
■ `bin/kafka-console-consumer.sh --zookeeper "localhost:2181" --topic "thrift_pings" --group
"testThrift" &>/tmp/thrift-consumer-msgs.log &`
Testing streaming mode with
Spark Streaming Context
Test ‘After’ block and assertion block for spark streaming
mode
After Block Test Assertion
Testing best practices - Code Coverage
● Tracking code coverage with Scoverage and/or Scct
● Enable fork = true to avoid spark exceptions caused by spark context conflicts.
● SCCT configurations
○ ScctPlugin.instrumentSettings
○ parallelExecution in ScctTest := false
○ fork in ScctTest := true
○ Command to run it - `sbt “scct:test”`
● Scoverage configurations
○ ScoverageSbtPlugin.instrumentSettings
○ ScoverageSbtPlugin.ScoverageKeys.excludedPackages in
ScoverageSbtPlugin.scoverage := ".*benchmark.*;.*util.*”
○ parallelExecution in ScoverageSbtPlugin.scoverageTest := false
○ fork in ScoverageSbtPlugin.scoverageTest := true
○ Command to run it - `sbt “scoverage:test”`
Testing best practices - Jenkins auto test build
trigger
● Requires enabling 'github-webhook' on github repo settings
page. Requires admin access for the repo.
● Jenkins job should be configured with corresponding github
repo via “GitHub Project” field.
● Test jenkins hook by triggering a test run from github repo.
● "Github pull request builder" can be used while configuring
jenkins job to auto publish test results on github pull requests
after every test run. This also lets you rerun failed tests via
github pull request.
What is a performance testing?
● A practice striving to build performance into the
implementation, design and architecture of a
system.
● Determine how a system performs in terms of
responsiveness and stability under a particular
workload.
● Can serve to investigate, measure, validate or verify
other quality attributes of a system, such as
scalability, reliability and resource usage.
What is a Gatling?
● Stress test tool
Why is Gatling selected over other
Perf Test tools as JMeter?
● Powerful scripting using Scala
● Akka + Netty
● Run multiple scenarios in one simulation
● Scenarios = code + DSL
● Graphical reports with clear & concise
graphs
How does Gatling work with Spark
● Access Web applications / services
Develop & setup a simple perf test example
A perf test will run against spark-jobserver for
word counts.
What is a spark jobserver?
● Provides a RESTful interface for submitting and
managing Apache Spark jobs, jars and job
contexts
● Scala 2.10 + CDH5/Hadoop 2.2 + Spark 0.9.0
● For more depths on jobserver, see Evan Chan
& Kelvin Chu’s Spark Query Service
presentation.
Steps to set up & run Spark-jobserver
● Clone spark-jobserver from git-hub
● Install SBT and type “sbt” in the spark-
jobserver repo
● From SBT shell, simply type “re-start”
$ git clone https://github.com/ooyala/spark-jobserver
> re-start
$ sbt
Steps to package & upload a jar to the
jobserver
● Package the test jar of the word count
example
● Upload the jar to the jobserver
$ curl --data-binary @job-server-tests/target/job-server-tests-0.3.1.jar
localhost:8090/jars/test
$ sbt job-server-tests/package
Run a request against the jobserver
$ curl -d "input.string = a b c a b see" 'http://localhost:8090/jobs?
appName=test&classPath=spark.jobserver.WordCountExample&sync=true'
{
"status": "OK",
"result": {
"a": 2,
"b": 2,
"c": 1,
"see": 1
}
}⏎
Source code of Word Count Example
Script Gatling for the Word Count Example
Scenario defines steps that Gatling does during
a runtime:
Script Gatling for the Word Count Example
Setup puts users and scenarios as workflows plus
assertions together in a performance test simulation
● Inject 10 users in 10 seconds into scenarios in 2
cycles
● Ensure successful requests greater than 80%
Test Results in Terminal Window
Gatling Graph - Indicator
Gatling Graph - Active Sessions
Best Practices on Performance Tests
● Run performance tests on Jenkins
● Set up baselines for any of performance
tests with different scenarios & users
Any Questions?
References
Contact Info:
Anupama Shetty: anupama@ooyala.com
Neil Marshall: nmarshall@ooyala.com
References:
http://www.slideshare.net/AnuShetty/spark-summit2014-techtalk-testing-spark

More Related Content

PDF
Unit testing of spark applications
PPTX
Profiling & Testing with Spark
PDF
Effective testing for spark programs Strata NY 2015
PPT
Spark stream - Kafka
PDF
Javascript TDD with Jasmine, Karma, and Gulp
PPT
whats new in java 8
PDF
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
PDF
JCConf 2020 - New Java Features Released in 2020
Unit testing of spark applications
Profiling & Testing with Spark
Effective testing for spark programs Strata NY 2015
Spark stream - Kafka
Javascript TDD with Jasmine, Karma, and Gulp
whats new in java 8
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
JCConf 2020 - New Java Features Released in 2020

What's hot (20)

PPTX
Performance tests with Gatling
PDF
NANO266 - Lecture 9 - Tools of the Modeling Trade
PDF
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
PDF
Kernel Recipes 2017: Using Linux perf at Netflix
PDF
Tempest scenariotests 20140512
PDF
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
PDF
Whatthestack using Tempest for testing your OpenStack deployment
PPT
Jvm Performance Tunning
PDF
Atlanta Spark User Meetup 09 22 2016
PDF
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
PDF
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
PDF
Getting The Best Performance With PySpark
PDF
Reactive programming with RxJava
PDF
LISA17 Container Performance Analysis
PDF
Netflix: From Clouds to Roots
PPTX
GoSF Jan 2016 - Go Write a Plugin for Snap!
PDF
H2O World - PySparkling Water - Nidhi Mehta
PDF
KrakenJS
PDF
Transparent GPU Exploitation for Java
PDF
Enabling Vectorized Engine in Apache Spark
Performance tests with Gatling
NANO266 - Lecture 9 - Tools of the Modeling Trade
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Kernel Recipes 2017: Using Linux perf at Netflix
Tempest scenariotests 20140512
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Whatthestack using Tempest for testing your OpenStack deployment
Jvm Performance Tunning
Atlanta Spark User Meetup 09 22 2016
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Getting The Best Performance With PySpark
Reactive programming with RxJava
LISA17 Container Performance Analysis
Netflix: From Clouds to Roots
GoSF Jan 2016 - Go Write a Plugin for Snap!
H2O World - PySparkling Water - Nidhi Mehta
KrakenJS
Transparent GPU Exploitation for Java
Enabling Vectorized Engine in Apache Spark
Ad

Viewers also liked (20)

PPTX
Production Readiness Testing Using Spark
PPTX
Tuning and Debugging in Apache Spark
PDF
Scala Bootcamp 1
PDF
Coscup 2013 : Continuous Integration on top of hadoop
PPTX
Spark to Production @Windward
PDF
Scala for dummies
PPTX
Streaming ETL for All
PDF
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
PPTX
Test Automation and Continuous Integration
PDF
Distributed Testing Environment
PDF
Production Readiness Testing At Salesforce Using Spark MLlib
PPT
Running Spark in Production
PDF
天猫后端技术架构优化实践
PDF
Beyond Parallelize and Collect by Holden Karau
PDF
Spark: Interactive To Production
PPTX
오픈 소스 도구를 활용한 성능 테스트 방법 및 사례
ODP
Deep dive into sass
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
Spark Summit EU talk by Ted Malaska
ODP
Cassandra - Tips And Techniques
Production Readiness Testing Using Spark
Tuning and Debugging in Apache Spark
Scala Bootcamp 1
Coscup 2013 : Continuous Integration on top of hadoop
Spark to Production @Windward
Scala for dummies
Streaming ETL for All
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Test Automation and Continuous Integration
Distributed Testing Environment
Production Readiness Testing At Salesforce Using Spark MLlib
Running Spark in Production
天猫后端技术架构优化实践
Beyond Parallelize and Collect by Holden Karau
Spark: Interactive To Production
오픈 소스 도구를 활용한 성능 테스트 방법 및 사례
Deep dive into sass
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU talk by Ted Malaska
Cassandra - Tips And Techniques
Ad

Similar to Spark summit2014 techtalk - testing spark (20)

PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Getting Started with Apache Spark on Kubernetes
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PDF
Openshift operator insight
PPTX
Apache Spark SQL- Installing Spark
PDF
Spark 2.x Troubleshooting Guide
 
PDF
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
PPTX
Fullstack workshop
PDF
Scala, docker and testing, oh my! mario camou
PDF
PDF
Greach 2019 - Creating Micronaut Configurations
PPTX
Tuning tips for Apache Spark Jobs
PDF
Introducing Playwright's New Test Runner
PDF
One commit, one release. Continuously delivering a Symfony project.
PDF
Faster Data Integration Pipeline Execution using Spark-Jobserver
PDF
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
PDF
Mastering Grails 3 Plugins - G3 Summit 2016
PDF
Continous Delivering a PHP application
PDF
Gatling Performance Workshop
PDF
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Openshift operator insight
Apache Spark SQL- Installing Spark
Spark 2.x Troubleshooting Guide
 
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
Fullstack workshop
Scala, docker and testing, oh my! mario camou
Greach 2019 - Creating Micronaut Configurations
Tuning tips for Apache Spark Jobs
Introducing Playwright's New Test Runner
One commit, one release. Continuously delivering a Symfony project.
Faster Data Integration Pipeline Execution using Spark-Jobserver
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Mastering Grails 3 Plugins - G3 Summit 2016
Continous Delivering a PHP application
Gatling Performance Workshop
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes

Recently uploaded (20)

PPTX
Lecture Notes Electrical Wiring System Components
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
additive manufacturing of ss316l using mig welding
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Welding lecture in detail for understanding
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
UNIT 4 Total Quality Management .pptx
Lecture Notes Electrical Wiring System Components
CH1 Production IntroductoryConcepts.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
additive manufacturing of ss316l using mig welding
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Welding lecture in detail for understanding
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Internet of Things (IOT) - A guide to understanding
UNIT 4 Total Quality Management .pptx

Spark summit2014 techtalk - testing spark

  • 1. Testing Spark: Best Practices Anupama Shetty Senior SDET, Analytics, Ooyala Inc Neil Marshall SDET, Analytics, Ooyala Inc Spark Summit 2014
  • 2. Agenda - Anu 1. Application Overview ● Batch mode ● Streaming mode with kafka 2. Test Overview ● Test environment setup ● Unit testing spark applications ● Integration testing spark applications 3. Best Practices ● Code coverage support with scoverage and scct ● Auto build trigger using jenkins hook via github
  • 3. Agenda - Neil 4. Performance testing of Spark ● Architecture & technology overview ● Performance testing setup & run ● Result analysis ● Best practices
  • 4. Company Overview ● Founded in 2007 ● 300+ employees worldwide ● Global footprint of 200M unique users in 130 countries ● Ooyala works with the most successful broadcasts and media companies in the world ● Reach, measure, monetize video business ● Cross-device video analytics and monetization products and services
  • 5. Application Overview ● Analytics ETL pipeline service ● Receives 5B+ player generated events such as plays, displays on a daily basis. ● Computed metrics include player conversion rate, video conversion rate and engagement metrics. ● Third party services used are ○ Spark 1.0 used to process player generated big data. ○ Kafka 0.9.0 with Zookeeper as our message queue ○ CDH5 HDFS as our intermediate storage file system
  • 6. Spark based Log Processor details ● Supports two input data formats ○ Json ○ Thrift ● Batch Mode Support ○ Uses Spark Context ○ Consumes input data via a text file ● Streaming Mode Support ○ Uses Spark streaming context ○ Consumes data via kafka stream
  • 7. Test pipeline setup ● Player simulation done using Watir (ruby gem based on Selenium). ● Kafka(with zookeeper) setup as local virtual machine using vagrant. VMs can be monitored using VirtualBox. ● Spark cluster run in local mode.
  • 8. Unit test setup - Spark in Batch mode ● Spark cluster setup for testing ○ Build your spark application jar using `sbt “assembly”` ○ Create config with spark.jar set to application jar and spark.master to “local” ■ var config = ConfigFactory parseString """spark.jar = "target/scala-2.10 /SparkLogProcessor.jar",spark.master = "local" """ ○ Store local spark directory path for spark context creation ■ val sparkDir = <path to local spark directory> + “spark-0.9.0- incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2. 10-0.9.0-incubating-hadoop2.2.0.jar").mkString ● Creating spark context ○ var sc: SparkContext = new SparkContext("local", getClass.getSimpleName, sparkDir, List(config.getString("spark.jar")))
  • 9. Test Setup for batch mode using Spark Context Before block After block Scala test framework “FunSpec” is used with “ShouldMatchers” (for assertions) and “BeforeAndAfter” (for setup/teardown).
  • 10. Kafka setup for spark streaming ● Bring up Kafka virtual machine using Vagrantfile with following command `vagrant up kafkavm` ● Configure Kafka ○ Create topic ■ `bin/kafka-create-topic.sh --zookeeper "localhost:2181" --topic "thrift_pings"` ○ Consume messages using ■ `bin/kafka-console-consumer.sh --zookeeper "localhost:2181" --topic "thrift_pings" --group "testThrift" &>/tmp/thrift-consumer-msgs.log &`
  • 11. Testing streaming mode with Spark Streaming Context
  • 12. Test ‘After’ block and assertion block for spark streaming mode After Block Test Assertion
  • 13. Testing best practices - Code Coverage ● Tracking code coverage with Scoverage and/or Scct ● Enable fork = true to avoid spark exceptions caused by spark context conflicts. ● SCCT configurations ○ ScctPlugin.instrumentSettings ○ parallelExecution in ScctTest := false ○ fork in ScctTest := true ○ Command to run it - `sbt “scct:test”` ● Scoverage configurations ○ ScoverageSbtPlugin.instrumentSettings ○ ScoverageSbtPlugin.ScoverageKeys.excludedPackages in ScoverageSbtPlugin.scoverage := ".*benchmark.*;.*util.*” ○ parallelExecution in ScoverageSbtPlugin.scoverageTest := false ○ fork in ScoverageSbtPlugin.scoverageTest := true ○ Command to run it - `sbt “scoverage:test”`
  • 14. Testing best practices - Jenkins auto test build trigger ● Requires enabling 'github-webhook' on github repo settings page. Requires admin access for the repo. ● Jenkins job should be configured with corresponding github repo via “GitHub Project” field. ● Test jenkins hook by triggering a test run from github repo. ● "Github pull request builder" can be used while configuring jenkins job to auto publish test results on github pull requests after every test run. This also lets you rerun failed tests via github pull request.
  • 15. What is a performance testing? ● A practice striving to build performance into the implementation, design and architecture of a system. ● Determine how a system performs in terms of responsiveness and stability under a particular workload. ● Can serve to investigate, measure, validate or verify other quality attributes of a system, such as scalability, reliability and resource usage.
  • 16. What is a Gatling? ● Stress test tool
  • 17. Why is Gatling selected over other Perf Test tools as JMeter? ● Powerful scripting using Scala ● Akka + Netty ● Run multiple scenarios in one simulation ● Scenarios = code + DSL ● Graphical reports with clear & concise graphs
  • 18. How does Gatling work with Spark ● Access Web applications / services
  • 19. Develop & setup a simple perf test example A perf test will run against spark-jobserver for word counts.
  • 20. What is a spark jobserver? ● Provides a RESTful interface for submitting and managing Apache Spark jobs, jars and job contexts ● Scala 2.10 + CDH5/Hadoop 2.2 + Spark 0.9.0 ● For more depths on jobserver, see Evan Chan & Kelvin Chu’s Spark Query Service presentation.
  • 21. Steps to set up & run Spark-jobserver ● Clone spark-jobserver from git-hub ● Install SBT and type “sbt” in the spark- jobserver repo ● From SBT shell, simply type “re-start” $ git clone https://github.com/ooyala/spark-jobserver > re-start $ sbt
  • 22. Steps to package & upload a jar to the jobserver ● Package the test jar of the word count example ● Upload the jar to the jobserver $ curl --data-binary @job-server-tests/target/job-server-tests-0.3.1.jar localhost:8090/jars/test $ sbt job-server-tests/package
  • 23. Run a request against the jobserver $ curl -d "input.string = a b c a b see" 'http://localhost:8090/jobs? appName=test&classPath=spark.jobserver.WordCountExample&sync=true' { "status": "OK", "result": { "a": 2, "b": 2, "c": 1, "see": 1 } }⏎
  • 24. Source code of Word Count Example
  • 25. Script Gatling for the Word Count Example Scenario defines steps that Gatling does during a runtime:
  • 26. Script Gatling for the Word Count Example Setup puts users and scenarios as workflows plus assertions together in a performance test simulation ● Inject 10 users in 10 seconds into scenarios in 2 cycles ● Ensure successful requests greater than 80%
  • 27. Test Results in Terminal Window
  • 28. Gatling Graph - Indicator
  • 29. Gatling Graph - Active Sessions
  • 30. Best Practices on Performance Tests ● Run performance tests on Jenkins ● Set up baselines for any of performance tests with different scenarios & users
  • 32. References Contact Info: Anupama Shetty: anupama@ooyala.com Neil Marshall: nmarshall@ooyala.com References: http://www.slideshare.net/AnuShetty/spark-summit2014-techtalk-testing-spark