SlideShare a Scribd company logo
Testing Your Apache Spark Apps
... or How to Farm Reputation on stack overflow
STL Big Data - Innovation, Data Engineering, Analytics Group
May 5, 2021
About Me
Kit Menke is the newest organizer of the STL Big
Data IDEA meetup and the Practice Director for
Data Engineering at 1904labs.
We’re hiring!
https://1904labs.com/your-careers/
Insert
Image
● Testing Theory
● Testing is Hard
● Why Test?
● An Example Spark App Testing Setup
● Stack Overflow
Agenda
Testing Theory
Types of Software Tests
Verifies the smallest
testable parts of an
application.
Purpose
Verifies methods and/or
the smallest testable unit.
Unit
Verify the interactions and
connectivity between the
modules of the
application.
Purpose
Ensure different
components work
together.
Integration
Validates the complete
and fully integrated
software product.
Purpose
Evaluate the end-to-end
system.
System
Regression Tests
Testing Pyramid
System
Tests
Integration Tests
Unit Tests
Isolated
Isolation
Faster
Speed
Tests
Run
Slower
Fully integrated
Most of your tests should be unit tests!
Volume
● Utilize expected production throughputs to establish the impact on transaction times of
estimated volumes of transactions and users.
Rendezvous tests
● Test the application’s performance while subjected to concurrency issues under production
load and volume.
Stress Tests
● Subject the application to unrealistically high volumes of users accessing the system at the
same time in order to determine a system breaking point.
Soak Tests
● An extended period of testing at predicted business volumes in order to determine if system
performance degrades during a period of continuous usage.
Performance Tests
Basic checks
● How much data are you getting into your data pipeline?
● How much data is coming out of your pipeline?
● Does the schema look right?
Detailed checks
● Are the data types correct?
● Valid values?
○ Distinct values?
○ Ranges?
○ Correct distribution of values? Ex: a lot of null values
Data Validation
Testing is Hard
● General
○ People often disagree on what each type of test is.
○ Unreasonable metrics like code coverage.
○ Focus on manual testing instead of automated CI pipelines
● Unit testing
○ Testing that only check for the absence of errors, not functionality.
○ Testing the wrong thing - symptom of this is mocking everything
● Integration and system tests
○ Can be brittle - prone to breaking and require constant updates
○ Can be difficult to debug - where is the issue?
● Performance tests
○ Tests aren’t repeatable - the size and shape of your data matters!
○ Results should be comparable over time
Things That Go Wrong
Discussion: why test?
Why: Error Signal Collapse
Static
Analyses
Unit
Tests
Integration
Tests
System
Tests
Performance
Tests
Other
Tests
Mutation
Testing
1. Prevent bugs from getting into production
2. Allow developers to make changes more confidently/quickly
Why test?
● Align with your team on what tests should look like
● Testing Spark Apps requires your full attention
○ Often many dependencies on other data stores (ex: hdfs, hive, hbase, databases)
○ Test your logic, not spark or the dependencies
○ Use pull requests (PR) to review the lack of tests or bad tests
● Start small
○ Bottom (of the pyramid) up - unit tests first
○ Focus on tests that provide the most value
● First priority: run unit tests and build app in a CI pipeline, automatically on PR
● Bug in production? Reproduce it in a unit test first.
Advice for Testing Spark Apps
Spark App Testing
● Project Management
○ Maven (for those of us coming from Java it is the familiar tool)
○ Alternatives: sbt
● Spark
○ Version 3.0.1, choose the same version as your cluster
● Unit testing
○ Scalatest, Scalamock
○ Alternatives: JUnit, TestNG
● CI Pipeline
○ Jenkins
○ Alternatives: Github actions, AWS Codebuild
Spark App Testing Stack (MVP)
● Integration testing
○ Scalatest + Testcontainers
○ Alternatives: scalamock
● System testing
○ Scripts
○ Alternatives: java projects
● Performance testing
○ Re-use system test scripts… just with a lot more data
○ Some way to save results (can just be logs!)
● Helpers
○ Spark-testing-base - Base classes to setup/tear down local spark context
○ Test-containers - use docker containers inside scalatest
Spark App Testing Stack (expanded)
Demo Example Project
And now for something completely
different...
Stack Overflow is a question and answer site for professional and enthusiast programmers.
It's built and run by you as part of the Stack Exchange network of Q&A sites. With your help,
we're working together to build a library of detailed answers to every question about
programming.
https://stackoverflow.com/tour
● Gain reputation by asking and answering questions
● Stack overflow spawned a network of other Q&A sites
○ Ex: Server Fault, Super User, Ask Ubuntu, Math, English, Arqade
Stack Overflow
How do you find the answers to your questions? Usually by typing things into Google and
end up at Stack overflow...
● How do you ask a good question?
○ Write a title that summarizes the specific problem
○ Introduce the problem before you post any code
○ Help others reproduce the problem
○ Respond to feedback
Stack Overflow
Unclear
Asking for an opinion
Too large
● Spark (data?) questions are HARD to ask
○ What is your input?
○ What code do you have?
○ What is the expected output?
○ What the heck are you trying to do?
Data Questions
● Use your local development environment to help answer questions!
● Once you’ve found a question you think you can answer, create a unit test
Cultivate your Spark Skills (and reputation!)
Tip #1: Use parallelize to create test data
Tip #2: Use printSchema and show to check
your work
Iterate quickly by re-running your test!
○ Get at core functionality
○ Small, self-contained units
○ Easy to for someone else to understand
○ Helps others!
Good questions are like unit tests
Get out there, write more tests, and
give back to your community.
Example Spark project with unit tests https://github.com/kitmenke/spark-hello-world
Scalatest https://www.scalatest.org/
spark-testing-base https://github.com/holdenk/spark-testing-base
Testcontainers https://www.testcontainers.org/
Test Pyramid https://martinfowler.com/articles/practical-test-pyramid.html
Monitoring https://www.ibm.com/garage/method/practices/manage/golden-signals
STIL IDEA Meetup Talk Ideas
https://docs.google.com/document/d/1x19Kh7OATI1zbCzrvomIf7ffG5OqbBGwt8MFYLahjgE/
edit
Links

More Related Content

PPTX
Data Engineering Roles
PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
PDF
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
PDF
A Collaborative Data Science Development Workflow
PDF
Empowering Zillow’s Developers with Self-Service ETL
PDF
Understanding and Improving Code Generation
PDF
Whirlpools in the Stream with Jayesh Lalwani
PDF
Sawtooth Windows for Feature Aggregations
Data Engineering Roles
How We Optimize Spark SQL Jobs With parallel and sync IO
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
A Collaborative Data Science Development Workflow
Empowering Zillow’s Developers with Self-Service ETL
Understanding and Improving Code Generation
Whirlpools in the Stream with Jayesh Lalwani
Sawtooth Windows for Feature Aggregations

What's hot (20)

PDF
WebCamp: Developer Day: Оптимизация Lift Framework для работы с большими пото...
PDF
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
PDF
10 Things Learned Releasing Databricks Enterprise Wide
PDF
Semantic Image Logging Using Approximate Statistics & MLflow
PDF
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
PDF
Accelerating Data Ingestion with Databricks Autoloader
PDF
Hybrid Apache Spark Architecture with YARN and Kubernetes
PDF
Tuning ML Models: Scaling, Workflows, and Architecture
PDF
Accelerate Your ML Pipeline with AutoML and MLflow
PDF
Is This Thing On? A Well State Model for the People
PDF
Unified MLOps: Feature Stores & Model Deployment
PDF
Willump: Optimizing Feature Computation in ML Inference
PDF
Anomaly Detection at Scale!
PDF
Spark Summit EU talk by Yiannis Gkoufas
PDF
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
PDF
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
PDF
Dagster @ R&S MNT
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PDF
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
PPTX
Zeppelin at Twitter
WebCamp: Developer Day: Оптимизация Lift Framework для работы с большими пото...
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
10 Things Learned Releasing Databricks Enterprise Wide
Semantic Image Logging Using Approximate Statistics & MLflow
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
Accelerating Data Ingestion with Databricks Autoloader
Hybrid Apache Spark Architecture with YARN and Kubernetes
Tuning ML Models: Scaling, Workflows, and Architecture
Accelerate Your ML Pipeline with AutoML and MLflow
Is This Thing On? A Well State Model for the People
Unified MLOps: Feature Stores & Model Deployment
Willump: Optimizing Feature Computation in ML Inference
Anomaly Detection at Scale!
Spark Summit EU talk by Yiannis Gkoufas
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Dagster @ R&S MNT
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
Zeppelin at Twitter
Ad

Similar to May 2021 Spark Testing ... or how to farm reputation on StackOverflow (20)

PDF
UPC Plone Testing Talk
PDF
stackconf 2024 | Test like a ninja with Go by Ivan Presenti.pdf
PDF
Demise of test scripts rise of test ideas
PDF
Unit testing (Exploring the other side as a tester)
PPTX
Unit testing
KEY
Driving application development through behavior driven development
PPTX
An Introduction to Unit Testing
PPTX
Unit Testing and TDD 2017
PDF
Lessons Learned When Automating
PDF
Software Testing Basic Concepts
PDF
Usable Software Design
PDF
Start with passing tests (tdd for bugs) v0.5 (22 sep 2016)
PDF
Integration testing - A&BP CC
PPTX
assertYourself - Breaking the Theories and Assumptions of Unit Testing in Flex
PPTX
Software testing
PPS
Unit Testing
PDF
Automated testing
PDF
Test Driven Development
PPTX
How to establish ways of working that allows shifting-left of the automation ...
PPTX
Testing for Logic App Solutions | Integration Monday
UPC Plone Testing Talk
stackconf 2024 | Test like a ninja with Go by Ivan Presenti.pdf
Demise of test scripts rise of test ideas
Unit testing (Exploring the other side as a tester)
Unit testing
Driving application development through behavior driven development
An Introduction to Unit Testing
Unit Testing and TDD 2017
Lessons Learned When Automating
Software Testing Basic Concepts
Usable Software Design
Start with passing tests (tdd for bugs) v0.5 (22 sep 2016)
Integration testing - A&BP CC
assertYourself - Breaking the Theories and Assumptions of Unit Testing in Flex
Software testing
Unit Testing
Automated testing
Test Driven Development
How to establish ways of working that allows shifting-left of the automation ...
Testing for Logic App Solutions | Integration Monday
Ad

More from Adam Doyle (20)

PPTX
ML Ops.pptx
PPTX
Managed Cluster Services
PPTX
Delta lake and the delta architecture
PPTX
Great Expectations Presentation
PDF
Automate your data flows with Apache NIFI
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PPTX
Localized Hadoop Development
PDF
The new big data
PDF
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
PDF
Snowflake Data Science and AI/ML at Scale
PPTX
Operationalizing Data Science St. Louis Big Data IDEA
PPTX
Retooling on the Modern Data and Analytics Tech Stack
PDF
Stl meetup cloudera platform - january 2020
PPTX
How stlrda does data
PPTX
Tailoring machine learning practices to support prescriptive analytics
PPTX
Synthesis of analytical methods data driven decision-making
PPTX
Big Data IDEA 101 2019
PPTX
Data Engineering and the Data Science Lifecycle
PDF
Data engineering Stl Big Data IDEA user group
PPTX
Cloudera - Docker on hadoop
ML Ops.pptx
Managed Cluster Services
Delta lake and the delta architecture
Great Expectations Presentation
Automate your data flows with Apache NIFI
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Localized Hadoop Development
The new big data
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Snowflake Data Science and AI/ML at Scale
Operationalizing Data Science St. Louis Big Data IDEA
Retooling on the Modern Data and Analytics Tech Stack
Stl meetup cloudera platform - january 2020
How stlrda does data
Tailoring machine learning practices to support prescriptive analytics
Synthesis of analytical methods data driven decision-making
Big Data IDEA 101 2019
Data Engineering and the Data Science Lifecycle
Data engineering Stl Big Data IDEA user group
Cloudera - Docker on hadoop

Recently uploaded (20)

PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Global journeys: estimating international migration
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Computer network topology notes for revision
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Launch Your Data Science Career in Kochi – 2025
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Global journeys: estimating international migration
Introduction to Knowledge Engineering Part 1
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Quality review (1)_presentation of this 21
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Database Infoormation System (DBIS).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Foundation of Data Science unit number two notes
Computer network topology notes for revision
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction-to-Cloud-ComputingFinal.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf

May 2021 Spark Testing ... or how to farm reputation on StackOverflow

  • 1. Testing Your Apache Spark Apps ... or How to Farm Reputation on stack overflow STL Big Data - Innovation, Data Engineering, Analytics Group May 5, 2021
  • 2. About Me Kit Menke is the newest organizer of the STL Big Data IDEA meetup and the Practice Director for Data Engineering at 1904labs. We’re hiring! https://1904labs.com/your-careers/ Insert Image
  • 3. ● Testing Theory ● Testing is Hard ● Why Test? ● An Example Spark App Testing Setup ● Stack Overflow Agenda
  • 5. Types of Software Tests Verifies the smallest testable parts of an application. Purpose Verifies methods and/or the smallest testable unit. Unit Verify the interactions and connectivity between the modules of the application. Purpose Ensure different components work together. Integration Validates the complete and fully integrated software product. Purpose Evaluate the end-to-end system. System Regression Tests
  • 6. Testing Pyramid System Tests Integration Tests Unit Tests Isolated Isolation Faster Speed Tests Run Slower Fully integrated Most of your tests should be unit tests!
  • 7. Volume ● Utilize expected production throughputs to establish the impact on transaction times of estimated volumes of transactions and users. Rendezvous tests ● Test the application’s performance while subjected to concurrency issues under production load and volume. Stress Tests ● Subject the application to unrealistically high volumes of users accessing the system at the same time in order to determine a system breaking point. Soak Tests ● An extended period of testing at predicted business volumes in order to determine if system performance degrades during a period of continuous usage. Performance Tests
  • 8. Basic checks ● How much data are you getting into your data pipeline? ● How much data is coming out of your pipeline? ● Does the schema look right? Detailed checks ● Are the data types correct? ● Valid values? ○ Distinct values? ○ Ranges? ○ Correct distribution of values? Ex: a lot of null values Data Validation
  • 10. ● General ○ People often disagree on what each type of test is. ○ Unreasonable metrics like code coverage. ○ Focus on manual testing instead of automated CI pipelines ● Unit testing ○ Testing that only check for the absence of errors, not functionality. ○ Testing the wrong thing - symptom of this is mocking everything ● Integration and system tests ○ Can be brittle - prone to breaking and require constant updates ○ Can be difficult to debug - where is the issue? ● Performance tests ○ Tests aren’t repeatable - the size and shape of your data matters! ○ Results should be comparable over time Things That Go Wrong
  • 12. Why: Error Signal Collapse Static Analyses Unit Tests Integration Tests System Tests Performance Tests Other Tests Mutation Testing
  • 13. 1. Prevent bugs from getting into production 2. Allow developers to make changes more confidently/quickly Why test?
  • 14. ● Align with your team on what tests should look like ● Testing Spark Apps requires your full attention ○ Often many dependencies on other data stores (ex: hdfs, hive, hbase, databases) ○ Test your logic, not spark or the dependencies ○ Use pull requests (PR) to review the lack of tests or bad tests ● Start small ○ Bottom (of the pyramid) up - unit tests first ○ Focus on tests that provide the most value ● First priority: run unit tests and build app in a CI pipeline, automatically on PR ● Bug in production? Reproduce it in a unit test first. Advice for Testing Spark Apps
  • 16. ● Project Management ○ Maven (for those of us coming from Java it is the familiar tool) ○ Alternatives: sbt ● Spark ○ Version 3.0.1, choose the same version as your cluster ● Unit testing ○ Scalatest, Scalamock ○ Alternatives: JUnit, TestNG ● CI Pipeline ○ Jenkins ○ Alternatives: Github actions, AWS Codebuild Spark App Testing Stack (MVP)
  • 17. ● Integration testing ○ Scalatest + Testcontainers ○ Alternatives: scalamock ● System testing ○ Scripts ○ Alternatives: java projects ● Performance testing ○ Re-use system test scripts… just with a lot more data ○ Some way to save results (can just be logs!) ● Helpers ○ Spark-testing-base - Base classes to setup/tear down local spark context ○ Test-containers - use docker containers inside scalatest Spark App Testing Stack (expanded)
  • 19. And now for something completely different...
  • 20. Stack Overflow is a question and answer site for professional and enthusiast programmers. It's built and run by you as part of the Stack Exchange network of Q&A sites. With your help, we're working together to build a library of detailed answers to every question about programming. https://stackoverflow.com/tour ● Gain reputation by asking and answering questions ● Stack overflow spawned a network of other Q&A sites ○ Ex: Server Fault, Super User, Ask Ubuntu, Math, English, Arqade Stack Overflow
  • 21. How do you find the answers to your questions? Usually by typing things into Google and end up at Stack overflow... ● How do you ask a good question? ○ Write a title that summarizes the specific problem ○ Introduce the problem before you post any code ○ Help others reproduce the problem ○ Respond to feedback Stack Overflow Unclear Asking for an opinion Too large
  • 22. ● Spark (data?) questions are HARD to ask ○ What is your input? ○ What code do you have? ○ What is the expected output? ○ What the heck are you trying to do? Data Questions
  • 23. ● Use your local development environment to help answer questions! ● Once you’ve found a question you think you can answer, create a unit test Cultivate your Spark Skills (and reputation!) Tip #1: Use parallelize to create test data Tip #2: Use printSchema and show to check your work Iterate quickly by re-running your test!
  • 24. ○ Get at core functionality ○ Small, self-contained units ○ Easy to for someone else to understand ○ Helps others! Good questions are like unit tests Get out there, write more tests, and give back to your community.
  • 25. Example Spark project with unit tests https://github.com/kitmenke/spark-hello-world Scalatest https://www.scalatest.org/ spark-testing-base https://github.com/holdenk/spark-testing-base Testcontainers https://www.testcontainers.org/ Test Pyramid https://martinfowler.com/articles/practical-test-pyramid.html Monitoring https://www.ibm.com/garage/method/practices/manage/golden-signals STIL IDEA Meetup Talk Ideas https://docs.google.com/document/d/1x19Kh7OATI1zbCzrvomIf7ffG5OqbBGwt8MFYLahjgE/ edit Links