SlideShare a Scribd company logo
Anna Holschuh, Target
Extending Apache Spark APIs
Without Going Near Spark Source
or a Compiler
#DevSAIS19
What This Talk is About
• Scala programming constructs
• Functional programming paradigms
• Tips for organizing code in production systems
2#DevSAIS19
Who am I
• Lead Data Engineer at Target since 2016
• Deep love of all things Target
• Primary career focus has been building backend
systems with a personal passion for Machine Learning
problems
• Started working in Spark in 2015
3#DevSAIS19
Agenda
• Motivation
• Scala’s “Enrich My Library” Pattern
• An Example
• Other Uses
4#DevSAIS19
Agenda
• Motivation
• Scala’s “Enrich My Library” Pattern
• An Example
• Other Uses
5#DevSAIS19
Motivation
Let’s go through an example…
• We have a system of Authors, Articles, and
Comments on those Articles
• From the example, Spark/Scala lends itself
well to functional programming paradigms
• What happens when the system grows in
size/complexity and it becomes necessary
to inject more custom code into the mix?
• Can we keep things concise, readable, and
efficient using the same functional style of
code development?
6#DevSAIS19
Motivation
Functional Programming Refresher
• Declarative style of writing code (vs.
Imperative)
• Favors composition with functions
• Avoids shared state, mutability, and side
effects.
7#DevSAIS19
Motivation
A Validation Framework was born…
• Tasked with building an on-demand
computation system consuming various
data sources
• There were many ways for this data to go
wrong
• Needed a way to fail fast and in a
predictable way when a certain bar for
quality was not being met
8#DevSAIS19
Motivation
A Validation Framework was born…
• Desired ability to “sprinkle” .validate() calls
throughout our existing Spark ETL code
9#DevSAIS19
This is possible with
Scala’s
“Enrich My Library”
Pattern
Agenda
• Motivation
• Scala’s “Enrich My Library” Pattern
• An Example
• Other Uses
10#DevSAIS19
“Enrich My Library”
A Scala programming pattern…
• Allows us to augment existing APIs
• Analogous features in other languages
• Also known as “Pimp My Library” for
Googling purposes
• Syntactic sugar that uses implicit classes
to guide the compiler
11#DevSAIS19
Reference: https://docs.scala-lang.org/overviews/core/implicit-classes.html
“Enrich My Library”
What are implicits?
Scala supports a keyword “implicit” that allows
the compiler to implicitly make connections at
compile-time as opposed to explicitly having to
call a function or feed in a variable. Scala
supports implicit values, parameters, functions,
and classes.
What is an implicit class?
Introduced formally with Scala 2.10 although
it’s possible to achieve the same effect in
previous versions through different constructs.
Allows extension of classes one normally
wouldn’t have access to.
12#DevSAIS19
Reference: https://docs.scala-lang.org/overviews/core/implicit-classes.html
Agenda
• Motivation
• Scala’s “Enrich My Library” Pattern
• An Example
• Other Uses
13#DevSAIS19
An Example
Back to our example…
How do we go from
THIS
14#DevSAIS19
Motivation
15#DevSAIS19
Back to our example…
To
THIS
An Example
16#DevSAIS19
Step 1: Build a Validation class to
work with
• Abstract class parameterized with type T
representing the object type that we plan to
validate
• Contains metadata relevant to running a
validation
• Has an abstract .execute() method to be filled
in by concrete subclasses
• Contains a concrete implementation
.performValidation() that calls on the abstract
execute method
An Example
17#DevSAIS19
Step 2: Add an implicit class to allow
the decoration of existing types with
new methods
• The class can be named anything
• It must be nested in a package or object
• It can only have one parameter. This defines
what class it’s augmenting.
• Extra arguments can be passed through the
implicit parameter list.
• .validate() delegates back to the validation object
being passed into the method and uses the
object being decorated to carry out the
validation.
An Example
18#DevSAIS19
Step 3: Define a validation
• Our validation extends a Validation typed with
Dataset[Article]
• It fills in the abstract method .execute() which
defines what the validation is checking for
• This means that any time the compiler finds a
Dataset[Article] type, we can call .validate() on
it with this validation supplied because of our
implicit class
• Roughly 20 lines of concise and isolated code
is nicely separated from the core ETL job
An Example
19#DevSAIS19
Step 4: Instantiate your validation
and pull it in scope
• This is what triggers the compiler to link
Datasets of Articles to the .validate()
method through the defined implicit class
An Example
20#DevSAIS19
Step 5: Don’t forget Unit Tests
• It is straightforward to develop concise
and isolated unit tests for each validation
that is developed
• ScalaTest with FunSpec are used to
achieve BDD-style tests
An Example
21#DevSAIS19
Step 6: And we’re done!
• We have been able to develop concise,
isolated, testable code that can fit
seamlessly into existing Spark jobs
• Data is messy, and we have the ability to
address this problem in an elegant way
• “Enrich my library” has allowed us to
extend Spark APIs so we can stay true to
functional programming paradigms
Agenda
• Motivation
• Scala’s “Enrich My Library” Pattern
• An Example
• Other Uses
22#DevSAIS19
Other Uses
23#DevSAIS19
Code organization and readability
• Move long blocks of related ETL code into
implicit class function definitions to help
organize code
Other Uses
24#DevSAIS19
Support other common functionalities
used in production systems
ü Validations
• Metrics Collection
• Logging
• Checkpointing
• Notifications
• …
Disclaimer
These are powerful programming constructs that
can greatly increase productivity and enable the
buildout of concise and elegant framework code.
Overuse can lead to cryptic and esoteric systems
that can cause engineers great pain and suffering.
Find the right balance!
25#DevSAIS19
Takeaways
• The “Enrich My Library” programming pattern
enables concise, clean, and readable code
• It enabled us to create a framework that supports
rapid development of new validations with a
relatively small amount of code
• The resulting code is isolated, testable, and easy to
understand
26#DevSAIS19
Come Work At Target
• We are hiring in Data Science and Data Engineering
• Solve real-world problems ranging from supply chain
logistics to smart stores to personalization and so on
• Offices in…
o Sunnyvale, CA
o Minneapolis, MN
o Pittsburgh, PA
o Bangalore, India
27#DevSAIS19
work somewhere you
Acknowledgements
• Thank you Spark Summit
• Thank you Target
• Thank you wonderful team members at Target
• Thank you vibrant Spark and Scala communities
28#DevSAIS19
QUESTIONS
29#DevSAIS19
annamaria.holschuh@target.com

More Related Content

PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
PDF
Whirlpools in the Stream with Jayesh Lalwani
PDF
Apache Spark Usage in the Open Source Ecosystem
PDF
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
PDF
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
PDF
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Databricks: What We Have Learned by Eating Our Dog Food
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Whirlpools in the Stream with Jayesh Lalwani
Apache Spark Usage in the Open Source Ecosystem
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...
Jump Start with Apache Spark 2.0 on Databricks
Databricks: What We Have Learned by Eating Our Dog Food

What's hot (20)

PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Spark Summit EU talk by Emlyn Whittick
PDF
Building a Business Logic Translation Engine with Spark Streaming for Communi...
PDF
A Collaborative Data Science Development Workflow
PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
PDF
Getting Ready to Use Redis with Apache Spark with Tague Griffith
PDF
Spark Summit EU talk by Yiannis Gkoufas
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
PDF
Infrastructure for Deep Learning in Apache Spark
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PDF
Insights Without Tradeoffs: Using Structured Streaming
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
PDF
Spark Summit EU talk by Bas Geerdink
PDF
Using Databricks as an Analysis Platform
PDF
Spark Summit EU talk by Stephan Kessler
PDF
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
PDF
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Emlyn Whittick
Building a Business Logic Translation Engine with Spark Streaming for Communi...
A Collaborative Data Science Development Workflow
How We Optimize Spark SQL Jobs With parallel and sync IO
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Getting Ready to Use Redis with Apache Spark with Tague Griffith
Spark Summit EU talk by Yiannis Gkoufas
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Infrastructure for Deep Learning in Apache Spark
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Insights Without Tradeoffs: Using Structured Streaming
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Spark Summit EU talk by Bas Geerdink
Using Databricks as an Analysis Platform
Spark Summit EU talk by Stephan Kessler
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Ad

Similar to Extending Apache Spark APIs Without Going Near Spark Source or a Compiler with Anna Holschuh (20)

PDF
Validating big data jobs - Spark AI Summit EU
PDF
Validating Big Data Pipelines - Big Data Spain 2018
PDF
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
PPTX
Introduction to Spark - DataFactZ
PDF
An Introduction to Scala (2014)
PPT
Scala idioms
PDF
Validating big data pipelines - Scala eXchange 2018
PDF
Scala for Java Developers (Silicon Valley Code Camp 13)
PDF
Validating big data pipelines - FOSDEM 2019
PDF
Practical Type Safety in Scala
PDF
Meet scala
PPTX
Scala Refactoring for Fun and Profit
PDF
From Java to Scala - advantages and possible risks
PPTX
Why Scala is the better Java
PDF
A Field Guide to DSL Design in Scala
PDF
Scala - core features
PDF
The Spark Big Data Analytics Platform
PDF
The Kotlin Programming Language, Svetlana Isakova
PDF
Светлана Исакова «Язык Kotlin»
PDF
Spark workshop
Validating big data jobs - Spark AI Summit EU
Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Introduction to Spark - DataFactZ
An Introduction to Scala (2014)
Scala idioms
Validating big data pipelines - Scala eXchange 2018
Scala for Java Developers (Silicon Valley Code Camp 13)
Validating big data pipelines - FOSDEM 2019
Practical Type Safety in Scala
Meet scala
Scala Refactoring for Fun and Profit
From Java to Scala - advantages and possible risks
Why Scala is the better Java
A Field Guide to DSL Design in Scala
Scala - core features
The Spark Big Data Analytics Platform
The Kotlin Programming Language, Svetlana Isakova
Светлана Исакова «Язык Kotlin»
Spark workshop
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Computer network topology notes for revision
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Business Analytics and business intelligence.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
annual-report-2024-2025 original latest.
PPTX
1_Introduction to advance data techniques.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to machine learning and Linear Models
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Computer network topology notes for revision
Clinical guidelines as a resource for EBP(1).pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Reliability_Chapter_ presentation 1221.5784
Business Analytics and business intelligence.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
annual-report-2024-2025 original latest.
1_Introduction to advance data techniques.pptx
ISS -ESG Data flows What is ESG and HowHow
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Quality review (1)_presentation of this 21
climate analysis of Dhaka ,Banglades.pptx
Introduction to machine learning and Linear Models
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Recruitment and Placement PPT.pdfbjfibjdfbjfobj

Extending Apache Spark APIs Without Going Near Spark Source or a Compiler with Anna Holschuh

  • 1. Anna Holschuh, Target Extending Apache Spark APIs Without Going Near Spark Source or a Compiler #DevSAIS19
  • 2. What This Talk is About • Scala programming constructs • Functional programming paradigms • Tips for organizing code in production systems 2#DevSAIS19
  • 3. Who am I • Lead Data Engineer at Target since 2016 • Deep love of all things Target • Primary career focus has been building backend systems with a personal passion for Machine Learning problems • Started working in Spark in 2015 3#DevSAIS19
  • 4. Agenda • Motivation • Scala’s “Enrich My Library” Pattern • An Example • Other Uses 4#DevSAIS19
  • 5. Agenda • Motivation • Scala’s “Enrich My Library” Pattern • An Example • Other Uses 5#DevSAIS19
  • 6. Motivation Let’s go through an example… • We have a system of Authors, Articles, and Comments on those Articles • From the example, Spark/Scala lends itself well to functional programming paradigms • What happens when the system grows in size/complexity and it becomes necessary to inject more custom code into the mix? • Can we keep things concise, readable, and efficient using the same functional style of code development? 6#DevSAIS19
  • 7. Motivation Functional Programming Refresher • Declarative style of writing code (vs. Imperative) • Favors composition with functions • Avoids shared state, mutability, and side effects. 7#DevSAIS19
  • 8. Motivation A Validation Framework was born… • Tasked with building an on-demand computation system consuming various data sources • There were many ways for this data to go wrong • Needed a way to fail fast and in a predictable way when a certain bar for quality was not being met 8#DevSAIS19
  • 9. Motivation A Validation Framework was born… • Desired ability to “sprinkle” .validate() calls throughout our existing Spark ETL code 9#DevSAIS19 This is possible with Scala’s “Enrich My Library” Pattern
  • 10. Agenda • Motivation • Scala’s “Enrich My Library” Pattern • An Example • Other Uses 10#DevSAIS19
  • 11. “Enrich My Library” A Scala programming pattern… • Allows us to augment existing APIs • Analogous features in other languages • Also known as “Pimp My Library” for Googling purposes • Syntactic sugar that uses implicit classes to guide the compiler 11#DevSAIS19 Reference: https://docs.scala-lang.org/overviews/core/implicit-classes.html
  • 12. “Enrich My Library” What are implicits? Scala supports a keyword “implicit” that allows the compiler to implicitly make connections at compile-time as opposed to explicitly having to call a function or feed in a variable. Scala supports implicit values, parameters, functions, and classes. What is an implicit class? Introduced formally with Scala 2.10 although it’s possible to achieve the same effect in previous versions through different constructs. Allows extension of classes one normally wouldn’t have access to. 12#DevSAIS19 Reference: https://docs.scala-lang.org/overviews/core/implicit-classes.html
  • 13. Agenda • Motivation • Scala’s “Enrich My Library” Pattern • An Example • Other Uses 13#DevSAIS19
  • 14. An Example Back to our example… How do we go from THIS 14#DevSAIS19
  • 16. An Example 16#DevSAIS19 Step 1: Build a Validation class to work with • Abstract class parameterized with type T representing the object type that we plan to validate • Contains metadata relevant to running a validation • Has an abstract .execute() method to be filled in by concrete subclasses • Contains a concrete implementation .performValidation() that calls on the abstract execute method
  • 17. An Example 17#DevSAIS19 Step 2: Add an implicit class to allow the decoration of existing types with new methods • The class can be named anything • It must be nested in a package or object • It can only have one parameter. This defines what class it’s augmenting. • Extra arguments can be passed through the implicit parameter list. • .validate() delegates back to the validation object being passed into the method and uses the object being decorated to carry out the validation.
  • 18. An Example 18#DevSAIS19 Step 3: Define a validation • Our validation extends a Validation typed with Dataset[Article] • It fills in the abstract method .execute() which defines what the validation is checking for • This means that any time the compiler finds a Dataset[Article] type, we can call .validate() on it with this validation supplied because of our implicit class • Roughly 20 lines of concise and isolated code is nicely separated from the core ETL job
  • 19. An Example 19#DevSAIS19 Step 4: Instantiate your validation and pull it in scope • This is what triggers the compiler to link Datasets of Articles to the .validate() method through the defined implicit class
  • 20. An Example 20#DevSAIS19 Step 5: Don’t forget Unit Tests • It is straightforward to develop concise and isolated unit tests for each validation that is developed • ScalaTest with FunSpec are used to achieve BDD-style tests
  • 21. An Example 21#DevSAIS19 Step 6: And we’re done! • We have been able to develop concise, isolated, testable code that can fit seamlessly into existing Spark jobs • Data is messy, and we have the ability to address this problem in an elegant way • “Enrich my library” has allowed us to extend Spark APIs so we can stay true to functional programming paradigms
  • 22. Agenda • Motivation • Scala’s “Enrich My Library” Pattern • An Example • Other Uses 22#DevSAIS19
  • 23. Other Uses 23#DevSAIS19 Code organization and readability • Move long blocks of related ETL code into implicit class function definitions to help organize code
  • 24. Other Uses 24#DevSAIS19 Support other common functionalities used in production systems ü Validations • Metrics Collection • Logging • Checkpointing • Notifications • …
  • 25. Disclaimer These are powerful programming constructs that can greatly increase productivity and enable the buildout of concise and elegant framework code. Overuse can lead to cryptic and esoteric systems that can cause engineers great pain and suffering. Find the right balance! 25#DevSAIS19
  • 26. Takeaways • The “Enrich My Library” programming pattern enables concise, clean, and readable code • It enabled us to create a framework that supports rapid development of new validations with a relatively small amount of code • The resulting code is isolated, testable, and easy to understand 26#DevSAIS19
  • 27. Come Work At Target • We are hiring in Data Science and Data Engineering • Solve real-world problems ranging from supply chain logistics to smart stores to personalization and so on • Offices in… o Sunnyvale, CA o Minneapolis, MN o Pittsburgh, PA o Bangalore, India 27#DevSAIS19 work somewhere you
  • 28. Acknowledgements • Thank you Spark Summit • Thank you Target • Thank you wonderful team members at Target • Thank you vibrant Spark and Scala communities 28#DevSAIS19