Extending Apache Spark APIs Without Going Near Spark Source or a Compiler with Anna Holschuh

Anna Holschuh, Target
Extending Apache Spark APIs
Without Going Near Spark Source
or a Compiler
#DevSAIS19

What This Talk is About
• Scala programming constructs
• Functional programming paradigms
• Tips for organizing code in production systems
2#DevSAIS19

Who am I
• Lead Data Engineer at Target since 2016
• Deep love of all things Target
• Primary career focus has been building backend
systems with a personal passion for Machine Learning
problems
• Started working in Spark in 2015
3#DevSAIS19

Agenda
• Motivation
• Scala’s “Enrich My Library” Pattern
• An Example
• Other Uses
4#DevSAIS19

Agenda
• Motivation
• An Example
• Other Uses
5#DevSAIS19

Motivation
Let’s go through an example…
• We have a system of Authors, Articles, and
Comments on those Articles
• From the example, Spark/Scala lends itself
well to functional programming paradigms
• What happens when the system grows in
size/complexity and it becomes necessary
to inject more custom code into the mix?
• Can we keep things concise, readable, and
efficient using the same functional style of
code development?
6#DevSAIS19

Motivation
Functional Programming Refresher
• Declarative style of writing code (vs.
Imperative)
• Favors composition with functions
• Avoids shared state, mutability, and side
effects.
7#DevSAIS19

Motivation
A Validation Framework was born…
• Tasked with building an on-demand
computation system consuming various
data sources
• There were many ways for this data to go
wrong
• Needed a way to fail fast and in a
predictable way when a certain bar for
quality was not being met
8#DevSAIS19

Motivation
A Validation Framework was born…
• Desired ability to “sprinkle” .validate() calls
throughout our existing Spark ETL code
9#DevSAIS19
This is possible with
Scala’s
“Enrich My Library”
Pattern

Agenda
• Motivation
• An Example
• Other Uses
10#DevSAIS19

A Scala programming pattern…
• Allows us to augment existing APIs
• Analogous features in other languages
• Also known as “Pimp My Library” for
Googling purposes
• Syntactic sugar that uses implicit classes
to guide the compiler
11#DevSAIS19
Reference: https://docs.scala-lang.org/overviews/core/implicit-classes.html

What are implicits?
Scala supports a keyword “implicit” that allows
the compiler to implicitly make connections at
compile-time as opposed to explicitly having to
call a function or feed in a variable. Scala
supports implicit values, parameters, functions,
and classes.
What is an implicit class?
Introduced formally with Scala 2.10 although
it’s possible to achieve the same effect in
previous versions through different constructs.
Allows extension of classes one normally
wouldn’t have access to.
12#DevSAIS19
Reference: https://docs.scala-lang.org/overviews/core/implicit-classes.html

Agenda
• Motivation
• An Example
• Other Uses
13#DevSAIS19

An Example
Back to our example…
How do we go from
THIS
14#DevSAIS19

Motivation
15#DevSAIS19
Back to our example…
To
THIS

An Example
16#DevSAIS19
Step 1: Build a Validation class to
work with
• Abstract class parameterized with type T
representing the object type that we plan to
validate
• Contains metadata relevant to running a
validation
• Has an abstract .execute() method to be filled
in by concrete subclasses
• Contains a concrete implementation
.performValidation() that calls on the abstract
execute method

An Example
17#DevSAIS19
Step 2: Add an implicit class to allow
the decoration of existing types with
new methods
• The class can be named anything
• It must be nested in a package or object
• It can only have one parameter. This defines
what class it’s augmenting.
• Extra arguments can be passed through the
implicit parameter list.
• .validate() delegates back to the validation object
being passed into the method and uses the
object being decorated to carry out the
validation.

An Example
18#DevSAIS19
Step 3: Define a validation
• Our validation extends a Validation typed with
Dataset[Article]
• It fills in the abstract method .execute() which
defines what the validation is checking for
• This means that any time the compiler finds a
Dataset[Article] type, we can call .validate() on
it with this validation supplied because of our
implicit class
• Roughly 20 lines of concise and isolated code
is nicely separated from the core ETL job

An Example
19#DevSAIS19
Step 4: Instantiate your validation
and pull it in scope
• This is what triggers the compiler to link
Datasets of Articles to the .validate()
method through the defined implicit class

An Example
20#DevSAIS19
Step 5: Don’t forget Unit Tests
• It is straightforward to develop concise
and isolated unit tests for each validation
that is developed
• ScalaTest with FunSpec are used to
achieve BDD-style tests

An Example
21#DevSAIS19
Step 6: And we’re done!
• We have been able to develop concise,
isolated, testable code that can fit
seamlessly into existing Spark jobs
• Data is messy, and we have the ability to
address this problem in an elegant way
• “Enrich my library” has allowed us to
extend Spark APIs so we can stay true to
functional programming paradigms

Agenda
• Motivation
• An Example
• Other Uses
22#DevSAIS19

Other Uses
23#DevSAIS19
Code organization and readability
• Move long blocks of related ETL code into
implicit class function definitions to help
organize code

Other Uses
24#DevSAIS19
Support other common functionalities
used in production systems
ü Validations
• Metrics Collection
• Logging
• Checkpointing
• Notifications
• …

Disclaimer
These are powerful programming constructs that
can greatly increase productivity and enable the
buildout of concise and elegant framework code.
Overuse can lead to cryptic and esoteric systems
that can cause engineers great pain and suffering.
Find the right balance!
25#DevSAIS19

Takeaways
• The “Enrich My Library” programming pattern
enables concise, clean, and readable code
• It enabled us to create a framework that supports
rapid development of new validations with a
relatively small amount of code
• The resulting code is isolated, testable, and easy to
understand
26#DevSAIS19

Come Work At Target
• We are hiring in Data Science and Data Engineering
• Solve real-world problems ranging from supply chain
logistics to smart stores to personalization and so on
• Offices in…
o Sunnyvale, CA
o Minneapolis, MN
o Pittsburgh, PA
o Bangalore, India
27#DevSAIS19
work somewhere you

Acknowledgements
• Thank you Spark Summit
• Thank you Target
• Thank you wonderful team members at Target
• Thank you vibrant Spark and Scala communities
28#DevSAIS19

QUESTIONS
29#DevSAIS19
annamaria.holschuh@target.com

Extending Apache Spark APIs Without Going Near Spark Source or a Compiler with Anna Holschuh

More Related Content

What's hot (20)

Similar to Extending Apache Spark APIs Without Going Near Spark Source or a Compiler with Anna Holschuh (20)

More from Databricks (20)

Recently uploaded (20)

Extending Apache Spark APIs Without Going Near Spark Source or a Compiler with Anna Holschuh