SlideShare a Scribd company logo
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Spark on Scala – Reference Architecture
Adrian Tanase – Adobe Romania, Analytics
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Agenda
§ Building data processing apps with Scala and Spark
§ Our reference architecture
§ Goals
§ Abstractions
§ Techniques
§ Tips and tricks
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
What is Spark?
3
§ General engine for large scale data processing w/ APIs in Java, Scala and Python
§ Batch, Streaming, Interactive
§ Multiple Spark apps running concurrently in the same cluster
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Our Requirements for Spark Apps
§ Build many data processing applications, mostly ETL and analytics
§ Batch and streaming ingestion and processing
§ Stateless and stateful aggregations
§ Consume data from Kafka, persist to HBase, HDFS and Kafka
§ Interact (real time) with external services (S3, REST via http)
§ Deployed on Mesos/Docker across AWS and Azure
4
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Real Life With Spark
§ Generic data-processing (analytics, SQL, Graph, ML)
§ BUT not generic distributed computing
§ Lacks API support for things like
§ Lifecycle events around starting / creating executors
§ e.g. instantiate a DB connection pool on remote executor
§ Sending shared data to all executors and refresh it a certain intervals
§ e.g. shared config that updates dynamically and stays in sync across all nodes
§ Async processing of events
§ e.g. HTTP non-blocking calls on the hot path
§ Control flow in case of bad things happening on remote nodes
§ e.g. pause processing or controlled shutdown if one node can’t reach an external service
5
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Our Reference Architecture
§ Basic template for building spark/scala apps in our team
§ Take advantage of Spark strong points, work around limitations
§ Decouple Spark APIs and business logic
§ Leverage strong points in Scala (blend FP and OOP)
§ Design goals – all apps should be:
§ Scalable (horizontally)
§ Reliable (at least once processing, no data loss)
§ Maintainable (easy to understand, change, remove code)
§ Testable (easy to write unit and integration tests)
§ Easy to configure (deploy time)
§ Portable (to other processing frameworks like akka or kafka streams)
6
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
The Sample App
§ Ingest – first component in the stack
§ Use case – basic ETL
§ load from persistent queue (Kafka)
§ unpack and validate protobuf elements
§ reach out to external config service
§ e.g. is customer active?
§ add minimal metadata (lookups to customer DB)
§ persist to data store (HBase)
§ emit for downstream processing (Kafka)
7
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Abstractions Used
8
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful Resources
Repository
Message
Producer
Spark APIs
Config
Domain
model
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Main Entrypoint
§ Load / parse configuration
§ Instantiate SparkContext, DB connections, etc
§ Starts data processing (the application) by providing concrete instances for all deps
9
object IngestMain {
def main(args:  Array[String])  {
val config =  IngestConfig.loadConfig
val streamContext =  new StreamingContext(...)
val ingestApp =  getIngestApp(config)
val ingressStream =  KafkaConnectionUtils.getDStream(...)
ingestApp.process(ingressStream)
streamContext.start()
streamContext.awaitTermination()
}
}
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
model
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
The Application
§ Assembles services / repos into actual data processing app
§ Facilitates integration testing by not relying on actual kafka queues, hbase connections, etc
§ Only place in the code that "speaks" Spark (DStream, RDD, transform APIs, etc)
§ Change this file to port app to another streaming framework
10
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
model
trait IngestApp {
def ingestService:  IngestService
def eventRepo:  ExecutorSingleton[EventRepository]
def process(dstream:  DStream[Array[Byte]]):  Unit =  {
val rawEvents =  dstream.mapPartitions {  partition  =>
partition.flatMap(ingestService.toRawEvents(...))
}
processEvents(rawEvents)
}
}
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
The Application (2)
§ Deals with Spark complexities so that the business services don’t have to
§ Caching, progress checkpointing, controlling side effects
§ Shipping code and stateful objects (e.g. DB connection) to executors
11
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
model
def processEvents(events:  DStream[RawEvent]):  Unit {
val validEvents =  events.transform {  rdd =>
//  update  and  broadcast  global  config
rdd.flatMap {  event  =>
ingestService.toValidEvent(event,  ...)
}
}
validEvents.cache()
validEvents.foreachRDDOrStop {  rdd =>
rdd.foreachPartition {  partition  =>
val repo  =  eventRepo.get
partition.foreach {  ev => ingestService.saveEvent(ev, repo)  }
}
}
}
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Services
12
§ Represent the majority of business logic
§ Stateless and generally implemented as scala traits
§ Collection of pure functions grouped logically
§ Process immutable data structures, side effects are contained
§ All resources provided at invoke time, avoiding DI altogether
§ Avoids serialization issues of stateful resources (e.g. DB connection),
concerns which are pushed to the outer application layers
§ Actual materialization of trait can be deferred
§ E.g. object, service class, mix-in another class
§ Allows for a very modular architecture
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful
Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
model
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Example – Ingest Service
§ Deserialization, validation
§ Check configs (calls config service)
§ Annotate with customer metadata (loads partner DB)
§ Persist to HBase via Repository
13
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful
Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
modeltrait IngestService {
def toRawEvents(bytes:  Array[Byte]):  Seq[RawEvent]
def toValidEvent(
ev:  RawEvent,  configRepo: ConfigRepository):  Option[ValidEvent]
def saveEvent(
ev:  ValidEvent,  repo:  EventRepository):  Unit Or Throwable
}
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Repositories and Other Stateful Objects
§ Repo - simple abstraction for modeling KV data stores, config DBs, etc
§ Read-write or read-only
§ Simple interface makes it easy to mock in testing (e.g. HashMaps)
§ or swap out implementation (HBase, Cassandra, etc)
§ Handled differently from simple services because
§ Generally relies on stateful objects (e.g. DB connection pool)
§ Needs extra set-up and tear-down lifecycle
§ Each executor needs it’s own repo, how do you create it there?
https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in-apache-spark/
14
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configurat
ion
(http)
Stateful Resources
Repository Message
Producer
Spark
APIs
Config
Domain
model
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
The Domain Model
§ Immutable entities via case classes
§ Serializable, equals and hash code, pattern matching out of the box
§ Controlled creation via smart constructors (factory + validation)
§ Enforce invariants during creation and transformation
§ No more defensive checks everywhere
§ Domain objects are guaranteed to be valid
§ Leverages the type system and compiler
http://www.cakesolutions.net/teamblogs/enforcing-invariants-in-
scala-datatypes
15
Main entry point
Application
Services
e.g.
Validati
on
(internal
)
e.g.
Configur
ation
(http)
Stateful
Resources
Reposito
ry
Message
Produce
r
Spark
APIs
Config
Domain model
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Example – DataSource
§ Validations done during creation & transformation phases
§ Immutable object; can’t change after that!
16
sealed trait DataSource {
def id:  Int
}
case  object  GlobalDataSource extends  DataSource {
val id  =  0
}
sealed abstract case class ExternalDataSource(id:  Int)  extends DataSource
object DataSource {
def apply(id:  Int):  Option[DataSource]  =  id  match {
case invalid  if invalid  <  0 =>  None
case GlobalDataSource.id =>  Some(GlobalDataSource)
case anyDsId =>  Some(ExternalDataSource(anyDsId))
}
}
Main entry point
Application
Services
e.g.
Validatio
n
(internal)
e.g.
Configur
ation
(http)
Stateful
Resources
Reposito
ry
Message
Producer
Spark
APIs
Config
Domain model
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Other Tips and Tricks
§ Typesafe config + ficus for powerful, zero boilerplate app config
https://github.com/iheartradio/ficus
§ Option / Try / Either for error handling
http://longcao.org/2015/07/09/functional-error-accumulation-in-scala
§ Unit/Integration testing for spark apps
https://github.com/holdenk/spark-testing-base
17
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Conclusion – Reaching Our Design Goals
§ Scalable
§ Maintainable
§ Testable
§ Easy Configurable
§ Portable
18
§ Only the app “speaks” Spark
§ Business logic and domain model can be swapped out easily
§ Config is a static typed class hierarchy
§ Free to parse via typesafe-config / ficus
§ Clear concerns at app level
§ Modular code
§ Pure functions
§ Immutable data structures
§ Pure functions are easy to unit test
§ The App interface makes integration tests easy
Use FP in the small, OOP in the large!
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Let’s Keep in Touch!
§ Adrian Tanase
atanase@adobe.com
§ We’re hiring!
http://bit.ly/adro-careers
19
20

More Related Content

PPTX
JSON and the Oracle Database
PDF
Streaming Solutions for Real time problems
PDF
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming...
PPTX
eProseed Oracle Open World 2016 debrief - Oracle 12.2.0.1 Database
PPTX
Transactional SQL in Apache Hive
PDF
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
PPTX
An Apache Hive Based Data Warehouse
PDF
The state of SQL-on-Hadoop in the Cloud
JSON and the Oracle Database
Streaming Solutions for Real time problems
Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming...
eProseed Oracle Open World 2016 debrief - Oracle 12.2.0.1 Database
Transactional SQL in Apache Hive
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
An Apache Hive Based Data Warehouse
The state of SQL-on-Hadoop in the Cloud

What's hot (20)

PPTX
PPTX
Dynamic DDL: Adding structure to streaming IoT data on the fly
PPTX
SAM - Streaming Analytics Made Easy
PDF
Database Cloud Services Office Hours : Oracle sharding hyperscale globally d...
PPTX
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
PDF
Enterprise Postgres
PPTX
SQL on Hadoop
PPTX
Meet HBase 2.0 and Phoenix 5.0
PDF
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
PDF
Native REST Web Services with Oracle 11g
PDF
Which Questions We Should Have
PDF
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
PPTX
Enterprise Data Classification and Provenance
PDF
Elasticsearch + Cascading for Scalable Log Processing
DOC
Chris Asano.dba.20160512a
PPTX
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
PDF
Hive 3 a new horizon
PPTX
Connecting your .Net Applications to NoSQL Databases - MongoDB & Cassandra
PPTX
Azure Databricks is Easier Than You Think
PPTX
An Overview on Optimization in Apache Hive: Past, Present, Future
Dynamic DDL: Adding structure to streaming IoT data on the fly
SAM - Streaming Analytics Made Easy
Database Cloud Services Office Hours : Oracle sharding hyperscale globally d...
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Enterprise Postgres
SQL on Hadoop
Meet HBase 2.0 and Phoenix 5.0
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
Native REST Web Services with Oracle 11g
Which Questions We Should Have
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Enterprise Data Classification and Provenance
Elasticsearch + Cascading for Scalable Log Processing
Chris Asano.dba.20160512a
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Hive 3 a new horizon
Connecting your .Net Applications to NoSQL Databases - MongoDB & Cassandra
Azure Databricks is Easier Than You Think
An Overview on Optimization in Apache Hive: Past, Present, Future
Ad

Similar to Spark and scala reference architecture (20)

DOCX
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
PDF
Apache Spark Streaming
PPTX
Apache Spark: Lightning Fast Cluster Computing
PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
DOCX
Himansu-Java&BigdataDeveloper
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
PPT
An Introduction to Apache spark with scala
PDF
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
PDF
Rajeev kumar apache_spark &amp; scala developer
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
PPTX
Real Time Data Processing Using Spark Streaming
DOCX
RABI SHANKAR PAL_New
PPTX
Spark from the Surface
PDF
Sap integration with_j_boss_technologies
PDF
Data Pipeline for The Big Data/Data Science OKC
PPTX
Apache Spark in Scientific Applciations
PPTX
Apache Spark in Scientific Applications
PPTX
Spark Streaming with Azure Databricks
PDF
Full-Stack JavaScript Development on SAP HANA Platform
PPTX
Spark Study Notes
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Apache Spark Streaming
Apache Spark: Lightning Fast Cluster Computing
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Himansu-Java&BigdataDeveloper
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
An Introduction to Apache spark with scala
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Rajeev kumar apache_spark &amp; scala developer
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing Using Spark Streaming
RABI SHANKAR PAL_New
Spark from the Surface
Sap integration with_j_boss_technologies
Data Pipeline for The Big Data/Data Science OKC
Apache Spark in Scientific Applciations
Apache Spark in Scientific Applications
Spark Streaming with Azure Databricks
Full-Stack JavaScript Development on SAP HANA Platform
Spark Study Notes
Ad

Recently uploaded (20)

PDF
System and Network Administraation Chapter 3
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Introduction to Artificial Intelligence
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Transform Your Business with a Software ERP System
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Essential Infomation Tech presentation.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
System and Network Administraation Chapter 3
Operating system designcfffgfgggggggvggggggggg
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Introduction to Artificial Intelligence
How to Choose the Right IT Partner for Your Business in Malaysia
Understanding Forklifts - TECH EHS Solution
Which alternative to Crystal Reports is best for small or large businesses.pdf
ai tools demonstartion for schools and inter college
PTS Company Brochure 2025 (1).pdf.......
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
How to Migrate SBCGlobal Email to Yahoo Easily
Odoo Companies in India – Driving Business Transformation.pdf
Transform Your Business with a Software ERP System
L1 - Introduction to python Backend.pptx
Essential Infomation Tech presentation.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
wealthsignaloriginal-com-DS-text-... (1).pdf
Upgrade and Innovation Strategies for SAP ERP Customers

Spark and scala reference architecture

  • 1. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Spark on Scala – Reference Architecture Adrian Tanase – Adobe Romania, Analytics
  • 2. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Agenda § Building data processing apps with Scala and Spark § Our reference architecture § Goals § Abstractions § Techniques § Tips and tricks
  • 3. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. What is Spark? 3 § General engine for large scale data processing w/ APIs in Java, Scala and Python § Batch, Streaming, Interactive § Multiple Spark apps running concurrently in the same cluster
  • 4. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Our Requirements for Spark Apps § Build many data processing applications, mostly ETL and analytics § Batch and streaming ingestion and processing § Stateless and stateful aggregations § Consume data from Kafka, persist to HBase, HDFS and Kafka § Interact (real time) with external services (S3, REST via http) § Deployed on Mesos/Docker across AWS and Azure 4
  • 5. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Real Life With Spark § Generic data-processing (analytics, SQL, Graph, ML) § BUT not generic distributed computing § Lacks API support for things like § Lifecycle events around starting / creating executors § e.g. instantiate a DB connection pool on remote executor § Sending shared data to all executors and refresh it a certain intervals § e.g. shared config that updates dynamically and stays in sync across all nodes § Async processing of events § e.g. HTTP non-blocking calls on the hot path § Control flow in case of bad things happening on remote nodes § e.g. pause processing or controlled shutdown if one node can’t reach an external service 5
  • 6. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Our Reference Architecture § Basic template for building spark/scala apps in our team § Take advantage of Spark strong points, work around limitations § Decouple Spark APIs and business logic § Leverage strong points in Scala (blend FP and OOP) § Design goals – all apps should be: § Scalable (horizontally) § Reliable (at least once processing, no data loss) § Maintainable (easy to understand, change, remove code) § Testable (easy to write unit and integration tests) § Easy to configure (deploy time) § Portable (to other processing frameworks like akka or kafka streams) 6
  • 7. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. The Sample App § Ingest – first component in the stack § Use case – basic ETL § load from persistent queue (Kafka) § unpack and validate protobuf elements § reach out to external config service § e.g. is customer active? § add minimal metadata (lookups to customer DB) § persist to data store (HBase) § emit for downstream processing (Kafka) 7
  • 8. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Abstractions Used 8 Main entry point Application Services e.g. Validation (internal) e.g. Configuration (http) Stateful Resources Repository Message Producer Spark APIs Config Domain model
  • 9. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Main Entrypoint § Load / parse configuration § Instantiate SparkContext, DB connections, etc § Starts data processing (the application) by providing concrete instances for all deps 9 object IngestMain { def main(args:  Array[String])  { val config =  IngestConfig.loadConfig val streamContext =  new StreamingContext(...) val ingestApp =  getIngestApp(config) val ingressStream =  KafkaConnectionUtils.getDStream(...) ingestApp.process(ingressStream) streamContext.start() streamContext.awaitTermination() } } Main entry point Application Services e.g. Validation (internal) e.g. Configuration (http) Stateful Resources Repository Message Producer Spark APIs Config Domain model
  • 10. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. The Application § Assembles services / repos into actual data processing app § Facilitates integration testing by not relying on actual kafka queues, hbase connections, etc § Only place in the code that "speaks" Spark (DStream, RDD, transform APIs, etc) § Change this file to port app to another streaming framework 10 Main entry point Application Services e.g. Validation (internal) e.g. Configuration (http) Stateful Resources Repository Message Producer Spark APIs Config Domain model trait IngestApp { def ingestService:  IngestService def eventRepo:  ExecutorSingleton[EventRepository] def process(dstream:  DStream[Array[Byte]]):  Unit =  { val rawEvents =  dstream.mapPartitions {  partition  => partition.flatMap(ingestService.toRawEvents(...)) } processEvents(rawEvents) } }
  • 11. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. The Application (2) § Deals with Spark complexities so that the business services don’t have to § Caching, progress checkpointing, controlling side effects § Shipping code and stateful objects (e.g. DB connection) to executors 11 Main entry point Application Services e.g. Validation (internal) e.g. Configuration (http) Stateful Resources Repository Message Producer Spark APIs Config Domain model def processEvents(events:  DStream[RawEvent]):  Unit { val validEvents =  events.transform {  rdd => //  update  and  broadcast  global  config rdd.flatMap {  event  => ingestService.toValidEvent(event,  ...) } } validEvents.cache() validEvents.foreachRDDOrStop {  rdd => rdd.foreachPartition {  partition  => val repo  =  eventRepo.get partition.foreach {  ev => ingestService.saveEvent(ev, repo)  } } } }
  • 12. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Services 12 § Represent the majority of business logic § Stateless and generally implemented as scala traits § Collection of pure functions grouped logically § Process immutable data structures, side effects are contained § All resources provided at invoke time, avoiding DI altogether § Avoids serialization issues of stateful resources (e.g. DB connection), concerns which are pushed to the outer application layers § Actual materialization of trait can be deferred § E.g. object, service class, mix-in another class § Allows for a very modular architecture Main entry point Application Services e.g. Validation (internal) e.g. Configuration (http) Stateful Resources Repository Message Producer Spark APIs Config Domain model
  • 13. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Example – Ingest Service § Deserialization, validation § Check configs (calls config service) § Annotate with customer metadata (loads partner DB) § Persist to HBase via Repository 13 Main entry point Application Services e.g. Validation (internal) e.g. Configuration (http) Stateful Resources Repository Message Producer Spark APIs Config Domain modeltrait IngestService { def toRawEvents(bytes:  Array[Byte]):  Seq[RawEvent] def toValidEvent( ev:  RawEvent,  configRepo: ConfigRepository):  Option[ValidEvent] def saveEvent( ev:  ValidEvent,  repo:  EventRepository):  Unit Or Throwable }
  • 14. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Repositories and Other Stateful Objects § Repo - simple abstraction for modeling KV data stores, config DBs, etc § Read-write or read-only § Simple interface makes it easy to mock in testing (e.g. HashMaps) § or swap out implementation (HBase, Cassandra, etc) § Handled differently from simple services because § Generally relies on stateful objects (e.g. DB connection pool) § Needs extra set-up and tear-down lifecycle § Each executor needs it’s own repo, how do you create it there? https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in-apache-spark/ 14 Main entry point Application Services e.g. Validation (internal) e.g. Configurat ion (http) Stateful Resources Repository Message Producer Spark APIs Config Domain model
  • 15. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. The Domain Model § Immutable entities via case classes § Serializable, equals and hash code, pattern matching out of the box § Controlled creation via smart constructors (factory + validation) § Enforce invariants during creation and transformation § No more defensive checks everywhere § Domain objects are guaranteed to be valid § Leverages the type system and compiler http://www.cakesolutions.net/teamblogs/enforcing-invariants-in- scala-datatypes 15 Main entry point Application Services e.g. Validati on (internal ) e.g. Configur ation (http) Stateful Resources Reposito ry Message Produce r Spark APIs Config Domain model
  • 16. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Example – DataSource § Validations done during creation & transformation phases § Immutable object; can’t change after that! 16 sealed trait DataSource { def id:  Int } case  object  GlobalDataSource extends  DataSource { val id  =  0 } sealed abstract case class ExternalDataSource(id:  Int)  extends DataSource object DataSource { def apply(id:  Int):  Option[DataSource]  =  id  match { case invalid  if invalid  <  0 =>  None case GlobalDataSource.id =>  Some(GlobalDataSource) case anyDsId =>  Some(ExternalDataSource(anyDsId)) } } Main entry point Application Services e.g. Validatio n (internal) e.g. Configur ation (http) Stateful Resources Reposito ry Message Producer Spark APIs Config Domain model
  • 17. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Other Tips and Tricks § Typesafe config + ficus for powerful, zero boilerplate app config https://github.com/iheartradio/ficus § Option / Try / Either for error handling http://longcao.org/2015/07/09/functional-error-accumulation-in-scala § Unit/Integration testing for spark apps https://github.com/holdenk/spark-testing-base 17
  • 18. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Conclusion – Reaching Our Design Goals § Scalable § Maintainable § Testable § Easy Configurable § Portable 18 § Only the app “speaks” Spark § Business logic and domain model can be swapped out easily § Config is a static typed class hierarchy § Free to parse via typesafe-config / ficus § Clear concerns at app level § Modular code § Pure functions § Immutable data structures § Pure functions are easy to unit test § The App interface makes integration tests easy Use FP in the small, OOP in the large!
  • 19. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Let’s Keep in Touch! § Adrian Tanase atanase@adobe.com § We’re hiring! http://bit.ly/adro-careers 19
  • 20. 20