Spark and scala reference architecture

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Spark on Scala – Reference Architecture
Adrian Tanase – Adobe Romania, Analytics

Agenda
§ Building data processing apps with Scala and Spark
§ Our reference architecture
§ Goals
§ Abstractions
§ Techniques
§ Tips and tricks

What is Spark?
3
§ General engine for large scale data processing w/ APIs in Java, Scala and Python
§ Batch, Streaming, Interactive
§ Multiple Spark apps running concurrently in the same cluster

Our Requirements for Spark Apps
§ Build many data processing applications, mostly ETL and analytics
§ Batch and streaming ingestion and processing
§ Stateless and stateful aggregations
§ Consume data from Kafka, persist to HBase, HDFS and Kafka
§ Interact (real time) with external services (S3, REST via http)
§ Deployed on Mesos/Docker across AWS and Azure
4

Real Life With Spark
§ Generic data-processing (analytics, SQL, Graph, ML)
§ BUT not generic distributed computing
§ Lacks API support for things like
§ Lifecycle events around starting / creating executors
§ e.g. instantiate a DB connection pool on remote executor
§ Sending shared data to all executors and refresh it a certain intervals
§ e.g. shared config that updates dynamically and stays in sync across all nodes
§ Async processing of events
§ e.g. HTTP non-blocking calls on the hot path
§ Control flow in case of bad things happening on remote nodes
§ e.g. pause processing or controlled shutdown if one node can’t reach an external service
5

Our Reference Architecture
§ Basic template for building spark/scala apps in our team
§ Take advantage of Spark strong points, work around limitations
§ Decouple Spark APIs and business logic
§ Leverage strong points in Scala (blend FP and OOP)
§ Design goals – all apps should be:
§ Scalable (horizontally)
§ Reliable (at least once processing, no data loss)
§ Maintainable (easy to understand, change, remove code)
§ Testable (easy to write unit and integration tests)
§ Easy to configure (deploy time)
§ Portable (to other processing frameworks like akka or kafka streams)
6

The Sample App
§ Ingest – first component in the stack
§ Use case – basic ETL
§ load from persistent queue (Kafka)
§ unpack and validate protobuf elements
§ reach out to external config service
§ e.g. is customer active?
§ add minimal metadata (lookups to customer DB)
§ persist to data store (HBase)
§ emit for downstream processing (Kafka)
7

Abstractions Used
8
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful Resources
Repository
Message
Producer
Spark APIs
Config
Domain
model

Main Entrypoint
§ Load / parse configuration
§ Instantiate SparkContext, DB connections, etc
§ Starts data processing (the application) by providing concrete instances for all deps
9
object IngestMain {
def main(args: Array[String]) {
val config = IngestConfig.loadConfig
val streamContext = new StreamingContext(...)
val ingestApp = getIngestApp(config)
val ingressStream = KafkaConnectionUtils.getDStream(...)
ingestApp.process(ingressStream)
streamContext.start()
streamContext.awaitTermination()
}
}
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
model

The Application
§ Assembles services / repos into actual data processing app
§ Facilitates integration testing by not relying on actual kafka queues, hbase connections, etc
§ Only place in the code that "speaks" Spark (DStream, RDD, transform APIs, etc)
§ Change this file to port app to another streaming framework
10
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
model
trait IngestApp {
def ingestService: IngestService
def eventRepo: ExecutorSingleton[EventRepository]
def process(dstream: DStream[Array[Byte]]): Unit = {
val rawEvents = dstream.mapPartitions { partition =>
partition.flatMap(ingestService.toRawEvents(...))
}
processEvents(rawEvents)
}
}

The Application (2)
§ Deals with Spark complexities so that the business services don’t have to
§ Caching, progress checkpointing, controlling side effects
§ Shipping code and stateful objects (e.g. DB connection) to executors
11
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
model
def processEvents(events: DStream[RawEvent]): Unit {
val validEvents = events.transform { rdd =>
// update and broadcast global config
rdd.flatMap { event =>
ingestService.toValidEvent(event, ...)
}
}
validEvents.cache()
validEvents.foreachRDDOrStop { rdd =>
rdd.foreachPartition { partition =>
val repo = eventRepo.get
partition.foreach { ev => ingestService.saveEvent(ev, repo) }
}
}
}

Services
12
§ Represent the majority of business logic
§ Stateless and generally implemented as scala traits
§ Collection of pure functions grouped logically
§ Process immutable data structures, side effects are contained
§ All resources provided at invoke time, avoiding DI altogether
§ Avoids serialization issues of stateful resources (e.g. DB connection),
concerns which are pushed to the outer application layers
§ Actual materialization of trait can be deferred
§ E.g. object, service class, mix-in another class
§ Allows for a very modular architecture
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful
Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
model

Example – Ingest Service
§ Deserialization, validation
§ Check configs (calls config service)
§ Annotate with customer metadata (loads partner DB)
§ Persist to HBase via Repository
13
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful
Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
modeltrait IngestService {
def toRawEvents(bytes: Array[Byte]): Seq[RawEvent]
def toValidEvent(
ev: RawEvent, configRepo: ConfigRepository): Option[ValidEvent]
def saveEvent(
ev: ValidEvent, repo: EventRepository): Unit Or Throwable
}

Repositories and Other Stateful Objects
§ Repo - simple abstraction for modeling KV data stores, config DBs, etc
§ Read-write or read-only
§ Simple interface makes it easy to mock in testing (e.g. HashMaps)
§ or swap out implementation (HBase, Cassandra, etc)
§ Handled differently from simple services because
§ Generally relies on stateful objects (e.g. DB connection pool)
§ Needs extra set-up and tear-down lifecycle
§ Each executor needs it’s own repo, how do you create it there?
https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in-apache-spark/
14
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configurat
ion
(http)
Stateful Resources
Repository Message
Producer
Spark
APIs
Config
Domain
model

The Domain Model
§ Immutable entities via case classes
§ Serializable, equals and hash code, pattern matching out of the box
§ Controlled creation via smart constructors (factory + validation)
§ Enforce invariants during creation and transformation
§ No more defensive checks everywhere
§ Domain objects are guaranteed to be valid
§ Leverages the type system and compiler
http://www.cakesolutions.net/teamblogs/enforcing-invariants-in-
scala-datatypes
15
Main entry point
Application
Services
e.g.
Validati
on
(internal
)
e.g.
Configur
ation
(http)
Stateful
Resources
Reposito
ry
Message
Produce
r
Spark
APIs
Config
Domain model

Example – DataSource
§ Validations done during creation & transformation phases
§ Immutable object; can’t change after that!
16
sealed trait DataSource {
def id: Int
}
case object GlobalDataSource extends DataSource {
val id = 0
}
sealed abstract case class ExternalDataSource(id: Int) extends DataSource
object DataSource {
def apply(id: Int): Option[DataSource] = id match {
case invalid if invalid < 0 => None
case GlobalDataSource.id => Some(GlobalDataSource)
case anyDsId => Some(ExternalDataSource(anyDsId))
}
}
Main entry point
Application
Services
e.g.
Validatio
n
(internal)
e.g.
Configur
ation
(http)
Stateful
Resources
Reposito
ry
Message
Producer
Spark
APIs
Config
Domain model

Other Tips and Tricks
§ Typesafe config + ficus for powerful, zero boilerplate app config
https://github.com/iheartradio/ficus
§ Option / Try / Either for error handling
http://longcao.org/2015/07/09/functional-error-accumulation-in-scala
§ Unit/Integration testing for spark apps
https://github.com/holdenk/spark-testing-base
17

Conclusion – Reaching Our Design Goals
§ Scalable
§ Maintainable
§ Testable
§ Easy Configurable
§ Portable
18
§ Only the app “speaks” Spark
§ Business logic and domain model can be swapped out easily
§ Config is a static typed class hierarchy
§ Free to parse via typesafe-config / ficus
§ Clear concerns at app level
§ Modular code
§ Pure functions
§ Immutable data structures
§ Pure functions are easy to unit test
§ The App interface makes integration tests easy
Use FP in the small, OOP in the large!

Let’s Keep in Touch!
§ Adrian Tanase
atanase@adobe.com
§ We’re hiring!
http://bit.ly/adro-careers
19

Spark and scala reference architecture

More Related Content

What's hot (20)

Similar to Spark and scala reference architecture (20)

Recently uploaded (20)

Spark and scala reference architecture