ApacheCon North America 2018: Creating Spark Data Sources

Under-the-hood:
Creating Your Own Spark Data Sources
Speaker: Jayesh Thakrar @ Conversant
Slide 1

Why Build Your Own Data Source?
Slide 2
Don't need
• text, csv, ORC, Parquet, JSON files
• JDBC sources
• When available from project / vendor, e.g.
• Cassandra, Kafka, MongoDB, etc
• Teradata, Greenplum (2018), etc
Need to
• Use special features, e.g.
• Kafka transactions
• Exploit RDBMS specific partition features
• Because you can, and want to 
Conversant Use Cases (mid-2017)
• Greenplum as data source (read/write)
• Incompatibility between Spark, Kafka and
Kafka connector versions

Agenda
1. Introduction to Spark data sources
2. Walk-through sample code
3. Practical considerations
Slide 3

Introduction To
Spark Data Source
Slide 4

Using Data Sources
Slide 5
• Built-in
spark.read.csv("path")
spark.read.orc("path")
spark.read.parquet("path")
• Third-party/custom
spark.read.format("class-name").load() // custom data source
spark.read.format("...").option("...", "...").load() // with options
spark.read.format("...").schema("..").load() // with schema

Spark Data Source
Spark Application
Spark API
Spark Data SourceData
Spark Backend
(runtime, Spark
libraries, etc)
Slide 6

Data Source V2 API
Spark Application
Spark API
Spark Data SourceData
Spark Backend
(runtime, Spark
libraries, etc)
Shiny, New V2 API
since Spark 2.3 (Feb 2018)
SPARK-15689
No relations, scans involved
Slide 7

V2 API: Data Source Types
Data Source Type Output
Batch Dataset
Microbatch
(successor to Dstreams)
Structured Stream =
stream of bounded dataset
Continuous
Continuous Stream = continuous
stream of Row(s)
Slide 8

V2 DATA SOURCE API
• Well documented design
V2 API : Design Doc , Feature SPARK-15689
Continuous Streaming Design Doc and Feature - SPARK-20928
• Implemented as Java Interfaces (not classes)
• Similar interfaces across all data source types
• Microbatch and Continuous not hardened yet...
e.g. SPARK-23886, SPARK-23887, SPARK-22911
Slide 9

Details
• Project: https://github.com/JThakrar/sparkconn
• Requirements:
• Scala 2.11
• SBT
Slide 11

Reading From Data Source
val data = spark.read.format("DataSource").option("key", "value").schema("..").load()
Step Action
spark = SparkSession
read Returns a DataFrameReader for data source orchestration
format
Lazy lookup of data source class -
by shortname or by full-qualified class name
option Zero, one or more key-value pairs of options for data source
schema Optional, user-provided schema
load
Loads the data as a dataframe.
Remember dataframes are lazily-evaluated.
Slide 12

Reading: Interfaces to Implement
Slide 13
spark.format(...)
DataSourceRegister
ReadSupport
ReadSupportWithSchema
DataSourceReader
DataReaderFactory
DataReaderFactory
DataReaderFactory
DataReader
DataReader
DataReader
Driver
Executors / Tasks / Partitions
Need to
implement a
minimum of 5
interfaces and
3 classes

Interface Definitions And Dependency
Slide 14
libraryDependencies +=
"org.apache.spark" %%
"spark-sql" %
"2.3.1" %
"provided"

Example V2 API Based Datasource
Slide 15
• Very simple data source that generates a row with a fixed schema of a single
string column. Column name = "string_value"
• Completely self-contained (i.e. no external connection)
• Number of partitions in dataset is user-configurable (default = 5)
• All partitions contains same number of rows (strings)
• Number of rows per partition is user-configurable (default = 5)

Read Interface: DataSourceRegister
Slide 16
Interface Purpose
org.apache.spark.sql.sources.v2.DataSourceRegister
org.apache.spark.sql.sources.v2.ReadSupport
and / or
org.apache.spark.sql.sources.v2.ReadSupportWithSchema
DataSourceRegister is the entry point for your data source.
ReadSupport is then responsible to instantiate the object
implementing DataSourceReader (discussed later). It accepts the
options/parameters and optional schema from Spark application.

Read Interface: DataSourceReader
Slide 17
Interface Purpose
org.apache.spark.sql.sources.v2.reader.DataSourceReader
This interface requires implementations for:
• determining the schema of data
• determining the number of partitions and creating that many
reader factories below.

Read Interface: DataReaderFactory
Slide 18
Interface Purpose
org.apache.spark.sql.sources.v2.reader.DataReaderFactory
This is the "handle" passed by the driver to each executor. It
instantiates the readers below and controls data fetch.

Read Interface: DataReader
Slide 19
Interface Purpose
org.apache.spark.sql.sources.v2.reader.DataReader This does the actual work of fetching data from source (at task-level)

Summary
Slide 20
Interface Purpose
org.apache.spark.sql.sources.v2.ReadSupport
org.apache.spark.sql.sources.v2.ReadSupportWithSchema
DataSourceRegister is the entry point for your connector.
ReadSupport is then responsible to instantiate the object
implementing DataSourceReader below. It accepts the
options/parameters and optional schema from Spark application.
org.apache.spark.sql.sources.v2.reader.DataSourceReader
• determining the schema of data
• determining the number of partitions and creating that many
reader factories below.
The DataSourceRegister, DataSourceReader and DataReaderFactory
are instantiated at the driver. The driver then serializes
DataReaderFactory and sends it to each of the executors.
org.apache.spark.sql.sources.v2.reader.DataReaderFactory
instantiates the readers below and controls data fetch.
org.apache.spark.sql.sources.v2.reader.DataReader This does the actual work of fetching data
Because you are implementing interfaces, YOU can determine the "class" parameters and initialization

Write Interfaces to Implement
Slide 21
Interface Purpose
org.apache.spark.sql.sources.v2.WriteSupport
DataSourceRegister is the entry point for your connector.
WriteSupport is then responsible to instantiate the object
implementing DataSourceWriter below. It accepts the
options/parameters and the schema.
org.apache.spark.sql.sources.v2.writer.DataSourceWriter
• committing data write
• aborting data write
• creating writer factories
The DataSourceRegister, DataSourceWriter and DataWriterFactory
are instantiated at the driver. The driver then serializes
DataWriterFactory and sends it to each of the executors.
org.apache.spark.sql.sources.v2.writer.DataWriterFactory
instantiates the writers below and controls data fetch.
org.apache.spark.sql.sources.v2.writer.DataWriter
This does the actual work of writing and committing/aborting the
data
org.apache.spark.sql.sources.v2.writer.WriterCommitMessage
This is a "commit" message that is passed from the DataWriter to
DataSourceWriter.

Practical Considerations
Slide 22

Know Your Data Source
• Configuration
• Partitions
• Data schema
• Parallelism approach
• Batch and/or streaming
• Restart / recovery
Slide 23

V2 API IS STILL EVOLVING
• SPARK-22386 - Data Source V2 Improvements
• SPARK-23507 - Migrate existing data sources
• SPARK-24073 - DataReaderFactory Renamed in 2.4
• SPARK-24252, SPARK-25006 - DataSourceV2: Add catalog support
• So Why use V2?
Future-ready and alternative to V2 needs significantly more time and effort!
See https://www.youtube.com/watch?v=O9kpduk5D48
Slide 24

About...
Conversant
• Digital marketing unit of Epsilon under Alliance Data Systems (ADS)
• (Significant) player in internet advertising.
We see about 80% of internet ad bids in the US
• Secret sauce = anonymous cross-device profiles driving personalized messaging
Me (Jayesh Thakrar)
• Sr. Software Engineer (jthakrar@conversantmedia.com)
• https://www.linkedin.com/in/jayeshthakrar/
Slide 25

ApacheCon North America 2018: Creating Spark Data Sources

More Related Content

What's hot (19)

Similar to ApacheCon North America 2018: Creating Spark Data Sources (20)

More from Jayesh Thakrar (7)

Recently uploaded (20)

ApacheCon North America 2018: Creating Spark Data Sources