SlideShare a Scribd company logo
Under-the-hood:
Creating Your Own Spark Data Sources
Speaker: Jayesh Thakrar @ Conversant
Slide 1
Why Build Your Own Data Source?
Slide 2
Don't need
• text, csv, ORC, Parquet, JSON files
• JDBC sources
• When available from project / vendor, e.g.
• Cassandra, Kafka, MongoDB, etc
• Teradata, Greenplum (2018), etc
Need to
• Use special features, e.g.
• Kafka transactions
• Exploit RDBMS specific partition features
• Because you can, and want to 
Conversant Use Cases (mid-2017)
• Greenplum as data source (read/write)
• Incompatibility between Spark, Kafka and
Kafka connector versions
Agenda
1. Introduction to Spark data sources
2. Walk-through sample code
3. Practical considerations
Slide 3
Introduction To
Spark Data Source
Slide 4
Using Data Sources
Slide 5
• Built-in
spark.read.csv("path")
spark.read.orc("path")
spark.read.parquet("path")
• Third-party/custom
spark.read.format("class-name").load() // custom data source
spark.read.format("...").option("...", "...").load() // with options
spark.read.format("...").schema("..").load() // with schema
Spark Data Source
Spark Application
Spark API
Spark Data SourceData
Spark Backend
(runtime, Spark
libraries, etc)
Slide 6
Data Source V2 API
Spark Application
Spark API
Spark Data SourceData
Spark Backend
(runtime, Spark
libraries, etc)
Shiny, New V2 API
since Spark 2.3 (Feb 2018)
SPARK-15689
No relations, scans involved
Slide 7
V2 API: Data Source Types
Data Source Type Output
Batch Dataset
Microbatch
(successor to Dstreams)
Structured Stream =
stream of bounded dataset
Continuous
Continuous Stream = continuous
stream of Row(s)
Slide 8
V2 DATA SOURCE API
• Well documented design
V2 API : Design Doc , Feature SPARK-15689
Continuous Streaming Design Doc and Feature - SPARK-20928
• Implemented as Java Interfaces (not classes)
• Similar interfaces across all data source types
• Microbatch and Continuous not hardened yet...
e.g. SPARK-23886, SPARK-23887, SPARK-22911
Slide 9
Code Walkthrough
Slide 10
Details
• Project: https://github.com/JThakrar/sparkconn
• Requirements:
• Scala 2.11
• SBT
Slide 11
Reading From Data Source
val data = spark.read.format("DataSource").option("key", "value").schema("..").load()
Step Action
spark = SparkSession
read Returns a DataFrameReader for data source orchestration
format
Lazy lookup of data source class -
by shortname or by full-qualified class name
option Zero, one or more key-value pairs of options for data source
schema Optional, user-provided schema
load
Loads the data as a dataframe.
Remember dataframes are lazily-evaluated.
Slide 12
Reading: Interfaces to Implement
Slide 13
spark.format(...)
DataSourceRegister
ReadSupport
ReadSupportWithSchema
DataSourceReader
DataReaderFactory
DataReaderFactory
DataReaderFactory
DataReader
DataReader
DataReader
Driver
Executors / Tasks / Partitions
Need to
implement a
minimum of 5
interfaces and
3 classes
Interface Definitions And Dependency
Slide 14
libraryDependencies +=
"org.apache.spark" %%
"spark-sql" %
"2.3.1" %
"provided"
Example V2 API Based Datasource
Slide 15
• Very simple data source that generates a row with a fixed schema of a single
string column. Column name = "string_value"
• Completely self-contained (i.e. no external connection)
• Number of partitions in dataset is user-configurable (default = 5)
• All partitions contains same number of rows (strings)
• Number of rows per partition is user-configurable (default = 5)
Read Interface: DataSourceRegister
Slide 16
Interface Purpose
org.apache.spark.sql.sources.v2.DataSourceRegister
org.apache.spark.sql.sources.v2.ReadSupport
and / or
org.apache.spark.sql.sources.v2.ReadSupportWithSchema
DataSourceRegister is the entry point for your data source.
ReadSupport is then responsible to instantiate the object
implementing DataSourceReader (discussed later). It accepts the
options/parameters and optional schema from Spark application.
Read Interface: DataSourceReader
Slide 17
Interface Purpose
org.apache.spark.sql.sources.v2.reader.DataSourceReader
This interface requires implementations for:
• determining the schema of data
• determining the number of partitions and creating that many
reader factories below.
Read Interface: DataReaderFactory
Slide 18
Interface Purpose
org.apache.spark.sql.sources.v2.reader.DataReaderFactory
This is the "handle" passed by the driver to each executor. It
instantiates the readers below and controls data fetch.
Read Interface: DataReader
Slide 19
Interface Purpose
org.apache.spark.sql.sources.v2.reader.DataReader This does the actual work of fetching data from source (at task-level)
Summary
Slide 20
Interface Purpose
org.apache.spark.sql.sources.v2.DataSourceRegister
org.apache.spark.sql.sources.v2.ReadSupport
org.apache.spark.sql.sources.v2.ReadSupportWithSchema
DataSourceRegister is the entry point for your connector.
ReadSupport is then responsible to instantiate the object
implementing DataSourceReader below. It accepts the
options/parameters and optional schema from Spark application.
org.apache.spark.sql.sources.v2.reader.DataSourceReader
This interface requires implementations for:
• determining the schema of data
• determining the number of partitions and creating that many
reader factories below.
The DataSourceRegister, DataSourceReader and DataReaderFactory
are instantiated at the driver. The driver then serializes
DataReaderFactory and sends it to each of the executors.
org.apache.spark.sql.sources.v2.reader.DataReaderFactory
This is the "handle" passed by the driver to each executor. It
instantiates the readers below and controls data fetch.
org.apache.spark.sql.sources.v2.reader.DataReader This does the actual work of fetching data
Because you are implementing interfaces, YOU can determine the "class" parameters and initialization
Write Interfaces to Implement
Slide 21
Interface Purpose
org.apache.spark.sql.sources.v2.DataSourceRegister
org.apache.spark.sql.sources.v2.WriteSupport
DataSourceRegister is the entry point for your connector.
WriteSupport is then responsible to instantiate the object
implementing DataSourceWriter below. It accepts the
options/parameters and the schema.
org.apache.spark.sql.sources.v2.writer.DataSourceWriter
This interface requires implementations for:
• committing data write
• aborting data write
• creating writer factories
The DataSourceRegister, DataSourceWriter and DataWriterFactory
are instantiated at the driver. The driver then serializes
DataWriterFactory and sends it to each of the executors.
org.apache.spark.sql.sources.v2.writer.DataWriterFactory
This is the "handle" passed by the driver to each executor. It
instantiates the writers below and controls data fetch.
org.apache.spark.sql.sources.v2.writer.DataWriter
This does the actual work of writing and committing/aborting the
data
org.apache.spark.sql.sources.v2.writer.WriterCommitMessage
This is a "commit" message that is passed from the DataWriter to
DataSourceWriter.
Practical Considerations
Slide 22
Know Your Data Source
• Configuration
• Partitions
• Data schema
• Parallelism approach
• Batch and/or streaming
• Restart / recovery
Slide 23
V2 API IS STILL EVOLVING
• SPARK-22386 - Data Source V2 Improvements
• SPARK-23507 - Migrate existing data sources
• SPARK-24073 - DataReaderFactory Renamed in 2.4
• SPARK-24252, SPARK-25006 - DataSourceV2: Add catalog support
• So Why use V2?
Future-ready and alternative to V2 needs significantly more time and effort!
See https://www.youtube.com/watch?v=O9kpduk5D48
Slide 24
About...
Conversant
• Digital marketing unit of Epsilon under Alliance Data Systems (ADS)
• (Significant) player in internet advertising.
We see about 80% of internet ad bids in the US
• Secret sauce = anonymous cross-device profiles driving personalized messaging
Me (Jayesh Thakrar)
• Sr. Software Engineer (jthakrar@conversantmedia.com)
• https://www.linkedin.com/in/jayeshthakrar/
Slide 25
Questions?
Slide 26

More Related Content

PDF
Spark sql
PDF
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
PPTX
Incorta spark integration
PDF
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
PDF
.NET Core, ASP.NET Core Course, Session 13
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
.NET Core, ASP.NET Core Course, Session 12
PDF
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
Spark sql
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Incorta spark integration
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
.NET Core, ASP.NET Core Course, Session 13
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
.NET Core, ASP.NET Core Course, Session 12
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation

What's hot (19)

ODP
The Adventure: BlackRay as a Storage Engine
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
PDF
Introduction to Spark SQL training workshop
PDF
.NET Core, ASP.NET Core Course, Session 14
PDF
Solr Architecture
PPT
Sqlite
PPTX
Integration patterns in AEM 6
PPTX
Spring Data - Intro (Odessa Java TechTalks)
PDF
.NET Core, ASP.NET Core Course, Session 15
PPTX
Sqlite
ODP
Introduction to SQL Alchemy - SyPy June 2013
PDF
Object Oriented Programming with Laravel - Session 6
PPTX
AAC Room
PPTX
L04 base patterns
PPTX
Entity framework code first
ODP
Hibernate Developer Reference
PDF
Object Oriented Programming with Laravel - Session 4
PPTX
Hibernate tutorial
PPT
Hibernate architecture
The Adventure: BlackRay as a Storage Engine
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Introduction to Spark SQL training workshop
.NET Core, ASP.NET Core Course, Session 14
Solr Architecture
Sqlite
Integration patterns in AEM 6
Spring Data - Intro (Odessa Java TechTalks)
.NET Core, ASP.NET Core Course, Session 15
Sqlite
Introduction to SQL Alchemy - SyPy June 2013
Object Oriented Programming with Laravel - Session 6
AAC Room
L04 base patterns
Entity framework code first
Hibernate Developer Reference
Object Oriented Programming with Laravel - Session 4
Hibernate tutorial
Hibernate architecture
Ad

Similar to ApacheCon North America 2018: Creating Spark Data Sources (20)

PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
20170126 big data processing
PPTX
Building highly scalable data pipelines with Apache Spark
PPTX
Apache Spark on HDinsight Training
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PPTX
Big data processing with Apache Spark and Oracle Database
PPTX
Azure Databricks is Easier Than You Think
PDF
Composable Parallel Processing in Apache Spark and Weld
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
PPT
JDBC java for learning java for learn.ppt
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
PDF
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
PDF
Data Source API in Spark
PDF
Jump Start with Apache Spark 2.0 on Databricks
PPTX
Java Database Connectivity by shreyash simu dbce.pptx
PPTX
Java- JDBC- Mazenet Solution
PDF
Apache Spark - A High Level overview
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PPTX
Spark sql
PPTX
Spark from the Surface
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
20170126 big data processing
Building highly scalable data pipelines with Apache Spark
Apache Spark on HDinsight Training
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Big data processing with Apache Spark and Oracle Database
Azure Databricks is Easier Than You Think
Composable Parallel Processing in Apache Spark and Weld
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
JDBC java for learning java for learn.ppt
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Data Source API in Spark
Jump Start with Apache Spark 2.0 on Databricks
Java Database Connectivity by shreyash simu dbce.pptx
Java- JDBC- Mazenet Solution
Apache Spark - A High Level overview
GraphFrames: DataFrame-based graphs for Apache® Spark™
Spark sql
Spark from the Surface
Ad

More from Jayesh Thakrar (7)

PPTX
Apache big-data-2017-spark-profiling
PPTX
Data Modeling for IoT and Big Data
PPTX
Apache big-data-2017-scala-sql
PPT
Data Loss and Duplication in Kafka
PPTX
ApacheCon-Flume-Kafka-2016
PPTX
ApacheCon-HBase-2016
PPTX
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Apache big-data-2017-spark-profiling
Data Modeling for IoT and Big Data
Apache big-data-2017-scala-sql
Data Loss and Duplication in Kafka
ApacheCon-Flume-Kafka-2016
ApacheCon-HBase-2016
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14

Recently uploaded (20)

PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Monthly Chronicles - July 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Network Security Unit 5.pdf for BCA BBA.
20250228 LYD VKU AI Blended-Learning.pptx
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
Building Integrated photovoltaic BIPV_UPV.pdf
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Big Data Technologies - Introduction.pptx
Encapsulation_ Review paper, used for researhc scholars
Review of recent advances in non-invasive hemoglobin estimation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
Reach Out and Touch Someone: Haptics and Empathic Computing
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Diabetes mellitus diagnosis method based random forest with bat algorithm

ApacheCon North America 2018: Creating Spark Data Sources

  • 1. Under-the-hood: Creating Your Own Spark Data Sources Speaker: Jayesh Thakrar @ Conversant Slide 1
  • 2. Why Build Your Own Data Source? Slide 2 Don't need • text, csv, ORC, Parquet, JSON files • JDBC sources • When available from project / vendor, e.g. • Cassandra, Kafka, MongoDB, etc • Teradata, Greenplum (2018), etc Need to • Use special features, e.g. • Kafka transactions • Exploit RDBMS specific partition features • Because you can, and want to  Conversant Use Cases (mid-2017) • Greenplum as data source (read/write) • Incompatibility between Spark, Kafka and Kafka connector versions
  • 3. Agenda 1. Introduction to Spark data sources 2. Walk-through sample code 3. Practical considerations Slide 3
  • 4. Introduction To Spark Data Source Slide 4
  • 5. Using Data Sources Slide 5 • Built-in spark.read.csv("path") spark.read.orc("path") spark.read.parquet("path") • Third-party/custom spark.read.format("class-name").load() // custom data source spark.read.format("...").option("...", "...").load() // with options spark.read.format("...").schema("..").load() // with schema
  • 6. Spark Data Source Spark Application Spark API Spark Data SourceData Spark Backend (runtime, Spark libraries, etc) Slide 6
  • 7. Data Source V2 API Spark Application Spark API Spark Data SourceData Spark Backend (runtime, Spark libraries, etc) Shiny, New V2 API since Spark 2.3 (Feb 2018) SPARK-15689 No relations, scans involved Slide 7
  • 8. V2 API: Data Source Types Data Source Type Output Batch Dataset Microbatch (successor to Dstreams) Structured Stream = stream of bounded dataset Continuous Continuous Stream = continuous stream of Row(s) Slide 8
  • 9. V2 DATA SOURCE API • Well documented design V2 API : Design Doc , Feature SPARK-15689 Continuous Streaming Design Doc and Feature - SPARK-20928 • Implemented as Java Interfaces (not classes) • Similar interfaces across all data source types • Microbatch and Continuous not hardened yet... e.g. SPARK-23886, SPARK-23887, SPARK-22911 Slide 9
  • 12. Reading From Data Source val data = spark.read.format("DataSource").option("key", "value").schema("..").load() Step Action spark = SparkSession read Returns a DataFrameReader for data source orchestration format Lazy lookup of data source class - by shortname or by full-qualified class name option Zero, one or more key-value pairs of options for data source schema Optional, user-provided schema load Loads the data as a dataframe. Remember dataframes are lazily-evaluated. Slide 12
  • 13. Reading: Interfaces to Implement Slide 13 spark.format(...) DataSourceRegister ReadSupport ReadSupportWithSchema DataSourceReader DataReaderFactory DataReaderFactory DataReaderFactory DataReader DataReader DataReader Driver Executors / Tasks / Partitions Need to implement a minimum of 5 interfaces and 3 classes
  • 14. Interface Definitions And Dependency Slide 14 libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1" % "provided"
  • 15. Example V2 API Based Datasource Slide 15 • Very simple data source that generates a row with a fixed schema of a single string column. Column name = "string_value" • Completely self-contained (i.e. no external connection) • Number of partitions in dataset is user-configurable (default = 5) • All partitions contains same number of rows (strings) • Number of rows per partition is user-configurable (default = 5)
  • 16. Read Interface: DataSourceRegister Slide 16 Interface Purpose org.apache.spark.sql.sources.v2.DataSourceRegister org.apache.spark.sql.sources.v2.ReadSupport and / or org.apache.spark.sql.sources.v2.ReadSupportWithSchema DataSourceRegister is the entry point for your data source. ReadSupport is then responsible to instantiate the object implementing DataSourceReader (discussed later). It accepts the options/parameters and optional schema from Spark application.
  • 17. Read Interface: DataSourceReader Slide 17 Interface Purpose org.apache.spark.sql.sources.v2.reader.DataSourceReader This interface requires implementations for: • determining the schema of data • determining the number of partitions and creating that many reader factories below.
  • 18. Read Interface: DataReaderFactory Slide 18 Interface Purpose org.apache.spark.sql.sources.v2.reader.DataReaderFactory This is the "handle" passed by the driver to each executor. It instantiates the readers below and controls data fetch.
  • 19. Read Interface: DataReader Slide 19 Interface Purpose org.apache.spark.sql.sources.v2.reader.DataReader This does the actual work of fetching data from source (at task-level)
  • 20. Summary Slide 20 Interface Purpose org.apache.spark.sql.sources.v2.DataSourceRegister org.apache.spark.sql.sources.v2.ReadSupport org.apache.spark.sql.sources.v2.ReadSupportWithSchema DataSourceRegister is the entry point for your connector. ReadSupport is then responsible to instantiate the object implementing DataSourceReader below. It accepts the options/parameters and optional schema from Spark application. org.apache.spark.sql.sources.v2.reader.DataSourceReader This interface requires implementations for: • determining the schema of data • determining the number of partitions and creating that many reader factories below. The DataSourceRegister, DataSourceReader and DataReaderFactory are instantiated at the driver. The driver then serializes DataReaderFactory and sends it to each of the executors. org.apache.spark.sql.sources.v2.reader.DataReaderFactory This is the "handle" passed by the driver to each executor. It instantiates the readers below and controls data fetch. org.apache.spark.sql.sources.v2.reader.DataReader This does the actual work of fetching data Because you are implementing interfaces, YOU can determine the "class" parameters and initialization
  • 21. Write Interfaces to Implement Slide 21 Interface Purpose org.apache.spark.sql.sources.v2.DataSourceRegister org.apache.spark.sql.sources.v2.WriteSupport DataSourceRegister is the entry point for your connector. WriteSupport is then responsible to instantiate the object implementing DataSourceWriter below. It accepts the options/parameters and the schema. org.apache.spark.sql.sources.v2.writer.DataSourceWriter This interface requires implementations for: • committing data write • aborting data write • creating writer factories The DataSourceRegister, DataSourceWriter and DataWriterFactory are instantiated at the driver. The driver then serializes DataWriterFactory and sends it to each of the executors. org.apache.spark.sql.sources.v2.writer.DataWriterFactory This is the "handle" passed by the driver to each executor. It instantiates the writers below and controls data fetch. org.apache.spark.sql.sources.v2.writer.DataWriter This does the actual work of writing and committing/aborting the data org.apache.spark.sql.sources.v2.writer.WriterCommitMessage This is a "commit" message that is passed from the DataWriter to DataSourceWriter.
  • 23. Know Your Data Source • Configuration • Partitions • Data schema • Parallelism approach • Batch and/or streaming • Restart / recovery Slide 23
  • 24. V2 API IS STILL EVOLVING • SPARK-22386 - Data Source V2 Improvements • SPARK-23507 - Migrate existing data sources • SPARK-24073 - DataReaderFactory Renamed in 2.4 • SPARK-24252, SPARK-25006 - DataSourceV2: Add catalog support • So Why use V2? Future-ready and alternative to V2 needs significantly more time and effort! See https://www.youtube.com/watch?v=O9kpduk5D48 Slide 24
  • 25. About... Conversant • Digital marketing unit of Epsilon under Alliance Data Systems (ADS) • (Significant) player in internet advertising. We see about 80% of internet ad bids in the US • Secret sauce = anonymous cross-device profiles driving personalized messaging Me (Jayesh Thakrar) • Sr. Software Engineer (jthakrar@conversantmedia.com) • https://www.linkedin.com/in/jayeshthakrar/ Slide 25