Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)

Extending Spark SQL 2.4
with New Data Sources
Live Coding Session
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl / Spark+AI Summit 2019

● Freelance IT consultant
● Specializing in Spark, Kafka, Kafka Streams, Scala
● Development | Consulting | Training | Speaking
● "The Internals Of" online books
● Among contributors to Apache Spark
● Among Conﬂuent Community Catalyst (Class of 2019 - 2020)
● Contact me at jacek@japila.pl
● Follow @JacekLaskowski on twitter for more #ApacheSpark
#ApacheKafka #KafkaStreams
Jacek Laskowski

Friendly reminder
Pictures...take a lot of pictures! 📷
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl

Why Should You Care?
1. Why would you ever consider developing a new data
source for Spark SQL?
2. Let structured queries access data in external systems
(e.g. Splice Machine, Google Cloud Spanner)
3. Make loading or writing process self-contained
a. Hidden from developers who'd focus on what to do with the data
not how to make the data available in a proper format

Data Source / Data Provider
1. Data Source is an pluggable “abstraction” in Spark SQL for loading and saving
data
a. Abstraction in a loose meaning
b. Also known as Data Provider or Data Format or Relation Provider
2. Built-In Data Sources: parquet, kafka, avro, json, etc.
3. All available for developers, data engineers, and data scientists
a. Scala, Java, Python, SQL
4. Allows for new data sources
5. Source or Reader for loading data
6. Sink or Writer for saving data
7. Read up on Data Sources in the ofﬁcial documentation
The goal of the session! 🎯

Before Developing New Data Source
1. What Apache Spark version?
2. Data Source API V1 vs Data Source API V2?
3. Loading and/or Saving Data?
4. Spark SQL only?
5. Spark Structured Streaming?
a. Micro-Batch Stream Processing?
b. Continuous Stream Processing?

DataFrameReader (1 of 2)
1. SparkSession.read to start describing a data flow
a. Creates a DataFrameReader
2. DataFrameReader is a fluent interface to describe the
input data source
3. Used to “load” data from external storage systems (e.g.
file systems, key-value stores, etc.)
a. No physical data movement yet
b. Metadata of an input node in a data flow (graph)
4. DataFrameReader.load to finish describing the input

DataFrameReader (2 of 2)
Worth noticing:
1. DataSource.lookupDataS
ource
2. DataSourceV2
3. ReadSupport
4. DataSourceV2Relation
5. loadV1Source

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)

loadV1Source = DataSource.resolveRelation
1. loadV1Source loads a DataSource API V1 data source

Data Source API
1. DataSourceRegister
2. 👉 Data Source API V1
3. 👉 Data Source API V2

Friendly reminders
1. Pictures...take a lot of pictures! 📷
2. It should be a live coding, shouldn’t it? 🤔

Data Source API V1
a. SchemaRelationProvider
b. RelationProvider
c. FileFormat
d. CreatableRelationProvider
2. BaseRelation
a. PrunedFilteredScan
b. InsertableRelation
c. PrunedScan
d. TableScan
e. CatalystScan

Data Source API V2
2. DataSourceV2
3. ReadSupport
4. WriteSupport

“The Internals Of” Online Books
1. The Internals of Spark SQL
2. The Internals of Spark Structured Streaming
3. The Internals of Apache Spark

Questions?
1. Follow @jaceklaskowski on twitter (DMs open)
2. Upvote my questions and answers on StackOverﬂow
3. Contact me at jacek@japila.pl
4. Connect with me at LinkedIn

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)

More Related Content

What's hot (20)

Similar to Extending Spark SQL 2.4 with New Data Sources (Live Coding Session) (20)

More from Databricks (20)

Recently uploaded (20)

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)