SlideShare a Scribd company logo
Extending Spark SQL 2.4
with New Data Sources
Live Coding Session
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl / Spark+AI Summit 2019
● Freelance IT consultant
● Specializing in Spark, Kafka, Kafka Streams, Scala
● Development | Consulting | Training | Speaking
● "The Internals Of" online books
● Among contributors to Apache Spark
● Among Confluent Community Catalyst (Class of 2019 - 2020)
● Contact me at jacek@japila.pl
● Follow @JacekLaskowski on twitter for more #ApacheSpark
#ApacheKafka #KafkaStreams
Jacek Laskowski
Friendly reminder
Pictures...take a lot of pictures! 📷
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Why Should You Care?
1. Why would you ever consider developing a new data
source for Spark SQL?
2. Let structured queries access data in external systems
(e.g. Splice Machine, Google Cloud Spanner)
3. Make loading or writing process self-contained
a. Hidden from developers who'd focus on what to do with the data
not how to make the data available in a proper format
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Data Source / Data Provider
1. Data Source is an pluggable “abstraction” in Spark SQL for loading and saving
data
a. Abstraction in a loose meaning
b. Also known as Data Provider or Data Format or Relation Provider
2. Built-In Data Sources: parquet, kafka, avro, json, etc.
3. All available for developers, data engineers, and data scientists
a. Scala, Java, Python, SQL
4. Allows for new data sources
5. Source or Reader for loading data
6. Sink or Writer for saving data
7. Read up on Data Sources in the official documentation
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
The goal of the session! 🎯
Before Developing New Data Source
1. What Apache Spark version?
2. Data Source API V1 vs Data Source API V2?
3. Loading and/or Saving Data?
4. Spark SQL only?
5. Spark Structured Streaming?
a. Micro-Batch Stream Processing?
b. Continuous Stream Processing?
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
DataFrameReader (1 of 2)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
1. SparkSession.read to start describing a data flow
a. Creates a DataFrameReader
2. DataFrameReader is a fluent interface to describe the
input data source
3. Used to “load” data from external storage systems (e.g.
file systems, key-value stores, etc.)
a. No physical data movement yet
b. Metadata of an input node in a data flow (graph)
4. DataFrameReader.load to finish describing the input
DataFrameReader (2 of 2)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Worth noticing:
1. DataSource.lookupDataS
ource
2. DataSourceV2
3. ReadSupport
4. DataSourceV2Relation
5. loadV1Source
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
loadV1Source = DataSource.resolveRelation
1. loadV1Source loads a DataSource API V1 data source
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Data Source API
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
1. DataSourceRegister
2. 👉 Data Source API V1
3. 👉 Data Source API V2
Friendly reminders
1. Pictures...take a lot of pictures! 📷
2. It should be a live coding, shouldn’t it? 🤔
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Data Source API V1
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
1. DataSourceRegister
a. SchemaRelationProvider
b. RelationProvider
c. FileFormat
d. CreatableRelationProvider
2. BaseRelation
a. PrunedFilteredScan
b. InsertableRelation
c. PrunedScan
d. TableScan
e. CatalystScan
Data Source API V2
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
1. DataSourceRegister
2. DataSourceV2
3. ReadSupport
4. WriteSupport
“The Internals Of” Online Books
1. The Internals of Spark SQL
2. The Internals of Spark Structured Streaming
3. The Internals of Apache Spark
Questions?
1. Follow @jaceklaskowski on twitter (DMs open)
2. Upvote my questions and answers on StackOverflow
3. Contact me at jacek@japila.pl
4. Connect with me at LinkedIn
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl

More Related Content

PDF
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
PDF
Vectorized R Execution in Apache Spark
PDF
Internals of Speeding up PySpark with Arrow
PDF
Stream Processing: Choosing the Right Tool for the Job
PDF
What to Expect for Big Data and Apache Spark in 2017
PDF
Modern ETL Pipelines with Change Data Capture
PDF
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
PDF
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Vectorized R Execution in Apache Spark
Internals of Speeding up PySpark with Arrow
Stream Processing: Choosing the Right Tool for the Job
What to Expect for Big Data and Apache Spark in 2017
Modern ETL Pipelines with Change Data Capture
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...

What's hot (20)

PDF
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
PDF
Infrastructure for Deep Learning in Apache Spark
PDF
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
PDF
Self-Service Apache Spark Structured Streaming Applications and Analytics
PDF
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
PDF
Databricks with R: Deep Dive
PDF
Spark Summit EU talk by Christos Erotocritou
PDF
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
PDF
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
PDF
Powering Custom Apps at Facebook using Spark Script Transformation
PDF
Acid ORC, Iceberg and Delta Lake
PDF
Accelerating Machine Learning on Databricks Runtime
PDF
End-to-End Data Pipelines with Apache Spark
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PDF
Big Telco - Yousun Jeong
PDF
Change Data Feed in Delta
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
PDF
Insights Without Tradeoffs: Using Structured Streaming
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Infrastructure for Deep Learning in Apache Spark
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Self-Service Apache Spark Structured Streaming Applications and Analytics
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Databricks with R: Deep Dive
Spark Summit EU talk by Christos Erotocritou
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Powering Custom Apps at Facebook using Spark Script Transformation
Acid ORC, Iceberg and Delta Lake
Accelerating Machine Learning on Databricks Runtime
End-to-End Data Pipelines with Apache Spark
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Jump Start with Apache Spark 2.0 on Databricks
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Big Telco - Yousun Jeong
Change Data Feed in Delta
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Insights Without Tradeoffs: Using Structured Streaming
Ad

Similar to Extending Spark SQL 2.4 with New Data Sources (Live Coding Session) (20)

PDF
 Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin...
PDF
Spark sql under the hood - Data KRK meetup
PDF
Started with-apache-spark
PDF
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
PDF
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
PPTX
Azure Databricks is Easier Than You Think
PDF
2016 Spark Summit East Keynote: Matei Zaharia
PPTX
Apache Spark Overview
PDF
Jumpstart on Apache Spark 2.2 on Databricks
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
H2O PySparkling Water
PDF
Spark + AI Summit 2020 イベント概要
PPT
An Introduction to Apache spark with scala
PDF
A Master Guide To Apache Spark Application And Versatile Uses.pdf
PDF
Spark Will Replace Hadoop ! Know Why
PPTX
Scalable Machine Learning with PySpark
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
PPTX
Jack Gudenkauf sparkug_20151207_7
PPTX
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
PPTX
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin...
Spark sql under the hood - Data KRK meetup
Started with-apache-spark
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Azure Databricks is Easier Than You Think
2016 Spark Summit East Keynote: Matei Zaharia
Apache Spark Overview
Jumpstart on Apache Spark 2.2 on Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
H2O PySparkling Water
Spark + AI Summit 2020 イベント概要
An Introduction to Apache spark with scala
A Master Guide To Apache Spark Application And Versatile Uses.pdf
Spark Will Replace Hadoop ! Know Why
Scalable Machine Learning with PySpark
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Jack Gudenkauf sparkug_20151207_7
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Computer network topology notes for revision
PPTX
Global journeys: estimating international migration
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Database Infoormation System (DBIS).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Moving the Public Sector (Government) to a Digital Adoption
Data_Analytics_and_PowerBI_Presentation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Computer network topology notes for revision
Global journeys: estimating international migration
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Business Acumen Training GuidePresentation.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)

  • 1. Extending Spark SQL 2.4 with New Data Sources Live Coding Session © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl / Spark+AI Summit 2019
  • 2. ● Freelance IT consultant ● Specializing in Spark, Kafka, Kafka Streams, Scala ● Development | Consulting | Training | Speaking ● "The Internals Of" online books ● Among contributors to Apache Spark ● Among Confluent Community Catalyst (Class of 2019 - 2020) ● Contact me at jacek@japila.pl ● Follow @JacekLaskowski on twitter for more #ApacheSpark #ApacheKafka #KafkaStreams Jacek Laskowski
  • 3. Friendly reminder Pictures...take a lot of pictures! 📷 © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 4. Why Should You Care? 1. Why would you ever consider developing a new data source for Spark SQL? 2. Let structured queries access data in external systems (e.g. Splice Machine, Google Cloud Spanner) 3. Make loading or writing process self-contained a. Hidden from developers who'd focus on what to do with the data not how to make the data available in a proper format © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 5. Data Source / Data Provider 1. Data Source is an pluggable “abstraction” in Spark SQL for loading and saving data a. Abstraction in a loose meaning b. Also known as Data Provider or Data Format or Relation Provider 2. Built-In Data Sources: parquet, kafka, avro, json, etc. 3. All available for developers, data engineers, and data scientists a. Scala, Java, Python, SQL 4. Allows for new data sources 5. Source or Reader for loading data 6. Sink or Writer for saving data 7. Read up on Data Sources in the official documentation © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl The goal of the session! 🎯
  • 6. Before Developing New Data Source 1. What Apache Spark version? 2. Data Source API V1 vs Data Source API V2? 3. Loading and/or Saving Data? 4. Spark SQL only? 5. Spark Structured Streaming? a. Micro-Batch Stream Processing? b. Continuous Stream Processing? © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 7. DataFrameReader (1 of 2) © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl 1. SparkSession.read to start describing a data flow a. Creates a DataFrameReader 2. DataFrameReader is a fluent interface to describe the input data source 3. Used to “load” data from external storage systems (e.g. file systems, key-value stores, etc.) a. No physical data movement yet b. Metadata of an input node in a data flow (graph) 4. DataFrameReader.load to finish describing the input
  • 8. DataFrameReader (2 of 2) © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl Worth noticing: 1. DataSource.lookupDataS ource 2. DataSourceV2 3. ReadSupport 4. DataSourceV2Relation 5. loadV1Source
  • 10. loadV1Source = DataSource.resolveRelation 1. loadV1Source loads a DataSource API V1 data source © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 11. Data Source API © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl 1. DataSourceRegister 2. 👉 Data Source API V1 3. 👉 Data Source API V2
  • 12. Friendly reminders 1. Pictures...take a lot of pictures! 📷 2. It should be a live coding, shouldn’t it? 🤔 © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 13. Data Source API V1 © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl 1. DataSourceRegister a. SchemaRelationProvider b. RelationProvider c. FileFormat d. CreatableRelationProvider 2. BaseRelation a. PrunedFilteredScan b. InsertableRelation c. PrunedScan d. TableScan e. CatalystScan
  • 14. Data Source API V2 © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl 1. DataSourceRegister 2. DataSourceV2 3. ReadSupport 4. WriteSupport
  • 15. “The Internals Of” Online Books 1. The Internals of Spark SQL 2. The Internals of Spark Structured Streaming 3. The Internals of Apache Spark
  • 16. Questions? 1. Follow @jaceklaskowski on twitter (DMs open) 2. Upvote my questions and answers on StackOverflow 3. Contact me at jacek@japila.pl 4. Connect with me at LinkedIn © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl