SlideShare a Scribd company logo
Spark Structured APIs
Using Databricks
Presented By:
Raviyanshu Singh
Software Consultant
Knoldus Inc
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to
the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
Our Agenda
01 What is Spark
02 What’s an RDD
03 Dataframes
04 Datasets
Databricks
05
05
06 Demo
What is Spark?
Unified Analytics Engine
Apache Spark is a unified engine designed for large-scale distributed data
processing, on premises in data centers or in the cloud.
Spark’s design philosophy is based
on these principles:
● Speed
● Ease of Use
● Modularity
● Extensibility
00
Spark APIs Trio
RDD, Dataframe & Datasets
Distributed collections of
JVM objects
Functional Operators
(Map, filter etc)
2011
Distributed collections of
Row objects.
Expression based
operations and UDFs
Fast/Efficient and
internal representations
2013
Internally rows,
externally
JVM objects.
“Best of both the
worlds”:
type safe + fast
2015
RDD Dataframe Datasets
The Timeline of Three
Whatʼs RDD?
[Resilient Distributed Datasets]
2013 2017 2018
● An RDD represents an immutable, partitioned collection of records that can be operated on in
parallel.
● RDDs gives you complete control because every record in RDD is just a Java or Python object.
RDD
Dependencies Partitions
Compute Function
Partition => Iterator[T]
Characteristics of an RDD
RDD Characteristics
2013 2017 2018
1. Dependencies
➢ The List of dependencies that instructs spark how an RDD is constructed.
➢ Spark can recreate an RDD from these dependencies and replicate operations on them.
(This characteristic gives RDDs resiliency)
2. Partitions
➢ This provide spark the ability to distribute the work to parallelize computation across executors.
➢ Spark also uses locality information to send work to executors close to the data.
(This characteristic gives RDDs distribution)
3. Compute Function
➢ An abstract method that computes the input split partition in the TaskContext to produce a
collection of values (of type T)
compute(split: Partition, context: TaskContext): Iterator[T]
Visualizing RDD
Simple &
Elegant
Whatʼs the Problem?
RDDs Expresses How-to Not What-to
Compute Function (or computation)
is opaque to Spark
Slow for non JVM languages like
Python
No optimization by Spark
No data compression techniques
Leading to inadvertent
inefficiencies
Dataframe
Solution is in structuring
What we mean by Structuring?
● Ordering and Structuring for allowing to arrange your data in
tabular format.
● Expressing computation using patterns like filtering, selecting,
counting etc.
The DataFrame API
Distributed in-memory tables with named columns and schemas, (where each_column ==
specific_datatype[String, Int, Timestamp etc.] )
To Human Eye DataFrame is like a table.
Visualizing Dataframes
With Custom Data
Spark Operations on Data
Manoeuvring Data
Transformation
Spark
Operation Head of IT
Actions
Finance Manager
Marketing Manager
● Transforming a Spark DF into a new
DF without altering the original data.
● Giving Immutability property.
● Actions are operations that returns the
raw value.
● It triggers the Lazy Evaluation of all the
recorded transformation
Transformations Actions
show()
take()
count()
collect()
orderBy()
groupBy()
filter()
select()
Common Dataframe Ops
Projections & Filter
➢ A way to return only the rows matching a certain relational condition by using filters.
➢ Projections are done with the select() method, while filters can be expressed using the filter() or where() method.
val topHits = df.select("Id", "First", "Url")
.where($"Hits" > 10000)
Renaming, Adding, and Dropping Columns
➢ Using withColumnRenamed() we can rename the column, just withColumn() will add new column and
drop() will drop the column specified inside it.
val newDf = df.withColumnRenamed("First","First_Name").withColumnRenamed("Last", "Last_Name")
val dfWithTS = newDf.withColumn("Issued_Date", to_timestamp(col("Published"), "dd/MM/yyyy"))
.drop("Published")
Common Dataframe Ops
Aggregation
➢ Transformations and actions on DataFrames, such as groupBy(), orderBy(), and count(), offer the ability to aggregate by column names and
then aggregate counts across them.
val mostShare = dfWithTS.select("Campaigns","First_Name").where(col("Campaigns").isNotNull)
.groupBy("Campaigns")
.count()
.orderBy(desc("count"))
The Datasets API
A Type-Safe one
According to the Dataset Documentation:
➢ A strongly typed collection of domain-specific objects that can be
transformed in parallel using functional or relational operations. Each
Dataset [in Scala] also has an untyped view called a DataFrame, which
is a Dataset of Row.
DataFrame
DataSets
Structured
APIs
Untyped APIs
Typed APIs
● Dataframe = Dataset[Row]
● Alias in Scala
● Dataset[T]
● In Scala & Java
Visualizing Datasets
Case Class (Type-Safe Hero)
Datasets Ops
Databricks?
A LakeHouse Company
● The Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing, and
maintaining enterprise-grade data solutions at scale.
● Databricks integrates with cloud storage and security in your cloud account, and manages and deploys cloud
infrastructure on your behalf.
Common Tools In Databricks
Core Data Tasks
REST API
Interactive
Notebooks
ML Model
Serving
Workflows
Scheduler
Source
Controlling
(GIt)
SQL Editor &
Dashboard
Compute
Management
Data
Ingestion
DEMO
Thank You !

More Related Content

PDF
Apache spark
PPTX
Getting started with postgresql
PPTX
Real Time search using Spark and Elasticsearch
PPTX
Introduction of ssis
PPTX
Learn Apache Spark: A Comprehensive Guide
PDF
Apache Spark Overview
PDF
Elasticsearch
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache spark
Getting started with postgresql
Real Time search using Spark and Elasticsearch
Introduction of ssis
Learn Apache Spark: A Comprehensive Guide
Apache Spark Overview
Elasticsearch
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity

What's hot (20)

PDF
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PDF
Top 65 SQL Interview Questions and Answers | Edureka
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
SQL vs. NoSQL Databases
PDF
Cloud Cost Management and Apache Spark with Xuan Wang
PDF
Can Apache Kafka Replace a Database?
PDF
PySpark in practice slides
PDF
End-to-end Data Pipeline with Apache Spark
PPTX
Apache Spark Fundamentals
PDF
Introduction to apache spark
PDF
[웨비나] 우리가 데이터 메시에 주목해야 할 이유
PDF
Learn to Use Databricks for the Full ML Lifecycle
PDF
Introduction to Spark with Python
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PDF
Apache Spark & Streaming
PDF
SSIS Tutorial For Beginners | SQL Server Integration Services (SSIS) | MSBI T...
PDF
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
PPTX
Sql vs NoSQL-Presentation
PPTX
PostgreSQL Database Slides
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Top 65 SQL Interview Questions and Answers | Edureka
Apache Spark in Depth: Core Concepts, Architecture & Internals
SQL vs. NoSQL Databases
Cloud Cost Management and Apache Spark with Xuan Wang
Can Apache Kafka Replace a Database?
PySpark in practice slides
End-to-end Data Pipeline with Apache Spark
Apache Spark Fundamentals
Introduction to apache spark
[웨비나] 우리가 데이터 메시에 주목해야 할 이유
Learn to Use Databricks for the Full ML Lifecycle
Introduction to Spark with Python
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Apache Spark & Streaming
SSIS Tutorial For Beginners | SQL Server Integration Services (SSIS) | MSBI T...
Introducing Exactly Once Semantics in Apache Kafka with Matthias J. Sax
Sql vs NoSQL-Presentation
PostgreSQL Database Slides
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Ad

Similar to Spark Structured APIs (20)

PDF
Apache Spark and DataStax Enablement
PPTX
Spark Unveiled Essential Insights for All Developers
PDF
Big Data processing with Apache Spark
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PPTX
Spark real world use cases and optimizations
PDF
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
PPTX
Dive into spark2
PDF
Let's start with Spark
PPTX
Building a modern Application with DataFrames
PPTX
Building a modern Application with DataFrames
PDF
Introduction to Apache Spark
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
PPTX
Ten tools for ten big data areas 03_Apache Spark
PDF
Boston Spark Meetup event Slides Update
PPTX
OVERVIEW ON SPARK.pptx
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
PDF
Meetup ml spark_ppt
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PPTX
Spark from the Surface
Apache Spark and DataStax Enablement
Spark Unveiled Essential Insights for All Developers
Big Data processing with Apache Spark
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Spark real world use cases and optimizations
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Dive into spark2
Let's start with Spark
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Introduction to Apache Spark
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
Ten tools for ten big data areas 03_Apache Spark
Boston Spark Meetup event Slides Update
OVERVIEW ON SPARK.pptx
Structuring Spark: DataFrames, Datasets, and Streaming
Meetup ml spark_ppt
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Spark from the Surface
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
PPTX
Self-Healing Test Automation Framework - Healenium
PPTX
Kanban Metrics Presentation (Project Management)
PPTX
Java 17 features and implementation.pptx
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
PPTX
GraalVM - A Step Ahead of JVM Presentation
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
DAPR - Distributed Application Runtime Presentation
PPTX
Introduction to Azure Virtual WAN Presentation
PPTX
Introduction to Argo Rollouts Presentation
PPTX
Intro to Azure Container App Presentation
PPTX
Insights Unveiled Test Reporting and Observability Excellence
PPTX
Introduction to Splunk Presentation (DevOps)
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
PPTX
AWS: Messaging Services in AWS Presentation
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
PPTX
Managing State & HTTP Requests In Ionic.
Angular Hydration Presentation (FrontEnd)
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Self-Healing Test Automation Framework - Healenium
Kanban Metrics Presentation (Project Management)
Java 17 features and implementation.pptx
Chaos Mesh Introducing Chaos in Kubernetes
GraalVM - A Step Ahead of JVM Presentation
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
DAPR - Distributed Application Runtime Presentation
Introduction to Azure Virtual WAN Presentation
Introduction to Argo Rollouts Presentation
Intro to Azure Container App Presentation
Insights Unveiled Test Reporting and Observability Excellence
Introduction to Splunk Presentation (DevOps)
Code Camp - Data Profiling and Quality Analysis Framework
AWS: Messaging Services in AWS Presentation
Amazon Cognito: A Primer on Authentication and Authorization
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Managing State & HTTP Requests In Ionic.

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PDF
KodekX | Application Modernization Development
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
Teaching material agriculture food technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation theory and applications.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
KodekX | Application Modernization Development
Per capita expenditure prediction using model stacking based on satellite ima...
Spectroscopy.pptx food analysis technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Teaching material agriculture food technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation theory and applications.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Approach and Philosophy of On baking technology

Spark Structured APIs

  • 1. Spark Structured APIs Using Databricks Presented By: Raviyanshu Singh Software Consultant Knoldus Inc
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time! Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. Our Agenda 01 What is Spark 02 What’s an RDD 03 Dataframes 04 Datasets Databricks 05 05 06 Demo
  • 4. What is Spark? Unified Analytics Engine Apache Spark is a unified engine designed for large-scale distributed data processing, on premises in data centers or in the cloud. Spark’s design philosophy is based on these principles: ● Speed ● Ease of Use ● Modularity ● Extensibility
  • 5. 00 Spark APIs Trio RDD, Dataframe & Datasets Distributed collections of JVM objects Functional Operators (Map, filter etc) 2011 Distributed collections of Row objects. Expression based operations and UDFs Fast/Efficient and internal representations 2013 Internally rows, externally JVM objects. “Best of both the worlds”: type safe + fast 2015 RDD Dataframe Datasets The Timeline of Three
  • 6. Whatʼs RDD? [Resilient Distributed Datasets] 2013 2017 2018 ● An RDD represents an immutable, partitioned collection of records that can be operated on in parallel. ● RDDs gives you complete control because every record in RDD is just a Java or Python object. RDD Dependencies Partitions Compute Function Partition => Iterator[T] Characteristics of an RDD
  • 7. RDD Characteristics 2013 2017 2018 1. Dependencies ➢ The List of dependencies that instructs spark how an RDD is constructed. ➢ Spark can recreate an RDD from these dependencies and replicate operations on them. (This characteristic gives RDDs resiliency) 2. Partitions ➢ This provide spark the ability to distribute the work to parallelize computation across executors. ➢ Spark also uses locality information to send work to executors close to the data. (This characteristic gives RDDs distribution) 3. Compute Function ➢ An abstract method that computes the input split partition in the TaskContext to produce a collection of values (of type T) compute(split: Partition, context: TaskContext): Iterator[T]
  • 9. Whatʼs the Problem? RDDs Expresses How-to Not What-to Compute Function (or computation) is opaque to Spark Slow for non JVM languages like Python No optimization by Spark No data compression techniques Leading to inadvertent inefficiencies
  • 10. Dataframe Solution is in structuring What we mean by Structuring? ● Ordering and Structuring for allowing to arrange your data in tabular format. ● Expressing computation using patterns like filtering, selecting, counting etc. The DataFrame API Distributed in-memory tables with named columns and schemas, (where each_column == specific_datatype[String, Int, Timestamp etc.] ) To Human Eye DataFrame is like a table.
  • 12. Spark Operations on Data Manoeuvring Data Transformation Spark Operation Head of IT Actions Finance Manager Marketing Manager ● Transforming a Spark DF into a new DF without altering the original data. ● Giving Immutability property. ● Actions are operations that returns the raw value. ● It triggers the Lazy Evaluation of all the recorded transformation Transformations Actions show() take() count() collect() orderBy() groupBy() filter() select()
  • 13. Common Dataframe Ops Projections & Filter ➢ A way to return only the rows matching a certain relational condition by using filters. ➢ Projections are done with the select() method, while filters can be expressed using the filter() or where() method. val topHits = df.select("Id", "First", "Url") .where($"Hits" > 10000) Renaming, Adding, and Dropping Columns ➢ Using withColumnRenamed() we can rename the column, just withColumn() will add new column and drop() will drop the column specified inside it. val newDf = df.withColumnRenamed("First","First_Name").withColumnRenamed("Last", "Last_Name") val dfWithTS = newDf.withColumn("Issued_Date", to_timestamp(col("Published"), "dd/MM/yyyy")) .drop("Published")
  • 14. Common Dataframe Ops Aggregation ➢ Transformations and actions on DataFrames, such as groupBy(), orderBy(), and count(), offer the ability to aggregate by column names and then aggregate counts across them. val mostShare = dfWithTS.select("Campaigns","First_Name").where(col("Campaigns").isNotNull) .groupBy("Campaigns") .count() .orderBy(desc("count"))
  • 15. The Datasets API A Type-Safe one According to the Dataset Documentation: ➢ A strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset [in Scala] also has an untyped view called a DataFrame, which is a Dataset of Row. DataFrame DataSets Structured APIs Untyped APIs Typed APIs ● Dataframe = Dataset[Row] ● Alias in Scala ● Dataset[T] ● In Scala & Java
  • 18. Databricks? A LakeHouse Company ● The Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing, and maintaining enterprise-grade data solutions at scale. ● Databricks integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf.
  • 19. Common Tools In Databricks Core Data Tasks REST API Interactive Notebooks ML Model Serving Workflows Scheduler Source Controlling (GIt) SQL Editor & Dashboard Compute Management Data Ingestion
  • 20. DEMO