SlideShare a Scribd company logo
Performant data processing
with PySpark, SparkR and
DataFrame API
Ryuji Tamagawa from Osaka
Many Thanks to Holden Karau,
for the discussion we had about this talk.
Agenda
Who am I ?
Spark
Spark and non-JVM languages
DataFrame APIs come to rescue
Examples
Who am I ?
Software engineer working for
Sky, from architecture design to
troubleshooting in the field
Translator working with O’Reilly
Japan
‘Learning Spark’ is the 27th book
Prized Rakuten tech award
Silver 2010 for translating
‘Hadoop the definitive guide’
A bed for 6 cats
Works of 2015
Available
Jan, 2016 ?
Works of past
Motivation for
today’s talk
I want to deal with my ‘Big’ data, 

WITH PYTHON !!
Apache Spark
Apache Spark
You may already
have heard a lot
Fast, distributed
data processing
framework with
high-level APIs
Written in Scala,
run in JVM
OS
HDFS
Hive e.t.c.
HBaseMapReduce
YARN
Impala
e.t.c(in-
memory SQL
engine)
Spark
(Spark Streaming, MLlib,
GraphX, Spark SQL)
Why it’s fast
Do not need to write temporary data to storage every time
Do not need to invoke JVM process every time
map
JVM Invocation
I/0
HDFS
reduce
JVM Invocation
I/0
map
JVM Invocation
I/0
reduce
JVM Invocation
I/0
f1(read data to RDD)
Executor(JVM)Invocation
HDFS
I/O
f2
f3
f4(persist to storage)
f5(does shuffle) I/O
f6
f7
Memory(RDDs)
access
access
access
access I/O
access
access
MapReduce Spark
Apache Spark
and
non-JVM languages
Spark supports
non-JVM languages
Shells
PySpark, 

for Python users
SparkR, 

for R users
GUI Environment : 

Jupiter, RStudio
You can write application code in
these languages
The Web UI tells us a lot
http://<address>:4040
Performance problems
with those languages
Data processing
performance with
those languages
may be several
times slower than
JVM languages
The reason lies in
the architecture https://cwiki.apache.org/confluence/
display/SPARK/PySpark+Internals
The choices you
have had
Learn Scala
Write (more lines of) code in Java
Use non-JVM languages with more
CPU cores to make up the
performance gap
DataFrame APIs
come to the rescue !
DataFrame
Tabular data with schema based on RDD
Successor of Schema RDD (Since 1.4)
Has rich set of APIs for data operation
Or, you can simply use SQL!
Do it within JVM
When you call
DataFrame APIs from
non-JVM Languages,
data will not be
transferred between JVM
and the language
runtime
Obviously, the
performance is almost
same compared to JVM
languages
Only code goes
through
Executor
DataFrame APIs compared to
RDD APIs by Examples
JVM
DataFrame,
Cached
Python
lambda items:
items[0] == ‘abc’
transfer
DataFrame,
result
transfer
Driver
Executor
DataFrame APIs compared to
RDD APIs by Examples
JVM
DataFrame,
Cached
filter(df[“_1”]
== “abc”)
transfer
DataFrame,
result
Driver
Watch out for UDFs
You can write UDFs
in Python
You can use
lambdas in Python,
too
Once you use them,
data flows between
the two worlds
slen = udf(
lambda s: len(s),
IntegerType())
df.select(
slen(df.name))
.collect()
Make it small first,
then use UDFs
Filter or sample your
‘big’ data with
DataFrame APIs
Then use UDFs
SQL optimizer does
not take it into
account when making
plans (so far)
‘BIG’ data
in DataFrame
filtering with
‘native APIs’
‘Small’ data in DataFrame
whatever
operation with
UDFs
Make it small first,
then use UDFs
Filter or sample your
‘big’ data with
DataFrame APIs
Then use UDFs
SQL optimizer does
not take it into
account when making
plans (so far)
slen = udf(
lambda s: len(s),
IntegerType())
sqc.SQL(
‘select…
from df
where fname like “tama%”
and slen(name)’
).collect()
processed first !
Ingesting Data
It’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when
using non-JVM languages
Executor
JVM
DataFrameDriver
Local Data
Py4J
Driver Machine
HDFS (Parquet)
Driver Machine
Ingesting Data
Executor
JVM
DataFrameDriver Py4Jcode only
HDFS (Parquet)
code only
It’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when
using non-JVM languages
Appendix : Parquet
Parquet: general purpose file
format for analytic workload
Columnar storage : reduces I/O
significantly
High compression rate
projection pushdown
Today, workloads become CPU-
intensive : very fast read, CPU-internal-
aware

More Related Content

PDF
Python and Bigdata - An Introduction to Spark (PySpark)
PDF
PySaprk
PPTX
Programming in Spark using PySpark
PDF
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
PDF
Spark Under the Hood - Meetup @ Data Science London
PPTX
Up and running with pyspark
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PPTX
Spark tutorial
Python and Bigdata - An Introduction to Spark (PySpark)
PySaprk
Programming in Spark using PySpark
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Spark Under the Hood - Meetup @ Data Science London
Up and running with pyspark
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Spark tutorial

What's hot (20)

PDF
PySpark Best Practices
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PPTX
Large-Scale Data Science in Apache Spark 2.0
PPTX
Spark r under the hood with Hossein Falaki
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PPTX
Introduction to Apache Spark Developer Training
PDF
New Developments in Spark
PDF
Introduction to Apache Spark
PPTX
Parallelizing Existing R Packages with SparkR
PPTX
Lightening Fast Big Data Analytics using Apache Spark
PDF
A really really fast introduction to PySpark - lightning fast cluster computi...
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PPTX
Introduction to Apache Spark
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
PPTX
Building a modern Application with DataFrames
PDF
Apache Arrow and Pandas UDF on Apache Spark
PDF
Spark Meetup at Uber
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
PySpark Best Practices
Frustration-Reduced PySpark: Data engineering with DataFrames
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Large-Scale Data Science in Apache Spark 2.0
Spark r under the hood with Hossein Falaki
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Introduction to Apache Spark Developer Training
New Developments in Spark
Introduction to Apache Spark
Parallelizing Existing R Packages with SparkR
Lightening Fast Big Data Analytics using Apache Spark
A really really fast introduction to PySpark - lightning fast cluster computi...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Introduction to Apache Spark
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building a modern Application with DataFrames
Apache Arrow and Pandas UDF on Apache Spark
Spark Meetup at Uber
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Ad

Viewers also liked (20)

PDF
Fast Data Analytics with Spark and Python
PDF
Getting The Best Performance With PySpark
PDF
High Performance Python on Apache Spark
PDF
PySpark in practice slides
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
PPTX
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
PDF
Apache Spark Introduction - CloudxLab
PDF
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
PDF
lessons learned from talking at rakuten technology conference
PDF
ヘルシープログラマ・翻訳と実践
PDF
20161215 python pandas-spark四方山話
PDF
Google Big Query
PDF
You might be paying too much for BigQuery
PDF
Mongo dbを知ろう devlove関西
PDF
Spark workshop
PDF
Apache Spark 101
PDF
Google BigQueryについて 紹介と推測
PDF
Tachyon Presentation at AMPCamp 6 (November, 2015)
PPTX
Master Data Mastery – Strategies to improve procurement performance
PDF
An excursion into Text Analytics with Apache Spark
Fast Data Analytics with Spark and Python
Getting The Best Performance With PySpark
High Performance Python on Apache Spark
PySpark in practice slides
Improving Python and Spark (PySpark) Performance and Interoperability
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Apache Spark Introduction - CloudxLab
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
lessons learned from talking at rakuten technology conference
ヘルシープログラマ・翻訳と実践
20161215 python pandas-spark四方山話
Google Big Query
You might be paying too much for BigQuery
Mongo dbを知ろう devlove関西
Spark workshop
Apache Spark 101
Google BigQueryについて 紹介と推測
Tachyon Presentation at AMPCamp 6 (November, 2015)
Master Data Mastery – Strategies to improve procurement performance
An excursion into Text Analytics with Apache Spark
Ad

Similar to Performant data processing with PySpark, SparkR and DataFrame API (20)

PDF
Big data beyond the JVM - DDTX 2018
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
PPTX
Building a modern Application with DataFrames
PDF
Introduction to Spark with Python
PDF
Big Data Beyond the JVM - Strata San Jose 2018
PDF
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
PPTX
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
PDF
Improving PySpark performance: Spark Performance Beyond the JVM
PDF
Data processing with spark in r &amp; python
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
PPTX
This is training for spark SQL essential
PDF
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PDF
Let's start with Spark
PPT
Apache spark-melbourne-april-2015-meetup
Big data beyond the JVM - DDTX 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Building a modern Application with DataFrames
Introduction to Spark with Python
Big Data Beyond the JVM - Strata San Jose 2018
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
Improving PySpark performance: Spark Performance Beyond the JVM
Data processing with spark in r &amp; python
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with python apache arrow, spark,...
Jump Start into Apache® Spark™ and Databricks
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
This is training for spark SQL essential
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Let's start with Spark
Apache spark-melbourne-april-2015-meetup

More from Ryuji Tamagawa (20)

PDF
20171012 found IT #9 PySparkの勘所
PDF
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
PPTX
hbstudy 74 Site Reliability Engineering
PDF
PySparkの勘所(20170630 sapporo db analytics showcase)
PDF
20170210 sapporotechbar7
PDF
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
PDF
20160708 データ処理のプラットフォームとしてのpython 札幌
PDF
20160127三木会 RDB経験者のためのspark
PDF
20151205 Japan.R SparkRとParquet
PDF
Apache Sparkの紹介
PDF
足を地に着け落ち着いて考える
PDF
BigQueryの課金、節約しませんか
PDF
Seleniumをもっと知るための本の話
PDF
データベース勉強会 In 広島 mongodb
PDF
Invitation to mongo db @ Rakuten TechTalk
PDF
MongoDB tuning on AWS
PDF
初めてのMongo db
PDF
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
PDF
初めてのAws elastic map reduce
PDF
初めてのAws rds for sql server
20171012 found IT #9 PySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
hbstudy 74 Site Reliability Engineering
PySparkの勘所(20170630 sapporo db analytics showcase)
20170210 sapporotechbar7
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20160708 データ処理のプラットフォームとしてのpython 札幌
20160127三木会 RDB経験者のためのspark
20151205 Japan.R SparkRとParquet
Apache Sparkの紹介
足を地に着け落ち着いて考える
BigQueryの課金、節約しませんか
Seleniumをもっと知るための本の話
データベース勉強会 In 広島 mongodb
Invitation to mongo db @ Rakuten TechTalk
MongoDB tuning on AWS
初めてのMongo db
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
初めてのAws elastic map reduce
初めてのAws rds for sql server

Recently uploaded (20)

PPTX
L1 - Introduction to python Backend.pptx
PDF
AI in Product Development-omnex systems
PDF
System and Network Administraation Chapter 3
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Digital Strategies for Manufacturing Companies
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
System and Network Administration Chapter 2
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPT
Introduction Database Management System for Course Database
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
L1 - Introduction to python Backend.pptx
AI in Product Development-omnex systems
System and Network Administraation Chapter 3
Odoo POS Development Services by CandidRoot Solutions
2025 Textile ERP Trends: SAP, Odoo & Oracle
CHAPTER 2 - PM Management and IT Context
Wondershare Filmora 15 Crack With Activation Key [2025
ManageIQ - Sprint 268 Review - Slide Deck
Navsoft: AI-Powered Business Solutions & Custom Software Development
Online Work Permit System for Fast Permit Processing
Digital Strategies for Manufacturing Companies
Softaken Excel to vCard Converter Software.pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
System and Network Administration Chapter 2
Upgrade and Innovation Strategies for SAP ERP Customers
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
How to Migrate SBCGlobal Email to Yahoo Easily
Introduction Database Management System for Course Database
Which alternative to Crystal Reports is best for small or large businesses.pdf

Performant data processing with PySpark, SparkR and DataFrame API

  • 1. Performant data processing with PySpark, SparkR and DataFrame API Ryuji Tamagawa from Osaka Many Thanks to Holden Karau, for the discussion we had about this talk.
  • 2. Agenda Who am I ? Spark Spark and non-JVM languages DataFrame APIs come to rescue Examples
  • 3. Who am I ? Software engineer working for Sky, from architecture design to troubleshooting in the field Translator working with O’Reilly Japan ‘Learning Spark’ is the 27th book Prized Rakuten tech award Silver 2010 for translating ‘Hadoop the definitive guide’ A bed for 6 cats
  • 6. Motivation for today’s talk I want to deal with my ‘Big’ data, 
 WITH PYTHON !!
  • 8. Apache Spark You may already have heard a lot Fast, distributed data processing framework with high-level APIs Written in Scala, run in JVM OS HDFS Hive e.t.c. HBaseMapReduce YARN Impala e.t.c(in- memory SQL engine) Spark (Spark Streaming, MLlib, GraphX, Spark SQL)
  • 9. Why it’s fast Do not need to write temporary data to storage every time Do not need to invoke JVM process every time map JVM Invocation I/0 HDFS reduce JVM Invocation I/0 map JVM Invocation I/0 reduce JVM Invocation I/0 f1(read data to RDD) Executor(JVM)Invocation HDFS I/O f2 f3 f4(persist to storage) f5(does shuffle) I/O f6 f7 Memory(RDDs) access access access access I/O access access MapReduce Spark
  • 11. Spark supports non-JVM languages Shells PySpark, 
 for Python users SparkR, 
 for R users GUI Environment : 
 Jupiter, RStudio You can write application code in these languages
  • 12. The Web UI tells us a lot http://<address>:4040
  • 13. Performance problems with those languages Data processing performance with those languages may be several times slower than JVM languages The reason lies in the architecture https://cwiki.apache.org/confluence/ display/SPARK/PySpark+Internals
  • 14. The choices you have had Learn Scala Write (more lines of) code in Java Use non-JVM languages with more CPU cores to make up the performance gap
  • 15. DataFrame APIs come to the rescue !
  • 16. DataFrame Tabular data with schema based on RDD Successor of Schema RDD (Since 1.4) Has rich set of APIs for data operation Or, you can simply use SQL!
  • 17. Do it within JVM When you call DataFrame APIs from non-JVM Languages, data will not be transferred between JVM and the language runtime Obviously, the performance is almost same compared to JVM languages Only code goes through
  • 18. Executor DataFrame APIs compared to RDD APIs by Examples JVM DataFrame, Cached Python lambda items: items[0] == ‘abc’ transfer DataFrame, result transfer Driver
  • 19. Executor DataFrame APIs compared to RDD APIs by Examples JVM DataFrame, Cached filter(df[“_1”] == “abc”) transfer DataFrame, result Driver
  • 20. Watch out for UDFs You can write UDFs in Python You can use lambdas in Python, too Once you use them, data flows between the two worlds slen = udf( lambda s: len(s), IntegerType()) df.select( slen(df.name)) .collect()
  • 21. Make it small first, then use UDFs Filter or sample your ‘big’ data with DataFrame APIs Then use UDFs SQL optimizer does not take it into account when making plans (so far) ‘BIG’ data in DataFrame filtering with ‘native APIs’ ‘Small’ data in DataFrame whatever operation with UDFs
  • 22. Make it small first, then use UDFs Filter or sample your ‘big’ data with DataFrame APIs Then use UDFs SQL optimizer does not take it into account when making plans (so far) slen = udf( lambda s: len(s), IntegerType()) sqc.SQL( ‘select… from df where fname like “tama%” and slen(name)’ ).collect() processed first !
  • 23. Ingesting Data It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages Executor JVM DataFrameDriver Local Data Py4J Driver Machine HDFS (Parquet)
  • 24. Driver Machine Ingesting Data Executor JVM DataFrameDriver Py4Jcode only HDFS (Parquet) code only It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages
  • 26. Parquet: general purpose file format for analytic workload Columnar storage : reduces I/O significantly High compression rate projection pushdown Today, workloads become CPU- intensive : very fast read, CPU-internal- aware