SlideShare a Scribd company logo
PySpark
@
▸ facebook : Ryuji Tamagawa
▸ Twitter : tamagawa_ryuji
▸ FB
pydata.tokyo
▸ Twitter
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
8 11
Wes Mckinney blog
▸ http://qiita.com/tamagawa-ryuji
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
▸
▸ CPU
▸ PyData.Tokyo
▸
PySpark
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
▸
▸
▸ Spark Hadoop
▸ PySpark
▸ Spark/Hadoop PyData
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
▸
▸
▸
PySpark
▸
▸ SSD
▸ CPU
▸
Parquet
S3
CPU
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
https://www.slideshare.net/kumagi/ss-78765920/4
▸
▸
▸ groupby
▸
▸
▸
N
▸ N
N
▸ …
…
▸
▸
▸
▸ CPU/
▸ CPU/
▸ 1
Hadoop Spark
▸
▸
▸ n /n
▸
▸
▸ Amazon EMR
▸ Microsoft Azure HDInsight
▸ Cloudera Altus
▸ Databricks Community Edition Spark
▸ PyData + Jupyter PySpark
Spark Hadoop
Spark Hadoop
Hadoop0.x Spark
OS
HDFS
MapReduce
OS
HDFS
Hive e.t.c.
HBase
MapReduce
OS
HDFS
Hive e.t.c.
HBaseMapReduce
YARN
Spark
Spark Streaming, MLlib,
GraphX, Spark SQL)
Impala
SQL
YARN
Spark
Spark Streaming, MLlib, GraphX,
Spark SQL)
Mesos
Spark
Spark Streaming, MLlib, GraphX,
Spark SQL) Spark
Spark Streaming, MLlib, GraphX,
Spark SQL)
Windows
Hadoop 0.x Hadoop 1.x Hadoop 2.x + Spark
Spark Hadoop
Hadoop Spark
map
JVM
HDFS
reduce
JVM
map
JVM
reduce
JVM
f1
RDD
Executor JVM
HDFS
f2
f3
f4
f5
f6
f7
MapReduce Spark
RDD
Spark Hadoop
Spark
▸ Hadoop MapReduce
▸ Spark API MapReduce API
▸ Hadoop
PySpark
(Py)Spark
▸ / Spark
▸ PyData
▸ Spark
▸ Spark Hadoop
PyData
PySpark
Spark 1.2
PySpark …
(Py)Spark
PySpark
PySpark
RDD API DataFrame API
▸ RDD Resilient Distributed Dataset =
Spark Java
▸ DataFrame RDD
/ R data.frame
▸ Python RDD API DataFrame API Scala
/ Java
PySpark
DataFrame API
RDD
DataFrame /
Dataset
MLlib ML
GraphX GraphFrame
Spark
Streaming
Structured
Streaming
Worker node
PySpark
Executer
JVM
Driver
JVM
Executer
JVM
Executer
JVM
Storage
Python
VM
Worker node Worker node
Python
VM
Python
VM
RDD API PySpark
Worker node
Executer
JVM
Driver
JVM
Executer
JVM
Executer
JVM
Storage
Python
VM
Worker node Worker node
Python
VM
Python
VM
DataFrame API PySpark
PySpark
▸ RDD API Executer JVM Python VM
▸ DataFrame API JVM
▸ UDF Python VM
▸ UDF Scala Java
▸ Spark 2.x DataFrame 

Spark PyData
Spark PyData
Spark PyData
▸ Spark
▸ Python PyData
▸
▸ Parquet
▸ Apache Arrow
Spark PyData
▸ CSV JSON
▸Parquet Spark DataFrame API
Python
fastparquet pyarrow
▸ Performance comparison of different file formats and storage engines
in the Hadoop ecosystem
▸
=
Spark PyData
Parquet


https://parquet.apache.org/documentation/latest/


zip CSV
I/O
ROW BLOCK
COLUMN #0 ROW #0
COLUMN #0 ROW #1
COLUMN #0 ROW #N
COLUMN #1 ROW #0
COLUMN #1 ROW #1
…
…
COLUMN #1 ROW #N
COLUMN #2 ROW #0
COLUMN #2 ROW #1
…
COLUMN #M ROW #N
ROW BLOCK
COLUMN #0 ROW #0
COLUMN #0 ROW #1
COLUMN #0 ROW #N
COLUMN #1 ROW #0
COLUMN #1 ROW #1
…
…
COLUMN #1 ROW #N
COLUMN #2 ROW #0
COLUMN #2 ROW #1
…
COLUMN #M ROW #N
...
Spark PyData
Spark
df = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20)
df.write.save(filename, compression = 'snappy')
from fastparquet import write
pdf = pd.read_csv(csvFilename)
write(filename, pdf, compression='UNCOMPRESSED')
fastparquet
import pyarrow as pa
import pyarrow.parquet as pq
arrow_table = pa.Table.from_pandas(pdf)
pq.write_table(arrow_table, filename, compression = 'GZIP')
pyarrow
Spark PyData
▸ pandas CSV Spark
Spark pandas
…
▸ Spark - pandas
▸ pandas → Spark …
▸ Apache Arrow
Spark PyData
Apache Arrow
▸ Apache Arrow
▸ PyData / OSS
▸ /
https://arrow.apache.org
Spark PyData
Wes blog
▸ pandas Apache Arrow
▸ Blog
▸ PyData Blog


Wes OK
▸ Apache Arrow pandas 10 

https://qiita.com/tamagawa-ryuji/items/3d8fc52406706ae0c144
PySpark Python Spark
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所

More Related Content

PDF
PySparkの勘所(20170630 sapporo db analytics showcase)
PDF
20171012 found IT #9 PySparkの勘所
PDF
20170210 sapporotechbar7
PDF
Introduction to Apache Hivemall v0.5.2 and v0.6
PDF
20161215 python pandas-spark四方山話
PDF
Apache spark session
PDF
Beginner Apache Spark Presentation
PPTX
A complete hadoop stack
PySparkの勘所(20170630 sapporo db analytics showcase)
20171012 found IT #9 PySparkの勘所
20170210 sapporotechbar7
Introduction to Apache Hivemall v0.5.2 and v0.6
20161215 python pandas-spark四方山話
Apache spark session
Beginner Apache Spark Presentation
A complete hadoop stack

What's hot (19)

PPTX
Cassandra + Hadoop @ApacheCon
PDF
Introduing spark
PDF
How to measure your dataflow using fio, pktgen and bandwidthTest
PDF
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
PDF
An introduction to Big-Data processing applying hadoop
PDF
PPTX
Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБ...
PPTX
PDF
Avoiding Performance Potholes: Scaling Python for Data Science Using Apache ...
PDF
Big data ecosystem
PDF
Big Data Programming Using Hadoop Workshop
PDF
Big Data Ecosystem after Spark
PDF
Hadoop - Simple. Scalable.
PDF
Introduction to Apache Tajo: Future of Data Warehouse
PDF
Hadoop 101 - Big Data Technology
PDF
Blaze the-evolution-of-numpy
PPTX
Nov HUG 2009: Hadoop Record Reader In Python
PDF
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
PDF
Big Data - Fast Machine Learning at Scale + Couchbase
Cassandra + Hadoop @ApacheCon
Introduing spark
How to measure your dataflow using fio, pktgen and bandwidthTest
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
An introduction to Big-Data processing applying hadoop
Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБ...
Avoiding Performance Potholes: Scaling Python for Data Science Using Apache ...
Big data ecosystem
Big Data Programming Using Hadoop Workshop
Big Data Ecosystem after Spark
Hadoop - Simple. Scalable.
Introduction to Apache Tajo: Future of Data Warehouse
Hadoop 101 - Big Data Technology
Blaze the-evolution-of-numpy
Nov HUG 2009: Hadoop Record Reader In Python
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Big Data - Fast Machine Learning at Scale + Couchbase
Ad

Viewers also liked (12)

PPTX
Apache sparkとapache cassandraで行うテキスト解析
PDF
Pynqでカメラ画像をリアルタイムfastx コーナー検出
PPTX
PYNQ 祭り: Pmod のプログラミング
PDF
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
PDF
PYNQ祭り
PDF
Presto in my_use_case
PPTX
PYNQで○○してみた!
PDF
PYNQ祭りLT todotani
PPTX
PYNQ単体でUIを表示してみる(PYNQまつり)
PDF
[db analytics showcase Sapporo 2017] A15: Pythonでの分散処理再入門 by 株式会社HPCソリューションズ ...
PDF
Pynq祭り資料
PDF
コンピュータエンジニアへのFPGAのすすめ
Apache sparkとapache cassandraで行うテキスト解析
Pynqでカメラ画像をリアルタイムfastx コーナー検出
PYNQ 祭り: Pmod のプログラミング
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
PYNQ祭り
Presto in my_use_case
PYNQで○○してみた!
PYNQ祭りLT todotani
PYNQ単体でUIを表示してみる(PYNQまつり)
[db analytics showcase Sapporo 2017] A15: Pythonでの分散処理再入門 by 株式会社HPCソリューションズ ...
Pynq祭り資料
コンピュータエンジニアへのFPGAのすすめ
Ad

Similar to 20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所 (20)

PDF
Introduction to Spark with Python
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
PDF
PySaprk
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
Improving Pandas and PySpark interoperability with Apache Arrow
PDF
Improving Pandas and PySpark performance and interoperability with Apache Arrow
PDF
Pyspark training | Introduction to PySpark DataFrames
PDF
Big data beyond the JVM - DDTX 2018
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PDF
Life of PySpark - A tale of two environments
PDF
How does that PySpark thing work? And why Arrow makes it faster?
PDF
Spark Summit 2016: Connecting Python to the Spark Ecosystem
PDF
Connecting Python To The Spark Ecosystem
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PDF
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
PDF
PYSPARK PROGRAMMING.pdf
PDF
Jump Start into Apache® Spark™ and Databricks
PPTX
Apache_Spark_with_Python_Lecture_Updated.pptx
Introduction to Spark with Python
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Accelerating Big Data beyond the JVM - Fosdem 2018
PySaprk
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Improving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark performance and interoperability with Apache Arrow
Pyspark training | Introduction to PySpark DataFrames
Big data beyond the JVM - DDTX 2018
PySpark Cassandra - Amsterdam Spark Meetup
Life of PySpark - A tale of two environments
How does that PySpark thing work? And why Arrow makes it faster?
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Connecting Python To The Spark Ecosystem
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
PYSPARK PROGRAMMING.pdf
Jump Start into Apache® Spark™ and Databricks
Apache_Spark_with_Python_Lecture_Updated.pptx

More from Ryuji Tamagawa (20)

PPTX
hbstudy 74 Site Reliability Engineering
PDF
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
PDF
20160708 データ処理のプラットフォームとしてのpython 札幌
PDF
20160127三木会 RDB経験者のためのspark
PDF
20151205 Japan.R SparkRとParquet
PDF
Performant data processing with PySpark, SparkR and DataFrame API
PDF
Apache Sparkの紹介
PDF
足を地に着け落ち着いて考える
PDF
ヘルシープログラマ・翻訳と実践
PDF
Google Big Query
PDF
BigQueryの課金、節約しませんか
PDF
You might be paying too much for BigQuery
PDF
Google BigQueryについて 紹介と推測
PDF
lessons learned from talking at rakuten technology conference
PDF
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
PDF
Mongo dbを知ろう devlove関西
PDF
Seleniumをもっと知るための本の話
PDF
データベース勉強会 In 広島 mongodb
PDF
Invitation to mongo db @ Rakuten TechTalk
PDF
MongoDB tuning on AWS
hbstudy 74 Site Reliability Engineering
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20160708 データ処理のプラットフォームとしてのpython 札幌
20160127三木会 RDB経験者のためのspark
20151205 Japan.R SparkRとParquet
Performant data processing with PySpark, SparkR and DataFrame API
Apache Sparkの紹介
足を地に着け落ち着いて考える
ヘルシープログラマ・翻訳と実践
Google Big Query
BigQueryの課金、節約しませんか
You might be paying too much for BigQuery
Google BigQueryについて 紹介と推測
lessons learned from talking at rakuten technology conference
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
Mongo dbを知ろう devlove関西
Seleniumをもっと知るための本の話
データベース勉強会 In 広島 mongodb
Invitation to mongo db @ Rakuten TechTalk
MongoDB tuning on AWS

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Hybrid model detection and classification of lung cancer
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
A novel scalable deep ensemble learning framework for big data classification...
Tartificialntelligence_presentation.pptx
Chapter 5: Probability Theory and Statistics
Programs and apps: productivity, graphics, security and other tools
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
DP Operators-handbook-extract for the Mautical Institute
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Digital-Transformation-Roadmap-for-Companies.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Zenith AI: Advanced Artificial Intelligence
1 - Historical Antecedents, Social Consideration.pdf
Hybrid model detection and classification of lung cancer
Unlocking AI with Model Context Protocol (MCP)
Encapsulation_ Review paper, used for researhc scholars
MIND Revenue Release Quarter 2 2025 Press Release
SOPHOS-XG Firewall Administrator PPT.pptx

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所