SlideShare a Scribd company logo
Python, Pandas, Spark 2.0
Sky
20161215 python pandas-spark四方山話
•
• Python 2000
(**)
• db tech showcase MongoDB
•
• FB: Ryuji Tamagawa
• Twitter : tamagawa_ryuji
20161215 python pandas-spark四方山話
2017
• Python Spark
•
•
• Python / Pandas
• Spark 2.0
Part 1 :
•
•
•
csv
Python
Pandas Python
Jupyter Notebook
Jenkins
Spark 2.0
• Spark API RDD ~1.3 DataFrame
/ DataSet 1.4~
• DataFrame API
RDD API Python Spark
DataFrame
• RDB /
• R Pandas Spark
Spark
R / Pandas
Spark
+
Part 2 :
CSV
zip
RDB
Parquet
Excel
CSV
Feather
Spark
Pandas / Spark
•
• CPU
•
• Pandas read_csv zip CSV
Pandas
2
• CSV CPU
Pandas zip CSV
CPU …
• Parquet !
•
: Parquet
I/O
•
• Spark Parquet
• Python Parquet
HDFS / S3
Parquet Parquet
SSD
Parquet Parquet
Parquet
No
No
Yes
HDD
•
• I/O Pandas
• Spark
• DataFrame Pandas → Spark
Spark → Pandas Pandas → Spark
• Apache Arrow
CPU
~2010
2010~
SSD
CPU 

Apache Spark 2.0
• 1.x
• 2.0
1.x
• DataFrame API Python
• databricks 

http://go.databricks.com/mastering-apache-spark-2.0
•
Spark 2.0
• CPU
• CPU
• SQL DataFrame
• + SSD
• CSV zip
Pandas read_csv
Python + Spark
• Python serialize
• DataFrame API UDF
UDF Scala/Java
• http://www.slideshare.net/dragan10/performant-data-processing-with-pyspark-sparkr-
and-dataframe-api
Executor
JVM
DataFrame,
Cached
Python
lambda items:
items[0] == ‘abc’
transfer
DataFrame,
result
transfer
Driver
20161215 python pandas-spark四方山話

More Related Content

PDF
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
PDF
20160708 データ処理のプラットフォームとしてのpython 札幌
PDF
PySparkの勘所(20170630 sapporo db analytics showcase)
PDF
20171012 found IT #9 PySparkの勘所
PDF
20170210 sapporotechbar7
PDF
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
PDF
Big Data Ecosystem after Spark
PPTX
Cpu analysis with flamegraphs
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20160708 データ処理のプラットフォームとしてのpython 札幌
PySparkの勘所(20170630 sapporo db analytics showcase)
20171012 found IT #9 PySparkの勘所
20170210 sapporotechbar7
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
Big Data Ecosystem after Spark
Cpu analysis with flamegraphs

What's hot (20)

PDF
Beginner Apache Spark Presentation
PDF
StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有
PDF
Brug af Solr i IMPACT
PDF
Growing a Data Pipeline for Analytics
PDF
Sparkler Presentation for Spark Summit East 2017
PDF
Денис Головняк - Продвинутый поиск с помощью Search API
PDF
Final_show
PDF
ストリーム処理を支えるキューイングシステムの選び方
PPTX
Cassandra + Hadoop @ApacheCon
PDF
Introduing spark
PDF
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
PDF
The Evolution of Hadoop at Spotify - Through Failures and Pain
PDF
MongoDB & Hadoop, Sittin' in a Tree
PDF
ニュースパスのクローラーアーキテクチャとマイクロサービス
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
PDF
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
PDF
Apache Spark Super Happy Funtimes - CHUG 2016
PDF
Go, memcached, microservices
PPTX
Microsoft Azure + R
PDF
Fluentd - Flexible, Stable, Scalable
Beginner Apache Spark Presentation
StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有
Brug af Solr i IMPACT
Growing a Data Pipeline for Analytics
Sparkler Presentation for Spark Summit East 2017
Денис Головняк - Продвинутый поиск с помощью Search API
Final_show
ストリーム処理を支えるキューイングシステムの選び方
Cassandra + Hadoop @ApacheCon
Introduing spark
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
The Evolution of Hadoop at Spotify - Through Failures and Pain
MongoDB & Hadoop, Sittin' in a Tree
ニュースパスのクローラーアーキテクチャとマイクロサービス
Debugging PySpark: Spark Summit East talk by Holden Karau
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Apache Spark Super Happy Funtimes - CHUG 2016
Go, memcached, microservices
Microsoft Azure + R
Fluentd - Flexible, Stable, Scalable
Ad

Similar to 20161215 python pandas-spark四方山話 (20)

PDF
Contributing to pandas (Korean)
PPTX
data science toolkit 101: set up Python, Spark, & Jupyter
PDF
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
PDF
Apache Arrow and Pandas UDF on Apache Spark
PDF
Wisely Chen Spark Talk At Spark Gathering in Taiwan
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PDF
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
PDF
Fluentd: Unified Logging Layer at CWT2014
PDF
Spark Streamingによるリアルタイムユーザ属性推定
PDF
Docker and Fluentd
PDF
Hands on with Apache Spark
PDF
Jumpstart on Apache Spark 2.2 on Databricks
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
Big data beyond the JVM - DDTX 2018
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
PDF
Penny coventry fiddler-spsbe23
PDF
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
PDF
OSINT tools for security auditing with python
Contributing to pandas (Korean)
data science toolkit 101: set up Python, Spark, & Jupyter
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
Accelerating Big Data beyond the JVM - Fosdem 2018
Apache Arrow and Pandas UDF on Apache Spark
Wisely Chen Spark Talk At Spark Gathering in Taiwan
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Fluentd: Unified Logging Layer at CWT2014
Spark Streamingによるリアルタイムユーザ属性推定
Docker and Fluentd
Hands on with Apache Spark
Jumpstart on Apache Spark 2.2 on Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Big data beyond the JVM - DDTX 2018
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Penny coventry fiddler-spsbe23
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
OSINT tools for security auditing with python
Ad

More from Ryuji Tamagawa (20)

PPTX
hbstudy 74 Site Reliability Engineering
PDF
20160127三木会 RDB経験者のためのspark
PDF
20151205 Japan.R SparkRとParquet
PDF
Performant data processing with PySpark, SparkR and DataFrame API
PDF
Apache Sparkの紹介
PDF
足を地に着け落ち着いて考える
PDF
ヘルシープログラマ・翻訳と実践
PDF
Google Big Query
PDF
BigQueryの課金、節約しませんか
PDF
You might be paying too much for BigQuery
PDF
Google BigQueryについて 紹介と推測
PDF
lessons learned from talking at rakuten technology conference
PDF
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
PDF
Mongo dbを知ろう devlove関西
PDF
Seleniumをもっと知るための本の話
PDF
データベース勉強会 In 広島 mongodb
PDF
Invitation to mongo db @ Rakuten TechTalk
PDF
MongoDB tuning on AWS
PDF
初めてのMongo db
PDF
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
hbstudy 74 Site Reliability Engineering
20160127三木会 RDB経験者のためのspark
20151205 Japan.R SparkRとParquet
Performant data processing with PySpark, SparkR and DataFrame API
Apache Sparkの紹介
足を地に着け落ち着いて考える
ヘルシープログラマ・翻訳と実践
Google Big Query
BigQueryの課金、節約しませんか
You might be paying too much for BigQuery
Google BigQueryについて 紹介と推測
lessons learned from talking at rakuten technology conference
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
Mongo dbを知ろう devlove関西
Seleniumをもっと知るための本の話
データベース勉強会 In 広島 mongodb
Invitation to mongo db @ Rakuten TechTalk
MongoDB tuning on AWS
初めてのMongo db
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
cuic standard and advanced reporting.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Spectroscopy.pptx food analysis technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Per capita expenditure prediction using model stacking based on satellite ima...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Programs and apps: productivity, graphics, security and other tools
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
cuic standard and advanced reporting.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
Spectroscopy.pptx food analysis technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Network Security Unit 5.pdf for BCA BBA.
Reach Out and Touch Someone: Haptics and Empathic Computing
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Chapter 3 Spatial Domain Image Processing.pdf
sap open course for s4hana steps from ECC to s4
“AI and Expert System Decision Support & Business Intelligence Systems”

20161215 python pandas-spark四方山話