SlideShare a Scribd company logo
PySpark
Next generation cloud
computing engine using Python
Wisely Chen
Yahoo! Taiwan Data team
Who am I?
• Wisely Chen ( thegiive@gmail.com ) 	

• Sr. Engineer inYahoo![Taiwan] data team 	

• Loves to promote open source tech 	

• Hadoop Summit 2013 San Jose	

• Jenkins Conf 2013 Palo Alto	

• Coscup 2006, 2012, 2013 , OSDC 2007,2014, Webconf
2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
Taiwan Data Team
Data!
Highway
BI!
Report
Serving!
API
Data!
Mart
ETL /
Forecast
Machine!
Learning
Agenda
• What is Spark?
• What is PySpark?
• How to write PySpark applications?
• PySpark demo
• Q&A
HDFS
YARN
MapReduce
What is Spark?
Spark
Storage
Resource Management
Computing Engine
• The leading candidate for “successor to
MapReduce” today is Apache Spark
• No vendor — no new project — is likely to catch
up. Chasing Spark would be a waste of time,
and would delay availability of real-time analytic
and processing services for no good reason. !
• From Cloudera CTO http://0rz.tw/y3OfM
What is Spark?
Spark is 3X~25X faster than MapReduce
!
From Matei’s paper: http://0rz.tw/VVqgP
Logistic
regression
RunningTime(S)
0
20
40
60
80
MR Spark
3
76
KMeans
0
27.5
55
82.5
110
MR Spark
33
106
PageRank
0
45
90
135
180
MR Spark
23
171
Most machine learning
algorithms need iterative computing
a1.0
1.0
1.0
1.0
PageRank
1st Iter 2nd Iter 3rd Iter
b
d
c
Rank
Tmp
Result
Rank
Tmp
Result
a1.85
1.0
0.58
b
d
c
0.58
a1.31
1.72
0.39
b
d
c
0.58
HDFS is 100x slower than memory
Input
(HDFS)
Iter 1
Tmp
(HDFS)
Iter 2
Tmp
(HDFS)
Iter N
Input
(HDFS)
Iter 1
Tmp
(Mem)
Iter 2
Tmp
(Mem)
Iter N
MapReduce
Spark
First iteration(HDFS)!
take 200 sec
3rd iteration(mem)!
take 7.7 sec
Page Rank algorithm in 1 billion record url
2nd iteration(mem)!
take 7.4 sec
What is PySpark?
Spark API
• Multi Language API
• JVM: Scala, JAVA
• PySpark: Python
PySpark
• Process via Python
• CPython
• Python lib (NumPy, Scipy…)
• Storage and transfer data in Spark
• HDFS access/Networking/Fault-recovery
• scheduling/broadcast/checkpointing/
Spark Architecture
Master!
(JVM)
Worker!
!
!
!
!
!
Task
Client
Block1
Worker!
!
!
!
!
!
Task
Block2
Worker!
!
!
!
!
!
Task
Block3
PySpark Architecture
Master!
(JVM)
Worker!
(JVM)!
!
!
!
Python!
Code
Block1
Py
Proc
Worker!
(JVM)!
!
!
!
Block2
Py
Proc
Worker!
(JVM)!
!
!
!
Block3
Py
Proc
JVM
PySpark Architecture
Master!
(JVM)
Worker!
(JVM)!
!
!
!
Python!
Code
Py4J Socket
Local FS
Block1
Worker!
(JVM)!
!
!
!
Block2
Worker!
(JVM)!
!
!
!
Block3
PySpark Architecture
Master!
(JVM)
Worker!
(JVM)!
!
!
!
Python!
Code
Block1
Py
code
Worker!
(JVM)!
!
!
!
Block2
Worker!
(JVM)!
!
!
!
Block3
Python functions
and closures
are serialized using
PiCloud’s CloudPickle
module
Py
code
Py
code
PySpark Architecture
Master!
(JVM)
Worker!
(JVM)!
!
!
!
Python!
Code
Block1
Py
Proc
Worker!
(JVM)!
!
!
!
Block2
Py
Proc
Worker!
(JVM)!
!
!
!
Block3
Py
Proc
On worker launch,
Python subprocesses
and communicate
with them using pipes,
sending the user's code
and the data to be processed.
A lot of python
processes
How to write PySpark
application?
Python Word Count
• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" ")) 
• .map(lambda word: (word, 1)) 
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
Access data via
Spark API
Process via Python
Python Word Count
• counts = file.flatMap(lambda line: line.split(" ")) 
You can find the
latest Spark
documentation,
including the
guide
Original text List
['You', 'can', 'find', 'the',
'latest', 'Spark',
'documentation,',
'including', 'the', ‘guide’]
Python Word Count
• .map(lambda word: (word, 1))
List Tuple List
[ (‘You’,1) , (‘can’,1),
(‘find’,1) , (‘the’,1) ….,
………..
(‘the’,1) , (‘guide’ ,1) ]
['You', 'can', 'find', 'the',
'latest', 'Spark',
'documentation,',
'including', 'the', ‘guide’]
Python Word Count
• .reduceByKey(lambda a, b: a + b)
Tuple List Reduce Tuple List
[ (‘You’,1) ,
(‘can’,1),
(‘find’,1) ,
(‘the’,1),
………..
(‘the’,1) ,
(‘guide’ ,1) ]
[ (‘You’,1) ,
(‘can’,1),
(‘find’,1) ,
(‘the’,2),
………
………..
(‘guide’ ,1) ]
Can I use ML python
lib on PySpark?
PySpark + scikit-learn
• sgd = lm.SGDClassifier(loss=‘log')
• for ii in range(ITERATIONS):
• sgd = sc.parallelize(…) 
• .mapPartitions(lambda x:…) 
• .reduce(lambda x, y: merge(x, y))
Use scikit-learn in
Single mode(master)
Cluster operation
Use scikit-learn
function in cluster mode ,
deal with partial data
!
Source Code is From : http://0rz.tw/o2CHT
!
!
PySpark support MLlib
• MLlib is spark version machine learning lib
• Example: KMeans.train(parsedData, 2,
maxIter=10, runs=30, "random")
• Check it out on http://0rz.tw/M35Rz
DEMO 1 :
Recommendation using ALS
(Data : MovieLens)
DEMO 2:
Interactive Shell
Conclusion
Join Us
• Our team’s work is highlight by world top conf
• Hadoop Summit San Jose 2013
• Hadoop Summit Amsterdam 2014
• MSTR World Las Vegas 2014
• SparkSummit San Francisco 2014
• Jenkins Conf Palo Alto 2013
Thank you

More Related Content

PDF
A really really fast introduction to PySpark - lightning fast cluster computi...
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
PPTX
Programming in Spark using PySpark
PDF
Spark SQL
PDF
Spark shuffle introduction
PDF
Hudi architecture, fundamentals and capabilities
PPTX
Apache Spark Fundamentals
PPTX
Introduction to Apache Spark
A really really fast introduction to PySpark - lightning fast cluster computi...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Programming in Spark using PySpark
Spark SQL
Spark shuffle introduction
Hudi architecture, fundamentals and capabilities
Apache Spark Fundamentals
Introduction to Apache Spark

What's hot (20)

PDF
Apache Spark Introduction
PPTX
SHACL by example
PDF
Introduction to MongoDB
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PPTX
Survey of High Performance NoSQL Systems
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
PPTX
Apache Spark Architecture
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Making Apache Spark Better with Delta Lake
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Introduction to DataFusion An Embeddable Query Engine Written in Rust
PPTX
Event Detection Pipelines with Apache Kafka
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PDF
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Apache Hudi: The Path Forward
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Apache Spark Introduction
SHACL by example
Introduction to MongoDB
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Survey of High Performance NoSQL Systems
Solving Enterprise Data Challenges with Apache Arrow
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Apache Spark Architecture
Apache Spark in Depth: Core Concepts, Architecture & Internals
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Making Apache Spark Better with Delta Lake
The Parquet Format and Performance Optimization Opportunities
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Event Detection Pipelines with Apache Kafka
Building robust CDC pipeline with Apache Hudi and Debezium
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
A Deep Dive into Query Execution Engine of Spark SQL
Apache Hudi: The Path Forward
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Ad

Similar to PySaprk (20)

PDF
Wattpad - Spark Stories
PDF
Ncku csie talk about Spark
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
PDF
Big data beyond the JVM - DDTX 2018
PDF
Apache Spark and Python: unified Big Data analytics
PDF
Sharing (or stealing) the jewels of python with big data & the jvm (1)
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PPTX
Apache spark
PDF
spark_v1_2
PPTX
Hadoop with Python
PDF
Life of PySpark - A tale of two environments
PDF
Apache Spark for Everyone - Women Who Code Workshop
PPTX
Overview of Apache Spark and PySpark.pptx
PDF
Pyspark tutorial
PDF
Pyspark tutorial
Wattpad - Spark Stories
Ncku csie talk about Spark
Accelerating Big Data beyond the JVM - Fosdem 2018
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with python apache arrow, spark,...
Big data beyond the JVM - DDTX 2018
Apache Spark and Python: unified Big Data analytics
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Apache spark
spark_v1_2
Hadoop with Python
Life of PySpark - A tale of two environments
Apache Spark for Everyone - Women Who Code Workshop
Overview of Apache Spark and PySpark.pptx
Pyspark tutorial
Pyspark tutorial
Ad

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Empathic Computing: Creating Shared Understanding
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Cloud computing and distributed systems.
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
20250228 LYD VKU AI Blended-Learning.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Network Security Unit 5.pdf for BCA BBA.
Empathic Computing: Creating Shared Understanding
“AI and Expert System Decision Support & Business Intelligence Systems”
The Rise and Fall of 3GPP – Time for a Sabbatical?
Understanding_Digital_Forensics_Presentation.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Cloud computing and distributed systems.
MYSQL Presentation for SQL database connectivity
Spectral efficient network and resource selection model in 5G networks
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Modernizing your data center with Dell and AMD
Advanced methodologies resolving dimensionality complications for autism neur...
Chapter 3 Spatial Domain Image Processing.pdf

PySaprk