SlideShare a Scribd company logo
Python and Big data - An Introduction to Spark (PySpark)
Hitesh Dharmdasani
About me
• Security Researcher, Malware
Reversing Engineer, Developer
• GIT > GMU > Berkeley 

> FireEye > On Stage
• Bootstrapping a few ideas
• Hiring!
Information

Security
Big

Data
Machine

Learning
Me
What we will talk about?
• What is Spark?
• How does spark do things
• PySpark and data processing primitives
• Example Demo - Playing with Network Logs
• Streaming and Machine Learning in Spark
• When to use Spark
http://bit.do/PyBelgaumSpark
http://tinyurl.com/PyBelgaumSpark
What will we NOT talk about
• Writing production level jobs
• Fine Tuning Spark
• Integrating Spark with Kafka and the like
• Nooks and Crooks of Spark
• But glad to talk about it offline
The Common Scenario
Some Data (NTFS, NFS, HDFS, Amazon S3 …)
Python
Process 1 Process 2 Process 3 Process 4 Process 5 …
You write 1 job. Then chunk,cut, slice and dice
Compute where the data is
• Paradigm shift in computing
• Don't load all the data into one place and do
operations
• State your operations and send code to the
machine
• Sending code to machine >>> Getting data over
network
MapReduce
public static MyFirstMapper {
public void map { . . . }
}
public static MyFirstReducer {
public void reduce { . . . }
}
public static MySecondMapper {
public void map { . . . }
}
public static MySecondReducer {
public void reduce { . . . }
}
Job job = new Job(conf,
“First");
job.setMapperClass(MyFirstMapper
.class);
job.setReducerClass(MyFirstReduc
er.class);
/*Job 1 goes to Disk */
if(job.isSuccessful()) {
Job job2 = new
Job(conf,”Second”);
job2.setMapperClass(MySecondMap
per.class);
job2.setReducerClass(MySecondRe
ducer.class);
}
This also looks ugly if you ask me!
What is Spark?
• Open Source Lighting Fast Cluster Computing
• Focus on Speed and Scale
• Developed at AMP Lab, UC Berkeley by Matei Zaharia
• Most active Apache Project in 2014 (Even more than
Hadoop)
• Recently beat MapReduce in sorting 100TB of data
by being 3X faster and using 10X fewer machines
What is Spark?
Spark
Some Data (NTFS, NFS, HDFS, Amazon S3 …)
Java Python Scala
MLLib Streaming ETL SQL ….GraphX
What is Spark?
Spark
Some Data (NTFS, NFS, HDFS, Amazon S3 …)
• Inherently distributed
• Computation happens where the data
resides
What is different from
MapReduce
• Uses main memory for caching
• Dataset is partitioned and stored in RAM/Disk for
iterative queries
• Large speedups for iterative operations when in-
memory caching is used
Spark Internals
The Init
• Creating a SparkContext
• It is Sparks’ gateway to access the cluster
• In interactive mode. SparkContext is created as ‘sc’
$ pyspark
...
...
SparkContext available as sc.
>>> sc
<pyspark.context.SparkContext at 0xdeadbeef>
Spark Internals
The Key Idea



Resilient Distributed Datasets
• Basic unit of abstraction of data
• Immutable
• Persistance
>>> data = [90, 14, 20, 86, 43, 55, 30, 94 ]
>>> distData = sc.parallelize(data)
ParallelCollectionRDD[13] at parallelize at
PythonRDD.scala:364
Spark Internals
Operations on RDDs - Transformations & Actions
Spark Internals
Transformations RDD
Spark
Context
File/Collection
Spark Internals
Lazy Evaluation
Now what?
Spark Internals
Transformations ActionsRDD
Spark
Context
File/Collection
Spark Internals
Transformation Operations on RDDs
Map
def mapFunc(x):
return x+1
rdd_2 = rdd_1.map(mapFunc)
Filter


def filterFunc(x):
if x % 2 == 0:
return True
else:
return False
rdd_2 = rdd_1.filter(filterFunc)
Spark Internals
Transformation Operations on RDDs
• map
• filter
• flatMap
• mapPartitions
• mapPartitionsWithIndex
• sample
• union
• intersection
• distinct
• groupByKey
Spark Internals
>>> increment_rdd = distData.map(mapFunc)

>>> increment_rdd.collect()

[91, 15, 21, 87, 44, 56, 31, 95]

>>>



>>> increment_rdd.filter(filterFunc).collect()

[44, 56]



OR



>>> distData.map(mapFunc).filter(filterFunc).collect()

[44, 56]
Spark Internals
Fault Tolerance and Lineage
Moving to the Terminal
Spark Streaming
Kafka
Flume
HDFS
Twitter
ZeroMQ
HDFS
Cassandra
NFS
TextFile
RDD
ML Lib
• Machine Learning Primitives in Spark
• Provides training and classification at scale
• Exploits Sparks’ ability for iterative computation
(Linear Regression, Random Forest)
• Currently the most active area of work within Spark
How can I use all this?
HDFS
Spark + ML Lib
Load Tweets
Bad
Tweets
Model
Live Tweets
Good
Bad
Report to Twitter
Spark Streaming
To Spark or not to Spark
• Iterative computations
• “Don't fix something that is not broken”
• Lesser learning barrier
• Large one-time compute
• Single Map Reduce Operation

More Related Content

PDF
Performant data processing with PySpark, SparkR and DataFrame API
PDF
PySpark Best Practices
PPTX
Spark tutorial
PPTX
Up and running with pyspark
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
PPTX
Programming in Spark using PySpark
PDF
PySaprk
PPTX
Introduction to Apache Spark Developer Training
Performant data processing with PySpark, SparkR and DataFrame API
PySpark Best Practices
Spark tutorial
Up and running with pyspark
Frustration-Reduced PySpark: Data engineering with DataFrames
Programming in Spark using PySpark
PySaprk
Introduction to Apache Spark Developer Training

What's hot (20)

PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Intro to Apache Spark
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Operational Tips for Deploying Spark
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
PDF
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
PPTX
Spark r under the hood with Hossein Falaki
PDF
Introduction to Apache Spark
PDF
Re-Architecting Spark For Performance Understandability
PPTX
Introduction to Apache Spark
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
PDF
Fast Data Analytics with Spark and Python
PPTX
Introduction to Apache Spark
PDF
Integrating Existing C++ Libraries into PySpark with Esther Kundin
PDF
Apache Spark 101
Spark Under the Hood - Meetup @ Data Science London
Intro to Apache Spark
Spark Summit EU 2015: Lessons from 300+ production users
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Operational Tips for Deploying Spark
Apache Spark: The Next Gen toolset for Big Data Processing
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Spark r under the hood with Hossein Falaki
Introduction to Apache Spark
Re-Architecting Spark For Performance Understandability
Introduction to Apache Spark
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Keeping Spark on Track: Productionizing Spark for ETL
Fast Data Analytics with Spark and Python
Introduction to Apache Spark
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Apache Spark 101
Ad

Similar to Python and Bigdata - An Introduction to Spark (PySpark) (20)

PDF
Apache Spark Overview
PPTX
Dive into spark2
PDF
How Apache Spark fits into the Big Data landscape
PDF
Lessons from Running Large Scale Spark Workloads
PPTX
OVERVIEW ON SPARK.pptx
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Stefano Baghino - From Big Data to Fast Data: Apache Spark
PDF
Bds session 13 14
PDF
Introduction to apache spark
PPTX
Ten tools for ten big data areas 03_Apache Spark
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PPTX
Spark real world use cases and optimizations
PDF
Dev Ops Training
PPTX
Introduction to real time big data with Apache Spark
PPTX
APACHE SPARK.pptx
PPTX
Big Data tools in practice
PPTX
Apache Spark on HDinsight Training
PPT
Big_data_analytics_NoSql_Module-4_Session
PPTX
Apache Spark Core
Apache Spark Overview
Dive into spark2
How Apache Spark fits into the Big Data landscape
Lessons from Running Large Scale Spark Workloads
OVERVIEW ON SPARK.pptx
Apache spark sneha challa- google pittsburgh-aug 25th
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Bds session 13 14
Introduction to apache spark
Ten tools for ten big data areas 03_Apache Spark
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Spark real world use cases and optimizations
Dev Ops Training
Introduction to real time big data with Apache Spark
APACHE SPARK.pptx
Big Data tools in practice
Apache Spark on HDinsight Training
Big_data_analytics_NoSql_Module-4_Session
Apache Spark Core
Ad

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Cloud computing and distributed systems.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Electronic commerce courselecture one. Pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
Teaching material agriculture food technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation_ Review paper, used for researhc scholars
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectral efficient network and resource selection model in 5G networks
Cloud computing and distributed systems.
Diabetes mellitus diagnosis method based random forest with bat algorithm
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Understanding_Digital_Forensics_Presentation.pptx
Machine learning based COVID-19 study performance prediction
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Advanced methodologies resolving dimensionality complications for autism neur...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Electronic commerce courselecture one. Pdf
Digital-Transformation-Roadmap-for-Companies.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Teaching material agriculture food technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation_ Review paper, used for researhc scholars

Python and Bigdata - An Introduction to Spark (PySpark)

  • 1. Python and Big data - An Introduction to Spark (PySpark) Hitesh Dharmdasani
  • 2. About me • Security Researcher, Malware Reversing Engineer, Developer • GIT > GMU > Berkeley 
 > FireEye > On Stage • Bootstrapping a few ideas • Hiring! Information
 Security Big
 Data Machine
 Learning Me
  • 3. What we will talk about? • What is Spark? • How does spark do things • PySpark and data processing primitives • Example Demo - Playing with Network Logs • Streaming and Machine Learning in Spark • When to use Spark http://bit.do/PyBelgaumSpark http://tinyurl.com/PyBelgaumSpark
  • 4. What will we NOT talk about • Writing production level jobs • Fine Tuning Spark • Integrating Spark with Kafka and the like • Nooks and Crooks of Spark • But glad to talk about it offline
  • 5. The Common Scenario Some Data (NTFS, NFS, HDFS, Amazon S3 …) Python Process 1 Process 2 Process 3 Process 4 Process 5 … You write 1 job. Then chunk,cut, slice and dice
  • 6. Compute where the data is • Paradigm shift in computing • Don't load all the data into one place and do operations • State your operations and send code to the machine • Sending code to machine >>> Getting data over network
  • 7. MapReduce public static MyFirstMapper { public void map { . . . } } public static MyFirstReducer { public void reduce { . . . } } public static MySecondMapper { public void map { . . . } } public static MySecondReducer { public void reduce { . . . } } Job job = new Job(conf, “First"); job.setMapperClass(MyFirstMapper .class); job.setReducerClass(MyFirstReduc er.class); /*Job 1 goes to Disk */ if(job.isSuccessful()) { Job job2 = new Job(conf,”Second”); job2.setMapperClass(MySecondMap per.class); job2.setReducerClass(MySecondRe ducer.class); } This also looks ugly if you ask me!
  • 8. What is Spark? • Open Source Lighting Fast Cluster Computing • Focus on Speed and Scale • Developed at AMP Lab, UC Berkeley by Matei Zaharia • Most active Apache Project in 2014 (Even more than Hadoop) • Recently beat MapReduce in sorting 100TB of data by being 3X faster and using 10X fewer machines
  • 9. What is Spark? Spark Some Data (NTFS, NFS, HDFS, Amazon S3 …) Java Python Scala MLLib Streaming ETL SQL ….GraphX
  • 10. What is Spark? Spark Some Data (NTFS, NFS, HDFS, Amazon S3 …) • Inherently distributed • Computation happens where the data resides
  • 11. What is different from MapReduce • Uses main memory for caching • Dataset is partitioned and stored in RAM/Disk for iterative queries • Large speedups for iterative operations when in- memory caching is used
  • 12. Spark Internals The Init • Creating a SparkContext • It is Sparks’ gateway to access the cluster • In interactive mode. SparkContext is created as ‘sc’ $ pyspark ... ... SparkContext available as sc. >>> sc <pyspark.context.SparkContext at 0xdeadbeef>
  • 13. Spark Internals The Key Idea
 
 Resilient Distributed Datasets • Basic unit of abstraction of data • Immutable • Persistance >>> data = [90, 14, 20, 86, 43, 55, 30, 94 ] >>> distData = sc.parallelize(data) ParallelCollectionRDD[13] at parallelize at PythonRDD.scala:364
  • 14. Spark Internals Operations on RDDs - Transformations & Actions
  • 18. Spark Internals Transformation Operations on RDDs Map def mapFunc(x): return x+1 rdd_2 = rdd_1.map(mapFunc) Filter 
 def filterFunc(x): if x % 2 == 0: return True else: return False rdd_2 = rdd_1.filter(filterFunc)
  • 19. Spark Internals Transformation Operations on RDDs • map • filter • flatMap • mapPartitions • mapPartitionsWithIndex • sample • union • intersection • distinct • groupByKey
  • 20. Spark Internals >>> increment_rdd = distData.map(mapFunc)
 >>> increment_rdd.collect()
 [91, 15, 21, 87, 44, 56, 31, 95]
 >>>
 
 >>> increment_rdd.filter(filterFunc).collect()
 [44, 56]
 
 OR
 
 >>> distData.map(mapFunc).filter(filterFunc).collect()
 [44, 56]
  • 22. Moving to the Terminal
  • 24. ML Lib • Machine Learning Primitives in Spark • Provides training and classification at scale • Exploits Sparks’ ability for iterative computation (Linear Regression, Random Forest) • Currently the most active area of work within Spark
  • 25. How can I use all this? HDFS Spark + ML Lib Load Tweets Bad Tweets Model Live Tweets Good Bad Report to Twitter Spark Streaming
  • 26. To Spark or not to Spark • Iterative computations • “Don't fix something that is not broken” • Lesser learning barrier • Large one-time compute • Single Map Reduce Operation