SlideShare a Scribd company logo
By Tsai Li Ming

PyData + Spark Meetup (SG) - 17 Nov 2015

http://about.me/tsailiming
Presentation and source codes are available here:

http://github.com/tsailiming/pydatasg-17Nov2015
What is Spark?
• Developed at UC Berkley in 2009. Open sourced
in 2010.

• Fast and general engine for large-scale data
processing
!
• Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk
!
• Multi-step Directed Acrylic Graphs (DAGs). Many
stages compared to just Hadoop Map and
Reduce only.
!
• Rich Scala, Java and Python APIs. R too!
!
• Interactive Shell
!
• Active development
What is Spark?
Spark Stack
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
http://web.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf
Speed matters
http://hblok.net/blog/storage/
2011
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
Logistic Regression Performance
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
How Spark works
Resilient Distributed Datasets (RDDs)
• Basic abstraction in Spark. Fault-tolerant collection of
elements that can be operated on in parallel
!
• RDDs can be created from local file system, HDFS,
Cassandra, HBase, Amazon S3, SequenceFiles, and any
other Hadoop InputFormat.
!
• Different levels of caching: MEMORY_ONLY,
MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, etc
!
• Rich APIs for Transformations and Actions
!
• Data Locality: PROCESS_LOCAL -> NODE_LOCAL ->
RACK_LOCAL
RDD Operations
map flatMap sortByKey
filter union reduce
sample join count
groupByKey distinct saveAsTextFile
reduceByKey mapValues first
Spark Example
Wordcount Example
//package org.myorg;!
import java.io.IOException;!
import java.util.*;!
!
import org.apache.hadoop.fs.Path;!
import org.apache.hadoop.conf.*;!
import org.apache.hadoop.io.*;!
import org.apache.hadoop.mapred.*;!
import org.apache.hadoop.util.*;!
!
public class WordCount {!
!
! public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable> {!
! ! private final static IntWritable one = new IntWritable(1);!
! ! private Text word = new Text();!
!
! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {!
! ! ! String line = value.toString();!
! ! ! StringTokenizer tokenizer = new StringTokenizer(line);!
! ! ! while (tokenizer.hasMoreTokens()) {!
! ! ! ! word.set(tokenizer.nextToken());!
! ! ! ! output.collect(word, one);!
! ! ! }!
! ! }!
! }!
!
! public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable,
Text, IntWritable> {!
! ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {!
! ! ! int sum = 0;!
! ! ! while (values.hasNext()) {!
! ! ! ! sum += values.next().get();!
! ! ! }!
! ! ! output.collect(key, new IntWritable(sum));!
! ! }!
! }!
!
! public static void main(String[] args) throws Exception {!
! ! JobConf conf = new JobConf(WordCount.class);!
! ! conf.setJobName("wordcount");!
!
! ! conf.setOutputKeyClass(Text.class);!
! ! conf.setOutputValueClass(IntWritable.class);!
!
! ! conf.setMapperClass(Map.class);!
! ! //conf.setCombinerClass(Reduce.class);!
! ! conf.setReducerClass(Reduce.class);!
!
! ! conf.setInputFormat(TextInputFormat.class);!
! ! conf.setOutputFormat(TextOutputFormat.class);!
!
! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));!
! ! FileOutputFormat.setOutputPath(conf, new Path(args[1]));!
!
! ! JobClient.runJob(conf);!
! }!
}
Hadoop MapReduce Spark Scala
val file = spark.textFile("hdfs://...")!
val counts = file.flatMap(line => line.split(" "))!
.map(word => (word, 1))!
.reduceByKey(_ + _)!
counts.saveAsTextFile("hdfs://...")
file = spark.textFile("hdfs://...")!
counts = file.flatMap(lambda line: line.split(" ")) !
.map(lambda word: (word, 1)) !
.reduceByKey(lambda a, b: a + b)!
counts.saveAsTextFile("hdfs://...")
Spark Python
Spark SQL and Dataframe Example
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
!
# Create the DataFrame
df = sqlContext.read.json("people.json")
!
# Show the content of the DataFrame
df.show()
## age name
## null Michael
## 30 Andy
## 19 Justin
!
# Print the schema in a tree format
df.printSchema()
## root
## |-- age: long (nullable = true)
## |-- name: string (nullable = true)
!
# Select only the "name" column
df.select("name").show()
## name
## Michael
## Andy
## Justin
# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
## name (age + 1)
## Michael null
## Andy 31
## Justin 20
!
# Select people older than 21
df.filter(df['age'] > 21).show()
## age name
## 30 Andy
!
# Count people by age
df.groupBy("age").count().show()
## age count
## null 1
## 19 1
## 30 1
andypetrella/spark-notebook

(forked from Scala notebook)
Apache Zeppelin
Notebooks for Spark
Actual Demo
PySpark with Juypter
Thank You!

More Related Content

PDF
Introduction to Spark
PDF
Debugging & Tuning in Spark
PDF
ElasticSearch
KEY
MongoFr : MongoDB as a log Collector
PPTX
Hadoop on osx
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PDF
Debugging and Testing ES Systems
PDF
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Spark
Debugging & Tuning in Spark
ElasticSearch
MongoFr : MongoDB as a log Collector
Hadoop on osx
PySpark Cassandra - Amsterdam Spark Meetup
Debugging and Testing ES Systems
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab

What's hot (20)

KEY
You know, for search. Querying 24 Billion Documents in 900ms
PPTX
Introduction to Apache Drill - interactive query and analysis at scale
PPTX
Attack monitoring using ElasticSearch Logstash and Kibana
PPTX
Tuning tips for Apache Spark Jobs
PDF
Hive dirty/beautiful hacks in TD
PPTX
Big data, just an introduction to Hadoop and Scripting Languages
PPTX
Data analysis scala_spark
PDF
Spark with Elasticsearch
KEY
MongoSF - mongodb @ foursquare
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
PDF
Apache zeppelin the missing component for the big data ecosystem
PDF
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Redis Indices (#RedisTLV)
PPT
9b. Document-Oriented Databases lab
PDF
Intro to py spark (and cassandra)
PDF
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
PDF
Apache spark basics
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
PDF
Lightning fast analytics with Spark and Cassandra
PPT
ELK stack at weibo.com
You know, for search. Querying 24 Billion Documents in 900ms
Introduction to Apache Drill - interactive query and analysis at scale
Attack monitoring using ElasticSearch Logstash and Kibana
Tuning tips for Apache Spark Jobs
Hive dirty/beautiful hacks in TD
Big data, just an introduction to Hadoop and Scripting Languages
Data analysis scala_spark
Spark with Elasticsearch
MongoSF - mongodb @ foursquare
Debugging PySpark: Spark Summit East talk by Holden Karau
Apache zeppelin the missing component for the big data ecosystem
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Redis Indices (#RedisTLV)
9b. Document-Oriented Databases lab
Intro to py spark (and cassandra)
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
Apache spark basics
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Lightning fast analytics with Spark and Cassandra
ELK stack at weibo.com
Ad

Similar to PySpark with Juypter (20)

PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PDF
Sparkling Water 5 28-14
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PDF
Big Data Beyond the JVM - Strata San Jose 2018
PDF
Intro to node.js - Ran Mizrahi (27/8/2014)
PDF
Intro to node.js - Ran Mizrahi (28/8/14)
PDF
"Xapi-lang For declarative code generation" By James Nelson
PDF
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
PDF
Developing OpenResty Framework
PDF
Spark devoxx2014
PPTX
Scalable and Flexible Machine Learning With Scala @ LinkedIn
KEY
RubyMotion
PPTX
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
PPT
Rapid, Scalable Web Development with MongoDB, Ming, and Python
PDF
JS Class 2016
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
PDF
Osd ctw spark
PDF
Play framework
PDF
JS class slides (2016)
PDF
Using Flow-based programming to write tools and workflows for Scientific Comp...
Big Data Processing with .NET and Spark (SQLBits 2020)
Sparkling Water 5 28-14
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Beyond the JVM - Strata San Jose 2018
Intro to node.js - Ran Mizrahi (27/8/2014)
Intro to node.js - Ran Mizrahi (28/8/14)
"Xapi-lang For declarative code generation" By James Nelson
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Developing OpenResty Framework
Spark devoxx2014
Scalable and Flexible Machine Learning With Scala @ LinkedIn
RubyMotion
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Rapid, Scalable Web Development with MongoDB, Ming, and Python
JS Class 2016
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Osd ctw spark
Play framework
JS class slides (2016)
Using Flow-based programming to write tools and workflows for Scientific Comp...
Ad

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced IT Governance
PPTX
Big Data Technologies - Introduction.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Modernizing your data center with Dell and AMD
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
GamePlan Trading System Review: Professional Trader's Honest Take
cuic standard and advanced reporting.pdf
Advanced IT Governance
Big Data Technologies - Introduction.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Understanding_Digital_Forensics_Presentation.pptx
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Machine learning based COVID-19 study performance prediction
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
MYSQL Presentation for SQL database connectivity
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
“AI and Expert System Decision Support & Business Intelligence Systems”
Review of recent advances in non-invasive hemoglobin estimation
20250228 LYD VKU AI Blended-Learning.pptx

PySpark with Juypter

  • 1. By Tsai Li Ming
 PyData + Spark Meetup (SG) - 17 Nov 2015
 http://about.me/tsailiming
  • 2. Presentation and source codes are available here:
 http://github.com/tsailiming/pydatasg-17Nov2015
  • 4. • Developed at UC Berkley in 2009. Open sourced in 2010.
 • Fast and general engine for large-scale data processing ! • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk ! • Multi-step Directed Acrylic Graphs (DAGs). Many stages compared to just Hadoop Map and Reduce only. ! • Rich Scala, Java and Python APIs. R too! ! • Interactive Shell ! • Active development What is Spark?
  • 12. Resilient Distributed Datasets (RDDs) • Basic abstraction in Spark. Fault-tolerant collection of elements that can be operated on in parallel ! • RDDs can be created from local file system, HDFS, Cassandra, HBase, Amazon S3, SequenceFiles, and any other Hadoop InputFormat. ! • Different levels of caching: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, etc ! • Rich APIs for Transformations and Actions ! • Data Locality: PROCESS_LOCAL -> NODE_LOCAL -> RACK_LOCAL
  • 13. RDD Operations map flatMap sortByKey filter union reduce sample join count groupByKey distinct saveAsTextFile reduceByKey mapValues first
  • 15. Wordcount Example //package org.myorg;! import java.io.IOException;! import java.util.*;! ! import org.apache.hadoop.fs.Path;! import org.apache.hadoop.conf.*;! import org.apache.hadoop.io.*;! import org.apache.hadoop.mapred.*;! import org.apache.hadoop.util.*;! ! public class WordCount {! ! ! public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {! ! ! private final static IntWritable one = new IntWritable(1);! ! ! private Text word = new Text();! ! ! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! ! ! ! String line = value.toString();! ! ! ! StringTokenizer tokenizer = new StringTokenizer(line);! ! ! ! while (tokenizer.hasMoreTokens()) {! ! ! ! ! word.set(tokenizer.nextToken());! ! ! ! ! output.collect(word, one);! ! ! ! }! ! ! }! ! }! ! ! public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {! ! ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! ! ! ! int sum = 0;! ! ! ! while (values.hasNext()) {! ! ! ! ! sum += values.next().get();! ! ! ! }! ! ! ! output.collect(key, new IntWritable(sum));! ! ! }! ! }! ! ! public static void main(String[] args) throws Exception {! ! ! JobConf conf = new JobConf(WordCount.class);! ! ! conf.setJobName("wordcount");! ! ! ! conf.setOutputKeyClass(Text.class);! ! ! conf.setOutputValueClass(IntWritable.class);! ! ! ! conf.setMapperClass(Map.class);! ! ! //conf.setCombinerClass(Reduce.class);! ! ! conf.setReducerClass(Reduce.class);! ! ! ! conf.setInputFormat(TextInputFormat.class);! ! ! conf.setOutputFormat(TextOutputFormat.class);! ! ! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));! ! ! FileOutputFormat.setOutputPath(conf, new Path(args[1]));! ! ! ! JobClient.runJob(conf);! ! }! } Hadoop MapReduce Spark Scala val file = spark.textFile("hdfs://...")! val counts = file.flatMap(line => line.split(" "))! .map(word => (word, 1))! .reduceByKey(_ + _)! counts.saveAsTextFile("hdfs://...") file = spark.textFile("hdfs://...")! counts = file.flatMap(lambda line: line.split(" ")) ! .map(lambda word: (word, 1)) ! .reduceByKey(lambda a, b: a + b)! counts.saveAsTextFile("hdfs://...") Spark Python
  • 16. Spark SQL and Dataframe Example from pyspark.sql import SQLContext sqlContext = SQLContext(sc) ! # Create the DataFrame df = sqlContext.read.json("people.json") ! # Show the content of the DataFrame df.show() ## age name ## null Michael ## 30 Andy ## 19 Justin ! # Print the schema in a tree format df.printSchema() ## root ## |-- age: long (nullable = true) ## |-- name: string (nullable = true) ! # Select only the "name" column df.select("name").show() ## name ## Michael ## Andy ## Justin # Select everybody, but increment the age by 1 df.select(df['name'], df['age'] + 1).show() ## name (age + 1) ## Michael null ## Andy 31 ## Justin 20 ! # Select people older than 21 df.filter(df['age'] > 21).show() ## age name ## 30 Andy ! # Count people by age df.groupBy("age").count().show() ## age count ## null 1 ## 19 1 ## 30 1
  • 17. andypetrella/spark-notebook
 (forked from Scala notebook) Apache Zeppelin Notebooks for Spark