SlideShare a Scribd company logo
Farzad Nozarian
4/18/15 @AUT
Purpose
This tutorial provides a quick introduction to using Spark. We will first
introduce the API through Spark’s interactive shell, then show how to write
applications in Scala.
To follow along with this guide, first download a packaged release of Spark
from the Spark website.
2
Interactive Analysis with the Spark Shell-
Basics
• Spark’s shell provides a simple way to learn the API, as well as a powerful tool
to analyze data interactively.
• It is available in either Scala or Python.
• Start it by running the following in the Spark directory:
• RDDs can be created from Hadoop InputFormats (such as HDFS files) or by
transforming other RDDs.
• Let’s make a new RDD from the text of the README file in the Spark source
directory:
3
./bin/spark-shell
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
Interactive Analysis with the Spark Shell-
Basics
• RDDs have actions, which return values, and transformations, which return
pointers to new RDDs. Let’s start with a few actions:
• Now let’s use a transformation:
• We will use the filter transformation to return a new RDD with a subset of the
items in the file.
4
scala> textFile.count() // Number of items in this RDD
res0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
scala> val linesWithSpark = textFile.filter(line =>
line.contains("Spark"))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
Interactive Analysis with the Spark Shell-
More on RDD Operations
• We can chain together transformations and actions:
• RDD actions and transformations can be used for more complex computations.
• Let’s say we want to find the line with the most words:
• The arguments to map and reduce are Scala function literals (closures), and can
use any language feature or Scala/Java library.
5
scala> textFile.filter(line => line.contains("Spark")).count()
// How many lines contain "Spark"?
res3: Long = 15
scala> textFile.map(line => line.split(" ").size).reduce((a, b)
=> if (a > b) a else b)
res4: Long = 15
Interactive Analysis with the Spark Shell-
More on RDD Operations
• We can easily call functions declared elsewhere.
• We’ll use Math.max() function to make previous code easier to understand:
• One common data flow pattern is MapReduce, as popularized by Hadoop.
• Spark can implement MapReduce flows easily:
6
scala> import java.lang.Math
import java.lang.Math
scala> textFile.map(line => line.split(" ").size).reduce((a, b)
=> Math.max(a, b))
res5: Int = 15
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word =>
(word, 1)).reduceByKey((a, b) => a + b)
wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
Interactive Analysis with the Spark Shell-
More on RDD Operations
• Here, we combined the flatMap, map and reduceByKey transformations to
compute the per-word counts in the file as an RDD of (String, Int) pairs.
• To collect the word counts in our shell, we can use the collect action:
7
scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3),
(Because,1), (Python,2), (agree,1), (cluster.,1), ...)
Interactive Analysis with the Spark Shell-
Caching
• Spark also supports pulling data sets into a cluster-wide in-memory cache.
• This is very useful when data is accessed repeatedly:
• Querying a small “hot” dataset.
• Running an iterative algorithm like PageRank.
• Let’s mark our linesWithSpark dataset to be cached:
8
scala> linesWithSpark.cache()
res7: spark.RDD[String] = spark.FilteredRDD@17e51082
scala> linesWithSpark.count()
res8: Long = 15
scala> linesWithSpark.count()
res9: Long = 15
Self-Contained Applications
9
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
Self-Contained Applications (Cont.)
• This program just counts the number of lines containing ‘a’ and the
number containing ‘b’ in the Spark README.
• Note that you’ll need to replace YOUR_SPARK_HOME with the location
where Spark is installed.
• Note that applications should define a main() method instead of
extending scala.App. Subclasses of scala.App may not work correctly.
• Unlike the earlier examples with the Spark shell, which initializes its own
SparkContext, we initialize a SparkContext as part of the program.
• We pass the SparkContext constructor a SparkConf object which
contains information about our application.
10
Self-Contained Applications (Cont.)
• Our application depends on the Spark API, so we’ll also include an sbt
configuration file, simple.sbt which explains that Spark is a dependency.
• For sbt to work correctly, we’ll need to layout SimpleApp.scala and
simple.sbt according to the typical directory structure.
• Then we can create a JAR package containing the application’s code and
use the spark-submit script to run our program.
11
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1"
Self-Contained Applications (Cont.)
12
# Your directory layout should look like this
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.10-1.0.jar
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit 
--class "SimpleApp" 
--master local[4] 
target/scala-2.10/simple-project_2.10-1.0.jar
...
Lines with a: 46, Lines with b: 23

More Related Content

PPT
Scala and spark
PPTX
Introduction to Apache Spark
PPT
11. From Hadoop to Spark 2/2
PDF
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Apache Spark Introduction
PDF
Introduction to Apache Spark
PDF
Intro to apache spark stand ford
PDF
Apache Spark RDDs
Scala and spark
Introduction to Apache Spark
11. From Hadoop to Spark 2/2
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction
Introduction to Apache Spark
Intro to apache spark stand ford
Apache Spark RDDs

What's hot (20)

PDF
Introduction to Apache Spark
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
PPTX
Apache Spark overview
PPTX
Intro to Apache Spark
PDF
Apache spark basics
PDF
Apache Spark Introduction - CloudxLab
ODP
Introduction to Spark with Scala
PPTX
Spark Sql and DataFrame
PDF
Spark shuffle introduction
PDF
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
PPTX
Spark core
PPTX
Processing Large Data with Apache Spark -- HasGeek
PPTX
Apache Spark Architecture
PDF
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PPTX
Apache spark Intro
PPTX
Transformations and actions a visual guide training
PDF
Productionizing Spark and the Spark Job Server
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Spark 1.6 vs Spark 2.0
Introduction to Apache Spark
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark overview
Intro to Apache Spark
Apache spark basics
Apache Spark Introduction - CloudxLab
Introduction to Spark with Scala
Spark Sql and DataFrame
Spark shuffle introduction
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Spark core
Processing Large Data with Apache Spark -- HasGeek
Apache Spark Architecture
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache spark Intro
Transformations and actions a visual guide training
Productionizing Spark and the Spark Job Server
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Spark 1.6 vs Spark 2.0
Ad

Viewers also liked (20)

PDF
Object Based Databases
PDF
Apache HDFS - Lab Assignment
PDF
Shark - Lab Assignment
PDF
Apache Storm Tutorial
PPTX
Introduction to Spark: Data Analysis and Use Cases in Big Data
PDF
Apache HBase - Lab Assignment
PDF
Apache Hadoop MapReduce Tutorial
PDF
Big Data Processing in Cloud Computing Environments
PPTX
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
PPTX
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
PPTX
Big Data and Cloud Computing
PDF
BDM25 - Spark runtime internal
PPTX
S4: Distributed Stream Computing Platform
PPTX
The Vortex of Change - Digital Transformation (Presented by Intel)
PDF
Big data Clustering Algorithms And Strategies
PPTX
Using Big Data to Transform Your Customer’s Experience - Part 1

PPTX
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
PPTX
Combining Machine Learning Frameworks with Apache Spark
PPTX
Top 5 IoT Use Cases
PPTX
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Object Based Databases
Apache HDFS - Lab Assignment
Shark - Lab Assignment
Apache Storm Tutorial
Introduction to Spark: Data Analysis and Use Cases in Big Data
Apache HBase - Lab Assignment
Apache Hadoop MapReduce Tutorial
Big Data Processing in Cloud Computing Environments
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Big Data and Cloud Computing
BDM25 - Spark runtime internal
S4: Distributed Stream Computing Platform
The Vortex of Change - Digital Transformation (Presented by Intel)
Big data Clustering Algorithms And Strategies
Using Big Data to Transform Your Customer’s Experience - Part 1

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Combining Machine Learning Frameworks with Apache Spark
Top 5 IoT Use Cases
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Ad

Similar to Apache Spark Tutorial (20)

PPTX
Introduction to Apache Spark
PDF
A Deep Dive Into Spark
PDF
Apache Spark: What? Why? When?
PDF
Meetup ml spark_ppt
PDF
Spark浅谈
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Artigo 81 - spark_tutorial.pdf
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
PPTX
Introduction to Spark - DataFactZ
PDF
20150716 introduction to apache spark v3
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PDF
Introduction to Spark
PDF
Scala+data
PPTX
OVERVIEW ON SPARK.pptx
PDF
Apache Spark Super Happy Funtimes - CHUG 2016
PDF
Introduction to apache spark
PDF
Introduction to apache spark
PPTX
Apache Spark
PDF
Intro to apache spark
Introduction to Apache Spark
A Deep Dive Into Spark
Apache Spark: What? Why? When?
Meetup ml spark_ppt
Spark浅谈
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache spark sneha challa- google pittsburgh-aug 25th
Artigo 81 - spark_tutorial.pdf
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Introduction to Spark - DataFactZ
20150716 introduction to apache spark v3
AI與大數據數據處理 Spark實戰(20171216)
Introduction to Spark
Scala+data
OVERVIEW ON SPARK.pptx
Apache Spark Super Happy Funtimes - CHUG 2016
Introduction to apache spark
Introduction to apache spark
Apache Spark
Intro to apache spark

Recently uploaded (20)

PDF
System and Network Administration Chapter 2
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
top salesforce developer skills in 2025.pdf
PDF
Digital Strategies for Manufacturing Companies
PPTX
history of c programming in notes for students .pptx
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
System and Network Administraation Chapter 3
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
AI in Product Development-omnex systems
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Introduction to Artificial Intelligence
PPTX
ai tools demonstartion for schools and inter college
System and Network Administration Chapter 2
How Creative Agencies Leverage Project Management Software.pdf
top salesforce developer skills in 2025.pdf
Digital Strategies for Manufacturing Companies
history of c programming in notes for students .pptx
wealthsignaloriginal-com-DS-text-... (1).pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
Wondershare Filmora 15 Crack With Activation Key [2025
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
System and Network Administraation Chapter 3
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Operating system designcfffgfgggggggvggggggggg
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Reimagine Home Health with the Power of Agentic AI​
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
AI in Product Development-omnex systems
Navsoft: AI-Powered Business Solutions & Custom Software Development
Introduction to Artificial Intelligence
ai tools demonstartion for schools and inter college

Apache Spark Tutorial

  • 2. Purpose This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell, then show how to write applications in Scala. To follow along with this guide, first download a packaged release of Spark from the Spark website. 2
  • 3. Interactive Analysis with the Spark Shell- Basics • Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. • It is available in either Scala or Python. • Start it by running the following in the Spark directory: • RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. • Let’s make a new RDD from the text of the README file in the Spark source directory: 3 ./bin/spark-shell scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
  • 4. Interactive Analysis with the Spark Shell- Basics • RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions: • Now let’s use a transformation: • We will use the filter transformation to return a new RDD with a subset of the items in the file. 4 scala> textFile.count() // Number of items in this RDD res0: Long = 126 scala> textFile.first() // First item in this RDD res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
  • 5. Interactive Analysis with the Spark Shell- More on RDD Operations • We can chain together transformations and actions: • RDD actions and transformations can be used for more complex computations. • Let’s say we want to find the line with the most words: • The arguments to map and reduce are Scala function literals (closures), and can use any language feature or Scala/Java library. 5 scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 15 scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) res4: Long = 15
  • 6. Interactive Analysis with the Spark Shell- More on RDD Operations • We can easily call functions declared elsewhere. • We’ll use Math.max() function to make previous code easier to understand: • One common data flow pattern is MapReduce, as popularized by Hadoop. • Spark can implement MapReduce flows easily: 6 scala> import java.lang.Math import java.lang.Math scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) res5: Int = 15 scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
  • 7. Interactive Analysis with the Spark Shell- More on RDD Operations • Here, we combined the flatMap, map and reduceByKey transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. • To collect the word counts in our shell, we can use the collect action: 7 scala> wordCounts.collect() res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)
  • 8. Interactive Analysis with the Spark Shell- Caching • Spark also supports pulling data sets into a cluster-wide in-memory cache. • This is very useful when data is accessed repeatedly: • Querying a small “hot” dataset. • Running an iterative algorithm like PageRank. • Let’s mark our linesWithSpark dataset to be cached: 8 scala> linesWithSpark.cache() res7: spark.RDD[String] = spark.FilteredRDD@17e51082 scala> linesWithSpark.count() res8: Long = 15 scala> linesWithSpark.count() res9: Long = 15
  • 9. Self-Contained Applications 9 /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) }
  • 10. Self-Contained Applications (Cont.) • This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in the Spark README. • Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. • Note that applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly. • Unlike the earlier examples with the Spark shell, which initializes its own SparkContext, we initialize a SparkContext as part of the program. • We pass the SparkContext constructor a SparkConf object which contains information about our application. 10
  • 11. Self-Contained Applications (Cont.) • Our application depends on the Spark API, so we’ll also include an sbt configuration file, simple.sbt which explains that Spark is a dependency. • For sbt to work correctly, we’ll need to layout SimpleApp.scala and simple.sbt according to the typical directory structure. • Then we can create a JAR package containing the application’s code and use the spark-submit script to run our program. 11 name := "Simple Project" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1"
  • 12. Self-Contained Applications (Cont.) 12 # Your directory layout should look like this $ find . . ./simple.sbt ./src ./src/main ./src/main/scala ./src/main/scala/SimpleApp.scala # Package a jar containing your application $ sbt package ... [info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.10-1.0.jar # Use spark-submit to run your application $ YOUR_SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar ... Lines with a: 46, Lines with b: 23