SlideShare a Scribd company logo
Using Apache Spark
Bojan Babic
Apache Spark
spark.incubator.apache.org
github.com/apache/incubator-spark
user@spark.incubator.apache.org
The Spark Community
INTRODUCTION TO APACHE SPARK
What is Spark?
Efficient
• General execution
graphs
• In-memory storage
Usable
• Rich APIs in Java,
Scala, Python
• Interactive shell
Fast and Expressive Cluster Computing
System Compatible with Apache Hadoop
Today’s
talk
Key Concepts
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM or
on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
• in hadoop work RDD corresponds
to partition elsewhere dataframe
Operations
• Transformations
(e.g. map, filter,
groupBy)
• Actions
(e.g. count, collect,
save)
Working With RDDs
RDD
RDD
RDD
RDD
Transfor
mations
Acti
on
Value
linesWithSpark = textFile.filter(l => l.contains("Spark”))
linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark
textFile = sc.textFile(”/var/log/hadoop/somelog”)
Spark (batch) example
Load error messages from a log into memory, then
interactively search for various patterns
val sc = new SparkContext(config)
val lines = spark.textFile(“hdfs://...”)
val messages = lines.filter(l => l.startswith(“ERROR”))
.map(e => e.split(“t“)(2))
messages.cache()
messages.filter(s => s.contains(“mongo”)).count()
messages.filter(s => s.contains(“500”)).count()
Base RDD
Transformed RDD
Action
Scaling Down
Spark Streaming example
val sc = new StreamingContext(conf, Seconds(10))
val kafkaParams = Map[String, String](...)
val messages = KafkaUtils.createStream[String, String, StringDecoder,
StringDecoder]
(
sc,
kafkaParams,
Map(topic -> threadsPerTopic),
storageLevel = StorageLevel.MEMORY_ONLY_SER
)
val revenue = messages.filter(m => m.contains(“[PURCHASE]”))
.map(p => p.split(“t”)(4)).reduceByKey(_ + _)
Mini-batch
Fault Recovery
RDDs track lineage information that can be used
to efficiently recompute lost data
val msgs = textFile.filter(l => l.contains(“ERROR”))
.map(e => e.split(“t”)(2))
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith
(…))
map
(func = split
(...))
JOB EXECUTION
Software Components
• Spark runs as a library in your
program (1 instance per app)
• Runs tasks locally or on cluster
– Mesos, YARN or standalone
mode
• Accesses storage systems via
Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
Task Scheduler
• General task
graphs
• Automatically
pipelines functions
• Data locality
aware
• Partitioning aware
to avoid shuffles
= cached
partition
=
RDD
joi
n
filte
r
groupB
y
Stage
3
Stage
1
Stage
2
A: B
:
C
:
D
:
E
:
F
:
ma
p
Advanced Features
• Controllable partitioning
– Speed up joins against a dataset
• Controllable storage formats
– Keep data serialized for efficiency, replicate to
multiple nodes, cache on disk
• Shared variables: broadcasts, accumulators
WORKING WITH SPARK
Using the Shell
Launching:
Modes:
MASTER=local ./spark-shell # local, 1 thread
MASTER=local[2] ./spark-shell # local, 2 threads
MASTER=spark://host:port ./spark-shell # cluster
spark-shell
pyspark (IPYTHON=1)
SparkContext
• Main entry point to Spark functionality
• Available in shell as variable sc
• In standalone programs, you’d make your own
(see later for details)
RDD Operators
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
sample
take
first
partitionBy
mapWith
pipe
save ...
Add Spark to Your Project
• Scala: add dependency to build.sbt
libraryDependencies ++= Seq(
("org.apache.spark" %% "spark-core" % "1.2.0").
exclude("org.mortbay.jetty", "servlet-api")
...
)
• make sure you exclude overlapping libs
and add Spark-Streaming
• Scala: add dependency to build.sbt
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "1.2.0"
PageRank Performance
Other Iterative Algorithms
Time per Iteration (s)
CONCLUSION
Conclusion
• Spark offers a rich API to make data analytics
fast: both fast to write and fast to run
• Achieves 100x speedups in real applications
• Growing community with 25+ companies
contributing
Thanks!!!
disclosure: some parts of this presentation have been inspired by Matei Zaharia ampmeeting in Berlkey

More Related Content

PPTX
Impala presentation
PDF
QuestDB: The building blocks of a fast open-source time-series database
PDF
Inside Parquet Format
PDF
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
PDF
Classes and Objects
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
PPT
7. Key-Value Databases: In Depth
PDF
10. resource management
Impala presentation
QuestDB: The building blocks of a fast open-source time-series database
Inside Parquet Format
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Classes and Objects
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
7. Key-Value Databases: In Depth
10. resource management

What's hot (20)

PPT
Thrashing allocation frames.43
PPTX
How Pulsar Stores Your Data - Pulsar Summit NA 2021
PPT
HadoooIO.ppt
PPTX
Apache PIG
PDF
Introduction to Parallel Computing
PDF
Spark overview
PPTX
Routing algorithm
PPTX
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
PPTX
Optimizing Apache Spark SQL Joins
PDF
Build an Edge-to-Cloud Solution with the MING Stack
PPTX
Communication costs in parallel machines
PPT
Map reduce in BIG DATA
PPTX
Multivector and multiprocessor
PDF
Introduction to Spark Internals
PPTX
Hashing
PPTX
Web search vs ir
PDF
Introduction to Spark with Python
PDF
JSONB型でpostgresをNoSQLっぽく使う
PPT
Chapter 6 pc
PDF
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Thrashing allocation frames.43
How Pulsar Stores Your Data - Pulsar Summit NA 2021
HadoooIO.ppt
Apache PIG
Introduction to Parallel Computing
Spark overview
Routing algorithm
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Optimizing Apache Spark SQL Joins
Build an Edge-to-Cloud Solution with the MING Stack
Communication costs in parallel machines
Map reduce in BIG DATA
Multivector and multiprocessor
Introduction to Spark Internals
Hashing
Web search vs ir
Introduction to Spark with Python
JSONB型でpostgresをNoSQLっぽく使う
Chapter 6 pc
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Ad

Similar to Introduction to Apache Spark Ecosystem (20)

PDF
Apache Spark Tutorial
PDF
Introduction to Apache Spark
PPTX
Intro to Apache Spark by CTO of Twingo
PDF
Intro to apache spark
PPTX
Spark core
PDF
Sparklife - Life In The Trenches With Spark
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PPTX
Introduction to Apache Spark
PDF
Spark 101
PDF
Introduction to Apache Spark
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
PDF
Introduction to apache spark and the architecture
PDF
Let's start with Spark
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Spark Working Environment in Windows OS
PDF
Pyspark tutorial
PDF
Pyspark tutorial
PDF
Artigo 81 - spark_tutorial.pdf
Apache Spark Tutorial
Introduction to Apache Spark
Intro to Apache Spark by CTO of Twingo
Intro to apache spark
Spark core
Sparklife - Life In The Trenches With Spark
Real time Analytics with Apache Kafka and Apache Spark
Introduction to Apache Spark
Spark 101
Introduction to Apache Spark
Apache spark sneha challa- google pittsburgh-aug 25th
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to apache spark and the architecture
Let's start with Spark
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Spark Working Environment in Windows OS
Pyspark tutorial
Pyspark tutorial
Artigo 81 - spark_tutorial.pdf
Ad

Recently uploaded (20)

PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPT
Project quality management in manufacturing
PPT
Mechanical Engineering MATERIALS Selection
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Digital Logic Computer Design lecture notes
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
bas. eng. economics group 4 presentation 1.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Lesson 3_Tessellation.pptx finite Mathematics
CYBER-CRIMES AND SECURITY A guide to understanding
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Project quality management in manufacturing
Mechanical Engineering MATERIALS Selection
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Operating System & Kernel Study Guide-1 - converted.pdf
Digital Logic Computer Design lecture notes
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...

Introduction to Apache Spark Ecosystem

  • 5. What is Spark? Efficient • General execution graphs • In-memory storage Usable • Rich APIs in Java, Scala, Python • Interactive shell Fast and Expressive Cluster Computing System Compatible with Apache Hadoop
  • 7. Key Concepts Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure • in hadoop work RDD corresponds to partition elsewhere dataframe Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save)
  • 8. Working With RDDs RDD RDD RDD RDD Transfor mations Acti on Value linesWithSpark = textFile.filter(l => l.contains("Spark”)) linesWithSpark.count() 74 linesWithSpark.first() # Apache Spark textFile = sc.textFile(”/var/log/hadoop/somelog”)
  • 9. Spark (batch) example Load error messages from a log into memory, then interactively search for various patterns val sc = new SparkContext(config) val lines = spark.textFile(“hdfs://...”) val messages = lines.filter(l => l.startswith(“ERROR”)) .map(e => e.split(“t“)(2)) messages.cache() messages.filter(s => s.contains(“mongo”)).count() messages.filter(s => s.contains(“500”)).count() Base RDD Transformed RDD Action
  • 11. Spark Streaming example val sc = new StreamingContext(conf, Seconds(10)) val kafkaParams = Map[String, String](...) val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder] ( sc, kafkaParams, Map(topic -> threadsPerTopic), storageLevel = StorageLevel.MEMORY_ONLY_SER ) val revenue = messages.filter(m => m.contains(“[PURCHASE]”)) .map(p => p.split(“t”)(4)).reduceByKey(_ + _) Mini-batch
  • 12. Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data val msgs = textFile.filter(l => l.contains(“ERROR”)) .map(e => e.split(“t”)(2)) HDFS File Filtered RDD Mapped RDD filter (func = startsWith (…)) map (func = split (...))
  • 14. Software Components • Spark runs as a library in your program (1 instance per app) • Runs tasks locally or on cluster – Mesos, YARN or standalone mode • Accesses storage systems via Hadoop InputFormat API – Can use HBase, HDFS, S3, … Your application SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS or other storage
  • 15. Task Scheduler • General task graphs • Automatically pipelines functions • Data locality aware • Partitioning aware to avoid shuffles = cached partition = RDD joi n filte r groupB y Stage 3 Stage 1 Stage 2 A: B : C : D : E : F : ma p
  • 16. Advanced Features • Controllable partitioning – Speed up joins against a dataset • Controllable storage formats – Keep data serialized for efficiency, replicate to multiple nodes, cache on disk • Shared variables: broadcasts, accumulators
  • 18. Using the Shell Launching: Modes: MASTER=local ./spark-shell # local, 1 thread MASTER=local[2] ./spark-shell # local, 2 threads MASTER=spark://host:port ./spark-shell # cluster spark-shell pyspark (IPYTHON=1)
  • 19. SparkContext • Main entry point to Spark functionality • Available in shell as variable sc • In standalone programs, you’d make your own (see later for details)
  • 20. RDD Operators • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...
  • 21. Add Spark to Your Project • Scala: add dependency to build.sbt libraryDependencies ++= Seq( ("org.apache.spark" %% "spark-core" % "1.2.0"). exclude("org.mortbay.jetty", "servlet-api") ... ) • make sure you exclude overlapping libs
  • 22. and add Spark-Streaming • Scala: add dependency to build.sbt libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "1.2.0"
  • 24. Other Iterative Algorithms Time per Iteration (s)
  • 26. Conclusion • Spark offers a rich API to make data analytics fast: both fast to write and fast to run • Achieves 100x speedups in real applications • Growing community with 25+ companies contributing
  • 27. Thanks!!! disclosure: some parts of this presentation have been inspired by Matei Zaharia ampmeeting in Berlkey