SlideShare a Scribd company logo
Lightning-fast cluster computing
Apache Spark
What is Apache Spark?
Cluster computing platform designed to be fast and general-purpose.
Fast
Universal
Highly Accessible
Unified stack
Comparison with MR
val textFile = spark.textFile("hdfs://...")
val counts = textFile.flatMap(line =>
line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Examples:Word Count
val sc = new SparkContext(...)
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
val sc = new SparkContext(...)
val inputRDD = sc.textFile("log.txt")
val errorsRDD = inputRDD.filter(line => line.contains("error"))
val warningsRDD = inputRDD.filter(line => line.contains("warning"))
val badLinesRDD = errorsRDD.union(warningsRDD)
badLinesRDD.persist()
badLinesRDD.count()
badLinesRDD.collect()
Examples:Log Mining
How it works?
RDD
Resilient
Distributed
Dataset
Example Hadoop RDD
partitions = One per HDFS block
dependencies = none
compute = read corresponding block
preferredLocations = HDFS block locations
partitioner = none
Advanced: RDD as interface
Direct Acyclic Graph (DAG)
hadoopRDD
errorsRDD warningsRDD
badLinseRDD
filterfilter
union
Function Name Purpose Example
map() Apply a function to each element in the RDD and return an RDD of the result. rdd.map(x => x + 1)
flatMap() Apply a function to each element in the RDD and return an RDD of the contents of
the iterators returned. Often used to extract words.
rdd.flatMap(x => x.to(3))
filter() Return an RDD consisting of only elements that pass the condition passed to
filter().
rdd.filter(x => x != 1)
distinct() Remove duplicates. rdd.distinct()
union() Produce an RDD containing elements from both RDDs. rdd.union(other)
intersection() RDD containing only elements found in both RDDs. rdd.intersection(other)
join() Perform an inner join between two RDDs. rdd.join(other)
groupByKey() Group values with same key rdd.groupByKey(other)
RDD Transformations
RDD actions
Function Name Purpose Example
count() Number of elements in RDD rdd.count()
collect() Return all elements from the RDD rdd.collect()
saveAsTextFile() Saves RDD elements to an external
storage system
rdd.saveAsTextFile(“hdfs://...”)
take(num) Return num elements from RDD rdd.take(10)
reduce(func) Combine the elements of the RDD
together in parallel (e.g., sum)
rdd.reduce((x, y) => x + y)
takeOrdered(num)(ordering) Return num elements regarding provided
ordering
rdd.takeOrdered(2)(myOrdering)
RDD Caching
Level Space Used CPU Time In Memory On disk Comments
MEMORY_ONLY High Low Y N
MEMORY_ONLY_SER Low High Y N
MEMORY_AND_DISK High Medium Some Some Spills to disk if there is too
much data to fit in memory.
MEMORY_AND_DISK_SER Low High Some Some Spills to disk if there is too
much data to fit in memory.
Stores serialized
representation in memory.
DISK_ONLY Low High N Y
How it works?
Main program which controls the flow
Driver Executors
Nodes that execute actions
How it works?
DAG
Scheduler
Coordination between
RDDs, driver and
nodes
What is Spark Application
Advanced Topics: Stages
Advanced Topics: Shuffling
Spark Stack
SQL
Streaming
Machine Learning
GraphX
if not… DEMO everyone?
?

More Related Content

PPTX
Javascript Arrays
ODP
Cache and Drupal
PDF
DomainService の Repository 排除と
エラー表現のパターン
PPTX
MongoDB - Aggregation Pipeline
PDF
Analytics with MongoDB Aggregation Framework and Hadoop Connector
PPTX
The Aggregation Framework
PPTX
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
PPTX
MongoDB World 2016 : Advanced Aggregation
Javascript Arrays
Cache and Drupal
DomainService の Repository 排除と
エラー表現のパターン
MongoDB - Aggregation Pipeline
Analytics with MongoDB Aggregation Framework and Hadoop Connector
The Aggregation Framework
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
MongoDB World 2016 : Advanced Aggregation

What's hot (20)

PPTX
MongoDB
PDF
Apache avro and overview hadoop tools
PDF
Spark: Taming Big Data
PDF
MongoDB Aggregation Framework
PPTX
working with files
PDF
JavaScript client API for Google Apps Script API primer
ODP
Aggregation Framework in MongoDB Overview Part-1
PPTX
MongoDB Aggregation
PDF
Java JVM Memory Cheat Sheet
PDF
Mongo indexes
PPTX
MongoDB Aggregation MongoSF May 2011
PPTX
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
PDF
Using spark data frame for sql
PDF
Unsupervised Learning with Apache Spark
PPTX
The Aggregation Framework
PDF
Data warehouse or conventional database: Which is right for you?
PPTX
Get docs from sp doc library
PPTX
MongoDb and NoSQL
PPTX
Google apps script database abstraction exposed version
PDF
R statistics with mongo db
MongoDB
Apache avro and overview hadoop tools
Spark: Taming Big Data
MongoDB Aggregation Framework
working with files
JavaScript client API for Google Apps Script API primer
Aggregation Framework in MongoDB Overview Part-1
MongoDB Aggregation
Java JVM Memory Cheat Sheet
Mongo indexes
MongoDB Aggregation MongoSF May 2011
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
Using spark data frame for sql
Unsupervised Learning with Apache Spark
The Aggregation Framework
Data warehouse or conventional database: Which is right for you?
Get docs from sp doc library
MongoDb and NoSQL
Google apps script database abstraction exposed version
R statistics with mongo db
Ad

Viewers also liked (18)

PPTX
Rss та wiki
PDF
UXPA 2016 - Using UX Skills to Shape Your Career
PDF
Research is not just for the UX team.
PPTX
A toolset for a modern dev company
PPTX
Rss та wiki
PPSX
Linked In PP
PDF
PDF
Using UX Skills to Craft Your Career
PPTX
SISTEM INFORMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
PDF
UX is not just for designers. UX IRL
PPTX
Respetar a los demás
PPTX
SISTEM INFPRMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
PDF
COMPANY PROFILE 2015
PPTX
Propiedades de la Potencia
PDF
Serve your customers better with User Experience Research
PDF
Apron feeder
DOCX
Praveen
PPTX
PPT encontro com Professores Coordenadores
Rss та wiki
UXPA 2016 - Using UX Skills to Shape Your Career
Research is not just for the UX team.
A toolset for a modern dev company
Rss та wiki
Linked In PP
Using UX Skills to Craft Your Career
SISTEM INFORMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
UX is not just for designers. UX IRL
Respetar a los demás
SISTEM INFPRMASI KESEHATAN STIKES AL IRSYAD AL ISLAMIYAH CILACAP (108114020 &...
COMPANY PROFILE 2015
Propiedades de la Potencia
Serve your customers better with User Experience Research
Apron feeder
Praveen
PPT encontro com Professores Coordenadores
Ad

Similar to Apache Spark - Aram Mkrtchyan (20)

PPTX
Introduction to Apache Spark
PPTX
Ten tools for ten big data areas 03_Apache Spark
PPTX
SparkNotes
PDF
Introduction to Apache Spark
PPTX
Apache spark core
PDF
Apache Spark Tutorial
PDF
Introduction to Apache Spark
PDF
Big Data Processing using Apache Spark and Clojure
PPT
Scala and spark
PDF
Distributed computing with spark
PDF
Apache Spark with Scala
PPTX
Apache Spark
PPTX
Scala meetup - Intro to spark
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
PDF
Introduction to Apache Spark
PDF
Apache Spark: What? Why? When?
PPTX
Spark real world use cases and optimizations
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PDF
Introduction to Spark
PPTX
Unit II Real Time Data Processing tools.pptx
Introduction to Apache Spark
Ten tools for ten big data areas 03_Apache Spark
SparkNotes
Introduction to Apache Spark
Apache spark core
Apache Spark Tutorial
Introduction to Apache Spark
Big Data Processing using Apache Spark and Clojure
Scala and spark
Distributed computing with spark
Apache Spark with Scala
Apache Spark
Scala meetup - Intro to spark
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Introduction to Apache Spark
Apache Spark: What? Why? When?
Spark real world use cases and optimizations
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Introduction to Spark
Unit II Real Time Data Processing tools.pptx

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Approach and Philosophy of On baking technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Cloud computing and distributed systems.
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Chapter 3 Spatial Domain Image Processing.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
Approach and Philosophy of On baking technology
20250228 LYD VKU AI Blended-Learning.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
cuic standard and advanced reporting.pdf
Encapsulation_ Review paper, used for researhc scholars
Advanced methodologies resolving dimensionality complications for autism neur...
Programs and apps: productivity, graphics, security and other tools
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Apache Spark - Aram Mkrtchyan

  • 2. What is Apache Spark? Cluster computing platform designed to be fast and general-purpose. Fast Universal Highly Accessible
  • 4. Comparison with MR val textFile = spark.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 5. Examples:Word Count val sc = new SparkContext(...) val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 6. val sc = new SparkContext(...) val inputRDD = sc.textFile("log.txt") val errorsRDD = inputRDD.filter(line => line.contains("error")) val warningsRDD = inputRDD.filter(line => line.contains("warning")) val badLinesRDD = errorsRDD.union(warningsRDD) badLinesRDD.persist() badLinesRDD.count() badLinesRDD.collect() Examples:Log Mining
  • 8. Example Hadoop RDD partitions = One per HDFS block dependencies = none compute = read corresponding block preferredLocations = HDFS block locations partitioner = none Advanced: RDD as interface
  • 9. Direct Acyclic Graph (DAG) hadoopRDD errorsRDD warningsRDD badLinseRDD filterfilter union
  • 10. Function Name Purpose Example map() Apply a function to each element in the RDD and return an RDD of the result. rdd.map(x => x + 1) flatMap() Apply a function to each element in the RDD and return an RDD of the contents of the iterators returned. Often used to extract words. rdd.flatMap(x => x.to(3)) filter() Return an RDD consisting of only elements that pass the condition passed to filter(). rdd.filter(x => x != 1) distinct() Remove duplicates. rdd.distinct() union() Produce an RDD containing elements from both RDDs. rdd.union(other) intersection() RDD containing only elements found in both RDDs. rdd.intersection(other) join() Perform an inner join between two RDDs. rdd.join(other) groupByKey() Group values with same key rdd.groupByKey(other) RDD Transformations
  • 11. RDD actions Function Name Purpose Example count() Number of elements in RDD rdd.count() collect() Return all elements from the RDD rdd.collect() saveAsTextFile() Saves RDD elements to an external storage system rdd.saveAsTextFile(“hdfs://...”) take(num) Return num elements from RDD rdd.take(10) reduce(func) Combine the elements of the RDD together in parallel (e.g., sum) rdd.reduce((x, y) => x + y) takeOrdered(num)(ordering) Return num elements regarding provided ordering rdd.takeOrdered(2)(myOrdering)
  • 12. RDD Caching Level Space Used CPU Time In Memory On disk Comments MEMORY_ONLY High Low Y N MEMORY_ONLY_SER Low High Y N MEMORY_AND_DISK High Medium Some Some Spills to disk if there is too much data to fit in memory. MEMORY_AND_DISK_SER Low High Some Some Spills to disk if there is too much data to fit in memory. Stores serialized representation in memory. DISK_ONLY Low High N Y
  • 13. How it works? Main program which controls the flow Driver Executors Nodes that execute actions
  • 14. How it works? DAG Scheduler Coordination between RDDs, driver and nodes
  • 15. What is Spark Application
  • 19. if not… DEMO everyone? ?