SlideShare a Scribd company logo
Spark
Session 7
2
Map reduce problems
• Force your pipeline into map and reduce steps.
• This can not accommodate every data analysis
workflow.
• Read from disk for each map reduce job.(iterative
algorithms? i.e. Machine learning ).
• Only native java programming interface.
3
Spark
4
• The solution is to write a framework : same features of Map reduce and more.
• Capable of reusing Hadoop ecosystem.
• 20 highly efficient distributed operations , any combinations of steps rather
than only 2 steps(map & reduce).
• Cashing data in memory rather than write it on the disk.
• It is been designed to be more easier for new users to write their analysis by
providing access with other languages like scala , python an R.
5
Map reduce VS Spark
6
7
8
Spark architecture
Driver program
9
The Driver Program is a process that runs the main() function of the application
and creates the SparkContext object. The purpose of SparkContext is to
coordinate the spark applications, running as independent sets of processes on a
cluster.
To run on a cluster, the SparkContext connects to a different type of cluster
managers and then perform the following tasks: -
•It acquires executors on nodes in the cluster.
•Then, it sends your application code to the executors. Here, the application code can be
defined by JAR or Python files passed to the SparkContext.
•At last, the SparkContext sends tasks to the executors to run.
Cluster manager
10
•The role of the cluster manager is to allocate resources across applications. The
Spark is capable enough of running on a large number of clusters.
•It consists of various types of cluster managers such as Hadoop YARN, Apache
Mesos and Standalone Scheduler.
•Here, the Standalone Scheduler is a standalone spark cluster manager that
facilitates to install Spark on an empty set of machines.
Worker Node
11
•The worker node is a slave node
•Its role is to run the application code in the cluster
Executer (JVM)
12
•An executor is a process launched for an application on a worker node.
•It runs tasks and keeps data in memory or disk storage across them.
•It read and write data to the external sources.
•Every application contains its executor.
RDD
• RDD stands for “Resilient Distributed Dataset”.
• The way in spark to store data (data containers).
• Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able
to recompute missing or damaged partitions due to node failures.
• Distributed, since Data resides on multiple nodes.
• Dataset represents records of the data you work with. The user can load the data
set externally which can be either JSON file, CSV file, text file or database via JDBC
with no specific data structure.
13
Pyspark
14
• Apache Spark is written in Scala programming language.
• PySpark is an interface for Apache Spark in Python.
• It not only allows you to write Spark applications using Python APIs, but also
provides the PySpark shell for interactively analyzing your data in a distributed
environment.
• PySpark supports most of Spark’s features such as Spark SQL, DataFrame,
Streaming, MLlib (Machine Learning) and Spark Core.
From the pyspark console
15
•create RDD
•integer_RDD=sc.parallelize(range(10),3)
•collect all data from nodes
•integer_RDD.collect()
•check how the data are partitioned across the nodes
•integer_RDD.glom().collect()
16
Read text file
• Read a local file :
• text_RDD=sc.textFile("file:///home/cloudera/data.txt")
• Read from HDFS:
• text_RDD=sc.textFile("/user/input/data.txt")
• Read the first line in RDD
• text_RDD.take(1)
• Read all lines
• text_RDD.collect()
Word count example : MAP
def splitLine(line):
return line.split()
def createPairs(word):
return (word,1)
txt_RDD=sc.textFile("/user/input/data.txt")
pairs_RDD=txt_RDD.flatMap(splitLine).map(createPairs)
17
Word count example : Reduce
def sumCounts(a,b):
return a+b
word_counts_RDD=pairs_RDD.reduceByKey(sumCounts)
word_counts_RDD.collect()
18

More Related Content

PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PPTX
Apache Spark
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PPTX
Apache Spark for Beginners
PDF
Apache Spark: What's under the hood
PDF
Fast Data Analytics with Spark and Python
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
PPTX
spark example spark example spark examplespark examplespark examplespark example
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
Apache Spark
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Apache Spark for Beginners
Apache Spark: What's under the hood
Fast Data Analytics with Spark and Python
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
spark example spark example spark examplespark examplespark examplespark example

Similar to spark ................................... (20)

PDF
Spark Driven Big Data Analytics
PDF
Apache Spark Introduction.pdf
PPTX
Apache Spark Core
PPTX
An Introduction to Apache Spark
PPTX
Apache Spark II (SparkSQL)
PPTX
Intro to Apache Spark
PPTX
Intro to Apache Spark
PPTX
PPTX
Apache Spark.
PPTX
Spark: The State of the Art Engine for Big Data Processing
PPTX
Apache Spark on HDinsight Training
PPTX
OVERVIEW ON SPARK.pptx
PPTX
Spark from the Surface
PPTX
APACHE SPARK.pptx
PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
PPTX
Apache spark
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PPTX
Unit II Real Time Data Processing tools.pptx
PPTX
Bring the Spark To Your Eyes
PPTX
Spark architechure.pptx
Spark Driven Big Data Analytics
Apache Spark Introduction.pdf
Apache Spark Core
An Introduction to Apache Spark
Apache Spark II (SparkSQL)
Intro to Apache Spark
Intro to Apache Spark
Apache Spark.
Spark: The State of the Art Engine for Big Data Processing
Apache Spark on HDinsight Training
OVERVIEW ON SPARK.pptx
Spark from the Surface
APACHE SPARK.pptx
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Apache spark
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Unit II Real Time Data Processing tools.pptx
Bring the Spark To Your Eyes
Spark architechure.pptx
Ad

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
A Presentation on Artificial Intelligence
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
Teaching material agriculture food technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Electronic commerce courselecture one. Pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Empathic Computing: Creating Shared Understanding
PDF
Machine learning based COVID-19 study performance prediction
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Unlocking AI with Model Context Protocol (MCP)
A Presentation on Artificial Intelligence
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Teaching material agriculture food technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Electronic commerce courselecture one. Pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Review of recent advances in non-invasive hemoglobin estimation
Empathic Computing: Creating Shared Understanding
Machine learning based COVID-19 study performance prediction
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Understanding_Digital_Forensics_Presentation.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Chapter 3 Spatial Domain Image Processing.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Ad

spark ...................................

  • 2. 2 Map reduce problems • Force your pipeline into map and reduce steps. • This can not accommodate every data analysis workflow. • Read from disk for each map reduce job.(iterative algorithms? i.e. Machine learning ). • Only native java programming interface.
  • 3. 3
  • 4. Spark 4 • The solution is to write a framework : same features of Map reduce and more. • Capable of reusing Hadoop ecosystem. • 20 highly efficient distributed operations , any combinations of steps rather than only 2 steps(map & reduce). • Cashing data in memory rather than write it on the disk. • It is been designed to be more easier for new users to write their analysis by providing access with other languages like scala , python an R.
  • 6. 6
  • 7. 7
  • 9. Driver program 9 The Driver Program is a process that runs the main() function of the application and creates the SparkContext object. The purpose of SparkContext is to coordinate the spark applications, running as independent sets of processes on a cluster. To run on a cluster, the SparkContext connects to a different type of cluster managers and then perform the following tasks: - •It acquires executors on nodes in the cluster. •Then, it sends your application code to the executors. Here, the application code can be defined by JAR or Python files passed to the SparkContext. •At last, the SparkContext sends tasks to the executors to run.
  • 10. Cluster manager 10 •The role of the cluster manager is to allocate resources across applications. The Spark is capable enough of running on a large number of clusters. •It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and Standalone Scheduler. •Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on an empty set of machines.
  • 11. Worker Node 11 •The worker node is a slave node •Its role is to run the application code in the cluster
  • 12. Executer (JVM) 12 •An executor is a process launched for an application on a worker node. •It runs tasks and keeps data in memory or disk storage across them. •It read and write data to the external sources. •Every application contains its executor.
  • 13. RDD • RDD stands for “Resilient Distributed Dataset”. • The way in spark to store data (data containers). • Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to recompute missing or damaged partitions due to node failures. • Distributed, since Data resides on multiple nodes. • Dataset represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure. 13
  • 14. Pyspark 14 • Apache Spark is written in Scala programming language. • PySpark is an interface for Apache Spark in Python. • It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. • PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.
  • 15. From the pyspark console 15 •create RDD •integer_RDD=sc.parallelize(range(10),3) •collect all data from nodes •integer_RDD.collect() •check how the data are partitioned across the nodes •integer_RDD.glom().collect()
  • 16. 16 Read text file • Read a local file : • text_RDD=sc.textFile("file:///home/cloudera/data.txt") • Read from HDFS: • text_RDD=sc.textFile("/user/input/data.txt") • Read the first line in RDD • text_RDD.take(1) • Read all lines • text_RDD.collect()
  • 17. Word count example : MAP def splitLine(line): return line.split() def createPairs(word): return (word,1) txt_RDD=sc.textFile("/user/input/data.txt") pairs_RDD=txt_RDD.flatMap(splitLine).map(createPairs) 17
  • 18. Word count example : Reduce def sumCounts(a,b): return a+b word_counts_RDD=pairs_RDD.reduceByKey(sumCounts) word_counts_RDD.collect() 18