spark ...................................

2
Map reduce problems
• Force your pipeline into map and reduce steps.
• This can not accommodate every data analysis
workflow.
• Read from disk for each map reduce job.(iterative
algorithms? i.e. Machine learning ).
• Only native java programming interface.

Spark
4
• The solution is to write a framework : same features of Map reduce and more.
• Capable of reusing Hadoop ecosystem.
• 20 highly efficient distributed operations , any combinations of steps rather
than only 2 steps(map & reduce).
• Cashing data in memory rather than write it on the disk.
• It is been designed to be more easier for new users to write their analysis by
providing access with other languages like scala , python an R.

Driver program
9
The Driver Program is a process that runs the main() function of the application
and creates the SparkContext object. The purpose of SparkContext is to
coordinate the spark applications, running as independent sets of processes on a
cluster.
To run on a cluster, the SparkContext connects to a different type of cluster
managers and then perform the following tasks: -
•It acquires executors on nodes in the cluster.
•Then, it sends your application code to the executors. Here, the application code can be
defined by JAR or Python files passed to the SparkContext.
•At last, the SparkContext sends tasks to the executors to run.

Cluster manager
10
•The role of the cluster manager is to allocate resources across applications. The
Spark is capable enough of running on a large number of clusters.
•It consists of various types of cluster managers such as Hadoop YARN, Apache
Mesos and Standalone Scheduler.
•Here, the Standalone Scheduler is a standalone spark cluster manager that
facilitates to install Spark on an empty set of machines.

Worker Node
11
•The worker node is a slave node
•Its role is to run the application code in the cluster

Executer (JVM)
12
•An executor is a process launched for an application on a worker node.
•It runs tasks and keeps data in memory or disk storage across them.
•It read and write data to the external sources.
•Every application contains its executor.

RDD
• RDD stands for “Resilient Distributed Dataset”.
• The way in spark to store data (data containers).
• Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able
to recompute missing or damaged partitions due to node failures.
• Distributed, since Data resides on multiple nodes.
• Dataset represents records of the data you work with. The user can load the data
set externally which can be either JSON file, CSV file, text file or database via JDBC
with no specific data structure.
13

Pyspark
14
• Apache Spark is written in Scala programming language.
• PySpark is an interface for Apache Spark in Python.
• It not only allows you to write Spark applications using Python APIs, but also
provides the PySpark shell for interactively analyzing your data in a distributed
environment.
• PySpark supports most of Spark’s features such as Spark SQL, DataFrame,
Streaming, MLlib (Machine Learning) and Spark Core.

From the pyspark console
15
•create RDD
•integer_RDD=sc.parallelize(range(10),3)
•collect all data from nodes
•integer_RDD.collect()
•check how the data are partitioned across the nodes
•integer_RDD.glom().collect()

16
Read text file
• Read a local file :
• text_RDD=sc.textFile("file:///home/cloudera/data.txt")
• Read from HDFS:
• text_RDD=sc.textFile("/user/input/data.txt")
• Read the first line in RDD
• text_RDD.take(1)
• Read all lines
• text_RDD.collect()

Word count example : MAP
def splitLine(line):
return line.split()
def createPairs(word):
return (word,1)
txt_RDD=sc.textFile("/user/input/data.txt")
pairs_RDD=txt_RDD.flatMap(splitLine).map(createPairs)
17

Word count example : Reduce
def sumCounts(a,b):
return a+b
word_counts_RDD=pairs_RDD.reduceByKey(sumCounts)
word_counts_RDD.collect()
18

spark ...................................

More Related Content

Similar to spark ................................... (20)

Recently uploaded (20)

spark ...................................