End-to-end working of Apache Spark

Presented By: Sarfaraz Hussain Divyansh Jain
Software Consultant Software Consultant
Knoldus Inc Knoldus Inc
End-to-end working
of Apache Spark

Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Respect Knolx session
timings, you are requested
not to join sessions after a 5
minutes threshold post the
session start time.
Feedback
Make sure to submit a
constructive feedback for all
sessions as it is very helpful
for the presenter.
Silent Mode
Keep your mobile devices in
silent mode, feel free to
move out of session in case
you need to attend an
urgent call.
Avoid Disturbance
Avoid unwanted chit chat
during the session.

Agenda
01 Why and What is Spark?
02 Working of Spark
03 Operations in Spark
04 Task and Stages
05 DataFrame & DataSets
06 Demo

Why Spark? (Distributed Computing)
Traditional Enterprise Approach :
- Splitting the data to different systems
- Coding is required in all systems
- No Fault Tolerance
- Aggregation of data
- System A is unaware of the data stored
in System B and vice versa.

What is Spark?
The official definition of Apache Spark says that
“Apache Spark™ is a unified analytics engine
for large-scale data processing.” It is an
in-memory computation processing engine
where the data is kept in random access
memory (RAM) instead of some slow disk drives
and is processed in parallel.

Working of Spark
File1.txt Master-Slave architecture

Working of Spark
val number = spark.sparkContext.textFile("path_to_File1.txt", 3)
val result = number.flatMap(_.split("t")).map(_.toInt).filter(x => x < 10)
RDD result = Find values smaller then 10 from number RDD
Master => Driver
Slave => Executor

RDD
Spark RDD is a resilient, partitioned, distributed and immutable collection of
data.
We can create an RDD using two methods:
- Load some data from a source.
- Create an RDD by transforming another RDD.

Working of Spark
Blocks with replication factor In case of Node failure

Working of Spark
val result = number.flatMap(_.split("t")).map(_.toInt).filter(x => x < 10)
B1 → B4 => 5,6
B2 → B5 => 9, 2
B3 → B6 => 1, 5
B4, B5, B6 => Result RDD

Lineage
- When a new RDD is derived from an existing RDD, that new
RDD contains a pointer to the parent RDD and Spark
keeps track of all the dependencies between these RDDs
using a component called the Lineage.
- In case of data loss, this lineage is used to rebuild the
data.
- The SparkContext (Driver) maintains the Lineage.
- It is also known as Dependency Graph.

Operations in Spark
- Transformation:
- Creation of new RDD from existing RDD.
- Action:
- This leads to creation of non-RDD result and
gives result to User, i.e. it creates results in some
form of Java Collection.
val result = number.flatMap(_.split("t")).map(_.toInt).filter(x =>
x < 10)
result.collect() → Action

Working of Spark (Lazy Evaluation)
Till the time we do not hit an Action, none of the above operations
(transformations) will take place.
- RDDs are lazily evaluated.
- RDDs are immutable.
- The values of action are stored to drivers or to the external storage
system.
- It brings laziness of RDD into motion.
- An action is one of the ways of sending data from Executer to the Driver.
- Action kick offs a job to execute on a cluster.

Word Count problem!
Apple
Banana
Orange
Apple
Cat
Dog
Cow
Orange
Cow
Banana

End-to-end working of Apache Spark

● Two kinds of operations:
1. Transformation
2. Action
● Dependency are divided into two types:
1. Narrow Dependency
2. Wide Dependency
● Stages
Spark Execution Model

RDDs
v/s
Dataframes / Datasets

Why use RDD?
- Offer control & flexibility
- Low level API
- Type Safe
- Encourage ‘How to’

Problem with RDD
- Focus on “How To” rather then “What To”.
- Not much optimized by Spark. Optimizing RDD is developer’s
responsibility.
- RDDs are low-level API.
- Inadveradent ineffecieincies.

Problem with RDD
Is this Optimized?

Structured APIs in Spark
→ Dataframes
→ Datasets
What is above RDD?

SQL Dataframes Datasets
Syntax Run Time CompileTime Compile
Time
Error
Analysis Run Time Run Time Compile
Time
Error
Analysis reports are reported before distributed job starts
RDD vs Dataframes vs Datasets

Output :
Dataframe Example
project page numRequests
en 23 45
en 24 200

Dataframe → SQL
Output :
project page numRequests
en 23 45
en 24 200

Dataframe = Easy to write..Believe it
name age
Jim 20
Ann 31
Jin 30
Output:

Analysis: Analyzing a logical plan to resolve references
Logical Optimization: Optimise the logical plan
Code generation: Compile parts of the query to Java
bytecode
Catalyst Optimizer

user.join(events, users(“id”) ===
events(“uid”)).filter(events(“date”) > “2015-01-01”)
Dataframe Optimization

Why When
- High level APIs & DSL - Structured Data Schema
- Strong Type-Safety - Code optimisations &
- Ease of use & Readability
- What to Do
perfromance
Dataframe and Dataset

End-to-end working of Apache Spark

More Related Content

What's hot (20)

Similar to End-to-end working of Apache Spark (20)

More from Knoldus Inc. (20)

Recently uploaded (20)

End-to-end working of Apache Spark