SlideShare a Scribd company logo
Presented By: Sarfaraz Hussain Divyansh Jain
Software Consultant Software Consultant
Knoldus Inc Knoldus Inc
End-to-end working
of Apache Spark
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Respect Knolx session
timings, you are requested
not to join sessions after a 5
minutes threshold post the
session start time.
Feedback
Make sure to submit a
constructive feedback for all
sessions as it is very helpful
for the presenter.
Silent Mode
Keep your mobile devices in
silent mode, feel free to
move out of session in case
you need to attend an
urgent call.
Avoid Disturbance
Avoid unwanted chit chat
during the session.
Agenda
01 Why and What is Spark?
02 Working of Spark
03 Operations in Spark
04 Task and Stages
05 DataFrame & DataSets
06 Demo
End-to-End??
Why Spark? (Distributed Computing)
Traditional Enterprise Approach :
- Splitting the data to different systems
- Coding is required in all systems
- No Fault Tolerance
- Aggregation of data
- System A is unaware of the data stored
in System B and vice versa.
What is Spark?
The official definition of Apache Spark says that
“Apache Spark™ is a unified analytics engine
for large-scale data processing.” It is an
in-memory computation processing engine
where the data is kept in random access
memory (RAM) instead of some slow disk drives
and is processed in parallel.
Working of Spark
File1.txt Master-Slave architecture
Working of Spark
val number = spark.sparkContext.textFile("path_to_File1.txt", 3)
val result = number.flatMap(_.split("t")).map(_.toInt).filter(x => x < 10)
RDD result = Find values smaller then 10 from number RDD
Master => Driver
Slave => Executor
RDD
Spark RDD is a resilient, partitioned, distributed and immutable collection of
data.
We can create an RDD using two methods:
- Load some data from a source.
- Create an RDD by transforming another RDD.
Working of Spark
Blocks with replication factor In case of Node failure
Working of Spark
val number = spark.sparkContext.textFile("path_to_File1.txt", 3)
val result = number.flatMap(_.split("t")).map(_.toInt).filter(x => x < 10)
B1 → B4 => 5,6
B2 → B5 => 9, 2
B3 → B6 => 1, 5
B4, B5, B6 => Result RDD
Lineage
- When a new RDD is derived from an existing RDD, that new
RDD contains a pointer to the parent RDD and Spark
keeps track of all the dependencies between these RDDs
using a component called the Lineage.
- In case of data loss, this lineage is used to rebuild the
data.
- The SparkContext (Driver) maintains the Lineage.
- It is also known as Dependency Graph.
Operations in Spark
- Transformation:
- Creation of new RDD from existing RDD.
- Action:
- This leads to creation of non-RDD result and
gives result to User, i.e. it creates results in some
form of Java Collection.
val number = spark.sparkContext.textFile("path_to_File1.txt", 3)
val result = number.flatMap(_.split("t")).map(_.toInt).filter(x =>
x < 10)
result.collect() → Action
Wait a minute???
Working of Spark (Lazy Evaluation)
Till the time we do not hit an Action, none of the above operations
(transformations) will take place.
- RDDs are lazily evaluated.
- RDDs are immutable.
- The values of action are stored to drivers or to the external storage
system.
- It brings laziness of RDD into motion.
- An action is one of the ways of sending data from Executer to the Driver.
- Action kick offs a job to execute on a cluster.
Here you go now!!
Let’s see them in action!
Word Count problem!
Apple
Banana
Orange
Apple
Cat
Dog
Cow
Orange
Cow
Banana
End-to-end working of Apache Spark
● Two kinds of operations:
1. Transformation
2. Action
● Dependency are divided into two types:
1. Narrow Dependency
2. Wide Dependency
● Stages
Spark Execution Model
End-to-end working of Apache Spark
Spark Ecosystem
RDDs
v/s
Dataframes / Datasets
Why use RDD?
- Offer control & flexibility
- Low level API
- Type Safe
- Encourage ‘How to’
RDD Example
Problem with RDD
- Focus on “How To” rather then “What To”.
- Not much optimized by Spark. Optimizing RDD is developer’s
responsibility.
- RDDs are low-level API.
- Inadveradent ineffecieincies.
Problem with RDD
Is this Optimized?
Structured APIs in Spark
→ Dataframes
→ Datasets
What is above RDD?
SQL Dataframes Datasets
Syntax Run Time CompileTime Compile
Time
Error
Analysis Run Time Run Time Compile
Time
Error
Analysis reports are reported before distributed job starts
RDD vs Dataframes vs Datasets
Output :
Dataframe Example
project page numRequests
en 23 45
en 24 200
Dataframe → SQL
Output :
project page numRequests
en 23 45
en 24 200
Dataframe = Easy to write..Believe it
name age
Jim 20
Ann 31
Jin 30
Output:
Analysis: Analyzing a logical plan to resolve references
Logical Optimization: Optimise the logical plan
Code generation: Compile parts of the query to Java
bytecode
Catalyst Optimizer
user.join(events, users(“id”) ===
events(“uid”)).filter(events(“date”) > “2015-01-01”)
Dataframe Optimization
Dataframe are faster than RDD
Dataset
Why When
- High level APIs & DSL - Structured Data Schema
- Strong Type-Safety - Code optimisations &
- Ease of use & Readability
- What to Do
perfromance
Dataframe and Dataset
DEMO
Q/A
Thank You !

More Related Content

PDF
Introduction to Apache Spark
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PDF
Optimizations in Spark; RDD, DataFrame
PPTX
Introduction to Spark - DataFactZ
PDF
Top 5 mistakes when writing Spark applications
PDF
Spark Streaming and IoT by Mike Freedman
PDF
Scarlet - Scalable, Redundant, Cloud Enabled JIRA
PDF
Extending Spark Streaming to Support Complex Event Processing
Introduction to Apache Spark
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Optimizations in Spark; RDD, DataFrame
Introduction to Spark - DataFactZ
Top 5 mistakes when writing Spark applications
Spark Streaming and IoT by Mike Freedman
Scarlet - Scalable, Redundant, Cloud Enabled JIRA
Extending Spark Streaming to Support Complex Event Processing

What's hot (20)

PDF
Getting Started with Spark Scala
PDF
Introduction to apache spark
PDF
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
PDF
Analyzing_Data_with_Spark_and_Cassandra
PDF
Beneath RDD in Apache Spark by Jacek Laskowski
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
PDF
Lessons Learned: Using Spark and Microservices
PDF
PDF
Productive Use of the Apache Spark Prompt with Sam Penrose
PDF
Introduction to Apache Spark
ODP
Sumedh Wale's presentation
PDF
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
PDF
SparkCruise: Automatic Computation Reuse in Apache Spark
PDF
Apache Spark Tutorial
PPTX
Rule Engine Evaluation for Complex Event Processing
PDF
Introduction to apache spark
PDF
How Apache Spark fits into the Big Data landscape
PDF
Introduction to apache spark
PDF
Spark and machine learning in microservices architecture
Getting Started with Spark Scala
Introduction to apache spark
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Analyzing_Data_with_Spark_and_Cassandra
Beneath RDD in Apache Spark by Jacek Laskowski
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Lessons Learned: Using Spark and Microservices
Productive Use of the Apache Spark Prompt with Sam Penrose
Introduction to Apache Spark
Sumedh Wale's presentation
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
SparkCruise: Automatic Computation Reuse in Apache Spark
Apache Spark Tutorial
Rule Engine Evaluation for Complex Event Processing
Introduction to apache spark
How Apache Spark fits into the Big Data landscape
Introduction to apache spark
Spark and machine learning in microservices architecture
Ad

Similar to End-to-end working of Apache Spark (20)

PPTX
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
PDF
Spark Driven Big Data Analytics
PPTX
Apache Spark Core
PPTX
An Introduction to Apache Spark
PPTX
Apache Spark on HDinsight Training
PPT
Big_data_analytics_NoSql_Module-4_Session
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
PDF
Apache Spark Fundamentals Meetup Talk
PPTX
Ten tools for ten big data areas 03_Apache Spark
PPTX
Spark core
PPTX
Unit II Real Time Data Processing tools.pptx
PPTX
Apache Spark Fundamentals Training
PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
PPTX
Spark real world use cases and optimizations
PDF
Introduction to Apache Spark
PDF
Apache Spark: What? Why? When?
PPTX
Getting Started with Apache Spark (Scala)
PPTX
PPTX
Spark architechure.pptx
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
Spark Driven Big Data Analytics
Apache Spark Core
An Introduction to Apache Spark
Apache Spark on HDinsight Training
Big_data_analytics_NoSql_Module-4_Session
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
Apache Spark Fundamentals Meetup Talk
Ten tools for ten big data areas 03_Apache Spark
Spark core
Unit II Real Time Data Processing tools.pptx
Apache Spark Fundamentals Training
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Spark real world use cases and optimizations
Introduction to Apache Spark
Apache Spark: What? Why? When?
Getting Started with Apache Spark (Scala)
Spark architechure.pptx
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
PPTX
Self-Healing Test Automation Framework - Healenium
PPTX
Kanban Metrics Presentation (Project Management)
PPTX
Java 17 features and implementation.pptx
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
PPTX
GraalVM - A Step Ahead of JVM Presentation
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
DAPR - Distributed Application Runtime Presentation
PPTX
Introduction to Azure Virtual WAN Presentation
PPTX
Introduction to Argo Rollouts Presentation
PPTX
Intro to Azure Container App Presentation
PPTX
Insights Unveiled Test Reporting and Observability Excellence
PPTX
Introduction to Splunk Presentation (DevOps)
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
PPTX
AWS: Messaging Services in AWS Presentation
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
PPTX
Managing State & HTTP Requests In Ionic.
Angular Hydration Presentation (FrontEnd)
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Self-Healing Test Automation Framework - Healenium
Kanban Metrics Presentation (Project Management)
Java 17 features and implementation.pptx
Chaos Mesh Introducing Chaos in Kubernetes
GraalVM - A Step Ahead of JVM Presentation
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
DAPR - Distributed Application Runtime Presentation
Introduction to Azure Virtual WAN Presentation
Introduction to Argo Rollouts Presentation
Intro to Azure Container App Presentation
Insights Unveiled Test Reporting and Observability Excellence
Introduction to Splunk Presentation (DevOps)
Code Camp - Data Profiling and Quality Analysis Framework
AWS: Messaging Services in AWS Presentation
Amazon Cognito: A Primer on Authentication and Authorization
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Managing State & HTTP Requests In Ionic.

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
Programs and apps: productivity, graphics, security and other tools
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Machine learning based COVID-19 study performance prediction
Unlocking AI with Model Context Protocol (MCP)
Building Integrated photovoltaic BIPV_UPV.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
The AUB Centre for AI in Media Proposal.docx
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MYSQL Presentation for SQL database connectivity
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation_ Review paper, used for researhc scholars
NewMind AI Weekly Chronicles - August'25 Week I
Chapter 3 Spatial Domain Image Processing.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”

End-to-end working of Apache Spark

  • 1. Presented By: Sarfaraz Hussain Divyansh Jain Software Consultant Software Consultant Knoldus Inc Knoldus Inc End-to-end working of Apache Spark
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Respect Knolx session timings, you are requested not to join sessions after a 5 minutes threshold post the session start time. Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. Agenda 01 Why and What is Spark? 02 Working of Spark 03 Operations in Spark 04 Task and Stages 05 DataFrame & DataSets 06 Demo
  • 5. Why Spark? (Distributed Computing) Traditional Enterprise Approach : - Splitting the data to different systems - Coding is required in all systems - No Fault Tolerance - Aggregation of data - System A is unaware of the data stored in System B and vice versa.
  • 6. What is Spark? The official definition of Apache Spark says that “Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an in-memory computation processing engine where the data is kept in random access memory (RAM) instead of some slow disk drives and is processed in parallel.
  • 7. Working of Spark File1.txt Master-Slave architecture
  • 8. Working of Spark val number = spark.sparkContext.textFile("path_to_File1.txt", 3) val result = number.flatMap(_.split("t")).map(_.toInt).filter(x => x < 10) RDD result = Find values smaller then 10 from number RDD Master => Driver Slave => Executor
  • 9. RDD Spark RDD is a resilient, partitioned, distributed and immutable collection of data. We can create an RDD using two methods: - Load some data from a source. - Create an RDD by transforming another RDD.
  • 10. Working of Spark Blocks with replication factor In case of Node failure
  • 11. Working of Spark val number = spark.sparkContext.textFile("path_to_File1.txt", 3) val result = number.flatMap(_.split("t")).map(_.toInt).filter(x => x < 10) B1 → B4 => 5,6 B2 → B5 => 9, 2 B3 → B6 => 1, 5 B4, B5, B6 => Result RDD
  • 12. Lineage - When a new RDD is derived from an existing RDD, that new RDD contains a pointer to the parent RDD and Spark keeps track of all the dependencies between these RDDs using a component called the Lineage. - In case of data loss, this lineage is used to rebuild the data. - The SparkContext (Driver) maintains the Lineage. - It is also known as Dependency Graph.
  • 13. Operations in Spark - Transformation: - Creation of new RDD from existing RDD. - Action: - This leads to creation of non-RDD result and gives result to User, i.e. it creates results in some form of Java Collection. val number = spark.sparkContext.textFile("path_to_File1.txt", 3) val result = number.flatMap(_.split("t")).map(_.toInt).filter(x => x < 10) result.collect() → Action
  • 15. Working of Spark (Lazy Evaluation) Till the time we do not hit an Action, none of the above operations (transformations) will take place. - RDDs are lazily evaluated. - RDDs are immutable. - The values of action are stored to drivers or to the external storage system. - It brings laziness of RDD into motion. - An action is one of the ways of sending data from Executer to the Driver. - Action kick offs a job to execute on a cluster.
  • 16. Here you go now!!
  • 17. Let’s see them in action!
  • 20. ● Two kinds of operations: 1. Transformation 2. Action ● Dependency are divided into two types: 1. Narrow Dependency 2. Wide Dependency ● Stages Spark Execution Model
  • 24. Why use RDD? - Offer control & flexibility - Low level API - Type Safe - Encourage ‘How to’
  • 26. Problem with RDD - Focus on “How To” rather then “What To”. - Not much optimized by Spark. Optimizing RDD is developer’s responsibility. - RDDs are low-level API. - Inadveradent ineffecieincies.
  • 27. Problem with RDD Is this Optimized?
  • 28. Structured APIs in Spark → Dataframes → Datasets What is above RDD?
  • 29. SQL Dataframes Datasets Syntax Run Time CompileTime Compile Time Error Analysis Run Time Run Time Compile Time Error Analysis reports are reported before distributed job starts RDD vs Dataframes vs Datasets
  • 30. Output : Dataframe Example project page numRequests en 23 45 en 24 200
  • 31. Dataframe → SQL Output : project page numRequests en 23 45 en 24 200
  • 32. Dataframe = Easy to write..Believe it name age Jim 20 Ann 31 Jin 30 Output:
  • 33. Analysis: Analyzing a logical plan to resolve references Logical Optimization: Optimise the logical plan Code generation: Compile parts of the query to Java bytecode Catalyst Optimizer
  • 37. Why When - High level APIs & DSL - Structured Data Schema - Strong Type-Safety - Code optimisations & - Ease of use & Readability - What to Do perfromance Dataframe and Dataset
  • 38. DEMO
  • 39. Q/A