SlideShare a Scribd company logo
Apache Spark
1
Introduction to Big Data.
• Big data refers to large, diverse sets of information that grow at
ever-increasing rates. The term encompasses the volume of
information, the velocity or speed at which it is created and
collected, and the variety or scope of the data points being
covered (commonly known as the "Three V's" of big data).
• Big data can be collected from social networks and websites, from personal
electronics, through questionnaires, product purchases, and electronic check-
ins, among many other sources
2
Why is Big Data Important
• Improved Decision-Making:
• Enhanced Customer Experience
• Analytics Research
• Risk Management
• Public Policy and Governance
3
Main Goal of Big Data.
• The goals of big data can be to increase the speed at which products get to
market, to reduce the amount of time and resources required to gain market
adoption, to target the right audiences, and to keep customers coming back
for more.
4
Advantage of Big Data
5
Disadvantages of Big Data
6
Introduction to Apache Spark
• Apache Spark is an open-source unified analytics engine used for large-scale data processing, hereafter
referred it as Spark. Spark is designed to be fast, flexible, and easy to use, making it a popular choice for
processing large-scale data sets. Spark runs operations on billions and trillions of data on distributed
clusters 100 times faster than traditional applications.
• Spark can run on single-node machines or multi-node machines(Cluster). Spark reuses
data by using an in-memory cache to speed up machine learning algorithms that
repeatedly call a function on the same dataset. This lowers the latency making Spark
multiple times faster than MapReduce, especially when doing machine learning, and
interactive analytics. Apache Spark can also process real-time streaming.
• It is also a multi-language engine, that provides APIs (Application Programming
Interfaces) and libraries for several programming languages like Java, Scala, Python, and
R, allowing developers to work with Spark using the language they are most comfortable
with.
7
Features of Apache Spark
8
• Speed
Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data
processing. It is also able to achieve this speed through controlled partitioning.
• Powerful Caching
Simple programming layer provides powerful caching and disk persistence
capabilities.
• Deployment
It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster manager.
• Real-Time
It offers Real-time computation & low latency because of in-memory computation.
• Polyglot
Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written
in any of these four languages. It also provides a shell in Scala and Python.
9
Spark Eco-System
10
As you can see from the below image, the spark ecosystem is composed of various components like Spark
SQL, Spark Streaming, MLlib, GraphX, and the Core API component.
• Spark Core
Spark Core is the base engine for large-scale parallel and distributed data
processing. Further, additional libraries which are built on the top of the core
allows diverse workloads for streaming, SQL, and machine learning.
• Spark Streaming
Spark Streaming is the component of Spark which is used to process real-
time streaming data. Thus, it is a useful addition to the core Spark API.
• Spark SQL
Spark SQL is a new module in Spark which integrates relational processing
with Spark’s functional programming API. It supports querying data either
via SQL or via the Hive Query Language.
11
• GraphX
GraphX is the Spark API for graphs and graph-parallel computation. Thus, it
extends the Spark RDD with a Resilient Distributed Property Graph.
• MLlib (Machine Learning)
MLlib stands for Machine Learning Library. Spark MLlib is used to perform
machine learning in Apache Spark.
• SparkR
It is an R package that provides a distributed data frame implementation. It
also supports operations like selection, filtering, aggregation but on large
data-sets.
12
Lifecycles of Spark Applications
13
Apache Spark Architecture
5) Overview:-
Apache Spark has a well-defined layered architecture where all the spark components and layers
are loosely coupled. This architecture is further integrated with various extensions and
libraries. Apache Spark Architecture is based on two main abstractions:
• Resilient Distributed Dataset (RDD)
• Directed Acyclic Graph (DAG)
14
Resilient Distributed Dataset(RDD)
RDDs are the building blocks of any Spark application. RDDs Stands for:
• Resilient: Fault tolerant and is capable of rebuilding data on failure
• Distributed: Distributed data among the multiple nodes in a cluster
• Dataset: Collection of partitioned data with values
15
• RDDs are split into chunks based on a key.
• RDDs are highly resilient, allowing them to
quickly recover from issues.
• The same data chunks are replicated across
multiple executor nodes.
• If one executor node fails, another will still
process the data.
• RDDs are divided into logical partitions.
• These partitions may be computed on different
nodes of the cluster.
• This setup allows for the parallel execution of
transformations or actions on the complete
dataset.
:-With RDDs, you can perform two types of
operations:
•Transformations: They are the operations that
are applied to create a new RDD.
•Actions: They are applied on an RDD to instruct
Apache Spark to apply computation and pass the
result back to the driver.
16
driver program:
• Create a Spark Context, acting as a gateway to all Spark functionalities (similar to a database
connection).
• Spark Context manages jobs with the cluster manager, splitting them into tasks.
Execution Process:
• Jobs are split into multiple tasks, distributed to worker nodes.
• Tasks are executed on partitioned RDDs within worker nodes.
• Results are returned to the Spark Context.
Task Management:
• Spark Context handles job distribution and execution within the cluster.
• RDDs created in Spark Context can be distributed and cached across nodes.
Scalability:
• Increasing the number of workers:
• Divides jobs into more partitions for parallel execution.
• Increases execution speed.
• Enhances memory size, allowing job caching for faster execution.
17
What are: Job, Stage, and Task in Apache Spark
Concept of Job in Spark
• A job in Spark refers to a sequence of
transformations on data. Whenever an action like
count(), first(), collect(), and save() is called on
RDD (Resilient Distributed Datasets), a job is
created
Concept of Stage in Spark
• A stage in Spark represents a sequence of
transformations that can be executed in a single
pass, i.e., without any shuffling of data.
Concept of Task in Spark
• A task in Spark is the smallest unit of work that
can be scheduled. Each stage is divided into tasks.
A task is a unit of execution that runs on a single
machine.
18
Introduction to PySpark
• Apache Spark is a fast cluster computing framework which is used for processing, querying and analyzing
Big data. Being based on In-memory computation, it has an advantage over several other big data
Frameworks.
• PySpark helps data scientists interface with RDDs in Apache Spark and Python through its library Py4j.
• Example : -Media is one of the biggest industry growing towards online streaming. Netflix uses Apache
Spark for real-time stream processing to provide personalized online recommendations to its customers.
Why Go for Python?
• Easy to Learn
• A vast set of Libraries
• Huge Community Support
19
Why do we Need PySpark?
 Handles Big Data Efficiently: Traditional tools struggle with large datasets, to which PySpark processes
them smoothly in a distributed computing environment.
 Speed and Performance: In-memory processing makes PySpark faster than disk-based frameworks like
Hadoop MapReduce, which is crucial for real-time data analysis.
 Versatility: PySpark supports structured as well as unstructured data from various sources, making it
highly versatile.
 Advanced Analytics: It is supported by built-in libraries for machine learning (ML) and graph
processing to enable comprehensive data analysis and modelling.
 Python Compatibility: Python compatibility ensures an easy transition for Python users, helping to
broaden the user base and empowering collaboration.
 All-in-One Solution: PySpark reduces complexity by combining multiple tools and technologies into
one framework.
20
What is Data Frame?
• DataFrame is a table-like structure that can be manipulated using
SQL-like methods. A DataFrame can be thought of as a table with
rows and columns. Each column is a field, and each row is a record.
For instance, the DataFrame below has two fields: age and name.It
has three records: (null, Michael), (30, Andy), and (19, Justin).
21
How to create Data Frame?
Emp=[('Me'),('Myself')]
Column = ['Name']
DF = spark.createDataFrame(data=Emp, schema=Column)
DF.show()
22
Difference Between RDD or Data Frame
RDD
• It supports low level Api.
• There are no built-in
optimization techniques.
• It supports structured,
unstructured.
• It doesn't support schema.
Data Frame
• It supports High level Api.
• There are several optimizations
techniques.
• It Supports structured , semi-
structured.
• It is supporting schema.
23
Spark File Format.
Spark support many file formats.
• Text
• CSV
• JSON
• Parquet
• ORC (Optimised Row Columnar)
• Sequence
• Avro
ORC
ORC (Optimised Row Columnar) is a columnar file format. It has faster reads but slower writes and
also supports compression.
Sequence
A SequenceFile is a flat file consisting of binary/value pairs. SequenceFile is splittable and also
supports compression.
Avro
Avro is a record based file format. It seamlessly supports record schema changes over time. It has
faster writes compared to columnar file formats.
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf

More Related Content

PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PPT
Big_data_analytics_NoSql_Module-4_Session
PDF
Apache Spark PDF
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PDF
A Master Guide To Apache Spark Application And Versatile Uses.pdf
PPTX
Introduction to spark
PPTX
Processing Large Data with Apache Spark -- HasGeek
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Big_data_analytics_NoSql_Module-4_Session
Apache Spark PDF
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
A Master Guide To Apache Spark Application And Versatile Uses.pdf
Introduction to spark
Processing Large Data with Apache Spark -- HasGeek

Similar to Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf (20)

PPTX
APACHE SPARK.pptx
PPTX
Apache Spark for Beginners
PPTX
Apache Spark
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PDF
Apache Spark Presentation good for big data
PDF
Apache spark
PDF
Started with-apache-spark
PPTX
Apache Spark Fundamentals
PPTX
Spark Workshop
PDF
An introduction To Apache Spark
PPTX
Spark introduction and architecture
PPTX
Spark introduction and architecture
PPTX
Data Engineering A Deep Dive into Databricks
PPTX
IBM Strategy for Spark
PDF
Spark Driven Big Data Analytics
PPTX
Big Data training
PDF
39.-Introduction-to-Sparkspark and all-1.pdf
PPTX
big data analytics (BAD601) Module-5.pptx
PPT
An Introduction to Apache spark with scala
APACHE SPARK.pptx
Apache Spark for Beginners
Apache Spark
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Apache Spark Presentation good for big data
Apache spark
Started with-apache-spark
Apache Spark Fundamentals
Spark Workshop
An introduction To Apache Spark
Spark introduction and architecture
Spark introduction and architecture
Data Engineering A Deep Dive into Databricks
IBM Strategy for Spark
Spark Driven Big Data Analytics
Big Data training
39.-Introduction-to-Sparkspark and all-1.pdf
big data analytics (BAD601) Module-5.pptx
An Introduction to Apache spark with scala
Ad

Recently uploaded (20)

PDF
Smarter Security: How Door Access Control Works with Alarms & CCTV
PDF
YKS Chrome Plated Brass Safety Valve Product Catalogue
PPT
FABRICATION OF MOS FET BJT DEVICES IN NANOMETER
PDF
Prescription1 which to be used for periodo
PPTX
DEATH AUDIT MAY 2025.pptxurjrjejektjtjyjjy
PPTX
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
PPTX
"Fundamentals of Digital Image Processing: A Visual Approach"
PDF
Chapter -24-By Dr Sajid Ali Ansari 2021.pdf
PPTX
kvjhvhjvhjhjhjghjghjgjhgjhgjhgjhgjhgjhgjhgjh
PPTX
PROGRAMMING-QUARTER-2-PYTHON.pptxnsnsndn
PPTX
STEEL- intro-1.pptxhejwjenwnwnenemwmwmwm
PPT
chapter_1_a.ppthduushshwhwbshshshsbbsbsbsbsh
PPTX
ERP good ERP good ERP good ERP good good ERP good ERP good
PDF
PPT Determiners.pdf.......................
PPTX
Lecture-3-Computer-programming for BS InfoTech
PPTX
executive branch_no record.pptxsvvsgsggs
PPT
Hypersensitivity Namisha1111111111-WPS.ppt
PPTX
sdn_based_controller_for_mobile_network_traffic_management1.pptx
PPTX
figurative-languagepowerpoint-150309132252-conversion-gate01.pptx
PPTX
Embeded System for Artificial intelligence 2.pptx
Smarter Security: How Door Access Control Works with Alarms & CCTV
YKS Chrome Plated Brass Safety Valve Product Catalogue
FABRICATION OF MOS FET BJT DEVICES IN NANOMETER
Prescription1 which to be used for periodo
DEATH AUDIT MAY 2025.pptxurjrjejektjtjyjjy
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
"Fundamentals of Digital Image Processing: A Visual Approach"
Chapter -24-By Dr Sajid Ali Ansari 2021.pdf
kvjhvhjvhjhjhjghjghjgjhgjhgjhgjhgjhgjhgjhgjh
PROGRAMMING-QUARTER-2-PYTHON.pptxnsnsndn
STEEL- intro-1.pptxhejwjenwnwnenemwmwmwm
chapter_1_a.ppthduushshwhwbshshshsbbsbsbsbsh
ERP good ERP good ERP good ERP good good ERP good ERP good
PPT Determiners.pdf.......................
Lecture-3-Computer-programming for BS InfoTech
executive branch_no record.pptxsvvsgsggs
Hypersensitivity Namisha1111111111-WPS.ppt
sdn_based_controller_for_mobile_network_traffic_management1.pptx
figurative-languagepowerpoint-150309132252-conversion-gate01.pptx
Embeded System for Artificial intelligence 2.pptx
Ad

Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf

  • 2. Introduction to Big Data. • Big data refers to large, diverse sets of information that grow at ever-increasing rates. The term encompasses the volume of information, the velocity or speed at which it is created and collected, and the variety or scope of the data points being covered (commonly known as the "Three V's" of big data). • Big data can be collected from social networks and websites, from personal electronics, through questionnaires, product purchases, and electronic check- ins, among many other sources 2
  • 3. Why is Big Data Important • Improved Decision-Making: • Enhanced Customer Experience • Analytics Research • Risk Management • Public Policy and Governance 3
  • 4. Main Goal of Big Data. • The goals of big data can be to increase the speed at which products get to market, to reduce the amount of time and resources required to gain market adoption, to target the right audiences, and to keep customers coming back for more. 4
  • 7. Introduction to Apache Spark • Apache Spark is an open-source unified analytics engine used for large-scale data processing, hereafter referred it as Spark. Spark is designed to be fast, flexible, and easy to use, making it a popular choice for processing large-scale data sets. Spark runs operations on billions and trillions of data on distributed clusters 100 times faster than traditional applications. • Spark can run on single-node machines or multi-node machines(Cluster). Spark reuses data by using an in-memory cache to speed up machine learning algorithms that repeatedly call a function on the same dataset. This lowers the latency making Spark multiple times faster than MapReduce, especially when doing machine learning, and interactive analytics. Apache Spark can also process real-time streaming. • It is also a multi-language engine, that provides APIs (Application Programming Interfaces) and libraries for several programming languages like Java, Scala, Python, and R, allowing developers to work with Spark using the language they are most comfortable with. 7
  • 9. • Speed Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing. It is also able to achieve this speed through controlled partitioning. • Powerful Caching Simple programming layer provides powerful caching and disk persistence capabilities. • Deployment It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster manager. • Real-Time It offers Real-time computation & low latency because of in-memory computation. • Polyglot Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written in any of these four languages. It also provides a shell in Scala and Python. 9
  • 10. Spark Eco-System 10 As you can see from the below image, the spark ecosystem is composed of various components like Spark SQL, Spark Streaming, MLlib, GraphX, and the Core API component.
  • 11. • Spark Core Spark Core is the base engine for large-scale parallel and distributed data processing. Further, additional libraries which are built on the top of the core allows diverse workloads for streaming, SQL, and machine learning. • Spark Streaming Spark Streaming is the component of Spark which is used to process real- time streaming data. Thus, it is a useful addition to the core Spark API. • Spark SQL Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. 11
  • 12. • GraphX GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph. • MLlib (Machine Learning) MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in Apache Spark. • SparkR It is an R package that provides a distributed data frame implementation. It also supports operations like selection, filtering, aggregation but on large data-sets. 12
  • 13. Lifecycles of Spark Applications 13
  • 14. Apache Spark Architecture 5) Overview:- Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled. This architecture is further integrated with various extensions and libraries. Apache Spark Architecture is based on two main abstractions: • Resilient Distributed Dataset (RDD) • Directed Acyclic Graph (DAG) 14
  • 15. Resilient Distributed Dataset(RDD) RDDs are the building blocks of any Spark application. RDDs Stands for: • Resilient: Fault tolerant and is capable of rebuilding data on failure • Distributed: Distributed data among the multiple nodes in a cluster • Dataset: Collection of partitioned data with values 15
  • 16. • RDDs are split into chunks based on a key. • RDDs are highly resilient, allowing them to quickly recover from issues. • The same data chunks are replicated across multiple executor nodes. • If one executor node fails, another will still process the data. • RDDs are divided into logical partitions. • These partitions may be computed on different nodes of the cluster. • This setup allows for the parallel execution of transformations or actions on the complete dataset. :-With RDDs, you can perform two types of operations: •Transformations: They are the operations that are applied to create a new RDD. •Actions: They are applied on an RDD to instruct Apache Spark to apply computation and pass the result back to the driver. 16
  • 17. driver program: • Create a Spark Context, acting as a gateway to all Spark functionalities (similar to a database connection). • Spark Context manages jobs with the cluster manager, splitting them into tasks. Execution Process: • Jobs are split into multiple tasks, distributed to worker nodes. • Tasks are executed on partitioned RDDs within worker nodes. • Results are returned to the Spark Context. Task Management: • Spark Context handles job distribution and execution within the cluster. • RDDs created in Spark Context can be distributed and cached across nodes. Scalability: • Increasing the number of workers: • Divides jobs into more partitions for parallel execution. • Increases execution speed. • Enhances memory size, allowing job caching for faster execution. 17
  • 18. What are: Job, Stage, and Task in Apache Spark Concept of Job in Spark • A job in Spark refers to a sequence of transformations on data. Whenever an action like count(), first(), collect(), and save() is called on RDD (Resilient Distributed Datasets), a job is created Concept of Stage in Spark • A stage in Spark represents a sequence of transformations that can be executed in a single pass, i.e., without any shuffling of data. Concept of Task in Spark • A task in Spark is the smallest unit of work that can be scheduled. Each stage is divided into tasks. A task is a unit of execution that runs on a single machine. 18
  • 19. Introduction to PySpark • Apache Spark is a fast cluster computing framework which is used for processing, querying and analyzing Big data. Being based on In-memory computation, it has an advantage over several other big data Frameworks. • PySpark helps data scientists interface with RDDs in Apache Spark and Python through its library Py4j. • Example : -Media is one of the biggest industry growing towards online streaming. Netflix uses Apache Spark for real-time stream processing to provide personalized online recommendations to its customers. Why Go for Python? • Easy to Learn • A vast set of Libraries • Huge Community Support 19
  • 20. Why do we Need PySpark?  Handles Big Data Efficiently: Traditional tools struggle with large datasets, to which PySpark processes them smoothly in a distributed computing environment.  Speed and Performance: In-memory processing makes PySpark faster than disk-based frameworks like Hadoop MapReduce, which is crucial for real-time data analysis.  Versatility: PySpark supports structured as well as unstructured data from various sources, making it highly versatile.  Advanced Analytics: It is supported by built-in libraries for machine learning (ML) and graph processing to enable comprehensive data analysis and modelling.  Python Compatibility: Python compatibility ensures an easy transition for Python users, helping to broaden the user base and empowering collaboration.  All-in-One Solution: PySpark reduces complexity by combining multiple tools and technologies into one framework. 20
  • 21. What is Data Frame? • DataFrame is a table-like structure that can be manipulated using SQL-like methods. A DataFrame can be thought of as a table with rows and columns. Each column is a field, and each row is a record. For instance, the DataFrame below has two fields: age and name.It has three records: (null, Michael), (30, Andy), and (19, Justin). 21
  • 22. How to create Data Frame? Emp=[('Me'),('Myself')] Column = ['Name'] DF = spark.createDataFrame(data=Emp, schema=Column) DF.show() 22
  • 23. Difference Between RDD or Data Frame RDD • It supports low level Api. • There are no built-in optimization techniques. • It supports structured, unstructured. • It doesn't support schema. Data Frame • It supports High level Api. • There are several optimizations techniques. • It Supports structured , semi- structured. • It is supporting schema. 23
  • 24. Spark File Format. Spark support many file formats. • Text • CSV • JSON • Parquet • ORC (Optimised Row Columnar) • Sequence • Avro
  • 25. ORC ORC (Optimised Row Columnar) is a columnar file format. It has faster reads but slower writes and also supports compression. Sequence A SequenceFile is a flat file consisting of binary/value pairs. SequenceFile is splittable and also supports compression. Avro Avro is a record based file format. It seamlessly supports record schema changes over time. It has faster writes compared to columnar file formats.