2. Introduction to Big Data.
• Big data refers to large, diverse sets of information that grow at
ever-increasing rates. The term encompasses the volume of
information, the velocity or speed at which it is created and
collected, and the variety or scope of the data points being
covered (commonly known as the "Three V's" of big data).
• Big data can be collected from social networks and websites, from personal
electronics, through questionnaires, product purchases, and electronic check-
ins, among many other sources
2
3. Why is Big Data Important
• Improved Decision-Making:
• Enhanced Customer Experience
• Analytics Research
• Risk Management
• Public Policy and Governance
3
4. Main Goal of Big Data.
• The goals of big data can be to increase the speed at which products get to
market, to reduce the amount of time and resources required to gain market
adoption, to target the right audiences, and to keep customers coming back
for more.
4
7. Introduction to Apache Spark
• Apache Spark is an open-source unified analytics engine used for large-scale data processing, hereafter
referred it as Spark. Spark is designed to be fast, flexible, and easy to use, making it a popular choice for
processing large-scale data sets. Spark runs operations on billions and trillions of data on distributed
clusters 100 times faster than traditional applications.
• Spark can run on single-node machines or multi-node machines(Cluster). Spark reuses
data by using an in-memory cache to speed up machine learning algorithms that
repeatedly call a function on the same dataset. This lowers the latency making Spark
multiple times faster than MapReduce, especially when doing machine learning, and
interactive analytics. Apache Spark can also process real-time streaming.
• It is also a multi-language engine, that provides APIs (Application Programming
Interfaces) and libraries for several programming languages like Java, Scala, Python, and
R, allowing developers to work with Spark using the language they are most comfortable
with.
7
9. • Speed
Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data
processing. It is also able to achieve this speed through controlled partitioning.
• Powerful Caching
Simple programming layer provides powerful caching and disk persistence
capabilities.
• Deployment
It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster manager.
• Real-Time
It offers Real-time computation & low latency because of in-memory computation.
• Polyglot
Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written
in any of these four languages. It also provides a shell in Scala and Python.
9
10. Spark Eco-System
10
As you can see from the below image, the spark ecosystem is composed of various components like Spark
SQL, Spark Streaming, MLlib, GraphX, and the Core API component.
11. • Spark Core
Spark Core is the base engine for large-scale parallel and distributed data
processing. Further, additional libraries which are built on the top of the core
allows diverse workloads for streaming, SQL, and machine learning.
• Spark Streaming
Spark Streaming is the component of Spark which is used to process real-
time streaming data. Thus, it is a useful addition to the core Spark API.
• Spark SQL
Spark SQL is a new module in Spark which integrates relational processing
with Spark’s functional programming API. It supports querying data either
via SQL or via the Hive Query Language.
11
12. • GraphX
GraphX is the Spark API for graphs and graph-parallel computation. Thus, it
extends the Spark RDD with a Resilient Distributed Property Graph.
• MLlib (Machine Learning)
MLlib stands for Machine Learning Library. Spark MLlib is used to perform
machine learning in Apache Spark.
• SparkR
It is an R package that provides a distributed data frame implementation. It
also supports operations like selection, filtering, aggregation but on large
data-sets.
12
14. Apache Spark Architecture
5) Overview:-
Apache Spark has a well-defined layered architecture where all the spark components and layers
are loosely coupled. This architecture is further integrated with various extensions and
libraries. Apache Spark Architecture is based on two main abstractions:
• Resilient Distributed Dataset (RDD)
• Directed Acyclic Graph (DAG)
14
15. Resilient Distributed Dataset(RDD)
RDDs are the building blocks of any Spark application. RDDs Stands for:
• Resilient: Fault tolerant and is capable of rebuilding data on failure
• Distributed: Distributed data among the multiple nodes in a cluster
• Dataset: Collection of partitioned data with values
15
16. • RDDs are split into chunks based on a key.
• RDDs are highly resilient, allowing them to
quickly recover from issues.
• The same data chunks are replicated across
multiple executor nodes.
• If one executor node fails, another will still
process the data.
• RDDs are divided into logical partitions.
• These partitions may be computed on different
nodes of the cluster.
• This setup allows for the parallel execution of
transformations or actions on the complete
dataset.
:-With RDDs, you can perform two types of
operations:
•Transformations: They are the operations that
are applied to create a new RDD.
•Actions: They are applied on an RDD to instruct
Apache Spark to apply computation and pass the
result back to the driver.
16
17. driver program:
• Create a Spark Context, acting as a gateway to all Spark functionalities (similar to a database
connection).
• Spark Context manages jobs with the cluster manager, splitting them into tasks.
Execution Process:
• Jobs are split into multiple tasks, distributed to worker nodes.
• Tasks are executed on partitioned RDDs within worker nodes.
• Results are returned to the Spark Context.
Task Management:
• Spark Context handles job distribution and execution within the cluster.
• RDDs created in Spark Context can be distributed and cached across nodes.
Scalability:
• Increasing the number of workers:
• Divides jobs into more partitions for parallel execution.
• Increases execution speed.
• Enhances memory size, allowing job caching for faster execution.
17
18. What are: Job, Stage, and Task in Apache Spark
Concept of Job in Spark
• A job in Spark refers to a sequence of
transformations on data. Whenever an action like
count(), first(), collect(), and save() is called on
RDD (Resilient Distributed Datasets), a job is
created
Concept of Stage in Spark
• A stage in Spark represents a sequence of
transformations that can be executed in a single
pass, i.e., without any shuffling of data.
Concept of Task in Spark
• A task in Spark is the smallest unit of work that
can be scheduled. Each stage is divided into tasks.
A task is a unit of execution that runs on a single
machine.
18
19. Introduction to PySpark
• Apache Spark is a fast cluster computing framework which is used for processing, querying and analyzing
Big data. Being based on In-memory computation, it has an advantage over several other big data
Frameworks.
• PySpark helps data scientists interface with RDDs in Apache Spark and Python through its library Py4j.
• Example : -Media is one of the biggest industry growing towards online streaming. Netflix uses Apache
Spark for real-time stream processing to provide personalized online recommendations to its customers.
Why Go for Python?
• Easy to Learn
• A vast set of Libraries
• Huge Community Support
19
20. Why do we Need PySpark?
Handles Big Data Efficiently: Traditional tools struggle with large datasets, to which PySpark processes
them smoothly in a distributed computing environment.
Speed and Performance: In-memory processing makes PySpark faster than disk-based frameworks like
Hadoop MapReduce, which is crucial for real-time data analysis.
Versatility: PySpark supports structured as well as unstructured data from various sources, making it
highly versatile.
Advanced Analytics: It is supported by built-in libraries for machine learning (ML) and graph
processing to enable comprehensive data analysis and modelling.
Python Compatibility: Python compatibility ensures an easy transition for Python users, helping to
broaden the user base and empowering collaboration.
All-in-One Solution: PySpark reduces complexity by combining multiple tools and technologies into
one framework.
20
21. What is Data Frame?
• DataFrame is a table-like structure that can be manipulated using
SQL-like methods. A DataFrame can be thought of as a table with
rows and columns. Each column is a field, and each row is a record.
For instance, the DataFrame below has two fields: age and name.It
has three records: (null, Michael), (30, Andy), and (19, Justin).
21
22. How to create Data Frame?
Emp=[('Me'),('Myself')]
Column = ['Name']
DF = spark.createDataFrame(data=Emp, schema=Column)
DF.show()
22
23. Difference Between RDD or Data Frame
RDD
• It supports low level Api.
• There are no built-in
optimization techniques.
• It supports structured,
unstructured.
• It doesn't support schema.
Data Frame
• It supports High level Api.
• There are several optimizations
techniques.
• It Supports structured , semi-
structured.
• It is supporting schema.
23
24. Spark File Format.
Spark support many file formats.
• Text
• CSV
• JSON
• Parquet
• ORC (Optimised Row Columnar)
• Sequence
• Avro
25. ORC
ORC (Optimised Row Columnar) is a columnar file format. It has faster reads but slower writes and
also supports compression.
Sequence
A SequenceFile is a flat file consisting of binary/value pairs. SequenceFile is splittable and also
supports compression.
Avro
Avro is a record based file format. It seamlessly supports record schema changes over time. It has
faster writes compared to columnar file formats.