SlideShare a Scribd company logo
Big Data Training
About me
• I’m Vishal Periyasamy Rajendran
• Senior Data Engineer
• Focused on architecting and developing big
data solutions at AWS cloud.
• 8x AWS certifications + other certifications on
Azure, Snowflake etc.
• You can find me on
• LinkedIn:
https://www.linkedin.com/in/vishal-p-
2703a9131/
• Medium:
https://medium.com/@vishalrv1904
2
Agenda
• Big data Overview
• Dimensions of Big data
• Traditional approach and limitations
• Hadoop Overview
• Spark Overview
• Hive Overview
• Other Big data frameworks
3
Big Data Overview
What is Big
data?
• Each user with a smartphone generates
approximately 40 Exabytes of data every month.
• According to Forbes, 2.5 Quintillion bytes of data are
created every day.
5
What is Big
data?
• Collection of data that is so huge & complex like none of
the traditional data management tool can store or process
it.
6
Dimensions of Big Data
6v’s Of Big data
• Volume
• The scale of data.
• Velocity
• Speed of data.
• Variety
• Diversity of data.
• Veracity
• Accuracy of data.
• Value
• Insights gained from data.
• Variability
• How often data can change.
8
Big Data Phases
Big Data Phases
• Data collection
• Data Cleansing / Validation
• Data Transformation
• Data Storage
• Data Visualization
Different Pipelines:
• ETL (Extract, Transform, Load)
• ELT (Extract, Load, Transform)
10
Traditional Approach
Traditional
Approach
• An enterprise will have a
computer to store and process
big data.
• Limitations:
• Processor that is
processing the data.
• Dealing with huge amounts
amounts of scalable data
12
Traditional
Approach
• Google’s Solution:
• Solved the processor
problem using an
algorithm called
MapReduce.
• Divides the task into small
parts and assigns them to
many computers.
13
Hadoop Overview
Hadoop Overview
• Using the solution provided by
Google, Doug Cutting and his team
developed an Open-Source Project
called HADOOP.
15
Hadoop Overview
• Framework for distributed data processing Maps
data to key/value pairs
Reduces intermediate results to final output Largely
supplanted by Spark these days
• Yet Another Resource Negotiator
Manages cluster resources for multiple data
processing frameworks
• Hadoop Distributed File System
Distributes data blocks across clusters in a redundant
manner
16
Spark
Overview
Spark Overview
• Hadoop MapReduce must persist data back to the
disk after every Map or Reduce action.
• This brings processing slowness.
• Spark - Distributed processing framework for big
data.
• Apache Spark is very much popular for its speed.
It runs 100 times faster in memory and ten times
faster on disk than Hadoop MapReduce since it
processes data in memory (RAM).
• Supports Java, Scala, Python, and R.
18
Spark Components
19
How Spark
Works
• Spark apps are run as
independent processes on a
cluster.
• Executors run computations
and store data.
• Spark context sends
application code and tasks to
executors
• Cluster manager – Yarn
20
Spark Context vs SQL Context vs
Hive Context vs Spark Session
• Spark 1.x three entry points were introduced,
•
Spark Context:
• The entry point of all spark application
• Spark Context is the first step to use RDD and connect to Spark
Cluster
• SQL Context:
• Used for the spark SQL executions & Structured data processing.
•
Hive Context:
• Used for the application to communicate with the hive.
21
Spark Context vs SQL Context vs
Hive Context vs Spark Session
• Spark 2.x introduced spark session,
• Spark Session:
• Combination of spark context, SQL context and
hive context.
22
Resilient Distributed
Dataset (RDD) & Dataframe
• RDD (Resilient Distributed Dataset) is a fundamental data
structure of Spark.
• The data frame is organized into named columns.
• Data frame supports APIs such as select, agg, sum, avg
etc.
• Support Spark SQL
• Catalyst Optimizer is available.
• Both are fault-tolerant, immutable distributed collections of
objects, which means you cannot change once you create.
23
Different types of Evaluation
• Eager Evaluation:
• Is the evaluation strategy you’ll most probably be familiar with and is used in most
programming languages
• Lazy Evaluation:
• Is an evaluation strategy that delays the evaluation of an expression until its value is
needed.
• Lazy Evaluation means that you can apply as many TRANSFORMATIONs as you want,
but Spark will not start the execution of the process until an ACTION is called.
24
Transformation & Actions
• Transformations are the instructions you use to modify the Data Frame in the way you want and
are lazily executed.
• Narrow transformations:
• Select
• Filter
• with column
• Wide transformations:
• Group by
• Repartition
• Actions are statements that will ask for a value to be computed immediately and are eager statements.
• Show, collect, save, count.
25
Spark’s Catalyst
Optimizer
• When performing different transformations,
Spark will store them in a Directed Acyclic
Graph (or DAG).
• Once the DAG is constructed, Spark’s catalyst
optimizer will perform a set of rule-based
and cost-based optimizations to determine
a logical and then physical plan of execution.
• Spark’s Catalyst optimizer will group
operations together, reducing the number of
passes on data and improving performance.
26
© Presidio, Inc. All rights reserved. Proprietary and Confidential.
Spark Hands-on
27
Spark Assignment
• Input:
• Covid data CSV file
• Expected outputs:
• Convert all state names to lowercase.
• The day had a greater number of covid cases.
• The state has the second-largest number of covid cases.
• Which Union Territory has the least number of death.
• The state has the lowest Death to Total Confirmed cases
ratio.
• Find which month the more Newer recovered cases.
• If the month is 02 it should display as February.
28
Hive Overview
Apache Hive
• Uses familiar SQL syntax (HiveQL)
• Scalable – works with “big data” on a
cluster
• Really most appropriate for data
warehouse applications
• Easy OLAP queries – WAY easier than
writing MapReduce in Java.
• Interactive & Highly optimized.
30
Other Big Data
Frameworks
Other Big
Data
Frameworks
32
• Pig introduces Pig Latin, a scripting language that lets you
use SQL-like syntax to define your map and reduce steps.
Apache Pig:
• Non-relational, petabyte-scale database.
• In-memory, Based on Google’s Bigtable, on top of HDFS
Apache HBase:
• It can connect to many different “big data” databases and
data stores at once, and query across them.
• Interactive queries at the petabyte scale.
Presto:
• Interactively run scripts/code against your data.
Apache Zeppelin:
© Presidio, Inc. All rights reserved. Proprietary and Confidential.
Questions
33

More Related Content

PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Spark Driven Big Data Analytics
PPT
Big_data_analytics_NoSql_Module-4_Session
PDF
An introduction To Apache Spark
PPTX
Introduction to Spark - DataFactZ
PPTX
Evolution of spark framework for simplifying data analysis.
PPTX
In Memory Analytics with Apache Spark
PDF
Apache Spark Overview
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Spark Driven Big Data Analytics
Big_data_analytics_NoSql_Module-4_Session
An introduction To Apache Spark
Introduction to Spark - DataFactZ
Evolution of spark framework for simplifying data analysis.
In Memory Analytics with Apache Spark
Apache Spark Overview

Similar to Big Data training (20)

PPTX
Getting Started with Apache Spark (Scala)
PPTX
Apache Spark Fundamentals
PDF
Apache Spark Presentation good for big data
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
PPTX
Big Data Processing with Apache Spark 2014
PDF
[@NaukriEngineering] Apache Spark
PDF
Rapid Cluster Computing with Apache Spark 2016
PPTX
PPTX
Unit II Real Time Data Processing tools.pptx
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
LanceIntroSpark_box
PPTX
APACHE SPARK.pptx
PPTX
Spark Unveiled Essential Insights for All Developers
PPTX
Apache Spark on HDinsight Training
PPTX
Analyzing Big data in R and Scala using Apache Spark 17-7-19
PDF
Unified Big Data Processing with Apache Spark
PPTX
Apache Spark
PDF
Apache Spark 101 - Demi Ben-Ari
PPTX
Big Data Processing Using Spark.pptx
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Getting Started with Apache Spark (Scala)
Apache Spark Fundamentals
Apache Spark Presentation good for big data
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Big Data Processing with Apache Spark 2014
[@NaukriEngineering] Apache Spark
Rapid Cluster Computing with Apache Spark 2016
Unit II Real Time Data Processing tools.pptx
Simplifying Big Data Analytics with Apache Spark
LanceIntroSpark_box
APACHE SPARK.pptx
Spark Unveiled Essential Insights for All Developers
Apache Spark on HDinsight Training
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Unified Big Data Processing with Apache Spark
Apache Spark
Apache Spark 101 - Demi Ben-Ari
Big Data Processing Using Spark.pptx
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Ad

Recently uploaded (20)

PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Introduction to Business Data Analytics.
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Lecture1 pattern recognition............
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Global journeys: estimating international migration
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
climate analysis of Dhaka ,Banglades.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Fluorescence-microscope_Botany_detailed content
Reliability_Chapter_ presentation 1221.5784
Introduction to Business Data Analytics.
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Lecture1 pattern recognition............
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Global journeys: estimating international migration
IBA_Chapter_11_Slides_Final_Accessible.pptx
Launch Your Data Science Career in Kochi – 2025
Major-Components-ofNKJNNKNKNKNKronment.pptx
Moving the Public Sector (Government) to a Digital Adoption
Introduction to Knowledge Engineering Part 1
oil_refinery_comprehensive_20250804084928 (1).pptx
Quality review (1)_presentation of this 21
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Data_Analytics_and_PowerBI_Presentation.pptx
Ad

Big Data training

  • 2. About me • I’m Vishal Periyasamy Rajendran • Senior Data Engineer • Focused on architecting and developing big data solutions at AWS cloud. • 8x AWS certifications + other certifications on Azure, Snowflake etc. • You can find me on • LinkedIn: https://www.linkedin.com/in/vishal-p- 2703a9131/ • Medium: https://medium.com/@vishalrv1904 2
  • 3. Agenda • Big data Overview • Dimensions of Big data • Traditional approach and limitations • Hadoop Overview • Spark Overview • Hive Overview • Other Big data frameworks 3
  • 5. What is Big data? • Each user with a smartphone generates approximately 40 Exabytes of data every month. • According to Forbes, 2.5 Quintillion bytes of data are created every day. 5
  • 6. What is Big data? • Collection of data that is so huge & complex like none of the traditional data management tool can store or process it. 6
  • 8. 6v’s Of Big data • Volume • The scale of data. • Velocity • Speed of data. • Variety • Diversity of data. • Veracity • Accuracy of data. • Value • Insights gained from data. • Variability • How often data can change. 8
  • 10. Big Data Phases • Data collection • Data Cleansing / Validation • Data Transformation • Data Storage • Data Visualization Different Pipelines: • ETL (Extract, Transform, Load) • ELT (Extract, Load, Transform) 10
  • 12. Traditional Approach • An enterprise will have a computer to store and process big data. • Limitations: • Processor that is processing the data. • Dealing with huge amounts amounts of scalable data 12
  • 13. Traditional Approach • Google’s Solution: • Solved the processor problem using an algorithm called MapReduce. • Divides the task into small parts and assigns them to many computers. 13
  • 15. Hadoop Overview • Using the solution provided by Google, Doug Cutting and his team developed an Open-Source Project called HADOOP. 15
  • 16. Hadoop Overview • Framework for distributed data processing Maps data to key/value pairs Reduces intermediate results to final output Largely supplanted by Spark these days • Yet Another Resource Negotiator Manages cluster resources for multiple data processing frameworks • Hadoop Distributed File System Distributes data blocks across clusters in a redundant manner 16
  • 18. Spark Overview • Hadoop MapReduce must persist data back to the disk after every Map or Reduce action. • This brings processing slowness. • Spark - Distributed processing framework for big data. • Apache Spark is very much popular for its speed. It runs 100 times faster in memory and ten times faster on disk than Hadoop MapReduce since it processes data in memory (RAM). • Supports Java, Scala, Python, and R. 18
  • 20. How Spark Works • Spark apps are run as independent processes on a cluster. • Executors run computations and store data. • Spark context sends application code and tasks to executors • Cluster manager – Yarn 20
  • 21. Spark Context vs SQL Context vs Hive Context vs Spark Session • Spark 1.x three entry points were introduced, • Spark Context: • The entry point of all spark application • Spark Context is the first step to use RDD and connect to Spark Cluster • SQL Context: • Used for the spark SQL executions & Structured data processing. • Hive Context: • Used for the application to communicate with the hive. 21
  • 22. Spark Context vs SQL Context vs Hive Context vs Spark Session • Spark 2.x introduced spark session, • Spark Session: • Combination of spark context, SQL context and hive context. 22
  • 23. Resilient Distributed Dataset (RDD) & Dataframe • RDD (Resilient Distributed Dataset) is a fundamental data structure of Spark. • The data frame is organized into named columns. • Data frame supports APIs such as select, agg, sum, avg etc. • Support Spark SQL • Catalyst Optimizer is available. • Both are fault-tolerant, immutable distributed collections of objects, which means you cannot change once you create. 23
  • 24. Different types of Evaluation • Eager Evaluation: • Is the evaluation strategy you’ll most probably be familiar with and is used in most programming languages • Lazy Evaluation: • Is an evaluation strategy that delays the evaluation of an expression until its value is needed. • Lazy Evaluation means that you can apply as many TRANSFORMATIONs as you want, but Spark will not start the execution of the process until an ACTION is called. 24
  • 25. Transformation & Actions • Transformations are the instructions you use to modify the Data Frame in the way you want and are lazily executed. • Narrow transformations: • Select • Filter • with column • Wide transformations: • Group by • Repartition • Actions are statements that will ask for a value to be computed immediately and are eager statements. • Show, collect, save, count. 25
  • 26. Spark’s Catalyst Optimizer • When performing different transformations, Spark will store them in a Directed Acyclic Graph (or DAG). • Once the DAG is constructed, Spark’s catalyst optimizer will perform a set of rule-based and cost-based optimizations to determine a logical and then physical plan of execution. • Spark’s Catalyst optimizer will group operations together, reducing the number of passes on data and improving performance. 26
  • 27. © Presidio, Inc. All rights reserved. Proprietary and Confidential. Spark Hands-on 27
  • 28. Spark Assignment • Input: • Covid data CSV file • Expected outputs: • Convert all state names to lowercase. • The day had a greater number of covid cases. • The state has the second-largest number of covid cases. • Which Union Territory has the least number of death. • The state has the lowest Death to Total Confirmed cases ratio. • Find which month the more Newer recovered cases. • If the month is 02 it should display as February. 28
  • 30. Apache Hive • Uses familiar SQL syntax (HiveQL) • Scalable – works with “big data” on a cluster • Really most appropriate for data warehouse applications • Easy OLAP queries – WAY easier than writing MapReduce in Java. • Interactive & Highly optimized. 30
  • 32. Other Big Data Frameworks 32 • Pig introduces Pig Latin, a scripting language that lets you use SQL-like syntax to define your map and reduce steps. Apache Pig: • Non-relational, petabyte-scale database. • In-memory, Based on Google’s Bigtable, on top of HDFS Apache HBase: • It can connect to many different “big data” databases and data stores at once, and query across them. • Interactive queries at the petabyte scale. Presto: • Interactively run scripts/code against your data. Apache Zeppelin:
  • 33. © Presidio, Inc. All rights reserved. Proprietary and Confidential. Questions 33