SlideShare a Scribd company logo
Big data and HDFS
Section 2
Agenda
2
 DFS
 what are the Problems with big data?
 Hadoop is the solution! What is Hadoop?
 HDFS
 YARN
DFS
3
• A distributed file system (DFS) is a file system that enables clients to access
file storage from multiple hosts through a computer network as if the user
was accessing local storage.
• Files are spread across multiple storage servers and in multiple locations,
which enables users to share data and storage resources.
Long-term information
storage
Access result of a process
later
Store large amounts of
information
Why to store Data in general?
Enable access of multiple
processes
File
Big data and Hadoop Section..............
Buy a bigger disk?
?
Copy data to an external hard
drive?
?
OR
PERSONAL
WORK
Big data and Hadoop Section..............
Rack
Distributed File System (DFS)
Rack
Cluster Node
Rack
1 2 3 4 5
Data
1
2
3 4
5
Block
Rack
1 2 3 4 5
Data
1
2
3 4
5
Analyze part 5 here!
Rack
1 2 3 4 5
Data
1
3
5
5
1
1 2
2
1
2
4
5
3
3
3
2
4
5
3
1
4
4
2 5 4
Rack
1 2 3 4 5
Data
1
3
5
5
1
1 2
2
1
2
4
5
3
3
3
2
4
5
3
1
4
4
2 5 4
Rack
1 2 3 4 5
Data
1
2
3 4
5
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
4
4
4
3 4
5: Reader 1
5: Reader 3
5: Reader 2
High Concurrency
vs.
Low Consistency
Rack
Distributed File System
(DFS)
Rack
Data scalability
Fault tolerance
High concurrency
Data replication
Data partitioning
When many storage computers (racks or nodes) are
connected throw the network we call it a DFS
Big data problems
• Storing huge and
exponentially
growing datasets.
• Processing data
having
complex
structure.
• the bottleneck of
bringing huge
amount data to
computation unit.
1
7
Hadoop is the
solution!
What is Hadoop?
Hadoop Ecosystem
19
• It is a services or frameworks (Open-source projects) which
solves big data problems.
• You can consider it as a suite which encompasses a number of
services (ingesting, storing, analyzing and maintaining) inside
it.
• Ex: To analyze the transaction data from a RDBMS, we need to ingest it into the Hadoop
Distributed File System (HDFS).
20
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
HDFS
28
• Hadoop Distributed File System is the core component or you can say, the backbone of
Hadoop Ecosystem.
• HDFS is the one, which makes it possible to store different types of large data sets (i.e.
structured, unstructured and semi structured data).
• It helps us in storing our data across various nodes and maintaining the log file about the
stored data (metadata).
• HDFS has two core components, i.e. NameNode and DataNode.
HDFS
29
1. The NameNode is the main node and it doesn’t store the actual data. It
contains metadata, just like a log file or you can say as a table of content.
Therefore, it requires less storage and high computational resources.
2. On the other hand, all your data is stored on the DataNodes and hence it
requires more storage resources. These DataNodes are commodity
hardware (like your laptops and desktops) in the distributed
environment. That’s the reason, why Hadoop solutions are very cost
effective.
3. You always communicate to the NameNode while writing the data. Then,
it internally sends a request to the client to store and replicate data on
various DataNodes.
30
31
For more information on how it works visit : https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/
Big data and Hadoop Section..............
YAR
N
33
• Yet Another Resource Negotiator (YARN)
• Consider YARN as the brain of your Hadoop Ecosystem. It performs all
your processing activities by allocating resources and scheduling tasks.
• It has two major components, i.e. ResourceManager and NodeManager.
• Recourse Manager controls all the recourses and decide who gets what.
• The NodeManager in YARN is responsible for launching the containers with
specified resource constraints.
• NodeManagers are installed on every DataNode.
YAR
N
Resourc
e
Manager
Node Manager
Scheduler App Manager Container
= machine
App Master
2/23/24 PRESENTATION TITLE 34
It performs the
scheduling of jobs
based on policies and
priorities defined by
the administrator.
It monitors the
health of the App
Master in the
cluster and
manages failover in
case of failures.
Manage the execution of
tasks within these
containers, monitor their
progress, and handle any
task failures or
reassignments.
35
Big data and Hadoop Section..............
Hive: A data warehousing and
SQL-like query language tool that
allows analysts to interact with
data stored in HDFS using familiar
SQL syntax.
Pig: A high-level scripting platform
for processing and analyzing large
datasets. It simplifies complex
data transformations using a
simple scripting language.
Big data and Hadoop Section..............
Spark: Although not
originally part of Hadoop,
Apache Spark is often
integrated into the
ecosystem. It offers an in-
memory processing engine
for faster data processing,
machine learning, and graph
processing.
Zookeeper in Hadoop can be
considered a centralized
repository where distributed
applications can put data into
and retrieve data from. It
makes a distributed system
work together as a whole
using its synchronization,
serialization, and
coordination goals.
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Big data and Hadoop Section..............
Advantages of Hadoop for Big Data
46
• Speed. Hadoop’s concurrent processing, MapReduce model, and HDFS
lets users run complex queries in just a few seconds.
• Diversity. Hadoop’s HDFS can store different data formats, like structured,
semi- structured, and unstructured.
• Cost-Effective. Hadoop is an open-source data framework.
• Resilient . Data stored in a node is replicated in other cluster nodes,
ensuring fault tolerance.
• Scalable. Since Hadoop functions in a distributed environment, you can easily
add more servers.
Next we will begin
with Hadoop
47

More Related Content

PPTX
Big Data Analytics -Introduction education
PPTX
Big Data Reverse Knowledge Transfer.pptx
PPTX
PPTX
Hadoop
PDF
Unit IV.pdf
PPT
Hadoop Technology
PPT
hadoop
PPT
hadoop
Big Data Analytics -Introduction education
Big Data Reverse Knowledge Transfer.pptx
Hadoop
Unit IV.pdf
Hadoop Technology
hadoop
hadoop

Similar to Big data and Hadoop Section.............. (20)

PPTX
Topic 9a-Hadoop Storage- HDFS.pptx
PPTX
Managing Big data with Hadoop
PPTX
Module 1- Introduction to Big Data and Hadoop
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PPTX
Unit-1 Introduction to Big Data.pptx
PPTX
Big Data-Session, data engineering and scala
PDF
Hadoop architecture-tutorial
PPTX
Seminar ppt
PPTX
Big Data and Hadoop
DOCX
project report on hadoop
PPT
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
PPTX
Big Data & Hadoop
PDF
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
PPSX
Hadoop – big deal
PPTX
Big data Analytics Hadoop
PPTX
Big Data UNIT 2 AKTU syllabus all topics covered
PDF
Hadoop overview.pdf
PDF
big data hadoop technonolgy for storing and processing data
PPTX
Apache Hadoop Big Data Technology
PPTX
Hadoop - HDFS
Topic 9a-Hadoop Storage- HDFS.pptx
Managing Big data with Hadoop
Module 1- Introduction to Big Data and Hadoop
hdfs readrmation ghghg bigdats analytics info.pdf
Unit-1 Introduction to Big Data.pptx
Big Data-Session, data engineering and scala
Hadoop architecture-tutorial
Seminar ppt
Big Data and Hadoop
project report on hadoop
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
Big Data & Hadoop
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop – big deal
Big data Analytics Hadoop
Big Data UNIT 2 AKTU syllabus all topics covered
Hadoop overview.pdf
big data hadoop technonolgy for storing and processing data
Apache Hadoop Big Data Technology
Hadoop - HDFS
Ad

Recently uploaded (20)

PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Digital Strategies for Manufacturing Companies
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Introduction to Artificial Intelligence
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
System and Network Administraation Chapter 3
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Understanding Forklifts - TECH EHS Solution
PDF
medical staffing services at VALiNTRY
PDF
Nekopoi APK 2025 free lastest update
Which alternative to Crystal Reports is best for small or large businesses.pdf
Odoo Companies in India – Driving Business Transformation.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Digital Strategies for Manufacturing Companies
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
CHAPTER 2 - PM Management and IT Context
Design an Analysis of Algorithms II-SECS-1021-03
Design an Analysis of Algorithms I-SECS-1021-03
Introduction to Artificial Intelligence
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Reimagine Home Health with the Power of Agentic AI​
System and Network Administraation Chapter 3
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Navsoft: AI-Powered Business Solutions & Custom Software Development
PTS Company Brochure 2025 (1).pdf.......
Understanding Forklifts - TECH EHS Solution
medical staffing services at VALiNTRY
Nekopoi APK 2025 free lastest update
Ad

Big data and Hadoop Section..............

  • 1. Big data and HDFS Section 2
  • 2. Agenda 2  DFS  what are the Problems with big data?  Hadoop is the solution! What is Hadoop?  HDFS  YARN
  • 3. DFS 3 • A distributed file system (DFS) is a file system that enables clients to access file storage from multiple hosts through a computer network as if the user was accessing local storage. • Files are spread across multiple storage servers and in multiple locations, which enables users to share data and storage resources.
  • 4. Long-term information storage Access result of a process later Store large amounts of information Why to store Data in general? Enable access of multiple processes
  • 7. Buy a bigger disk? ? Copy data to an external hard drive? ? OR
  • 10. Rack Distributed File System (DFS) Rack Cluster Node
  • 11. Rack 1 2 3 4 5 Data 1 2 3 4 5 Block
  • 12. Rack 1 2 3 4 5 Data 1 2 3 4 5 Analyze part 5 here!
  • 13. Rack 1 2 3 4 5 Data 1 3 5 5 1 1 2 2 1 2 4 5 3 3 3 2 4 5 3 1 4 4 2 5 4
  • 14. Rack 1 2 3 4 5 Data 1 3 5 5 1 1 2 2 1 2 4 5 3 3 3 2 4 5 3 1 4 4 2 5 4
  • 15. Rack 1 2 3 4 5 Data 1 2 3 4 5 5 5 5 5 1 1 1 1 2 2 2 2 3 3 3 4 4 4 3 4 5: Reader 1 5: Reader 3 5: Reader 2 High Concurrency vs. Low Consistency
  • 16. Rack Distributed File System (DFS) Rack Data scalability Fault tolerance High concurrency Data replication Data partitioning When many storage computers (racks or nodes) are connected throw the network we call it a DFS
  • 17. Big data problems • Storing huge and exponentially growing datasets. • Processing data having complex structure. • the bottleneck of bringing huge amount data to computation unit. 1 7
  • 19. Hadoop Ecosystem 19 • It is a services or frameworks (Open-source projects) which solves big data problems. • You can consider it as a suite which encompasses a number of services (ingesting, storing, analyzing and maintaining) inside it. • Ex: To analyze the transaction data from a RDBMS, we need to ingest it into the Hadoop Distributed File System (HDFS).
  • 20. 20
  • 28. HDFS 28 • Hadoop Distributed File System is the core component or you can say, the backbone of Hadoop Ecosystem. • HDFS is the one, which makes it possible to store different types of large data sets (i.e. structured, unstructured and semi structured data). • It helps us in storing our data across various nodes and maintaining the log file about the stored data (metadata). • HDFS has two core components, i.e. NameNode and DataNode.
  • 29. HDFS 29 1. The NameNode is the main node and it doesn’t store the actual data. It contains metadata, just like a log file or you can say as a table of content. Therefore, it requires less storage and high computational resources. 2. On the other hand, all your data is stored on the DataNodes and hence it requires more storage resources. These DataNodes are commodity hardware (like your laptops and desktops) in the distributed environment. That’s the reason, why Hadoop solutions are very cost effective. 3. You always communicate to the NameNode while writing the data. Then, it internally sends a request to the client to store and replicate data on various DataNodes.
  • 30. 30
  • 31. 31 For more information on how it works visit : https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/
  • 33. YAR N 33 • Yet Another Resource Negotiator (YARN) • Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing activities by allocating resources and scheduling tasks. • It has two major components, i.e. ResourceManager and NodeManager. • Recourse Manager controls all the recourses and decide who gets what. • The NodeManager in YARN is responsible for launching the containers with specified resource constraints. • NodeManagers are installed on every DataNode.
  • 34. YAR N Resourc e Manager Node Manager Scheduler App Manager Container = machine App Master 2/23/24 PRESENTATION TITLE 34 It performs the scheduling of jobs based on policies and priorities defined by the administrator. It monitors the health of the App Master in the cluster and manages failover in case of failures. Manage the execution of tasks within these containers, monitor their progress, and handle any task failures or reassignments.
  • 35. 35
  • 37. Hive: A data warehousing and SQL-like query language tool that allows analysts to interact with data stored in HDFS using familiar SQL syntax. Pig: A high-level scripting platform for processing and analyzing large datasets. It simplifies complex data transformations using a simple scripting language.
  • 39. Spark: Although not originally part of Hadoop, Apache Spark is often integrated into the ecosystem. It offers an in- memory processing engine for faster data processing, machine learning, and graph processing.
  • 40. Zookeeper in Hadoop can be considered a centralized repository where distributed applications can put data into and retrieve data from. It makes a distributed system work together as a whole using its synchronization, serialization, and coordination goals.
  • 46. Advantages of Hadoop for Big Data 46 • Speed. Hadoop’s concurrent processing, MapReduce model, and HDFS lets users run complex queries in just a few seconds. • Diversity. Hadoop’s HDFS can store different data formats, like structured, semi- structured, and unstructured. • Cost-Effective. Hadoop is an open-source data framework. • Resilient . Data stored in a node is replicated in other cluster nodes, ensuring fault tolerance. • Scalable. Since Hadoop functions in a distributed environment, you can easily add more servers.
  • 47. Next we will begin with Hadoop 47

Editor's Notes

  • #16: commodity cluster instead of
  • #17: It breaks down complex tasks into smaller sub-tasks, which are processed in parallel by map and reduce functions
  • #20: Apache Sqoop, which is part of CDH, is that tool. Sqoop is a tool used to automatically load our relational data from MySOL into HDFS, while preserving the structure. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data into HDFS.
  • #31: here are two files associated with the metadata: FsImage: It contains the complete state of the file system namespace since the start of the NameNode. EditLogs: It contains all the recent modifications made to the file system with respect to the most recent FsImage.
  • #40: Distributed applications are difficult to coordinate and work with ,Because many machines are involved, race conditions and deadlocks are common problems when implementing distributed applications. A race condition occurs when a machine tries to perform two or more operations at once, and this can be resolved with ZooKeeper’s serialization feature. A deadlock occurs when two or more computers attempt to access the same shared resource simultaneously.