SlideShare a Scribd company logo
Hadoop Introduction
Apache Hadoop is a Java software framework that allows for the distributed processing
of large data sets across clusters of computers spread across the world using a simple
programming model.
Hadoop Introduction
•  Distributed, scalable and
reliable
•  Fault‐tolerant storage
system
Hadoop Distributed
File System
•  High-performance parallel
data processing
•  Employs the divide-conquer
principle
Map-Reduce
Programming Model
A class teacher of class 5 needs to find out the name of the student with highest marks
for each subject.
Total students : 50
Total subjects : 5
Our Goal
To minimize the Total time spent
Time to process each
subject per student
: 1min
Total time spent : 250mins
Subject 1 : S1-98
Subject 2 : S13-95
Subject 3 : S1-97
Subject 4 : S23-100
Subject 5 : S8-99
Input
Output
HDFS: Distribute the
data into blocks across
multiple nodes
Distribute papers across 5 peons – Each
peon will have papers of 10 students for
each subject (50 papers each)
a)
Map Phase: Apply
business logic on
distributed data in parallel
Each peon will provide list of subjects
with student name and highest marks
from his data from a list of 10 students.
Total time spent: 50mins (in parallel)
b)
Reduce Phase: Iterate
over the map phase
output and get final result
Total records left: 5 students for 5
subjects only. Time to get subject list for
student name with highest marks: 25mins
c)
Total time spent: 50 + 25 = 75mins
Social Media Data
Analyzing Web Clickstream Data
Server Log Data
Machine and Sensor Data
HDFS Layer : --
Stores files across storage nodes
in a Hadoop cluster
Consists of :
•  Namenode & Datanodes
Map-Reduce Engine : --
Processes vast amounts of data in-
parallel on large clusters in a
reliable & fault-tolerant manner
Consists of :
•  Job Tracker & Task Trackers
Namenode
Datanode_1 Datanode_2 Datanode_3
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3 Block 4
Storage & Replication of Blocks in HDFS
Filedividedintoblocks
Block 1
Block 2
Block 3
Block 4
HDFS Client
File write
request
Job
Tracker
Task Tracker 1 Task Tracker _2 Task Tracker _3
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3 Block 4
Map-Reduce
job from
client
Executes individual
Map-Reduce tasks
assigned by Job
Tracker
Task Trackers retrieve data from HDFS which is stored on the
Data-node i.e. the same system where Task Tracker is running.
Task
Tracker
Data
Node
Slave
m/c
NameNode
Ø  Maps a block to the Datanodes
Ø  Controls read/write access to files
Ø  Manages Replication Engine for Blocks
DataNode
Ø  Responsible for serving read and write
requests (block creation, deletion, and
replication)
JobTracker
Ø  Accepts Map-Reduce tasks from the clients
Ø  Assigns tasks to the Task Trackers &
monitors their status
TaskTracker
Ø  Worker daemon, runs Map-Reduce tasks
Ø  Sends heart-beat to Job Tracker
Ø  Retrieves Job resources from HDFS
NameNode DataNode
JobTracker TaskTracker
Hadoop
Daemons
Hadoop Introduction
Hadoop
Services
HDFS MapReduce YARN
YARN stands for “Yet
Another Resource
Negotiator”, a framework
to provide generic
resource management
solution to Hadoop
clusters.
Hadoop Introduction
Allows easy integration of
multiple data processing
algorithms to the data stored in
HDFS
Hadoop Introduction
Query Language Pig Scripting
Coordination Service
Columnar Database
Log Management
Data Exchange
Designing Workflow
Machine Learning
Messaging System
a)  Apache Website
à http://hadoop.apache.org/
b)  Learning YARN
à https://www.packtpub.com/big-data-and-business-intelligence/learning-yarn
c)  Hadoop: The definitive guide
àhttp://shop.oreilly.com/product/0636920033448.do
Hadoop Introduction

More Related Content

PPTX
Hadoop HDFS Concepts
PPTX
Hadoop HDFS Detailed Introduction
PPTX
Hadoop HDFS Concepts
PDF
HDFS User Reference
PDF
HDFS Design Principles
PDF
Interacting with hdfs
PPTX
Introduction to hadoop and hdfs
PPT
Anatomy of file read in hadoop
Hadoop HDFS Concepts
Hadoop HDFS Detailed Introduction
Hadoop HDFS Concepts
HDFS User Reference
HDFS Design Principles
Interacting with hdfs
Introduction to hadoop and hdfs
Anatomy of file read in hadoop

What's hot (20)

PDF
Hadoop introduction
PPTX
Hadoop HDFS NameNode HA
PPTX
Hadoop Distributed File System(HDFS) : Behind the scenes
PDF
Hadoop Distributed File System
PPTX
Hadoop Interacting with HDFS
PPTX
Introduction to HDFS and MapReduce
PPTX
Hadoop HDFS Architeture and Design
PDF
HDFS Trunncate: Evolving Beyond Write-Once Semantics
PDF
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
PDF
Hdfs architecture
PPT
Anatomy of file write in hadoop
ODP
Hadoop HDFS by rohitkapa
PDF
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
PPTX
PPTX
Hadoop Distributed File System
PPTX
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
PDF
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
PPTX
March 2011 HUG: HDFS Federation
PPT
Hadoop Architecture
PPT
HDFS introduction
Hadoop introduction
Hadoop HDFS NameNode HA
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System
Hadoop Interacting with HDFS
Introduction to HDFS and MapReduce
Hadoop HDFS Architeture and Design
HDFS Trunncate: Evolving Beyond Write-Once Semantics
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Hdfs architecture
Anatomy of file write in hadoop
Hadoop HDFS by rohitkapa
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Hadoop Distributed File System
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
March 2011 HUG: HDFS Federation
Hadoop Architecture
HDFS introduction
Ad

Viewers also liked (18)

PDF
Les Business Analysts face à l'agilité : de nouveaux challenges à relever
PDF
Agile & Top Management
PPTX
Spark One Platform Webinar
PDF
Apache Spark beyond Hadoop MapReduce
PPTX
Spark for big data analytics
PDF
De la pensée projet à la pensée produit
PDF
Cloud : en 2017, sortez du stratus !
PDF
Démystifions l'API-culture!
PDF
Afterwork Blockchain : la prochaine technologie disruptive ?
PDF
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
PDF
Control Transactions using PowerCenter
PDF
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
PDF
하둡 (Hadoop) 및 관련기술 훑어보기
PDF
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
PPTX
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
PDF
DevOps : mission [im]possible ?
PDF
Polar Expeditions and Agility: the 1910 Race to the South Pole and Modern Tales
PPTX
Afterwork Big Data - Data Science & Machine Learning : explorer, comprendre e...
Les Business Analysts face à l'agilité : de nouveaux challenges à relever
Agile & Top Management
Spark One Platform Webinar
Apache Spark beyond Hadoop MapReduce
Spark for big data analytics
De la pensée projet à la pensée produit
Cloud : en 2017, sortez du stratus !
Démystifions l'API-culture!
Afterwork Blockchain : la prochaine technologie disruptive ?
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
Control Transactions using PowerCenter
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
하둡 (Hadoop) 및 관련기술 훑어보기
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
DevOps : mission [im]possible ?
Polar Expeditions and Agility: the 1910 Race to the South Pole and Modern Tales
Afterwork Big Data - Data Science & Machine Learning : explorer, comprendre e...
Ad

Similar to Hadoop Introduction (20)

PPTX
HADOOP.pptx
PPTX
Big Data-Session, data engineering and scala
PDF
Hadoop data management
PDF
Aziksa hadoop architecture santosh jha
PDF
Tutorial Haddop 2.3
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PDF
Hadoop security
PPT
Hadoop -HDFS.ppt
PPTX
Topic 9a-Hadoop Storage- HDFS.pptx
PPTX
Understanding Hadoop
PPTX
Introduction to HDFS
PDF
Lecture 2 part 1
PDF
Hadoop overview.pdf
PPTX
Hadoop Distributed File System
PPTX
Big data- HDFS(2nd presentation)
PPTX
Big Data Reverse Knowledge Transfer.pptx
PPTX
Big data processing using hadoop poster presentation
PDF
Hadoop architecture-tutorial
PPTX
PPTX
Cloud Computing - Cloud Technologies and Advancements
HADOOP.pptx
Big Data-Session, data engineering and scala
Hadoop data management
Aziksa hadoop architecture santosh jha
Tutorial Haddop 2.3
hdfs readrmation ghghg bigdats analytics info.pdf
Hadoop security
Hadoop -HDFS.ppt
Topic 9a-Hadoop Storage- HDFS.pptx
Understanding Hadoop
Introduction to HDFS
Lecture 2 part 1
Hadoop overview.pdf
Hadoop Distributed File System
Big data- HDFS(2nd presentation)
Big Data Reverse Knowledge Transfer.pptx
Big data processing using hadoop poster presentation
Hadoop architecture-tutorial
Cloud Computing - Cloud Technologies and Advancements

Recently uploaded (20)

PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
System and Network Administraation Chapter 3
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Transform Your Business with a Software ERP System
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Introduction to Artificial Intelligence
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPT
Introduction Database Management System for Course Database
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Odoo Companies in India – Driving Business Transformation.pdf
Design an Analysis of Algorithms I-SECS-1021-03
VVF-Customer-Presentation2025-Ver1.9.pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
System and Network Administraation Chapter 3
Upgrade and Innovation Strategies for SAP ERP Customers
How to Choose the Right IT Partner for Your Business in Malaysia
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Transform Your Business with a Software ERP System
Operating system designcfffgfgggggggvggggggggg
Introduction to Artificial Intelligence
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
CHAPTER 2 - PM Management and IT Context
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Introduction Database Management System for Course Database
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Navsoft: AI-Powered Business Solutions & Custom Software Development
PTS Company Brochure 2025 (1).pdf.......
Odoo Companies in India – Driving Business Transformation.pdf

Hadoop Introduction

  • 2. Apache Hadoop is a Java software framework that allows for the distributed processing of large data sets across clusters of computers spread across the world using a simple programming model.
  • 4. •  Distributed, scalable and reliable •  Fault‐tolerant storage system Hadoop Distributed File System •  High-performance parallel data processing •  Employs the divide-conquer principle Map-Reduce Programming Model
  • 5. A class teacher of class 5 needs to find out the name of the student with highest marks for each subject. Total students : 50 Total subjects : 5 Our Goal To minimize the Total time spent Time to process each subject per student : 1min Total time spent : 250mins Subject 1 : S1-98 Subject 2 : S13-95 Subject 3 : S1-97 Subject 4 : S23-100 Subject 5 : S8-99 Input Output
  • 6. HDFS: Distribute the data into blocks across multiple nodes Distribute papers across 5 peons – Each peon will have papers of 10 students for each subject (50 papers each) a) Map Phase: Apply business logic on distributed data in parallel Each peon will provide list of subjects with student name and highest marks from his data from a list of 10 students. Total time spent: 50mins (in parallel) b) Reduce Phase: Iterate over the map phase output and get final result Total records left: 5 students for 5 subjects only. Time to get subject list for student name with highest marks: 25mins c) Total time spent: 50 + 25 = 75mins
  • 7. Social Media Data Analyzing Web Clickstream Data Server Log Data Machine and Sensor Data
  • 8. HDFS Layer : -- Stores files across storage nodes in a Hadoop cluster Consists of : •  Namenode & Datanodes Map-Reduce Engine : -- Processes vast amounts of data in- parallel on large clusters in a reliable & fault-tolerant manner Consists of : •  Job Tracker & Task Trackers
  • 9. Namenode Datanode_1 Datanode_2 Datanode_3 HDFS Block 1 HDFS Block 2 HDFS Block 3 Block 4 Storage & Replication of Blocks in HDFS Filedividedintoblocks Block 1 Block 2 Block 3 Block 4 HDFS Client File write request
  • 10. Job Tracker Task Tracker 1 Task Tracker _2 Task Tracker _3 HDFS Block 1 HDFS Block 2 HDFS Block 3 Block 4 Map-Reduce job from client Executes individual Map-Reduce tasks assigned by Job Tracker Task Trackers retrieve data from HDFS which is stored on the Data-node i.e. the same system where Task Tracker is running. Task Tracker Data Node Slave m/c
  • 11. NameNode Ø  Maps a block to the Datanodes Ø  Controls read/write access to files Ø  Manages Replication Engine for Blocks DataNode Ø  Responsible for serving read and write requests (block creation, deletion, and replication) JobTracker Ø  Accepts Map-Reduce tasks from the clients Ø  Assigns tasks to the Task Trackers & monitors their status TaskTracker Ø  Worker daemon, runs Map-Reduce tasks Ø  Sends heart-beat to Job Tracker Ø  Retrieves Job resources from HDFS NameNode DataNode JobTracker TaskTracker Hadoop Daemons
  • 13. Hadoop Services HDFS MapReduce YARN YARN stands for “Yet Another Resource Negotiator”, a framework to provide generic resource management solution to Hadoop clusters.
  • 15. Allows easy integration of multiple data processing algorithms to the data stored in HDFS
  • 17. Query Language Pig Scripting Coordination Service Columnar Database Log Management Data Exchange Designing Workflow Machine Learning Messaging System
  • 18. a)  Apache Website à http://hadoop.apache.org/ b)  Learning YARN à https://www.packtpub.com/big-data-and-business-intelligence/learning-yarn c)  Hadoop: The definitive guide àhttp://shop.oreilly.com/product/0636920033448.do