SlideShare a Scribd company logo
Hadoop
Distributed File System
(HDFS)
SEMINAR GUIDE
Mr. PRAMOD PAVITHRAN
HEAD OF DIVISION
COMPUTER SCIENCE & ENGINEERING
SCHOOL OF ENGINEERING, CUSAT
PRESENTED BY
VIJAY PRATAP SINGH
REG NO: 12110083
S7, CS-B
ROLL NO: 81
CONTENTS
WHAT IS HADOOP
PROJECT COMPONENTS IN HADOOP
MAP/REDUCE
HDFS
ARCHITECTURE
WRITE & READ IN HDFS
GOALS OF HADOOP
COMPARISION WITH OTHER SYSTEMS
CONCLUSION
REFERENCES
WHAT IS HADOOP ?
WHAT IS HADOOP ?
WHAT IS HADOOP ?
WHAT IS HADOOP ?
o Hadoop is an open-source software framework .
o Hadoop framework consists on two main layers
● Distributed file system (HDFS)
● Execution engine (MapReduce)
o Supports data-intensive distributed applications.
o Licensed under the Apache v2 license.
o It enables applications to work with thousands of computation-independent
computers and petabytes of data
WHY HADOOP ?
PROJECT COMPONENTS IN
HADOOP
MAP/REDUCE
o Hadoop is the popular open source implementation of map/reduce
o MapReduce is a programming model for processing large data sets
o MapReduce is typically used to do distributed computing on clusters of computers
o MapReduce can take advantage of locality of data, processing data on or near the storage assets to
decrease transmission of data.
oThe model is inspired by the map and reduce functions
o"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes
them to slave nodes. The slave node processes the smaller problem, and passes the answer back to
its master node.
o"Reduce" step: The master node then collects the answers to all the sub-problems and combines
them in some way to form the final output
MAP REDUCE ENGINE
HDFS
Highly scalable file system
◦ 6K nodes and 120PB
◦ Add commodity servers and disks to scale storage and IO bandwidth
Supports parallel reading & processing of data
◦ Optimized for streaming reads/writes of large files
◦ Bandwidth scales linearly with the number of nodes and disks
Fault tolerant & easy management
◦ Built in redundancy
◦ Tolerate disk and node failure
◦ Automatically manages addition/removal of nodes
◦ One operator per 3K nodes
Scalable, Reliable & Manageable
LIMITATIONS OF EXISTING DATA
ANALYTICS ARCHITECTURE
BIG DATA
INCREASING BIG DATA
HADOOP'S APPROACH
HADOOP'S APPROACH
HADOOP'S APPROACH
ARCHITECTURE OF HADOOP
HADOOP MASTER/SLAVE
ARCHITECTURE
ARCHITECTURE OF HDFS
ARCHITECTURE OF HDFS
CLIENT INTERACTION TO
HADOOP
HDFS WRITE
Client
Rack Awareness
Rack 1:DN 1
Rack 2:DN7,9
Rack 1
Core Switch
Switch SwitchF
DataNode 1
DataNode 9
DataNode 7
Rack 5
BA C
Name Node
I want to
write file.txt
Block A
OK, Write to
DataNode
[1,7,9]
Ready DN 7,9
Ready DN 9
Ready
PIPELINED WRITE
Client
Rack Awareness
Rack 1:DN 1
Rack 2:DN7,9
Rack 1
Core Switch
Switch SwitchF
DataNode 1
DataNode 9
DataNode 7
Rack 5
BA C
Name Node
A A
A
PIPELINED WRITE
Client
Rack Awareness
Rack 1:DN 1
Rack 2:DN7,9
Rack 1
Core Switch
Switch SwitchF
DataNode 1
DataNode 9
DataNode 7
Rack 5
BA C
Name Node
A A
A
Block Received
Success
MetaData
File.txt =
Block:
DN: 1,7,9
A
HDFS READ
Client
Rack 1
Core Switch
Switch Switch
DataNode 1
DataNode 9
DataNode 7
Rack 5
Name Node
I want to
Read file.txt
Block A
Available at
DataNode
[1,7,9]
A A
A
MetaData
File.txt =
Block:
DN: 1,7,9
A
HDFS SHELL COMMANDS
● bin/hadoop fs -ls
● bin/hadoop fs -mkdir
● bin/hadoop fs -copyFromLocal
● bin/hadoop fs -copyToLocal
● bin/hadoop fs -moveToLocal
● bin/hadoop fs -rm
● bin/hadoop fs -tail
● bin/hadoop fs -chmod
● bin/hadoop fs -setrep -w 4 -R /dir1/s-dir/
GOALS OF HDFS
Very Large Distributed File System
◦10K nodes, 100 million files, 10PB
Assumes Commodity Hardware
◦Files are replicated to handle hardware failure
◦Detect failures and recover from them
Optimized for Batch Processing
◦Data locations exposed so that computations can move to where data resides
◦Provides very high aggregate bandwidth
SCALABILITY OF HADOOP
EASE TO PROGRAMMERS
HADOOP VS. OTHER SYSTEMS
HADOOP USERS
TO LEARN MORE
Source code
◦http://hadoop.apache.org/version_control.html
◦http://svn.apache.org/viewvc/hadoop/common/trunk/
Hadoop releases
◦http://hadoop.apache.org/releases.html
Contribute to it
◦http://wiki.apache.org/hadoop/HowToContribute
CONCLUSION
Hdfs provides a reliable, scalable and manageable solution for
working with huge amounts of data
Future secure
Hdfs has been deployed in clusters of 10 to 4k datanodes
◦Used in production at companies such as yahoo! , FB , Twitter , ebay
◦Many enterprises including financial companies use hadoop
REFERENCES
[1] M. Zukowski, S. Heman, N. Nes, And P. Boncz. Cooperative Scans: Dynamic Bandwidth Sharing In A DBMS. In
VLDB ’07: Proceedings Of The 33rd International Conference On Very Large Data Bases, Pages 23–34, 2007.
[2] Tom White, Hadoop The Definite Guide, O’reilly Media ,Third Edition, May 2012
[3] Jeffrey Shafer, Scott Rixner, And Alan L. Cox, The Hadoop Distributed Filesystem: Balancing Portability And
Performance, Rice University, Houston, TX
[4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System,
Yahoo, Sunnyvale, California, USA
[5] Jens Dittrich, Jorge-arnulfo Quian, E-ruiz, Information Systems Group, Efficient Big Data Processing In
Hadoop Mapreduce , Saarland University
Thankyou.
Queries

More Related Content

PDF
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
PPTX
Hadoop Technology
PPT
Hadoop technology
PPTX
PPTX
HDFS Tiered Storage
PPTX
Apache hadoop basics
PPTX
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
PDF
Red Hat Storage for Mere Mortals
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop Technology
Hadoop technology
HDFS Tiered Storage
Apache hadoop basics
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
Red Hat Storage for Mere Mortals

What's hot (20)

PDF
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
PPTX
Scalding by Adform Research, Alex Gryzlov
PPTX
HDFS Erasure Coding in Action
PPTX
R & Python on Hadoop
PPTX
Lecture 2 part 2
PDF
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
PPTX
Hadoop File system (HDFS)
PDF
Red Hat Storage Day New York - New Reference Architectures
PDF
Apache Hadoop 0.22 and Other Versions
PPTX
Hadoop
PDF
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
PDF
HPC Storage and IO Trends and Workflows
PPTX
HBase with MapR
PPT
Hadoop training in hyderabad-kellytechnologies
PPTX
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
PPTX
Hadoop
PPT
Containerized Storage
PDF
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
PPTX
PPTX
Introduction to HDFS and MapReduce
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Scalding by Adform Research, Alex Gryzlov
HDFS Erasure Coding in Action
R & Python on Hadoop
Lecture 2 part 2
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
Hadoop File system (HDFS)
Red Hat Storage Day New York - New Reference Architectures
Apache Hadoop 0.22 and Other Versions
Hadoop
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
HPC Storage and IO Trends and Workflows
HBase with MapR
Hadoop training in hyderabad-kellytechnologies
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Hadoop
Containerized Storage
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Introduction to HDFS and MapReduce
Ad

Similar to HDFS presented by VIJAY (20)

PPT
Apache hadoop, hdfs and map reduce Overview
PPT
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
PPTX
Bigdata and Hadoop Introduction
PDF
getFamiliarWithHadoop
ODP
Hadoop HDFS by rohitkapa
PDF
PDF
hadoop distributed file systems complete information
PPTX
Introduction to Hadoop and Big Data
PPTX
Apache Hadoop Big Data Technology
PPTX
Unit-3.pptx
PPTX
Hadoop Interacting with HDFS
PPTX
Presentation sreenu dwh-services
PPTX
Introduction to HDFS
PPT
DOCX
PDF
Interacting with hdfs
PDF
Hadoop operations basic
PPTX
Bigdata workshop february 2015
PDF
Hadoop overview.pdf
PDF
Hadoop data management
Apache hadoop, hdfs and map reduce Overview
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
Bigdata and Hadoop Introduction
getFamiliarWithHadoop
Hadoop HDFS by rohitkapa
hadoop distributed file systems complete information
Introduction to Hadoop and Big Data
Apache Hadoop Big Data Technology
Unit-3.pptx
Hadoop Interacting with HDFS
Presentation sreenu dwh-services
Introduction to HDFS
Interacting with hdfs
Hadoop operations basic
Bigdata workshop february 2015
Hadoop overview.pdf
Hadoop data management
Ad

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Empathic Computing: Creating Shared Understanding
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
KodekX | Application Modernization Development
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation theory and applications.pdf
PPTX
Cloud computing and distributed systems.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
“AI and Expert System Decision Support & Business Intelligence Systems”
Per capita expenditure prediction using model stacking based on satellite ima...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Monthly Chronicles - July 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Empathic Computing: Creating Shared Understanding
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KodekX | Application Modernization Development
20250228 LYD VKU AI Blended-Learning.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf
Cloud computing and distributed systems.
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf
Spectral efficient network and resource selection model in 5G networks

HDFS presented by VIJAY

  • 1. Hadoop Distributed File System (HDFS) SEMINAR GUIDE Mr. PRAMOD PAVITHRAN HEAD OF DIVISION COMPUTER SCIENCE & ENGINEERING SCHOOL OF ENGINEERING, CUSAT PRESENTED BY VIJAY PRATAP SINGH REG NO: 12110083 S7, CS-B ROLL NO: 81
  • 2. CONTENTS WHAT IS HADOOP PROJECT COMPONENTS IN HADOOP MAP/REDUCE HDFS ARCHITECTURE WRITE & READ IN HDFS GOALS OF HADOOP COMPARISION WITH OTHER SYSTEMS CONCLUSION REFERENCES
  • 6. WHAT IS HADOOP ? o Hadoop is an open-source software framework . o Hadoop framework consists on two main layers ● Distributed file system (HDFS) ● Execution engine (MapReduce) o Supports data-intensive distributed applications. o Licensed under the Apache v2 license. o It enables applications to work with thousands of computation-independent computers and petabytes of data
  • 9. MAP/REDUCE o Hadoop is the popular open source implementation of map/reduce o MapReduce is a programming model for processing large data sets o MapReduce is typically used to do distributed computing on clusters of computers o MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data. oThe model is inspired by the map and reduce functions o"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to slave nodes. The slave node processes the smaller problem, and passes the answer back to its master node. o"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the final output
  • 11. HDFS Highly scalable file system ◦ 6K nodes and 120PB ◦ Add commodity servers and disks to scale storage and IO bandwidth Supports parallel reading & processing of data ◦ Optimized for streaming reads/writes of large files ◦ Bandwidth scales linearly with the number of nodes and disks Fault tolerant & easy management ◦ Built in redundancy ◦ Tolerate disk and node failure ◦ Automatically manages addition/removal of nodes ◦ One operator per 3K nodes Scalable, Reliable & Manageable
  • 12. LIMITATIONS OF EXISTING DATA ANALYTICS ARCHITECTURE
  • 23. HDFS WRITE Client Rack Awareness Rack 1:DN 1 Rack 2:DN7,9 Rack 1 Core Switch Switch SwitchF DataNode 1 DataNode 9 DataNode 7 Rack 5 BA C Name Node I want to write file.txt Block A OK, Write to DataNode [1,7,9] Ready DN 7,9 Ready DN 9 Ready
  • 24. PIPELINED WRITE Client Rack Awareness Rack 1:DN 1 Rack 2:DN7,9 Rack 1 Core Switch Switch SwitchF DataNode 1 DataNode 9 DataNode 7 Rack 5 BA C Name Node A A A
  • 25. PIPELINED WRITE Client Rack Awareness Rack 1:DN 1 Rack 2:DN7,9 Rack 1 Core Switch Switch SwitchF DataNode 1 DataNode 9 DataNode 7 Rack 5 BA C Name Node A A A Block Received Success MetaData File.txt = Block: DN: 1,7,9 A
  • 26. HDFS READ Client Rack 1 Core Switch Switch Switch DataNode 1 DataNode 9 DataNode 7 Rack 5 Name Node I want to Read file.txt Block A Available at DataNode [1,7,9] A A A MetaData File.txt = Block: DN: 1,7,9 A
  • 27. HDFS SHELL COMMANDS ● bin/hadoop fs -ls ● bin/hadoop fs -mkdir ● bin/hadoop fs -copyFromLocal ● bin/hadoop fs -copyToLocal ● bin/hadoop fs -moveToLocal ● bin/hadoop fs -rm ● bin/hadoop fs -tail ● bin/hadoop fs -chmod ● bin/hadoop fs -setrep -w 4 -R /dir1/s-dir/
  • 28. GOALS OF HDFS Very Large Distributed File System ◦10K nodes, 100 million files, 10PB Assumes Commodity Hardware ◦Files are replicated to handle hardware failure ◦Detect failures and recover from them Optimized for Batch Processing ◦Data locations exposed so that computations can move to where data resides ◦Provides very high aggregate bandwidth
  • 31. HADOOP VS. OTHER SYSTEMS
  • 33. TO LEARN MORE Source code ◦http://hadoop.apache.org/version_control.html ◦http://svn.apache.org/viewvc/hadoop/common/trunk/ Hadoop releases ◦http://hadoop.apache.org/releases.html Contribute to it ◦http://wiki.apache.org/hadoop/HowToContribute
  • 34. CONCLUSION Hdfs provides a reliable, scalable and manageable solution for working with huge amounts of data Future secure Hdfs has been deployed in clusters of 10 to 4k datanodes ◦Used in production at companies such as yahoo! , FB , Twitter , ebay ◦Many enterprises including financial companies use hadoop
  • 35. REFERENCES [1] M. Zukowski, S. Heman, N. Nes, And P. Boncz. Cooperative Scans: Dynamic Bandwidth Sharing In A DBMS. In VLDB ’07: Proceedings Of The 33rd International Conference On Very Large Data Bases, Pages 23–34, 2007. [2] Tom White, Hadoop The Definite Guide, O’reilly Media ,Third Edition, May 2012 [3] Jeffrey Shafer, Scott Rixner, And Alan L. Cox, The Hadoop Distributed Filesystem: Balancing Portability And Performance, Rice University, Houston, TX [4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System, Yahoo, Sunnyvale, California, USA [5] Jens Dittrich, Jorge-arnulfo Quian, E-ruiz, Information Systems Group, Efficient Big Data Processing In Hadoop Mapreduce , Saarland University