SlideShare a Scribd company logo
Hadoop Architecture
Presented by :
Yojana Nanaware
ME(CSE-I)
Agenda
• What is Hadoop?
• Why, When, Where?
• Hadoop : How?
• Hadoop Architecture
• Hadoop Common
• HDFS
• Hadoop Map/Reduce
• Process
• Hadoop Community
• Conclusion
• References
What is Hadoop?
• A SMART WAY TO STORE & ANALYAZE
DATA
• Douglas Reed Cutting, who is the creator
of Open-Source Technology & also
Hadoop. He originated Lucene and Nutch
• Open-source project administered by
Apache Software Foundation. Hadoop
consists of two key services:
What is Hadoop?
– Hadoop Distributed File System (HDFS).
– Map/Reduce .
• Hadoop is large-scale, high-performance
processing jobs — in spite of system
changes or failures
Why Hadoop?
• Need to process 100TB datasets
• On 1 node :
– Scanning @ 50MB/s=23 days
– MTBF = 3 years
• On 1000 node cluster :
– Scanning @ 50MB/s=33mins
– MTBF = 1 days
• Need efficient, Reliable & Usable
framework
Where & When?
• Where
– Batch Data
Processing, not
real-time/ user
facing
– Highly parallel data
intensive distributed
application
– Very large
production of
deployment
• When
– Process lots of
unstructured data
– When your processing
can easily be made
parallel
– Running batch jobs is
acceptable
– When you have to
access lots of cheap
hardware
Hadoop : How?
• Commodity hardware cluster
• Distributed File System
– Modeled on GFS
• Distributed Processing Framework
– Using Map/Reduce metaphor
• Open Source Java
– Apache Lucene Framework
Hadoop Architecture
Hadoop consists :
•Hadoop Common
– Support other Hadoop subprojects
•HDFS
– Provide high throughput access to application
data
•MapReduce
– Compute cluster of large data sets
Hadoop Common
• It is a set of utilities
• Includes File system, RPC, & Serialization
libraries
HDFS
• Primary storage system
• Creates multiple replicas of data blocks &
distributes them on compute nodes
throughout a cluster to enable reliable,
extremely rapid computations.
• Replication & locality
HDFS Architecture
Hadoop MapReduce
• The Map/Reduce programming language
– Framework
– Pluggable user code
• Common design pattern in design processing
cat * I grep I sort I unique –c I cat>file
input I map I shuffle I reduce I output
• Natural for
– log processing
– web search indexing
– Ad-hoc queries
Map/Reduce Implementation
1. input files split
2. Assign Masters &
Workers
3. Map tasks
4. Writing intermediate
data to disk
5. Intermediate data
read & sort
6. Reduce tasks
7. Return
Example of Map/Reduce word count
• Read text files & count how word often
occur.
– The input is text files
– The output is text file
• Each line : word, tab, count
• Map – Produce pair of (word, count)
• Reduce – For each word, sum up the
counts
Process
• Installation
– Requirements : Linux,
java1.6, sshd, rsync
– Configure SSH for
password free
authentication
– Unpack Hadoop
distribution
– Edit a few configuration
files
– Format the DFS on the
name node
– Start all the demon
process
• Execution
– Compile your job into a
jar files
– Copy input data into the
HDFS
– Execute bin/hadoop jar
with relevant arguments
– Monitor task via Web
interface (optional)
– Examine output when
job is complete
Hadoop Community
• Hadoop Users
– Adobe
– Alibaba
– Amazon
– AOL
– Facebook
– Google
– IBM
• Major Contributor
– Apache
– Cloudera
– Yahoo
Conclusion
• Designed to run on cheap commodity
power
• Handles data replication & node failure
• Cost saving & efficient & reliable data
processing
References
• http://www.newyorksys.com/hadoop-
online-training
• Hadoop on Wikipedia
(http://en.wikipedia.org/wiki/Hadoop )
• http://hadoop.apache.org/core/docs/curren
t/api/

More Related Content

PPTX
Hadoop
PPTX
Introduction to HDFS and MapReduce
PPTX
Getting started big data
PDF
Big Data and Hadoop Ecosystem
PDF
Intro to Apache Spark
PPTX
Scheduling scheme for hadoop clusters
PPTX
Lecture 2 part 2
PPTX
Asbury Hadoop Overview
Hadoop
Introduction to HDFS and MapReduce
Getting started big data
Big Data and Hadoop Ecosystem
Intro to Apache Spark
Scheduling scheme for hadoop clusters
Lecture 2 part 2
Asbury Hadoop Overview

What's hot (19)

PDF
Aziksa hadoop architecture santosh jha
PPTX
Hadoop
PDF
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
PPTX
Introduction to apache hadoop copy
PDF
Day1_23Aug.txt - Notepad
PPTX
Messaging architecture @FB (Fifth Elephant Conference)
PPTX
Hadoop: The elephant in the room
PPTX
Hadoop Architecture
PDF
Syncsort et le retour d'expérience ComScore
PDF
Hadoop_Admin_eVenkat
ODT
Hadoop online trainings
PPTX
Hadoop Technology
PPTX
Cloud Optimized Big Data
PPT
Nextag talk
PPT
Hadoop technology
PPTX
Hadoop
PPTX
Big data
PPTX
Cloudera Hadoop Distribution
Aziksa hadoop architecture santosh jha
Hadoop
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Introduction to apache hadoop copy
Day1_23Aug.txt - Notepad
Messaging architecture @FB (Fifth Elephant Conference)
Hadoop: The elephant in the room
Hadoop Architecture
Syncsort et le retour d'expérience ComScore
Hadoop_Admin_eVenkat
Hadoop online trainings
Hadoop Technology
Cloud Optimized Big Data
Nextag talk
Hadoop technology
Hadoop
Big data
Cloudera Hadoop Distribution
Ad

Viewers also liked (10)

PDF
How YARN Enables Multiple Data Processing Engines in Hadoop
PPTX
Apache Hadoop YARN: best practices
PDF
Practical Problem Solving with Apache Hadoop & Pig
PPTX
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
PPTX
Big Data & Hadoop Tutorial
PPTX
Introduction to YARN and MapReduce 2
PPTX
Big data and Hadoop
PPTX
Hadoop introduction , Why and What is Hadoop ?
PDF
Hadoop Overview & Architecture
 
PPTX
Big Data Analytics with Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
Apache Hadoop YARN: best practices
Practical Problem Solving with Apache Hadoop & Pig
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Big Data & Hadoop Tutorial
Introduction to YARN and MapReduce 2
Big data and Hadoop
Hadoop introduction , Why and What is Hadoop ?
Hadoop Overview & Architecture
 
Big Data Analytics with Hadoop
Ad

Similar to Hadoop (20)

PPT
Hadoop
ODP
Hadoop - Overview
PPTX
PPTX
Introduction to Hadoop and Big Data
PPTX
Hadoop
PDF
Hadoop overview.pdf
PPTX
Unit 5
PPTX
Apache Hadoop Big Data Technology
PDF
Hadoop trainting in hyderabad@kelly technologies
PPTX
Hadoop info
PPTX
Hadoop: A distributed framework for Big Data
PDF
Hadoop
PPT
Apache hadoop, hdfs and map reduce Overview
PPTX
Big data
PDF
Apache Hadoop - Big Data Engineering
PDF
Hadoop Ecosystem
PDF
Hadoop paper
DOCX
Hadoop Seminar Report
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPT
Big Data Technologies - Hadoop
Hadoop
Hadoop - Overview
Introduction to Hadoop and Big Data
Hadoop
Hadoop overview.pdf
Unit 5
Apache Hadoop Big Data Technology
Hadoop trainting in hyderabad@kelly technologies
Hadoop info
Hadoop: A distributed framework for Big Data
Hadoop
Apache hadoop, hdfs and map reduce Overview
Big data
Apache Hadoop - Big Data Engineering
Hadoop Ecosystem
Hadoop paper
Hadoop Seminar Report
Hadoop_EcoSystem slide by CIDAC India.pptx
Big Data Technologies - Hadoop

More from Yojana Nanaware (7)

PPTX
Structure of processes ppt
PPTX
Process scheduling & time
PPT
Quantum router
PPT
sAudio & video db
PPT
Multimedia db system
PPTX
Nanonetworks
PPT
Android architechture
Structure of processes ppt
Process scheduling & time
Quantum router
sAudio & video db
Multimedia db system
Nanonetworks
Android architechture

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Encapsulation theory and applications.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
NewMind AI Weekly Chronicles - August'25 Week I
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Monthly Chronicles - July 2025
Encapsulation theory and applications.pdf
MYSQL Presentation for SQL database connectivity
Spectral efficient network and resource selection model in 5G networks
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
NewMind AI Weekly Chronicles - August'25 Week I
The AUB Centre for AI in Media Proposal.docx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Hadoop

  • 1. Hadoop Architecture Presented by : Yojana Nanaware ME(CSE-I)
  • 2. Agenda • What is Hadoop? • Why, When, Where? • Hadoop : How? • Hadoop Architecture • Hadoop Common • HDFS • Hadoop Map/Reduce • Process • Hadoop Community • Conclusion • References
  • 3. What is Hadoop? • A SMART WAY TO STORE & ANALYAZE DATA • Douglas Reed Cutting, who is the creator of Open-Source Technology & also Hadoop. He originated Lucene and Nutch • Open-source project administered by Apache Software Foundation. Hadoop consists of two key services:
  • 4. What is Hadoop? – Hadoop Distributed File System (HDFS). – Map/Reduce . • Hadoop is large-scale, high-performance processing jobs — in spite of system changes or failures
  • 5. Why Hadoop? • Need to process 100TB datasets • On 1 node : – Scanning @ 50MB/s=23 days – MTBF = 3 years • On 1000 node cluster : – Scanning @ 50MB/s=33mins – MTBF = 1 days • Need efficient, Reliable & Usable framework
  • 6. Where & When? • Where – Batch Data Processing, not real-time/ user facing – Highly parallel data intensive distributed application – Very large production of deployment • When – Process lots of unstructured data – When your processing can easily be made parallel – Running batch jobs is acceptable – When you have to access lots of cheap hardware
  • 7. Hadoop : How? • Commodity hardware cluster • Distributed File System – Modeled on GFS • Distributed Processing Framework – Using Map/Reduce metaphor • Open Source Java – Apache Lucene Framework
  • 8. Hadoop Architecture Hadoop consists : •Hadoop Common – Support other Hadoop subprojects •HDFS – Provide high throughput access to application data •MapReduce – Compute cluster of large data sets
  • 9. Hadoop Common • It is a set of utilities • Includes File system, RPC, & Serialization libraries
  • 10. HDFS • Primary storage system • Creates multiple replicas of data blocks & distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. • Replication & locality
  • 12. Hadoop MapReduce • The Map/Reduce programming language – Framework – Pluggable user code • Common design pattern in design processing cat * I grep I sort I unique –c I cat>file input I map I shuffle I reduce I output • Natural for – log processing – web search indexing – Ad-hoc queries
  • 13. Map/Reduce Implementation 1. input files split 2. Assign Masters & Workers 3. Map tasks 4. Writing intermediate data to disk 5. Intermediate data read & sort 6. Reduce tasks 7. Return
  • 14. Example of Map/Reduce word count • Read text files & count how word often occur. – The input is text files – The output is text file • Each line : word, tab, count • Map – Produce pair of (word, count) • Reduce – For each word, sum up the counts
  • 15. Process • Installation – Requirements : Linux, java1.6, sshd, rsync – Configure SSH for password free authentication – Unpack Hadoop distribution – Edit a few configuration files – Format the DFS on the name node – Start all the demon process • Execution – Compile your job into a jar files – Copy input data into the HDFS – Execute bin/hadoop jar with relevant arguments – Monitor task via Web interface (optional) – Examine output when job is complete
  • 16. Hadoop Community • Hadoop Users – Adobe – Alibaba – Amazon – AOL – Facebook – Google – IBM • Major Contributor – Apache – Cloudera – Yahoo
  • 17. Conclusion • Designed to run on cheap commodity power • Handles data replication & node failure • Cost saving & efficient & reliable data processing
  • 18. References • http://www.newyorksys.com/hadoop- online-training • Hadoop on Wikipedia (http://en.wikipedia.org/wiki/Hadoop ) • http://hadoop.apache.org/core/docs/curren t/api/