SlideShare a Scribd company logo
Hadoop Fundamentals
Satish Mittal
InMobi
Why Hadoop?
Big Data
• Sources: Server logs, clickstream, machine, sensor, social…
• Use-cases: batch/interactive/real-time
Scalable
o Petabytes of data
Economical
o Use commodity hardware
o Share clusters among many applications
Reliable
o Failure is common when you run thousands of machines. Handle it well in
the SW layer.
Simple programming model
o Applications must be simple to write and maintain
What is needed from a Distributed Platform?
Hadoop is peta-byte scale distributed data storage and data
processing infrastructure
 Based on Google GFS & MR paper
 Contributed mostly by Yahoo! in the initial years and now have a
more widespread developer and user base
 1000s of nodes, PBs of data in storage
What is Hadoop?
• Cheap JBODs for storage
• Move processing to where data is
Location awareness (topology)
• Assume hardware failures to be the norm
• Map & Reduce primitives are fairly simple yet powerful
Most set operations can be performed using these primitives
• Isolation
Hadoop Basics
Hadoop Distributed File System
(HDFS)
Goals:
 Fault tolerant, scalable, distributed storage system
 Designed to reliably store very large files across machines in a
large cluster
Assumptions:
 Files are written once and read several times
 Applications perform large sequential streaming reads
 Not a Unix-like, POSIX file system
 Access via command line or Java API
HDFS
• Data is organized into files and directories
• Files are divided into uniform sized blocks and distributed across
cluster nodes
• Blocks are replicated to handle hardware failure
• Filesystem keeps checksums of data for corruption detection and
recovery
• HDFS exposes block placement so that computes can be migrated
to data
HDFS – Data Model
HDFS - Architecture
• Namenode is SPOF (HA for NN is now available in 2.0
Alpha)
• Responsible for managing a list of all active data nodes,
FS name system (files, directories, blocks and their
locations)
• Block placement policy
• Ensuring adequate replicas
• Writing edit logs durably
Namenode
• Service to allow data to be streamed in & out
• Block is the unit of data that data node understands
• Block reports to Namenode periodically
• Checksum checks, disk usage stats are managed by datanode
• Clients talk to datanode for actual data
• As long as there is at least one data node available to service file
blocks, failures in datanodes can be tolerated, albeit at lower
performance.
Datanode
HDFS – Write pipeline
DFS Client Namenode
Data node 1
Data node 2
Data node 3
Rack 2
Create file, get Block Loc (1)
DN 1, 2 & 3 (2)
Stream file (5)
Ack (5a)
Ack (4a)
Ack (3a)
Complete file (3b)
Rack 1
• Default is 3 replicas, but configurable
• Blocks are placed (writes are pipelined):
On same node
On different rack
On the other rack
• Clients read from closest replica
• If the replication for a block drops below target, it is
automatically re-replicated.
HDFS – Block placement
• Data is checked with CRC32
• File Creation
‣ Client computes checksum per block
‣ DataNode stores the checksum
• File access
‣ Client retrieves the data and checksum from DataNode
‣ If Validation fails, Client tries other replicas
HDFS – Data correctness
Simple commands
• hadoop fs -ls, -du, -rm, -rmr, -chown, -chmod
Uploading files
• hadoop fs -put foo mydata/foo
• cat ReallyBigFile | hadoop fs -put - mydata/ReallyBigFile
Downloading files
• hadoop fs -get mydata/foo foo
• hadoop fs -get - mydata/ReallyBigFile | grep “the answer is”
• hadoop fs -cat mydata/foo
Admin
• hadoop dfsadmin –report
• hadoop fsck
Interacting with HDFS
Map-Reduce
Say we have 100s of machines available to us. How do we write
applications on them?
As an example, consider the problem of creating an index for search.
‣ Input: Hundreds of documents
‣ Output: A mapping of word to document IDs
‣ Resources: A few machines
Map-Reduce Application
The problem : Inverted Index
Farmer1 has the
following animals:
bees, cows, goats.
Some other
animals …
Animals: 1, 2, 3, 4, 12
Bees: 1, 2, 23, 34
Dog: 3,9
Farmer1: 1, 7
…
Building an inverted index
Machine1
Machine2
Machine3
Animals: 1,3
Dog: 3
Animals:2,12
Bees: 23
Dog:9
Farmer1: 7
Machine4
Animals: 1,3
Animals:2,12
Bees:23
Machine5
Dog: 3
Dog:9
Farmer1: 7
Machine4
Animals: 1,2,3,12
Bees:23
Machine5
Dog: 3,9
Farmer1: 7
In our example
‣ Map: (doc-num, text) ➝ [(word, doc-num)]
‣ Reduce: (word, [doc1, doc3, ...]) ➝ [(word, “doc1, doc3, …”)]
General form:
‣ Two functions: Map and Reduce
‣ Operate on key and value pairs
‣ Map: (K1, V1) ➝ list(K2, V2)
‣ Reduce: (K2, list(V2)) ➝ (K3, V3)
‣ Primitives present in Lisp and other functional languages
Same principle extended to distributed computing
‣ Map and Reduce tasks run on distributed sets of machines
This is Map-Reduce
Abstracts functionality common to all Map/Reduce applications
‣ Distribute tasks to multiple machines
‣ Sorts, transfers and merges intermediate data from all machines from the Map phase to
the Reduce phase
‣ Monitors task progress
‣ Handles faulty machines, faulty tasks transparently
Provides pluggable APIs and configuration mechanisms for writing applications
‣ Map and Reduce functions
‣ Input formats and splits
‣ Number of tasks, data types, etc…
Provides status about jobs to users
Map-Reduce Framework
MR – Architecture
Job Client Job Tracker
DFS Client
DFS Client
DFS Client
DFS Client
DFS Client
DFS Client
Task Tracker
Heartbeat Task Assignment
Shuffle
Submit
Progress
H
D
F
S
• All user code runs in isolated JVM
• Client computes splits
• JT just schedules these splits (one mapper per split)
• Mapper, Reducer, Partitioner and Combiner and any custom
Input/OutputFormat runs in user JVM
• Idempotence
Map-Reduce
Hadoop HDFS + MR cluster
Machines with Datanodes and Tasktrackers
D D D DTT
JobTracker
Namenode
T T TD
Client
Submit Job
HTTP Monitoring UI
Get Block
Locations
• Input: A bunch of large text files
• Desired Output: Frequencies of Words
WordCount: Hello World of Hadoop
Hadoop – Two services in one
Mapper
‣ Input: value: lines of text of input
‣ Output: key: word, value: 1
Reducer
‣ Input: key: word, value: set of counts
‣ Output: key: word, value: sum
Launching program
‣ Defines the job
‣ Submits job to cluster
Word Count Example
Questions ?
Thank You!
mailto: satish.mittal@inmobi.com

More Related Content

PPTX
2. hadoop fundamentals
PDF
Hadoop Fundamentals I
PPTX
Apache hadoop technology : Beginners
PPT
Hadoop training in hyderabad-kellytechnologies
PPTX
Hadoop architecture meetup
PDF
Hadoop, HDFS and MapReduce
PPTX
Asbury Hadoop Overview
PPTX
Hadoop
2. hadoop fundamentals
Hadoop Fundamentals I
Apache hadoop technology : Beginners
Hadoop training in hyderabad-kellytechnologies
Hadoop architecture meetup
Hadoop, HDFS and MapReduce
Asbury Hadoop Overview
Hadoop

What's hot (18)

PPTX
Introduction to HDFS and MapReduce
PPT
Seminar Presentation Hadoop
PDF
Distributed Computing with Apache Hadoop: Technology Overview
PPTX
Big Data and Cloud Computing
PPT
Hadoop Technology
PPTX
Apache Hadoop
PPTX
Hadoop
PPSX
Hadoop-Quick introduction
PPTX
Hadoop technology
PDF
Hadoop distributed computing framework for big data
PPTX
10c introduction
PDF
PPTX
Hadoop Backup and Disaster Recovery
PPTX
Introduction to Hadoop
PDF
02.28.13 WANdisco ApacheCon 2013
PPTX
Selective Data Replication with Geographically Distributed Hadoop
PPTX
Apache hadoop basics
PPTX
Hadoop File system (HDFS)
Introduction to HDFS and MapReduce
Seminar Presentation Hadoop
Distributed Computing with Apache Hadoop: Technology Overview
Big Data and Cloud Computing
Hadoop Technology
Apache Hadoop
Hadoop
Hadoop-Quick introduction
Hadoop technology
Hadoop distributed computing framework for big data
10c introduction
Hadoop Backup and Disaster Recovery
Introduction to Hadoop
02.28.13 WANdisco ApacheCon 2013
Selective Data Replication with Geographically Distributed Hadoop
Apache hadoop basics
Hadoop File system (HDFS)
Ad

Viewers also liked (13)

PDF
Talk on Parallel Computing at IGWA
PPTX
Kafka. seattle data science and data engineering meetup
PDF
Magic quadrant for data warehouse database management systems
PPT
Design principles of scalable, distributed systems
PDF
Distributed Systems: scalability and high availability
PPT
Distributed Systems Architecture in Software Engineering SE11
PPT
Building a Scalable Architecture for web apps
PPTX
Comparison of MPP Data Warehouse Platforms
PDF
MPP vs Hadoop
PPTX
Modern Data Architecture
PDF
Scalability Design Principles - Internal Session
PPT
Distributed Systems
DOC
Unit 1 architecture of distributed systems
Talk on Parallel Computing at IGWA
Kafka. seattle data science and data engineering meetup
Magic quadrant for data warehouse database management systems
Design principles of scalable, distributed systems
Distributed Systems: scalability and high availability
Distributed Systems Architecture in Software Engineering SE11
Building a Scalable Architecture for web apps
Comparison of MPP Data Warehouse Platforms
MPP vs Hadoop
Modern Data Architecture
Scalability Design Principles - Internal Session
Distributed Systems
Unit 1 architecture of distributed systems
Ad

Similar to Hadoop Fundamentals (20)

PPT
PPT
hadoop
PPT
hadoop
PPTX
PPTX
PPTX
Cppt Hadoop
PPTX
Unit-1 Introduction to Big Data.pptx
PPT
HDFS_architecture.ppt
PPTX
Managing Big data with Hadoop
PDF
Hadoop overview.pdf
PPTX
Hadoop
PPTX
Introduction to Hadoop and Big Data
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PPTX
Apache Hadoop Big Data Technology
PDF
getFamiliarWithHadoop
PPTX
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
ODP
Apache hadoop
hadoop
hadoop
Cppt Hadoop
Unit-1 Introduction to Big Data.pptx
HDFS_architecture.ppt
Managing Big data with Hadoop
Hadoop overview.pdf
Hadoop
Introduction to Hadoop and Big Data
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Apache Hadoop Big Data Technology
getFamiliarWithHadoop
Hadoop BRamamurthy ajjaahdvddvdnsmsjdjfj
Apache hadoop

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Cloud computing and distributed systems.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Electronic commerce courselecture one. Pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Machine learning based COVID-19 study performance prediction
Review of recent advances in non-invasive hemoglobin estimation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Big Data Technologies - Introduction.pptx
A Presentation on Artificial Intelligence
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
Electronic commerce courselecture one. Pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Unlocking AI with Model Context Protocol (MCP)
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The AUB Centre for AI in Media Proposal.docx
Understanding_Digital_Forensics_Presentation.pptx

Hadoop Fundamentals

  • 3. Big Data • Sources: Server logs, clickstream, machine, sensor, social… • Use-cases: batch/interactive/real-time
  • 4. Scalable o Petabytes of data Economical o Use commodity hardware o Share clusters among many applications Reliable o Failure is common when you run thousands of machines. Handle it well in the SW layer. Simple programming model o Applications must be simple to write and maintain What is needed from a Distributed Platform?
  • 5. Hadoop is peta-byte scale distributed data storage and data processing infrastructure  Based on Google GFS & MR paper  Contributed mostly by Yahoo! in the initial years and now have a more widespread developer and user base  1000s of nodes, PBs of data in storage What is Hadoop?
  • 6. • Cheap JBODs for storage • Move processing to where data is Location awareness (topology) • Assume hardware failures to be the norm • Map & Reduce primitives are fairly simple yet powerful Most set operations can be performed using these primitives • Isolation Hadoop Basics
  • 7. Hadoop Distributed File System (HDFS)
  • 8. Goals:  Fault tolerant, scalable, distributed storage system  Designed to reliably store very large files across machines in a large cluster Assumptions:  Files are written once and read several times  Applications perform large sequential streaming reads  Not a Unix-like, POSIX file system  Access via command line or Java API HDFS
  • 9. • Data is organized into files and directories • Files are divided into uniform sized blocks and distributed across cluster nodes • Blocks are replicated to handle hardware failure • Filesystem keeps checksums of data for corruption detection and recovery • HDFS exposes block placement so that computes can be migrated to data HDFS – Data Model
  • 11. • Namenode is SPOF (HA for NN is now available in 2.0 Alpha) • Responsible for managing a list of all active data nodes, FS name system (files, directories, blocks and their locations) • Block placement policy • Ensuring adequate replicas • Writing edit logs durably Namenode
  • 12. • Service to allow data to be streamed in & out • Block is the unit of data that data node understands • Block reports to Namenode periodically • Checksum checks, disk usage stats are managed by datanode • Clients talk to datanode for actual data • As long as there is at least one data node available to service file blocks, failures in datanodes can be tolerated, albeit at lower performance. Datanode
  • 13. HDFS – Write pipeline DFS Client Namenode Data node 1 Data node 2 Data node 3 Rack 2 Create file, get Block Loc (1) DN 1, 2 & 3 (2) Stream file (5) Ack (5a) Ack (4a) Ack (3a) Complete file (3b) Rack 1
  • 14. • Default is 3 replicas, but configurable • Blocks are placed (writes are pipelined): On same node On different rack On the other rack • Clients read from closest replica • If the replication for a block drops below target, it is automatically re-replicated. HDFS – Block placement
  • 15. • Data is checked with CRC32 • File Creation ‣ Client computes checksum per block ‣ DataNode stores the checksum • File access ‣ Client retrieves the data and checksum from DataNode ‣ If Validation fails, Client tries other replicas HDFS – Data correctness
  • 16. Simple commands • hadoop fs -ls, -du, -rm, -rmr, -chown, -chmod Uploading files • hadoop fs -put foo mydata/foo • cat ReallyBigFile | hadoop fs -put - mydata/ReallyBigFile Downloading files • hadoop fs -get mydata/foo foo • hadoop fs -get - mydata/ReallyBigFile | grep “the answer is” • hadoop fs -cat mydata/foo Admin • hadoop dfsadmin –report • hadoop fsck Interacting with HDFS
  • 18. Say we have 100s of machines available to us. How do we write applications on them? As an example, consider the problem of creating an index for search. ‣ Input: Hundreds of documents ‣ Output: A mapping of word to document IDs ‣ Resources: A few machines Map-Reduce Application
  • 19. The problem : Inverted Index Farmer1 has the following animals: bees, cows, goats. Some other animals … Animals: 1, 2, 3, 4, 12 Bees: 1, 2, 23, 34 Dog: 3,9 Farmer1: 1, 7 …
  • 20. Building an inverted index Machine1 Machine2 Machine3 Animals: 1,3 Dog: 3 Animals:2,12 Bees: 23 Dog:9 Farmer1: 7 Machine4 Animals: 1,3 Animals:2,12 Bees:23 Machine5 Dog: 3 Dog:9 Farmer1: 7 Machine4 Animals: 1,2,3,12 Bees:23 Machine5 Dog: 3,9 Farmer1: 7
  • 21. In our example ‣ Map: (doc-num, text) ➝ [(word, doc-num)] ‣ Reduce: (word, [doc1, doc3, ...]) ➝ [(word, “doc1, doc3, …”)] General form: ‣ Two functions: Map and Reduce ‣ Operate on key and value pairs ‣ Map: (K1, V1) ➝ list(K2, V2) ‣ Reduce: (K2, list(V2)) ➝ (K3, V3) ‣ Primitives present in Lisp and other functional languages Same principle extended to distributed computing ‣ Map and Reduce tasks run on distributed sets of machines This is Map-Reduce
  • 22. Abstracts functionality common to all Map/Reduce applications ‣ Distribute tasks to multiple machines ‣ Sorts, transfers and merges intermediate data from all machines from the Map phase to the Reduce phase ‣ Monitors task progress ‣ Handles faulty machines, faulty tasks transparently Provides pluggable APIs and configuration mechanisms for writing applications ‣ Map and Reduce functions ‣ Input formats and splits ‣ Number of tasks, data types, etc… Provides status about jobs to users Map-Reduce Framework
  • 23. MR – Architecture Job Client Job Tracker DFS Client DFS Client DFS Client DFS Client DFS Client DFS Client Task Tracker Heartbeat Task Assignment Shuffle Submit Progress H D F S
  • 24. • All user code runs in isolated JVM • Client computes splits • JT just schedules these splits (one mapper per split) • Mapper, Reducer, Partitioner and Combiner and any custom Input/OutputFormat runs in user JVM • Idempotence Map-Reduce
  • 25. Hadoop HDFS + MR cluster Machines with Datanodes and Tasktrackers D D D DTT JobTracker Namenode T T TD Client Submit Job HTTP Monitoring UI Get Block Locations
  • 26. • Input: A bunch of large text files • Desired Output: Frequencies of Words WordCount: Hello World of Hadoop
  • 27. Hadoop – Two services in one
  • 28. Mapper ‣ Input: value: lines of text of input ‣ Output: key: word, value: 1 Reducer ‣ Input: key: word, value: set of counts ‣ Output: key: word, value: sum Launching program ‣ Defines the job ‣ Submits job to cluster Word Count Example