SlideShare a Scribd company logo
Hadoop/MapReduce/HDFS
Team:
Wasnaa AL-Mawee
Praveen Bhat
Class: CS6550
Department of Computer Science
Western Michigan University
• We live in the data age
 Facebook - 1.01b daily active users
 New York Stock Exchange – 1 terabyte of new trade/day
 Internet Archive stores appr. 2 petabytes
Introduction
Data
Enterprise
Social
Media
Sensor
PublicTransaction
• Characteristics of data
 Humongous.
 Structured, Semi-structured, and unstructured
 Growing beyond one can imagine.
• We call it Big Data!
Introduction
Velocity
Variety
Volume
Big
Data
What is the problem
Storage Drive capacity
1990 1370MB
2010 1 terabyte
2013 4 terabyte
Transfer Speed
1990 4.4 MB/s
2010 100MB/s
2013 146MB/s
• Require more time to read data from disk.
• Traditional data storage mechanism insufficient
What do we do ?
“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for
more systems of computers.”
—Grace Hopper, Computer Scientist
• Create a cluster of systems
• Store data in clustered systems
• Process data sets independent of one another
Hadoop
Hadoop is a framework for running applications on large cluster built of
commodity hardware.
In other words,
A reliable shared storage and analysis system.
Hadoop Modules
• Hadoop Common
• Hadoop Distributed File System(HDFS)
• Hadoop Yarn
• Hadoop MapReduce
Journey of Hadoop
2002
Started by
Dough
Cutting and
Mike
Cafarella as a
text search
library
2003
Google’s
distributed file
system paper
published
Yahoo hired
Dough,
Supported
Hadoop
2006
2008
Yahoo
announced
that its search
index was
generated by
10,000-core
Hadoop
cluster
2009
Won the
minute sort by
sorting 500
GB in 59
seconds ! 2013
More than half
of the Fortune
50 use
Hadoop
Current projects under Apache Hadoop
• Avro
• Cassandra:
• Chukwa
• HBase
• Hive
• Mahout
• Pig
• Spark
• Tez
• Zoookeeper
Hadoop Distributed File System(HDFS)
• File systems that manages the storage across a network of machines
• Built around to handle
 Very large files - Terabytes, petabytes
 Streaming data access - write once, read many times
 Commodity Hardware - commonly available hardware
Namenodes and Datanodes
• Two types of node operating in a master-worker pattern
• Namenode
 Master node
 Manages filesystem namespace
 Maintains metadata for all the files and directories in the tree
• Datanode
 Workhorses of the file system
 Store and retrieve blocks when told by client or Namenode
 Periodically report to Namenode
HDFS Architecture
Source: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
Client reading files from HDFS
Client
Name Node
Tell me the
block
locations of
results.txt
Blk A = 1,5,6
Blk B = 1, 2, 8
Blk C = 5, 8, 9
Data Node
Data Node
Data Node 6
Data Node 5
SwitchSwitch
Data Node 1
Data Node 2
Data Node
Data Node
B A
B
C A
Data Node
Data Node
Data Node 9
Data Node 8
Switch
C
C
B
A
Result.txt =
Blk A :
DN1, DN5, N6
Blk B:
DN8, DN1, DN2
Blk C = DN5, DN8,
DN9
Metadata
• Client receives Data Node list from each block
• Picks first Data Node for each block
• Reads blocks sequentially Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Client-Read-from-HDFS.PNG
Writing files to HDFS
I want to
write blocks
A,B,C of
file.txt
Client
Name Node
Data Node 1 Data Node 5 Data Node 6 Data Node N
Blk A Blk B Blk C
file.txt
Blk A Blk B Blk C
OK. Write to
data nodes
1,5, 6
• Client consults Name Node
• Writes block directly to one Data Node
• Data Node replicates block
• Cycle repeats for next block
Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Writing-Files-to-HDFS.PNG
What is MapReduce?
• MapReduce is a programming model for processing
large data sets with a parallel, distributed algorithm
on a cluster.
• Published in 2004 from Google engineers Jeffrey
Dean and Sanjay Ghemawat.
MapReduce Features
• Large-scale distributed data processing
• Parallel programming.
• Simple but restricted.
• Load Balancing
• Handling machine failure
When should we use MapReduce ?
Query
• Index and search such as inverted index
• Classification
• Filtering
Analytics
• Sorting and merging
• Frequency distribution
• Summarization and statistics
• SQL-based queries: group by, having, etc.
• Generation of graphics
Others
• Message passing such as Breadth first-search algorithm
MapReduce Inspiration!
- Read massive data
- Map: Extracting data from each record
map (in_key, in_value) (out_key, intermediate_value) list
- Shuffle and Sort
- Reduce: Aggregate, filter, summarize and transform
reduce (out_key, intermediate_value list) out_value list
- Write the result
MapReduce Process Architecture
MapReduce Examples
1. Word Counting
2. Inverted indexes
MapReduce Algorithms
1. Disease propagation detection based-MapReduce
2. Trading strategies based-MapReduce.
3. Graph processing algorithm based-MapReduce.
Final Note !
• Open source community taking newer and larger steps
– Spark, Ceph, Open Stack
• Need for better processing
– Batch processing + Streaming
• Time to move on from Hadoop?
References
• http://www.intelligententerprise.com/showArticle.jhtml?articleID=207800705.
• http://mashable.com/2008/10/15/facebook-10-billion-photos/.
• http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret +Data+Center.aspx,
• http://www.archive.org/about/faqs.php.
• http://www.interactions.org/cms/?pid=1027032.
• Hadoop The Definitive Guide 2nd Edition by Tom White
• Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” October 2003
• http://www.forbes.com/sites/teradata/2015/05/22/the-future-of-hadoop-is-cloudy-with-a-chance-of-growing-ecosystem/
• R. Ranjan, and R. Misra,” Epidemic Disease Propagation Detection Algorithm using MapReduce for Realistic Social Contact
Networks, “IEEE Int. Conf. on High Performance Computing and Applications, vol. 2, Bhubaneswar, Dec. 2014, pp.1-6.
• X. Qin, and et al,“Optimizing Parameters of algorithm trading strategies using MapReduce ,” 9th IEEE Int. Conf. Fuzzy
Systems and Knowledge Discovery, Sichuan, May 2012, pp. 2738-274.
• K. Shirahata, H. Sato, T. Suzumura, and S. Matsuoka “A Scalable Implementation of a MapReduce-based Graph Processing
Algorithm for Large Scale Heterogeneous Supercomputers, “13th IEEE/ACM Int. Sym. on Cluster, Cloud, and Grid
Computing, Delft, May 2013, pp. 277-284.
• G. Yang, “The Application of MapReduce in the Cloud Computing,” 2nd IEEE Int. Syn. On Intilligence Information
Processing and Trusted, Hubei, Oct. 2011, pp.154-156.
• C. Goncalves, L. Assuncao, and J.C Cunha “Data Analytics in the Cloud with Flexible MapReduce Workflows” 4th IEEE Int.
Conf. on Cloud computing technology and Sience, Taipei, Dec. 2012, pp. 427-434.
• Count Frequencies of Words in Document. Last access Nov. 15th, 2015. Available
on:http://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf.
• Link Elevation. Last access Nov. 15th, 2015. Available on: http://www.slideshare.net/ChicagoHUG/mr.
• Inverted indexes. Last access Nov. 15, 2015. Available on: http://blog.cloudera.com/wp-
content/uploads/2010/01/InvertedIndex.pdf.

More Related Content

PPT
Hadoop MapReduce Fundamentals
PDF
Hadoop Design and k -Means Clustering
PPTX
Real time hadoop + mapreduce intro
PPTX
Map reduce paradigm explained
PDF
Hadoop Internals (2.3.0 or later)
PDF
Hadoop ecosystem
PPTX
The Evolution of the Hadoop Ecosystem
PDF
Hadoop ecosystem
Hadoop MapReduce Fundamentals
Hadoop Design and k -Means Clustering
Real time hadoop + mapreduce intro
Map reduce paradigm explained
Hadoop Internals (2.3.0 or later)
Hadoop ecosystem
The Evolution of the Hadoop Ecosystem
Hadoop ecosystem

What's hot (19)

PPTX
MapReduce Paradigm
PPTX
Introduction to Hadoop and Hadoop component
PPTX
MapReduce basic
PDF
Hadoop Overview & Architecture
 
PPTX
The Hadoop Ecosystem
PPTX
Hadoop And Their Ecosystem
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PDF
Hadoop-Introduction
PPTX
Hadoop hbase mapreduce
PPTX
MapReduce Design Patterns
PPTX
Hadoop overview
PPTX
Introduction to Pig
PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
PPTX
Map reduce prashant
PDF
Practical Problem Solving with Apache Hadoop & Pig
KEY
Intro to Hadoop
PPTX
YARN - Hadoop Next Generation Compute Platform
PPTX
Hadoop and mysql by Chris Schneider
PPT
Hadoop basics
MapReduce Paradigm
Introduction to Hadoop and Hadoop component
MapReduce basic
Hadoop Overview & Architecture
 
The Hadoop Ecosystem
Hadoop And Their Ecosystem
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Hadoop-Introduction
Hadoop hbase mapreduce
MapReduce Design Patterns
Hadoop overview
Introduction to Pig
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Map reduce prashant
Practical Problem Solving with Apache Hadoop & Pig
Intro to Hadoop
YARN - Hadoop Next Generation Compute Platform
Hadoop and mysql by Chris Schneider
Hadoop basics
Ad

Viewers also liked (17)

PPTX
Big Data & Hadoop
PPT
Overview of Bigdata Analytics
PPTX
Hadoop story
PDF
MS_Learning_Transcript.PDF
PPTX
Ecommercebypraveen
PPSX
DOCX
French day (6)
DOCX
Resumen de slideshare
PDF
Diario Resumen 20151222
DOC
Articulocea2012 ottoayala
PDF
2012 Avalon starting at only $34,500 at Jerry's Toyota in Baltimore, Maryland
PPTX
Chase Portfolio
PDF
Diario Resumen 20160205
PDF
Bring the Backyard Back Recap
PDF
Hadoop - How It Works
PPT
Lc board presentation2010
Big Data & Hadoop
Overview of Bigdata Analytics
Hadoop story
MS_Learning_Transcript.PDF
Ecommercebypraveen
French day (6)
Resumen de slideshare
Diario Resumen 20151222
Articulocea2012 ottoayala
2012 Avalon starting at only $34,500 at Jerry's Toyota in Baltimore, Maryland
Chase Portfolio
Diario Resumen 20160205
Bring the Backyard Back Recap
Hadoop - How It Works
Lc board presentation2010
Ad

Similar to Hadoop/MapReduce/HDFS (20)

PDF
Hadoop introduction
PPT
PPTX
Introduction to Hadoop and Big Data
PDF
Hadoop-2.6.0 Slides
PPT
hadoop
PPT
hadoop
PDF
An introduction to Big-Data processing applying hadoop
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PDF
Hadoop paper
PPTX
Hadoop and MapReduce addDdaDadadDDAD.pptx
PPT
Hadoop - Introduction to HDFS
PPT
Hadoop and Mapreduce Introduction
PPTX
Hadoop
PPT
Hadoop online-training
PPT
Apache hadoop, hdfs and map reduce Overview
ODP
Training
PPTX
Lecture2-MapReduce - An introductory lecture to Map Reduce
PDF
MapReduce and Hadoop
PPTX
Hadoop and Big Data
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Hadoop introduction
Introduction to Hadoop and Big Data
Hadoop-2.6.0 Slides
hadoop
hadoop
An introduction to Big-Data processing applying hadoop
Hadoop_EcoSystem slide by CIDAC India.pptx
Hadoop paper
Hadoop and MapReduce addDdaDadadDDAD.pptx
Hadoop - Introduction to HDFS
Hadoop and Mapreduce Introduction
Hadoop
Hadoop online-training
Apache hadoop, hdfs and map reduce Overview
Training
Lecture2-MapReduce - An introductory lecture to Map Reduce
MapReduce and Hadoop
Hadoop and Big Data
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Recently uploaded (20)

PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPT
Mechanical Engineering MATERIALS Selection
DOCX
573137875-Attendance-Management-System-original
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPT
Project quality management in manufacturing
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
Welding lecture in detail for understanding
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
web development for engineering and engineering
PPTX
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Well-logging-methods_new................
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Mechanical Engineering MATERIALS Selection
573137875-Attendance-Management-System-original
Lecture Notes Electrical Wiring System Components
Foundation to blockchain - A guide to Blockchain Tech
Project quality management in manufacturing
Embodied AI: Ushering in the Next Era of Intelligent Systems
Structs to JSON How Go Powers REST APIs.pdf
Welding lecture in detail for understanding
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
web development for engineering and engineering
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
bas. eng. economics group 4 presentation 1.pptx
Well-logging-methods_new................
Arduino robotics embedded978-1-4302-3184-4.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd

Hadoop/MapReduce/HDFS

  • 1. Hadoop/MapReduce/HDFS Team: Wasnaa AL-Mawee Praveen Bhat Class: CS6550 Department of Computer Science Western Michigan University
  • 2. • We live in the data age  Facebook - 1.01b daily active users  New York Stock Exchange – 1 terabyte of new trade/day  Internet Archive stores appr. 2 petabytes Introduction Data Enterprise Social Media Sensor PublicTransaction
  • 3. • Characteristics of data  Humongous.  Structured, Semi-structured, and unstructured  Growing beyond one can imagine. • We call it Big Data! Introduction Velocity Variety Volume Big Data
  • 4. What is the problem Storage Drive capacity 1990 1370MB 2010 1 terabyte 2013 4 terabyte Transfer Speed 1990 4.4 MB/s 2010 100MB/s 2013 146MB/s • Require more time to read data from disk. • Traditional data storage mechanism insufficient
  • 5. What do we do ? “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” —Grace Hopper, Computer Scientist • Create a cluster of systems • Store data in clustered systems • Process data sets independent of one another
  • 6. Hadoop Hadoop is a framework for running applications on large cluster built of commodity hardware. In other words, A reliable shared storage and analysis system. Hadoop Modules • Hadoop Common • Hadoop Distributed File System(HDFS) • Hadoop Yarn • Hadoop MapReduce
  • 7. Journey of Hadoop 2002 Started by Dough Cutting and Mike Cafarella as a text search library 2003 Google’s distributed file system paper published Yahoo hired Dough, Supported Hadoop 2006 2008 Yahoo announced that its search index was generated by 10,000-core Hadoop cluster 2009 Won the minute sort by sorting 500 GB in 59 seconds ! 2013 More than half of the Fortune 50 use Hadoop
  • 8. Current projects under Apache Hadoop • Avro • Cassandra: • Chukwa • HBase • Hive • Mahout • Pig • Spark • Tez • Zoookeeper
  • 9. Hadoop Distributed File System(HDFS) • File systems that manages the storage across a network of machines • Built around to handle  Very large files - Terabytes, petabytes  Streaming data access - write once, read many times  Commodity Hardware - commonly available hardware
  • 10. Namenodes and Datanodes • Two types of node operating in a master-worker pattern • Namenode  Master node  Manages filesystem namespace  Maintains metadata for all the files and directories in the tree • Datanode  Workhorses of the file system  Store and retrieve blocks when told by client or Namenode  Periodically report to Namenode
  • 12. Client reading files from HDFS Client Name Node Tell me the block locations of results.txt Blk A = 1,5,6 Blk B = 1, 2, 8 Blk C = 5, 8, 9 Data Node Data Node Data Node 6 Data Node 5 SwitchSwitch Data Node 1 Data Node 2 Data Node Data Node B A B C A Data Node Data Node Data Node 9 Data Node 8 Switch C C B A Result.txt = Blk A : DN1, DN5, N6 Blk B: DN8, DN1, DN2 Blk C = DN5, DN8, DN9 Metadata • Client receives Data Node list from each block • Picks first Data Node for each block • Reads blocks sequentially Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Client-Read-from-HDFS.PNG
  • 13. Writing files to HDFS I want to write blocks A,B,C of file.txt Client Name Node Data Node 1 Data Node 5 Data Node 6 Data Node N Blk A Blk B Blk C file.txt Blk A Blk B Blk C OK. Write to data nodes 1,5, 6 • Client consults Name Node • Writes block directly to one Data Node • Data Node replicates block • Cycle repeats for next block Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Writing-Files-to-HDFS.PNG
  • 14. What is MapReduce? • MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. • Published in 2004 from Google engineers Jeffrey Dean and Sanjay Ghemawat.
  • 15. MapReduce Features • Large-scale distributed data processing • Parallel programming. • Simple but restricted. • Load Balancing • Handling machine failure
  • 16. When should we use MapReduce ? Query • Index and search such as inverted index • Classification • Filtering Analytics • Sorting and merging • Frequency distribution • Summarization and statistics • SQL-based queries: group by, having, etc. • Generation of graphics Others • Message passing such as Breadth first-search algorithm
  • 17. MapReduce Inspiration! - Read massive data - Map: Extracting data from each record map (in_key, in_value) (out_key, intermediate_value) list - Shuffle and Sort - Reduce: Aggregate, filter, summarize and transform reduce (out_key, intermediate_value list) out_value list - Write the result
  • 21. MapReduce Algorithms 1. Disease propagation detection based-MapReduce 2. Trading strategies based-MapReduce. 3. Graph processing algorithm based-MapReduce.
  • 22. Final Note ! • Open source community taking newer and larger steps – Spark, Ceph, Open Stack • Need for better processing – Batch processing + Streaming • Time to move on from Hadoop?
  • 23. References • http://www.intelligententerprise.com/showArticle.jhtml?articleID=207800705. • http://mashable.com/2008/10/15/facebook-10-billion-photos/. • http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret +Data+Center.aspx, • http://www.archive.org/about/faqs.php. • http://www.interactions.org/cms/?pid=1027032. • Hadoop The Definitive Guide 2nd Edition by Tom White • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” October 2003 • http://www.forbes.com/sites/teradata/2015/05/22/the-future-of-hadoop-is-cloudy-with-a-chance-of-growing-ecosystem/ • R. Ranjan, and R. Misra,” Epidemic Disease Propagation Detection Algorithm using MapReduce for Realistic Social Contact Networks, “IEEE Int. Conf. on High Performance Computing and Applications, vol. 2, Bhubaneswar, Dec. 2014, pp.1-6. • X. Qin, and et al,“Optimizing Parameters of algorithm trading strategies using MapReduce ,” 9th IEEE Int. Conf. Fuzzy Systems and Knowledge Discovery, Sichuan, May 2012, pp. 2738-274. • K. Shirahata, H. Sato, T. Suzumura, and S. Matsuoka “A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large Scale Heterogeneous Supercomputers, “13th IEEE/ACM Int. Sym. on Cluster, Cloud, and Grid Computing, Delft, May 2013, pp. 277-284. • G. Yang, “The Application of MapReduce in the Cloud Computing,” 2nd IEEE Int. Syn. On Intilligence Information Processing and Trusted, Hubei, Oct. 2011, pp.154-156. • C. Goncalves, L. Assuncao, and J.C Cunha “Data Analytics in the Cloud with Flexible MapReduce Workflows” 4th IEEE Int. Conf. on Cloud computing technology and Sience, Taipei, Dec. 2012, pp. 427-434. • Count Frequencies of Words in Document. Last access Nov. 15th, 2015. Available on:http://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf. • Link Elevation. Last access Nov. 15th, 2015. Available on: http://www.slideshare.net/ChicagoHUG/mr. • Inverted indexes. Last access Nov. 15, 2015. Available on: http://blog.cloudera.com/wp- content/uploads/2010/01/InvertedIndex.pdf.