Hadoop/MapReduce/HDFS

Hadoop/MapReduce/HDFS
Team:
Wasnaa AL-Mawee
Praveen Bhat
Class: CS6550
Department of Computer Science
Western Michigan University

• We live in the data age
 Facebook - 1.01b daily active users
 New York Stock Exchange – 1 terabyte of new trade/day
 Internet Archive stores appr. 2 petabytes
Introduction
Data
Enterprise
Social
Media
Sensor
PublicTransaction

• Characteristics of data
 Humongous.
 Structured, Semi-structured, and unstructured
 Growing beyond one can imagine.
• We call it Big Data!
Introduction
Velocity
Variety
Volume
Big
Data

What is the problem
Storage Drive capacity
1990 1370MB
2010 1 terabyte
2013 4 terabyte
Transfer Speed
1990 4.4 MB/s
2010 100MB/s
2013 146MB/s
• Require more time to read data from disk.
• Traditional data storage mechanism insufficient

What do we do ?
“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for
more systems of computers.”
—Grace Hopper, Computer Scientist
• Create a cluster of systems
• Store data in clustered systems
• Process data sets independent of one another

Hadoop
Hadoop is a framework for running applications on large cluster built of
commodity hardware.
In other words,
A reliable shared storage and analysis system.
Hadoop Modules
• Hadoop Common
• Hadoop Distributed File System(HDFS)
• Hadoop Yarn
• Hadoop MapReduce

Journey of Hadoop
2002
Started by
Dough
Cutting and
Mike
Cafarella as a
text search
library
2003
Google’s
distributed file
system paper
published
Yahoo hired
Dough,
Supported
Hadoop
2006
2008
Yahoo
announced
that its search
index was
generated by
10,000-core
Hadoop
cluster
2009
Won the
minute sort by
sorting 500
GB in 59
seconds ! 2013
More than half
of the Fortune
50 use
Hadoop

Current projects under Apache Hadoop
• Avro
• Cassandra:
• Chukwa
• HBase
• Hive
• Mahout
• Pig
• Spark
• Tez
• Zoookeeper

Hadoop Distributed File System(HDFS)
• File systems that manages the storage across a network of machines
• Built around to handle
 Very large files - Terabytes, petabytes
 Streaming data access - write once, read many times
 Commodity Hardware - commonly available hardware

Namenodes and Datanodes
• Two types of node operating in a master-worker pattern
• Namenode
 Master node
 Manages filesystem namespace
 Maintains metadata for all the files and directories in the tree
• Datanode
 Workhorses of the file system
 Store and retrieve blocks when told by client or Namenode
 Periodically report to Namenode

HDFS Architecture
Source: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

Client reading files from HDFS
Client
Name Node
Tell me the
block
locations of
results.txt
Blk A = 1,5,6
Blk B = 1, 2, 8
Blk C = 5, 8, 9
Data Node
Data Node
Data Node 6
Data Node 5
SwitchSwitch
Data Node 1
Data Node 2
Data Node
Data Node
B A
B
C A
Data Node
Data Node
Data Node 9
Data Node 8
Switch
C
C
B
A
Result.txt =
Blk A :
DN1, DN5, N6
Blk B:
DN8, DN1, DN2
Blk C = DN5, DN8,
DN9
Metadata
• Client receives Data Node list from each block
• Picks first Data Node for each block
• Reads blocks sequentially Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Client-Read-from-HDFS.PNG

Writing files to HDFS
I want to
write blocks
A,B,C of
file.txt
Client
Name Node
Data Node 1 Data Node 5 Data Node 6 Data Node N
Blk A Blk B Blk C
file.txt
Blk A Blk B Blk C
OK. Write to
data nodes
1,5, 6
• Client consults Name Node
• Writes block directly to one Data Node
• Data Node replicates block
• Cycle repeats for next block
Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Writing-Files-to-HDFS.PNG

What is MapReduce?
• MapReduce is a programming model for processing
large data sets with a parallel, distributed algorithm
on a cluster.
• Published in 2004 from Google engineers Jeffrey
Dean and Sanjay Ghemawat.

MapReduce Features
• Large-scale distributed data processing
• Parallel programming.
• Simple but restricted.
• Load Balancing
• Handling machine failure

When should we use MapReduce ?
Query
• Index and search such as inverted index
• Classification
• Filtering
Analytics
• Sorting and merging
• Frequency distribution
• Summarization and statistics
• SQL-based queries: group by, having, etc.
• Generation of graphics
Others
• Message passing such as Breadth first-search algorithm

MapReduce Inspiration!
- Read massive data
- Map: Extracting data from each record
map (in_key, in_value) (out_key, intermediate_value) list
- Shuffle and Sort
- Reduce: Aggregate, filter, summarize and transform
reduce (out_key, intermediate_value list) out_value list
- Write the result

MapReduce Process Architecture

MapReduce Examples
1. Word Counting

MapReduce Algorithms
1. Disease propagation detection based-MapReduce
2. Trading strategies based-MapReduce.
3. Graph processing algorithm based-MapReduce.

Final Note !
• Open source community taking newer and larger steps
– Spark, Ceph, Open Stack
• Need for better processing
– Batch processing + Streaming
• Time to move on from Hadoop?

References
• http://www.intelligententerprise.com/showArticle.jhtml?articleID=207800705.
• http://mashable.com/2008/10/15/facebook-10-billion-photos/.
• http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret +Data+Center.aspx,
• http://www.archive.org/about/faqs.php.
• http://www.interactions.org/cms/?pid=1027032.
• Hadoop The Definitive Guide 2nd Edition by Tom White
• Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” October 2003
• http://www.forbes.com/sites/teradata/2015/05/22/the-future-of-hadoop-is-cloudy-with-a-chance-of-growing-ecosystem/
• R. Ranjan, and R. Misra,” Epidemic Disease Propagation Detection Algorithm using MapReduce for Realistic Social Contact
Networks, “IEEE Int. Conf. on High Performance Computing and Applications, vol. 2, Bhubaneswar, Dec. 2014, pp.1-6.
• X. Qin, and et al,“Optimizing Parameters of algorithm trading strategies using MapReduce ,” 9th IEEE Int. Conf. Fuzzy
Systems and Knowledge Discovery, Sichuan, May 2012, pp. 2738-274.
• K. Shirahata, H. Sato, T. Suzumura, and S. Matsuoka “A Scalable Implementation of a MapReduce-based Graph Processing
Algorithm for Large Scale Heterogeneous Supercomputers, “13th IEEE/ACM Int. Sym. on Cluster, Cloud, and Grid
Computing, Delft, May 2013, pp. 277-284.
• G. Yang, “The Application of MapReduce in the Cloud Computing,” 2nd IEEE Int. Syn. On Intilligence Information
Processing and Trusted, Hubei, Oct. 2011, pp.154-156.
• C. Goncalves, L. Assuncao, and J.C Cunha “Data Analytics in the Cloud with Flexible MapReduce Workflows” 4th IEEE Int.
Conf. on Cloud computing technology and Sience, Taipei, Dec. 2012, pp. 427-434.
• Count Frequencies of Words in Document. Last access Nov. 15th, 2015. Available
on:http://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf.
• Link Elevation. Last access Nov. 15th, 2015. Available on: http://www.slideshare.net/ChicagoHUG/mr.
• Inverted indexes. Last access Nov. 15, 2015. Available on: http://blog.cloudera.com/wp-
content/uploads/2010/01/InvertedIndex.pdf.

Hadoop/MapReduce/HDFS

More Related Content

What's hot (19)

Viewers also liked (17)

Similar to Hadoop/MapReduce/HDFS (20)

Recently uploaded (20)

Hadoop/MapReduce/HDFS