Hadoop MapReduce Paradigm

Gandhinagar Institute of Technology
SUBJECT – DMBI (2170715)
Hadoop MapReduce Paradigm
Prepared By-
Tarj Mehta (170120107074)
Guided By – Prof. Nisha Khurana

Have you ever wondered how Google applies queries on
their large mountain of data?
How Facebook is quickly able to deal with such large
quantities of information?

The problem…
• In the early days companies had to pay money to database vendors to
house their data.
• This technique was good for small to medium amount of data.
• In early 2000, Google ran into a problem.
• They had to pay large amount of money to database vendors like
Oracle, IBM and Microsoft to fit their data. Hence, data processing
was turning expensive.

The solution…
• To address the problem, Google Labs team developed an algorithm.
• The algorithm allowed calculations of large data to be chopped into
smaller chunks. (tuples of data)
• The small chunks were mapped to many computers.
• When required the calculations can be done again to bring it back
together and produce resulting data set.
• This algorithm is called Map Reduce.

Hadoop
• The Map Reduce algorithm was later used to develop an open source
project called Hadoop.
• It allows different applications to run using Map Reduce algorithm.
• Simply it can be said that data is processed in parallel and not in
serial.
• It depends on Java coding.

Why Hadoop?
• In organizations with >10 Terabytes of data, high calculation
complexity like statistical simulations takes time to compute.
• Hadoop plays a central role in statistical analysis, ETL (Extract,
Transfer, Load)processing and business applications.

The Algorithm
• MapReduce is based on “sending computer to where data resides”.
• It has 2 stages:
1. Map stage: process input data (form of directory or file) stored in
Hadoop File system (HDFS). Input file is sent line by line and
converted to smaller chunks.
2. Reduce Stage: Reducer’s job is to process data that comes from
mapper. After processing, it produces a new set of output which is
stored in HDFS.

• During Map and Reduce job, Hadoop sends map and reduce task to
appropriate servers in the cluster.
• The Hadoop manages all the details of data like issuing tasks, verifying
task completion and copy data around the cluster.
• Most of the calculations is done on nodes with data to reduce
network traffic.
• After completion of the given tasks, cluster collects and reduces the
data to form appropriate result and send back to Hadoop server.

References
• https://www.youtube.com/watch?v=9s-vSeWej1U
• https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm

Hadoop MapReduce Paradigm

More Related Content

What's hot (18)

Similar to Hadoop MapReduce Paradigm (20)

Recently uploaded (20)

Hadoop MapReduce Paradigm