Tutorial5

B.V.V Sangha’s
BASAVESHWAR ENGINEERING COLLEGE(Autonomous),BAGALKOT-587103
Department of Computer Science and Engineering
TUTORIALASSIGNMENT – 5
“Seminar on Hadoop Ecosystem MapReduce”
Submitted by: Submitted to:
Shweta Policepatil Dr. S M Hatture
M.Tech 1st Sem CSE Course Coordinator
Roll No:09

Contents
Overview
What is MapReduce?
MapReduce Data flow
Word Count Example
Features
Implementations
MapReduce v/s. Parallel Database
Conclusion

Overview
MapReduce is a programming model and an
associated implementation for processing and
generating large data sets. Approached to collect
and analyze website data for search optimization.
Apache, the open source organization, began
using MapReduce in the “Nutch” project, which
is an open source web search engine that still is
active today.

What is MapReduce?
It is a programming model.
Provides
-Automatic Parallelization and Distribution.
Hadoop is capable of running MapReduce
programs written in various languages:
-Java, Ruby, Python, and C++
Performs large-scale data analysis.

Example :Word Count
Consider the problem of counting the number
of occurrences of each word in a large
collection of documents
How would you do it in parallel ?

Solution:
- Divide documents among workers
- Each worker parses document to find all
words, outputs pairs..
- Partition (word, count) pairs across workers
based on word.
- For each word at a worker, locally add up
counts.

The crew of the space
shuttle Endeavor recently
returned to Earth as
ambassadors, harbingers of
a new era of space
exploration. Scientists at
NASA are saying that the
recent assembly of the
Dextre bot is the first step in
a long-term space-based
man/mache partnership.
'"The work we're doing now
-- the robotics we're doing -
- is what we're going to
need ……………………..
Big document
(The, 1)
(crew, 1)
(of, 1)
(the, 1)
(space, 1)
(shuttle, 1)
(Endeavor, 1)
(recently, 1)
….
(crew, 1)
(crew, 1)
(space, 1)
(the, 1)
(the, 1)
(the, 1)
(shuttle, 1)
(recently, 1)
…
(crew, 2)
(space, 1)
(the, 3)
(shuttle, 1)
(recently, 1)
…
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs
with same key
Reduce:
Collect all values
belonging to the
key and output
(key, value)
Provided by the
programmer
Provided by the
programmer
(key, value)
(key, value)
Sequentially
read
the
data
Only
sequential
reads
MapReduce Word Count Scenario

Features
Java and C++ APIs
Each task can process data sets larger than RAM
Automatic re-execution on failure
- In a large cluster, some nodes are always slow
- Framework re-executes failed tasks
Locality optimizations
- Queries HDFS for locations of input data
- Tasks are scheduled close to the inputs when
possible.

Implementations
Google
- Not available outside Google
Hadoop
-An open-source implementation in Java
- Uses HDFS for stable storage
And several others, such as Cassandra at
Facebook, etc.

MapReduce v/s Parallel Databases
MapReduce Parallel Database
Designed for unstructured
data
Designed for structured
relational data
MapReduce programs written
in a variety of
languages(some SQL support)
SQL
Materializes results between
Map and Reduce phases
Pipelines results between
operators
Determined by data storage
block size(
Runtime scheduler)
Entire Query
Data
Query
Interface
Query
Execution
Job
Granularity

Conclusion
Here at the end we conclude that we
got to know about the MapReduce. MapReduce
is programming model, and why do we use
MapReduce instead of Parallel Databases. The
Distributed file system provides the platform for
MapReduce and MapReduce uses Name node
and Data node from the HDFS to operate on big
data to avoid the storing repeated data.

References
 https://mindmajix.com/mapreduce/history-and-advantages-of-hadoop-
mapreduce-programming
 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google
File System, http://labs.google.com/papers/gfs.html
 https://www.talend.com/resources/what-is-mapreduce/
 https://www.ibm.com/analytics/hadoop/mapreduce
 https://research.google.com/archive/mapreduce-osdi04-slides/index.html
 https://ai.google/research/pubs/pub62

Tutorial5

More Related Content

What's hot (20)

Similar to Tutorial5 (20)

Recently uploaded (20)

Tutorial5