Hadoop and MapReduce

HADOOP
Framework and
Applications

Prepared by: TEAM HADOOP slide1/22

CONTENTS
 WHY HADOOP?

 INTRODUCTION TO MapReduce

Prepared by: TEAM HADOOP slide 2/22

WHAT?
“... to create building blocks for programmers
who just happen to have lots of data to
store, lots of data to analyze, or lots of machines
to coordinate, and who don‟t have the
time, the skill, or the inclination to become
distributed systems experts to build the
infrastructure to handle it.”
-Tom White

Source: Hadoop: The Definitive Guide


WHAT?
 Hadoop contains many subprojects:
 Hadoop Common
 Chukwa
 HBase
 ZooKeeper
 Pig
 Zombie
 Hive
 MapReduce

We will focus on MapReduce


WHO & WHEN?
 Pre-2004 : Cutting and Cafarella develop
open source projects for web-scale
indexing, crawling and search.


WHO & WHEN?
 2004: Jeffrey Dean and Sanjay
Ghemawat introduce map reduce model
used internally at Google.


WHO & WHEN?
 2006:Hadoop becomes official Apache
project, Cutting joins Yahoo!Yahoo
adopts Hadoop.


TRENDS


WHO USES IT?


Roughly how long to read 1TB
from a commodity hard disk?


Roughly how long to read 1TB
from a commodity hard disk?

Around 4 hours
WITH HADOOP..

62 seconds…


INTRODUCTION TO MapReduce

"Break large problem into smaller parts, solve in
parallel, combine results."


Typical scenario
 How many times is the word „IT‟ present?
You‟ll probably count but in a 30k paged
document, can you??


Map Reduce Typical Illustration


Map Reduce paradigm

Input

Output Map

Reduce Shuffle/Sort


Map Reduce paradigm
 Map: transforms input record to
intermediate (key, value) pair


Map Reduce paradigm
 Reduce: transforms all records for given
key to final output.


Map reduce principles

Move code to data (local
computation)

Abstract away fault Allow programs to scale
tolerance, synchronization, etc. transparently w.r.t size of input


Implementation: Hardware

Prepared by: TEAM HADOOP sroy choudhury7@gmail.com slide 19/22

Map Reduce: strengths
 Batch, offline jobs

 Write-once, read-many across full data
set

 Usually,
though not always, simple
computations

 I/O bound by disk/network bandwidth


What it‟s not!

What it‟s not:

 High-performance parallel
computing, e.g. MPI

 Low-latency random access relational
database

 Always the right solution


THANK YOU!
QUESTIONS?


Hadoop and MapReduce

More Related Content

Viewers also liked (19)

Similar to Hadoop and MapReduce (20)

More from Abhishek Dey (6)

Recently uploaded (20)

Hadoop and MapReduce