Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010

Apache Hadoop
an introduction

Todd Lipcon
todd@cloudera.com
@tlipcon @cloudera
May 27, 2010

Thursday, May 27, 2010

Hi there!
Software Engineer at
Hadoop contributor, HBase committer
Previously: systems programming,
operations, large scale data analysis
I love data and data systems


Outline
Why should you care? (Intro)
What is Hadoop?
The MapReduce Model
HDFS, Hadoop Map/Reduce
The Hadoop Ecosystem
Questions


Data is everywhere.

Data is important.


“I keep saying that the sexy job
in the next 10 years will be
statisticians, and I’m not kidding.”

Hal Varian
(Google’s chief economist)


Are you throwing
away data?
Data comes in many shapes and sizes:
relational tuples, log ﬁles, semistructured
textual data (e.g., e-mail), … .
Are you throwing it away because it
doesn’t ‘ﬁt’?

So, what’s Hadoop?

The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry


Apache Hadoop is an
open-source system
to reliably store and process
gobs of information
across many commodity computers.

The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry


Two Core
Components
HDFS Map/Reduce

Self-healing
high-bandwidth Fault-tolerant
clustered storage. distributed computing.


What makes
Hadoop special?


Assumption 1: Machines can be reliable...

Image: MadMan the Mighty CC BY-NC-SA

Hadoop separates distributed
system fault-tolerance code
from application logic.
Unicorns

Systems
Statisticians
Programmers


Assumption 2: Machines have identities...

Image:Laughing Squid CC BY-
NC-SA

Hadoop lets you interact
with a cluster, not a bunch
of machines.

Image:Yahoo! Hadoop cluster
[ OSCON ’07 ]

Assumption 3: Your analysis ﬁts on one machine

Image: Matthew J. Stinson CC-BY-NC

Hadoop scales linearly
with data size
or analysis complexity.
Data-parallel or compute-parallel. For example:

Extensive machine learning on <100GB of image data

Simple SQL-style queries on >100TB of clickstream
data

One Hadoop works for both applications!


A Typical Look...
5-4000 commodity servers
(8-core, 8-24GB RAM, 4-12 TB, gig-E)
2-level network architecture
20-40 nodes per rack


Image: Josh Hough CC BY-NC-SA

STOP!
REAL METAL?
Isn’t this some kind of
“Cloud Computing” conference?

Hadoop runs as a cloud (a cluster)
and maybe in a cloud (eg EC2).

Hadoop sounds like
magic.

How is it possible?

dramatis personae
Starring...

NameNode (metadata server and database)

SecondaryNameNode (assistant to NameNode)
JobTracker (scheduler)

The Chorus…
DataNodes TaskTrackers
(block storage) (task execution)

Thanks to Zak Stone for earmuff image!

Namenode HDFS
3x64MB file, 3 rep
(fs metadata)
4x64MB file, 3 rep
Small file, 7 rep
Datanodes

One Rack A Different Rack

HDFS Write Path


HDFS Failures?
Datanode crash?
Clients read another copy
Background rebalance/rereplicate
Namenode crash?
uh-oh
not responsible for
majority of downtime!


The M/R
Programming Model


You specify map()
and reduce()
functions.

The framework does
the rest.

fault-tolerance
(that’s what’s important)
(and that’s why Hadoop)


map()
map: K₁,V₁→list K₂,V₂

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException,
InterruptedException {
// context.write() can be called many times
// this is default “identity mapper” implementation
context.write((KEYOUT) key, (VALUEOUT) value);
}
}


(the shufﬂe)

map output is assigned to a “reducer”

map output is sorted by key


reduce()
K₂, iter(V₂)→list(K₃,V₃)

public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
/**
* This method is called once for each key. Most applications will define
* their reduce class by overriding this method. The default implementation
* is an identity function.
*/
@SuppressWarnings("unchecked")
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
) throws IOException, InterruptedException {
for(VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
}


Putting it together...
Logical

Physical Flow

Physical


Some samples...
Build an inverted index.
Summarize data grouped by a key.
Build map tiles from geographic data.
OCRing many images.
Learning ML models. (e.g., Naive Bayes
for text classiﬁcation)
Augment traditional BI/DWH
technologies (by archiving raw data).

M/R
Job on stars
Tasktrackers on the same Different job
machines as datanodes Idle

One Rack A Different Rack


M/R


M/R Failures
Task fails
Try again?
Try again somewhere else?
Report failure
Retries possible because of idempotence


There’s more than the
Java API
Streaming Pig Hive
perl, python, Higher-level SQL interface.
ruby, whatever. dataﬂow language
for easy ad-hoc Great for
stdin/stdout/ analysis. analysts.
stderr
Developed at Developed at
Yahoo! Facebook

Many tasks actually require a series
of M/R jobs; that’s ok!

The Hadoop Ecosystem
ETL Tools BI Reporting RDBMS

Pig (Data Flow) Hive (SQL) Sqoop
Zookeepr (Coordination)

Avro (Serialization)
MapReduce (Job Scheduling/Execution System)

HBase (Column DB)
(Key-Value store)

HDFS
(Hadoop Distributed File System)


Hadoop in the Wild
(yes, it’s used in production)

Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld NYC ’09)

Facebook: 15TB new data per day;
10000+ cores, 12+ PB

Twitter: ~1TB per day, ~80 nodes

Lots of 5-40 node clusters at companies without petabytes
of data (web, retail, ﬁnance, telecom, research)


Ok, ﬁne, what next?
Get Hadoop!
Cloudera’s Distribution for Hadoop
http://hadoop.apache.org/
Try it out! (Locally, or on EC2) Door
Prize
Watch free training videos on
http://cloudera.com/


Questions?

todd@cloudera.com

(feedback? yes!)

(hiring? yes!)


Backup Slides


Important APIs
→ is 1:many
Input Format data→K₁,V₁
Writable
Mapper K₁,V₁→K₂,V₂
JobClient
M/R Flow

Other
Combiner K₂,iter(V₂)→K₂,V₂

Partitioner K₂,V₂→int *Context

Reducer K₂, iter(V₂)→K₃,V₃ Filesystem
Out. Format K₃,V₃→data

public int run(String[] args)
throws Exception { grepJob.setReducerClass(LongSumRedu FileOutputFormat.setOutputPath(sort
if (args.length < 3) { cer.class); Job, new Path(args[1]));
System.out.println("Grep // sort by decreasing freq
<inDir> <outDir> <regex>
[<group>]"); FileOutputFormat.setOutputPath(grep sortJob.setOutputKeyComparatorClass
Job, tempDir); (LongWritable.DecreasingComparator.
ToolRunner.printGenericCommandUsage class);
(System.out); grepJob.setOutputFormat(SequenceFil
return -1; eOutputFormat.class); JobClient.runJob(sortJob);
} } finally {
Path tempDir = new Path("grep- grepJob.setOutputKeyClass(Text.clas
temp-"+Integer.toString(new s); FileSystem.get(grepJob).delete(temp
Random().nextInt(Integer.MAX_VALUE) Dir, true);
)); grepJob.setOutputValueClass(LongWri }
JobConf grepJob = new table.class); return 0;
JobConf(getConf(), Grep.class); }
try { JobClient.runJob(grepJob);
grepJob.setJobName("grep-
search"); JobConf sortJob = new
JobConf(Grep.class);
sortJob.setJobName("grep-
FileInputFormat.setInputPaths(grepJ sort");

the “grep”
ob, args[0]);

FileInputFormat.setInputPaths(sortJ
grepJob.setMapperClass(RegexMapper. ob, tempDir);
class);

example
sortJob.setInputFormat(SequenceFile
grepJob.set("mapred.mapper.regex", InputFormat.class);
args[2]);
if (args.length == 4)
sortJob.setMapperClass(InverseMappe
grepJob.set("mapred.mapper.regex.gr r.class);
oup", args[3]);
// write a single file
sortJob.setNumReduceTasks(1);
grepJob.setCombinerClass(LongSumRed
ucer.class);


$ cat input.txt
adams dunster kirkland dunster
kirland dudley dunster
adams dunster winthrop

$ bin/hadoop jar hadoop-0.18.3-
examples.jar grep input.txt output1
'dunster|adams'

$ cat output1/part-00000
4 dunster
2 adams


JobConf grepJob = new JobConf(getConf(), Grep.class);
try {
grepJob.setJobName("grep-search");

FileInputFormat.setInputPaths(grepJob, args[0]);

Job
grepJob.setMapperClass(RegexMapper.class);
grepJob.set("mapred.mapper.regex", args[2]);
if (args.length == 4)
grepJob.set("mapred.mapper.regex.group", args[3]);

grepJob.setCombinerClass(LongSumReducer.class);
grepJob.setReducerClass(LongSumReducer.class);
1of 2
FileOutputFormat.setOutputPath(grepJob, tempDir);
grepJob.setOutputFormat(SequenceFileOutputFormat.class);
grepJob.setOutputKeyClass(Text.class);
grepJob.setOutputValueClass(LongWritable.class);

JobClient.runJob(grepJob);
} ...


JobConf sortJob = new JobConf(Grep.class);
sortJob.setJobName("grep-sort");

FileInputFormat.setInputPaths(sortJob, tempDir);

Job
sortJob.setInputFormat(SequenceFileInputFormat.class);

sortJob.setMapperClass(InverseMapper.class);
(implicit identity reducer)
// write a single file
sortJob.setNumReduceTasks(1); 2 of 2
FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
// sort by decreasing freq
sortJob.setOutputKeyComparatorClass(
LongWritable.DecreasingComparator.class);

JobClient.runJob(sortJob);
} finally {
FileSystem.get(grepJob).delete(tempDir, true);
}
return 0;
}


The types there...
?, Text

Text, Long

Text, list(Long)

Text, Long

Long, Text


Facebook Data Infrastructure
Facebook’s DWH
2008
Scribe Tier MySQL Tier

Hadoop Tier

Oracle RAC Servers

il 1, 2009


Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010 (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010