1
Hadoop Puzzlers
Aaron Myers & Daniel Templeton
Cloudera, Inc.
2
Your Hosts
Aaron “ATM” Myers
• AKA “Cash Money”
• Software Engineer
• Apache Hadoop
Committer
Daniel Templeton
• Certification Developer
• Crusty, old HPC guy
• Likes Perl
©2014 Cloudera, Inc. All rights reserved.2
3
What is a Hadoop Puzzler
©2014 Cloudera, Inc. All rights reserved.3
• Shameless knockoff of Josh Bloch’s Java Puzzlers talks
• We’ll walk through a puzzle
• You vote on the answer
• We all learn a valuable lesson
4 ©2014 Cloudera, Inc. All rights reserved.4
Let’s try it, OK?
5
An Easy One
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
IntWritable max =
new IntWritable(0);
for (IntWritable v: values)
if (v.get() > max.get())
max = v;
c.write(key, max);
} }
©2014 Cloudera, Inc. All rights reserved.5
6
An Easy One
The data:
A,1
A,5
A,3
The results:
a) A 5
b) A 1
c) A 3
d) The job fails
©2014 Cloudera, Inc. All rights reserved.6
7
An Easy One
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
IntWritable max =
new IntWritable(0);
for (IntWritable v: values)
if (v.get() > max.get())
max = v;
c.write(key, max);
} }
©2014 Cloudera, Inc. All rights reserved.7
A 1
A 5
A 3
8
An Easy One
The data:
A,1
A,5
A,3
The results:
a) A 5
b) A 1
c) A 3
d) The job fails
©2014 Cloudera, Inc. All rights reserved.8
9
An Easy One (Answer)
The data:
A,1
A,5
A,3
The results:
a) A 5
b) A 1
c) A 3
d) The job fails
©2014 Cloudera, Inc. All rights reserved.9
10
An Easy One (Problem)
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
IntWritable max =
new IntWritable(0);
for (IntWritable v: values)
if (v.get() > max.get())
max = v;
c.write(key, max);
} }
©2014 Cloudera, Inc. All rights reserved.10
11
An Easy One (Moral)
©2014 Cloudera, Inc. All rights reserved.11
• MapReduce reuses Writables whenever it can
• That includes while iterating through the values
• Always be careful to only store the value instead of
the Writable!
12
A Sinking Feeling
public class AsyncSubmit
extends Configured
implements Tool {
public static void main(String[] args)
throws Exception {
int ret = ToolRunner.run(
new Configuration(),
new AsyncSubmit(), args);
System.exit(ret);
}
public int run(String[] args)
throws Exception {
Job job = Job.getInstance(getConf());
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job,
new Path(args[0]));
FileOutputFormat.setOutputPath(job,
new Path(args[1]));
job.waitForCompletion(false);
return job.isComplete() ? 1 : 0;
} }
©2014 Cloudera, Inc. All rights reserved.12
13
A Sinking Feeling
The data:
The complete works of
William Shakespeare
The results:
a) Fails to compile
b) The job fails
c) Exits with 0
d) Exits with 1
©2014 Cloudera, Inc. All rights reserved.13
14
A Sinking Feeling
public class AsyncSubmit
extends Configured
implements Tool {
public static void main(String[] args)
throws Exception {
int ret = ToolRunner.run(
new Configuration(),
new AsyncSubmit(), args);
System.exit(ret);
}
public int run(String[] args)
throws Exception {
Job job = Job.getInstance(getConf());
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job,
new Path(args[0]));
FileOutputFormat.setOutputPath(job,
new Path(args[1]));
job.waitForCompletion(false);
return job.isComplete() ? 1 : 0;
} }
©2014 Cloudera, Inc. All rights reserved.14
The complete works of
William Shakespeare
15
A Sinking Feeling
The data:
The complete works of
William Shakespeare
The results:
a) Fails to compile
b) The job fails
c) Exits with 0
d) Exits with 1
©2014 Cloudera, Inc. All rights reserved.15
16
A Sinking Feeling (Answer)
The data:
The complete works of
William Shakespeare
The results:
a) Fails to compile
b) The job fails
c) Exits with 0
d) Exits with 1
©2014 Cloudera, Inc. All rights reserved.16
17
A Sinking Feeling (Problem)
public class AsyncSubmit
extends Configured
implements Tool {
public static void main(String[] args)
throws Exception {
int ret = ToolRunner.run(
new Configuration(),
new AsyncSubmit(), args);
System.exit(ret);
}
public int run(String[] args)
throws Exception {
Job job = Job.getInstance(getConf());
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job,
new Path(args[0]));
FileOutputFormat.setOutputPath(job,
new Path(args[1]));
job.waitForCompletion(false);
return job.isComplete() ? 1 : 0;
} }
©2014 Cloudera, Inc. All rights reserved.17
18
A Sinking Job (Moral)
©2014 Cloudera, Inc. All rights reserved.18
• Read the API docs!
• Sometimes the obvious meanings of methods and
parameters aren’t correct
• Parameter for waitForCompletion() controls whether
status output is printed
• Driver does wait for job to exit but does not print all the job
status information
19
Do-over
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduceRedux
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
int max = 0;
for (IntWritable v: values)
if (v.get() > max)
max = v.get();
c.write(key, new IntWritable(max));
} }
©2014 Cloudera, Inc. All rights reserved.19
20
Do-over
The data:
A,1
A,5
The results:
a) A 5
b) A 1
c) A 1
A 5
d) The job fails
©2014 Cloudera, Inc. All rights reserved.20
21
Do-over
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduceRedux
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
int max = 0;
for (IntWritable v: values)
if (v.get() > max)
max = v.get();
c.write(key, new IntWritable(max));
} }
©2014 Cloudera, Inc. All rights reserved.21
A 1
A 5
22
Do-over
The data:
A,1
A,5
The results:
a) A 5
b) A 1
c) A 1
A 5
d) The job fails
©2014 Cloudera, Inc. All rights reserved.22
23
Do-over (Answer)
The data:
A,1
A,5
The results:
a) A 5
b) A 1
c) A 1
A 5
d) The job fails
©2014 Cloudera, Inc. All rights reserved.23
24
Do-over (Problem)
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduceRedux
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
int max = 0;
for (IntWritable v: values)
if (v.get() > max)
max = v.get();
c.write(key, new IntWritable(max));
} }
©2014 Cloudera, Inc. All rights reserved.24
25
Do-over (Moral)
©2014 Cloudera, Inc. All rights reserved.25
• Mismatched signatures can lead to unexpected
behaviors because of exposed base class method
implementations
• ALWAYS use @Override!
26
Joining Forces
hive> DESCRIBE table1;
OK
id int
phone string
state string
Time taken: 0.236 seconds
hive> DESCRIBE table2;
OK
id int
city string
state string
Time taken: 0.116 seconds
hive> CREATE TABLE table3 AS SELECT
table2.*,table1.phone,table1.state
AS s FROM table1 JOIN table2 ON
(table1.id == table2.id);
…
hive> EXPORT TABLE table3 TO
'/user/cloudera/table3.csv';
…
hive> exit
$ hadoop fs –cat table3.csv |
head -1 | tr , 'n' | wc –l
©2014 Cloudera, Inc. All rights reserved.26
27
Joining Forces
The data:
hive> SELECT * FROM table1;
OK
1 6506506500 CA
2 2282282280 MS
Time taken: 1.006 seconds
hive> SELECT * FROM table2;
OK
1 Palo Alto CA
2 Gautier MS
Time taken: 1.202 seconds
The results:
a) 5
b) 4
c) 1
d) The join fails
©2014 Cloudera, Inc. All rights reserved.27
28
Joining Forces
hive> DESCRIBE table1;
OK
id int
phone string
state string
Time taken: 0.236 seconds
hive> DESCRIBE table2;
OK
id int
city string
state string
Time taken: 0.116 seconds
hive> CREATE TABLE table3 AS SELECT
table2.*,table1.phone,table1.state
AS s FROM table1 JOIN table2 ON
(table1.id == table2.id);
…
hive> EXPORT TABLE table3 TO
'/user/cloudera/table3.csv';
…
hive> exit
$ hadoop fs –cat table3.csv |
head -1 | tr , 'n' | wc –l
©2014 Cloudera, Inc. All rights reserved.28
1 6506506500 CA
2 2282282280 MS
1 Palo Alto CA
2 Gautier MS
29
Joining Forces
The data:
hive> SELECT * FROM table1;
OK
1 6506506500 CA
2 2282282280 MS
Time taken: 1.006 seconds
hive> SELECT * FROM table2;
OK
1 Palo Alto CA
2 Gautier MS
Time taken: 1.202 seconds
The results:
a) 5
b) 4
c) 1
d) The join fails
©2014 Cloudera, Inc. All rights reserved.29
30
Joining Forces (Answer)
The data:
hive> SELECT * FROM table1;
OK
1 6506506500 CA
2 2282282280 MS
Time taken: 1.006 seconds
hive> SELECT * FROM table2;
OK
1 Palo Alto CA
2 Gautier MS
Time taken: 1.202 seconds
The results:
a) 5
b) 4
c) 1
d) The join fails
©2014 Cloudera, Inc. All rights reserved.30
31
Joining Forces (Problem)
hive> DESCRIBE table1;
OK
id int
phone string
state string
Time taken: 0.236 seconds
hive> DESCRIBE table2;
OK
id int
city string
state string
Time taken: 0.116 seconds
hive> CREATE TABLE table3 AS SELECT
table2.*,table1.phone,table1.state
AS s FROM table1 JOIN table2 ON
(table1.id == table2.id);
…
hive> EXPORT TABLE table3 TO
'/user/cloudera/table3.csv';
…
hive> exit
$ hadoop fs –cat table3.csv |
head -1 | tr , 'n' | wc –l
©2014 Cloudera, Inc. All rights reserved.31
32
Joining Forces (Moral)
©2014 Cloudera, Inc. All rights reserved.32
• Hive’s default delimiter is 0x01 (CTRL-A)
• Easy to assume export will use a sane delimiter – it
doesn’t
• Incidentally, Hive’s join rules are pretty sane and work
as you’d expect
33
Close Enough
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class Top20Reduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
float max = 0.0f;
for (IntWritable v: values)
if (v.get() > max) max = v.get();
max *= 0.8f;
for (IntWritable v: values)
if (v.get() >= max)
c.write(key, v);
} }
©2014 Cloudera, Inc. All rights reserved.33
34
Close Enough
The data:
A,1
A,5
A,4
The results:
a)
b) A 5
c) A 5
A 4
d) The job fails
©2014 Cloudera, Inc. All rights reserved.34
35
Close Enough
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class Top20Reduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
float max = 0.0f;
for (IntWritable v: values)
if (v.get() > max) max = v.get();
max *= 0.8f;
for (IntWritable v: values)
if (v.get() >= max)
c.write(key, v);
} }
©2014 Cloudera, Inc. All rights reserved.35
A 1
A 5
A 4
36
Close Enough
The data:
A,1
A,5
A,4
The results:
a)
b) A 5
c) A 5
A 4
d) The job fails
©2014 Cloudera, Inc. All rights reserved.36
37
Close Enough (Answer)
The data:
A,1
A,5
A,4
The results:
a)
b) A 5
c) A 5
A 4
d) The job fails
©2014 Cloudera, Inc. All rights reserved.37
38
Close Enough (Problem)
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class Top20Reduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
float max = 0.0f;
for (IntWritable v: values)
if (v.get() > max) max = v.get();
max *= 0.8f;
for (IntWritable v: values)
if (v.get() >= max)
c.write(key, v);
} }
©2014 Cloudera, Inc. All rights reserved.38
39
Close Enough (Moral)
©2014 Cloudera, Inc. All rights reserved.39
• For scalability reasons, the values iterable is
single-shot
• Subsequent iterators iterate over an empty collection
• Store values (not Writables!) in the first pass
• Better yet, restructure the logic to avoid storing all
values in memory
40
Overbyte
public class MinLineMap
extends Mapper<LongWritable,
Text,Text,Text> {
Text k = new Text();
protected void map(LongWritable key,
Text value, Context c) … {
String val = value.toString();
k.set(val.substring(0, 1));
c.write(k, value);
} }
public class MinLineReduce
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<Text> values,
Context c) … {
int min = Integer.MAX_VALUE;
for (Text v: values)
if (v.getBytes().length < min)
min = v.getBytes().length;
c.write(key, new IntWritable(min));
} }
©2014 Cloudera, Inc. All rights reserved.40
41
Overbyte
The data:
Hadoop
Spark
Hive
Sqoop2
The results:
a) H 4
S 5
b) H 6
S 5
c) H 6
S 6
d) The job fails
©2014 Cloudera, Inc. All rights reserved.41
42
Overbyte
public class MinLineMap
extends Mapper<LongWritable,
Text,Text,Text> {
Text k = new Text();
protected void map(LongWritable key,
Text value, Context c) … {
String val = value.toString();
k.set(val.substring(0, 1));
c.write(k, value);
} }
public class MinLineReduce
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<Text> values,
Context c) … {
int min = Integer.MAX_VALUE;
for (Text v: values)
if (v.getBytes().length < min)
min = v.getBytes().length;
c.write(key, new IntWritable(min));
} }
©2014 Cloudera, Inc. All rights reserved.42
Hadoop
Spark
Hive
Sqoop2
43
Overbyte
The data:
Hadoop
Spark
Hive
Sqoop2
The results:
a) H 4
S 5
b) H 6
S 5
c) H 6
S 6
d) The job fails
©2014 Cloudera, Inc. All rights reserved.43
44
Overbyte (Answer)
The data:
Hadoop
Spark
Hive
Sqoop2
The results:
a) H 4
S 5
b) H 6
S 5
c) H 6
S 6
d) The job fails
©2014 Cloudera, Inc. All rights reserved.44
45
Overbyte (Problem)
public class MinLineMap
extends Mapper<LongWritable,
Text,Text,Text> {
Text k = new Text();
protected void map(LongWritable key,
Text value, Context c) … {
String val = value.toString();
k.set(val.substring(0, 1));
c.write(k, value);
} }
public class MinLineReduce
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<Text> values,
Context c) … {
int min = Integer.MAX_VALUE;
for (Text v: values)
if (v.getBytes().length < min)
min = v.getBytes().length;
c.write(key, new IntWritable(min));
} }
©2014 Cloudera, Inc. All rights reserved.45
46
Overbyte (Moral)
©2014 Cloudera, Inc. All rights reserved.46
• Writables get reused in loops
• In addition, Text.getBytes() reuses byte array
allocated by previous calls
• Net result is wrongness
• Text.getLength() is the correct way to get the length
of a Text.
47
What We Learned
©2014 Cloudera, Inc. All rights reserved.47
• Beware of reuse of Writables
• Always use @Override so your compiler can help you
• Don’t assume you know what a method does because
of the name or parameters – read the docs!
• Sometimes scalability is inconvenient
48
One Closing Note
©2014 Cloudera, Inc. All rights reserved.48
• Hadoop is still not easy
• Being good takes effort and experience
• Recognizing Hadoop talent can be hard
• Cloudera’s is working to make Hadoop talent easier to
recognize through certification
http://cloudera.com/content/cloudera/en/training/cert
ification.html
49 ©2014 Cloudera, Inc. All rights reserved.
Aaron Myers &
Daniel Templeton

More Related Content

PPTX
Ipc: aidl sexy, not a curse
PPTX
IPC: AIDL is sexy, not a curse
ODP
GPars (Groovy Parallel Systems)
PDF
Clustering your Application with Hazelcast
PDF
concurrency with GPars
PDF
Easy Scaling with Open Source Data Structures, by Talip Ozturk
PPTX
Hazelcast
PDF
Integrate Solr with real-time stream processing applications
Ipc: aidl sexy, not a curse
IPC: AIDL is sexy, not a curse
GPars (Groovy Parallel Systems)
Clustering your Application with Hazelcast
concurrency with GPars
Easy Scaling with Open Source Data Structures, by Talip Ozturk
Hazelcast
Integrate Solr with real-time stream processing applications

What's hot (20)

PDF
Vielseitiges In-Memory Computing mit Apache Ignite und Kubernetes
PPTX
Getting Started with Datatsax .Net Driver
PDF
ChtiJUG - Cassandra 2.0
PDF
Cassandra summit 2013 - DataStax Java Driver Unleashed!
PPTX
Kotlin coroutines and spring framework
PPTX
Apex code benchmarking
PDF
Showdown of the Asserts by Philipp Krenn
PDF
.NET Multithreading and File I/O
PDF
Paris Cassandra Meetup - Cassandra for Developers
KEY
PPTX
Psycopg2 - Connect to PostgreSQL using Python Script
PDF
Programming with Python and PostgreSQL
PPTX
Hazelcast and MongoDB at Cloud CMS
PDF
Concurrency Concepts in Java
PDF
Vavr Java User Group Rheinland
PPTX
Rx 101 Codemotion Milan 2015 - Tamir Dresher
PPTX
Building responsive application with Rx - confoo - tamir dresher
PDF
JVM Mechanics
PDF
NoSQL @ CodeMash 2010
Vielseitiges In-Memory Computing mit Apache Ignite und Kubernetes
Getting Started with Datatsax .Net Driver
ChtiJUG - Cassandra 2.0
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Kotlin coroutines and spring framework
Apex code benchmarking
Showdown of the Asserts by Philipp Krenn
.NET Multithreading and File I/O
Paris Cassandra Meetup - Cassandra for Developers
Psycopg2 - Connect to PostgreSQL using Python Script
Programming with Python and PostgreSQL
Hazelcast and MongoDB at Cloud CMS
Concurrency Concepts in Java
Vavr Java User Group Rheinland
Rx 101 Codemotion Milan 2015 - Tamir Dresher
Building responsive application with Rx - confoo - tamir dresher
JVM Mechanics
NoSQL @ CodeMash 2010
Ad

Viewers also liked (18)

PPTX
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
PDF
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
PDF
Application Architectures with Hadoop | Data Day Texas 2015
PDF
Taming Operations in the Hadoop Ecosystem
PPTX
Introducing the TPCx-HS Benchmark for Big Data
PDF
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
PPTX
In-memory Caching in HDFS: Lower Latency, Same Great Taste
PPTX
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
PDF
Debugging (Docker) containers in production
PDF
Nested Types in Impala
PDF
Improving Hadoop Cluster Performance via Linux Configuration
PPTX
Building a Modern Analytic Database with Cloudera 5.8
PPTX
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
PDF
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
PPTX
Python in the Hadoop Ecosystem (Rock Health presentation)
PPTX
The Impala Cookbook
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
Application Architectures with Hadoop | Data Day Texas 2015
Taming Operations in the Hadoop Ecosystem
Introducing the TPCx-HS Benchmark for Big Data
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
In-memory Caching in HDFS: Lower Latency, Same Great Taste
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Debugging (Docker) containers in production
Nested Types in Impala
Improving Hadoop Cluster Performance via Linux Configuration
Building a Modern Analytic Database with Cloudera 5.8
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Data Science at Scale Using Apache Spark and Apache Hadoop
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Python in the Hadoop Ecosystem (Rock Health presentation)
The Impala Cookbook
Ad

Similar to Hadoop Puzzlers (20)

PDF
Introduction to Scalding and Monoids
PDF
C# - What's next
PDF
Store and Process Big Data with Hadoop and Cassandra
PDF
Adam Sitnik "State of the .NET Performance"
PDF
State of the .Net Performance
PDF
Celery - A Distributed Task Queue
PPTX
Blazing Fast Windows 8 Apps using Visual C++
PDF
The Art Of Readable Code
PPTX
Hadoop MapReduce framework - Module 3
PDF
Scala @ TechMeetup Edinburgh
PPTX
Anti patterns
PPT
TechTalk - Dotnet
PDF
C# 7.x What's new and what's coming with C# 8
PPT
Using xUnit as a Swiss-Aarmy Testing Toolkit
PDF
RxJava on Android
PDF
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
PPTX
What is new in Java 8
PDF
JRubyKaigi2010 Hadoop Papyrus
POTX
Stream analysis with kafka native way and considerations about monitoring as ...
PPTX
실시간 인벤트 처리
Introduction to Scalding and Monoids
C# - What's next
Store and Process Big Data with Hadoop and Cassandra
Adam Sitnik "State of the .NET Performance"
State of the .Net Performance
Celery - A Distributed Task Queue
Blazing Fast Windows 8 Apps using Visual C++
The Art Of Readable Code
Hadoop MapReduce framework - Module 3
Scala @ TechMeetup Edinburgh
Anti patterns
TechTalk - Dotnet
C# 7.x What's new and what's coming with C# 8
Using xUnit as a Swiss-Aarmy Testing Toolkit
RxJava on Android
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
What is new in Java 8
JRubyKaigi2010 Hadoop Papyrus
Stream analysis with kafka native way and considerations about monitoring as ...
실시간 인벤트 처리

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PPTX
Tech Workshop Escape Room Tech Workshop
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PDF
Salesforce Agentforce AI Implementation.pdf
PDF
iTop VPN Crack Latest Version Full Key 2025
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
DOC
UTEP毕业证学历认证,宾夕法尼亚克拉里恩大学毕业证未毕业
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PPTX
Matchmaking for JVMs: How to Pick the Perfect GC Partner
PPTX
Trending Python Topics for Data Visualization in 2025
PDF
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PPTX
GSA Content Generator Crack (2025 Latest)
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PPTX
Cybersecurity: Protecting the Digital World
PDF
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PPTX
Full-Stack Developer Courses That Actually Land You Jobs
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
DOCX
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
Tech Workshop Escape Room Tech Workshop
Advanced SystemCare Ultimate Crack + Portable (2025)
Salesforce Agentforce AI Implementation.pdf
iTop VPN Crack Latest Version Full Key 2025
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
UTEP毕业证学历认证,宾夕法尼亚克拉里恩大学毕业证未毕业
Wondershare Recoverit Full Crack New Version (Latest 2025)
Matchmaking for JVMs: How to Pick the Perfect GC Partner
Trending Python Topics for Data Visualization in 2025
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
GSA Content Generator Crack (2025 Latest)
Weekly report ppt - harsh dattuprasad patel.pptx
Cybersecurity: Protecting the Digital World
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
Full-Stack Developer Courses That Actually Land You Jobs
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx

Hadoop Puzzlers

  • 1. 1 Hadoop Puzzlers Aaron Myers & Daniel Templeton Cloudera, Inc.
  • 2. 2 Your Hosts Aaron “ATM” Myers • AKA “Cash Money” • Software Engineer • Apache Hadoop Committer Daniel Templeton • Certification Developer • Crusty, old HPC guy • Likes Perl ©2014 Cloudera, Inc. All rights reserved.2
  • 3. 3 What is a Hadoop Puzzler ©2014 Cloudera, Inc. All rights reserved.3 • Shameless knockoff of Josh Bloch’s Java Puzzlers talks • We’ll walk through a puzzle • You vote on the answer • We all learn a valuable lesson
  • 4. 4 ©2014 Cloudera, Inc. All rights reserved.4 Let’s try it, OK?
  • 5. 5 An Easy One public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { IntWritable max = new IntWritable(0); for (IntWritable v: values) if (v.get() > max.get()) max = v; c.write(key, max); } } ©2014 Cloudera, Inc. All rights reserved.5
  • 6. 6 An Easy One The data: A,1 A,5 A,3 The results: a) A 5 b) A 1 c) A 3 d) The job fails ©2014 Cloudera, Inc. All rights reserved.6
  • 7. 7 An Easy One public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { IntWritable max = new IntWritable(0); for (IntWritable v: values) if (v.get() > max.get()) max = v; c.write(key, max); } } ©2014 Cloudera, Inc. All rights reserved.7 A 1 A 5 A 3
  • 8. 8 An Easy One The data: A,1 A,5 A,3 The results: a) A 5 b) A 1 c) A 3 d) The job fails ©2014 Cloudera, Inc. All rights reserved.8
  • 9. 9 An Easy One (Answer) The data: A,1 A,5 A,3 The results: a) A 5 b) A 1 c) A 3 d) The job fails ©2014 Cloudera, Inc. All rights reserved.9
  • 10. 10 An Easy One (Problem) public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { IntWritable max = new IntWritable(0); for (IntWritable v: values) if (v.get() > max.get()) max = v; c.write(key, max); } } ©2014 Cloudera, Inc. All rights reserved.10
  • 11. 11 An Easy One (Moral) ©2014 Cloudera, Inc. All rights reserved.11 • MapReduce reuses Writables whenever it can • That includes while iterating through the values • Always be careful to only store the value instead of the Writable!
  • 12. 12 A Sinking Feeling public class AsyncSubmit extends Configured implements Tool { public static void main(String[] args) throws Exception { int ret = ToolRunner.run( new Configuration(), new AsyncSubmit(), args); System.exit(ret); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); job.setNumReduceTasks(0); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(false); return job.isComplete() ? 1 : 0; } } ©2014 Cloudera, Inc. All rights reserved.12
  • 13. 13 A Sinking Feeling The data: The complete works of William Shakespeare The results: a) Fails to compile b) The job fails c) Exits with 0 d) Exits with 1 ©2014 Cloudera, Inc. All rights reserved.13
  • 14. 14 A Sinking Feeling public class AsyncSubmit extends Configured implements Tool { public static void main(String[] args) throws Exception { int ret = ToolRunner.run( new Configuration(), new AsyncSubmit(), args); System.exit(ret); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); job.setNumReduceTasks(0); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(false); return job.isComplete() ? 1 : 0; } } ©2014 Cloudera, Inc. All rights reserved.14 The complete works of William Shakespeare
  • 15. 15 A Sinking Feeling The data: The complete works of William Shakespeare The results: a) Fails to compile b) The job fails c) Exits with 0 d) Exits with 1 ©2014 Cloudera, Inc. All rights reserved.15
  • 16. 16 A Sinking Feeling (Answer) The data: The complete works of William Shakespeare The results: a) Fails to compile b) The job fails c) Exits with 0 d) Exits with 1 ©2014 Cloudera, Inc. All rights reserved.16
  • 17. 17 A Sinking Feeling (Problem) public class AsyncSubmit extends Configured implements Tool { public static void main(String[] args) throws Exception { int ret = ToolRunner.run( new Configuration(), new AsyncSubmit(), args); System.exit(ret); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); job.setNumReduceTasks(0); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(false); return job.isComplete() ? 1 : 0; } } ©2014 Cloudera, Inc. All rights reserved.17
  • 18. 18 A Sinking Job (Moral) ©2014 Cloudera, Inc. All rights reserved.18 • Read the API docs! • Sometimes the obvious meanings of methods and parameters aren’t correct • Parameter for waitForCompletion() controls whether status output is printed • Driver does wait for job to exit but does not print all the job status information
  • 19. 19 Do-over public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduceRedux extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { int max = 0; for (IntWritable v: values) if (v.get() > max) max = v.get(); c.write(key, new IntWritable(max)); } } ©2014 Cloudera, Inc. All rights reserved.19
  • 20. 20 Do-over The data: A,1 A,5 The results: a) A 5 b) A 1 c) A 1 A 5 d) The job fails ©2014 Cloudera, Inc. All rights reserved.20
  • 21. 21 Do-over public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduceRedux extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { int max = 0; for (IntWritable v: values) if (v.get() > max) max = v.get(); c.write(key, new IntWritable(max)); } } ©2014 Cloudera, Inc. All rights reserved.21 A 1 A 5
  • 22. 22 Do-over The data: A,1 A,5 The results: a) A 5 b) A 1 c) A 1 A 5 d) The job fails ©2014 Cloudera, Inc. All rights reserved.22
  • 23. 23 Do-over (Answer) The data: A,1 A,5 The results: a) A 5 b) A 1 c) A 1 A 5 d) The job fails ©2014 Cloudera, Inc. All rights reserved.23
  • 24. 24 Do-over (Problem) public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduceRedux extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { int max = 0; for (IntWritable v: values) if (v.get() > max) max = v.get(); c.write(key, new IntWritable(max)); } } ©2014 Cloudera, Inc. All rights reserved.24
  • 25. 25 Do-over (Moral) ©2014 Cloudera, Inc. All rights reserved.25 • Mismatched signatures can lead to unexpected behaviors because of exposed base class method implementations • ALWAYS use @Override!
  • 26. 26 Joining Forces hive> DESCRIBE table1; OK id int phone string state string Time taken: 0.236 seconds hive> DESCRIBE table2; OK id int city string state string Time taken: 0.116 seconds hive> CREATE TABLE table3 AS SELECT table2.*,table1.phone,table1.state AS s FROM table1 JOIN table2 ON (table1.id == table2.id); … hive> EXPORT TABLE table3 TO '/user/cloudera/table3.csv'; … hive> exit $ hadoop fs –cat table3.csv | head -1 | tr , 'n' | wc –l ©2014 Cloudera, Inc. All rights reserved.26
  • 27. 27 Joining Forces The data: hive> SELECT * FROM table1; OK 1 6506506500 CA 2 2282282280 MS Time taken: 1.006 seconds hive> SELECT * FROM table2; OK 1 Palo Alto CA 2 Gautier MS Time taken: 1.202 seconds The results: a) 5 b) 4 c) 1 d) The join fails ©2014 Cloudera, Inc. All rights reserved.27
  • 28. 28 Joining Forces hive> DESCRIBE table1; OK id int phone string state string Time taken: 0.236 seconds hive> DESCRIBE table2; OK id int city string state string Time taken: 0.116 seconds hive> CREATE TABLE table3 AS SELECT table2.*,table1.phone,table1.state AS s FROM table1 JOIN table2 ON (table1.id == table2.id); … hive> EXPORT TABLE table3 TO '/user/cloudera/table3.csv'; … hive> exit $ hadoop fs –cat table3.csv | head -1 | tr , 'n' | wc –l ©2014 Cloudera, Inc. All rights reserved.28 1 6506506500 CA 2 2282282280 MS 1 Palo Alto CA 2 Gautier MS
  • 29. 29 Joining Forces The data: hive> SELECT * FROM table1; OK 1 6506506500 CA 2 2282282280 MS Time taken: 1.006 seconds hive> SELECT * FROM table2; OK 1 Palo Alto CA 2 Gautier MS Time taken: 1.202 seconds The results: a) 5 b) 4 c) 1 d) The join fails ©2014 Cloudera, Inc. All rights reserved.29
  • 30. 30 Joining Forces (Answer) The data: hive> SELECT * FROM table1; OK 1 6506506500 CA 2 2282282280 MS Time taken: 1.006 seconds hive> SELECT * FROM table2; OK 1 Palo Alto CA 2 Gautier MS Time taken: 1.202 seconds The results: a) 5 b) 4 c) 1 d) The join fails ©2014 Cloudera, Inc. All rights reserved.30
  • 31. 31 Joining Forces (Problem) hive> DESCRIBE table1; OK id int phone string state string Time taken: 0.236 seconds hive> DESCRIBE table2; OK id int city string state string Time taken: 0.116 seconds hive> CREATE TABLE table3 AS SELECT table2.*,table1.phone,table1.state AS s FROM table1 JOIN table2 ON (table1.id == table2.id); … hive> EXPORT TABLE table3 TO '/user/cloudera/table3.csv'; … hive> exit $ hadoop fs –cat table3.csv | head -1 | tr , 'n' | wc –l ©2014 Cloudera, Inc. All rights reserved.31
  • 32. 32 Joining Forces (Moral) ©2014 Cloudera, Inc. All rights reserved.32 • Hive’s default delimiter is 0x01 (CTRL-A) • Easy to assume export will use a sane delimiter – it doesn’t • Incidentally, Hive’s join rules are pretty sane and work as you’d expect
  • 33. 33 Close Enough public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class Top20Reduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { float max = 0.0f; for (IntWritable v: values) if (v.get() > max) max = v.get(); max *= 0.8f; for (IntWritable v: values) if (v.get() >= max) c.write(key, v); } } ©2014 Cloudera, Inc. All rights reserved.33
  • 34. 34 Close Enough The data: A,1 A,5 A,4 The results: a) b) A 5 c) A 5 A 4 d) The job fails ©2014 Cloudera, Inc. All rights reserved.34
  • 35. 35 Close Enough public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class Top20Reduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { float max = 0.0f; for (IntWritable v: values) if (v.get() > max) max = v.get(); max *= 0.8f; for (IntWritable v: values) if (v.get() >= max) c.write(key, v); } } ©2014 Cloudera, Inc. All rights reserved.35 A 1 A 5 A 4
  • 36. 36 Close Enough The data: A,1 A,5 A,4 The results: a) b) A 5 c) A 5 A 4 d) The job fails ©2014 Cloudera, Inc. All rights reserved.36
  • 37. 37 Close Enough (Answer) The data: A,1 A,5 A,4 The results: a) b) A 5 c) A 5 A 4 d) The job fails ©2014 Cloudera, Inc. All rights reserved.37
  • 38. 38 Close Enough (Problem) public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class Top20Reduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { float max = 0.0f; for (IntWritable v: values) if (v.get() > max) max = v.get(); max *= 0.8f; for (IntWritable v: values) if (v.get() >= max) c.write(key, v); } } ©2014 Cloudera, Inc. All rights reserved.38
  • 39. 39 Close Enough (Moral) ©2014 Cloudera, Inc. All rights reserved.39 • For scalability reasons, the values iterable is single-shot • Subsequent iterators iterate over an empty collection • Store values (not Writables!) in the first pass • Better yet, restructure the logic to avoid storing all values in memory
  • 40. 40 Overbyte public class MinLineMap extends Mapper<LongWritable, Text,Text,Text> { Text k = new Text(); protected void map(LongWritable key, Text value, Context c) … { String val = value.toString(); k.set(val.substring(0, 1)); c.write(k, value); } } public class MinLineReduce extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<Text> values, Context c) … { int min = Integer.MAX_VALUE; for (Text v: values) if (v.getBytes().length < min) min = v.getBytes().length; c.write(key, new IntWritable(min)); } } ©2014 Cloudera, Inc. All rights reserved.40
  • 41. 41 Overbyte The data: Hadoop Spark Hive Sqoop2 The results: a) H 4 S 5 b) H 6 S 5 c) H 6 S 6 d) The job fails ©2014 Cloudera, Inc. All rights reserved.41
  • 42. 42 Overbyte public class MinLineMap extends Mapper<LongWritable, Text,Text,Text> { Text k = new Text(); protected void map(LongWritable key, Text value, Context c) … { String val = value.toString(); k.set(val.substring(0, 1)); c.write(k, value); } } public class MinLineReduce extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<Text> values, Context c) … { int min = Integer.MAX_VALUE; for (Text v: values) if (v.getBytes().length < min) min = v.getBytes().length; c.write(key, new IntWritable(min)); } } ©2014 Cloudera, Inc. All rights reserved.42 Hadoop Spark Hive Sqoop2
  • 43. 43 Overbyte The data: Hadoop Spark Hive Sqoop2 The results: a) H 4 S 5 b) H 6 S 5 c) H 6 S 6 d) The job fails ©2014 Cloudera, Inc. All rights reserved.43
  • 44. 44 Overbyte (Answer) The data: Hadoop Spark Hive Sqoop2 The results: a) H 4 S 5 b) H 6 S 5 c) H 6 S 6 d) The job fails ©2014 Cloudera, Inc. All rights reserved.44
  • 45. 45 Overbyte (Problem) public class MinLineMap extends Mapper<LongWritable, Text,Text,Text> { Text k = new Text(); protected void map(LongWritable key, Text value, Context c) … { String val = value.toString(); k.set(val.substring(0, 1)); c.write(k, value); } } public class MinLineReduce extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<Text> values, Context c) … { int min = Integer.MAX_VALUE; for (Text v: values) if (v.getBytes().length < min) min = v.getBytes().length; c.write(key, new IntWritable(min)); } } ©2014 Cloudera, Inc. All rights reserved.45
  • 46. 46 Overbyte (Moral) ©2014 Cloudera, Inc. All rights reserved.46 • Writables get reused in loops • In addition, Text.getBytes() reuses byte array allocated by previous calls • Net result is wrongness • Text.getLength() is the correct way to get the length of a Text.
  • 47. 47 What We Learned ©2014 Cloudera, Inc. All rights reserved.47 • Beware of reuse of Writables • Always use @Override so your compiler can help you • Don’t assume you know what a method does because of the name or parameters – read the docs! • Sometimes scalability is inconvenient
  • 48. 48 One Closing Note ©2014 Cloudera, Inc. All rights reserved.48 • Hadoop is still not easy • Being good takes effort and experience • Recognizing Hadoop talent can be hard • Cloudera’s is working to make Hadoop talent easier to recognize through certification http://cloudera.com/content/cloudera/en/training/cert ification.html
  • 49. 49 ©2014 Cloudera, Inc. All rights reserved. Aaron Myers & Daniel Templeton