Map reduce in Hadoop BIG DATA ANALYTICS

Myself Archana R
Assistant Professor In
Dept Of CS
SACWC.
I am here because I love
to give presentations.

MapReduce - Features
 Fine grained Map and Reduce tasks
 Improved load balancing
 Faster recovery from failed tasks
 Automatic re-execution on failure
 In a large cluster, some nodes are always slow or flaky
 Framework re-executes failed tasks
 Locality optimizations
 With large data, bandwidth to data is a problem
 Map-Reduce + HDFS is a very effective solution
 Map-Reduce queries HDFS for locations of input data
 Map tasks are scheduled close to the inputs when possible

Word Count Example
 Mapper
 Input: value: lines of text of input
 Output: key: word, value: 1
 Reducer
 Input: key: word, value: set of counts
 Output: key: word, value: sum
 Launching program
 Defines this job
 Submits job to cluster

Word Count Mapper
public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
public static void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while(tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}

Word Count Example
 Jobs are controlled by configuring JobConfs
 JobConfs are maps from attribute names to string values
 The framework defines attributes to control how the job is executed
 conf.set(“mapred.job.name”, “MyApp”);
 Applications can add arbitrary values to the JobConf
 conf.set(“my.string”, “foo”);
 conf.set(“my.integer”, 12);
 JobConf is available to all tasks

Putting it all together
 Create a launching program for your application
 The launching program configures:
 The Mapper and Reducer to use
 The output key and value types (input types are inferred from the InputFormat)
 The locations for your input and output
 The launching program then submits the job and typically waits for it to complete

Input and Output Formats
 A Map/Reduce may specify how it’s input is to be read by specifying an InputFormat to be used
 A Map/Reduce may specify how it’s output is to be written by specifying an OutputFormat to be used
 These default to TextInputFormat and TextOutputFormat, which process line-based text data
 Another common choice is SequenceFileInputFormat and SequenceFileOutputFormat for binary data
 These are file-based, but they are not required to be

How many Maps and Reduces
 Maps
 Usually as many as the number of HDFS blocks being processed, this is the default
 Else the number of maps can be specified as a hint
 The number of maps can also be controlled by specifying the minimum split size
 The actual sizes of the map inputs are computed by:
 max(min(block_size,data/#maps), min_split_size
 Reduces
 Unless the amount of data being processed is small
 0.95*num_nodes*mapred.tasktracker.tasks.maximum

Some handy tools
 Partitioners
 Combiners
 Compression
 Counters
 Speculation
 Zero Reduces
 Distributed File Cache
 Tool

Partitioners
 Partitioners are application code that define how keys are assigned to reduces
 Default partitioning spreads keys evenly, but randomly
 Uses key.hashCode() % num_reduces
 Custom partitioning is often required, for example, to produce a total order in the output
 Should implement Partitioner interface
 Set by calling conf.setPartitionerClass(MyPart.class)
 To get a total order, sample the map output keys and pick values to divide the keys into roughly equal
buckets and use that in your partitioner

Compression
 Compressing the outputs and intermediate data will often yield huge performance gains
 Can be specified via a configuration file or set programmatically
 Set mapred.output.compress to true to compress job output
 Set mapred.compress.map.output to true to compress map outputs
 Compression Types (mapred(.map)?.output.compression.type)
 “block” - Group of keys and values are compressed together
 “record” - Each value is compressed individually
 Block compression is almost always best
 Compression Codecs (mapred(.map)?.output.compression.codec)
 Default (zlib) - slower, but more compression
 LZO - faster, but less compression

Counters
 Often Map/Reduce applications have countable events
 For example, framework counts records in to and out of Mapper and Reducer
 To define user counters:
static enum Counter {EVENT1, EVENT2};
reporter.incrCounter(Counter.EVENT1, 1);
 Define nice names in a MyClass_Counter.properties file
CounterGroupName=MyCounters
EVENT1.name=Event 1
EVENT2.name=Event 2

Map reduce in Hadoop BIG DATA ANALYTICS

Map reduce in Hadoop BIG DATA ANALYTICS

More Related Content

What's hot (20)

Similar to Map reduce in Hadoop BIG DATA ANALYTICS (20)

More from Archana Gopinath (20)

Recently uploaded (20)

Map reduce in Hadoop BIG DATA ANALYTICS