SlideShare a Scribd company logo
2
Most read
3
Most read
Myself Archana R
Assistant Professor In
Dept Of CS
SACWC.
I am here because I love
to give presentations.
MapReduce
 MapReduce is a programming model for efficient distributed computing
 It works like a Unix pipeline
 cat input | grep | sort | uniq -c | cat > output
 Input | Map | Shuffle & Sort | Reduce | Output
 Efficiency from
 Streaming through data, reducing seeks
 Pipelining
 A good fit for a lot of applications
 Log processing
 Web index building
MapReduce - Dataflow
MapReduce - Features
 Fine grained Map and Reduce tasks
 Improved load balancing
 Faster recovery from failed tasks
 Automatic re-execution on failure
 In a large cluster, some nodes are always slow or flaky
 Framework re-executes failed tasks
 Locality optimizations
 With large data, bandwidth to data is a problem
 Map-Reduce + HDFS is a very effective solution
 Map-Reduce queries HDFS for locations of input data
 Map tasks are scheduled close to the inputs when possible
Word Count Example
 Mapper
 Input: value: lines of text of input
 Output: key: word, value: 1
 Reducer
 Input: key: word, value: set of counts
 Output: key: word, value: sum
 Launching program
 Defines this job
 Submits job to cluster
Word Count Dataflow
Word Count Mapper
public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
public static void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while(tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}
Word Count Example
 Jobs are controlled by configuring JobConfs
 JobConfs are maps from attribute names to string values
 The framework defines attributes to control how the job is executed
 conf.set(“mapred.job.name”, “MyApp”);
 Applications can add arbitrary values to the JobConf
 conf.set(“my.string”, “foo”);
 conf.set(“my.integer”, 12);
 JobConf is available to all tasks
Putting it all together
 Create a launching program for your application
 The launching program configures:
 The Mapper and Reducer to use
 The output key and value types (input types are inferred from the InputFormat)
 The locations for your input and output
 The launching program then submits the job and typically waits for it to complete
Input and Output Formats
 A Map/Reduce may specify how it’s input is to be read by specifying an InputFormat to be used
 A Map/Reduce may specify how it’s output is to be written by specifying an OutputFormat to be used
 These default to TextInputFormat and TextOutputFormat, which process line-based text data
 Another common choice is SequenceFileInputFormat and SequenceFileOutputFormat for binary data
 These are file-based, but they are not required to be
How many Maps and Reduces
 Maps
 Usually as many as the number of HDFS blocks being processed, this is the default
 Else the number of maps can be specified as a hint
 The number of maps can also be controlled by specifying the minimum split size
 The actual sizes of the map inputs are computed by:
 max(min(block_size,data/#maps), min_split_size
 Reduces
 Unless the amount of data being processed is small
 0.95*num_nodes*mapred.tasktracker.tasks.maximum
Some handy tools
 Partitioners
 Combiners
 Compression
 Counters
 Speculation
 Zero Reduces
 Distributed File Cache
 Tool
Partitioners
 Partitioners are application code that define how keys are assigned to reduces
 Default partitioning spreads keys evenly, but randomly
 Uses key.hashCode() % num_reduces
 Custom partitioning is often required, for example, to produce a total order in the output
 Should implement Partitioner interface
 Set by calling conf.setPartitionerClass(MyPart.class)
 To get a total order, sample the map output keys and pick values to divide the keys into roughly equal
buckets and use that in your partitioner
Compression
 Compressing the outputs and intermediate data will often yield huge performance gains
 Can be specified via a configuration file or set programmatically
 Set mapred.output.compress to true to compress job output
 Set mapred.compress.map.output to true to compress map outputs
 Compression Types (mapred(.map)?.output.compression.type)
 “block” - Group of keys and values are compressed together
 “record” - Each value is compressed individually
 Block compression is almost always best
 Compression Codecs (mapred(.map)?.output.compression.codec)
 Default (zlib) - slower, but more compression
 LZO - faster, but less compression
Counters
 Often Map/Reduce applications have countable events
 For example, framework counts records in to and out of Mapper and Reducer
 To define user counters:
static enum Counter {EVENT1, EVENT2};
reporter.incrCounter(Counter.EVENT1, 1);
 Define nice names in a MyClass_Counter.properties file
CounterGroupName=MyCounters
EVENT1.name=Event 1
EVENT2.name=Event 2
Map reduce in Hadoop BIG DATA ANALYTICS

More Related Content

PPTX
MapReduce Paradigm
DOCX
System architecture
PDF
MapReduce
PDF
Craft software for dummies
PPTX
Wei's notes on MapReduce Scheduling
PPTX
Industrial Facility Design
PPTX
Plant Layout Algorithm
MapReduce Paradigm
System architecture
MapReduce
Craft software for dummies
Wei's notes on MapReduce Scheduling
Industrial Facility Design
Plant Layout Algorithm

What's hot (20)

PPTX
MapReduce
PPTX
Ch4.mapreduce algorithm design
PPT
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
PPTX
Map reduce
PPTX
Ppt 2 d ploting k10998
RTF
Match-n-Freq
PPT
Finite state automaton
PPT
Training Storyboard
PDF
Introduction to computer_lec_03
PDF
02 Map Reduce
PDF
Tutorial ground classification with Laserdata LiS
PPTX
Dfg &amp; sg ppt (1)
PDF
1 Anne complains that defining functions to use in her programs is a lot of ...
PPTX
Flowcharting and pseudocoding
PDF
15 functional programming
PPTX
Hadoop Map Reduce OS
PDF
5 Ways to Improve Your LiDAR Workflows
PDF
Linear Referencing (LRS): How FME Measures Up
PPTX
MapReduce
MapReduce
Ch4.mapreduce algorithm design
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Map reduce
Ppt 2 d ploting k10998
Match-n-Freq
Finite state automaton
Training Storyboard
Introduction to computer_lec_03
02 Map Reduce
Tutorial ground classification with Laserdata LiS
Dfg &amp; sg ppt (1)
1 Anne complains that defining functions to use in her programs is a lot of ...
Flowcharting and pseudocoding
15 functional programming
Hadoop Map Reduce OS
5 Ways to Improve Your LiDAR Workflows
Linear Referencing (LRS): How FME Measures Up
MapReduce
Ad

Similar to Map reduce in Hadoop BIG DATA ANALYTICS (20)

PPTX
Map reduce presentation
PPTX
map reduce Technic in big data
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
PPTX
Map Reduce
PPT
Hadoop_Pennonsoft
PPTX
MapReduce and Hadoop Introcuctory Presentation
PPT
Hadoop Map Reduce
PPTX
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
PDF
Lecture 1 mapreduce
PDF
Map reduce
PDF
2004 map reduce simplied data processing on large clusters (mapreduce)
PDF
MapReduce-Notes.pdf
PPT
Hadoop - Introduction to mapreduce
PPTX
Lecture 04 big data analytics | map reduce
PPT
Big-data-analysis-training-in-mumbai
PDF
Map reduceoriginalpaper mandatoryreading
PDF
Map reduce
PDF
Mypreson 27
PPTX
PDF
Hadoop interview questions - Softwarequery.com
Map reduce presentation
map reduce Technic in big data
MAP REDUCE IN DATA SCIENCE.pptx
Map Reduce
Hadoop_Pennonsoft
MapReduce and Hadoop Introcuctory Presentation
Hadoop Map Reduce
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
Lecture 1 mapreduce
Map reduce
2004 map reduce simplied data processing on large clusters (mapreduce)
MapReduce-Notes.pdf
Hadoop - Introduction to mapreduce
Lecture 04 big data analytics | map reduce
Big-data-analysis-training-in-mumbai
Map reduceoriginalpaper mandatoryreading
Map reduce
Mypreson 27
Hadoop interview questions - Softwarequery.com
Ad

More from Archana Gopinath (20)

PDF
The Graph Abstract Data Type-DATA STRUCTURE.pdf
PPTX
Introduction-to-Binary-Tree-Traversal.pptx
PPTX
DNS-Translates domain names into IP addresses.pptx
PPTX
Data Transfer & Manipulation.pptx
PPTX
DP _ CO Instruction Format.pptx
PPTX
Language for specifying lexical Analyzer
PPTX
Implementation of lexical analyser
PPTX
A simple approach of lexical analyzers
PPTX
A Role of Lexical Analyzer
PPTX
minimization the number of states of DFA
PPTX
Regular Expression to Finite Automata
PPTX
Fundamentals of big data analytics and Hadoop
PPTX
Business intelligence
PPTX
PPTX
Programming with R in Big Data Analytics
PPTX
If statements in c programming
PPT
un Guided media
PPT
Guided media Transmission Media
PPTX
Main Memory RAM and ROM
PDF
Java thread life cycle
The Graph Abstract Data Type-DATA STRUCTURE.pdf
Introduction-to-Binary-Tree-Traversal.pptx
DNS-Translates domain names into IP addresses.pptx
Data Transfer & Manipulation.pptx
DP _ CO Instruction Format.pptx
Language for specifying lexical Analyzer
Implementation of lexical analyser
A simple approach of lexical analyzers
A Role of Lexical Analyzer
minimization the number of states of DFA
Regular Expression to Finite Automata
Fundamentals of big data analytics and Hadoop
Business intelligence
Programming with R in Big Data Analytics
If statements in c programming
un Guided media
Guided media Transmission Media
Main Memory RAM and ROM
Java thread life cycle

Recently uploaded (20)

PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Complications of Minimal Access Surgery at WLH
PPTX
GDM (1) (1).pptx small presentation for students
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
master seminar digital applications in india
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Complications of Minimal Access Surgery at WLH
GDM (1) (1).pptx small presentation for students
STATICS OF THE RIGID BODIES Hibbelers.pdf
Basic Mud Logging Guide for educational purpose
Abdominal Access Techniques with Prof. Dr. R K Mishra
O5-L3 Freight Transport Ops (International) V1.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
master seminar digital applications in india
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
TR - Agricultural Crops Production NC III.pdf
Final Presentation General Medicine 03-08-2024.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Sports Quiz easy sports quiz sports quiz
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPH.pptx obstetrics and gynecology in nursing
102 student loan defaulters named and shamed – Is someone you know on the list?
Pharmacology of Heart Failure /Pharmacotherapy of CHF

Map reduce in Hadoop BIG DATA ANALYTICS

  • 1. Myself Archana R Assistant Professor In Dept Of CS SACWC. I am here because I love to give presentations.
  • 2. MapReduce  MapReduce is a programming model for efficient distributed computing  It works like a Unix pipeline  cat input | grep | sort | uniq -c | cat > output  Input | Map | Shuffle & Sort | Reduce | Output  Efficiency from  Streaming through data, reducing seeks  Pipelining  A good fit for a lot of applications  Log processing  Web index building
  • 4. MapReduce - Features  Fine grained Map and Reduce tasks  Improved load balancing  Faster recovery from failed tasks  Automatic re-execution on failure  In a large cluster, some nodes are always slow or flaky  Framework re-executes failed tasks  Locality optimizations  With large data, bandwidth to data is a problem  Map-Reduce + HDFS is a very effective solution  Map-Reduce queries HDFS for locations of input data  Map tasks are scheduled close to the inputs when possible
  • 5. Word Count Example  Mapper  Input: value: lines of text of input  Output: key: word, value: 1  Reducer  Input: key: word, value: set of counts  Output: key: word, value: sum  Launching program  Defines this job  Submits job to cluster
  • 7. Word Count Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> { private static final IntWritable one = new IntWritable(1); private Text word = new Text(); public static void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer = new StringTokenizer(line); while(tokenizer.hasNext()) { word.set(tokenizer.nextToken()); output.collect(word,one); } } }
  • 8. Word Count Example  Jobs are controlled by configuring JobConfs  JobConfs are maps from attribute names to string values  The framework defines attributes to control how the job is executed  conf.set(“mapred.job.name”, “MyApp”);  Applications can add arbitrary values to the JobConf  conf.set(“my.string”, “foo”);  conf.set(“my.integer”, 12);  JobConf is available to all tasks
  • 9. Putting it all together  Create a launching program for your application  The launching program configures:  The Mapper and Reducer to use  The output key and value types (input types are inferred from the InputFormat)  The locations for your input and output  The launching program then submits the job and typically waits for it to complete
  • 10. Input and Output Formats  A Map/Reduce may specify how it’s input is to be read by specifying an InputFormat to be used  A Map/Reduce may specify how it’s output is to be written by specifying an OutputFormat to be used  These default to TextInputFormat and TextOutputFormat, which process line-based text data  Another common choice is SequenceFileInputFormat and SequenceFileOutputFormat for binary data  These are file-based, but they are not required to be
  • 11. How many Maps and Reduces  Maps  Usually as many as the number of HDFS blocks being processed, this is the default  Else the number of maps can be specified as a hint  The number of maps can also be controlled by specifying the minimum split size  The actual sizes of the map inputs are computed by:  max(min(block_size,data/#maps), min_split_size  Reduces  Unless the amount of data being processed is small  0.95*num_nodes*mapred.tasktracker.tasks.maximum
  • 12. Some handy tools  Partitioners  Combiners  Compression  Counters  Speculation  Zero Reduces  Distributed File Cache  Tool
  • 13. Partitioners  Partitioners are application code that define how keys are assigned to reduces  Default partitioning spreads keys evenly, but randomly  Uses key.hashCode() % num_reduces  Custom partitioning is often required, for example, to produce a total order in the output  Should implement Partitioner interface  Set by calling conf.setPartitionerClass(MyPart.class)  To get a total order, sample the map output keys and pick values to divide the keys into roughly equal buckets and use that in your partitioner
  • 14. Compression  Compressing the outputs and intermediate data will often yield huge performance gains  Can be specified via a configuration file or set programmatically  Set mapred.output.compress to true to compress job output  Set mapred.compress.map.output to true to compress map outputs  Compression Types (mapred(.map)?.output.compression.type)  “block” - Group of keys and values are compressed together  “record” - Each value is compressed individually  Block compression is almost always best  Compression Codecs (mapred(.map)?.output.compression.codec)  Default (zlib) - slower, but more compression  LZO - faster, but less compression
  • 15. Counters  Often Map/Reduce applications have countable events  For example, framework counts records in to and out of Mapper and Reducer  To define user counters: static enum Counter {EVENT1, EVENT2}; reporter.incrCounter(Counter.EVENT1, 1);  Define nice names in a MyClass_Counter.properties file CounterGroupName=MyCounters EVENT1.name=Event 1 EVENT2.name=Event 2