SlideShare a Scribd company logo
First MR job -Inverted Index construction
Map Reduce -Introduction 
•Parallel Job processing framework 
•Written in java 
•Close integration with HDFS 
•Provides : 
–Auto partitioning of job into sub tasks 
–Auto retry on failures 
–Linear Scalability 
–Locality of task execution 
–Pluginbased framework for extensibility
Lets think scalability 
•Let’s go through an exercise of scaling a simple program to process a large data set. 
•Problem: count the number of times each word occurs in a set of documents. 
•Example: only one document with only one sentence –“Do as I say, not as I do.” 
•Pseudocode: A multisetis a set where each element also has a count 
define wordCountas Multiset; (assume a hash table) 
for each document in documentSet{ 
T = tokenize(document); 
for each token in T { 
wordCount[token]++; 
} 
} 
display(wordCount);
How about a billion documents? 
•Looping through all the documents using a single computer will be extremely time consuming. 
•You speed it up by rewriting the program so that it distributes the work over several machines. 
•Each machine will process a distinct fraction of the documents. When all the machines have completed this, a second phase of processing will combine the result of all the machines. 
define wordCountas Multiset; 
for each document in documentSubset{ 
T = tokenize(document); 
for each token in T { 
wordCount[token]++; 
} 
} 
sendToSecondPhase(wordCount); 
define totalWordCountas Multiset; 
for each wordCountreceived from firstPhase{ 
multisetAdd(totalWordCount, wordCount); 
}
Problems 
•Where are documents stored? 
–Having more machines for processing only helps up to a certain point— until the storage server can’t keep up. 
–You’ll also need to split up the documents among the set of processing machines such that each machine will process only those documents that are stored in it. 
•wordCount(and totalWordCount) are stored in memory 
–When processing large document sets, the number of unique words can exceed the RAM storage of a machine. 
–Furthermore phase two has only one machine, which will process wordCountsent from all the machines in phase one. The single machine in phase two will become the bottleneck. 
•Solution: divide based on expected output! 
–Let’s say we have 26 machines for phase two. We assign each machine to only handle wordCountfor words beginning with a particular letter in the alphabet.
Map-Reduce 
•MapReduceprograms are executed in two main phases, called 
–mapping and 
–reducing. 
•In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. 
•In the reducing phase, the reducer processes all the outputs from the mapperand arrives at a final result. 
•The mapperis meant to filter and transform the input into something 
•That the reducer can aggregate over. 
•MapReduceuses lists and (key/value) pairs as its main data primitives.
Map-Reduce 
•Map-Reduce Program 
–Based on two functions: Map and Reduce 
–Every Map/Reduce program must specify a Mapperand optionally a Reducer 
–Operate on key and value pairs 
Map-Reduce works like a Unix pipeline: 
cat input | grep| sort | uniq-c | cat > output 
Input| Map| Shuffle & Sort | Reduce| Output 
cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist 
Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) 
Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
Map-Reduce on Hadoop
Putting things in context 
HDFS 
. 
. 
. 
File 1 
File 2 
File 3 
File N-2 
File N-1 
File N 
Input 
files 
Splits 
Mapper 
Machine -1 
Machine -2 
Machine -M 
Split 1 
Split 2 
Split 3 
Split M-2 
Split M-1 
Split M 
Map 1 
Map 2 
Map 3 
Map M-2 
Map M-1 
Map M 
Combiner 1 
Combiner C 
(Kay, Value) 
pairs 
Record Reader 
combiner 
. 
. 
. 
Partition 1 
Partition 2 
Partition P-1 
Partition P 
Partitionar 
Reducer 
HDFS 
. 
. 
. 
File 1 
File 2 
File 3 
File O-2 
File O-1 
File O 
Reducer 1 
Reducer 2 
Reducer R-1 
Reducer R 
Input 
Output 
Machine -x
Some MapReduce Terminology 
•Job–A “full program” -an execution of a Mapperand Reducer across a data set 
•Task –An execution of a Mapperor a Reducer on a slice of data 
–a.k.a. Task-In-Progress (TIP) 
•Task Attempt –A particular instance of an attempt to execute a task on a machine
Terminology Example 
•Running “Word Count” across 20 files is one job 
•20 files to be mapped simply 20 map tasks+ some number of reduce tasks 
•At least 20 map task attemptswill be performed… more if a machine crashes, due to speculative execution etc.
Task Attempts 
•A particular task will be attempted at least once, possibly more times if it crashes 
–If the same input causes crashes over and over, that input will eventually be abandoned 
•Multiple attempts at one task may occur in parallel with speculative execution turned on 
–Task ID from TaskInProgress is not a unique identifier; don’t use it that way
MapReduce: High Level 
JobTrackerMapReduce job submitted by client computerMaster nodeTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instance
Node-to-Node Communication 
•Hadoop uses its own RPC protocol 
•All communication begins in slave nodes 
–Prevents circular-wait deadlock 
–Slaves periodically poll for “status” message 
•Classes must provide explicit serialization
Nodes, Trackers, Tasks 
•Master node runs JobTrackerinstance, which accepts Job requests from clients 
•TaskTrackerinstances run on slave nodes 
•TaskTrackerforks separate Java process for task instances
Job Distribution 
•MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options 
•Running a MapReduce job places these files into the HDFS and notifies TaskTrackerswhere to retrieve the relevant program code 
•… Where’s the data distribution?
Data Distribution 
•Implicit in design of MapReduce! 
–All mappersare equivalent; so map whatever data is local to a particular node in HDFS 
•If lots of data does happen to pile up on the same node, nearby nodes will map instead 
–Data transfer is handled implicitly by HDFS
Configuring With JobConf 
•MR Programs have many configurable options 
•JobConfobjects hold (key, value) components mapping String ’a’ 
–e.g., “mapred.map.tasks” 20 
–JobConfis serialized and distributed before running the job 
•Objects implementing JobConfigurablecan retrieve elements from a JobConf
Job Launch Process: Client 
•Client program creates a JobConf 
–Identify classes implementing Mapperand Reducerinterfaces 
•JobConf.setMapperClass(), setReducerClass() 
–Specify inputs, outputs 
•JobConf.setInputPath(), setOutputPath() 
–Optionally, other options too: 
•JobConf.setNumReduceTasks(), JobConf.setOutputFormat()…
Job Launch Process: JobClient 
•Pass JobConfto JobClient.runJob() or submitJob() 
–runJob() blocks, submitJob() does not 
•JobClient: 
–Determines proper division of input into InputSplits 
–Sends job data to master JobTrackerserver
Job Launch Process: JobTracker 
•JobTracker: 
–Inserts jar and JobConf(serialized to XML) in shared location 
–Posts a JobInProgressto its run queue
Job Launch Process: TaskTracker 
•TaskTrackersrunning on slave nodes periodically query JobTrackerfor work 
•Retrieve job-specific jar and config 
•Launch task in separate instance of Java 
–main() is provided by Hadoop
Job Launch Process: Task 
•TaskTracker.Child.main(): 
–Sets up the child TaskInProgressattempt 
–Reads XML configuration 
–Connects back to necessary MapReduce components via RPC 
–Uses TaskRunnerto launch user process
Job Launch Process: TaskRunner 
•TaskRunner, MapTaskRunner, MapRunnerwork in a daisy-chain to launch your Mapper 
–Task knows ahead of time which InputSplitsit should be mapping 
–Calls Mapperonce for each record retrieved from the InputSplit 
•Running the Reduceris much the same
Creating the Mapper 
•You provide the instance of Mapper 
–Should extend MapReduceBase 
•One instance of your Mapperis initialized by the MapTaskRunnerfor a TaskInProgress 
–Exists in separate process from all other instances of Mapper–no data sharing!
Mapper 
•void map(WritableComparablekey, 
Writable value, 
OutputCollectoroutput, 
Reporter reporter)
What is Writable? 
•Hadoop defines its own classes for strings (Text),integers (IntWritable), etc. 
•All values are instances of Writable 
•All keys are instances of WritableComparable
Getting Data To The Mapper 
Input fileInputSplitInputSplitInputSplitInputSplitInput fileRecordReaderRecordReaderRecordReaderRecordReaderMapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates)InputFormat
Reading Data 
•Data sets are specified by InputFormats 
–Defines input data (e.g., a directory) 
–Identifies partitions of the data that form an InputSplit 
–Factory for RecordReaderobjects to extract (k, v) records from the input source
FileInputFormatand Friends 
•TextInputFormat–Treats each ‘n’-terminated line of a file as a value 
•KeyValueTextInputFormat–Maps ‘n’-terminated text lines of “k SEP v” 
•SequenceFileInputFormat–Binary file of (k, v) pairs with some additional metadata 
•SequenceFileAsTextInputFormat–Same, but maps (k.toString(), v.toString())
Filtering File Inputs 
•FileInputFormatwill read all files out of a specified directory and send them to the mapper 
•Delegates filtering this file list to a method subclasses may override 
–e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list
Record Readers 
•Each InputFormatprovides its own RecordReaderimplementation 
–Provides capability multiplexing 
•LineRecordReader–Reads a line from a text file 
•KeyValueRecordReader–Used by KeyValueTextInputFormat
Input Split Size 
•FileInputFormatwill divide large files into chunks 
–Exact size controlled by mapred.min.split.size 
•RecordReadersreceive file, offset, and length of chunk 
•Custom InputFormatimplementations may override split size –e.g., “NeverChunkFile”
Sending Data To Reducers 
•Map function receives OutputCollectorobject 
–OutputCollector.collect() takes (k, v) elements 
•Any (WritableComparable, Writable)can be used
WritableComparator 
•Compares WritableComparabledata 
–Will call WritableComparable.compare() 
–Can provide fast path for serialized data 
•JobConf.setOutputValueGroupingComparator()
Sending Data To The Client 
•Reporterobject sent to Mapperallows simple asynchronous feedback 
–incrCounter(Enumkey, long amount) 
–setStatus(String msg) 
•Allows self-identification of input 
–InputSplitgetInputSplit()
Partition And Shuffle 
Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) ReducerReducerReducer(intermediates)(intermediates)(intermediates) PartitionerPartitionerPartitionerPartitioner shuffling
Partitioner 
•intgetPartition(key, val, numPartitions) 
–Outputs the partition number for a given key 
–One partition == values sent to one Reduce task 
•HashPartitionerused by default 
–Uses key.hashCode() to return partition num 
•JobConfsets Partitioner implementation
Reduction 
•reduce(WritableComparablekey, 
Iteratorvalues, 
OutputCollectoroutput, 
Reporter reporter) 
•Keys & values sent to one partition all go to the same reduce task 
•Calls are sorted by key –“earlier” keys are reduced and output before “later” keys
Finally: Writing The Output 
ReducerReducerReducerRecordWriterRecordWriterRecordWriteroutput fileoutput fileoutput file OutputFormat
WordCountM/R 
map(String filename, String document) 
{ 
List<String> T = tokenize(document); 
for each token in T { 
emit ((String)token, (Integer) 1); 
} 
} 
reduce(String token, List<Integer> values) 
{ 
Integer sum = 0; 
for each value in values { 
sum = sum + value; 
} 
emit ((String)token, (Integer) sum); 
}
Word Count: Java Mapper 
public static class MapClassextendsMapReduceBase 
implementsMapper<LongWritable, Text, Text, IntWritable>{ 
public voidmap(LongWritablekey, Text value, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throwsIOException{ 
String line = value.toString(); 
StringTokenizeritr= newStringTokenizer(line); 
while(itr.hasMoreTokens()) { 
Text word = new Text(itr.nextToken()); 
output.collect(word,newIntWritable(1)); 
} 
} 
} 
42
Word Count: Java Reduce 
public static class Reduce extendsMapReduceBase 
implementsReducer<Text, IntWritable, Text, IntWritable> { 
public void reduce(Text key, 
Iterator<IntWritable> values, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throwsIOException{ 
intsum = 0; 
while(values.hasNext()) { 
sum += values.next().get(); 
} 
output.collect(key, newIntWritable(sum)); 
} 
} 
43
Word Count: Java Driver 
public void run(String inPath, String outPath) 
throwsException { 
JobConfconf = newJobConf(WordCount.class); 
conf.setJobName("wordcount"); 
conf.setMapperClass(MapClass.class); 
conf.setReducerClass(Reduce.class); 
FileInputFormat.addInputPath(conf, newPath(inPath)); 
FileOutputFormat.setOutputPath(conf, newPath(outPath)); 
JobClient.runJob(conf); 
} 
44
WordCountwith many mapperand One reducer
Job, Task, and Task Attempt IDs 
•The format of a job ID is composed of the time that the jobtracker(not the job) started and an incrementing counter maintained by the jobtrackerto uniquely identify the job to that instance of the jobtracker. 
•job_201206111011_0002 : 
–is the second (0002, job IDs are 1-based) job run by the jobtracker 
–which started at 10:11 on June 11, 2012 
•Tasks belong to a job, and their IDs are formed by replacing the job prefix of a job ID with a task prefix, and adding a suffix to identify the task within the job. 
•task_201206111011_0002_m_000003: 
–is the fourth (000003, task IDs are 0-based) 
–map (m) task of the job with ID job_201206111011_0002. 
–The task IDs are created for a job when it is initialized, so they do not necessarily dictate the order that the tasks will be executed in. 
•Tasks may be executed more than once, due to failure or speculative execution, so to identify different instances of a task execution, task attempts are given unique IDs on the jobtracker. 
•attempt_200904110811_0002_m_000003_0: 
–is the first (0, attempt IDs are 0-based) attempt at running task task_201206111011_0002_m_000003.
Exercise -description 
•The objectives for this exercise are: 
–Become familiar with decomposing a problem into Map and Reduce stages. 
–Get a sense for how MapReducecan be used in the real world. 
•An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. 
For example, if given the following 2 documents: 
Doc1: Buffalo buffalobuffalo. 
Doc2: Buffalo are mammals. 
we could construct the following inverted file index: 
Buffalo -> Doc1, Doc2 
buffalo -> Doc1 
buffalo. -> Doc1 
are -> Doc2 
mammals. -> Doc2
Exercise -tasks 
•Task -1: (30 min) 
–Write pseudo-code for map and reduce to solve inverted index problem 
–What are your K1 V1, K2, V2 etc. 
–“Execute” your pseudo-code with following example and explain what shuffle & Sort stage do with keys and values 
•Task –2: (30min) 
–Use distributed code Python/Java and execute them following instruction 
•Where input and out data was stored, and in what format? 
•What were K1 V1, K2, V2 data types used? 
•Task –3: (45min) 
•Some words are so common that their presence in an inverted index is "noise" -- they can obfuscate the more interesting properties of that document. For example, the words "the", "a", "and", "of", "in", and "for" occur in almost every English document. How can you determine whether a word is "noisy“? 
–Re-write your pseudo-code with determination (your algorithms) and removal of “noisy” words using map-reduce framework. 
•Group / individual presentation (45 min)
Example: Inverted Index 
•Input:(filename, text) records 
•Output:list of files containing each word 
•Map: foreachword in text.split(): output(word, filename) 
•Combine:unique filenames for each word 
•Reduce: defreduce(word, filenames): output(word, sort(filenames)) 
49
Inverted Index 
50 
to be or not to be 
afraid, (1Xth.txt) 
be, (1Xth.txt, hamlet.txt) 
greatness, (12th.txt) 
not, (1Xth.txt, hamlet.txt) 
of, (12th.txt) 
or, (hamlet.txt) 
to, (hamlet.txt) 
hamlet.txt 
be not afraid of greatness 
1Xth.txt 
to, hamlet.txt 
be, hamlet.txt 
or, hamlet.txt 
not, hamlet.txt 
be, 1Xth.txt 
not, 1Xth.txt 
afraid, 1Xth.txt 
of, 1Xth.txt 
greatness, 1Xth.txt
A better example 
•Billions of crawled pages and links 
•Generate an index of words linking to web urlsin which they occur. 
–Input is split into url->pages (lines of pages) 
–Map looks for words in lines of page and puts out word -> link pairs 
–Group k,vpairs to generate word->{list of links} 
–Reduce puts out pairs to output
Search Reverse Index 
public static class MapClassextends MapReduceBase 
implements Mapper<Text, Text, Text, IntWritable> { 
private Text word = new Text(); 
public void map(Text url, Text pageText, 
OutputCollector<Text, Text> output, 
Reporter reporter) throws IOException{ 
String line = pageText.toString(); 
StringTokenizeritr= new StringTokenizer(line); 
while (itr.hasMoreTokens()) { 
//ignore unwanted and redundant words 
word.set(itr.nextToken()); 
output.collect(word, url); 
} 
} 
}
Search Reverse Index 
public static class Reduce extends MapReduceBase 
implements Reducer<Text, IntWritable, Text, IntWritable> { 
public void reduce(Text word, Iterator<Text> urls, 
OutputCollector<Text, Iterator<Text>> output, 
Reporter reporter) throws IOException{ 
output.collect(word, urls); 
} 
}
End of sesssion 
Day –1: First MR job -Inverted Index construction

More Related Content

PDF
Notes from Coursera Deep Learning courses by Andrew Ng
PPTX
Brain computer interface
PPT
Biosignals andthermometry fin
PPT
Diseases caused by Computer
PPTX
03 hive query language (hql)
PPTX
Python in the Hadoop Ecosystem (Rock Health presentation)
PDF
Application of MapReduce in Cloud Computing
PPT
Genetic Algorithms - Artificial Intelligence
Notes from Coursera Deep Learning courses by Andrew Ng
Brain computer interface
Biosignals andthermometry fin
Diseases caused by Computer
03 hive query language (hql)
Python in the Hadoop Ecosystem (Rock Health presentation)
Application of MapReduce in Cloud Computing
Genetic Algorithms - Artificial Intelligence

Similar to Hadoop first mr job - inverted index construction (20)

PDF
Hadoop map reduce concepts
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
PDF
PPTX
Map Reduce
PPT
Hadoop - Introduction to mapreduce
PPTX
Map reduce in Hadoop BIG DATA ANALYTICS
PPTX
Hadoop and Mapreduce for .NET User Group
PPTX
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
PPTX
MapReduce and Hadoop Introcuctory Presentation
PPT
Lecture 4 Parallel and Distributed Systems Fall 2024.ppt
PPT
Hadoop_Pennonsoft
PPT
L4.FA16n nm,m,m,,m,m,m,mmnm,n,mnmnmm.ppt
PDF
Hadoop map reduce in operation
PPT
Big Data- process of map reducing MapReduce- .ppt
PPT
mapreduce ppt.ppt
PPTX
Hadoop training-in-hyderabad
PDF
An Introduction to MapReduce
PDF
Hadoop & MapReduce
PPT
hadoop.ppt
PPT
Hadoop 2
Hadoop map reduce concepts
MAP REDUCE IN DATA SCIENCE.pptx
Map Reduce
Hadoop - Introduction to mapreduce
Map reduce in Hadoop BIG DATA ANALYTICS
Hadoop and Mapreduce for .NET User Group
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
MapReduce and Hadoop Introcuctory Presentation
Lecture 4 Parallel and Distributed Systems Fall 2024.ppt
Hadoop_Pennonsoft
L4.FA16n nm,m,m,,m,m,m,mmnm,n,mnmnmm.ppt
Hadoop map reduce in operation
Big Data- process of map reducing MapReduce- .ppt
mapreduce ppt.ppt
Hadoop training-in-hyderabad
An Introduction to MapReduce
Hadoop & MapReduce
hadoop.ppt
Hadoop 2
Ad

More from Subhas Kumar Ghosh (20)

PPTX
07 logistic regression and stochastic gradient descent
PPTX
06 how to write a map reduce version of k-means clustering
PPTX
05 k-means clustering
PPTX
02 data warehouse applications with hive
PPTX
PPTX
06 pig etl features
PPTX
05 pig user defined functions (udfs)
PPTX
04 pig data operations
PPTX
03 pig intro
PPTX
02 naive bays classifier and sentiment analysis
PPTX
Hadoop performance optimization tips
PPTX
Hadoop Day 3
PDF
Hadoop exercise
PDF
Hadoop map reduce v2
PPTX
Hadoop job chaining
PDF
Hadoop secondary sort and a custom comparator
PDF
Hadoop combiner and partitioner
PPTX
Hadoop deconstructing map reduce job step by step
PDF
Hadoop availability
PDF
Hadoop scheduler
07 logistic regression and stochastic gradient descent
06 how to write a map reduce version of k-means clustering
05 k-means clustering
02 data warehouse applications with hive
06 pig etl features
05 pig user defined functions (udfs)
04 pig data operations
03 pig intro
02 naive bays classifier and sentiment analysis
Hadoop performance optimization tips
Hadoop Day 3
Hadoop exercise
Hadoop map reduce v2
Hadoop job chaining
Hadoop secondary sort and a custom comparator
Hadoop combiner and partitioner
Hadoop deconstructing map reduce job step by step
Hadoop availability
Hadoop scheduler
Ad

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Lecture1 pattern recognition............
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Fluorescence-microscope_Botany_detailed content
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Mega Projects Data Mega Projects Data
Supervised vs unsupervised machine learning algorithms
Launch Your Data Science Career in Kochi – 2025
IBA_Chapter_11_Slides_Final_Accessible.pptx
.pdf is not working space design for the following data for the following dat...
Galatica Smart Energy Infrastructure Startup Pitch Deck
IB Computer Science - Internal Assessment.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Acumen Training GuidePresentation.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Lecture1 pattern recognition............
Major-Components-ofNKJNNKNKNKNKronment.pptx
Clinical guidelines as a resource for EBP(1).pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

Hadoop first mr job - inverted index construction

  • 1. First MR job -Inverted Index construction
  • 2. Map Reduce -Introduction •Parallel Job processing framework •Written in java •Close integration with HDFS •Provides : –Auto partitioning of job into sub tasks –Auto retry on failures –Linear Scalability –Locality of task execution –Pluginbased framework for extensibility
  • 3. Lets think scalability •Let’s go through an exercise of scaling a simple program to process a large data set. •Problem: count the number of times each word occurs in a set of documents. •Example: only one document with only one sentence –“Do as I say, not as I do.” •Pseudocode: A multisetis a set where each element also has a count define wordCountas Multiset; (assume a hash table) for each document in documentSet{ T = tokenize(document); for each token in T { wordCount[token]++; } } display(wordCount);
  • 4. How about a billion documents? •Looping through all the documents using a single computer will be extremely time consuming. •You speed it up by rewriting the program so that it distributes the work over several machines. •Each machine will process a distinct fraction of the documents. When all the machines have completed this, a second phase of processing will combine the result of all the machines. define wordCountas Multiset; for each document in documentSubset{ T = tokenize(document); for each token in T { wordCount[token]++; } } sendToSecondPhase(wordCount); define totalWordCountas Multiset; for each wordCountreceived from firstPhase{ multisetAdd(totalWordCount, wordCount); }
  • 5. Problems •Where are documents stored? –Having more machines for processing only helps up to a certain point— until the storage server can’t keep up. –You’ll also need to split up the documents among the set of processing machines such that each machine will process only those documents that are stored in it. •wordCount(and totalWordCount) are stored in memory –When processing large document sets, the number of unique words can exceed the RAM storage of a machine. –Furthermore phase two has only one machine, which will process wordCountsent from all the machines in phase one. The single machine in phase two will become the bottleneck. •Solution: divide based on expected output! –Let’s say we have 26 machines for phase two. We assign each machine to only handle wordCountfor words beginning with a particular letter in the alphabet.
  • 6. Map-Reduce •MapReduceprograms are executed in two main phases, called –mapping and –reducing. •In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. •In the reducing phase, the reducer processes all the outputs from the mapperand arrives at a final result. •The mapperis meant to filter and transform the input into something •That the reducer can aggregate over. •MapReduceuses lists and (key/value) pairs as its main data primitives.
  • 7. Map-Reduce •Map-Reduce Program –Based on two functions: Map and Reduce –Every Map/Reduce program must specify a Mapperand optionally a Reducer –Operate on key and value pairs Map-Reduce works like a Unix pipeline: cat input | grep| sort | uniq-c | cat > output Input| Map| Shuffle & Sort | Reduce| Output cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
  • 9. Putting things in context HDFS . . . File 1 File 2 File 3 File N-2 File N-1 File N Input files Splits Mapper Machine -1 Machine -2 Machine -M Split 1 Split 2 Split 3 Split M-2 Split M-1 Split M Map 1 Map 2 Map 3 Map M-2 Map M-1 Map M Combiner 1 Combiner C (Kay, Value) pairs Record Reader combiner . . . Partition 1 Partition 2 Partition P-1 Partition P Partitionar Reducer HDFS . . . File 1 File 2 File 3 File O-2 File O-1 File O Reducer 1 Reducer 2 Reducer R-1 Reducer R Input Output Machine -x
  • 10. Some MapReduce Terminology •Job–A “full program” -an execution of a Mapperand Reducer across a data set •Task –An execution of a Mapperor a Reducer on a slice of data –a.k.a. Task-In-Progress (TIP) •Task Attempt –A particular instance of an attempt to execute a task on a machine
  • 11. Terminology Example •Running “Word Count” across 20 files is one job •20 files to be mapped simply 20 map tasks+ some number of reduce tasks •At least 20 map task attemptswill be performed… more if a machine crashes, due to speculative execution etc.
  • 12. Task Attempts •A particular task will be attempted at least once, possibly more times if it crashes –If the same input causes crashes over and over, that input will eventually be abandoned •Multiple attempts at one task may occur in parallel with speculative execution turned on –Task ID from TaskInProgress is not a unique identifier; don’t use it that way
  • 13. MapReduce: High Level JobTrackerMapReduce job submitted by client computerMaster nodeTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instance
  • 14. Node-to-Node Communication •Hadoop uses its own RPC protocol •All communication begins in slave nodes –Prevents circular-wait deadlock –Slaves periodically poll for “status” message •Classes must provide explicit serialization
  • 15. Nodes, Trackers, Tasks •Master node runs JobTrackerinstance, which accepts Job requests from clients •TaskTrackerinstances run on slave nodes •TaskTrackerforks separate Java process for task instances
  • 16. Job Distribution •MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options •Running a MapReduce job places these files into the HDFS and notifies TaskTrackerswhere to retrieve the relevant program code •… Where’s the data distribution?
  • 17. Data Distribution •Implicit in design of MapReduce! –All mappersare equivalent; so map whatever data is local to a particular node in HDFS •If lots of data does happen to pile up on the same node, nearby nodes will map instead –Data transfer is handled implicitly by HDFS
  • 18. Configuring With JobConf •MR Programs have many configurable options •JobConfobjects hold (key, value) components mapping String ’a’ –e.g., “mapred.map.tasks” 20 –JobConfis serialized and distributed before running the job •Objects implementing JobConfigurablecan retrieve elements from a JobConf
  • 19. Job Launch Process: Client •Client program creates a JobConf –Identify classes implementing Mapperand Reducerinterfaces •JobConf.setMapperClass(), setReducerClass() –Specify inputs, outputs •JobConf.setInputPath(), setOutputPath() –Optionally, other options too: •JobConf.setNumReduceTasks(), JobConf.setOutputFormat()…
  • 20. Job Launch Process: JobClient •Pass JobConfto JobClient.runJob() or submitJob() –runJob() blocks, submitJob() does not •JobClient: –Determines proper division of input into InputSplits –Sends job data to master JobTrackerserver
  • 21. Job Launch Process: JobTracker •JobTracker: –Inserts jar and JobConf(serialized to XML) in shared location –Posts a JobInProgressto its run queue
  • 22. Job Launch Process: TaskTracker •TaskTrackersrunning on slave nodes periodically query JobTrackerfor work •Retrieve job-specific jar and config •Launch task in separate instance of Java –main() is provided by Hadoop
  • 23. Job Launch Process: Task •TaskTracker.Child.main(): –Sets up the child TaskInProgressattempt –Reads XML configuration –Connects back to necessary MapReduce components via RPC –Uses TaskRunnerto launch user process
  • 24. Job Launch Process: TaskRunner •TaskRunner, MapTaskRunner, MapRunnerwork in a daisy-chain to launch your Mapper –Task knows ahead of time which InputSplitsit should be mapping –Calls Mapperonce for each record retrieved from the InputSplit •Running the Reduceris much the same
  • 25. Creating the Mapper •You provide the instance of Mapper –Should extend MapReduceBase •One instance of your Mapperis initialized by the MapTaskRunnerfor a TaskInProgress –Exists in separate process from all other instances of Mapper–no data sharing!
  • 26. Mapper •void map(WritableComparablekey, Writable value, OutputCollectoroutput, Reporter reporter)
  • 27. What is Writable? •Hadoop defines its own classes for strings (Text),integers (IntWritable), etc. •All values are instances of Writable •All keys are instances of WritableComparable
  • 28. Getting Data To The Mapper Input fileInputSplitInputSplitInputSplitInputSplitInput fileRecordReaderRecordReaderRecordReaderRecordReaderMapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates)InputFormat
  • 29. Reading Data •Data sets are specified by InputFormats –Defines input data (e.g., a directory) –Identifies partitions of the data that form an InputSplit –Factory for RecordReaderobjects to extract (k, v) records from the input source
  • 30. FileInputFormatand Friends •TextInputFormat–Treats each ‘n’-terminated line of a file as a value •KeyValueTextInputFormat–Maps ‘n’-terminated text lines of “k SEP v” •SequenceFileInputFormat–Binary file of (k, v) pairs with some additional metadata •SequenceFileAsTextInputFormat–Same, but maps (k.toString(), v.toString())
  • 31. Filtering File Inputs •FileInputFormatwill read all files out of a specified directory and send them to the mapper •Delegates filtering this file list to a method subclasses may override –e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list
  • 32. Record Readers •Each InputFormatprovides its own RecordReaderimplementation –Provides capability multiplexing •LineRecordReader–Reads a line from a text file •KeyValueRecordReader–Used by KeyValueTextInputFormat
  • 33. Input Split Size •FileInputFormatwill divide large files into chunks –Exact size controlled by mapred.min.split.size •RecordReadersreceive file, offset, and length of chunk •Custom InputFormatimplementations may override split size –e.g., “NeverChunkFile”
  • 34. Sending Data To Reducers •Map function receives OutputCollectorobject –OutputCollector.collect() takes (k, v) elements •Any (WritableComparable, Writable)can be used
  • 35. WritableComparator •Compares WritableComparabledata –Will call WritableComparable.compare() –Can provide fast path for serialized data •JobConf.setOutputValueGroupingComparator()
  • 36. Sending Data To The Client •Reporterobject sent to Mapperallows simple asynchronous feedback –incrCounter(Enumkey, long amount) –setStatus(String msg) •Allows self-identification of input –InputSplitgetInputSplit()
  • 37. Partition And Shuffle Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) ReducerReducerReducer(intermediates)(intermediates)(intermediates) PartitionerPartitionerPartitionerPartitioner shuffling
  • 38. Partitioner •intgetPartition(key, val, numPartitions) –Outputs the partition number for a given key –One partition == values sent to one Reduce task •HashPartitionerused by default –Uses key.hashCode() to return partition num •JobConfsets Partitioner implementation
  • 39. Reduction •reduce(WritableComparablekey, Iteratorvalues, OutputCollectoroutput, Reporter reporter) •Keys & values sent to one partition all go to the same reduce task •Calls are sorted by key –“earlier” keys are reduced and output before “later” keys
  • 40. Finally: Writing The Output ReducerReducerReducerRecordWriterRecordWriterRecordWriteroutput fileoutput fileoutput file OutputFormat
  • 41. WordCountM/R map(String filename, String document) { List<String> T = tokenize(document); for each token in T { emit ((String)token, (Integer) 1); } } reduce(String token, List<Integer> values) { Integer sum = 0; for each value in values { sum = sum + value; } emit ((String)token, (Integer) sum); }
  • 42. Word Count: Java Mapper public static class MapClassextendsMapReduceBase implementsMapper<LongWritable, Text, Text, IntWritable>{ public voidmap(LongWritablekey, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throwsIOException{ String line = value.toString(); StringTokenizeritr= newStringTokenizer(line); while(itr.hasMoreTokens()) { Text word = new Text(itr.nextToken()); output.collect(word,newIntWritable(1)); } } } 42
  • 43. Word Count: Java Reduce public static class Reduce extendsMapReduceBase implementsReducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throwsIOException{ intsum = 0; while(values.hasNext()) { sum += values.next().get(); } output.collect(key, newIntWritable(sum)); } } 43
  • 44. Word Count: Java Driver public void run(String inPath, String outPath) throwsException { JobConfconf = newJobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, newPath(inPath)); FileOutputFormat.setOutputPath(conf, newPath(outPath)); JobClient.runJob(conf); } 44
  • 46. Job, Task, and Task Attempt IDs •The format of a job ID is composed of the time that the jobtracker(not the job) started and an incrementing counter maintained by the jobtrackerto uniquely identify the job to that instance of the jobtracker. •job_201206111011_0002 : –is the second (0002, job IDs are 1-based) job run by the jobtracker –which started at 10:11 on June 11, 2012 •Tasks belong to a job, and their IDs are formed by replacing the job prefix of a job ID with a task prefix, and adding a suffix to identify the task within the job. •task_201206111011_0002_m_000003: –is the fourth (000003, task IDs are 0-based) –map (m) task of the job with ID job_201206111011_0002. –The task IDs are created for a job when it is initialized, so they do not necessarily dictate the order that the tasks will be executed in. •Tasks may be executed more than once, due to failure or speculative execution, so to identify different instances of a task execution, task attempts are given unique IDs on the jobtracker. •attempt_200904110811_0002_m_000003_0: –is the first (0, attempt IDs are 0-based) attempt at running task task_201206111011_0002_m_000003.
  • 47. Exercise -description •The objectives for this exercise are: –Become familiar with decomposing a problem into Map and Reduce stages. –Get a sense for how MapReducecan be used in the real world. •An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents: Doc1: Buffalo buffalobuffalo. Doc2: Buffalo are mammals. we could construct the following inverted file index: Buffalo -> Doc1, Doc2 buffalo -> Doc1 buffalo. -> Doc1 are -> Doc2 mammals. -> Doc2
  • 48. Exercise -tasks •Task -1: (30 min) –Write pseudo-code for map and reduce to solve inverted index problem –What are your K1 V1, K2, V2 etc. –“Execute” your pseudo-code with following example and explain what shuffle & Sort stage do with keys and values •Task –2: (30min) –Use distributed code Python/Java and execute them following instruction •Where input and out data was stored, and in what format? •What were K1 V1, K2, V2 data types used? •Task –3: (45min) •Some words are so common that their presence in an inverted index is "noise" -- they can obfuscate the more interesting properties of that document. For example, the words "the", "a", "and", "of", "in", and "for" occur in almost every English document. How can you determine whether a word is "noisy“? –Re-write your pseudo-code with determination (your algorithms) and removal of “noisy” words using map-reduce framework. •Group / individual presentation (45 min)
  • 49. Example: Inverted Index •Input:(filename, text) records •Output:list of files containing each word •Map: foreachword in text.split(): output(word, filename) •Combine:unique filenames for each word •Reduce: defreduce(word, filenames): output(word, sort(filenames)) 49
  • 50. Inverted Index 50 to be or not to be afraid, (1Xth.txt) be, (1Xth.txt, hamlet.txt) greatness, (12th.txt) not, (1Xth.txt, hamlet.txt) of, (12th.txt) or, (hamlet.txt) to, (hamlet.txt) hamlet.txt be not afraid of greatness 1Xth.txt to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt be, 1Xth.txt not, 1Xth.txt afraid, 1Xth.txt of, 1Xth.txt greatness, 1Xth.txt
  • 51. A better example •Billions of crawled pages and links •Generate an index of words linking to web urlsin which they occur. –Input is split into url->pages (lines of pages) –Map looks for words in lines of page and puts out word -> link pairs –Group k,vpairs to generate word->{list of links} –Reduce puts out pairs to output
  • 52. Search Reverse Index public static class MapClassextends MapReduceBase implements Mapper<Text, Text, Text, IntWritable> { private Text word = new Text(); public void map(Text url, Text pageText, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{ String line = pageText.toString(); StringTokenizeritr= new StringTokenizer(line); while (itr.hasMoreTokens()) { //ignore unwanted and redundant words word.set(itr.nextToken()); output.collect(word, url); } } }
  • 53. Search Reverse Index public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text word, Iterator<Text> urls, OutputCollector<Text, Iterator<Text>> output, Reporter reporter) throws IOException{ output.collect(word, urls); } }
  • 54. End of sesssion Day –1: First MR job -Inverted Index construction