Hadoop first mr job - inverted index construction

First MR job -Inverted Index construction

Map Reduce -Introduction
•Parallel Job processing framework
•Written in java
•Close integration with HDFS
•Provides :
–Auto partitioning of job into sub tasks
–Auto retry on failures
–Linear Scalability
–Locality of task execution
–Pluginbased framework for extensibility

Lets think scalability
•Let’s go through an exercise of scaling a simple program to process a large data set.
•Problem: count the number of times each word occurs in a set of documents.
•Example: only one document with only one sentence –“Do as I say, not as I do.”
•Pseudocode: A multisetis a set where each element also has a count
define wordCountas Multiset; (assume a hash table)
for each document in documentSet{
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
display(wordCount);

How about a billion documents?
•Looping through all the documents using a single computer will be extremely time consuming.
•You speed it up by rewriting the program so that it distributes the work over several machines.
•Each machine will process a distinct fraction of the documents. When all the machines have completed this, a second phase of processing will combine the result of all the machines.
define wordCountas Multiset;
for each document in documentSubset{
T = tokenize(document);
wordCount[token]++;
}
}
sendToSecondPhase(wordCount);
define totalWordCountas Multiset;
for each wordCountreceived from firstPhase{
multisetAdd(totalWordCount, wordCount);
}

Problems
•Where are documents stored?
–Having more machines for processing only helps up to a certain point— until the storage server can’t keep up.
–You’ll also need to split up the documents among the set of processing machines such that each machine will process only those documents that are stored in it.
•wordCount(and totalWordCount) are stored in memory
–When processing large document sets, the number of unique words can exceed the RAM storage of a machine.
–Furthermore phase two has only one machine, which will process wordCountsent from all the machines in phase one. The single machine in phase two will become the bottleneck.
•Solution: divide based on expected output!
–Let’s say we have 26 machines for phase two. We assign each machine to only handle wordCountfor words beginning with a particular letter in the alphabet.

Map-Reduce
•MapReduceprograms are executed in two main phases, called
–mapping and
–reducing.
•In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper.
•In the reducing phase, the reducer processes all the outputs from the mapperand arrives at a final result.
•The mapperis meant to filter and transform the input into something
•That the reducer can aggregate over.
•MapReduceuses lists and (key/value) pairs as its main data primitives.

Map-Reduce
•Map-Reduce Program
–Based on two functions: Map and Reduce
–Every Map/Reduce program must specify a Mapperand optionally a Reducer
–Operate on key and value pairs
Map-Reduce works like a Unix pipeline:
cat input | grep| sort | uniq-c | cat > output
Input| Map| Shuffle & Sort | Reduce| Output
cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist
Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2)
Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)

Putting things in context
HDFS
.
.
.
File 1
File 2
File 3
File N-2
File N-1
File N
Input
files
Splits
Mapper
Machine -1
Machine -2
Machine -M
Split 1
Split 2
Split 3
Split M-2
Split M-1
Split M
Map 1
Map 2
Map 3
Map M-2
Map M-1
Map M
Combiner 1
Combiner C
(Kay, Value)
pairs
Record Reader
combiner
.
.
.
Partition 1
Partition 2
Partition P-1
Partition P
Partitionar
Reducer
HDFS
.
.
.
File 1
File 2
File 3
File O-2
File O-1
File O
Reducer 1
Reducer 2
Reducer R-1
Reducer R
Input
Output
Machine -x

Some MapReduce Terminology
•Job–A “full program” -an execution of a Mapperand Reducer across a data set
•Task –An execution of a Mapperor a Reducer on a slice of data
–a.k.a. Task-In-Progress (TIP)
•Task Attempt –A particular instance of an attempt to execute a task on a machine

Terminology Example
•Running “Word Count” across 20 files is one job
•20 files to be mapped simply 20 map tasks+ some number of reduce tasks
•At least 20 map task attemptswill be performed… more if a machine crashes, due to speculative execution etc.

Task Attempts
•A particular task will be attempted at least once, possibly more times if it crashes
–If the same input causes crashes over and over, that input will eventually be abandoned
•Multiple attempts at one task may occur in parallel with speculative execution turned on
–Task ID from TaskInProgress is not a unique identifier; don’t use it that way

MapReduce: High Level
JobTrackerMapReduce job submitted by client computerMaster nodeTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instanceTaskTrackerSlave nodeTask instance

Node-to-Node Communication
•Hadoop uses its own RPC protocol
•All communication begins in slave nodes
–Prevents circular-wait deadlock
–Slaves periodically poll for “status” message
•Classes must provide explicit serialization

Nodes, Trackers, Tasks
•Master node runs JobTrackerinstance, which accepts Job requests from clients
•TaskTrackerinstances run on slave nodes
•TaskTrackerforks separate Java process for task instances

Job Distribution
•MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options
•Running a MapReduce job places these files into the HDFS and notifies TaskTrackerswhere to retrieve the relevant program code
•… Where’s the data distribution?

Data Distribution
•Implicit in design of MapReduce!
–All mappersare equivalent; so map whatever data is local to a particular node in HDFS
•If lots of data does happen to pile up on the same node, nearby nodes will map instead
–Data transfer is handled implicitly by HDFS

Configuring With JobConf
•MR Programs have many configurable options
•JobConfobjects hold (key, value) components mapping String ’a’
–e.g., “mapred.map.tasks” 20
–JobConfis serialized and distributed before running the job
•Objects implementing JobConfigurablecan retrieve elements from a JobConf

Job Launch Process: Client
•Client program creates a JobConf
–Identify classes implementing Mapperand Reducerinterfaces
•JobConf.setMapperClass(), setReducerClass()
–Specify inputs, outputs
•JobConf.setInputPath(), setOutputPath()
–Optionally, other options too:
•JobConf.setNumReduceTasks(), JobConf.setOutputFormat()…

Job Launch Process: JobClient
•Pass JobConfto JobClient.runJob() or submitJob()
–runJob() blocks, submitJob() does not
•JobClient:
–Determines proper division of input into InputSplits
–Sends job data to master JobTrackerserver

Job Launch Process: JobTracker
•JobTracker:
–Inserts jar and JobConf(serialized to XML) in shared location
–Posts a JobInProgressto its run queue

Job Launch Process: TaskTracker
•TaskTrackersrunning on slave nodes periodically query JobTrackerfor work
•Retrieve job-specific jar and config
•Launch task in separate instance of Java
–main() is provided by Hadoop

Job Launch Process: Task
•TaskTracker.Child.main():
–Sets up the child TaskInProgressattempt
–Reads XML configuration
–Connects back to necessary MapReduce components via RPC
–Uses TaskRunnerto launch user process

Job Launch Process: TaskRunner
•TaskRunner, MapTaskRunner, MapRunnerwork in a daisy-chain to launch your Mapper
–Task knows ahead of time which InputSplitsit should be mapping
–Calls Mapperonce for each record retrieved from the InputSplit
•Running the Reduceris much the same

Creating the Mapper
•You provide the instance of Mapper
–Should extend MapReduceBase
•One instance of your Mapperis initialized by the MapTaskRunnerfor a TaskInProgress
–Exists in separate process from all other instances of Mapper–no data sharing!

Mapper
•void map(WritableComparablekey,
Writable value,
OutputCollectoroutput,
Reporter reporter)

What is Writable?
•Hadoop defines its own classes for strings (Text),integers (IntWritable), etc.
•All values are instances of Writable
•All keys are instances of WritableComparable

Getting Data To The Mapper
Input fileInputSplitInputSplitInputSplitInputSplitInput fileRecordReaderRecordReaderRecordReaderRecordReaderMapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates)InputFormat

Reading Data
•Data sets are specified by InputFormats
–Defines input data (e.g., a directory)
–Identifies partitions of the data that form an InputSplit
–Factory for RecordReaderobjects to extract (k, v) records from the input source

FileInputFormatand Friends
•TextInputFormat–Treats each ‘n’-terminated line of a file as a value
•KeyValueTextInputFormat–Maps ‘n’-terminated text lines of “k SEP v”
•SequenceFileInputFormat–Binary file of (k, v) pairs with some additional metadata
•SequenceFileAsTextInputFormat–Same, but maps (k.toString(), v.toString())

Filtering File Inputs
•FileInputFormatwill read all files out of a specified directory and send them to the mapper
•Delegates filtering this file list to a method subclasses may override
–e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list

Record Readers
•Each InputFormatprovides its own RecordReaderimplementation
–Provides capability multiplexing
•LineRecordReader–Reads a line from a text file
•KeyValueRecordReader–Used by KeyValueTextInputFormat

Input Split Size
•FileInputFormatwill divide large files into chunks
–Exact size controlled by mapred.min.split.size
•RecordReadersreceive file, offset, and length of chunk
•Custom InputFormatimplementations may override split size –e.g., “NeverChunkFile”

Sending Data To Reducers
•Map function receives OutputCollectorobject
–OutputCollector.collect() takes (k, v) elements
•Any (WritableComparable, Writable)can be used

WritableComparator
•Compares WritableComparabledata
–Will call WritableComparable.compare()
–Can provide fast path for serialized data
•JobConf.setOutputValueGroupingComparator()

Sending Data To The Client
•Reporterobject sent to Mapperallows simple asynchronous feedback
–incrCounter(Enumkey, long amount)
–setStatus(String msg)
•Allows self-identification of input
–InputSplitgetInputSplit()

Partition And Shuffle
Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) Mapper(intermediates) ReducerReducerReducer(intermediates)(intermediates)(intermediates) PartitionerPartitionerPartitionerPartitioner shuffling

Partitioner
•intgetPartition(key, val, numPartitions)
–Outputs the partition number for a given key
–One partition == values sent to one Reduce task
•HashPartitionerused by default
–Uses key.hashCode() to return partition num
•JobConfsets Partitioner implementation

Reduction
•reduce(WritableComparablekey,
Iteratorvalues,
OutputCollectoroutput,
Reporter reporter)
•Keys & values sent to one partition all go to the same reduce task
•Calls are sorted by key –“earlier” keys are reduced and output before “later” keys

Finally: Writing The Output
ReducerReducerReducerRecordWriterRecordWriterRecordWriteroutput fileoutput fileoutput file OutputFormat

WordCountM/R
map(String filename, String document)
{
List<String> T = tokenize(document);
emit ((String)token, (Integer) 1);
}
}
reduce(String token, List<Integer> values)
{
Integer sum = 0;
for each value in values {
sum = sum + value;
}
emit ((String)token, (Integer) sum);
}

Word Count: Java Mapper
public static class MapClassextendsMapReduceBase
implementsMapper<LongWritable, Text, Text, IntWritable>{
public voidmap(LongWritablekey, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throwsIOException{
String line = value.toString();
StringTokenizeritr= newStringTokenizer(line);
while(itr.hasMoreTokens()) {
Text word = new Text(itr.nextToken());
output.collect(word,newIntWritable(1));
}
}
}
42

Word Count: Java Reduce
public static class Reduce extendsMapReduceBase
implementsReducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key,
Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throwsIOException{
intsum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, newIntWritable(sum));
}
}
43

Word Count: Java Driver
public void run(String inPath, String outPath)
throwsException {
JobConfconf = newJobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, newPath(inPath));
FileOutputFormat.setOutputPath(conf, newPath(outPath));
JobClient.runJob(conf);
}
44

WordCountwith many mapperand One reducer

Job, Task, and Task Attempt IDs
•The format of a job ID is composed of the time that the jobtracker(not the job) started and an incrementing counter maintained by the jobtrackerto uniquely identify the job to that instance of the jobtracker.
•job_201206111011_0002 :
–is the second (0002, job IDs are 1-based) job run by the jobtracker
–which started at 10:11 on June 11, 2012
•Tasks belong to a job, and their IDs are formed by replacing the job prefix of a job ID with a task prefix, and adding a suffix to identify the task within the job.
•task_201206111011_0002_m_000003:
–is the fourth (000003, task IDs are 0-based)
–map (m) task of the job with ID job_201206111011_0002.
–The task IDs are created for a job when it is initialized, so they do not necessarily dictate the order that the tasks will be executed in.
•Tasks may be executed more than once, due to failure or speculative execution, so to identify different instances of a task execution, task attempts are given unique IDs on the jobtracker.
•attempt_200904110811_0002_m_000003_0:
–is the first (0, attempt IDs are 0-based) attempt at running task task_201206111011_0002_m_000003.

Exercise -description
•The objectives for this exercise are:
–Become familiar with decomposing a problem into Map and Reduce stages.
–Get a sense for how MapReducecan be used in the real world.
•An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier.
For example, if given the following 2 documents:
Doc1: Buffalo buffalobuffalo.
Doc2: Buffalo are mammals.
we could construct the following inverted file index:
Buffalo -> Doc1, Doc2
buffalo -> Doc1
buffalo. -> Doc1
are -> Doc2
mammals. -> Doc2

Exercise -tasks
•Task -1: (30 min)
–Write pseudo-code for map and reduce to solve inverted index problem
–What are your K1 V1, K2, V2 etc.
–“Execute” your pseudo-code with following example and explain what shuffle & Sort stage do with keys and values
•Task –2: (30min)
–Use distributed code Python/Java and execute them following instruction
•Where input and out data was stored, and in what format?
•What were K1 V1, K2, V2 data types used?
•Task –3: (45min)
•Some words are so common that their presence in an inverted index is "noise" -- they can obfuscate the more interesting properties of that document. For example, the words "the", "a", "and", "of", "in", and "for" occur in almost every English document. How can you determine whether a word is "noisy“?
–Re-write your pseudo-code with determination (your algorithms) and removal of “noisy” words using map-reduce framework.
•Group / individual presentation (45 min)

Example: Inverted Index
•Input:(filename, text) records
•Output:list of files containing each word
•Map: foreachword in text.split(): output(word, filename)
•Combine:unique filenames for each word
•Reduce: defreduce(word, filenames): output(word, sort(filenames))
49

Inverted Index
50
to be or not to be
afraid, (1Xth.txt)
be, (1Xth.txt, hamlet.txt)
greatness, (12th.txt)
not, (1Xth.txt, hamlet.txt)
of, (12th.txt)
or, (hamlet.txt)
to, (hamlet.txt)
hamlet.txt
be not afraid of greatness
1Xth.txt
to, hamlet.txt
be, hamlet.txt
or, hamlet.txt
not, hamlet.txt
be, 1Xth.txt
not, 1Xth.txt
afraid, 1Xth.txt
of, 1Xth.txt
greatness, 1Xth.txt

A better example
•Billions of crawled pages and links
•Generate an index of words linking to web urlsin which they occur.
–Input is split into url->pages (lines of pages)
–Map looks for words in lines of page and puts out word -> link pairs
–Group k,vpairs to generate word->{list of links}
–Reduce puts out pairs to output

Search Reverse Index
public static class MapClassextends MapReduceBase
implements Mapper<Text, Text, Text, IntWritable> {
private Text word = new Text();
public void map(Text url, Text pageText,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException{
String line = pageText.toString();
StringTokenizeritr= new StringTokenizer(line);
while (itr.hasMoreTokens()) {
//ignore unwanted and redundant words
word.set(itr.nextToken());
output.collect(word, url);
}
}
}

Search Reverse Index
public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text word, Iterator<Text> urls,
OutputCollector<Text, Iterator<Text>> output,
Reporter reporter) throws IOException{
output.collect(word, urls);
}
}

End of sesssion
Day –1: First MR job -Inverted Index construction

Hadoop first mr job - inverted index construction

More Related Content

Similar to Hadoop first mr job - inverted index construction (20)

More from Subhas Kumar Ghosh (20)

Recently uploaded (20)

Hadoop first mr job - inverted index construction