SlideShare a Scribd company logo
6: MapReduce Applications

                            Zubair Nabi

                  zubair.nabi@itu.edu.pk


                          April 18, 2013




Zubair Nabi      6: MapReduce Applications   April 18, 2013   1 / 27
Outline




  1    The Anatomy of a MapReduce Application



  2    MapReduce Design Patterns



  3    Common MapReduce Application Types




  Zubair Nabi             6: MapReduce Applications   April 18, 2013   2 / 27
Outline




  1    The Anatomy of a MapReduce Application



  2    MapReduce Design Patterns



  3    Common MapReduce Application Types




  Zubair Nabi             6: MapReduce Applications   April 18, 2013   3 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task




  Zubair Nabi                 6: MapReduce Applications               April 18, 2013   4 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task
     2    Map logic: The user-supplied map function is invoked




  Zubair Nabi                 6: MapReduce Applications               April 18, 2013   4 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task
     2    Map logic: The user-supplied map function is invoked
                In tandem a sort phase is also applied that ensures that map output is
                locally sorted by key




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   4 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task
     2    Map logic: The user-supplied map function is invoked
                In tandem a sort phase is also applied that ensures that map output is
                locally sorted by key
                In addition, the key space is also partitioned amongst the reducers




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   4 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task
     2    Map logic: The user-supplied map function is invoked
                In tandem a sort phase is also applied that ensures that map output is
                locally sorted by key
                In addition, the key space is also partitioned amongst the reducers
     3    Shuffle: Map output is relayed to all reduce tasks




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   4 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task
     2    Map logic: The user-supplied map function is invoked
                In tandem a sort phase is also applied that ensures that map output is
                locally sorted by key
                In addition, the key space is also partitioned amongst the reducers
     3    Shuffle: Map output is relayed to all reduce tasks
     4    Reduce logic: The user-provided reduce function is invoked




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   4 / 27
MapReduce job phases


 A MapReduce job can be divided into 4 phases:
     1    Input split: The input dataset is sliced into M splits, one per map task
     2    Map logic: The user-supplied map function is invoked
                In tandem a sort phase is also applied that ensures that map output is
                locally sorted by key
                In addition, the key space is also partitioned amongst the reducers
     3    Shuffle: Map output is relayed to all reduce tasks
     4    Reduce logic: The user-provided reduce function is invoked
                Before the application of the reduce function, the input keys are merged
                to get globally sorted key/value pairs




  Zubair Nabi                   6: MapReduce Applications                  April 18, 2013   4 / 27
Of mappers and reducers




          In the common case, programmers only need to write a map and a
          reduce function




  Zubair Nabi               6: MapReduce Applications           April 18, 2013   5 / 27
Of mappers and reducers




          In the common case, programmers only need to write a map and a
          reduce function
          The user-provided map function is invoked for every line (can be
          modified) in the input file and is passed the line number as key and line
          contents as value




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   5 / 27
Of mappers and reducers




          In the common case, programmers only need to write a map and a
          reduce function
          The user-provided map function is invoked for every line (can be
          modified) in the input file and is passed the line number as key and line
          contents as value
          The user-provided reduce function is invoked for each key output by
          the map phase and is passed the set of associated values as iterable
          values




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   5 / 27
Wordcount: High-level view




          Input: A text corpus such as Wikipedia dump, books from Gutenberg,
          etc.




  Zubair Nabi                6: MapReduce Applications            April 18, 2013   6 / 27
Wordcount: High-level view




          Input: A text corpus such as Wikipedia dump, books from Gutenberg,
          etc.
          The map function is invoked once for each text line




  Zubair Nabi                 6: MapReduce Applications           April 18, 2013   6 / 27
Wordcount: High-level view




          Input: A text corpus such as Wikipedia dump, books from Gutenberg,
          etc.
          The map function is invoked once for each text line
          Map output: Words as keys and 1 as values




  Zubair Nabi                 6: MapReduce Applications           April 18, 2013   6 / 27
Wordcount: High-level view




          Input: A text corpus such as Wikipedia dump, books from Gutenberg,
          etc.
          The map function is invoked once for each text line
          Map output: Words as keys and 1 as values
          Reduce input: Key/value pairs of words and values (1)




  Zubair Nabi                 6: MapReduce Applications           April 18, 2013   6 / 27
Wordcount: High-level view




          Input: A text corpus such as Wikipedia dump, books from Gutenberg,
          etc.
          The map function is invoked once for each text line
          Map output: Words as keys and 1 as values
          Reduce input: Key/value pairs of words and values (1)
          The reduce function is invoked once for each word with a list of 1s




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   6 / 27
Wordcount: High-level view




          Input: A text corpus such as Wikipedia dump, books from Gutenberg,
          etc.
          The map function is invoked once for each text line
          Map output: Words as keys and 1 as values
          Reduce input: Key/value pairs of words and values (1)
          The reduce function is invoked once for each word with a list of 1s
          Reduce output: Words and their final counts




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   6 / 27
Wordcount: Low-level view



          A new process is created for each map, called MapRunner




  Zubair Nabi               6: MapReduce Applications           April 18, 2013   7 / 27
Wordcount: Low-level view



          A new process is created for each map, called MapRunner
          MapRunner has a RecordReader instance that is used to read the
          input file




  Zubair Nabi               6: MapReduce Applications           April 18, 2013   7 / 27
Wordcount: Low-level view



          A new process is created for each map, called MapRunner
          MapRunner has a RecordReader instance that is used to read the
          input file
          RecordReader reads the input file in chunks and parses the chunks
          into lines




  Zubair Nabi               6: MapReduce Applications            April 18, 2013   7 / 27
Wordcount: Low-level view



          A new process is created for each map, called MapRunner
          MapRunner has a RecordReader instance that is used to read the
          input file
          RecordReader reads the input file in chunks and parses the chunks
          into lines
          MapRunner also has a Mapper instance with a map function,
          WordCountMapper in this case




  Zubair Nabi               6: MapReduce Applications            April 18, 2013   7 / 27
Wordcount: Low-level view



          A new process is created for each map, called MapRunner
          MapRunner has a RecordReader instance that is used to read the
          input file
          RecordReader reads the input file in chunks and parses the chunks
          into lines
          MapRunner also has a Mapper instance with a map function,
          WordCountMapper in this case
          For each line parse by RecordReader, MapRunner calls
          WordCountMapper.map() and passes it the line




  Zubair Nabi               6: MapReduce Applications            April 18, 2013   7 / 27
Wordcount: Low-level view (2)


          WordCountMapper has an OutputCollector instance which maintains
          an in-memory buffer for each output partition (one partition per reduce)




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   8 / 27
Wordcount: Low-level view (2)


          WordCountMapper has an OutputCollector instance which maintains
          an in-memory buffer for each output partition (one partition per reduce)
          Each time WordCountMapper.map() is invoked it, it tokenizes the line
          into words




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   8 / 27
Wordcount: Low-level view (2)


          WordCountMapper has an OutputCollector instance which maintains
          an in-memory buffer for each output partition (one partition per reduce)
          Each time WordCountMapper.map() is invoked it, it tokenizes the line
          into words
          For each word, it writes the word as key and 1 as value to
          OutputCollector




  Zubair Nabi                 6: MapReduce Applications                April 18, 2013   8 / 27
Wordcount: Low-level view (2)


          WordCountMapper has an OutputCollector instance which maintains
          an in-memory buffer for each output partition (one partition per reduce)
          Each time WordCountMapper.map() is invoked it, it tokenizes the line
          into words
          For each word, it writes the word as key and 1 as value to
          OutputCollector
          OutputCollector uses the Partitioner instance to select a partition
          buffer for each key




  Zubair Nabi                 6: MapReduce Applications                April 18, 2013   8 / 27
Wordcount: Low-level view (2)


          WordCountMapper has an OutputCollector instance which maintains
          an in-memory buffer for each output partition (one partition per reduce)
          Each time WordCountMapper.map() is invoked it, it tokenizes the line
          into words
          For each word, it writes the word as key and 1 as value to
          OutputCollector
          OutputCollector uses the Partitioner instance to select a partition
          buffer for each key
          Whenever the size of a partition buffer exceeds a configurable
          threshold, its contents are first sorted by key and then flushed to disk




  Zubair Nabi                 6: MapReduce Applications                April 18, 2013   8 / 27
Wordcount: Low-level view (2)


          WordCountMapper has an OutputCollector instance which maintains
          an in-memory buffer for each output partition (one partition per reduce)
          Each time WordCountMapper.map() is invoked it, it tokenizes the line
          into words
          For each word, it writes the word as key and 1 as value to
          OutputCollector
          OutputCollector uses the Partitioner instance to select a partition
          buffer for each key
          Whenever the size of a partition buffer exceeds a configurable
          threshold, its contents are first sorted by key and then flushed to disk
          This process is repeated till the map logic has been applied to all lines
          within the input file



  Zubair Nabi                 6: MapReduce Applications                April 18, 2013   8 / 27
Wordcount: Low-level view (3)




          Once all maps have completed their execution, the reduce phase is
          started




  Zubair Nabi                6: MapReduce Applications            April 18, 2013   9 / 27
Wordcount: Low-level view (3)




          Once all maps have completed their execution, the reduce phase is
          started
          For each reduce task, a ReduceRunner process is created




  Zubair Nabi                6: MapReduce Applications            April 18, 2013   9 / 27
Wordcount: Low-level view (3)




          Once all maps have completed their execution, the reduce phase is
          started
          For each reduce task, a ReduceRunner process is created
          Each reduce task fetches its input partitions from machines on which
          map tasks were run




  Zubair Nabi                6: MapReduce Applications              April 18, 2013   9 / 27
Wordcount: Low-level view (3)




          Once all maps have completed their execution, the reduce phase is
          started
          For each reduce task, a ReduceRunner process is created
          Each reduce task fetches its input partitions from machines on which
          map tasks were run
          All input partitions are then merged to get a globally sorted partition of
          key/value pairs




  Zubair Nabi                  6: MapReduce Applications                April 18, 2013   9 / 27
Wordcount: Low-level view (4)


          ReduceRunner contains a Reducer instance with a reduce function,
          WordCountReducer in this case




  Zubair Nabi               6: MapReduce Applications          April 18, 2013   10 / 27
Wordcount: Low-level view (4)


          ReduceRunner contains a Reducer instance with a reduce function,
          WordCountReducer in this case
          For each word, ReduceRunner invokes WordCountReducer.reduce()
          and passes it the word and a list of its values (1s)




  Zubair Nabi               6: MapReduce Applications          April 18, 2013   10 / 27
Wordcount: Low-level view (4)


          ReduceRunner contains a Reducer instance with a reduce function,
          WordCountReducer in this case
          For each word, ReduceRunner invokes WordCountReducer.reduce()
          and passes it the word and a list of its values (1s)
          WordCountReducer also has an OutputCollector instance with an
          in-memory buffer




  Zubair Nabi               6: MapReduce Applications          April 18, 2013   10 / 27
Wordcount: Low-level view (4)


          ReduceRunner contains a Reducer instance with a reduce function,
          WordCountReducer in this case
          For each word, ReduceRunner invokes WordCountReducer.reduce()
          and passes it the word and a list of its values (1s)
          WordCountReducer also has an OutputCollector instance with an
          in-memory buffer
          WordCountReducer.reduce() sums the list of values it is passed and
          writes the word and its final count to the OutputCollector




  Zubair Nabi               6: MapReduce Applications           April 18, 2013   10 / 27
Wordcount: Low-level view (4)


          ReduceRunner contains a Reducer instance with a reduce function,
          WordCountReducer in this case
          For each word, ReduceRunner invokes WordCountReducer.reduce()
          and passes it the word and a list of its values (1s)
          WordCountReducer also has an OutputCollector instance with an
          in-memory buffer
          WordCountReducer.reduce() sums the list of values it is passed and
          writes the word and its final count to the OutputCollector
          This process is repeated till the reduce logic has been applied
          key/value pairs




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   10 / 27
Wordcount: Low-level view (4)


          ReduceRunner contains a Reducer instance with a reduce function,
          WordCountReducer in this case
          For each word, ReduceRunner invokes WordCountReducer.reduce()
          and passes it the word and a list of its values (1s)
          WordCountReducer also has an OutputCollector instance with an
          in-memory buffer
          WordCountReducer.reduce() sums the list of values it is passed and
          writes the word and its final count to the OutputCollector
          This process is repeated till the reduce logic has been applied
          key/value pairs
          At the end of the entire job, each reduce produces an output file with
          words and their number of occurrences



  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   10 / 27
Wordcount map in Java




1   public void map( Object key , Text value , Context context ) {
2                 StringTokenizer itr = new StringTokenizer ( value . toString ());
3                 while (itr. hasMoreTokens ()) {
4                         word.set(itr. nextToken ());
5                         context .write (word , one );
6                 }
7   }




    Zubair Nabi                      6: MapReduce Applications                  April 18, 2013   11 / 27
Wordcount reduce in Java



1   public void reduce (Text key , Iterable < IntWritable > values ,
2                                                            Context context ) {
3                 int sum = 0;
4                 for ( IntWritable val : values ) {
5                          sum += val.get ();
6                 }
7                 result .set(sum );
8                 context .write(key , result );
9   }




    Zubair Nabi                        6: MapReduce Applications                   April 18, 2013   12 / 27
Wordcount map in Python




1   def map(self , key , value ):
2                 [self. _output_collector . collect (word , 1) for word in value . split (’ ’)]




    Zubair Nabi                      6: MapReduce Applications                   April 18, 2013   13 / 27
Wordcount reduce in Python




1   def reduce (self , key , values ):
2         sum__ = 0
3         for value in values :
4                 sum__ += value
5         self. _output_collector . collect (key , sum__ )




    Zubair Nabi                    6: MapReduce Applications   April 18, 2013   14 / 27
Outline




  1    The Anatomy of a MapReduce Application



  2    MapReduce Design Patterns



  3    Common MapReduce Application Types




  Zubair Nabi             6: MapReduce Applications   April 18, 2013   15 / 27
Bird’s-eye view


          The MapReduce paradigm is amenable to divide-and-conquer
          algorithms




  Zubair Nabi               6: MapReduce Applications         April 18, 2013   16 / 27
Bird’s-eye view


          The MapReduce paradigm is amenable to divide-and-conquer
          algorithms
          One way to look at MapReduce is that it is just a large-scale sorting
          platform




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   16 / 27
Bird’s-eye view


          The MapReduce paradigm is amenable to divide-and-conquer
          algorithms
          One way to look at MapReduce is that it is just a large-scale sorting
          platform
          User-logic is only involved at specific hook points




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   16 / 27
Bird’s-eye view


          The MapReduce paradigm is amenable to divide-and-conquer
          algorithms
          One way to look at MapReduce is that it is just a large-scale sorting
          platform
          User-logic is only involved at specific hook points
          Algorithms must be expressed in terms of a small number of specific
          components that fit together in preset ways




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   16 / 27
Bird’s-eye view


          The MapReduce paradigm is amenable to divide-and-conquer
          algorithms
          One way to look at MapReduce is that it is just a large-scale sorting
          platform
          User-logic is only involved at specific hook points
          Algorithms must be expressed in terms of a small number of specific
          components that fit together in preset ways
                Like putting together a jigsaw puzzle in which all the other pieces have
                already been assembled and you only need to add two pieces: The map
                and the reduce pieces




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   16 / 27
Bird’s-eye view


          The MapReduce paradigm is amenable to divide-and-conquer
          algorithms
          One way to look at MapReduce is that it is just a large-scale sorting
          platform
          User-logic is only involved at specific hook points
          Algorithms must be expressed in terms of a small number of specific
          components that fit together in preset ways
                Like putting together a jigsaw puzzle in which all the other pieces have
                already been assembled and you only need to add two pieces: The map
                and the reduce pieces
          Fortunately a large number of algorithms easily fit this rigid pattern




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   16 / 27
Programmer control




  The programmer has no control over
     1    The location of a map or reduce task in terms of nodes in the cluster




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   17 / 27
Programmer control




  The programmer has no control over
     1    The location of a map or reduce task in terms of nodes in the cluster
     2    The start and end time of a map or a reduce task




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   17 / 27
Programmer control




  The programmer has no control over
     1    The location of a map or reduce task in terms of nodes in the cluster
     2    The start and end time of a map or a reduce task
     3    The input key/value pairs processed by a specific map task




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   17 / 27
Programmer control




  The programmer has no control over
     1    The location of a map or reduce task in terms of nodes in the cluster
     2    The start and end time of a map or a reduce task
     3    The input key/value pairs processed by a specific map task
     4    The intermediate key/value pairs processed by a specific reduce task




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   17 / 27
Programmer control (2)



  The programmer does have control over
     1    The data structures to be used as keys and values




  Zubair Nabi                6: MapReduce Applications        April 18, 2013   18 / 27
Programmer control (2)



  The programmer does have control over
     1    The data structures to be used as keys and values
     2    Initialization code at the beginning of map/reduce tasks and
          termination code at the end




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   18 / 27
Programmer control (2)



  The programmer does have control over
     1    The data structures to be used as keys and values
     2    Initialization code at the beginning of map/reduce tasks and
          termination code at the end
     3    Preservation of state across multiple invocations of map/reduce tasks




  Zubair Nabi                 6: MapReduce Applications             April 18, 2013   18 / 27
Programmer control (2)



  The programmer does have control over
     1    The data structures to be used as keys and values
     2    Initialization code at the beginning of map/reduce tasks and
          termination code at the end
     3    Preservation of state across multiple invocations of map/reduce tasks
     4    The sort order of intermediate keys and in turn, the order in which a
          reducer encounters keys




  Zubair Nabi                 6: MapReduce Applications              April 18, 2013   18 / 27
Programmer control (2)



  The programmer does have control over
     1    The data structures to be used as keys and values
     2    Initialization code at the beginning of map/reduce tasks and
          termination code at the end
     3    Preservation of state across multiple invocations of map/reduce tasks
     4    The sort order of intermediate keys and in turn, the order in which a
          reducer encounters keys
     5    Partitioning of key space and in turn, the set of keys that a particular
          reducer encounters




  Zubair Nabi                  6: MapReduce Applications               April 18, 2013   18 / 27
Multi-job algorithms




          Many algorithms cannot be easily expressed as a single MapReduce
          job




  Zubair Nabi               6: MapReduce Applications          April 18, 2013   19 / 27
Multi-job algorithms




          Many algorithms cannot be easily expressed as a single MapReduce
          job
          Complex algorithms need to be decomposed into a sequence of jobs




  Zubair Nabi               6: MapReduce Applications          April 18, 2013   19 / 27
Multi-job algorithms




          Many algorithms cannot be easily expressed as a single MapReduce
          job
          Complex algorithms need to be decomposed into a sequence of jobs
                The output of one job becomes the input to the next




  Zubair Nabi                   6: MapReduce Applications             April 18, 2013   19 / 27
Multi-job algorithms




          Many algorithms cannot be easily expressed as a single MapReduce
          job
          Complex algorithms need to be decomposed into a sequence of jobs
                The output of one job becomes the input to the next
          Most interactive algorithms need to be run by an external driver
          program that performs the convergence check




  Zubair Nabi                   6: MapReduce Applications             April 18, 2013   19 / 27
Local aggregation



          Network and disk latencies are expensive compared to other
          operations




  Zubair Nabi                6: MapReduce Applications           April 18, 2013   20 / 27
Local aggregation



          Network and disk latencies are expensive compared to other
          operations
          Decreasing the amount of data transferred over the network during the
          shuffle phase results in efficiency




  Zubair Nabi                6: MapReduce Applications            April 18, 2013   20 / 27
Local aggregation



          Network and disk latencies are expensive compared to other
          operations
          Decreasing the amount of data transferred over the network during the
          shuffle phase results in efficiency
          Aggressive user of combiners for commutative and associative
          algorithms can greatly reduce intermediate data




  Zubair Nabi                6: MapReduce Applications            April 18, 2013   20 / 27
Local aggregation



          Network and disk latencies are expensive compared to other
          operations
          Decreasing the amount of data transferred over the network during the
          shuffle phase results in efficiency
          Aggressive user of combiners for commutative and associative
          algorithms can greatly reduce intermediate data
          Another strategy, dubbed “in-mapper combining” can not only decrease
          the amount of intermediate data but also the number of key/valur pairs
          emitted by the map tasks




  Zubair Nabi                6: MapReduce Applications             April 18, 2013   20 / 27
Outline




  1    The Anatomy of a MapReduce Application



  2    MapReduce Design Patterns



  3    Common MapReduce Application Types




  Zubair Nabi             6: MapReduce Applications   April 18, 2013   21 / 27
Counting and Summing




     1    Problem
                A number of documents with a set of terms




  Zubair Nabi                  6: MapReduce Applications    April 18, 2013   22 / 27
Counting and Summing




     1    Problem
                A number of documents with a set of terms
                Need to calculate the number of occurrences of each term (word count)
                or some arbitrary function over the terms (average response time in log
                files)




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   22 / 27
Counting and Summing




     1    Problem
                A number of documents with a set of terms
                Need to calculate the number of occurrences of each term (word count)
                or some arbitrary function over the terms (average response time in log
                files)
     2    Solution
                Map: For each term, emit the term and “1”




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   22 / 27
Counting and Summing




     1    Problem
                A number of documents with a set of terms
                Need to calculate the number of occurrences of each term (word count)
                or some arbitrary function over the terms (average response time in log
                files)
     2    Solution
                Map: For each term, emit the term and “1”
                Reduce: Take the sum (or any other operation) of each term values




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   22 / 27
Collating



     1    Problem
                A number of documents with a set of terms and some function of one
                item




  Zubair Nabi                  6: MapReduce Applications               April 18, 2013   23 / 27
Collating



     1    Problem
                A number of documents with a set of terms and some function of one
                item
                Need to group all items that have the same value of function to either
                store items together or perform some computation over them




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   23 / 27
Collating



     1    Problem
                A number of documents with a set of terms and some function of one
                item
                Need to group all items that have the same value of function to either
                store items together or perform some computation over them
     2    Solution
                Map: For each item, compute given function and emit function value as
                key and item as value




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   23 / 27
Collating



     1    Problem
                A number of documents with a set of terms and some function of one
                item
                Need to group all items that have the same value of function to either
                store items together or perform some computation over them
     2    Solution
                Map: For each item, compute given function and emit function value as
                key and item as value
                Reduce: Either save all grouped items or perform further computation




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   23 / 27
Collating



     1    Problem
                A number of documents with a set of terms and some function of one
                item
                Need to group all items that have the same value of function to either
                store items together or perform some computation over them
     2    Solution
                Map: For each item, compute given function and emit function value as
                key and item as value
                Reduce: Either save all grouped items or perform further computation
                Example: Inverted Index: Items are words and function is document ID




  Zubair Nabi                   6: MapReduce Applications                 April 18, 2013   23 / 27
Filtering, Parsing, and Validation



     1    Problem
                A set of records




  Zubair Nabi                      6: MapReduce Applications   April 18, 2013   24 / 27
Filtering, Parsing, and Validation



     1    Problem
                A set of records
                Need to collect all records that meet some condition or transform each
                record into another representation




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   24 / 27
Filtering, Parsing, and Validation



     1    Problem
                A set of records
                Need to collect all records that meet some condition or transform each
                record into another representation
     2    Solution
                Map: For each record, emit it if passes the condition or emit its
                transformed version




  Zubair Nabi                    6: MapReduce Applications                  April 18, 2013   24 / 27
Filtering, Parsing, and Validation



     1    Problem
                A set of records
                Need to collect all records that meet some condition or transform each
                record into another representation
     2    Solution
                Map: For each record, emit it if passes the condition or emit its
                transformed version
                Reduce: Identity




  Zubair Nabi                    6: MapReduce Applications                  April 18, 2013   24 / 27
Filtering, Parsing, and Validation



     1    Problem
                A set of records
                Need to collect all records that meet some condition or transform each
                record into another representation
     2    Solution
                Map: For each record, emit it if passes the condition or emit its
                transformed version
                Reduce: Identity
                Example: Text parsing or transformation such as word capitalization




  Zubair Nabi                   6: MapReduce Applications                April 18, 2013   24 / 27
Distributed Task Execution




     1    Problem
                Large computational problem




  Zubair Nabi                  6: MapReduce Applications   April 18, 2013   25 / 27
Distributed Task Execution




     1    Problem
                Large computational problem
                Need to divide it into multiple parts and combine results from all parts to
                obtain a final result




  Zubair Nabi                    6: MapReduce Applications                  April 18, 2013   25 / 27
Distributed Task Execution




     1    Problem
                Large computational problem
                Need to divide it into multiple parts and combine results from all parts to
                obtain a final result
     2    Solution
                Map: Perform corresponding computation




  Zubair Nabi                    6: MapReduce Applications                  April 18, 2013   25 / 27
Distributed Task Execution




     1    Problem
                Large computational problem
                Need to divide it into multiple parts and combine results from all parts to
                obtain a final result
     2    Solution
                Map: Perform corresponding computation
                Reduce: Combine all emitted results into a final one




  Zubair Nabi                    6: MapReduce Applications                  April 18, 2013   25 / 27
Distributed Task Execution




     1    Problem
                Large computational problem
                Need to divide it into multiple parts and combine results from all parts to
                obtain a final result
     2    Solution
                Map: Perform corresponding computation
                Reduce: Combine all emitted results into a final one
                Example: RGB histogram calculation of bitmap images




  Zubair Nabi                    6: MapReduce Applications                  April 18, 2013   25 / 27
Sorting




     1    Problem
                A set of records




  Zubair Nabi                      6: MapReduce Applications   April 18, 2013   26 / 27
Sorting




     1    Problem
                A set of records
                Need to sort records in some order




  Zubair Nabi                  6: MapReduce Applications   April 18, 2013   26 / 27
Sorting




     1    Problem
                A set of records
                Need to sort records in some order
     2    Solution
                Map: Identity




  Zubair Nabi                   6: MapReduce Applications   April 18, 2013   26 / 27
Sorting




     1    Problem
                A set of records
                Need to sort records in some order
     2    Solution
                Map: Identity
                Reduce: Identity




  Zubair Nabi                  6: MapReduce Applications   April 18, 2013   26 / 27
Sorting




     1    Problem
                A set of records
                Need to sort records in some order
     2    Solution
                Map: Identity
                Reduce: Identity
                Also possible to sort by value, either perform a secondary sort or
                perform a key-to-value conversion




  Zubair Nabi                   6: MapReduce Applications                  April 18, 2013   26 / 27
References




     1    Jimmy Lin and Chris Dyer. 2010. Data-Intensive Text Processing with
          MapReduce. Morgan and Claypool Publishers.
     2    MapReduce Patterns, Algorithms, and Use Cases:
          http://highlyscalable.wordpress.com/2012/02/01/
          mapreduce-patterns/




  Zubair Nabi                6: MapReduce Applications            April 18, 2013   27 / 27

More Related Content

PPTX
SAP Asset Accounting in 1-Pager
PPTX
Power Electronics
PDF
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
PDF
Sap S/4 HANA New Implementation
PDF
SAP BW to BW4HANA Migration
PDF
Apache Spark on K8S and HDFS Security with Ilan Flonenko
PPTX
Flink Streaming
PDF
Application of MapReduce in Cloud Computing
SAP Asset Accounting in 1-Pager
Power Electronics
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Sap S/4 HANA New Implementation
SAP BW to BW4HANA Migration
Apache Spark on K8S and HDFS Security with Ilan Flonenko
Flink Streaming
Application of MapReduce in Cloud Computing

Similar to Topic 6: MapReduce Applications (20)

PDF
Topic 5: MapReduce Theory and Implementation
PPTX
IOE MODULE 6.pptx
PPTX
Map reduce presentation
PDF
MapReduce: Simplified Data Processing On Large Clusters
PDF
Benchmarking tool for graph algorithms
PPTX
MapInfo Professional 12.5 and Discover3D 2014 - A brief overview
PDF
Hadoop eco system with mapreduce hive and pig
PPTX
MapReduce Paradigm
PPTX
MapReduce : Simplified Data Processing on Large Clusters
PDF
Introduction to map reduce
PPTX
Introduction to MapReduce
PDF
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
PPTX
Big Data.pptx
PPT
Dsm Presentation
PPTX
Mapreduce introduction
PPT
Big Data- process of map reducing MapReduce- .ppt
PDF
Big Data Analytics Chapter3-6@2021.pdf
PPT
Chapter3 application requirements
PPTX
Mapreduce script
PDF
MapReduce Application Scripting
Topic 5: MapReduce Theory and Implementation
IOE MODULE 6.pptx
Map reduce presentation
MapReduce: Simplified Data Processing On Large Clusters
Benchmarking tool for graph algorithms
MapInfo Professional 12.5 and Discover3D 2014 - A brief overview
Hadoop eco system with mapreduce hive and pig
MapReduce Paradigm
MapReduce : Simplified Data Processing on Large Clusters
Introduction to map reduce
Introduction to MapReduce
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
Big Data.pptx
Dsm Presentation
Mapreduce introduction
Big Data- process of map reducing MapReduce- .ppt
Big Data Analytics Chapter3-6@2021.pdf
Chapter3 application requirements
Mapreduce script
MapReduce Application Scripting
Ad

More from Zubair Nabi (20)

PDF
AOS Lab 12: Network Communication
PDF
AOS Lab 11: Virtualization
PDF
AOS Lab 10: File system -- Inodes and beyond
PDF
AOS Lab 9: File system -- Of buffers, logs, and blocks
PDF
AOS Lab 8: Interrupts and Device Drivers
PDF
AOS Lab 7: Page tables
PDF
AOS Lab 6: Scheduling
PDF
AOS Lab 5: System calls
PDF
AOS Lab 4: If you liked it, then you should have put a “lock” on it
PDF
AOS Lab 1: Hello, Linux!
PDF
AOS Lab 2: Hello, xv6!
PDF
AOS Lab 1: Hello, Linux!
PPTX
The Big Data Stack
PDF
Raabta: Low-cost Video Conferencing for the Developing World
PDF
The Anatomy of Web Censorship in Pakistan
PDF
MapReduce and DBMS Hybrids
PDF
Topic 15: Datacenter Design and Networking
PDF
Topic 14: Operating Systems and Virtualization
PDF
Topic 13: Cloud Stacks
PDF
Lab 5: Interconnecting a Datacenter using Mininet
AOS Lab 12: Network Communication
AOS Lab 11: Virtualization
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 7: Page tables
AOS Lab 6: Scheduling
AOS Lab 5: System calls
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 1: Hello, Linux!
AOS Lab 2: Hello, xv6!
AOS Lab 1: Hello, Linux!
The Big Data Stack
Raabta: Low-cost Video Conferencing for the Developing World
The Anatomy of Web Censorship in Pakistan
MapReduce and DBMS Hybrids
Topic 15: Datacenter Design and Networking
Topic 14: Operating Systems and Virtualization
Topic 13: Cloud Stacks
Lab 5: Interconnecting a Datacenter using Mininet
Ad

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
cuic standard and advanced reporting.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation theory and applications.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
KodekX | Application Modernization Development
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
The AUB Centre for AI in Media Proposal.docx
cuic standard and advanced reporting.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation theory and applications.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Mobile App Security Testing_ A Comprehensive Guide.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation_ Review paper, used for researhc scholars
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KodekX | Application Modernization Development
“AI and Expert System Decision Support & Business Intelligence Systems”
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Reach Out and Touch Someone: Haptics and Empathic Computing
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Dropbox Q2 2025 Financial Results & Investor Presentation
The Rise and Fall of 3GPP – Time for a Sabbatical?

Topic 6: MapReduce Applications

  • 1. 6: MapReduce Applications Zubair Nabi zubair.nabi@itu.edu.pk April 18, 2013 Zubair Nabi 6: MapReduce Applications April 18, 2013 1 / 27
  • 2. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 2 / 27
  • 3. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 3 / 27
  • 4. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 5. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 6. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 7. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 8. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers 3 Shuffle: Map output is relayed to all reduce tasks Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 9. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers 3 Shuffle: Map output is relayed to all reduce tasks 4 Reduce logic: The user-provided reduce function is invoked Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 10. MapReduce job phases A MapReduce job can be divided into 4 phases: 1 Input split: The input dataset is sliced into M splits, one per map task 2 Map logic: The user-supplied map function is invoked In tandem a sort phase is also applied that ensures that map output is locally sorted by key In addition, the key space is also partitioned amongst the reducers 3 Shuffle: Map output is relayed to all reduce tasks 4 Reduce logic: The user-provided reduce function is invoked Before the application of the reduce function, the input keys are merged to get globally sorted key/value pairs Zubair Nabi 6: MapReduce Applications April 18, 2013 4 / 27
  • 11. Of mappers and reducers In the common case, programmers only need to write a map and a reduce function Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
  • 12. Of mappers and reducers In the common case, programmers only need to write a map and a reduce function The user-provided map function is invoked for every line (can be modified) in the input file and is passed the line number as key and line contents as value Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
  • 13. Of mappers and reducers In the common case, programmers only need to write a map and a reduce function The user-provided map function is invoked for every line (can be modified) in the input file and is passed the line number as key and line contents as value The user-provided reduce function is invoked for each key output by the map phase and is passed the set of associated values as iterable values Zubair Nabi 6: MapReduce Applications April 18, 2013 5 / 27
  • 14. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  • 15. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  • 16. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  • 17. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Reduce input: Key/value pairs of words and values (1) Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  • 18. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Reduce input: Key/value pairs of words and values (1) The reduce function is invoked once for each word with a list of 1s Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  • 19. Wordcount: High-level view Input: A text corpus such as Wikipedia dump, books from Gutenberg, etc. The map function is invoked once for each text line Map output: Words as keys and 1 as values Reduce input: Key/value pairs of words and values (1) The reduce function is invoked once for each word with a list of 1s Reduce output: Words and their final counts Zubair Nabi 6: MapReduce Applications April 18, 2013 6 / 27
  • 20. Wordcount: Low-level view A new process is created for each map, called MapRunner Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  • 21. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input file Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  • 22. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input file RecordReader reads the input file in chunks and parses the chunks into lines Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  • 23. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input file RecordReader reads the input file in chunks and parses the chunks into lines MapRunner also has a Mapper instance with a map function, WordCountMapper in this case Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  • 24. Wordcount: Low-level view A new process is created for each map, called MapRunner MapRunner has a RecordReader instance that is used to read the input file RecordReader reads the input file in chunks and parses the chunks into lines MapRunner also has a Mapper instance with a map function, WordCountMapper in this case For each line parse by RecordReader, MapRunner calls WordCountMapper.map() and passes it the line Zubair Nabi 6: MapReduce Applications April 18, 2013 7 / 27
  • 25. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  • 26. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  • 27. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  • 28. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector OutputCollector uses the Partitioner instance to select a partition buffer for each key Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  • 29. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector OutputCollector uses the Partitioner instance to select a partition buffer for each key Whenever the size of a partition buffer exceeds a configurable threshold, its contents are first sorted by key and then flushed to disk Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  • 30. Wordcount: Low-level view (2) WordCountMapper has an OutputCollector instance which maintains an in-memory buffer for each output partition (one partition per reduce) Each time WordCountMapper.map() is invoked it, it tokenizes the line into words For each word, it writes the word as key and 1 as value to OutputCollector OutputCollector uses the Partitioner instance to select a partition buffer for each key Whenever the size of a partition buffer exceeds a configurable threshold, its contents are first sorted by key and then flushed to disk This process is repeated till the map logic has been applied to all lines within the input file Zubair Nabi 6: MapReduce Applications April 18, 2013 8 / 27
  • 31. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
  • 32. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started For each reduce task, a ReduceRunner process is created Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
  • 33. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started For each reduce task, a ReduceRunner process is created Each reduce task fetches its input partitions from machines on which map tasks were run Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
  • 34. Wordcount: Low-level view (3) Once all maps have completed their execution, the reduce phase is started For each reduce task, a ReduceRunner process is created Each reduce task fetches its input partitions from machines on which map tasks were run All input partitions are then merged to get a globally sorted partition of key/value pairs Zubair Nabi 6: MapReduce Applications April 18, 2013 9 / 27
  • 35. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  • 36. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  • 37. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  • 38. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer WordCountReducer.reduce() sums the list of values it is passed and writes the word and its final count to the OutputCollector Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  • 39. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer WordCountReducer.reduce() sums the list of values it is passed and writes the word and its final count to the OutputCollector This process is repeated till the reduce logic has been applied key/value pairs Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  • 40. Wordcount: Low-level view (4) ReduceRunner contains a Reducer instance with a reduce function, WordCountReducer in this case For each word, ReduceRunner invokes WordCountReducer.reduce() and passes it the word and a list of its values (1s) WordCountReducer also has an OutputCollector instance with an in-memory buffer WordCountReducer.reduce() sums the list of values it is passed and writes the word and its final count to the OutputCollector This process is repeated till the reduce logic has been applied key/value pairs At the end of the entire job, each reduce produces an output file with words and their number of occurrences Zubair Nabi 6: MapReduce Applications April 18, 2013 10 / 27
  • 41. Wordcount map in Java 1 public void map( Object key , Text value , Context context ) { 2 StringTokenizer itr = new StringTokenizer ( value . toString ()); 3 while (itr. hasMoreTokens ()) { 4 word.set(itr. nextToken ()); 5 context .write (word , one ); 6 } 7 } Zubair Nabi 6: MapReduce Applications April 18, 2013 11 / 27
  • 42. Wordcount reduce in Java 1 public void reduce (Text key , Iterable < IntWritable > values , 2 Context context ) { 3 int sum = 0; 4 for ( IntWritable val : values ) { 5 sum += val.get (); 6 } 7 result .set(sum ); 8 context .write(key , result ); 9 } Zubair Nabi 6: MapReduce Applications April 18, 2013 12 / 27
  • 43. Wordcount map in Python 1 def map(self , key , value ): 2 [self. _output_collector . collect (word , 1) for word in value . split (’ ’)] Zubair Nabi 6: MapReduce Applications April 18, 2013 13 / 27
  • 44. Wordcount reduce in Python 1 def reduce (self , key , values ): 2 sum__ = 0 3 for value in values : 4 sum__ += value 5 self. _output_collector . collect (key , sum__ ) Zubair Nabi 6: MapReduce Applications April 18, 2013 14 / 27
  • 45. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 15 / 27
  • 46. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  • 47. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  • 48. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at specific hook points Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  • 49. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at specific hook points Algorithms must be expressed in terms of a small number of specific components that fit together in preset ways Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  • 50. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at specific hook points Algorithms must be expressed in terms of a small number of specific components that fit together in preset ways Like putting together a jigsaw puzzle in which all the other pieces have already been assembled and you only need to add two pieces: The map and the reduce pieces Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  • 51. Bird’s-eye view The MapReduce paradigm is amenable to divide-and-conquer algorithms One way to look at MapReduce is that it is just a large-scale sorting platform User-logic is only involved at specific hook points Algorithms must be expressed in terms of a small number of specific components that fit together in preset ways Like putting together a jigsaw puzzle in which all the other pieces have already been assembled and you only need to add two pieces: The map and the reduce pieces Fortunately a large number of algorithms easily fit this rigid pattern Zubair Nabi 6: MapReduce Applications April 18, 2013 16 / 27
  • 52. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
  • 53. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster 2 The start and end time of a map or a reduce task Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
  • 54. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster 2 The start and end time of a map or a reduce task 3 The input key/value pairs processed by a specific map task Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
  • 55. Programmer control The programmer has no control over 1 The location of a map or reduce task in terms of nodes in the cluster 2 The start and end time of a map or a reduce task 3 The input key/value pairs processed by a specific map task 4 The intermediate key/value pairs processed by a specific reduce task Zubair Nabi 6: MapReduce Applications April 18, 2013 17 / 27
  • 56. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  • 57. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  • 58. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end 3 Preservation of state across multiple invocations of map/reduce tasks Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  • 59. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end 3 Preservation of state across multiple invocations of map/reduce tasks 4 The sort order of intermediate keys and in turn, the order in which a reducer encounters keys Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  • 60. Programmer control (2) The programmer does have control over 1 The data structures to be used as keys and values 2 Initialization code at the beginning of map/reduce tasks and termination code at the end 3 Preservation of state across multiple invocations of map/reduce tasks 4 The sort order of intermediate keys and in turn, the order in which a reducer encounters keys 5 Partitioning of key space and in turn, the set of keys that a particular reducer encounters Zubair Nabi 6: MapReduce Applications April 18, 2013 18 / 27
  • 61. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
  • 62. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Complex algorithms need to be decomposed into a sequence of jobs Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
  • 63. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Complex algorithms need to be decomposed into a sequence of jobs The output of one job becomes the input to the next Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
  • 64. Multi-job algorithms Many algorithms cannot be easily expressed as a single MapReduce job Complex algorithms need to be decomposed into a sequence of jobs The output of one job becomes the input to the next Most interactive algorithms need to be run by an external driver program that performs the convergence check Zubair Nabi 6: MapReduce Applications April 18, 2013 19 / 27
  • 65. Local aggregation Network and disk latencies are expensive compared to other operations Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
  • 66. Local aggregation Network and disk latencies are expensive compared to other operations Decreasing the amount of data transferred over the network during the shuffle phase results in efficiency Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
  • 67. Local aggregation Network and disk latencies are expensive compared to other operations Decreasing the amount of data transferred over the network during the shuffle phase results in efficiency Aggressive user of combiners for commutative and associative algorithms can greatly reduce intermediate data Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
  • 68. Local aggregation Network and disk latencies are expensive compared to other operations Decreasing the amount of data transferred over the network during the shuffle phase results in efficiency Aggressive user of combiners for commutative and associative algorithms can greatly reduce intermediate data Another strategy, dubbed “in-mapper combining” can not only decrease the amount of intermediate data but also the number of key/valur pairs emitted by the map tasks Zubair Nabi 6: MapReduce Applications April 18, 2013 20 / 27
  • 69. Outline 1 The Anatomy of a MapReduce Application 2 MapReduce Design Patterns 3 Common MapReduce Application Types Zubair Nabi 6: MapReduce Applications April 18, 2013 21 / 27
  • 70. Counting and Summing 1 Problem A number of documents with a set of terms Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
  • 71. Counting and Summing 1 Problem A number of documents with a set of terms Need to calculate the number of occurrences of each term (word count) or some arbitrary function over the terms (average response time in log files) Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
  • 72. Counting and Summing 1 Problem A number of documents with a set of terms Need to calculate the number of occurrences of each term (word count) or some arbitrary function over the terms (average response time in log files) 2 Solution Map: For each term, emit the term and “1” Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
  • 73. Counting and Summing 1 Problem A number of documents with a set of terms Need to calculate the number of occurrences of each term (word count) or some arbitrary function over the terms (average response time in log files) 2 Solution Map: For each term, emit the term and “1” Reduce: Take the sum (or any other operation) of each term values Zubair Nabi 6: MapReduce Applications April 18, 2013 22 / 27
  • 74. Collating 1 Problem A number of documents with a set of terms and some function of one item Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  • 75. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  • 76. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them 2 Solution Map: For each item, compute given function and emit function value as key and item as value Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  • 77. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them 2 Solution Map: For each item, compute given function and emit function value as key and item as value Reduce: Either save all grouped items or perform further computation Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  • 78. Collating 1 Problem A number of documents with a set of terms and some function of one item Need to group all items that have the same value of function to either store items together or perform some computation over them 2 Solution Map: For each item, compute given function and emit function value as key and item as value Reduce: Either save all grouped items or perform further computation Example: Inverted Index: Items are words and function is document ID Zubair Nabi 6: MapReduce Applications April 18, 2013 23 / 27
  • 79. Filtering, Parsing, and Validation 1 Problem A set of records Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  • 80. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  • 81. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation 2 Solution Map: For each record, emit it if passes the condition or emit its transformed version Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  • 82. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation 2 Solution Map: For each record, emit it if passes the condition or emit its transformed version Reduce: Identity Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  • 83. Filtering, Parsing, and Validation 1 Problem A set of records Need to collect all records that meet some condition or transform each record into another representation 2 Solution Map: For each record, emit it if passes the condition or emit its transformed version Reduce: Identity Example: Text parsing or transformation such as word capitalization Zubair Nabi 6: MapReduce Applications April 18, 2013 24 / 27
  • 84. Distributed Task Execution 1 Problem Large computational problem Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  • 85. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a final result Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  • 86. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a final result 2 Solution Map: Perform corresponding computation Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  • 87. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a final result 2 Solution Map: Perform corresponding computation Reduce: Combine all emitted results into a final one Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  • 88. Distributed Task Execution 1 Problem Large computational problem Need to divide it into multiple parts and combine results from all parts to obtain a final result 2 Solution Map: Perform corresponding computation Reduce: Combine all emitted results into a final one Example: RGB histogram calculation of bitmap images Zubair Nabi 6: MapReduce Applications April 18, 2013 25 / 27
  • 89. Sorting 1 Problem A set of records Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  • 90. Sorting 1 Problem A set of records Need to sort records in some order Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  • 91. Sorting 1 Problem A set of records Need to sort records in some order 2 Solution Map: Identity Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  • 92. Sorting 1 Problem A set of records Need to sort records in some order 2 Solution Map: Identity Reduce: Identity Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  • 93. Sorting 1 Problem A set of records Need to sort records in some order 2 Solution Map: Identity Reduce: Identity Also possible to sort by value, either perform a secondary sort or perform a key-to-value conversion Zubair Nabi 6: MapReduce Applications April 18, 2013 26 / 27
  • 94. References 1 Jimmy Lin and Chris Dyer. 2010. Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers. 2 MapReduce Patterns, Algorithms, and Use Cases: http://highlyscalable.wordpress.com/2012/02/01/ mapreduce-patterns/ Zubair Nabi 6: MapReduce Applications April 18, 2013 27 / 27