Top 3 design patterns in Map Reduce

www.edureka.co/r-for-analytics
www.edureka.co/mapreduce-design-patterns
Top 3 Design Patterns in MapReduce

Slide 2Slide 2 www.edureka.co/mapreduce-design-patterns
Today we will take you through the following:
 Summarization Patterns
 Numerical Summarization
 Filter Patterns
 Finding Top K records
 Join Patterns
 Reduce side join
Agenda
Hands On
Hands On
Hands On

MapReduce Review

Why MapReduce Design Patterns - Question
Let's broach this topic with few questions.
 Will you use standard sorting algorithms on MapReduce framework ?
» Quick Sort, Merge Sort etc. ??? NO
» Why ?
 MapReduce imposes constraints like any other framework
» You have to think in terms of Map tasks and Reduce tasks
» Programmer has little control over many aspects of execution
 But MapReduce does provide a number of techniques for controlling flow of data

MapReduce Paradigm - Constraints (Contd.)
 Programmer has little control over many aspects of execution
» Where a mapper or reducer runs
» When a mapper or reducer begins or finishes
» Which input key-value pairs are processed by a specific mapper
» Which intermediate key-value pairs are processed by a specific reducer

Why MapReduce Design Patterns - Answer
 Because of the constraints discussed in earlier slide
» Design Patterns help you solve problems and people have learnt to solve these problems in the best
possible ways
 Because of the MapReduce techniques for controlling execution & flow of data
» Use these techniques on problems in standard ways that people have already created
 Judicious use of Distributed Cache, Sorting Comparator can help in quite a few algorithms
 Scalability & Efficiency concerns

Summarization Patterns – What is it
 Provides high level aggregate view of data set when visual inspection of whole data not feasible
 Group similar data together and perform an operations like
» Calculating a statistic, indexing, counting etc.
 Apply on a new dataset to quickly understand what's important and what to look closely at
 Example
» Number of hits per hour per location on a website in a web log
» Average length of comments / user in blog comments
» Top ten salary per profession region-wise

Numerical Summarizations – Description
 General Pattern for calculating aggregate statistic on the dataset
 Group records by a key field and calculate a numerical aggregate per group
» Min, max, sum, average, median, standard deviation etc.
 Use Combiner properly for efficient implementation
 Example
» Take advertising actions based on hours users are most active on your site
» Group hourly average amount users spend on your site
 Applicability – Use it when
» You are dealing with numerical data or counting
» The data can be grouped by fields

Numerical Summarizations – Structure
 Mapper
» Output Key = field to group by; Output Value = numerical item to summarize on
» Make sure only relevant items are output from Map to Reduce network traffic
 Combiner
» Use if summarization operation on reducer is Associative & Commutative
» Will reduce the network traffic between Map tasks & Reduce tasks

Numerical Summarizations – Structure (Contd.)
 Partitioner
» Use custom partitioner if you feel skew in the data
» To distribute computation uniformly across reducers
 Reducer
» Each reducer applies summarization function on the data set received on the group key
» Output key = group key; summarization statistic
» Job output is a set of part files containing a single record per reducer input group

Numerical Summarizations – Analogy, Performance
 Performance
» The crux of this pattern – Grouping by key – is what MapReduce provides at it's core
» Performs well when combiner is used properly
» For skewed dataset, use custom partitioner for improved performance
» Use appropriate number of reducers

Numerical Summarizations – Use Cases
 Min/Max/Count
» Analytics to find minimum, maximum, count of an event
 Average/Median/Standard Deviation
» Analytics similar to Min/Max/Count
» Implementation not as straight forward as operations not associative
 Record Count
» Common analytics to get a heartbeat of data flow rate on a particular interval
 Word Count
» Basic Text Analytics of word count in a document
» Hello World of MapReduce

Min/Max/Count Example – Data Flow

DEMO
Min/Max/Count Example

Filtering Patterns – What is it
 Finding a subset of interest from a large data set
 So that further analytics can be applied on this subset
 These patterns don't alter the original dataset
Example:
 Sampling – to get a representative sample to apply on Machine Learning Algorithms
 Selecting all records for a user to apply further analytics

Basic Filtering Pattern – Description
 Acts as a basic filtering abstract pattern for some other patterns
 Filter out records that are not of interest and keep the ones that are
 Parallel processing system like Hadoop is required due to large size of original data set
 Filtered in subset may be large or small
Example: To study behaviour of users between 10-11am filter out records from log file
Applicability – Use it when
 Widely applicable
 Use it when data can be easily parsed to yield a filtering criteria

Basic Filtering Pattern – Structure

Basic Filtering Pattern – Description
Mapper
 Applies filtering criteria to each record it receives
 Outputs records that match filtering in criteria
 Output key/value pairs same as input key/value pairs
Combiner
 Not Required; map only job
Partitioner
 Not Required; map only job
Reducer
 Generally Not Required ; Map Only job
 But can use Identity reducers

Basic Filtering Pattern – Use Cases
 Closer view of data
 Removing low scoring data
 Distributed grep
 Data cleansing
 Simple random sampling
 Tracking a thread of events

Top Ten – Description
 Filter in a fixed and relatively small number (10) of records from a large data set
 Based on a total ordering ranking criteria
 You can manually look at this small number of records to see what's special about them
 Important in terms of how one would implement Top Ten in MapReduce vis-a-vis SQL
» In SQL or any programming language you would sort and then take top 10
» In Map Reduce total order sorting is complex and resource intensive
Example: Top ten users with highest number of comments posted on Stackoverflow in 2014

Top Ten – Applicability
Applicability – Use it when
 A comparator function is available for ranking records
 Number of output records much smaller than input records
» If not, one is better off sorting the whole dataset

Top Ten – Structure

Mapper
 In setup() method initialize an array of size k(=10)
 In map(), insert record field into array in a sorted way
 If sizeOf(array) truncate array to size == 10, keeping highest 10
 In cleanup() read the array and output key = null and value = record
Combiner and custom Partitioner not required
Reducer
 Considering number of output records from mapper is small only 1 reducer is used
 Reducer does things similar to mapper
Top Ten – Structure

Top Ten – Use Cases
 Outlier analysis
 Select interesting data for further BI systems which cannot handle Big Data sets
 Publish interesting dashboards

DEMO
Top Ten Example

Join Patterns – What is it
 Datasets generally exist in multiple sources
 Deriving full-value requires merging them together
 Join Patterns are used for this purpose
 Performing joins on the fly on Big Data can be costly in terms of time
Example: Joining StackOverflow data from Comments & Posts on UserId

Join – Refresher
 Inner Join
 Outer Join
» Left Outer Join
» Right Outer Join
» Full Outer Join
 Anti Join
 Cartesian Product

Reduce Side Join – Description
 Easiest to implement but can be longest to execute
 Supports all types of join operation
 Can join multiple data sources, but expensive in terms of network resources & time
 All data transferred across network
Example : Join PostLinks table data in StackOverflow to Posts data

Reduce Side Join – Description (Contd.)
 Applicability – Use it when
» Multiple large data sets require to be joined
» If one of the data sources is small look at using replicated join
» Different data sources are linked by a foreign key
» You want all join operations to be supported

Reduce Side Join – Structure

Reduce Side Join – Structure (Contd.)
 Mapper
» Output key should reflect the foreign key
» Value can be the whole record and an identifier to identify the source
» Use projection and output only the required number of fields
 Combiner
» Not Required ; No additional benefit
 Partitioner
» User Custom Partitioner if required;
 Reducer
» Reducer logic based on type of join required
» Reducer receives the data from all the different sources per key

Reduce Side Join – Performance
 Performance
» The whole data moves across the network to reducers
» You can optimize by using projection and sending only the required fields
» Number of reducers typically higher than normal
» If you can use any other Join type for your problem, use that instead

DEMO
Reduce Side Join Example

Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your
experience better!
Please spare few minutes to take the survey after the webinar.
Survey

Top 3 design patterns in Map Reduce

Top 3 design patterns in Map Reduce

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to Top 3 design patterns in Map Reduce (20)

More from Edureka! (20)

Recently uploaded (20)

Top 3 design patterns in Map Reduce

Editor's Notes