Large Scale Data Analysis with Map/Reduce, part I

Large Scale Data Analysis with
Map/Reduce, part I
Marin Dimitrov
(technology watch #1)

Feb 2010

Contents

• Map/Reduce
• Dryad
• Sector/Sphere
• Open source M/R frameworks & tools
– Hadoop (Yahoo/Apache)
– Cloud MapReduce (Accenture)
– Elastic MapReduce (Hadoop on AWS)
– MR.Flow
• Some M/R algorithms
– Graph algorithms, Text Indexing & retrieval

Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #2

Contents

Part I

Distributed computing
frameworks


Scalability & Parallelisation

• Scalability approaches
– Scale up (vertical scaling)
• Only one direction of improvement (bigger box)
– Scale out (horizontal scaling)
• Two directions – add more nodes + scale up each node
• Can achieve x4 the performance of a similarly priced scale-up system
(ref?)
– Hybrid (“scale out in a box”)
• Parallel algorithms... Not
– Algorithms with state
– Dependencies from one iteration to another (recurrence, induction)


Parallelisation approaches

• Parallelization approaches
– Task decomposition
• Distribute coarse-grained (synchronisation wise) and computationally
expensive tasks (otherwise too much coordination/management
overhead)
• Dependencies - execution order vs. data dependencies
• Move the data to the processing (when needed)
– Data decomposition
• Each parallel task works with a data partition assigned to it (no sharing)
• Data has regular structure, i.e. chunks expected to need the same
amount of processing time
• Two criteria: granularity (size of chunk) and shape (data exchange
between chunk neighbours)
• Move the processing to the data


Amdahl’s law

• Impossible to achieve linear speedup
• Maximum speedup is always bounded by the overhead for
parallelisation and by the serial processing part
• Amdahl’s law
– max_speedup =

– P: proportion of the program than can be parallelised (1-P still
remains serial or overhead)
– N: number of processors / parallel nodes
– Example: P=75% (i.e. 25% serial or overhead)
N (parallel nodes) 2 4 8 16 32 1024 64K
Max speedup 1.60 2.29 2.91 3.37 3.66 3.99 3.99


Map/Reduce

• Google (2005), US patent (2010)
• General idea - co-locate data with computation nodes
– Data decomposition (parallelization) – no data/order dependencies
between tasks (except the Map-to-Reduce phase)
– Try to utilise data locality (bandwidth is $$$)
– Implicit data flow (higher abstraction level than MPI)
– Partial failure handling (failed map/reduce tasks are re-scheduled)
• Structure
– Map - for each input (Ki,Vi) produce zero or more output pairs
(Km,Vm)
– Combine – optional intermediate aggregation (less M->R data
transfer)
– Reduce - for input pair (Km, list(V1,V2,…, Vn)) produce zero or more
output pairs (Kr,Vr)

Map/Reduce (2)

(C) Jimmy Lin


Map/Reduce - examples

• In other words…
– Map = partitioning of the data (compute part of a problem across
several servers)
– Reduce = processing of the partitions (aggregate the partial results
from all servers into a single resultset)
– The M/R framework takes care of grouping of partitions by key
• Example: word count
– Map (1 task per document in the collection)
• In: docx
• Out: (term1, count1,x), (term2, count2,x), …
– Reduce (1 task per term in the collection)
• In: (term1, < count1,x, count1,y, … count1,z >)
• Out: (term1, SUM(count1,x, count1,y, … count1,z))


Map/Reduce
examples (2)
• Example: Shortest path in graph (naïve)
– Map: in (nodein, dist); out (nodeout, dist++) where nodein->nodeout
– Reduce: in (noder, <dista,r, distb,r, …, dustc,r>); out (noder, MIN(dista,r,
distb,r, …, dustc,r))
– Multiple M/R iterations required, start with (nodestart,0)
• Example: Inverted indexing (full text search)
– Map
• In: docx
• out: (term1, (docx, pos’1,x)), (term1, (docx, pos’’1,x)), (term2, (docx, pos2,x))…
– Reduce
• in = (term1, < (docx, pos’1,x), (docx, pos’’1,x), (docy, pos1,y), … (docz, pos1,z)>)
• out = (term1, < (docx, <pos’1,x, pos’’1,x,…>), (docy, <pos1,y>), … (docz,
<pos1,z>)>)


Map/Reduce - examples (3)

• Inverted index example rundown
• input
– Doc1: “Why did the chicken cross the road?”
– Doc2: “The chicken and egg problem”
– Doc3: “Kentucky Fried Chicken”
• Map phase (3 parallel tasks)
– map1 => (“why”,(doc1,1)), (“did”,(doc1,2)), (“the”,(doc1,3)),
(“chicken”,(doc1,4)), (“cross”,(doc1,5)), (“the”,(doc1,6)),
(“road”,(doc1,7))
– map2 => (“the”,(doc2,1)), (“chicken”,(doc2,2)), (“and”,(doc2,3)),
(“egg”,(doc2,4)), (“problem”, (doc2,5))
– map3 => (“kentucky”,(doc3,1)), (“fried”,(doc3,2)), (“chicken”,(doc3,3))



• Inverted index example rundown (cont.)
• Intermediate shuffle & sort phase
– (“why”, <(doc1,1)>),
– (“did”, <(doc1,2)>),
– (“the”, <(doc1,3), (doc1,6), (doc2,1)>)
– (“chicken”, <(doc1,4), (doc2,2), (doc3,3)>)
– (“cross”, <(doc1,5)>)
– (“road”, <(doc1,7)>)
– (“and”, <(doc2,3)>)
– (“egg”, <(doc2,4)>)
– (“problem”, <(doc2,5)>)
– (“kentucky”, <(doc3,1)>)
– (“fried”, <(doc3,2)>)



• Inverted index example rundown (cont.)
• Reduce phase (11 parallel tasks)
– (“why”, <(doc1,<1>)>),
– (“did”, <(doc1,<2>)>),
– (“the”, <(doc1, <3,6>), (doc2, <1>)>)
– (“chicken”, <(doc1,<4>), (doc2,<2>), (doc3,<3>)>)
– (“cross”, <(doc1,<5>)>)
– (“road”, <(doc1,<7>)>)
– (“and”, <(doc2,<3>)>)
– (“egg”, <(doc2,<4>)>)
– (“problem”, <(doc2,<5>)>)
– (“kentucky”, <(doc3,<1>)>)
– (“fried”, <(doc3,<2>)>)


Map/Reduce – pros & cons

• Good for
– Lots of input, intermediate & output data
– Little or no synchronisation required
– “Read once”, batch oriented datasets (ETL)
• Bad for
– Fast response time
– Large amounts of shared data
– Fine-grained synchronisation required
– CPU intensive operations (as opposed to data intensive)


Dryad

• Microsoft Research (2007),
http://research.microsoft.com/en-us/projects/dryad/
• General purpose distributed execution engine
– Focus on throughput, not latency
– Automatic management of scheduling, distribution &fault tolerance
• Simple DAG model
– Vertices -> processes (processing nodes)
– Edges -> communication channels between the processes
• DAG model benefits
– Generic scheduler
– No deadlocks / deterministic
– Easier fault tolerance


Dryad DAG jobs

(C) Michael Isard


Dryad (3)

• The job graph can mutate during execution (?)
• Channel types (one way)
– Files on a DFS
– Temporary file
– Shared memory FIFO
– TCP pipes
• Fault tolerance
– Node fails => re-run
– Input disappears => re-run upstream node
– Node is slow => run a duplicate copy at another node, get first result


Dryad architecture & components

(C) Mihai Budiu


Dryad programming

• C++ API (incl. Map/Reduce interfaces)
• SQL Integration Services (SSIS)
– Many parallel SQL Server instances (each is a vertex in the DAG)
• DryadLINQ
– LINQ to Dryad translator
• Distributed shell
– Generalisation of the Unix shell & pipes
– Many inputs/outputs per process!
– Pipes span multiple machines


Dryad vs. Map/Reduce

(C) Mihai Budiu


Contents

Part II

Open Source Map/Reduce
frameworks


Hadoop

• Apache Nutch (2004), Yahoo is currently the major
contributor
• http://hadoop.apache.org/
• Not only a Map/Reduce implementation!
– HDFS – distributed filesystem
– HBase – distributed column store
– Pig – high level query language (SQL like)
– Hive – Hadoop based data warehouse
– ZooKeeper, Chukwa, Pipes/Streaming, …
• Also available on Amazon EC2
• Largest Hadoop cluster – 25K nodes / 100K cores (Yahoo)


Hadoop - Map/Reduce

• Components
– Job client
– Job Tracker
• Only one
• Scheduling, coordinating, monitoring, failure handling
– Task Tracker
• Many
• Executes tasks received by the Job Tracker
• Sends “heartbeats” and progress reports back to the Job Tracker
– Task Runner
• The actual Map or Reduce task started in a separate JVM
• Crashes & failures do not affect the Task Tracker on the node!


Hadoop - Map/Reduce (2)

(C) Tom White



• Integrated with HDFS
– Map tasks executed on the HDFS node where the data is (data
locality => reduce traffic)
– Data locality is not possible for Reduce tasks
– Intermediate outputs of Map tasks (nodes) are not stored on HDFS,
but locally, and then sent to the proper Reduce task (node)
• Status updates
– Task Runner => Task Tracker, progress updates every 3s
– Task Tracker => Job Tracker, heartbeat + progress for all local tasks
every 5s
– If a task has no progress report for too long, it will be considered
failed and re-started



• Some extras
– Counters
• Gather stats about a task
• Globally aggregated (Job Runner => Task Tracker => Job Tracker)
• M/R counters: M/R input records, M/R output records
• Filesystem counters: bytes read/written
• Job counters: launched M/R tasks, failed M/R tasks, …
– Joins
• Copy the small set on each node and perform joins locally. Useful when
one dataset is very large, the other very small (e.g. “Scalable Distributed
Reasoning using MapReduce” from VUA)
• Map side join – data is joined before the Map function, very efficient but
less flexible (datasets must be partitioned & sorted in a particular way)
• Reduce side join – more general but less efficient (Map generates (K,V)
pairs using the join key)



• Built-in mappers and reducers
– Chain – run a chain/pipe of sequential Maps (M+RM*). The last Map
output is the Task output
– FieldSelection – select a list of fields from the input dataset to be
used as MR keys/values
– TokenCounterMapper, SumReducer – (remember the “word count”
example?)
– RegexMapper – matches a regex in the input key/value pairs


Cloud MapReduce

• Accenture (2010)
• http://code.google.com/p/cloudmapreduce/
• Map/Reduce implementation for AWS (EC2, S3, SimpleDB,
SQS)
– fast (reported as up to 60 times faster than Hadoop/EC2 in some
cases)
– scalable & robust (no single point of bottleneck or failure)
– simple (3 KLOC)
• Features
– No need for centralised coordinator (JobTracker), just put job status
in the cloud datastore (SimpleDB)
– All data transfer & communication is handled by the Cloud
– All I/O and storage is handled by the Cloud

Cloud MapReduce (2)

(C) Ricky Ho


Cloud MapReduce (3)

• Job client workflow
1. Store input data (S3)
2. Create a Map task for each data split & put it into the Mapper
Queue (SQS)
3. Create Multiple Partition Queue (SQS)
4. Create Reducer Queue (SQS) & put a Reduce task for each Partition
Queue
5. Create the Output Queue (SQS)
6. Create a Job Request (ref to all queues) and put it into SimpleDB
7. Start EC2 instances for Mappers & Reducers
8. Poll SimpleDB for job status
9. When job complete download results from S3


Cloud MapReduce (4)

• Mapper worflow
1. Dequeue a Map task from the Mapper Queue
2. Fetch data from S3
3. Perform user defined map function, add multiple output (Km,Vm)
pairs to some Multiple Partition Queue (hash(Km)) => several
partition keys may share the same partition queue!
4. When done remove Map task from Mapper Queue
• Reducer workflow
1. Dequeue a Reeduce task from the Reducer Queue
2. Dequeue the (Km,Vm) pairs from the corresponding Partition Queue
=> several partitions may share the same queue!
3. Perform a user defined reduce function and add output pairs (Kr,Vr)
to the Output Queue
4. When done remove the Reduce task from the Reducer Queue

MR.Flow

• Web based M/R editor
– http://www.mr-flow.com
– Reusable M/R modules
– Execution & status monitoring (Hadoop clusters)


Contents

Part III

Some Map/Reduce
algorithms


General considerations

• Map execution order is not deterministic
• Map processing time cannot be predicted
• Reduce tasks cannot start before all Maps have finished
(dataset needs to be fully partitioned)
• Not suitable for continuous input streams
• There will be a spike in network utilisation after the Map /
before the Reduce phase
• Number & size of key/value pairs
– Object creation & serialisation overhead (Amdahl’s law!)
• Aggregate partial results when possible!
– Use Combiners


Graph algorithms

• Very suitable for M/R processing
– Data (graph node) locality
– “spreading activation” type of processing
– Some algorithms with sequential dependency not suitable for M/R
• Breadth-first search algorithms better than depth-first

• General Approach
– Graph represented by adjacency lists
– Map task – input: node + its adjacency list; perform some analysis
over the node link structure; output: target key + analysis result
– Reduce task – aggregate values by key
– Perform multiple iterations (with a termination criteria)


Social Network Analysis

• Problem: recommend new friends (friend-of-a-friend, FOAF)
• Map task
– U (target user) is fixed and its friends list copied to all cluster nodes
(“copy join”); each cluster node stores part of the social graph
– In: (X, <friendsX>), i.e. the local data for the cluster node
– Out:
• if (U, X) are friends => (U, <friendsXfriendsU>), i.e. the users who are
friends of X but not already friends of U
• nil otherwise

• Reduce task
– In: (U, <<friendsAfriendsU>,<friendsBfriendsU>, … >), i.e. the FOAF
lists for all users A, B, etc. who are friends with U
– Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is
its total number of occurrences in all FOAF lists (sort/rank the result!)

PageRank with M/R

(C) Jimmy Lin


Text Indexing & Retrieval

• Indexing is very suitable for M/R
– Focus on scalability, not on latency & response time
– Batch oriented
• Map task
– emit (Term, (DocID, position))
• Reduce task
– Group pairs by Term and sort by DocID


Text Indexing & Retrieval (2)

(C) Jimmy Lin


Text Indexing & Retrieval (3)

• Retrieval not suitable for M/R
– Focus on response time
– Startup of Mappers & Reducers is usually prohibitively expensive
• Katta
– http://katta.sourceforge.net/
– Distributed Lucene indexing with Hadoop (HDFS)
– Multicast querying & ranking


Useful links

• "MapReduce: Simplified Data Processing on Large Clusters"
• “Dryad: Distributed Data-Parallel Programs from Sequential
Building Blocks”
• “Cloud MapReduce Technical Report”
• Data-Intensive Text Processing with MapReduce
• Hadoop - The Definitive Guide


Q&A

Questions?


Large Scale Data Analysis with Map/Reduce, part I

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Large Scale Data Analysis with Map/Reduce, part I (20)

More from Marin Dimitrov (20)

Recently uploaded (20)

Large Scale Data Analysis with Map/Reduce, part I