SlideShare a Scribd company logo
Large Scale Data Analysis with
     Map/Reduce, part I
           Marin Dimitrov
        (technology watch #1)


              Feb 2010
Contents

• Map/Reduce
• Dryad
• Sector/Sphere
• Open source M/R frameworks & tools
   –   Hadoop (Yahoo/Apache)
   –   Cloud MapReduce (Accenture)
   –   Elastic MapReduce (Hadoop on AWS)
   –   MR.Flow
• Some M/R algorithms
   – Graph algorithms, Text Indexing & retrieval



                    Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #2
Contents



                       Part I

Distributed computing
      frameworks


    Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #3
Scalability & Parallelisation

• Scalability approaches
   – Scale up (vertical scaling)
       • Only one direction of improvement (bigger box)
   – Scale out (horizontal scaling)
       • Two directions – add more nodes + scale up each node
       • Can achieve x4 the performance of a similarly priced scale-up system
         (ref?)
   – Hybrid (“scale out in a box”)
• Parallel algorithms... Not
   – Algorithms with state
   – Dependencies from one iteration to another (recurrence, induction)




                      Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #4
Parallelisation approaches

• Parallelization approaches
   – Task decomposition
       • Distribute coarse-grained (synchronisation wise) and computationally
         expensive tasks (otherwise too much coordination/management
         overhead)
       • Dependencies - execution order vs. data dependencies
       • Move the data to the processing (when needed)
   – Data decomposition
       • Each parallel task works with a data partition assigned to it (no sharing)
       • Data has regular structure, i.e. chunks expected to need the same
         amount of processing time
       • Two criteria: granularity (size of chunk) and shape (data exchange
         between chunk neighbours)
       • Move the processing to the data



                      Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010      #5
Amdahl’s law

• Impossible to achieve linear speedup
• Maximum speedup is always bounded by the overhead for
  parallelisation and by the serial processing part
• Amdahl’s law
   – max_speedup =

   – P: proportion of the program than can be parallelised (1-P still
     remains serial or overhead)
   – N: number of processors / parallel nodes
   – Example: P=75% (i.e. 25% serial or overhead)
  N (parallel nodes)    2         4         8        16        32     1024      64K
  Max speedup           1.60      2.29      2.91     3.37      3.66   3.99      3.99


                     Large Scale Data Analysis (Map/Reduce), part I          Feb, 2010   #6
Map/Reduce

• Google (2005), US patent (2010)
• General idea - co-locate data with computation nodes
   – Data decomposition (parallelization) – no data/order dependencies
     between tasks (except the Map-to-Reduce phase)
   – Try to utilise data locality (bandwidth is $$$)
   – Implicit data flow (higher abstraction level than MPI)
   – Partial failure handling (failed map/reduce tasks are re-scheduled)
• Structure
   – Map - for each input (Ki,Vi) produce zero or more output pairs
     (Km,Vm)
   – Combine – optional intermediate aggregation (less M->R data
     transfer)
   – Reduce - for input pair (Km, list(V1,V2,…, Vn)) produce zero or more
     output pairs (Kr,Vr)
                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #7
Map/Reduce (2)




                                                 (C) Jimmy Lin


Large Scale Data Analysis (Map/Reduce), part I       Feb, 2010   #8
Map/Reduce - examples

• In other words…
   – Map = partitioning of the data (compute part of a problem across
     several servers)
   – Reduce = processing of the partitions (aggregate the partial results
     from all servers into a single resultset)
   – The M/R framework takes care of grouping of partitions by key
• Example: word count
   – Map (1 task per document in the collection)
       • In: docx
       • Out: (term1, count1,x), (term2, count2,x), …
   – Reduce (1 task per term in the collection)
       • In: (term1, < count1,x, count1,y, … count1,z >)
       • Out: (term1, SUM(count1,x, count1,y, … count1,z))


                       Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #9
Map/Reduce
                                 examples (2)
• Example: Shortest path in graph (naïve)
   – Map: in (nodein, dist); out (nodeout, dist++) where nodein->nodeout
   – Reduce: in (noder, <dista,r, distb,r, …, dustc,r>); out (noder, MIN(dista,r,
     distb,r, …, dustc,r))
   – Multiple M/R iterations required, start with (nodestart,0)
• Example: Inverted indexing (full text search)
   – Map
       • In: docx
       • out: (term1, (docx, pos’1,x)), (term1, (docx, pos’’1,x)), (term2, (docx, pos2,x))…
   – Reduce
       • in = (term1, < (docx, pos’1,x), (docx, pos’’1,x), (docy, pos1,y), … (docz, pos1,z)>)
       • out = (term1, < (docx, <pos’1,x, pos’’1,x,…>), (docy, <pos1,y>), … (docz,
         <pos1,z>)>)



                        Large Scale Data Analysis (Map/Reduce), part I        Feb, 2010         #10
Map/Reduce - examples (3)

• Inverted index example rundown
• input
   – Doc1: “Why did the chicken cross the road?”
   – Doc2: “The chicken and egg problem”
   – Doc3: “Kentucky Fried Chicken”
• Map phase (3 parallel tasks)
   – map1 => (“why”,(doc1,1)), (“did”,(doc1,2)), (“the”,(doc1,3)),
     (“chicken”,(doc1,4)), (“cross”,(doc1,5)), (“the”,(doc1,6)),
     (“road”,(doc1,7))
   – map2 => (“the”,(doc2,1)), (“chicken”,(doc2,2)), (“and”,(doc2,3)),
     (“egg”,(doc2,4)), (“problem”, (doc2,5))
   – map3 => (“kentucky”,(doc3,1)), (“fried”,(doc3,2)), (“chicken”,(doc3,3))



                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #11
Map/Reduce - examples (4)

• Inverted index example rundown (cont.)
• Intermediate shuffle & sort phase
   –   (“why”, <(doc1,1)>),
   –   (“did”, <(doc1,2)>),
   –   (“the”, <(doc1,3), (doc1,6), (doc2,1)>)
   –   (“chicken”, <(doc1,4), (doc2,2), (doc3,3)>)
   –   (“cross”, <(doc1,5)>)
   –   (“road”, <(doc1,7)>)
   –   (“and”, <(doc2,3)>)
   –   (“egg”, <(doc2,4)>)
   –   (“problem”, <(doc2,5)>)
   –   (“kentucky”, <(doc3,1)>)
   –   (“fried”, <(doc3,2)>)

                       Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #12
Map/Reduce - examples (5)

• Inverted index example rundown (cont.)
• Reduce phase (11 parallel tasks)
   –   (“why”, <(doc1,<1>)>),
   –   (“did”, <(doc1,<2>)>),
   –   (“the”, <(doc1, <3,6>), (doc2, <1>)>)
   –   (“chicken”, <(doc1,<4>), (doc2,<2>), (doc3,<3>)>)
   –   (“cross”, <(doc1,<5>)>)
   –   (“road”, <(doc1,<7>)>)
   –   (“and”, <(doc2,<3>)>)
   –   (“egg”, <(doc2,<4>)>)
   –   (“problem”, <(doc2,<5>)>)
   –   (“kentucky”, <(doc3,<1>)>)
   –   (“fried”, <(doc3,<2>)>)

                      Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #13
Map/Reduce – pros & cons

• Good for
   – Lots of input, intermediate & output data
   – Little or no synchronisation required
   – “Read once”, batch oriented datasets (ETL)
• Bad for
   –   Fast response time
   –   Large amounts of shared data
   –   Fine-grained synchronisation required
   –   CPU intensive operations (as opposed to data intensive)




                      Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #14
Dryad

• Microsoft Research (2007),
  http://research.microsoft.com/en-us/projects/dryad/
• General purpose distributed execution engine
   – Focus on throughput, not latency
   – Automatic management of scheduling, distribution &fault tolerance
• Simple DAG model
   – Vertices -> processes (processing nodes)
   – Edges -> communication channels between the processes
• DAG model benefits
   – Generic scheduler
   – No deadlocks / deterministic
   – Easier fault tolerance

                    Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #15
Dryad DAG jobs




                                                  (C) Michael Isard

Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010            #16
Dryad (3)

• The job graph can mutate during execution (?)
• Channel types (one way)
   –   Files on a DFS
   –   Temporary file
   –   Shared memory FIFO
   –   TCP pipes
• Fault tolerance
   – Node fails => re-run
   – Input disappears => re-run upstream node
   – Node is slow => run a duplicate copy at another node, get first result




                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #17
Dryad architecture & components




                                                       (C) Mihai Budiu




     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010          #18
Dryad programming

• C++ API (incl. Map/Reduce interfaces)
• SQL Integration Services (SSIS)
   – Many parallel SQL Server instances (each is a vertex in the DAG)
• DryadLINQ
   – LINQ to Dryad translator
• Distributed shell
   – Generalisation of the Unix shell & pipes
   – Many inputs/outputs per process!
   – Pipes span multiple machines




                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #19
Dryad vs. Map/Reduce




                                                 (C) Mihai Budiu


Large Scale Data Analysis (Map/Reduce), part I       Feb, 2010     #20
Contents



                       Part II

Open Source Map/Reduce
      frameworks


     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #21
Hadoop

• Apache Nutch (2004), Yahoo is currently the major
  contributor
• http://hadoop.apache.org/
• Not only a Map/Reduce implementation!
   –   HDFS – distributed filesystem
   –   HBase – distributed column store
   –   Pig – high level query language (SQL like)
   –   Hive – Hadoop based data warehouse
   –   ZooKeeper, Chukwa, Pipes/Streaming, …
• Also available on Amazon EC2
• Largest Hadoop cluster – 25K nodes / 100K cores (Yahoo)


                      Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #22
Hadoop - Map/Reduce

• Components
  – Job client
  – Job Tracker
      • Only one
      • Scheduling, coordinating, monitoring, failure handling
  – Task Tracker
      • Many
      • Executes tasks received by the Job Tracker
      • Sends “heartbeats” and progress reports back to the Job Tracker
  – Task Runner
      • The actual Map or Reduce task started in a separate JVM
      • Crashes & failures do not affect the Task Tracker on the node!




                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #23
Hadoop - Map/Reduce (2)




                                                  (C) Tom White


 Large Scale Data Analysis (Map/Reduce), part I        Feb, 2010   #24
Hadoop - Map/Reduce (3)

• Integrated with HDFS
   – Map tasks executed on the HDFS node where the data is (data
     locality => reduce traffic)
   – Data locality is not possible for Reduce tasks
   – Intermediate outputs of Map tasks (nodes) are not stored on HDFS,
     but locally, and then sent to the proper Reduce task (node)
• Status updates
   – Task Runner => Task Tracker, progress updates every 3s
   – Task Tracker => Job Tracker, heartbeat + progress for all local tasks
     every 5s
   – If a task has no progress report for too long, it will be considered
     failed and re-started



                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #25
Hadoop - Map/Reduce (4)

• Some extras
   – Counters
       •   Gather stats about a task
       •   Globally aggregated (Job Runner => Task Tracker => Job Tracker)
       •   M/R counters: M/R input records, M/R output records
       •   Filesystem counters: bytes read/written
       •   Job counters: launched M/R tasks, failed M/R tasks, …
   – Joins
       • Copy the small set on each node and perform joins locally. Useful when
         one dataset is very large, the other very small (e.g. “Scalable Distributed
         Reasoning using MapReduce” from VUA)
       • Map side join – data is joined before the Map function, very efficient but
         less flexible (datasets must be partitioned & sorted in a particular way)
       • Reduce side join – more general but less efficient (Map generates (K,V)
         pairs using the join key)


                       Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010      #26
Hadoop - Map/Reduce (5)

• Built-in mappers and reducers
   – Chain – run a chain/pipe of sequential Maps (M+RM*). The last Map
     output is the Task output
   – FieldSelection – select a list of fields from the input dataset to be
     used as MR keys/values
   – TokenCounterMapper, SumReducer – (remember the “word count”
     example?)
   – RegexMapper – matches a regex in the input key/value pairs




                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #27
Cloud MapReduce

• Accenture (2010)
• http://code.google.com/p/cloudmapreduce/
• Map/Reduce implementation for AWS (EC2, S3, SimpleDB,
  SQS)
   – fast (reported as up to 60 times faster than Hadoop/EC2 in some
     cases)
   – scalable & robust (no single point of bottleneck or failure)
   – simple (3 KLOC)
• Features
   – No need for centralised coordinator (JobTracker), just put job status
     in the cloud datastore (SimpleDB)
   – All data transfer & communication is handled by the Cloud
   – All I/O and storage is handled by the Cloud
                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #28
Cloud MapReduce (2)




                                                 (C) Ricky Ho



Large Scale Data Analysis (Map/Reduce), part I       Feb, 2010   #29
Cloud MapReduce (3)

• Job client workflow
   1.   Store input data (S3)
   2.   Create a Map task for each data split & put it into the Mapper
        Queue (SQS)
   3.   Create Multiple Partition Queue (SQS)
   4.   Create Reducer Queue (SQS) & put a Reduce task for each Partition
        Queue
   5.   Create the Output Queue (SQS)
   6.   Create a Job Request (ref to all queues) and put it into SimpleDB
   7.   Start EC2 instances for Mappers & Reducers
   8.   Poll SimpleDB for job status
   9.   When job complete download results from S3



                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #30
Cloud MapReduce (4)

• Mapper worflow
   1.   Dequeue a Map task from the Mapper Queue
   2.   Fetch data from S3
   3.   Perform user defined map function, add multiple output (Km,Vm)
        pairs to some Multiple Partition Queue (hash(Km)) => several
        partition keys may share the same partition queue!
   4.   When done remove Map task from Mapper Queue
• Reducer workflow
   1.   Dequeue a Reeduce task from the Reducer Queue
   2.   Dequeue the (Km,Vm) pairs from the corresponding Partition Queue
        => several partitions may share the same queue!
   3.   Perform a user defined reduce function and add output pairs (Kr,Vr)
        to the Output Queue
   4.   When done remove the Reduce task from the Reducer Queue
                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #31
MR.Flow

• Web based M/R editor
   – http://www.mr-flow.com
   – Reusable M/R modules
   – Execution & status monitoring (Hadoop clusters)




                    Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #32
Contents



                    Part III

Some Map/Reduce
   algorithms


  Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #33
General considerations

• Map execution order is not deterministic
• Map processing time cannot be predicted
• Reduce tasks cannot start before all Maps have finished
  (dataset needs to be fully partitioned)
• Not suitable for continuous input streams
• There will be a spike in network utilisation after the Map /
  before the Reduce phase
• Number & size of key/value pairs
   – Object creation & serialisation overhead (Amdahl’s law!)
• Aggregate partial results when possible!
   – Use Combiners

                     Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #34
Graph algorithms

• Very suitable for M/R processing
   – Data (graph node) locality
   – “spreading activation” type of processing
   – Some algorithms with sequential dependency not suitable for M/R
       • Breadth-first search algorithms better than depth-first

• General Approach
   – Graph represented by adjacency lists
   – Map task – input: node + its adjacency list; perform some analysis
     over the node link structure; output: target key + analysis result
   – Reduce task – aggregate values by key
   – Perform multiple iterations (with a termination criteria)




                      Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #35
Social Network Analysis

• Problem: recommend new friends (friend-of-a-friend, FOAF)
• Map task
   – U (target user) is fixed and its friends list copied to all cluster nodes
     (“copy join”); each cluster node stores part of the social graph
   – In: (X, <friendsX>), i.e. the local data for the cluster node
   – Out:
       • if (U, X) are friends => (U, <friendsXfriendsU>), i.e. the users who are
         friends of X but not already friends of U
       • nil otherwise

• Reduce task
   – In: (U, <<friendsAfriendsU>,<friendsBfriendsU>, … >), i.e. the FOAF
     lists for all users A, B, etc. who are friends with U
   – Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is
     its total number of occurrences in all FOAF lists (sort/rank the result!)
                       Large Scale Data Analysis (Map/Reduce), part I    Feb, 2010   #36
PageRank with M/R




                                                     (C) Jimmy Lin




Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010           #37
Text Indexing & Retrieval

• Indexing is very suitable for M/R
   – Focus on scalability, not on latency & response time
   – Batch oriented
• Map task
   – emit (Term, (DocID, position))
• Reduce task
   – Group pairs by Term and sort by DocID




                    Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #38
Text Indexing & Retrieval (2)




                                                   (C) Jimmy Lin



  Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010       #39
Text Indexing & Retrieval (3)

• Retrieval not suitable for M/R
   – Focus on response time
   – Startup of Mappers & Reducers is usually prohibitively expensive
• Katta
   – http://katta.sourceforge.net/
   – Distributed Lucene indexing with Hadoop (HDFS)
   – Multicast querying & ranking




                    Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #40
Useful links

• "MapReduce: Simplified Data Processing on Large Clusters"
• “Dryad: Distributed Data-Parallel Programs from Sequential
  Building Blocks”
• “Cloud MapReduce Technical Report”
• Data-Intensive Text Processing with MapReduce
• Hadoop - The Definitive Guide




                  Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #41
Q&A




    Questions?




Large Scale Data Analysis (Map/Reduce), part I   Feb, 2010   #42

More Related Content

PPTX
Map Reduce Online
PPTX
Map reduce presentation
PDF
Mapreduce by examples
PPT
Hadoop Map Reduce
PDF
Introduction to Map-Reduce
PPT
Map Reduce
PPT
An Introduction To Map-Reduce
PDF
Map Reduce
Map Reduce Online
Map reduce presentation
Mapreduce by examples
Hadoop Map Reduce
Introduction to Map-Reduce
Map Reduce
An Introduction To Map-Reduce
Map Reduce

What's hot (20)

PPTX
Introduction to map reduce
PPTX
Map reduce and Hadoop on windows
PDF
An Introduction to MapReduce
PPTX
Introduction to MapReduce
PPTX
Introduction to MapReduce
PPTX
Introduction to Map Reduce
PDF
The google MapReduce
PPTX
MapReduce Paradigm
PPT
Map Reduce
PDF
Introduction to map reduce
PPT
Map Reduce
PPT
Map Reduce
PPSX
MapReduce Scheduling Algorithms
PDF
Topic 6: MapReduce Applications
PDF
Hadoop Map Reduce Arch
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
PPTX
Analysing of big data using map reduce
PPT
Introduction To Map Reduce
PPTX
Map Reduce
Introduction to map reduce
Map reduce and Hadoop on windows
An Introduction to MapReduce
Introduction to MapReduce
Introduction to MapReduce
Introduction to Map Reduce
The google MapReduce
MapReduce Paradigm
Map Reduce
Introduction to map reduce
Map Reduce
Map Reduce
MapReduce Scheduling Algorithms
Topic 6: MapReduce Applications
Hadoop Map Reduce Arch
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Analysing of big data using map reduce
Introduction To Map Reduce
Map Reduce
Ad

Viewers also liked (20)

PPTX
Product Sentiment Analysis
PDF
Hadoop Futures
PDF
PDF
A hadoop implementation of pagerank
PDF
Hadoop implementation for algorithms apriori, pcy, son
PDF
Mike davies sentiment_analysis_presentation_backup
PDF
Hadoop Ecosystem
PPTX
Dataiku big data paris - the rise of the hadoop ecosystem
PDF
Hadoop ecosystem
PDF
The Hadoop Ecosystem for Developers
PDF
Big Data and Hadoop Ecosystem
PPTX
Hadoop And Their Ecosystem
PPTX
Hadoop And Their Ecosystem ppt
PPTX
Hadoop Ecosystem at a Glance
PDF
Google PageRank
PDF
Hadoop ecosystem
PPTX
Hadoop Ecosystem
PDF
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PPT
Hadoop ecosystem
Product Sentiment Analysis
Hadoop Futures
A hadoop implementation of pagerank
Hadoop implementation for algorithms apriori, pcy, son
Mike davies sentiment_analysis_presentation_backup
Hadoop Ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
Hadoop ecosystem
The Hadoop Ecosystem for Developers
Big Data and Hadoop Ecosystem
Hadoop And Their Ecosystem
Hadoop And Their Ecosystem ppt
Hadoop Ecosystem at a Glance
Google PageRank
Hadoop ecosystem
Hadoop Ecosystem
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Hadoop ecosystem
Ad

Similar to Large Scale Data Analysis with Map/Reduce, part I (20)

PPT
Map reducecloudtech
PDF
Large Scale Data Processing & Storage
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
PPTX
Large scale computing with mapreduce
PDF
Lecture 06 - CS-5040 - modern database systems
PPTX
introduction to Complete Map and Reduce Framework
PDF
Seattle hug 2010
PPTX
Hadoop and Mapreduce for .NET User Group
PPTX
This gives a brief detail about big data
PPT
MapReduce in cgrid and cloud computinge.ppt
PPTX
Big data & Hadoop
PPTX
Module3 for enginerring students ppt.pptx
PPT
mapreduce and hadoop Distributed File sysytem
PDF
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
PPTX
Big Data.pptx
PDF
Parallel Data Processing with MapReduce: A Survey
PPTX
An introduction to Hadoop for large scale data analysis
PPTX
MapReduce.pptx
PDF
MapReduce: Simplified Data Processing on Large Clusters
PDF
MapReduce basics
Map reducecloudtech
Large Scale Data Processing & Storage
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
Large scale computing with mapreduce
Lecture 06 - CS-5040 - modern database systems
introduction to Complete Map and Reduce Framework
Seattle hug 2010
Hadoop and Mapreduce for .NET User Group
This gives a brief detail about big data
MapReduce in cgrid and cloud computinge.ppt
Big data & Hadoop
Module3 for enginerring students ppt.pptx
mapreduce and hadoop Distributed File sysytem
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
Big Data.pptx
Parallel Data Processing with MapReduce: A Survey
An introduction to Hadoop for large scale data analysis
MapReduce.pptx
MapReduce: Simplified Data Processing on Large Clusters
MapReduce basics

More from Marin Dimitrov (20)

PPTX
Measuring the Productivity of Your Engineering Organisation - the Good, the B...
PDF
Mapping Your Career Journey
PDF
Open Source @ Uber
PDF
Trust - the Key Success Factor for Teams & Organisations
PDF
Uber @ Telerik Academy 2018
PDF
Machine Learning @ Uber
PDF
Career Advice for My Younger Self
PDF
Scaling Your Engineering Organization with Distributed Sites
PDF
Building, Scaling and Leading High-Performance Teams
PDF
Uber @ Career Days 2017 (Sofia University)
PDF
GraphDB Connectors – Powering Complex SPARQL Queries
PDF
DataGraft Platform: RDF Database-as-a-Service
PDF
On-Demand RDF Graph Databases in the Cloud
PDF
Low-cost Open Data As-a-Service
PDF
Text Analytics & Linked Data Management As-a-Service
PDF
RDF Database-as-a-Service with S4
PPTX
Scaling up Linked Data
PDF
Enabling Low-cost Open Data Publishing and Reuse
PDF
S4: The Self-Service Semantic Suite
PDF
Scaling to Millions of Concurrent SPARQL Queries on the Cloud
Measuring the Productivity of Your Engineering Organisation - the Good, the B...
Mapping Your Career Journey
Open Source @ Uber
Trust - the Key Success Factor for Teams & Organisations
Uber @ Telerik Academy 2018
Machine Learning @ Uber
Career Advice for My Younger Self
Scaling Your Engineering Organization with Distributed Sites
Building, Scaling and Leading High-Performance Teams
Uber @ Career Days 2017 (Sofia University)
GraphDB Connectors – Powering Complex SPARQL Queries
DataGraft Platform: RDF Database-as-a-Service
On-Demand RDF Graph Databases in the Cloud
Low-cost Open Data As-a-Service
Text Analytics & Linked Data Management As-a-Service
RDF Database-as-a-Service with S4
Scaling up Linked Data
Enabling Low-cost Open Data Publishing and Reuse
S4: The Self-Service Semantic Suite
Scaling to Millions of Concurrent SPARQL Queries on the Cloud

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Machine learning based COVID-19 study performance prediction
PDF
Electronic commerce courselecture one. Pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Big Data Technologies - Introduction.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Empathic Computing: Creating Shared Understanding
PDF
Approach and Philosophy of On baking technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation_ Review paper, used for researhc scholars
Machine learning based COVID-19 study performance prediction
Electronic commerce courselecture one. Pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Big Data Technologies - Introduction.pptx
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
20250228 LYD VKU AI Blended-Learning.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Network Security Unit 5.pdf for BCA BBA.

Large Scale Data Analysis with Map/Reduce, part I

  • 1. Large Scale Data Analysis with Map/Reduce, part I Marin Dimitrov (technology watch #1) Feb 2010
  • 2. Contents • Map/Reduce • Dryad • Sector/Sphere • Open source M/R frameworks & tools – Hadoop (Yahoo/Apache) – Cloud MapReduce (Accenture) – Elastic MapReduce (Hadoop on AWS) – MR.Flow • Some M/R algorithms – Graph algorithms, Text Indexing & retrieval Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #2
  • 3. Contents Part I Distributed computing frameworks Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #3
  • 4. Scalability & Parallelisation • Scalability approaches – Scale up (vertical scaling) • Only one direction of improvement (bigger box) – Scale out (horizontal scaling) • Two directions – add more nodes + scale up each node • Can achieve x4 the performance of a similarly priced scale-up system (ref?) – Hybrid (“scale out in a box”) • Parallel algorithms... Not – Algorithms with state – Dependencies from one iteration to another (recurrence, induction) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #4
  • 5. Parallelisation approaches • Parallelization approaches – Task decomposition • Distribute coarse-grained (synchronisation wise) and computationally expensive tasks (otherwise too much coordination/management overhead) • Dependencies - execution order vs. data dependencies • Move the data to the processing (when needed) – Data decomposition • Each parallel task works with a data partition assigned to it (no sharing) • Data has regular structure, i.e. chunks expected to need the same amount of processing time • Two criteria: granularity (size of chunk) and shape (data exchange between chunk neighbours) • Move the processing to the data Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #5
  • 6. Amdahl’s law • Impossible to achieve linear speedup • Maximum speedup is always bounded by the overhead for parallelisation and by the serial processing part • Amdahl’s law – max_speedup = – P: proportion of the program than can be parallelised (1-P still remains serial or overhead) – N: number of processors / parallel nodes – Example: P=75% (i.e. 25% serial or overhead) N (parallel nodes) 2 4 8 16 32 1024 64K Max speedup 1.60 2.29 2.91 3.37 3.66 3.99 3.99 Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #6
  • 7. Map/Reduce • Google (2005), US patent (2010) • General idea - co-locate data with computation nodes – Data decomposition (parallelization) – no data/order dependencies between tasks (except the Map-to-Reduce phase) – Try to utilise data locality (bandwidth is $$$) – Implicit data flow (higher abstraction level than MPI) – Partial failure handling (failed map/reduce tasks are re-scheduled) • Structure – Map - for each input (Ki,Vi) produce zero or more output pairs (Km,Vm) – Combine – optional intermediate aggregation (less M->R data transfer) – Reduce - for input pair (Km, list(V1,V2,…, Vn)) produce zero or more output pairs (Kr,Vr) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #7
  • 8. Map/Reduce (2) (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #8
  • 9. Map/Reduce - examples • In other words… – Map = partitioning of the data (compute part of a problem across several servers) – Reduce = processing of the partitions (aggregate the partial results from all servers into a single resultset) – The M/R framework takes care of grouping of partitions by key • Example: word count – Map (1 task per document in the collection) • In: docx • Out: (term1, count1,x), (term2, count2,x), … – Reduce (1 task per term in the collection) • In: (term1, < count1,x, count1,y, … count1,z >) • Out: (term1, SUM(count1,x, count1,y, … count1,z)) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #9
  • 10. Map/Reduce examples (2) • Example: Shortest path in graph (naïve) – Map: in (nodein, dist); out (nodeout, dist++) where nodein->nodeout – Reduce: in (noder, <dista,r, distb,r, …, dustc,r>); out (noder, MIN(dista,r, distb,r, …, dustc,r)) – Multiple M/R iterations required, start with (nodestart,0) • Example: Inverted indexing (full text search) – Map • In: docx • out: (term1, (docx, pos’1,x)), (term1, (docx, pos’’1,x)), (term2, (docx, pos2,x))… – Reduce • in = (term1, < (docx, pos’1,x), (docx, pos’’1,x), (docy, pos1,y), … (docz, pos1,z)>) • out = (term1, < (docx, <pos’1,x, pos’’1,x,…>), (docy, <pos1,y>), … (docz, <pos1,z>)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #10
  • 11. Map/Reduce - examples (3) • Inverted index example rundown • input – Doc1: “Why did the chicken cross the road?” – Doc2: “The chicken and egg problem” – Doc3: “Kentucky Fried Chicken” • Map phase (3 parallel tasks) – map1 => (“why”,(doc1,1)), (“did”,(doc1,2)), (“the”,(doc1,3)), (“chicken”,(doc1,4)), (“cross”,(doc1,5)), (“the”,(doc1,6)), (“road”,(doc1,7)) – map2 => (“the”,(doc2,1)), (“chicken”,(doc2,2)), (“and”,(doc2,3)), (“egg”,(doc2,4)), (“problem”, (doc2,5)) – map3 => (“kentucky”,(doc3,1)), (“fried”,(doc3,2)), (“chicken”,(doc3,3)) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #11
  • 12. Map/Reduce - examples (4) • Inverted index example rundown (cont.) • Intermediate shuffle & sort phase – (“why”, <(doc1,1)>), – (“did”, <(doc1,2)>), – (“the”, <(doc1,3), (doc1,6), (doc2,1)>) – (“chicken”, <(doc1,4), (doc2,2), (doc3,3)>) – (“cross”, <(doc1,5)>) – (“road”, <(doc1,7)>) – (“and”, <(doc2,3)>) – (“egg”, <(doc2,4)>) – (“problem”, <(doc2,5)>) – (“kentucky”, <(doc3,1)>) – (“fried”, <(doc3,2)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #12
  • 13. Map/Reduce - examples (5) • Inverted index example rundown (cont.) • Reduce phase (11 parallel tasks) – (“why”, <(doc1,<1>)>), – (“did”, <(doc1,<2>)>), – (“the”, <(doc1, <3,6>), (doc2, <1>)>) – (“chicken”, <(doc1,<4>), (doc2,<2>), (doc3,<3>)>) – (“cross”, <(doc1,<5>)>) – (“road”, <(doc1,<7>)>) – (“and”, <(doc2,<3>)>) – (“egg”, <(doc2,<4>)>) – (“problem”, <(doc2,<5>)>) – (“kentucky”, <(doc3,<1>)>) – (“fried”, <(doc3,<2>)>) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #13
  • 14. Map/Reduce – pros & cons • Good for – Lots of input, intermediate & output data – Little or no synchronisation required – “Read once”, batch oriented datasets (ETL) • Bad for – Fast response time – Large amounts of shared data – Fine-grained synchronisation required – CPU intensive operations (as opposed to data intensive) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #14
  • 15. Dryad • Microsoft Research (2007), http://research.microsoft.com/en-us/projects/dryad/ • General purpose distributed execution engine – Focus on throughput, not latency – Automatic management of scheduling, distribution &fault tolerance • Simple DAG model – Vertices -> processes (processing nodes) – Edges -> communication channels between the processes • DAG model benefits – Generic scheduler – No deadlocks / deterministic – Easier fault tolerance Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #15
  • 16. Dryad DAG jobs (C) Michael Isard Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #16
  • 17. Dryad (3) • The job graph can mutate during execution (?) • Channel types (one way) – Files on a DFS – Temporary file – Shared memory FIFO – TCP pipes • Fault tolerance – Node fails => re-run – Input disappears => re-run upstream node – Node is slow => run a duplicate copy at another node, get first result Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #17
  • 18. Dryad architecture & components (C) Mihai Budiu Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #18
  • 19. Dryad programming • C++ API (incl. Map/Reduce interfaces) • SQL Integration Services (SSIS) – Many parallel SQL Server instances (each is a vertex in the DAG) • DryadLINQ – LINQ to Dryad translator • Distributed shell – Generalisation of the Unix shell & pipes – Many inputs/outputs per process! – Pipes span multiple machines Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #19
  • 20. Dryad vs. Map/Reduce (C) Mihai Budiu Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #20
  • 21. Contents Part II Open Source Map/Reduce frameworks Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #21
  • 22. Hadoop • Apache Nutch (2004), Yahoo is currently the major contributor • http://hadoop.apache.org/ • Not only a Map/Reduce implementation! – HDFS – distributed filesystem – HBase – distributed column store – Pig – high level query language (SQL like) – Hive – Hadoop based data warehouse – ZooKeeper, Chukwa, Pipes/Streaming, … • Also available on Amazon EC2 • Largest Hadoop cluster – 25K nodes / 100K cores (Yahoo) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #22
  • 23. Hadoop - Map/Reduce • Components – Job client – Job Tracker • Only one • Scheduling, coordinating, monitoring, failure handling – Task Tracker • Many • Executes tasks received by the Job Tracker • Sends “heartbeats” and progress reports back to the Job Tracker – Task Runner • The actual Map or Reduce task started in a separate JVM • Crashes & failures do not affect the Task Tracker on the node! Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #23
  • 24. Hadoop - Map/Reduce (2) (C) Tom White Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #24
  • 25. Hadoop - Map/Reduce (3) • Integrated with HDFS – Map tasks executed on the HDFS node where the data is (data locality => reduce traffic) – Data locality is not possible for Reduce tasks – Intermediate outputs of Map tasks (nodes) are not stored on HDFS, but locally, and then sent to the proper Reduce task (node) • Status updates – Task Runner => Task Tracker, progress updates every 3s – Task Tracker => Job Tracker, heartbeat + progress for all local tasks every 5s – If a task has no progress report for too long, it will be considered failed and re-started Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #25
  • 26. Hadoop - Map/Reduce (4) • Some extras – Counters • Gather stats about a task • Globally aggregated (Job Runner => Task Tracker => Job Tracker) • M/R counters: M/R input records, M/R output records • Filesystem counters: bytes read/written • Job counters: launched M/R tasks, failed M/R tasks, … – Joins • Copy the small set on each node and perform joins locally. Useful when one dataset is very large, the other very small (e.g. “Scalable Distributed Reasoning using MapReduce” from VUA) • Map side join – data is joined before the Map function, very efficient but less flexible (datasets must be partitioned & sorted in a particular way) • Reduce side join – more general but less efficient (Map generates (K,V) pairs using the join key) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #26
  • 27. Hadoop - Map/Reduce (5) • Built-in mappers and reducers – Chain – run a chain/pipe of sequential Maps (M+RM*). The last Map output is the Task output – FieldSelection – select a list of fields from the input dataset to be used as MR keys/values – TokenCounterMapper, SumReducer – (remember the “word count” example?) – RegexMapper – matches a regex in the input key/value pairs Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #27
  • 28. Cloud MapReduce • Accenture (2010) • http://code.google.com/p/cloudmapreduce/ • Map/Reduce implementation for AWS (EC2, S3, SimpleDB, SQS) – fast (reported as up to 60 times faster than Hadoop/EC2 in some cases) – scalable & robust (no single point of bottleneck or failure) – simple (3 KLOC) • Features – No need for centralised coordinator (JobTracker), just put job status in the cloud datastore (SimpleDB) – All data transfer & communication is handled by the Cloud – All I/O and storage is handled by the Cloud Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #28
  • 29. Cloud MapReduce (2) (C) Ricky Ho Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #29
  • 30. Cloud MapReduce (3) • Job client workflow 1. Store input data (S3) 2. Create a Map task for each data split & put it into the Mapper Queue (SQS) 3. Create Multiple Partition Queue (SQS) 4. Create Reducer Queue (SQS) & put a Reduce task for each Partition Queue 5. Create the Output Queue (SQS) 6. Create a Job Request (ref to all queues) and put it into SimpleDB 7. Start EC2 instances for Mappers & Reducers 8. Poll SimpleDB for job status 9. When job complete download results from S3 Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #30
  • 31. Cloud MapReduce (4) • Mapper worflow 1. Dequeue a Map task from the Mapper Queue 2. Fetch data from S3 3. Perform user defined map function, add multiple output (Km,Vm) pairs to some Multiple Partition Queue (hash(Km)) => several partition keys may share the same partition queue! 4. When done remove Map task from Mapper Queue • Reducer workflow 1. Dequeue a Reeduce task from the Reducer Queue 2. Dequeue the (Km,Vm) pairs from the corresponding Partition Queue => several partitions may share the same queue! 3. Perform a user defined reduce function and add output pairs (Kr,Vr) to the Output Queue 4. When done remove the Reduce task from the Reducer Queue Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #31
  • 32. MR.Flow • Web based M/R editor – http://www.mr-flow.com – Reusable M/R modules – Execution & status monitoring (Hadoop clusters) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #32
  • 33. Contents Part III Some Map/Reduce algorithms Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #33
  • 34. General considerations • Map execution order is not deterministic • Map processing time cannot be predicted • Reduce tasks cannot start before all Maps have finished (dataset needs to be fully partitioned) • Not suitable for continuous input streams • There will be a spike in network utilisation after the Map / before the Reduce phase • Number & size of key/value pairs – Object creation & serialisation overhead (Amdahl’s law!) • Aggregate partial results when possible! – Use Combiners Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #34
  • 35. Graph algorithms • Very suitable for M/R processing – Data (graph node) locality – “spreading activation” type of processing – Some algorithms with sequential dependency not suitable for M/R • Breadth-first search algorithms better than depth-first • General Approach – Graph represented by adjacency lists – Map task – input: node + its adjacency list; perform some analysis over the node link structure; output: target key + analysis result – Reduce task – aggregate values by key – Perform multiple iterations (with a termination criteria) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #35
  • 36. Social Network Analysis • Problem: recommend new friends (friend-of-a-friend, FOAF) • Map task – U (target user) is fixed and its friends list copied to all cluster nodes (“copy join”); each cluster node stores part of the social graph – In: (X, <friendsX>), i.e. the local data for the cluster node – Out: • if (U, X) are friends => (U, <friendsXfriendsU>), i.e. the users who are friends of X but not already friends of U • nil otherwise • Reduce task – In: (U, <<friendsAfriendsU>,<friendsBfriendsU>, … >), i.e. the FOAF lists for all users A, B, etc. who are friends with U – Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is its total number of occurrences in all FOAF lists (sort/rank the result!) Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #36
  • 37. PageRank with M/R (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #37
  • 38. Text Indexing & Retrieval • Indexing is very suitable for M/R – Focus on scalability, not on latency & response time – Batch oriented • Map task – emit (Term, (DocID, position)) • Reduce task – Group pairs by Term and sort by DocID Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #38
  • 39. Text Indexing & Retrieval (2) (C) Jimmy Lin Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #39
  • 40. Text Indexing & Retrieval (3) • Retrieval not suitable for M/R – Focus on response time – Startup of Mappers & Reducers is usually prohibitively expensive • Katta – http://katta.sourceforge.net/ – Distributed Lucene indexing with Hadoop (HDFS) – Multicast querying & ranking Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #40
  • 41. Useful links • "MapReduce: Simplified Data Processing on Large Clusters" • “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks” • “Cloud MapReduce Technical Report” • Data-Intensive Text Processing with MapReduce • Hadoop - The Definitive Guide Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #41
  • 42. Q&A Questions? Large Scale Data Analysis (Map/Reduce), part I Feb, 2010 #42