SlideShare a Scribd company logo
MapReduce basics
        Harisankar H,
     PhD student, DOS lab,
     Dept. CSE, IIT Madras

          6-Feb-2013

http://harisankarh.wordpress.com
Distributed processing ?
 • Processing distributed across multiple
   machines/servers




Image from: http://installornot.com/wp-content/uploads/google-datacenter-tech-13.jpg
Why distributed processing?
– Reduce execution time of large jobs
   • E.g., extracting urls from terabytes of data
   • 1000 machines could finish the jobs 1000 times faster
– Fault-tolerance
   • Other nodes will take over the jobs if some of the
     nodes fail
      – Typically if you have 10,000 servers, on the average one will
        fail per day
Issues in distributed processing
• Realized traditionally using special-purpose
  implementations
   – E.g., indexer, log processor
• Implementation really hard at socket programming level
   – Fault-tolerance
      • Keep track of failure, reassignment of tasks
   – Hand-coded parallelization
   – Scheduling across heterogeneous nodes
   – Locality
      • Minimise movement of data for computation
   – How to distribute data?
• Results in:
   – Complex, brittle, non-generic code
   – Reimplementation of common features like fault-tolerance,
     distribution
Need for a generic abstraction for
         distributed processing

App programmer  abstraction  systems developer

                   Separation of concerns


                  Express app   Performance, fault
                  logic         handling etc.

 • Tradeoff between genericity and performance
   – More generic => usually less performance
 • MapReduce probably a sweet spot where you
   have both to some extent
MapReduce abstraction(app
                programmer’s view)
  • Model input and output as <key,value> pairs
  • Provide map() and reduce() functions which
    act on <k,v> pairs
  • Input: set of <k,v> pairs: {k,v}
      – For each input <k,v>:
                                      map(k1,v1)  list(k2,v2)
      – For each unique output key from map:
             reduce(k2,combined list(v2))  list(v3)

System will take care of distributing the tasks across thousands of machines,
handling locality, fault-tolerance etc.
Example: word count
• Problem:
   – Count the number of occurrences of each unique
     word in a big collection of documents
• Input <k,v> set:
   – <document name, document contents>
      • Organize the files in this format
• Output:
   – <word, count>
      • Get it in output files
• Next step:
   – Define the map() and reduce() functions
Word count
map(String key, String value):
 // key: document name
 // value: document contents
 for each word w in value:
 EmitIntermediate(w, “1”);

reduce(String key, List values):
 // key: a word
 // values: a list of counts
 int result = 0;
 for each v in values:
 result += ParseInt(v);
 Emit(AsString(result));
Program in java

                                              public void reduce(Text key,
     public void map(LongWritable key, Text   Iterable<IntWritable> values, Context
value, Context context) throws …              context) throws …
   {                                                {
       String line = value.toString();                int sum = 0;
       StringTokenizer tokenizer = new                for (IntWritable val : values) {
StringTokenizer(line);                                   sum += val.get();
       while (tokenizer.hasMoreTokens()) {            }
         word.set(tokenizer.nextToken());             context.write(key, new
         context.write(word, one);            IntWritable(sum));
       }                                          }
     }
Implementing MapReduce abstraction

App programmer  abstraction  systems developer


 • Looked at the application programmer’s view
 • Need a platform which implements the
   MapReduce abstraction
 • Hadoop is the popular open-source
   implementation of MapReduce abstraction
 • Questions for the platform developer
   – How to
      •   parallelize ?
      •   handle faults ?
      •   provide locality ?
      •   distribute the data ?
Basics of platform implementation
• parallelize ?
   – Each map can be executed independently in parallel
   – After all maps have finished execution, all reduce can be
     executed in parallel
• handle faults ?
   – map() and reduce() has no internal state
      • Simply re-execute in case of a failure
• distribute the data ?
   – Have a distributed file system(HDFS)
• provide locality ?
   – Prefer to execute map() on the nodes having input <k,v>
     pair
MapReduce implementation
• Distributed File System(DFS) +
  MapReduce(MR) Engine
  – Specifically, MR engine uses a DFS
• Distributed files system
  – Files split into large chunks and stored in the
    distributed file system(e.g., HDFS)
  – Large chunks: typically 64MB per block
  – can have a master-slave architecture
     • Master assigns and manages replicated blocks in the
       slaves
MapReduce engine
• Has a master slave architecture
  – Master co-ordinates the task execution across
    workers
  – Workers perform the map() and reduce()
    functions
     • Reads and writes blocks to/from the DFS
  – Master keeps tracks of failure of workers and
    reassigns tasks if necessary
     • Failure detection usually done through timeouts
network
Some tips for designing MR jobs
• Reduce network traffic between map and reduce
  – Model map() and reduce() jobs appropriately
  – Use combine() functions
     • combine(<k,[v]>)  <k,[v]>
     • combine() executes after all map()s finish in each block
         – map() [same node] combine() [network]  reduce()

• Make map jobs of roughly equal expected
  execution times
• Try to make reduce() jobs less skewed
Pros and cons of MapReduce
• Advantages
  –   Simple, easy to use distributed processing system
  –   Reasonably generic
  –   Exploits locality for performance
  –   Simple and less buggy implementation
• Issues
  – Not a magic bullet which fit all problems
       • Difficult to model iterative and recursive computations
           – E.g.: k-means clustering
           – Generate-Map-Reduce
       • Difficult to model streaming computations
       • Centralized entities like master becomes bottlenecks
       • Most real-world problems require large chains of MR jobs
Summary
  • Today
       –   Distributed processing issues, MR programming model
       –   Sample MR job
       –   How MR can be implemented
       –   Pros and cons of MR, tips for better performance
  • Tomorrow
       – Details specific to Hadoop
       – Downloading and setting up of Hadoop on a cluster

Ack: some images from: Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data
processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.
Hadoop components
• HDFS
  – Master: Namenode
  – Slave : DataNode
• MapReduce engine
  – Master: JobTracker
  – Slave: TaskTracker

More Related Content

PDF
Google Spanner : our understanding of concepts and implications
PPTX
Corbett osdi12 slides (1)
PDF
An Overview of Spanner: Google's Globally Distributed Database
PDF
10 replication
PPTX
Spanner - Google distributed database
PDF
Process management
PPTX
Scheduling in distributed systems - Andrii Vozniuk
PPTX
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Google Spanner : our understanding of concepts and implications
Corbett osdi12 slides (1)
An Overview of Spanner: Google's Globally Distributed Database
10 replication
Spanner - Google distributed database
Process management
Scheduling in distributed systems - Andrii Vozniuk
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP

What's hot (20)

PDF
Samsung DeepSort
PPT
Chapter 14 replication
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
PPT
Ppt project process migration
PPT
31 address binding, dynamic loading
PDF
CS6601 DISTRIBUTED SYSTEMS
PPTX
Distributed System Management
PDF
Spectrum Scale Memory Usage
PDF
Give or take a block
PDF
03 Hadoop
PDF
Clock Synchronization in Distributed Systems
PDF
Ch4 memory management
PPTX
Transaction Process System and Recovery
PDF
Distributed process and scheduling
PPT
File replication
PDF
HDFS for Geographically Distributed File System
PPTX
Dynamo cassandra
PPT
process management
PDF
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Samsung DeepSort
Chapter 14 replication
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Ppt project process migration
31 address binding, dynamic loading
CS6601 DISTRIBUTED SYSTEMS
Distributed System Management
Spectrum Scale Memory Usage
Give or take a block
03 Hadoop
Clock Synchronization in Distributed Systems
Ch4 memory management
Transaction Process System and Recovery
Distributed process and scheduling
File replication
HDFS for Geographically Distributed File System
Dynamo cassandra
process management
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Ad

Viewers also liked (20)

PDF
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
PDF
Persecuted and Forgotten?
PPSX
Highway safety in pakistan
PDF
8621 המכון ללימודים אסטרטגיים חוק הגנת הדייר 1999
PDF
Hệ thống ống nhựa Blue Ocean (VIE)
PDF
Zionism versus bolshevism
PDF
Z STREET: IRS continues Constitutional violations concerning Israel - B
PDF
דו"ח הבדיקה של פרופ' חיים פרשטמן
PDF
סימולציה תשלום שכר לחיילי צהל
PDF
חשבונות המאזן הלאומי
PPTX
Teaching by Design - Session 1 Slides
PDF
Timothy CV
PDF
Strategie per la mente sicilia
PDF
התיישבות בודדים בנגב JIMS
DOC
לנגוצקי נגד רציו
PDF
ההוצאה הלאומית לחינוך 2014-2012
PPTX
כלכלה מסביב לעולם רוסיה
PDF
Dr. Yaron Brook on TheMarker 12/2013
DOC
Margo Rose Social Media Executive
PPT
Presentacion 25 de noviembre
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Persecuted and Forgotten?
Highway safety in pakistan
8621 המכון ללימודים אסטרטגיים חוק הגנת הדייר 1999
Hệ thống ống nhựa Blue Ocean (VIE)
Zionism versus bolshevism
Z STREET: IRS continues Constitutional violations concerning Israel - B
דו"ח הבדיקה של פרופ' חיים פרשטמן
סימולציה תשלום שכר לחיילי צהל
חשבונות המאזן הלאומי
Teaching by Design - Session 1 Slides
Timothy CV
Strategie per la mente sicilia
התיישבות בודדים בנגב JIMS
לנגוצקי נגד רציו
ההוצאה הלאומית לחינוך 2014-2012
כלכלה מסביב לעולם רוסיה
Dr. Yaron Brook on TheMarker 12/2013
Margo Rose Social Media Executive
Presentacion 25 de noviembre
Ad

Similar to MapReduce basics (20)

PPT
Seminar Presentation Hadoop
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
PPTX
Hadoop training-in-hyderabad
PDF
Hadoop & MapReduce
PPTX
MapReduce.pptx
PDF
2 mapreduce-model-principles
PPTX
This gives a brief detail about big data
PPT
MapReduce in cgrid and cloud computinge.ppt
PDF
MapReduce
PPTX
Map reduce presentation
PDF
Large Scale Data Processing & Storage
PDF
Big data shim
PDF
Parallel Data Processing with MapReduce: A Survey
PPTX
Mapreduce script
PPTX
Hadoop and MapReduce Introductort presentation
PPT
Map reducecloudtech
PPTX
TheEdge10 : Big Data is Here - Hadoop to the Rescue
PDF
Hadoop pig
Seminar Presentation Hadoop
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Hadoop training-in-hyderabad
Hadoop & MapReduce
MapReduce.pptx
2 mapreduce-model-principles
This gives a brief detail about big data
MapReduce in cgrid and cloud computinge.ppt
MapReduce
Map reduce presentation
Large Scale Data Processing & Storage
Big data shim
Parallel Data Processing with MapReduce: A Survey
Mapreduce script
Hadoop and MapReduce Introductort presentation
Map reducecloudtech
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Hadoop pig

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PPTX
Cloud computing and distributed systems.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
A Presentation on Artificial Intelligence
PDF
Modernizing your data center with Dell and AMD
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
cuic standard and advanced reporting.pdf
Cloud computing and distributed systems.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Mobile App Security Testing_ A Comprehensive Guide.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MYSQL Presentation for SQL database connectivity
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
A Presentation on Artificial Intelligence
Modernizing your data center with Dell and AMD
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

MapReduce basics

  • 1. MapReduce basics Harisankar H, PhD student, DOS lab, Dept. CSE, IIT Madras 6-Feb-2013 http://harisankarh.wordpress.com
  • 2. Distributed processing ? • Processing distributed across multiple machines/servers Image from: http://installornot.com/wp-content/uploads/google-datacenter-tech-13.jpg
  • 3. Why distributed processing? – Reduce execution time of large jobs • E.g., extracting urls from terabytes of data • 1000 machines could finish the jobs 1000 times faster – Fault-tolerance • Other nodes will take over the jobs if some of the nodes fail – Typically if you have 10,000 servers, on the average one will fail per day
  • 4. Issues in distributed processing • Realized traditionally using special-purpose implementations – E.g., indexer, log processor • Implementation really hard at socket programming level – Fault-tolerance • Keep track of failure, reassignment of tasks – Hand-coded parallelization – Scheduling across heterogeneous nodes – Locality • Minimise movement of data for computation – How to distribute data? • Results in: – Complex, brittle, non-generic code – Reimplementation of common features like fault-tolerance, distribution
  • 5. Need for a generic abstraction for distributed processing App programmer  abstraction  systems developer Separation of concerns Express app Performance, fault logic handling etc. • Tradeoff between genericity and performance – More generic => usually less performance • MapReduce probably a sweet spot where you have both to some extent
  • 6. MapReduce abstraction(app programmer’s view) • Model input and output as <key,value> pairs • Provide map() and reduce() functions which act on <k,v> pairs • Input: set of <k,v> pairs: {k,v} – For each input <k,v>: map(k1,v1)  list(k2,v2) – For each unique output key from map: reduce(k2,combined list(v2))  list(v3) System will take care of distributing the tasks across thousands of machines, handling locality, fault-tolerance etc.
  • 7. Example: word count • Problem: – Count the number of occurrences of each unique word in a big collection of documents • Input <k,v> set: – <document name, document contents> • Organize the files in this format • Output: – <word, count> • Get it in output files • Next step: – Define the map() and reduce() functions
  • 8. Word count map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1”); reduce(String key, List values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
  • 9. Program in java public void reduce(Text key, public void map(LongWritable key, Text Iterable<IntWritable> values, Context value, Context context) throws … context) throws … { { String line = value.toString(); int sum = 0; StringTokenizer tokenizer = new for (IntWritable val : values) { StringTokenizer(line); sum += val.get(); while (tokenizer.hasMoreTokens()) { } word.set(tokenizer.nextToken()); context.write(key, new context.write(word, one); IntWritable(sum)); } } }
  • 10. Implementing MapReduce abstraction App programmer  abstraction  systems developer • Looked at the application programmer’s view • Need a platform which implements the MapReduce abstraction • Hadoop is the popular open-source implementation of MapReduce abstraction • Questions for the platform developer – How to • parallelize ? • handle faults ? • provide locality ? • distribute the data ?
  • 11. Basics of platform implementation • parallelize ? – Each map can be executed independently in parallel – After all maps have finished execution, all reduce can be executed in parallel • handle faults ? – map() and reduce() has no internal state • Simply re-execute in case of a failure • distribute the data ? – Have a distributed file system(HDFS) • provide locality ? – Prefer to execute map() on the nodes having input <k,v> pair
  • 12. MapReduce implementation • Distributed File System(DFS) + MapReduce(MR) Engine – Specifically, MR engine uses a DFS • Distributed files system – Files split into large chunks and stored in the distributed file system(e.g., HDFS) – Large chunks: typically 64MB per block – can have a master-slave architecture • Master assigns and manages replicated blocks in the slaves
  • 13. MapReduce engine • Has a master slave architecture – Master co-ordinates the task execution across workers – Workers perform the map() and reduce() functions • Reads and writes blocks to/from the DFS – Master keeps tracks of failure of workers and reassigns tasks if necessary • Failure detection usually done through timeouts
  • 15. Some tips for designing MR jobs • Reduce network traffic between map and reduce – Model map() and reduce() jobs appropriately – Use combine() functions • combine(<k,[v]>)  <k,[v]> • combine() executes after all map()s finish in each block – map() [same node] combine() [network]  reduce() • Make map jobs of roughly equal expected execution times • Try to make reduce() jobs less skewed
  • 16. Pros and cons of MapReduce • Advantages – Simple, easy to use distributed processing system – Reasonably generic – Exploits locality for performance – Simple and less buggy implementation • Issues – Not a magic bullet which fit all problems • Difficult to model iterative and recursive computations – E.g.: k-means clustering – Generate-Map-Reduce • Difficult to model streaming computations • Centralized entities like master becomes bottlenecks • Most real-world problems require large chains of MR jobs
  • 17. Summary • Today – Distributed processing issues, MR programming model – Sample MR job – How MR can be implemented – Pros and cons of MR, tips for better performance • Tomorrow – Details specific to Hadoop – Downloading and setting up of Hadoop on a cluster Ack: some images from: Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.
  • 18. Hadoop components • HDFS – Master: Namenode – Slave : DataNode • MapReduce engine – Master: JobTracker – Slave: TaskTracker