SlideShare a Scribd company logo
Adaptive Execution Support for
   Malleable Computation
         Speaker: LIN Qian
http://www.comp.nus.edu.sg/~linqian
Outline
• Introduce the key ideas of 3 selected papers
• Discussion
FORMLESS
• FORMLESS: Scalable Utilization of Embedded
  Manycores in Streaming Applications
  [LCTES’12]
  – Functionally-cOnsistent stRucturally-MalLEabe
    Streaming Specification
  – Actor-oriented specification models
  – Space exploration scheme
     • to customize the application specification to better fit
       the target platform.
FORMLESS (cont.)
• Space exploration for platform-driven
  instantiation
FORMLESS (cont.)
• Example:
Dynamic Load Balancing
• A Distributed and Adaptive Dynamic Load
  Balancing Scheme for Parallel Processing of
  Medium-Grain Tasks
  [IEEE Jounal, 1990]
  – Challenge: Allocate and distribute tasks
    dynamically with minimum run time overhead.
  – Design: A distributed and adaptive load balancing
    scheme for medium-grain tasks
Dynamic Load Balancing (cont.)
• Key idea 1: Neighborhood average strategy
  – Attempts to balance load within a neighborhood
    by distributing tasks
     • such that all neighbors have loads close to the
       neighborhood average.
  – The decision when to balance load is based on the
    neighborhood state information that is checked
    periodically.
     • Each processor maintains status information of all its
       neighbors.
Dynamic Load Balancing (cont.)
• Key idea 2: Grain Size Control
  – If the cost of making work available to another
    processor exceeds the cost of executing it at the
    local processor, then it does not make sense to
    decompose and parallelize work beyond a certain
    size or granularity of work.
  – Granularity control: To determine when to stop
    breaking down a computation into parallel
    computations at a frontier node, treating it as a
    leaf node and executing it sequentially.
Adaptive Load Balancing
• Compiler and Run-Time Support for Adaptive
  Load Balancing in Software Distributed Shared
  Memory Systems
  [1998]
  – Use information provided by the compiler to help
    the run-time system distribute the work of the
    parallel loops
     • according to the relative power of the processors
     • minimize communication and page sharing
Adaptive Load Balancing (cont.)
• Compile-Time Support for Load Balancing
    – The specific compiler adopts SUIF system, which is
      organized as a set of compiler passes.
    – The SUIF pass extracts the shared data access
      patterns in each of the SPMD regions, and feeds
      this information to the run-time system.
        • also responsible for adding hooks in the parallelized
          code to allow run-time library to change the load
          distribution

--------
SUIF: Stanford University Intermediate Format
SPMD: Single-Program Multiple-Data
Adaptive Load Balancing (cont.)
– Access pattern extraction
   • SUIF pass walks through the program looking for
     accesses to shared memory.
– Prefetching
   • Use the access pattern information to prefetch data
     through prefetching calls.
– Load balancing interface and strategy
   • The compiler can direct the run-time to choose
     between two partitioning strategies for distributing the
     parallel loops.
      1.   Shifting of loop boundaries
      2.   Multiple loop bounds
Adaptive Load Balancing (cont.)
• Run-Time Load Balancing Support
  – The run-time library is responsible for keeping
    track of the progress of each process
     • collect statistics about the execution time of each
       parallel task, and
     • adjust the load accordingly
  – Load balancing vs. Locality management
     • need to avoid unnecessary movement of data and
       minimize page sharing
     • Locality-conscious load balancing: the run-time library
       uses the information supplied by the compiler about
       what loop distribution strategy to use.
Algorithms for Scheduling
• Scheduling Malleable Parallel Tasks: An
  Asymptotic Fully Polynomial-Time
  Approximation Scheme [2002]
• Mapping and Scheduling Heterogeneous Tasks
  using Genertic Algorithms [1995]

More Related Content

PPTX
Map reduce
PDF
Hadoop map reduce v2
PPTX
MapReduce : Simplified Data Processing on Large Clusters
PPTX
Scope of parallelism
PPT
Map reduce - simplified data processing on large clusters
PPTX
Mapreduce script
PDF
MapReduce: Simplified Data Processing on Large Clusters
PDF
MapReduce: Simplified Data Processing On Large Clusters
Map reduce
Hadoop map reduce v2
MapReduce : Simplified Data Processing on Large Clusters
Scope of parallelism
Map reduce - simplified data processing on large clusters
Mapreduce script
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing On Large Clusters

What's hot (20)

PPT
program flow mechanisms, advanced computer architecture
PPTX
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
PPTX
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
PDF
Mapreduce - Simplified Data Processing on Large Clusters
PPTX
Hadoop deconstructing map reduce job step by step
PDF
Hadoop combiner and partitioner
PPTX
Map reduce advantages over parallel databases
PPTX
Parallel computing
PPTX
Map reduce presentation
PPTX
Flow control in computer
PPTX
Parallel Algorithms Advantages and Disadvantages
PPTX
Introduction to parallel processing
PPTX
Parallel architecture-programming
PPT
Informatica perf points
PPTX
network ram parallel computing
PPT
Paralle programming 2
PDF
Cluster Computing
PPTX
Memory management based on MCA
PPTX
Mapreduce total order sorting technique
PPTX
Limitations of memory system performance
program flow mechanisms, advanced computer architecture
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
Mapreduce - Simplified Data Processing on Large Clusters
Hadoop deconstructing map reduce job step by step
Hadoop combiner and partitioner
Map reduce advantages over parallel databases
Parallel computing
Map reduce presentation
Flow control in computer
Parallel Algorithms Advantages and Disadvantages
Introduction to parallel processing
Parallel architecture-programming
Informatica perf points
network ram parallel computing
Paralle programming 2
Cluster Computing
Memory management based on MCA
Mapreduce total order sorting technique
Limitations of memory system performance
Ad

Similar to Adaptive Execution Support for Malleable Computation (20)

PDF
J0210053057
PPT
SecondPresentationDesigning_Parallel_Programs.ppt
PPTX
PPTX
PPT
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
PDF
load-balancing-method-for-embedded-rt-system-20120711-0940
PPTX
Chapter 5.pptx
PPT
01-MessagePassingFundamentals.ppt
PPTX
assignment_presentaion_jhvvnvhjhbhjhvjh.pptx
PPTX
operating system
PPTX
Resource management
PPTX
Data Parallel and Object Oriented Model
PPT
Unit-3.ppt
PPTX
Cloud computing Module 2 First Part
PPT
Module2 MultiThreads.ppt
PPT
10-MultiprocessorScheduling chapter8.ppt
PDF
Real time operating systems
PPTX
Simulation of Heterogeneous Cloud Infrastructures
PPTX
TASK AND DATA PARALLELISM in Computer Science pptx
PPTX
parellelisum edited_jsdnsfnjdnjfnjdn.pptx
J0210053057
SecondPresentationDesigning_Parallel_Programs.ppt
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
load-balancing-method-for-embedded-rt-system-20120711-0940
Chapter 5.pptx
01-MessagePassingFundamentals.ppt
assignment_presentaion_jhvvnvhjhbhjhvjh.pptx
operating system
Resource management
Data Parallel and Object Oriented Model
Unit-3.ppt
Cloud computing Module 2 First Part
Module2 MultiThreads.ppt
10-MultiprocessorScheduling chapter8.ppt
Real time operating systems
Simulation of Heterogeneous Cloud Infrastructures
TASK AND DATA PARALLELISM in Computer Science pptx
parellelisum edited_jsdnsfnjdnjfnjdn.pptx
Ad

More from Qian Lin (12)

PDF
Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
PDF
PaxosStore: High-availability Storage Made Practical in WeChat
PPTX
Trinity: A Distributed Graph Engine on a Memory Cloud
PPTX
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
PPTX
C-Cube: Elastic Continuous Clustering in the Cloud
PPTX
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
PPTX
Optimizing Virtual Machines Using Hybrid Virtualization
PPT
Virtual Machine Performance
PPTX
Be an Explorer, Be a Coder, Be a Writer
PPTX
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
PPTX
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
PPTX
In-situ MapReduce for Log Processing
Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
PaxosStore: High-availability Storage Made Practical in WeChat
Trinity: A Distributed Graph Engine on a Memory Cloud
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
C-Cube: Elastic Continuous Clustering in the Cloud
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Optimizing Virtual Machines Using Hybrid Virtualization
Virtual Machine Performance
Be an Explorer, Be a Coder, Be a Writer
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
In-situ MapReduce for Log Processing

Adaptive Execution Support for Malleable Computation

  • 1. Adaptive Execution Support for Malleable Computation Speaker: LIN Qian http://www.comp.nus.edu.sg/~linqian
  • 2. Outline • Introduce the key ideas of 3 selected papers • Discussion
  • 3. FORMLESS • FORMLESS: Scalable Utilization of Embedded Manycores in Streaming Applications [LCTES’12] – Functionally-cOnsistent stRucturally-MalLEabe Streaming Specification – Actor-oriented specification models – Space exploration scheme • to customize the application specification to better fit the target platform.
  • 4. FORMLESS (cont.) • Space exploration for platform-driven instantiation
  • 6. Dynamic Load Balancing • A Distributed and Adaptive Dynamic Load Balancing Scheme for Parallel Processing of Medium-Grain Tasks [IEEE Jounal, 1990] – Challenge: Allocate and distribute tasks dynamically with minimum run time overhead. – Design: A distributed and adaptive load balancing scheme for medium-grain tasks
  • 7. Dynamic Load Balancing (cont.) • Key idea 1: Neighborhood average strategy – Attempts to balance load within a neighborhood by distributing tasks • such that all neighbors have loads close to the neighborhood average. – The decision when to balance load is based on the neighborhood state information that is checked periodically. • Each processor maintains status information of all its neighbors.
  • 8. Dynamic Load Balancing (cont.) • Key idea 2: Grain Size Control – If the cost of making work available to another processor exceeds the cost of executing it at the local processor, then it does not make sense to decompose and parallelize work beyond a certain size or granularity of work. – Granularity control: To determine when to stop breaking down a computation into parallel computations at a frontier node, treating it as a leaf node and executing it sequentially.
  • 9. Adaptive Load Balancing • Compiler and Run-Time Support for Adaptive Load Balancing in Software Distributed Shared Memory Systems [1998] – Use information provided by the compiler to help the run-time system distribute the work of the parallel loops • according to the relative power of the processors • minimize communication and page sharing
  • 10. Adaptive Load Balancing (cont.) • Compile-Time Support for Load Balancing – The specific compiler adopts SUIF system, which is organized as a set of compiler passes. – The SUIF pass extracts the shared data access patterns in each of the SPMD regions, and feeds this information to the run-time system. • also responsible for adding hooks in the parallelized code to allow run-time library to change the load distribution -------- SUIF: Stanford University Intermediate Format SPMD: Single-Program Multiple-Data
  • 11. Adaptive Load Balancing (cont.) – Access pattern extraction • SUIF pass walks through the program looking for accesses to shared memory. – Prefetching • Use the access pattern information to prefetch data through prefetching calls. – Load balancing interface and strategy • The compiler can direct the run-time to choose between two partitioning strategies for distributing the parallel loops. 1. Shifting of loop boundaries 2. Multiple loop bounds
  • 12. Adaptive Load Balancing (cont.) • Run-Time Load Balancing Support – The run-time library is responsible for keeping track of the progress of each process • collect statistics about the execution time of each parallel task, and • adjust the load accordingly – Load balancing vs. Locality management • need to avoid unnecessary movement of data and minimize page sharing • Locality-conscious load balancing: the run-time library uses the information supplied by the compiler about what loop distribution strategy to use.
  • 13. Algorithms for Scheduling • Scheduling Malleable Parallel Tasks: An Asymptotic Fully Polynomial-Time Approximation Scheme [2002] • Mapping and Scheduling Heterogeneous Tasks using Genertic Algorithms [1995]

Editor's Notes

  • #5: Design space exploration for platform-driven instantiation of a FORMLESS specification.
  • #6: FORMLESS specification of the sort example: A) Actor specifications. B-D) Example instantiations.
  • #7: The scheme attempts to balance load within a neighborhood by distributing tasks such that all neighbors have loads close to the neighborhood average.
  • #9: In terms of processing time the average grain size is defined as (Total Sequential Execution Time / Total Number of Message Processed)
  • #12: The goal is to minimize execution time by considering both communication and the computation components.