SlideShare a Scribd company logo
WorkflowSim: A Toolkit for Simulating
  Scientific Workflows in Distributed
             Environments
         Weiwei Chen, Ewa Deelman
Outline
§  Introduction
  –  Scientific workflows?
  –  Distributed environments?
§  Challenge
  –  Large scale, system overhead
§  Solution
  –  Workflow Overhead and Failure Model
§  Validation and Application
  –  Overhead Robustness
  –  Fault Tolerant Clustering

                           2
Scientific Applications
§  Scientists often need to
  –  Integrate diverse components and data
  –  Automate data processing steps
  –  Reproduce/analyze/share previous results
  –  Track the provenance of data products
  –  Execute processing steps efficiently
  –  Reliably execute applications

       Scientific Workflows provide
       solutions to these problems


                         3
Scientific Workflows
§  DAG model (Directed Acyclic Graph)
  –  Node: computational activities
  –  Directed edge: data dependencies
  –  Task: a process that users would like to execute
  –  Job: a single unit for execution with one of more tasks
  –  Task clustering: the process of grouping tasks to jobs




                             4
Workflow Management System
§  Common Features of WMS
  –  Maps abstract workflows to executable workflows
  –  Handles data dependencies
  –  Replica selection, transfers, registration, cleanup
  –  Task clustering, workflow partitioning, scheduling
  –  Reliability and fault tolerance
  –  Monitoring and troubleshooting
§  Existing WMS
  –  Pegasus, Askalon, Taverna, Kepler, Triana



                           5
Workflow Simulation
§  Benefits
  –  Save efforts in system setup, execution
  –  Repeat experimental results
  –  Control system environments (failures)
§  Trace based Workflow Simulation
  –  Import trace from a completed execution
  –  Vary workflow structures and system environments
§  Challenges (CloudSim, GridSim, etc.)
  –  System overhead and failures
  –  Multiple levels of computational activities(task/job)
  –  A hierarchy of management components
                           6
Comparison
                          CloudSim           WorkflowSim
                         Task, Bag of
  Execution Model                           Task, Job, DAG
                            Tasks
Failure and Monitoring        No                 Yes

                                          Data Transfer Delay
                         Data Transfer
  Overhead Model                         Workflow Engine Delay
                            Delay
                                          Clustering Delay …

    Site selection          Single             Multiple
                                         Scheduling, Job retry
    Optimization
                          Scheduling         Clustering,
    Techniques
                                            Partitioning …
     WorkflowSim is an extension of
     CloudSim, but it is workflow aware
                             7
System Architecture
§  Submit Host
  –  Workflow Mapper
  –  Clustering Engine
  –  Workflow Engine
  –  Local Scheduler
§  Execution Site
  –  Remote Scheduler
  –  Worker Nodes
  –  Failure Generator
  –  Failure Monitor

                         8
Workflow Overhead
–  Workflow Engine       –  Postscript Delay
   Delay                 –  Clustering Delay
–  Queue Delay           –  Data Transfer Delay




                     9
Example: Clustering Delay
                         n is the number of tasks per
                         level. k is number of jobs per
                         level.
                          !"#$%&'()*!!"#$%|!!! ! ! !
                                              =   =
                          !"#$%&'()*!!"#$%|!!! ! ! !



                         mProjectPP, mDiffFit, and
                         mBackground are the major
                         jobs of Montage

Overhead is not a constant variable.
It has diverse distribution and patterns

                 10
Validation
           !"#$%&'#$!!"#$%&&!!"#$%&'   Ideal Case: Accuracy=1.0
!!!"#$!% =
             !"#$!!"#$%&&!!"#$%&'      k: Maximum jobs per level




                                        Overheads have
                                        the biggest impact




                               11
Application: Overhead Robustness
§  Overhead robustness: the influence of
    overheads on the workflow runtime for DAG
    scheduling heuristics.
§  Inaccurate estimation (under- or over-estimated)
    of workflow overheads influences the overall
    runtime of workflows.
§  Research Merits:
     –  Sensitivity of heuristics
     –  Overhead friendly heuristics



                         12
Application: Overhead Robustness
§  Increase or Reduce Overhead by a Factor
    (Weight)
  –  Under estimation and over estimation
§  Heuristics Evaluated
  –  FCFS: First Come First Serve
  –  MCT: Minimum Completion Time
  –  MinMin: The job with the minimum completion time is
     selected and assigned to the fastest resource.
  –  MaxMin: The job with the maximum completion time
     and assigns it to its best available resource.



                           13
Application: Overhead Robustness
                           MCT<MinMin MCT<MinMin
         MCT>MinMin

MCT<MinMin




    Accurate estimation of Clustering
    Delay is more important

                      14
Workflow Failure
§  Failures have significant influence on the
    performance
§  Classifying Transient Failures
   –  Task Failure: A task fails, other tasks within the
      same job may not fail
   –  Job Failure: A job fails, all of its tasks fail




                            15
System Architecture
§  Submit Host
  –  Job Retry
  –  Reclustering
§  Execution Site
  –  Failure Generator
  –  Failure Monitor




                         16
Application: Fault Tolerant Clustering
§  Task clustering can reduce execution overhead
§  A job composed of multiple tasks may have a
    greater risk of suffering from failures
§  Reclustering and Job Retry are proposed
  –  No Optimization (NOOP) retries the failed jobs.
  –  Dynamic Clustering (DC) decreases the clusters.size if
     the measured job failure rate is high.




                           17
Application: Fault Tolerant Clustering
–  Selective Reclustering (SR) retries the failed tasks in a job
–  Dynamic Reclustering (DR) retires the failed tasks in a job and also
   decreases the clusters.size if the measured job failure rate is high.




                               18
Performance
                       §  Reclustering (DR/DC/SR) reduces the influence of
                           failures significantly compared to NOOP
                       §  DR outperforms other techniques.
The lower the better




                                                 19
Conclusion
§  WorkflowSim assists researchers to evaluate
    their workflow optimization techniques with
    better accuracy and wider support.
  –  Modeling Overhead and Failures
  –  Distributed and Hierarchical components
  –  Workflow Techniques
§  It is necessary to consider both data
    dependencies, workflow failures and system
    overhead.



                           20
Acknowledgements
§    Pegasus Team: http://pegasus.isi.edu
§    FutureGrid & XSEDE
§    Funded by NSF grants IIS-0905032.
§    More info: wchen@isi.edu
§    Available on request
§    Q&A




                               21
Overhead Distribution




                   Montage




          22
Overhead Distribution




                   CyberShake




          23
Broadband




            24
Task Failure Model and Job Failure Model




                n                                       4d
      k* =                           −d + d 2 −
                r                                    ln(1− α )
                              k* =                               ,   if   n >> r
                                                2t
     *       (kt + d)
    ttotal
        =                    *   n(k *t + d)
              1− β           t =
                             total          *
                                 rk(1− α )k




                        25                               25
Related Papers


§    Integration of Workflow Partitioning and Resource Provisioning, CCGrid 2012.
§    Improving Scientific Workflow Performance using Policy Based Data
      Placement, IEEE Policy 2012.
§    Fault Tolerant Clustering in Scientific Workflows, SWF, IEEE Services 2012
§    Workflow Overhead Analysis and Optimizations, WORKS 2011.
§    Partitioning and Scheduling Workflows across Multiple Sites with Storage
      Constraints, PPAM 2011
§    Pegasus: a Framework for Mapping Complex Scientific Workflows onto
      Distributed Systems, Scientific Programming Journal 2005




                                          26

More Related Content

PDF
IOS Cisco - Cheat sheets
PDF
3 cucm database
PPTX
Barry Hesk: Cisco Unified Communications Manager training deck 1
PDF
Inter-AS MPLS VPN Deployment
PDF
Next-gen Network Telemetry is Within Your Packets: In-band OAM
PPT
Ccna introduction
PPTX
Session initiation-protocol
PDF
Layer-2 VPN
IOS Cisco - Cheat sheets
3 cucm database
Barry Hesk: Cisco Unified Communications Manager training deck 1
Inter-AS MPLS VPN Deployment
Next-gen Network Telemetry is Within Your Packets: In-band OAM
Ccna introduction
Session initiation-protocol
Layer-2 VPN

What's hot (20)

PPTX
VXLAN
PPTX
Cisco Live Milan 2015 - BGP advance
PPTX
SIP: Call Id, Cseq, Via-branch, From & To-tag role play
PDF
Indroduction to SIP
PDF
3g counter & timer
PPTX
IEEE 802.1Q
PPTX
The Data Center Evolution and Pre-Fab Data Centers
PDF
Waris l2vpn-tutorial
PPTX
Introduction to nexux from zero to Hero
PPTX
Thesis of sdh
PDF
Ieee nfv-sdn-2020-srv6-tutorial
PDF
VXLAN BGP EVPN: Technology Building Blocks
PPT
CCNA SUMMER TRAINNING PPT
PPTX
CCNA PPT
PPT
PPT
Spannig tree
PDF
Developing SDN apps in Ryu
PPT
MPLS SDN 2016 - Microloop avoidance with segment routing
PDF
Virtual Extensible LAN (VXLAN)
VXLAN
Cisco Live Milan 2015 - BGP advance
SIP: Call Id, Cseq, Via-branch, From & To-tag role play
Indroduction to SIP
3g counter & timer
IEEE 802.1Q
The Data Center Evolution and Pre-Fab Data Centers
Waris l2vpn-tutorial
Introduction to nexux from zero to Hero
Thesis of sdh
Ieee nfv-sdn-2020-srv6-tutorial
VXLAN BGP EVPN: Technology Building Blocks
CCNA SUMMER TRAINNING PPT
CCNA PPT
Spannig tree
Developing SDN apps in Ryu
MPLS SDN 2016 - Microloop avoidance with segment routing
Virtual Extensible LAN (VXLAN)
Ad

Similar to Workflowsim escience12 (20)

PPT
High Performance Computing - Cloud Point of View
PPTX
Взгляд на облака с точки зрения HPC
PDF
Velocity 2018 preetha appan final
PDF
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
PDF
Apache Spark Overview part1 (20161107)
PDF
Camunda BPM 7.2: Performance and Scalability (English)
PPTX
IEEE CLOUD \'11
PDF
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
PDF
Java scalability considerations yogesh deshpande
PDF
Apache Hadoop YARN - The Future of Data Processing with Hadoop
PPTX
NoSQL and ACID
PPTX
Speed up R with parallel programming in the Cloud
PDF
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
PDF
OpenMP tasking model: from the standard to the classroom
PDF
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
PDF
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
PDF
Resource Aware Scheduling for Hadoop [Final Presentation]
PDF
HP - Jerome Rolia - Hadoop World 2010
PPTX
Designing apps for resiliency
PPTX
Single Page Applications with AngularJS 2.0
High Performance Computing - Cloud Point of View
Взгляд на облака с точки зрения HPC
Velocity 2018 preetha appan final
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Apache Spark Overview part1 (20161107)
Camunda BPM 7.2: Performance and Scalability (English)
IEEE CLOUD \'11
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
Java scalability considerations yogesh deshpande
Apache Hadoop YARN - The Future of Data Processing with Hadoop
NoSQL and ACID
Speed up R with parallel programming in the Cloud
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
OpenMP tasking model: from the standard to the classroom
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Resource Aware Scheduling for Hadoop [Final Presentation]
HP - Jerome Rolia - Hadoop World 2010
Designing apps for resiliency
Single Page Applications with AngularJS 2.0
Ad

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation theory and applications.pdf
Empathic Computing: Creating Shared Understanding
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MYSQL Presentation for SQL database connectivity
Chapter 3 Spatial Domain Image Processing.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
Spectral efficient network and resource selection model in 5G networks
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf
Encapsulation theory and applications.pdf

Workflowsim escience12

  • 1. WorkflowSim: A Toolkit for Simulating Scientific Workflows in Distributed Environments Weiwei Chen, Ewa Deelman
  • 2. Outline §  Introduction –  Scientific workflows? –  Distributed environments? §  Challenge –  Large scale, system overhead §  Solution –  Workflow Overhead and Failure Model §  Validation and Application –  Overhead Robustness –  Fault Tolerant Clustering 2
  • 3. Scientific Applications §  Scientists often need to –  Integrate diverse components and data –  Automate data processing steps –  Reproduce/analyze/share previous results –  Track the provenance of data products –  Execute processing steps efficiently –  Reliably execute applications Scientific Workflows provide solutions to these problems 3
  • 4. Scientific Workflows §  DAG model (Directed Acyclic Graph) –  Node: computational activities –  Directed edge: data dependencies –  Task: a process that users would like to execute –  Job: a single unit for execution with one of more tasks –  Task clustering: the process of grouping tasks to jobs 4
  • 5. Workflow Management System §  Common Features of WMS –  Maps abstract workflows to executable workflows –  Handles data dependencies –  Replica selection, transfers, registration, cleanup –  Task clustering, workflow partitioning, scheduling –  Reliability and fault tolerance –  Monitoring and troubleshooting §  Existing WMS –  Pegasus, Askalon, Taverna, Kepler, Triana 5
  • 6. Workflow Simulation §  Benefits –  Save efforts in system setup, execution –  Repeat experimental results –  Control system environments (failures) §  Trace based Workflow Simulation –  Import trace from a completed execution –  Vary workflow structures and system environments §  Challenges (CloudSim, GridSim, etc.) –  System overhead and failures –  Multiple levels of computational activities(task/job) –  A hierarchy of management components 6
  • 7. Comparison CloudSim WorkflowSim Task, Bag of Execution Model Task, Job, DAG Tasks Failure and Monitoring No Yes Data Transfer Delay Data Transfer Overhead Model Workflow Engine Delay Delay Clustering Delay … Site selection Single Multiple Scheduling, Job retry Optimization Scheduling Clustering, Techniques Partitioning … WorkflowSim is an extension of CloudSim, but it is workflow aware 7
  • 8. System Architecture §  Submit Host –  Workflow Mapper –  Clustering Engine –  Workflow Engine –  Local Scheduler §  Execution Site –  Remote Scheduler –  Worker Nodes –  Failure Generator –  Failure Monitor 8
  • 9. Workflow Overhead –  Workflow Engine –  Postscript Delay Delay –  Clustering Delay –  Queue Delay –  Data Transfer Delay 9
  • 10. Example: Clustering Delay n is the number of tasks per level. k is number of jobs per level. !"#$%&'()*!!"#$%|!!! ! ! ! = = !"#$%&'()*!!"#$%|!!! ! ! ! mProjectPP, mDiffFit, and mBackground are the major jobs of Montage Overhead is not a constant variable. It has diverse distribution and patterns 10
  • 11. Validation !"#$%&'#$!!"#$%&&!!"#$%&' Ideal Case: Accuracy=1.0 !!!"#$!% = !"#$!!"#$%&&!!"#$%&' k: Maximum jobs per level Overheads have the biggest impact 11
  • 12. Application: Overhead Robustness §  Overhead robustness: the influence of overheads on the workflow runtime for DAG scheduling heuristics. §  Inaccurate estimation (under- or over-estimated) of workflow overheads influences the overall runtime of workflows. §  Research Merits: –  Sensitivity of heuristics –  Overhead friendly heuristics 12
  • 13. Application: Overhead Robustness §  Increase or Reduce Overhead by a Factor (Weight) –  Under estimation and over estimation §  Heuristics Evaluated –  FCFS: First Come First Serve –  MCT: Minimum Completion Time –  MinMin: The job with the minimum completion time is selected and assigned to the fastest resource. –  MaxMin: The job with the maximum completion time and assigns it to its best available resource. 13
  • 14. Application: Overhead Robustness MCT<MinMin MCT<MinMin MCT>MinMin MCT<MinMin Accurate estimation of Clustering Delay is more important 14
  • 15. Workflow Failure §  Failures have significant influence on the performance §  Classifying Transient Failures –  Task Failure: A task fails, other tasks within the same job may not fail –  Job Failure: A job fails, all of its tasks fail 15
  • 16. System Architecture §  Submit Host –  Job Retry –  Reclustering §  Execution Site –  Failure Generator –  Failure Monitor 16
  • 17. Application: Fault Tolerant Clustering §  Task clustering can reduce execution overhead §  A job composed of multiple tasks may have a greater risk of suffering from failures §  Reclustering and Job Retry are proposed –  No Optimization (NOOP) retries the failed jobs. –  Dynamic Clustering (DC) decreases the clusters.size if the measured job failure rate is high. 17
  • 18. Application: Fault Tolerant Clustering –  Selective Reclustering (SR) retries the failed tasks in a job –  Dynamic Reclustering (DR) retires the failed tasks in a job and also decreases the clusters.size if the measured job failure rate is high. 18
  • 19. Performance §  Reclustering (DR/DC/SR) reduces the influence of failures significantly compared to NOOP §  DR outperforms other techniques. The lower the better 19
  • 20. Conclusion §  WorkflowSim assists researchers to evaluate their workflow optimization techniques with better accuracy and wider support. –  Modeling Overhead and Failures –  Distributed and Hierarchical components –  Workflow Techniques §  It is necessary to consider both data dependencies, workflow failures and system overhead. 20
  • 21. Acknowledgements §  Pegasus Team: http://pegasus.isi.edu §  FutureGrid & XSEDE §  Funded by NSF grants IIS-0905032. §  More info: wchen@isi.edu §  Available on request §  Q&A 21
  • 22. Overhead Distribution Montage 22
  • 23. Overhead Distribution CyberShake 23
  • 24. Broadband 24
  • 25. Task Failure Model and Job Failure Model n 4d k* = −d + d 2 − r ln(1− α ) k* = , if n >> r 2t * (kt + d) ttotal = * n(k *t + d) 1− β t = total * rk(1− α )k 25 25
  • 26. Related Papers §  Integration of Workflow Partitioning and Resource Provisioning, CCGrid 2012. §  Improving Scientific Workflow Performance using Policy Based Data Placement, IEEE Policy 2012. §  Fault Tolerant Clustering in Scientific Workflows, SWF, IEEE Services 2012 §  Workflow Overhead Analysis and Optimizations, WORKS 2011. §  Partitioning and Scheduling Workflows across Multiple Sites with Storage Constraints, PPAM 2011 §  Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems, Scientific Programming Journal 2005 26