SlideShare a Scribd company logo
Exploiting Multicores to
                         Optimize Business Process
                                 Execution
                                       Daniele Bonetta,
                        Achille Peternier, Cesare Pautasso, Walter Binder
                                     Faculty of Informatics
                                   University of Lugano - USI
                                          Switzerland
                                     http://sosoa.inf.usi.ch
                                                           daniele.bonetta@usi.ch
Tuesday, December 14, 2010
BP Execution Engine
                             Focus: Business Process Runtime
                                 Execution Environment
                                                            Web
                                                           Service
                                       Composite
                                          BP                  Web
                  Client                 Web
                                       Execution             Service
                                        Service
                                        Engine                  Web
                                                               Service




Tuesday, December 14, 2010
How to scale?

                    Client
            ClientClient
                                               Web
               Client
      Client                                  Service
           Client                                Web
                                  Composite
                                     BP
      Client
          Client                                Service
                                    Web
                                  Execution
      Client                       Service
                                   Engine          Web
          Client
                                                  Service
      Client
           Client
                Client
            Client


Tuesday, December 14, 2010
Client                 How to scale?
 Client
              Client
    Client Client Client
Client   Client
             Client
       Client Client                           Web
ent           Client
     Client                                   Service
ent       Client
              Client                             Web
   Client                   Composite
                                BP
     Client
          Client                                Service
ent                            Web
                             Execution
  Client       Client         Service              Web
     Client                   Engine
 ient Client
   Client                                         Service
     Client Client
      Client
           Client
     Client
Client
       Client Client
   Client    ClientClient
          Client      Client
   Client       Client
 Tuesday, December 14, 2010
Client                 How to scale?
 Client
              Client
    Client Client Client
Client   Client
             Client
       Client Client                           Web
ent           Client
     Client                                   Service
ent       Client
              Client                             Web
   Client                   Composite
                             Service
      More clients == More BP Instances
     Client
          Client              Web
                           Composition          Service
ent
  Client       Client
     Client                  Service
                             Engine                Web
 ient Client
   Client                                         Service
     Client Client
      Client
           Client
     Client
Client
       Client Client
   Client    ClientClient
          Client      Client
   Client       Client
 Tuesday, December 14, 2010
Client                 How to scale?
 Client
              Client
    Client Client Client
Client   Client
             Client
       Client Client                           Web
ent           Client
     Client                                   Service
ent       Client
              Client                             Web
   Client                   Composite
                             Service
      More clients <= More BP Instances
     Client
          Client              Web
                           Composition          Service
ent
  Client       Client
     Client                  Service
                             Engine                Web
 ient Client
   Client                                         Service
     Client Client
      Client
           Client
     Client
Client
       Client Client
   Client    ClientClient
          Client      Client
   Client       Client
 Tuesday, December 14, 2010
Multicores

                             core     core           core      core




                             core      core          core      core

                  core         core   core    core      core   core
                                                                      IBM Power7
Tuesday, December 14, 2010
Outline
                1. Multicore Issues

                2. JOpera Business Process Execution Engine

                             1. Thread Level Parallelism

                             2. CPU/Core Level Parallelism

                3. Experimental Results

                4. Conclusion

Tuesday, December 14, 2010
Multicore Issues

                                        • Number of cores
                                        • Type of cores (e.g.
                                          SMT)
                                        • On Chip Caching
                                          Layout (e.g. L2, L3...)
                                        • On Board Memory
                                          Layout (e.g. NUMA,
                                          NUMA-CC, ...)


Tuesday, December 14, 2010
Multicore Issues



             • Cores Num                 Th Migrations,
             • Cores Type                 Ctx Switches


             • Cache Layout
             • Memory Layout
Tuesday, December 14, 2010
Multicore Issues



             • Cores Num                 Th Migrations,
             • Cores Type                 Ctx Switches


             • Cache Layout                     Data Locality,
             • Memory Layout                     Contention

Tuesday, December 14, 2010
BP Execution Engine




                             Java Business Process Execution Engine




Tuesday, December 14, 2010
3 Layers Approach
                                Concurrent Business
                                 Process Instances



                                    OS Threads



                                 Hardware Cores



Tuesday, December 14, 2010
Abstraction Layers
                                Concurrent Business
                                 Process Instances



                                    OS Threads



                                 Hardware Cores



Tuesday, December 14, 2010
Engine Architecture

                             Request       Kernel     Invoker
                             Handler




                                 Request       Execution
                                 Queue          Queue


Tuesday, December 14, 2010
Abstraction Layers
                                Concurrent Business
                                 Process Instances



                                    OS Threads



                                 Hardware Cores



Tuesday, December 14, 2010
Engine Architecture

                                  Kernel    Invoker


                      Request
                      Handler




                                 Request   Execution
                                 Queue      Queue

Tuesday, December 14, 2010
BP Execution

                                  Kernel     Invoker


                      Request
                      Handler




                                 Request    Execution
                                 Queue       Queue

Tuesday, December 14, 2010
BP Execution

                                  Kernel     Invoker


                      Request
                      Handler




                                 Request    Execution
                                 Queue       Queue

Tuesday, December 14, 2010
BP Execution

                                  Kernel     Invoker


                      Request
                      Handler




                                 Request    Execution
                                 Queue       Queue

Tuesday, December 14, 2010
BP Execution

                                  Kernel     Invoker


                      Request
                      Handler




                                 Request    Execution
                                 Queue       Queue

Tuesday, December 14, 2010
Abstraction Layers
                                Concurrent Business
                                 Process Instances



                                    OS Threads



                                 Hardware Cores



Tuesday, December 14, 2010
Deployment on Multicores
                                                 Kernel   Invoker

                                       Request
                                       handler




                         // threads



                             core core core               core core core
                             core core core               core core core



Tuesday, December 14, 2010
Deployment on Multicores
                                                 Kernel   Invoker

                                       Request
                                       handler




                         // threads

                                      How?
                             core core core               core core core
                             core core core               core core core



Tuesday, December 14, 2010
OverHPC Library

                                 Jopera Engine (Java)

                                OverHPC (JNI, C, Java)

                                        libpfm

                                    Linux Kernel

                                 Multicore Hardware




Tuesday, December 14, 2010
OverHPC Library

                                 Jopera Engine (Java)

            1. Control and Change (JNI, C, Java) scheduling
                         OverHPC per-thread

                                       libpfm

                                    Linux Kernel
            2. Measure low level thread performance data
                                Multicore Hardware




Tuesday, December 14, 2010
OverHPC Library API

                 1) Control and Change per-thread scheduling
                             Thread-Core Dynamic Affinity Binding
                                       getThreadPID()
                                      getThreadAffinity()
                                      setThreadAffinity()
                                       getAffinityInfo()




Tuesday, December 14, 2010
OverHPC Library API

                      2) Measure low level thread performance:
                             Hardware Performance Counters
                                  getEventsFromCache()
                                  getEventsFromThread()
                                   bindEventsToCore()
                                  bindEventsToThread()




Tuesday, December 14, 2010
Evaluation
                     <flow>                                                   <flow>
                                                B                        D
                                                              C
                               A                        B
                                                    A
                                                D
                                   C



                                       DAG              Parallel



        <sequence>                           A              Inc
                                                                             <while>
                               B
                                             C                    Test

                                   D



                                   Sequential           Loop




Tuesday, December 14, 2010
Hardware Setup

                     6 cores, 3 cache levels, 1 last level cache

                                            L3 Cache
                                  L2   L2   L2   L2    L2   L2
                             2x   L1   L1   L1   L1    L1   L1
                                  C1   C2   C3   C4    C5   C6




Tuesday, December 14, 2010
Experimental Setup

                Concurrent Business
                 Process Instances          Up to 30’000


                             OS Threads
                                                 k

                     Hardware Cores
                                                12


Tuesday, December 14, 2010
Thread-level Parallelism
                                 How many threads?

                      Just increase the number of
                      parallel concurrent threads
                     in the pools for an increasing
                          number of instances?



Tuesday, December 14, 2010
Thread-level Parallelism
                                                               Just increasing the number of threads...

                                                        1800                                                           ForEach
                                                                                                                     Sequential
                                                        1600                                                            Parallel
                                                                                                                          Loop
      Throughput (req/s)




                                                        1400
                           Throughput (Instances/sec)




                                                        1200

                                                        1000

                                                        800

                                                        600

                                                        400

                                                        200

                                                               0    20     40      60           80             100    120          140
                                                                                Number of threads (per pool)
                                                                                # of threads
Tuesday, December 14, 2010
Experimental Setup

                Concurrent Business
                 Process Instances          Up to 30’000


                             OS Threads
                                                24

                     Hardware Cores
                                                12


Tuesday, December 14, 2010
Experimental Setup
                      6 cores, 3 cache levels, 1 last level cache

                                               L3 Cache
                                   L2     L2   L2    L2   L2   L2
                             2x    L1     L1   L1    L1   L1   L1
                                  C1      C2   C3    C4   C5   C6

                                       2 Thread pools:

                              Kernel                Invoker


Tuesday, December 14, 2010
CPU Affinity Binding
                                             Policy 1: Default

            Unconstrained scheduling of threads by the OS

                                  L3 Cache                            L3 Cache
                L2           L2   L2   L2    L2   L2       L2    L2   L2   L2    L2   L2
                L1           L1   L1   L1    L1   L1       L1    L1   L1   L1    L1   L1

                C1       C2       C3   C4    C5   C6       C1    C2   C3   C4    C5   C6




Tuesday, December 14, 2010
CPU Affinity Binding
                                             Policy 2: per CPU
                    Constrain each thread pool within a CPU

                                  L3 Cache                          L3 Cache
                L2           L2   L2   L2     L2   L2     L2   L2   L2   L2    L2   L2
                L1           L1   L1   L1     L1   L1     L1   L1   L1   L1    L1   L1

                C1       C2       C3   C4    C5    C6     C1   C2   C3   C4    C5   C6




Tuesday, December 14, 2010
CPU Affinity Binding
                                             Policy 3: per Core
       Policy 2 + Constrain each thread on a specific core

                                  L3 Cache                          L3 Cache
                L2           L2   L2   L2     L2   L2     L2   L2   L2   L2    L2   L2
                L1           L1   L1   L1     L1   L1     L1   L1   L1   L1    L1   L1

                C1       C2       C3   C4    C5    C6     C1   C2   C3   C4    C5   C6




Tuesday, December 14, 2010
CPU Affinity Binding
                                            Policy 4: Interleaved
                                  Mix thread pools across CPUs

                                  L3 Cache                           L3 Cache
                L2           L2   L2   L2     L2   L2      L2   L2   L2   L2    L2   L2
                L1           L1   L1   L1     L1   L1      L1   L1   L1   L1    L1   L1

                C1       C2       C3   C4    C5    C6      C1   C2   C3   C4    C5   C6




Tuesday, December 14, 2010
Experimental Setup

                Concurrent Business
                 Process Instances          Up to 30’000


                             OS Threads
                                                24

                     Hardware Cores
                                                12


Tuesday, December 14, 2010
Performance Layers

                Concurrent Business
                 Process Instances          5’000 - 30’000
                          Throughput, Walltime, ...


                             OS Threads
                                                     24

                        Hardware Performance Counters:
                    Hardware Thread Migrations, Context sw, ...
                   Cache miss, Cores
                                                     12


Tuesday, December 14, 2010
Experimental Results
                                     Relative Speedup with 30k instances
                                                        30000 Instances
                               1.3                                                   Default
                                                                                    Per CPU
                                                                                    Per core
                               1.2                                               Interleaved
            Relative Speedup




                               1.1


                                1


                               0.9


                               0.8


                               0.7
                                      DAG    Parallel    Sequential       Loop    Geomean



         2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
HPC-Based Validation

                              Ineffective sw prefetches
                             A prefetch request for a memory
                               address already in the cache

                            L3 cache evictions
                     Data that needs to be stored in the
                   cache is bigger than free available space




Tuesday, December 14, 2010
Experimental Results
                                                             Ineffective SW prefetches
                                                                           30000 Instances
                                                 1.2                                                    Default
                                                                                                       Per CPU
            Relative Ineffective SW Prefetches




                                                                                                       Per core
                                                 1.1                                                Interleaved



                                                  1


                                                 0.9


                                                 0.8


                                                 0.7
                                                       DAG      Parallel    Sequential       Loop    Geomean



         2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
Experimental Results
                                                        L3 Cache evictions
                                                                   30000 Instances
                                              Default
                                     2       Per CPU
                                             Per core
                                    1.8   Interleaved
            Relative L3 Evictions




                                    1.6

                                    1.4

                                    1.2

                                     1

                                    0.8

                                    0.6

                                    0.4
                                            DAG         Parallel    Sequential       Loop   Geomean



         2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
Experimental Results
                                     Relative Speedup with 5k instances
                                                        10000 Instances
                               1.2                                                   Default
                                                                                    Per CPU
                                                                                    Per core
                               1.1                                               Interleaved
            Relative Speedup




                                1


                               0.9


                               0.8


                               0.7
                                     DAG     Parallel    Sequential       Loop    Geomean



         2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
Experimental Results
                                                             Ineffective SW prefetches
                                                                           10000 Instances
                                                 1.2                                                    Default
                                                                                                       Per CPU
            Relative Ineffective SW Prefetches




                                                                                                       Per core
                                                 1.1                                                Interleaved



                                                  1


                                                 0.9


                                                 0.8


                                                 0.7
                                                       DAG      Parallel    Sequential       Loop    Geomean



         2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
Experimental Results
                                                        L3 Cache evictions
                                                                   10000 Instances
                                    1.8       Default
                                             Per CPU
                                    1.6      Per core
                                          Interleaved
            Relative L3 Evictions




                                    1.4

                                    1.2

                                     1

                                    0.8

                                    0.6

                                    0.4

                                            DAG         Parallel    Sequential       Loop   Geomean



         2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
Experimental Results
                                     Relative Speedup with 10k instances
                                                        5000 Instances
                               1.3                                                  Default
                                                                                   Per CPU
                                                                                   Per core
                               1.2                                              Interleaved
            Relative Speedup




                               1.1


                                1


                               0.9


                               0.8


                               0.7
                                      DAG    Parallel   Sequential       Loop    Geomean



         2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
Experimental Results
                                                             Ineffective SW prefetches
                                                                           5000 Instances
                                                 1.2                                                   Default
                                                                                                      Per CPU
            Relative Ineffective SW Prefetches




                                                                                                      Per core
                                                 1.1                                               Interleaved



                                                  1


                                                 0.9


                                                 0.8


                                                 0.7
                                                       DAG      Parallel   Sequential       Loop    Geomean



         2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
Experimental Results
                                                        L3 Cache evictions
                                                                   5000 Instances
                                              Default
                                     2       Per CPU
                                             Per core
                                    1.8   Interleaved
            Relative L3 Evictions




                                    1.6

                                    1.4

                                    1.2

                                     1

                                    0.8

                                    0.6

                                    0.4
                                            DAG         Parallel   Sequential       Loop   Geomean



         2 x AMD Barcelona 6 cores processors with 2 LLC
Tuesday, December 14, 2010
Experimental Results
                              Correlation Coefficients
                        (Hardware events - JOpera throughput)

                     Workload Size        Ineffective   L3 Cache
                  (Number of Instances)    SW Pref      Evictions

                              5000         0.9842        0.9456
                             10000         0.9125        0.9883
                             30000         0.9661        0.9946




Tuesday, December 14, 2010
Conclusion

               • Multicore machines offer powerful hardware
                      parallelism, but what matters is not just the
                      number of PEs
               • The performance depends on how a limited
                      amount of threads are mapped to the HW
               • Multicore Aware Thread Scheduling
                      significantly impacts the performance (up to
                      10% speedup)


Tuesday, December 14, 2010
Thank you!
                                OverHPC Library:
                               http://sosoa.inf.usi.ch

                  JOpera business process execution engine:
                           http://www.jopera.org

                                    Twitter:
                                  @jopera_org

                                        me:
                              daniele.bonetta@usi.ch


Tuesday, December 14, 2010

More Related Content

PPTX
Introduction to Fusebill's AR (Invoicing) Suite
PPTX
eFolder Lunch, Three Secrets to Pricing and Packaging Your BDR Service
PDF
M2 m etsi_oktober_v1 0_241012_final[2]
PDF
Imaginea product-support-offering
PDF
Prakash Narayan Killer S O Aapps Using J2 E E
PDF
Why vhelp
PDF
Howto Deliver Business Driven Demand Planningv1
PDF
Building tomorrow's web with today's tools
Introduction to Fusebill's AR (Invoicing) Suite
eFolder Lunch, Three Secrets to Pricing and Packaging Your BDR Service
M2 m etsi_oktober_v1 0_241012_final[2]
Imaginea product-support-offering
Prakash Narayan Killer S O Aapps Using J2 E E
Why vhelp
Howto Deliver Business Driven Demand Planningv1
Building tomorrow's web with today's tools

What's hot (7)

PDF
Rawsthorne Dan - from usability studies to stories
PDF
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
PPTX
CRM_ AIS Case Study
PPTX
Wpf Tech Overview2009
PPTX
Multinational Corporations - Banking Landscape
PDF
Gu3112991305
PDF
Lifetime Case Study
Rawsthorne Dan - from usability studies to stories
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
CRM_ AIS Case Study
Wpf Tech Overview2009
Multinational Corporations - Banking Landscape
Gu3112991305
Lifetime Case Study
Ad

Similar to Exploiting Multicores to Optimize Business Process Execution (20)

PPT
Internet Apps powered by NoSQL and JavaScript
PDF
Windows Azure - Windows In The Cloud
PDF
WebLogic 12c Developer Deep Dive at Oracle Develop India 2012
PDF
Thoughts on Utility, Grid, on demand, cloud computing and appliances
PPTX
PDF
21st Century Service Oriented Architecture
PPTX
Server vs Client in real life and in programming world
PDF
Ebs architecture con9036_pdf_9036_0001
PDF
OSGi Remote Services With Sca
PPTX
Windows azure uk universities overview march 2012
PPTX
Where and when to use the Oracle Service Bus (OSB)
PDF
Eva flex java_1_slides
PPT
Managing Enterprise Services through Service Versioning & Governance - Impact...
PDF
Windows Azure Platform Technical Deep Dive - Chris Auld (Intergen)
PDF
BPM with REST
PDF
Load Balancing und Beschleunigung mit Citrix Net Scaler
PPTX
Take the spaghetti out of windows azure – an insight for it pro techies part 1
PPTX
Prodware wa college - marcel meijer
Internet Apps powered by NoSQL and JavaScript
Windows Azure - Windows In The Cloud
WebLogic 12c Developer Deep Dive at Oracle Develop India 2012
Thoughts on Utility, Grid, on demand, cloud computing and appliances
21st Century Service Oriented Architecture
Server vs Client in real life and in programming world
Ebs architecture con9036_pdf_9036_0001
OSGi Remote Services With Sca
Windows azure uk universities overview march 2012
Where and when to use the Oracle Service Bus (OSB)
Eva flex java_1_slides
Managing Enterprise Services through Service Versioning & Governance - Impact...
Windows Azure Platform Technical Deep Dive - Chris Auld (Intergen)
BPM with REST
Load Balancing und Beschleunigung mit Citrix Net Scaler
Take the spaghetti out of windows azure – an insight for it pro techies part 1
Prodware wa college - marcel meijer
Ad

More from Cesare Pautasso (20)

PDF
Beautiful APIs - SOSE2021 Keynote
PDF
How do you back up and consistently recover your microservice architecture?
PDF
Microservices: An Eventually Inconsistent Architectural Style?
PDF
Disaster Recovery and Microservices: The BAC Theorem
PPTX
The Blockchain as a Software Connector
PPTX
Team Situational Awareness and Architectural Decision Making with the Softwar...
PDF
JOpera - Eclipse-based Visual Composition Environment featuring a general lan...
PDF
Push-Enabling RESTful Business Processes
PDF
BPMN for REST
PDF
SOA with REST
PDF
Atomic Transactions for the REST of us
PDF
Service Oriented Architectures and Web Services
PDF
Real-time Mashups di Web Service Geografici
PDF
Towards Scalable Service Composition on Multicores
PDF
WS-* vs. RESTful Services
PDF
RESTful Service Composition with JOpera
PDF
SOA2010 SOA with REST
PPT
USI SCUBE Associate Member
PDF
Lighweight Collaboration Management (Mashups09@OOPSLA)
PDF
Some REST Design Patterns (and Anti-Patterns) - SOA Symposium 2009
Beautiful APIs - SOSE2021 Keynote
How do you back up and consistently recover your microservice architecture?
Microservices: An Eventually Inconsistent Architectural Style?
Disaster Recovery and Microservices: The BAC Theorem
The Blockchain as a Software Connector
Team Situational Awareness and Architectural Decision Making with the Softwar...
JOpera - Eclipse-based Visual Composition Environment featuring a general lan...
Push-Enabling RESTful Business Processes
BPMN for REST
SOA with REST
Atomic Transactions for the REST of us
Service Oriented Architectures and Web Services
Real-time Mashups di Web Service Geografici
Towards Scalable Service Composition on Multicores
WS-* vs. RESTful Services
RESTful Service Composition with JOpera
SOA2010 SOA with REST
USI SCUBE Associate Member
Lighweight Collaboration Management (Mashups09@OOPSLA)
Some REST Design Patterns (and Anti-Patterns) - SOA Symposium 2009

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PPTX
Cloud computing and distributed systems.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation theory and applications.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Approach and Philosophy of On baking technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Teaching material agriculture food technology
Cloud computing and distributed systems.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Digital-Transformation-Roadmap-for-Companies.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Unlocking AI with Model Context Protocol (MCP)
Encapsulation theory and applications.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
MYSQL Presentation for SQL database connectivity
Approach and Philosophy of On baking technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Exploiting Multicores to Optimize Business Process Execution

  • 1. Exploiting Multicores to Optimize Business Process Execution Daniele Bonetta, Achille Peternier, Cesare Pautasso, Walter Binder Faculty of Informatics University of Lugano - USI Switzerland http://sosoa.inf.usi.ch daniele.bonetta@usi.ch Tuesday, December 14, 2010
  • 2. BP Execution Engine Focus: Business Process Runtime Execution Environment Web Service Composite BP Web Client Web Execution Service Service Engine Web Service Tuesday, December 14, 2010
  • 3. How to scale? Client ClientClient Web Client Client Service Client Web Composite BP Client Client Service Web Execution Client Service Engine Web Client Service Client Client Client Client Tuesday, December 14, 2010
  • 4. Client How to scale? Client Client Client Client Client Client Client Client Client Client Web ent Client Client Service ent Client Client Web Client Composite BP Client Client Service ent Web Execution Client Client Service Web Client Engine ient Client Client Service Client Client Client Client Client Client Client Client Client ClientClient Client Client Client Client Tuesday, December 14, 2010
  • 5. Client How to scale? Client Client Client Client Client Client Client Client Client Client Web ent Client Client Service ent Client Client Web Client Composite Service More clients == More BP Instances Client Client Web Composition Service ent Client Client Client Service Engine Web ient Client Client Service Client Client Client Client Client Client Client Client Client ClientClient Client Client Client Client Tuesday, December 14, 2010
  • 6. Client How to scale? Client Client Client Client Client Client Client Client Client Client Web ent Client Client Service ent Client Client Web Client Composite Service More clients <= More BP Instances Client Client Web Composition Service ent Client Client Client Service Engine Web ient Client Client Service Client Client Client Client Client Client Client Client Client ClientClient Client Client Client Client Tuesday, December 14, 2010
  • 7. Multicores core core core core core core core core core core core core core core IBM Power7 Tuesday, December 14, 2010
  • 8. Outline 1. Multicore Issues 2. JOpera Business Process Execution Engine 1. Thread Level Parallelism 2. CPU/Core Level Parallelism 3. Experimental Results 4. Conclusion Tuesday, December 14, 2010
  • 9. Multicore Issues • Number of cores • Type of cores (e.g. SMT) • On Chip Caching Layout (e.g. L2, L3...) • On Board Memory Layout (e.g. NUMA, NUMA-CC, ...) Tuesday, December 14, 2010
  • 10. Multicore Issues • Cores Num Th Migrations, • Cores Type Ctx Switches • Cache Layout • Memory Layout Tuesday, December 14, 2010
  • 11. Multicore Issues • Cores Num Th Migrations, • Cores Type Ctx Switches • Cache Layout Data Locality, • Memory Layout Contention Tuesday, December 14, 2010
  • 12. BP Execution Engine Java Business Process Execution Engine Tuesday, December 14, 2010
  • 13. 3 Layers Approach Concurrent Business Process Instances OS Threads Hardware Cores Tuesday, December 14, 2010
  • 14. Abstraction Layers Concurrent Business Process Instances OS Threads Hardware Cores Tuesday, December 14, 2010
  • 15. Engine Architecture Request Kernel Invoker Handler Request Execution Queue Queue Tuesday, December 14, 2010
  • 16. Abstraction Layers Concurrent Business Process Instances OS Threads Hardware Cores Tuesday, December 14, 2010
  • 17. Engine Architecture Kernel Invoker Request Handler Request Execution Queue Queue Tuesday, December 14, 2010
  • 18. BP Execution Kernel Invoker Request Handler Request Execution Queue Queue Tuesday, December 14, 2010
  • 19. BP Execution Kernel Invoker Request Handler Request Execution Queue Queue Tuesday, December 14, 2010
  • 20. BP Execution Kernel Invoker Request Handler Request Execution Queue Queue Tuesday, December 14, 2010
  • 21. BP Execution Kernel Invoker Request Handler Request Execution Queue Queue Tuesday, December 14, 2010
  • 22. Abstraction Layers Concurrent Business Process Instances OS Threads Hardware Cores Tuesday, December 14, 2010
  • 23. Deployment on Multicores Kernel Invoker Request handler // threads core core core core core core core core core core core core Tuesday, December 14, 2010
  • 24. Deployment on Multicores Kernel Invoker Request handler // threads How? core core core core core core core core core core core core Tuesday, December 14, 2010
  • 25. OverHPC Library Jopera Engine (Java) OverHPC (JNI, C, Java) libpfm Linux Kernel Multicore Hardware Tuesday, December 14, 2010
  • 26. OverHPC Library Jopera Engine (Java) 1. Control and Change (JNI, C, Java) scheduling OverHPC per-thread libpfm Linux Kernel 2. Measure low level thread performance data Multicore Hardware Tuesday, December 14, 2010
  • 27. OverHPC Library API 1) Control and Change per-thread scheduling Thread-Core Dynamic Affinity Binding getThreadPID() getThreadAffinity() setThreadAffinity() getAffinityInfo() Tuesday, December 14, 2010
  • 28. OverHPC Library API 2) Measure low level thread performance: Hardware Performance Counters getEventsFromCache() getEventsFromThread() bindEventsToCore() bindEventsToThread() Tuesday, December 14, 2010
  • 29. Evaluation <flow> <flow> B D C A B A D C DAG Parallel <sequence> A Inc <while> B C Test D Sequential Loop Tuesday, December 14, 2010
  • 30. Hardware Setup 6 cores, 3 cache levels, 1 last level cache L3 Cache L2 L2 L2 L2 L2 L2 2x L1 L1 L1 L1 L1 L1 C1 C2 C3 C4 C5 C6 Tuesday, December 14, 2010
  • 31. Experimental Setup Concurrent Business Process Instances Up to 30’000 OS Threads k Hardware Cores 12 Tuesday, December 14, 2010
  • 32. Thread-level Parallelism How many threads? Just increase the number of parallel concurrent threads in the pools for an increasing number of instances? Tuesday, December 14, 2010
  • 33. Thread-level Parallelism Just increasing the number of threads... 1800 ForEach Sequential 1600 Parallel Loop Throughput (req/s) 1400 Throughput (Instances/sec) 1200 1000 800 600 400 200 0 20 40 60 80 100 120 140 Number of threads (per pool) # of threads Tuesday, December 14, 2010
  • 34. Experimental Setup Concurrent Business Process Instances Up to 30’000 OS Threads 24 Hardware Cores 12 Tuesday, December 14, 2010
  • 35. Experimental Setup 6 cores, 3 cache levels, 1 last level cache L3 Cache L2 L2 L2 L2 L2 L2 2x L1 L1 L1 L1 L1 L1 C1 C2 C3 C4 C5 C6 2 Thread pools: Kernel Invoker Tuesday, December 14, 2010
  • 36. CPU Affinity Binding Policy 1: Default Unconstrained scheduling of threads by the OS L3 Cache L3 Cache L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 Tuesday, December 14, 2010
  • 37. CPU Affinity Binding Policy 2: per CPU Constrain each thread pool within a CPU L3 Cache L3 Cache L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 Tuesday, December 14, 2010
  • 38. CPU Affinity Binding Policy 3: per Core Policy 2 + Constrain each thread on a specific core L3 Cache L3 Cache L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 Tuesday, December 14, 2010
  • 39. CPU Affinity Binding Policy 4: Interleaved Mix thread pools across CPUs L3 Cache L3 Cache L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 Tuesday, December 14, 2010
  • 40. Experimental Setup Concurrent Business Process Instances Up to 30’000 OS Threads 24 Hardware Cores 12 Tuesday, December 14, 2010
  • 41. Performance Layers Concurrent Business Process Instances 5’000 - 30’000 Throughput, Walltime, ... OS Threads 24 Hardware Performance Counters: Hardware Thread Migrations, Context sw, ... Cache miss, Cores 12 Tuesday, December 14, 2010
  • 42. Experimental Results Relative Speedup with 30k instances 30000 Instances 1.3 Default Per CPU Per core 1.2 Interleaved Relative Speedup 1.1 1 0.9 0.8 0.7 DAG Parallel Sequential Loop Geomean 2 x AMD Barcelona 6 cores processors with 2 LLC Tuesday, December 14, 2010
  • 43. HPC-Based Validation Ineffective sw prefetches A prefetch request for a memory address already in the cache L3 cache evictions Data that needs to be stored in the cache is bigger than free available space Tuesday, December 14, 2010
  • 44. Experimental Results Ineffective SW prefetches 30000 Instances 1.2 Default Per CPU Relative Ineffective SW Prefetches Per core 1.1 Interleaved 1 0.9 0.8 0.7 DAG Parallel Sequential Loop Geomean 2 x AMD Barcelona 6 cores processors with 2 LLC Tuesday, December 14, 2010
  • 45. Experimental Results L3 Cache evictions 30000 Instances Default 2 Per CPU Per core 1.8 Interleaved Relative L3 Evictions 1.6 1.4 1.2 1 0.8 0.6 0.4 DAG Parallel Sequential Loop Geomean 2 x AMD Barcelona 6 cores processors with 2 LLC Tuesday, December 14, 2010
  • 46. Experimental Results Relative Speedup with 5k instances 10000 Instances 1.2 Default Per CPU Per core 1.1 Interleaved Relative Speedup 1 0.9 0.8 0.7 DAG Parallel Sequential Loop Geomean 2 x AMD Barcelona 6 cores processors with 2 LLC Tuesday, December 14, 2010
  • 47. Experimental Results Ineffective SW prefetches 10000 Instances 1.2 Default Per CPU Relative Ineffective SW Prefetches Per core 1.1 Interleaved 1 0.9 0.8 0.7 DAG Parallel Sequential Loop Geomean 2 x AMD Barcelona 6 cores processors with 2 LLC Tuesday, December 14, 2010
  • 48. Experimental Results L3 Cache evictions 10000 Instances 1.8 Default Per CPU 1.6 Per core Interleaved Relative L3 Evictions 1.4 1.2 1 0.8 0.6 0.4 DAG Parallel Sequential Loop Geomean 2 x AMD Barcelona 6 cores processors with 2 LLC Tuesday, December 14, 2010
  • 49. Experimental Results Relative Speedup with 10k instances 5000 Instances 1.3 Default Per CPU Per core 1.2 Interleaved Relative Speedup 1.1 1 0.9 0.8 0.7 DAG Parallel Sequential Loop Geomean 2 x AMD Barcelona 6 cores processors with 2 LLC Tuesday, December 14, 2010
  • 50. Experimental Results Ineffective SW prefetches 5000 Instances 1.2 Default Per CPU Relative Ineffective SW Prefetches Per core 1.1 Interleaved 1 0.9 0.8 0.7 DAG Parallel Sequential Loop Geomean 2 x AMD Barcelona 6 cores processors with 2 LLC Tuesday, December 14, 2010
  • 51. Experimental Results L3 Cache evictions 5000 Instances Default 2 Per CPU Per core 1.8 Interleaved Relative L3 Evictions 1.6 1.4 1.2 1 0.8 0.6 0.4 DAG Parallel Sequential Loop Geomean 2 x AMD Barcelona 6 cores processors with 2 LLC Tuesday, December 14, 2010
  • 52. Experimental Results Correlation Coefficients (Hardware events - JOpera throughput) Workload Size Ineffective L3 Cache (Number of Instances) SW Pref Evictions 5000 0.9842 0.9456 10000 0.9125 0.9883 30000 0.9661 0.9946 Tuesday, December 14, 2010
  • 53. Conclusion • Multicore machines offer powerful hardware parallelism, but what matters is not just the number of PEs • The performance depends on how a limited amount of threads are mapped to the HW • Multicore Aware Thread Scheduling significantly impacts the performance (up to 10% speedup) Tuesday, December 14, 2010
  • 54. Thank you! OverHPC Library: http://sosoa.inf.usi.ch JOpera business process execution engine: http://www.jopera.org Twitter: @jopera_org me: daniele.bonetta@usi.ch Tuesday, December 14, 2010