Exploiting Multicores to Optimize Business Process Execution

Exploiting Multicores to
Optimize Business Process
Execution
Daniele Bonetta,
Achille Peternier, Cesare Pautasso, Walter Binder
Faculty of Informatics
University of Lugano - USI
Switzerland
http://sosoa.inf.usi.ch
daniele.bonetta@usi.ch
Tuesday, December 14, 2010

BP Execution Engine
Focus: Business Process Runtime
Execution Environment
Web
Service
Composite
BP Web
Client Web
Execution Service
Service
Engine Web
Service


How to scale?

Client
ClientClient
Web
Client
Client Service
Client Web
Composite
BP
Client
Client Service
Web
Execution
Client Service
Engine Web
Client
Service
Client
Client
Client
Client


Client How to scale?
Client
Client
Client Client Client
Client Client
Client
Client Client Web
ent Client
Client Service
ent Client
Client Web
Client Composite
BP
Client
Client Service
ent Web
Execution
Client Client Service Web
Client Engine
ient Client
Client Service
Client Client
Client
Client
Client
Client
Client Client
Client ClientClient
Client Client
Client Client

Client
Client
Client Client
Client
Client Client Web
ent Client
Client Service
ent Client
Client Web
Client Composite
Service
More clients == More BP Instances
Client
Client Web
Composition Service
ent
Client Client
Client Service
Engine Web
ient Client
Client Service
Client Client
Client
Client
Client
Client
Client Client
Client ClientClient
Client Client
Client Client

Client
Client
Client Client
Client
Client Client Web
ent Client
Client Service
ent Client
Client Web
Client Composite
Service
More clients <= More BP Instances
Client
Client Web
Composition Service
ent
Client Client
Client Service
Engine Web
ient Client
Client Service
Client Client
Client
Client
Client
Client
Client Client
Client ClientClient
Client Client
Client Client

Multicores

core core core core

core core core core

core core core core core core
IBM Power7

Outline
1. Multicore Issues

2. JOpera Business Process Execution Engine

1. Thread Level Parallelism

2. CPU/Core Level Parallelism

3. Experimental Results

4. Conclusion


Multicore Issues

• Number of cores
• Type of cores (e.g.
SMT)
• On Chip Caching
Layout (e.g. L2, L3...)
• On Board Memory
Layout (e.g. NUMA,
NUMA-CC, ...)


Multicore Issues

• Cores Num Th Migrations,
• Cores Type Ctx Switches

• Cache Layout
• Memory Layout

Multicore Issues

• Cores Num Th Migrations,
• Cores Type Ctx Switches

• Cache Layout Data Locality,
• Memory Layout Contention


BP Execution Engine

Java Business Process Execution Engine


3 Layers Approach
Concurrent Business
Process Instances

OS Threads

Hardware Cores


Abstraction Layers
Concurrent Business
Process Instances

OS Threads

Hardware Cores


Engine Architecture

Request Kernel Invoker
Handler

Request Execution
Queue Queue


Engine Architecture

Kernel Invoker

Request
Handler

Request Execution
Queue Queue


BP Execution

Kernel Invoker

Request
Handler

Request Execution
Queue Queue


Deployment on Multicores
Kernel Invoker

Request
handler

// threads



Deployment on Multicores
Kernel Invoker

Request
handler

// threads

How?


OverHPC Library

Jopera Engine (Java)

OverHPC (JNI, C, Java)

libpfm

Linux Kernel

Multicore Hardware


OverHPC Library

Jopera Engine (Java)

1. Control and Change (JNI, C, Java) scheduling
OverHPC per-thread

libpfm

Linux Kernel
2. Measure low level thread performance data
Multicore Hardware


OverHPC Library API

1) Control and Change per-thread scheduling
Thread-Core Dynamic Affinity Binding
getThreadPID()
getThreadAffinity()
setThreadAffinity()
getAffinityInfo()


OverHPC Library API

2) Measure low level thread performance:
Hardware Performance Counters
getEventsFromCache()
getEventsFromThread()
bindEventsToCore()
bindEventsToThread()


Evaluation
<ﬂow> <ﬂow>
B D
C
A B
A
D
C

DAG Parallel

<sequence> A Inc
<while>
B
C Test

D

Sequential Loop


Hardware Setup

6 cores, 3 cache levels, 1 last level cache

L3 Cache
L2 L2 L2 L2 L2 L2
2x L1 L1 L1 L1 L1 L1
C1 C2 C3 C4 C5 C6


Experimental Setup

Concurrent Business
Process Instances Up to 30’000

OS Threads
k

Hardware Cores
12


Thread-level Parallelism
How many threads?

Just increase the number of
parallel concurrent threads
in the pools for an increasing
number of instances?


Thread-level Parallelism
Just increasing the number of threads...

1800 ForEach
Sequential
1600 Parallel
Loop
Throughput (req/s)

1400
Throughput (Instances/sec)

1200

1000

800

600

400

200

0 20 40 60 80 100 120 140
Number of threads (per pool)
# of threads

Experimental Setup

Concurrent Business
Process Instances Up to 30’000

OS Threads
24

Hardware Cores
12


Experimental Setup
6 cores, 3 cache levels, 1 last level cache

L3 Cache
L2 L2 L2 L2 L2 L2
2x L1 L1 L1 L1 L1 L1
C1 C2 C3 C4 C5 C6

2 Thread pools:

Kernel Invoker


CPU Afﬁnity Binding
Policy 1: Default

Unconstrained scheduling of threads by the OS

L3 Cache L3 Cache
L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6


Policy 2: per CPU
Constrain each thread pool within a CPU

L3 Cache L3 Cache
L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6


Policy 3: per Core
Policy 2 + Constrain each thread on a speciﬁc core

L3 Cache L3 Cache
L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6


Policy 4: Interleaved
Mix thread pools across CPUs

L3 Cache L3 Cache
L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6


Performance Layers

Concurrent Business
Process Instances 5’000 - 30’000
Throughput, Walltime, ...

OS Threads
24

Hardware Performance Counters:
Hardware Thread Migrations, Context sw, ...
Cache miss, Cores
12


Experimental Results
Relative Speedup with 30k instances
30000 Instances
1.3 Default
Per CPU
Per core
1.2 Interleaved
Relative Speedup

1.1

1

0.9

0.8

0.7
DAG Parallel Sequential Loop Geomean

2 x AMD Barcelona 6 cores processors with 2 LLC

HPC-Based Validation

Ineffective sw prefetches
A prefetch request for a memory
address already in the cache

L3 cache evictions
Data that needs to be stored in the
cache is bigger than free available space


Ineffective SW prefetches
30000 Instances
1.2 Default
Per CPU
Relative Ineffective SW Prefetches

Per core
1.1 Interleaved

1

0.9

0.8

0.7


L3 Cache evictions
30000 Instances
Default
2 Per CPU
Per core
1.8 Interleaved
Relative L3 Evictions

1.6

1.4

1.2

1

0.8

0.6

0.4


10000 Instances
1.2 Default
Per CPU
Per core
1.1 Interleaved
Relative Speedup

1

0.9

0.8

0.7


10000 Instances
1.2 Default
Per CPU

Per core
1.1 Interleaved

1

0.9

0.8

0.7


L3 Cache evictions
10000 Instances
1.8 Default
Per CPU
1.6 Per core
Interleaved

1.4

1.2

1

0.8

0.6

0.4



5000 Instances
1.3 Default
Per CPU
Per core
1.2 Interleaved
Relative Speedup

1.1

1

0.9

0.8

0.7


5000 Instances
1.2 Default
Per CPU

Per core
1.1 Interleaved

1

0.9

0.8

0.7


L3 Cache evictions
5000 Instances
Default
2 Per CPU
Per core
1.8 Interleaved

1.6

1.4

1.2

1

0.8

0.6

0.4


Correlation Coefﬁcients
(Hardware events - JOpera throughput)

Workload Size Ineffective L3 Cache
(Number of Instances) SW Pref Evictions

5000 0.9842 0.9456
10000 0.9125 0.9883
30000 0.9661 0.9946


Conclusion

• Multicore machines offer powerful hardware
parallelism, but what matters is not just the
number of PEs
• The performance depends on how a limited
amount of threads are mapped to the HW
• Multicore Aware Thread Scheduling
signiﬁcantly impacts the performance (up to
10% speedup)


Thank you!
OverHPC Library:
http://sosoa.inf.usi.ch

JOpera business process execution engine:
http://www.jopera.org

Twitter:
@jopera_org

me:
daniele.bonetta@usi.ch


Exploiting Multicores to Optimize Business Process Execution

More Related Content

What's hot (7)

Similar to Exploiting Multicores to Optimize Business Process Execution (20)

More from Cesare Pautasso (20)

Recently uploaded (20)

Exploiting Multicores to Optimize Business Process Execution