SlideShare a Scribd company logo
C++ on its way to exascale and beyond
– The HPX Parallel Runtime System
Thomas Heller (thomas.heller@cs.fau.de)
January 21, 2016
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
What is Exascale anyway?
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
Exascale in numbers
• An Exascale Computer is supposed to execute 1018
floating point
operations in a second
• Exa: 1018
= 1000000000000000000
• People on Earth: 7.3 Billion = 7.3 ∗ 109
• Imagine each person is able to compute one operation per second. It
takes:
⇒ 136986301 seconds
⇒ 2283105 minutes
⇒ 38051 hours
⇒ 1585 days
⇒ 4 years
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
3/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Why do we need that many calculations?
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
4/ 51
Challenges
• How do we program those beasts?
⇒ Massively parallel processors
⇒ Massive amount of compute nodes
⇒ Deep Memory hierarchies
• How can we design the architecture to be affordable?
⇒ Biggest Operational cost is Energy
⇒ Power Envelop of 20MW
⇒ Current fastest Computer (Tian-He 2): 17MW
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
5/ 51
Current Development
Current #1 System:
• Tian-He 2: 33.9 PFLOPS
• 4% of an Exaflop
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
6/ 51
Hardware Trends
• ARM: Low-Power ARM64 cores (maybe adding embedded GPU
accelerators)
• IBM: POWER + NVIDIA Accelerators
• Intel: Knights Landing (Xeon Phi) Many Core processor
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
7/ 51
How will C++ deal with all that?!?
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
Challenges
• Programmability
• Expressing Parallelism
• Expressing Data Locality
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
9/ 51
The 4 Horsemen of the Apocalypse: SLOW
Starvation
Latency
Overhead
Waiting for contention
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
10/ 51
State of the Art
• Modern architectures impose massive challenges on programmability in
the context of performance portability
• Massive increase in on-node parallelism
• Deep memory hierarchies
• Only portable parallelization solution for C++ programmers (today):
OpenMP and MPI
• Hugely successful for years
• Widely used and supported
• Simple use for simple use cases
• Very portable
• Highly optimized
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
11/ 51
State of the Art – Parallelism in C++
• C++11 introduced lower level abstractions
• std::thread, std::mutex, std::future, etc.
• Fairly limited, more is needed
• C++ needs stronger support for higher-level parallelism
• Several proposals to the Standardization Committee are accepted or
under consideration
• Technical Specification: Concurrency (P0159, note: misnomer)
• Technical Specification: Parallelism (P0024)
• Other smaller proposals: resumable functions, task regions, executors
• Currently there is no overarching vision related to higher-level parallelism
• Goal is to standardize a ‘big story’ by 2020
• No need for OpenMP, OpenACC, OpenCL, etc.
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
12/ 51
Stepping Aside – Introducing HPX
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
HPX – A general purpose parallel Runtime System
• Solidly based on a theoretical foundation – a well defined, new execution
model (ParalleX)
• Exposes a coherent and uniform, standards-oriented API for ease of
programming parallel and distributed applications.
• Enables to write fully asynchronous code using hundreds of millions of threads.
• Provides unified syntax and semantics for local and remote operations.
• Open Source: Published under the Boost Software License
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
14/ 51
HPX – A general purpose parallel Runtime System
HPX represents an innovative mixture of
• A global system-wide address space (AGAS - Active Global Address
Space)
• Fine grain parallelism and lightweight synchronization
• Combined with implicit, work queue based, message driven computation
• Full semantic equivalence of local and remote execution, and
• Explicit support for hardware accelerators (through percolation)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
15/ 51
HPX 101 – The programming model
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
HPX 101 – The programming model
Global Address Space
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
Parcelport
Active Global Address Space (AGAS) Service
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
HPX 101 – The programming model
Global Address Space
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
Parcelport
Active Global Address Space (AGAS) Service
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
HPX 101 – The programming model
Global Address Space
Memory
Locality 0
Memory
Locality 1
Memory
Locality i
Memory
Locality N-1
Parcelport
Active Global Address Space (AGAS) Service
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
future <id_type > id =
new_ <Component >( locality , ...);
future <R> result =
async(id.get(), action , ...);
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
HPX 101 – The programming model
Locality 0 Locality 1 Locality i Locality N-1
Parcelport
Active Global Address Space (AGAS) Service
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
Thread-
Scheduler
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
16/ 51
HPX 101 – Overview
HPX
C++ Standard Library
C++
R f(p...) Synchronous Asynchronous Fire & Forget
(returns R) (returns future<R>) (returns void)
Functions f(p...) async(f, p...) apply(f, p...)
(direct)
Functions bind(f, p...)(...) async(bind(f, p...), ...) apply(bind(f, p...), ...)
(lazy)
Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a)
(direct) a()(id, p...) async(a(), id, p...) apply(a(), id, p...)
Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a)
(lazy) bind(a(), id, p...)
(...)
async(bind(a(), id, p...),
...)
apply(bind(a(), id, p...),
...)
In Addition: dataflow(func, f1, f2);
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
17/ 51
The Future, an example
int universal_answer () { return 42; }
void deep_thought () {
future <int > promised_answer
= async(& universal_answer);
// do other things for 7.5 million years
cout << promised_answer.get() << endl;
// prints 42, eventually
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
18/ 51
Compositional facilities
• Sequential composition of futures
future <string > make_string () {
future <int > f1 =
async ([]() -> int { return 123; });
future <string > f2 = f1.then(
[](future <int > f) -> string
{
// here .get() won’t block
return to_string(f.get());
});
return f2;
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
19/ 51
Compositional facilities
• Parallel composition of futures
future <int > test_when_all () {
future <int > future1 =
async ([]() -> int { return 125; });
future <string > future2 =
async ([]() -> string { return string("hi"); });
auto all_f = when_all(future1 , future2);
future <int > result = all_f.then(
[]( auto f) -> int {
return do_work(f.get());
});
return result;
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
20/ 51
Dataflow – The new ’async’ (HPX)
• What if one or more arguments to ’async’ are futures themselves?
• Normal behavior: pass futures through to function
• Extended behavior: wait for futures to become ready before invoking the
function:
template <typename F, typename ... Arg >
future <result_of_t <F(Args ...) >>
// requires(is_callable <F(Arg ...) >)
dataflow(F && f, Arg &&... arg);
• If ArgN is a future, then the invocation of F will be delayed
• Non-future arguments are passed through
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
21/ 51
Parallel Algorithms
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
Concepts of Parallelism – Parallel Execution Properties
• The execution restrictions applicable for the work items
• In what sequence the work items have to be executed
• Where the work items should be executed
• The parameters of the execution environment
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
23/ 51
Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
Sequence, Where
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
Sequence, Where
Grain Size
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
Sequence, Where
Grain Size
Futures, Async, Dataflow
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
Concepts and Types of Parallelism
Application
Concepts
Execution Policies
Executors Executor Parameters
Restrictions
Sequence, Where
Grain Size
Futures, Async, Dataflow
Parallel Algorithms Fork-Join, etc
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
24/ 51
Execution Policies (std)
• Specify execution guarantees (in terms of thread-safety) for executed
parallel tasks:
• sequential_execution_policy: seq
• parallel_execution_policy: par
• parallel_vector_execution_policy: par_vec
• In parallelism TS used for parallel algorithms only
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
25/ 51
Execution Policies (Extensions)
• Asynchronous Execution Policies:
• sequential_task_execution_policy: seq(task)
• parallel_task_execution_policy: par(task)
• In both cases the formerly synchronous functions return a future<>
• Instruct the parallel construct to be executed asynchronously
• Allows integration with asynchronous control flow
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
26/ 51
Executors
• Executor are objects responsible for
• Creating execution agents on which work is performed (P0058)
• In P0058 this is limited to parallel algorithms, here much broader use
• Abstraction of the (potentially platform-specific) mechanisms for launching
work
• Responsible for defining the Where and How of the execution of tasks
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
27/ 51
Execution Parameters
Allows to control the grain size of work
• i.e. amount of iterations of a parallel for_each run on the same thread
• Similar to OpenMP scheduling policies: static, guided, dynamic
• Much more fine control
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
28/ 51
Putting it all together – SAXPY routine with data locality
• a[i] = b[i] ∗ x + c[i], for i from 0 to N − 1
• Using parallel algorithms
• Explicit Control over data locality
• No raw Loops
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
29/ 51
Putting it all together – SAXPY routine with data locality
Complete serial version:
std::vector <double > a = ...;
std::vector <double > b = ...;
std::vector <double > c = ...;
double x = ...;
std:: transform(b.begin(), b.end(),
c.begin(), c.end(), a.begin(),
[x]( double bb, double cc)
{
return bb * x + cc;
});
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
30/ 51
Putting it all together – SAXPY routine with data locality
Parallel version, no data locality:
std::vector <double > a = ...;
std::vector <double > b = ...;
std::vector <double > c = ...;
double x = ...;
parallel :: transform(parallel ::par ,
b.begin(), b.end(),
c.begin(), c.end(), a.begin(),
[x]( double bb, double cc)
{
return bb * x + cc;
});
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
31/ 51
Putting it all together – SAXPY routine with data locality
Parallel version, no data locality:
std::vector <double , numa_allocator > a = ...;
std::vector <double , numa_allocator > b = ...;
std::vector <double , numa_allocator > c = ...;
double x = ...;
for(numa_executor : numa_executors) {
parallel :: transform(
parallel ::par.on(numa_executor),
b.begin() +..., b.begin() +...,
c.begin() +..., c.begin() +..., a.begin() +...,
[x]( double bb, double cc)
{ return bb * x + cc; });
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
32/ 51
Case Studies
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
LibGeoDecomp
• C++ Auto-parallelizing framework
• Open Source
• High scalability
• Wide range of platform support
• http://www.libgeodecomp.org
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
34/ 51
LibGeoDecomp
Futurizing the Simulation Flow
Basic Simulation flow:
for(Region r: innerRegion) {
update(r, oldGrid , newGrid , step);
}
swap(oldGrid , newGrid);
++step;
for(Region r: outerGhostZoneRegion) {
notifyPatchProviders(r, oldGrid);
}
for(Region r: outerGhostZoneRegion) {
update(r, oldGrid , newGrid , step);
}
for(Region r: innerGhostZoneRegion) {
notifyPatchAccepters(r, oldGrid);
}
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
35/ 51
LibGeoDecomp
Futurizing the Simulation Flow
Futurized Simulation flow:
parallel for(Region r: innerRegion) {
update(r, oldGrid , newGrid , step);
}
swap(oldGrid , newGrid); ++ step;
parallel for(Region r: outerGhostZoneRegion) {
notifyPatchProviders(r, oldGrid);
}
parallel for(Region r: outerGhostZoneRegion) {
update(r, oldGrid , newGrid , step);
}
parallel for(Region r: innerGhostZoneRegion) {
notifyPatchAccepters(r, oldGrid);
}
Continuation
Continuation
Continuation
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
36/ 51
HPXCL – Extending the Global Adress Space
• All GPU devices are addressable globally
• GPU memory can be allocated and referenced remotely
• Events are extensions of the shared state
⇒ API embedded into the already existing future facilities
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
37/ 51
From async to GPUs
Spawning single tasks not feasible
⇒ offload a work group (Think of parallel::for_each)
auto devices
= hpx:: opencl :: find_devices(hpx:: find_here (),
CL_DEVICE_TYPE_GPU).get();
// create buffers , programs and kernels ...
hpx:: opencl :: buffer buf = devices [0]. create_buffer(
CL_MEM_READ_WRITE , 4711);
auto write_future = buf.enqueue_write(some_vec.
begin(), some_vec.end());
auto kernel_future = kernel.enqueue(dim ,
write_future);
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
38/ 51
From async to GPUs
Spawning single tasks not feasible
⇒ offload a work group (Think of parallel::for_each)
• Proof of Concept
• Future Directions:
• Embedd OpenCL devices behind Execution Policies and Executors
• Hide OpenCL stuff behind parallel algorithms
• Hide OpenCL buffer management behind "distributed data structures"
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
38/ 51
Mandelbrot example
Queue
Google
Maps API
Client
Worker
Generator
Worker
Webserver
Acknowledgements to Martin Stumpf
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
39/ 51
Mandelbrot example
Acknowledgements to Martin Stumpf
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
40/ 51
Mandelbrot example
Acknowledgements to Martin Stumpf
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
41/ 51
LibGeoDecomp
Performance Results
0
10
20
30
40
50
60
70
1 2 4 8 16
Time[s]
Number of Cores, on one Node
Execution Times of HPX and MPI N-Body Codes
(SMP, Weak Scaling)
Sim HPX
Sim MPI
Comm HPX
Comm MPI
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
42/ 51
LibGeoDecomp
Performance Results
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
42/ 51
LibGeoDecomp
Performance Results
0
200
400
600
800
1000
1200
1400
1600
0 10 20 30 40 50 60
PerformanceinGFLOPS
Number of Cores
Weak Scaling Results for HPX N-Body Code
(Single Xeon Phi, Futurized)
1 Thread/Core
2 Threads/Core
3 Threads/Core
4 Threads/Core
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
42/ 51
LibGeoDecomp
Performance Results
0
5
10
15
20
25
30
0 2 4 6 8 10 12 14 16
PerformanceinTFLOPS
Number of Nodes, 16 Cores on Host, Full Xeon Phi
Weak Scaling Results for HPX N-Body Codes
(Host Cores and Xeon Phi Accelerator)
HPX
Peak
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
42/ 51
STREAM Benchmark
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10 11 12
Bandwidth[GB/s]
Number of cores per NUMA Domain
TRIAD STREAM Results
(50 million data points)
HPX (1 NUMA Domain)
OpenMP (1 NUMA Domain)
HPX (2 NUMA Domains)
OpenMP (2 NUMA Domains)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
43/ 51
Matrix Transpose
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10 11 12
Datatransferrate[GB/s]
Number of cores per NUMA domain
Matrix Transpose (SMP, 24kx24k Matrices)
HPX (1 NUMA Domain)
HPX (2 NUMA Domains)
OMP (1 NUMA Domain)
OMP (2 NUMA Domains)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
44/ 51
Matrix Transpose
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10 11 12
Datatransferrate[GB/s]
Number of cores per NUMA domain
Matrix Transpose (SMP, 24kx24k Matrices)
HPX (2 NUMA Domains)
MPI (1 NUMA Domain, 12 ranks)
MPI (2 NUMA Domains, 24 ranks)
MPI+OMP (2 NUMA Domains)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
45/ 51
Matrix Transpose
0
5
10
15
20
25
30
35
40
45
50
0 10 20 30 40 50 60
Datatransferrate[GB/s]
Number of cores
Matrix Transpose (Xeon/Phi, 24kx24k matrices)
HPX (4 PUs per core) OMP (4 PUs per core)
HPX (2 PUs per core) OMP (2 PUs per core)
HPX (1 PUs per core) OMP (1 PUs per core)
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
46/ 51
Matrix Transpose
0
5
10
15
20
25
30
35
2 3 4 5 6 7 8
Datatransferrate[GB/s]
Number of nodes (16 cores each)
Matrix Transpose (Distributed, 18kx18k elements per node)
HPX MPI
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
47/ 51
What’s beyond Exascale?
This project has received funding from the Eu-
ropean Union‘s Horizon 2020 research and in-
novation programme under grant agreement No.
671603
Conclusions
Higher-level parallelization abstractions in C++:
• uniform, versatile, and generic
• All of this is enabled by use of modern C++ facilities
• Runtime system (fine-grain, task-based schedulers)
• Performant, portable implementation
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
49/ 51
Parallelism is here to stay!
• Massive Parallel Hardware is already part of our daily lives!
• Parallelism is observable everywhere:
⇒ IoT: Massive amount devices existing in parallel
⇒ Embedded: Meet massively parallel energy-aware systems (Epiphany, DSPs,
FPGAs)
⇒ Automotive: Massive amount of parallel sensor data to process
• We all need solutions on how to deal with this, efficiently and pragmatically
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
50/ 51
More Information
• https://github.com/STEllAR-GROUP/hpx
• http://stellar-group.org
• hpx-users@stellar.cct.lsu.edu
• #STE||AR @ irc.freenode.org
Collaborations:
• FET-HPC (H2020): AllScale (https://allscale.eu)
• NSF: STORM (http://storm.stellar-group.org)
• DOE: Part of X-Stack
C++ on its way to exascale and beyond – The HPX Parallel Runtime System
21.01.2016 | Thomas Heller |
51/ 51

More Related Content

PDF
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
PPT
Behm Shah Pagerank
PPTX
Map and Reduce
PDF
Ufuc Celebi – Stream & Batch Processing in one System
PDF
Flink Gelly - Karlsruhe - June 2015
PDF
MapReduce Algorithm Design
PDF
Ieee eit-talk-large-scale-neural-modeling-in-map reduce-giraph
PDF
Mapreduce Algorithms
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Behm Shah Pagerank
Map and Reduce
Ufuc Celebi – Stream & Batch Processing in one System
Flink Gelly - Karlsruhe - June 2015
MapReduce Algorithm Design
Ieee eit-talk-large-scale-neural-modeling-in-map reduce-giraph
Mapreduce Algorithms

What's hot (18)

PPTX
Introduction to Map Reduce
PDF
An Introduction to MapReduce
PPT
Map Reduce introduction
PPTX
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
PDF
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PPTX
Apache Flink Training: System Overview
PDF
Introduction to R
PDF
Asymmetry in Large-Scale Graph Analysis, Explained
PPT
Hadoop institutes-in-bangalore
PPTX
All AI Roads lead to Distribution - Dot AI
PPTX
Introduction to MapReduce
PDF
On the Capability and Achievable Performance of FPGAs for HPC Applications
PPT
Giraph at Hadoop Summit 2014
PPTX
Apache Flink: API, runtime, and project roadmap
PPTX
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC
PPTX
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
PPTX
SICS: Apache Flink Streaming
Introduction to Map Reduce
An Introduction to MapReduce
Map Reduce introduction
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Apache Flink Training: System Overview
Introduction to R
Asymmetry in Large-Scale Graph Analysis, Explained
Hadoop institutes-in-bangalore
All AI Roads lead to Distribution - Dot AI
Introduction to MapReduce
On the Capability and Achievable Performance of FPGAs for HPC Applications
Giraph at Hadoop Summit 2014
Apache Flink: API, runtime, and project roadmap
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
SICS: Apache Flink Streaming
Ad

Viewers also liked (20)

PPT
Римский корсаков снегурочка
PPT
Цветочные легенды
PPTX
High Performance Distributed Systems with CQRS
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
PPTX
бсп (обоб. урок)
PPTX
правописание приставок урок№4
PDF
Troubleshooting mysql-tutorial
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
PDF
Windowing in Apache Apex
PDF
The 5 People in your Organization that grow Legacy Code
PDF
Hadoop File System Shell Commands,
DOCX
Hadoop basic commands
PPTX
Introduction to Apache Apex and writing a big data streaming application
PDF
Build your shiny new pc, with Pangoly
PPTX
HDFS Internals
PDF
Hadoop Internals (2.3.0 or later)
PPTX
Hadoop Interacting with HDFS
PDF
Introduction to UNIX Command-Lines with examples
PPTX
Introduction to Real-Time Data Processing
Римский корсаков снегурочка
Цветочные легенды
High Performance Distributed Systems with CQRS
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
бсп (обоб. урок)
правописание приставок урок№4
Troubleshooting mysql-tutorial
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Windowing in Apache Apex
The 5 People in your Organization that grow Legacy Code
Hadoop File System Shell Commands,
Hadoop basic commands
Introduction to Apache Apex and writing a big data streaming application
Build your shiny new pc, with Pangoly
HDFS Internals
Hadoop Internals (2.3.0 or later)
Hadoop Interacting with HDFS
Introduction to UNIX Command-Lines with examples
Introduction to Real-Time Data Processing
Ad

Similar to C++ on its way to exascale and beyond -- The HPX Parallel Runtime System (20)

PDF
lecture01_Introduction.pdf
PDF
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
PDF
Application Optimisation using OpenPOWER and Power 9 systems
PPT
HPC Performance tools, on the road to Exascale
PDF
Quantifying Overheads in Charm++ and HPX using Task Bench
PDF
Programming Models for Exascale Systems
PDF
Foundation of High Performance Computing HPC
PPTX
Hpx runtime system
PPTX
Hpx runtime system
PDF
HPX: C++11 runtime система для параллельных и распределённых вычислений
PPTX
OpenACC Monthly Highlights: September 2021
PDF
lec02_parallel_programming.pdfghhbjjjhjb
PDF
Assisting User’s Transition to Titan’s Accelerated Architecture
PPTX
A Source-To-Source Approach to HPC Challenges
PPTX
OpenACC Monthly Highlights: June 2021
PDF
Exploring the Programming Models for the LUMI Supercomputer
PPTX
OpenACC Monthly Highlights Summer 2019
PDF
Parallel Computing - Lec 5
PDF
OpenHPI - Parallel Programming Concepts - Week 5
PPTX
High performance computing for research
lecture01_Introduction.pdf
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Application Optimisation using OpenPOWER and Power 9 systems
HPC Performance tools, on the road to Exascale
Quantifying Overheads in Charm++ and HPX using Task Bench
Programming Models for Exascale Systems
Foundation of High Performance Computing HPC
Hpx runtime system
Hpx runtime system
HPX: C++11 runtime система для параллельных и распределённых вычислений
OpenACC Monthly Highlights: September 2021
lec02_parallel_programming.pdfghhbjjjhjb
Assisting User’s Transition to Titan’s Accelerated Architecture
A Source-To-Source Approach to HPC Challenges
OpenACC Monthly Highlights: June 2021
Exploring the Programming Models for the LUMI Supercomputer
OpenACC Monthly Highlights Summer 2019
Parallel Computing - Lec 5
OpenHPI - Parallel Programming Concepts - Week 5
High performance computing for research

Recently uploaded (20)

PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
AI in Product Development-omnex systems
PPTX
history of c programming in notes for students .pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Nekopoi APK 2025 free lastest update
PPTX
Introduction to Artificial Intelligence
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Transform Your Business with a Software ERP System
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
ai tools demonstartion for schools and inter college
PTS Company Brochure 2025 (1).pdf.......
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Which alternative to Crystal Reports is best for small or large businesses.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Softaken Excel to vCard Converter Software.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Odoo POS Development Services by CandidRoot Solutions
AI in Product Development-omnex systems
history of c programming in notes for students .pptx
How to Choose the Right IT Partner for Your Business in Malaysia
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Navsoft: AI-Powered Business Solutions & Custom Software Development
Upgrade and Innovation Strategies for SAP ERP Customers
Reimagine Home Health with the Power of Agentic AI​
Nekopoi APK 2025 free lastest update
Introduction to Artificial Intelligence
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Transform Your Business with a Software ERP System
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
ai tools demonstartion for schools and inter college

C++ on its way to exascale and beyond -- The HPX Parallel Runtime System

  • 1. C++ on its way to exascale and beyond – The HPX Parallel Runtime System Thomas Heller (thomas.heller@cs.fau.de) January 21, 2016 This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 2. What is Exascale anyway? This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 3. Exascale in numbers • An Exascale Computer is supposed to execute 1018 floating point operations in a second • Exa: 1018 = 1000000000000000000 • People on Earth: 7.3 Billion = 7.3 ∗ 109 • Imagine each person is able to compute one operation per second. It takes: ⇒ 136986301 seconds ⇒ 2283105 minutes ⇒ 38051 hours ⇒ 1585 days ⇒ 4 years C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 3/ 51
  • 4. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 5. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 6. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 7. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 8. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 9. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 10. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 11. Why do we need that many calculations? C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 4/ 51
  • 12. Challenges • How do we program those beasts? ⇒ Massively parallel processors ⇒ Massive amount of compute nodes ⇒ Deep Memory hierarchies • How can we design the architecture to be affordable? ⇒ Biggest Operational cost is Energy ⇒ Power Envelop of 20MW ⇒ Current fastest Computer (Tian-He 2): 17MW C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 5/ 51
  • 13. Current Development Current #1 System: • Tian-He 2: 33.9 PFLOPS • 4% of an Exaflop C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 6/ 51
  • 14. Hardware Trends • ARM: Low-Power ARM64 cores (maybe adding embedded GPU accelerators) • IBM: POWER + NVIDIA Accelerators • Intel: Knights Landing (Xeon Phi) Many Core processor C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 7/ 51
  • 15. How will C++ deal with all that?!? This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 16. Challenges • Programmability • Expressing Parallelism • Expressing Data Locality C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 9/ 51
  • 17. The 4 Horsemen of the Apocalypse: SLOW Starvation Latency Overhead Waiting for contention C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 10/ 51
  • 18. State of the Art • Modern architectures impose massive challenges on programmability in the context of performance portability • Massive increase in on-node parallelism • Deep memory hierarchies • Only portable parallelization solution for C++ programmers (today): OpenMP and MPI • Hugely successful for years • Widely used and supported • Simple use for simple use cases • Very portable • Highly optimized C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 11/ 51
  • 19. State of the Art – Parallelism in C++ • C++11 introduced lower level abstractions • std::thread, std::mutex, std::future, etc. • Fairly limited, more is needed • C++ needs stronger support for higher-level parallelism • Several proposals to the Standardization Committee are accepted or under consideration • Technical Specification: Concurrency (P0159, note: misnomer) • Technical Specification: Parallelism (P0024) • Other smaller proposals: resumable functions, task regions, executors • Currently there is no overarching vision related to higher-level parallelism • Goal is to standardize a ‘big story’ by 2020 • No need for OpenMP, OpenACC, OpenCL, etc. C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 12/ 51
  • 20. Stepping Aside – Introducing HPX This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 21. HPX – A general purpose parallel Runtime System • Solidly based on a theoretical foundation – a well defined, new execution model (ParalleX) • Exposes a coherent and uniform, standards-oriented API for ease of programming parallel and distributed applications. • Enables to write fully asynchronous code using hundreds of millions of threads. • Provides unified syntax and semantics for local and remote operations. • Open Source: Published under the Boost Software License C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 14/ 51
  • 22. HPX – A general purpose parallel Runtime System HPX represents an innovative mixture of • A global system-wide address space (AGAS - Active Global Address Space) • Fine grain parallelism and lightweight synchronization • Combined with implicit, work queue based, message driven computation • Full semantic equivalence of local and remote execution, and • Explicit support for hardware accelerators (through percolation) C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 15/ 51
  • 23. HPX 101 – The programming model Memory Locality 0 Memory Locality 1 Memory Locality i Memory Locality N-1 C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 16/ 51
  • 24. HPX 101 – The programming model Global Address Space Memory Locality 0 Memory Locality 1 Memory Locality i Memory Locality N-1 Parcelport Active Global Address Space (AGAS) Service Thread- Scheduler Thread- Scheduler Thread- Scheduler Thread- Scheduler C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 16/ 51
  • 25. HPX 101 – The programming model Global Address Space Memory Locality 0 Memory Locality 1 Memory Locality i Memory Locality N-1 Parcelport Active Global Address Space (AGAS) Service Thread- Scheduler Thread- Scheduler Thread- Scheduler Thread- Scheduler Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 16/ 51
  • 26. HPX 101 – The programming model Global Address Space Memory Locality 0 Memory Locality 1 Memory Locality i Memory Locality N-1 Parcelport Active Global Address Space (AGAS) Service Thread- Scheduler Thread- Scheduler Thread- Scheduler Thread- Scheduler Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread future <id_type > id = new_ <Component >( locality , ...); future <R> result = async(id.get(), action , ...); C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 16/ 51
  • 27. HPX 101 – The programming model Locality 0 Locality 1 Locality i Locality N-1 Parcelport Active Global Address Space (AGAS) Service Thread- Scheduler Thread- Scheduler Thread- Scheduler Thread- Scheduler C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 16/ 51
  • 28. HPX 101 – Overview HPX C++ Standard Library C++ R f(p...) Synchronous Asynchronous Fire & Forget (returns R) (returns future<R>) (returns void) Functions f(p...) async(f, p...) apply(f, p...) (direct) Functions bind(f, p...)(...) async(bind(f, p...), ...) apply(bind(f, p...), ...) (lazy) Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a) (direct) a()(id, p...) async(a(), id, p...) apply(a(), id, p...) Actions HPX_ACTION(f, a) HPX_ACTION(f, a) HPX_ACTION(f, a) (lazy) bind(a(), id, p...) (...) async(bind(a(), id, p...), ...) apply(bind(a(), id, p...), ...) In Addition: dataflow(func, f1, f2); C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 17/ 51
  • 29. The Future, an example int universal_answer () { return 42; } void deep_thought () { future <int > promised_answer = async(& universal_answer); // do other things for 7.5 million years cout << promised_answer.get() << endl; // prints 42, eventually } C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 18/ 51
  • 30. Compositional facilities • Sequential composition of futures future <string > make_string () { future <int > f1 = async ([]() -> int { return 123; }); future <string > f2 = f1.then( [](future <int > f) -> string { // here .get() won’t block return to_string(f.get()); }); return f2; } C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 19/ 51
  • 31. Compositional facilities • Parallel composition of futures future <int > test_when_all () { future <int > future1 = async ([]() -> int { return 125; }); future <string > future2 = async ([]() -> string { return string("hi"); }); auto all_f = when_all(future1 , future2); future <int > result = all_f.then( []( auto f) -> int { return do_work(f.get()); }); return result; } C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 20/ 51
  • 32. Dataflow – The new ’async’ (HPX) • What if one or more arguments to ’async’ are futures themselves? • Normal behavior: pass futures through to function • Extended behavior: wait for futures to become ready before invoking the function: template <typename F, typename ... Arg > future <result_of_t <F(Args ...) >> // requires(is_callable <F(Arg ...) >) dataflow(F && f, Arg &&... arg); • If ArgN is a future, then the invocation of F will be delayed • Non-future arguments are passed through C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 21/ 51
  • 33. Parallel Algorithms This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 34. Concepts of Parallelism – Parallel Execution Properties • The execution restrictions applicable for the work items • In what sequence the work items have to be executed • Where the work items should be executed • The parameters of the execution environment C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 23/ 51
  • 35. Concepts and Types of Parallelism Application Concepts Execution Policies Executors Executor Parameters C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 24/ 51
  • 36. Concepts and Types of Parallelism Application Concepts Execution Policies Executors Executor Parameters Restrictions C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 24/ 51
  • 37. Concepts and Types of Parallelism Application Concepts Execution Policies Executors Executor Parameters Restrictions Sequence, Where C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 24/ 51
  • 38. Concepts and Types of Parallelism Application Concepts Execution Policies Executors Executor Parameters Restrictions Sequence, Where Grain Size C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 24/ 51
  • 39. Concepts and Types of Parallelism Application Concepts Execution Policies Executors Executor Parameters Restrictions Sequence, Where Grain Size Futures, Async, Dataflow C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 24/ 51
  • 40. Concepts and Types of Parallelism Application Concepts Execution Policies Executors Executor Parameters Restrictions Sequence, Where Grain Size Futures, Async, Dataflow Parallel Algorithms Fork-Join, etc C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 24/ 51
  • 41. Execution Policies (std) • Specify execution guarantees (in terms of thread-safety) for executed parallel tasks: • sequential_execution_policy: seq • parallel_execution_policy: par • parallel_vector_execution_policy: par_vec • In parallelism TS used for parallel algorithms only C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 25/ 51
  • 42. Execution Policies (Extensions) • Asynchronous Execution Policies: • sequential_task_execution_policy: seq(task) • parallel_task_execution_policy: par(task) • In both cases the formerly synchronous functions return a future<> • Instruct the parallel construct to be executed asynchronously • Allows integration with asynchronous control flow C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 26/ 51
  • 43. Executors • Executor are objects responsible for • Creating execution agents on which work is performed (P0058) • In P0058 this is limited to parallel algorithms, here much broader use • Abstraction of the (potentially platform-specific) mechanisms for launching work • Responsible for defining the Where and How of the execution of tasks C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 27/ 51
  • 44. Execution Parameters Allows to control the grain size of work • i.e. amount of iterations of a parallel for_each run on the same thread • Similar to OpenMP scheduling policies: static, guided, dynamic • Much more fine control C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 28/ 51
  • 45. Putting it all together – SAXPY routine with data locality • a[i] = b[i] ∗ x + c[i], for i from 0 to N − 1 • Using parallel algorithms • Explicit Control over data locality • No raw Loops C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 29/ 51
  • 46. Putting it all together – SAXPY routine with data locality Complete serial version: std::vector <double > a = ...; std::vector <double > b = ...; std::vector <double > c = ...; double x = ...; std:: transform(b.begin(), b.end(), c.begin(), c.end(), a.begin(), [x]( double bb, double cc) { return bb * x + cc; }); C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 30/ 51
  • 47. Putting it all together – SAXPY routine with data locality Parallel version, no data locality: std::vector <double > a = ...; std::vector <double > b = ...; std::vector <double > c = ...; double x = ...; parallel :: transform(parallel ::par , b.begin(), b.end(), c.begin(), c.end(), a.begin(), [x]( double bb, double cc) { return bb * x + cc; }); C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 31/ 51
  • 48. Putting it all together – SAXPY routine with data locality Parallel version, no data locality: std::vector <double , numa_allocator > a = ...; std::vector <double , numa_allocator > b = ...; std::vector <double , numa_allocator > c = ...; double x = ...; for(numa_executor : numa_executors) { parallel :: transform( parallel ::par.on(numa_executor), b.begin() +..., b.begin() +..., c.begin() +..., c.begin() +..., a.begin() +..., [x]( double bb, double cc) { return bb * x + cc; }); } C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 32/ 51
  • 49. Case Studies This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 50. LibGeoDecomp • C++ Auto-parallelizing framework • Open Source • High scalability • Wide range of platform support • http://www.libgeodecomp.org C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 34/ 51
  • 51. LibGeoDecomp Futurizing the Simulation Flow Basic Simulation flow: for(Region r: innerRegion) { update(r, oldGrid , newGrid , step); } swap(oldGrid , newGrid); ++step; for(Region r: outerGhostZoneRegion) { notifyPatchProviders(r, oldGrid); } for(Region r: outerGhostZoneRegion) { update(r, oldGrid , newGrid , step); } for(Region r: innerGhostZoneRegion) { notifyPatchAccepters(r, oldGrid); } C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 35/ 51
  • 52. LibGeoDecomp Futurizing the Simulation Flow Futurized Simulation flow: parallel for(Region r: innerRegion) { update(r, oldGrid , newGrid , step); } swap(oldGrid , newGrid); ++ step; parallel for(Region r: outerGhostZoneRegion) { notifyPatchProviders(r, oldGrid); } parallel for(Region r: outerGhostZoneRegion) { update(r, oldGrid , newGrid , step); } parallel for(Region r: innerGhostZoneRegion) { notifyPatchAccepters(r, oldGrid); } Continuation Continuation Continuation C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 36/ 51
  • 53. HPXCL – Extending the Global Adress Space • All GPU devices are addressable globally • GPU memory can be allocated and referenced remotely • Events are extensions of the shared state ⇒ API embedded into the already existing future facilities C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 37/ 51
  • 54. From async to GPUs Spawning single tasks not feasible ⇒ offload a work group (Think of parallel::for_each) auto devices = hpx:: opencl :: find_devices(hpx:: find_here (), CL_DEVICE_TYPE_GPU).get(); // create buffers , programs and kernels ... hpx:: opencl :: buffer buf = devices [0]. create_buffer( CL_MEM_READ_WRITE , 4711); auto write_future = buf.enqueue_write(some_vec. begin(), some_vec.end()); auto kernel_future = kernel.enqueue(dim , write_future); C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 38/ 51
  • 55. From async to GPUs Spawning single tasks not feasible ⇒ offload a work group (Think of parallel::for_each) • Proof of Concept • Future Directions: • Embedd OpenCL devices behind Execution Policies and Executors • Hide OpenCL stuff behind parallel algorithms • Hide OpenCL buffer management behind "distributed data structures" C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 38/ 51
  • 56. Mandelbrot example Queue Google Maps API Client Worker Generator Worker Webserver Acknowledgements to Martin Stumpf C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 39/ 51
  • 57. Mandelbrot example Acknowledgements to Martin Stumpf C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 40/ 51
  • 58. Mandelbrot example Acknowledgements to Martin Stumpf C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 41/ 51
  • 59. LibGeoDecomp Performance Results 0 10 20 30 40 50 60 70 1 2 4 8 16 Time[s] Number of Cores, on one Node Execution Times of HPX and MPI N-Body Codes (SMP, Weak Scaling) Sim HPX Sim MPI Comm HPX Comm MPI C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 42/ 51
  • 60. LibGeoDecomp Performance Results C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 42/ 51
  • 61. LibGeoDecomp Performance Results 0 200 400 600 800 1000 1200 1400 1600 0 10 20 30 40 50 60 PerformanceinGFLOPS Number of Cores Weak Scaling Results for HPX N-Body Code (Single Xeon Phi, Futurized) 1 Thread/Core 2 Threads/Core 3 Threads/Core 4 Threads/Core C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 42/ 51
  • 62. LibGeoDecomp Performance Results 0 5 10 15 20 25 30 0 2 4 6 8 10 12 14 16 PerformanceinTFLOPS Number of Nodes, 16 Cores on Host, Full Xeon Phi Weak Scaling Results for HPX N-Body Codes (Host Cores and Xeon Phi Accelerator) HPX Peak C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 42/ 51
  • 63. STREAM Benchmark 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 8 9 10 11 12 Bandwidth[GB/s] Number of cores per NUMA Domain TRIAD STREAM Results (50 million data points) HPX (1 NUMA Domain) OpenMP (1 NUMA Domain) HPX (2 NUMA Domains) OpenMP (2 NUMA Domains) C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 43/ 51
  • 64. Matrix Transpose 0 10 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 11 12 Datatransferrate[GB/s] Number of cores per NUMA domain Matrix Transpose (SMP, 24kx24k Matrices) HPX (1 NUMA Domain) HPX (2 NUMA Domains) OMP (1 NUMA Domain) OMP (2 NUMA Domains) C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 44/ 51
  • 65. Matrix Transpose 0 10 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 11 12 Datatransferrate[GB/s] Number of cores per NUMA domain Matrix Transpose (SMP, 24kx24k Matrices) HPX (2 NUMA Domains) MPI (1 NUMA Domain, 12 ranks) MPI (2 NUMA Domains, 24 ranks) MPI+OMP (2 NUMA Domains) C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 45/ 51
  • 66. Matrix Transpose 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 Datatransferrate[GB/s] Number of cores Matrix Transpose (Xeon/Phi, 24kx24k matrices) HPX (4 PUs per core) OMP (4 PUs per core) HPX (2 PUs per core) OMP (2 PUs per core) HPX (1 PUs per core) OMP (1 PUs per core) C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 46/ 51
  • 67. Matrix Transpose 0 5 10 15 20 25 30 35 2 3 4 5 6 7 8 Datatransferrate[GB/s] Number of nodes (16 cores each) Matrix Transpose (Distributed, 18kx18k elements per node) HPX MPI C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 47/ 51
  • 68. What’s beyond Exascale? This project has received funding from the Eu- ropean Union‘s Horizon 2020 research and in- novation programme under grant agreement No. 671603
  • 69. Conclusions Higher-level parallelization abstractions in C++: • uniform, versatile, and generic • All of this is enabled by use of modern C++ facilities • Runtime system (fine-grain, task-based schedulers) • Performant, portable implementation C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 49/ 51
  • 70. Parallelism is here to stay! • Massive Parallel Hardware is already part of our daily lives! • Parallelism is observable everywhere: ⇒ IoT: Massive amount devices existing in parallel ⇒ Embedded: Meet massively parallel energy-aware systems (Epiphany, DSPs, FPGAs) ⇒ Automotive: Massive amount of parallel sensor data to process • We all need solutions on how to deal with this, efficiently and pragmatically C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 50/ 51
  • 71. More Information • https://github.com/STEllAR-GROUP/hpx • http://stellar-group.org • hpx-users@stellar.cct.lsu.edu • #STE||AR @ irc.freenode.org Collaborations: • FET-HPC (H2020): AllScale (https://allscale.eu) • NSF: STORM (http://storm.stellar-group.org) • DOE: Part of X-Stack C++ on its way to exascale and beyond – The HPX Parallel Runtime System 21.01.2016 | Thomas Heller | 51/ 51