SlideShare a Scribd company logo
Deploying a Task-based Runtime System on Raspberry
Pi Clusters
Patrick Diehl and Steven R. Brandt
University of Louisiana
{pdiehl,sbrandt}@cct.lsu.edu
Extreme Scale Programming Models and Middleware (ESPM2’20)
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 1 / 23
Motivation
Arm®
-Technology is emerging
in supercomputers and data
centers, e.g. Fugaku the fastest
super computer in the Top500.
The low power consumption.
The low costs of a Raspberry Pi
and building a small cluster.
One cluster of 4 nodes costs
around $200.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 2 / 23
Outline
1 Tools
2 Hardware and Software
3 Benchmarks
4 Results
Memory
Computation time
Energy consumption
5 Conclusion & Outlook
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 3 / 23
Tools
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 4 / 23
HPX
The C++ Standard Library for Concurrency and Parallelism
Thread Scheduling ActiveGlobal AddressSpace
Parcel Transport Layer Local Control Objects
PerformanceMonitoring
Operating System
HPX Application
HPX’s lightweight user threads reduce context switching overhead
Active Global Address Space (AGAS) makes a unified view of the
application
Overlapping communication and computation in the Parcel Layer
Reference
Kaiser et al., (2020). HPX - The C++ Standard Library for Parallelism and Concurrency. Journal of Open Source Software,
5(53), 2352, https://doi.org/10.21105/joss.02352
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 5 / 23
Phylanx
An Asynchronous Distributed Array Computing Toolkit
Run Python code within the HPX runtime system in parallel.
Python is a common used language in machine and deep learning.
Reference
Tohid, R., et al. ”Asynchronous execution of python code on task-based runtime systems.” 2018 IEEE/ACM 4th International
Workshop on Extreme Scale Programming Models and Middleware (ESPM2). IEEE, 2018.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 6 / 23
Hardware and Software
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 7 / 23
Raspberry PI cluster & Software
Table: Specification/Architecture of the three nodes utilised in the benchmarks.
Model Raspberry Pi 3B Raspberry Pi 3B+ Raspberry Pi 4B
Micro-architecture Arm® v8-A Arm® v8-A Arm® v8
Processor Model Cortex-A53 Cortex-A53 Cortex-A72
Number of CPUs 1 1 1
Cores per CPU 4 4 4
Total Cores 4 4 4
Frequency 1.2GHz 1.4GHz 1.5GHz
Memory 1GB 1GB 4GB
1
https://bitbucket.org/blaze-lib/blaze
2
https://github.com/STEllAR-GROUP/hpx
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 8 / 23
Raspberry PI cluster & Software
Table: Specification/Architecture of the three nodes utilised in the benchmarks.
Model Raspberry Pi 3B Raspberry Pi 3B+ Raspberry Pi 4B
Micro-architecture Arm® v8-A Arm® v8-A Arm® v8
Processor Model Cortex-A53 Cortex-A53 Cortex-A72
Number of CPUs 1 1 1
Cores per CPU 4 4 4
Total Cores 4 4 4
Frequency 1.2GHz 1.4GHz 1.5GHz
Memory 1GB 1GB 4GB
Table: Overview of the compilers, software, and operating system used.
Operating System Ubuntu 20.04 LTS Kernel 5.4
for Arm® blaze1
75179e6
Compilers gcc 9.30.1 boost 1.71
hwloc 2.1.0 gperftools 2.7
lapack 3.8 HPX2
5b9de48ab1
1
https://bitbucket.org/blaze-lib/blaze
2
https://github.com/STEllAR-GROUP/hpx
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 8 / 23
Benchmarks
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 9 / 23
2D Jacobi Solver (Shared memory)
2D Stencil based on the Jacobi method using
Standard grid layout for GCC’s autovecorize (−03)
Virtual Node Scheme1
for explicit vectorization
Roofline model2 to predict the optimal performance
Poptimal = Memory Bandwidth × AI
with arithmetic intensity (AI) is given by 1/24 for double precision and
1/12 for single precision.
References
1 P. Boyle, A. Yamaguchi, G. Cossu, and A. Portelli, “Grid: A next generation data
parallel c++ qcd library,” arXiv preprint arXiv:1512.03487, 2015
2 S. Williams, D. Patterson, L. Oliker, J. Shalf, and K. Yelick, “The roofline model:
A pedagogical tool for auto-tuning kernels on multicore architectures,” in Hot
Chips, vol. 20, 2008, pp. 24–26.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 10 / 23
1D Heat equation solver (Distributed memory)
Parameter: heat transfer coefficient k = 0.5, time step dt = 1, and
grid spacing dx = 1.
2 1 0 1 2
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
0
20
40
60
80
u
Figure: Initial conditions
2 1 0 1 2
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
20
30
40
50
60
70
80
u
Figure: Solution
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 11 / 23
Results
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 12 / 23
STREAM TRIAD Benchmark
0
1000
2000
3000
4000
5000
6000
1 2 3 4
Memory
Bandwidth
(in
MBps)
Core count
STREAM TRIAD Results
Rpi 3B
Rpi 3B+
Rpi 4
Figure: Memory Bandwidth results using
the STREAM TRIAD Benchmark with
an array size of 10M elements
Pi 3B/3B+ have very low
memory bandwidth (MB)
One single processor unit (PU)
already saturated
Pi 4 same behavior, but double
MB
Conclusion: Memory bus can
only handle a certain amount of
MB and concurrency at the
same time.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 13 / 23
2D stencil (Raspberry Pi 4)
2D Stencil: Raspberry Pi 4
200
300
400
500
600
700
1 2 3 4
Performance
(in
MLUPs/s)
Core count
Single Precision
100
150
200
250
300
350
1 2 3 4
Core count
Double Precision
scalar (auto)
vector (explicit)
Expected Peak
Figure: 2D stencil (Raspberry Pi 4): Grid size of 4096×4096 iterated over a 100
time steps.
Conclusion: Best performance on 2 cores.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 14 / 23
2D stencil (Raspberry Pi 3B+)
2D Stencil: Raspberry Pi 3B+
100
150
200
250
300
350
1 2 3 4
Performance
(in
MLUPs/s)
Core count
Single Precision
40
60
80
100
120
140
160
1 2 3 4
Core count
Double Precision
scalar (auto)
vector (explicit)
Expected Peak
Figure: 2D stencil (Raspberry Pi 3B+): Grid size of 4096×4096 iterated over a
100 time steps.
Conclusion: We can not achieve the expected peak performance.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 15 / 23
2D stencil (Raspberry Pi 3B)
2D Stencil: Raspberry Pi 3B
100
150
200
250
300
350
1 2 3 4
Performance
(in
MLUPs/s)
Core count
Single Precision
40
60
80
100
120
140
160
1 2 3 4
Core count
Double Precision
scalar (auto)
vector (explicit)
Expected Peak
Figure: 2D stencil (Raspberry Pi 3B): Grid size of 4096×4096 iterated over a 100
time steps.
Conclusion: PI 3B and 3B+ have similar performance, because these two
model differ only in the clock speeds.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 16 / 23
Multi-Node Benchmark
0
20
40
60
80
100
120
140
1 2 3 4
Execution
Time
(s)
Node count
Raspberry Pi 3B
0
20
40
60
80
100
120
140
1 2 3 4
Node count
Raspberry Pi 3B+
0
20
40
60
80
100
120
140
160
1 2 3 4
Node count
Raspberry Pi 4
Np=30M, Nt=100
Np=60M, Nt=100
Np=60M, Nt=500
Figure: Execution time in seconds for various node counts using all 4 threads.
Conclusion: Multi-node codes can scale well.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 17 / 23
Multi-Node Benchmark
0
10
20
30
40
50
60
70
1 2 3 4
Execution
Time
(s)
Node count
Raspberry Pi 3B
0
10
20
30
40
50
60
1 2 3 4
Node count
Raspberry Pi 3B+
0
2
4
6
8
10
12
14
16
1 2 3 4
Node count
Raspberry Pi 4
1 thread/node
2 threads/node
3 threads/node
4 threads/node
Figure: Execution time in seconds for various node counts using various threads.
Conclusion: Threads provide little performance gain, and actually hurt on
the 3 and 3+.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 18 / 23
Phylanx: ALS Benchmark
3
4
5
6
7
8
9
10
1 2 3 4
Execution
Time
(s)
Core count
Performance of the ALS Benchmark
Rpi 4
Rpi 3B
Rpi 3B+
Figure: Execution time in seconds for various core counts.
Conclusion: Our ALS code is fastest on 2 cores, but probably needs more
development.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 19 / 23
Cost wrt Power Consumption
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Rpi3 Rpi3p Rpi4
Cost
(in
1e-5
US
¢)
Model
1D stencil: Cost wrt Power Consumption
Figure: Cost with respect to power
consumption for the 1D stencil code
using 30 million stencil points per
iteration and a total of 100 iterations.
0
2
4
6
8
10
12
Rpi3 Rpi3p Rpi4
Cost
(in
1e-5
US
¢)
Model
ALS: Cost wrt Power Consumption
Figure: Cost with respect to power
consumption for the Alternating Least
Square (ALS) benchmark in Phylanx for
the MovieLens 20m database.
The power consumption for all models was obtained using the Linux
command stress3 for all four cores.
3
https://linux.die.net/man/1/stress
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 20 / 23
Conclusion & Outlook
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 21 / 23
Conclusion & Outlook
Conclusion
Limited memory bandwith limits the utilization of all cores.
Ubuntu Server 2020 supports 32-bit and 64-bit.
Frequency setting of “performance” used instead of “ondemand.”
The cluster provides modest performance at a reasonable cost.
Outlook
Use the small and affordable cluster for teaching parallel and
distributing computing.
The interface to attach sensors could be used in field studies to
collect data and Phylanx could process the data before uploading to
more powerful devices to do the analysis.
Use a larger cluster and more sophisticated Arm®
hardware.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 22 / 23
This work is licensed under a Creative
Commons “Attribution-NonCommercial-
NoDerivatives 4.0 International” license.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 23 / 23

More Related Content

PDF
Recent developments in HPX and Octo-Tiger
PDF
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
PDF
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
PDF
DuraMat Data Analytics
PPTX
Earth Science Platform
PDF
Software tools for high-throughput materials data generation and data mining
PPT
Transferable GAN-generated Images Detection Framework.
PDF
DuraMat Data Management and Analytics
Recent developments in HPX and Octo-Tiger
An asynchronous and task-based implementation of peridynamics utilizing HPX—t...
Porting our astrophysics application to Arm64FX and adding Arm64FX support us...
DuraMat Data Analytics
Earth Science Platform
Software tools for high-throughput materials data generation and data mining
Transferable GAN-generated Images Detection Framework.
DuraMat Data Management and Analytics

What's hot (20)

PDF
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
PPT
[Seminar arxiv]fake face detection via adaptive residuals extraction network
PDF
Cognitive Engine: Boosting Scientific Discovery
PDF
Combining density functional theory calculations, supercomputing, and data-dr...
PDF
Software tools, crystal descriptors, and machine learning applied to material...
PDF
Data dissemination and materials informatics at LBNL
PDF
PDF
Methods, tools, and examples (Part II): High-throughput computation and machi...
PDF
Atomate: a tool for rapid high-throughput computing and materials discovery
PDF
How might machine learning help advance solar PV research?
PDF
Combining density functional theory calculations, supercomputing, and data-dr...
PDF
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
PPTX
Coding the Continuum
PPT
The OptIPuter as a Prototype for CalREN-XD
PPT
[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...
PDF
TermProject_cp33252_alw278_aa44757
PPT
Science and Cyberinfrastructure in the Data-Dominated Era
PDF
Software tools, crystal descriptors, and machine learning applied to material...
PPTX
AI at Scale for Materials and Chemistry
PDF
Core Objective 1: Highlights from the Central Data Resource
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
[Seminar arxiv]fake face detection via adaptive residuals extraction network
Cognitive Engine: Boosting Scientific Discovery
Combining density functional theory calculations, supercomputing, and data-dr...
Software tools, crystal descriptors, and machine learning applied to material...
Data dissemination and materials informatics at LBNL
Methods, tools, and examples (Part II): High-throughput computation and machi...
Atomate: a tool for rapid high-throughput computing and materials discovery
How might machine learning help advance solar PV research?
Combining density functional theory calculations, supercomputing, and data-dr...
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
Coding the Continuum
The OptIPuter as a Prototype for CalREN-XD
[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...
TermProject_cp33252_alw278_aa44757
Science and Cyberinfrastructure in the Data-Dominated Era
Software tools, crystal descriptors, and machine learning applied to material...
AI at Scale for Materials and Chemistry
Core Objective 1: Highlights from the Central Data Resource
Ad

Similar to Deploying a Task-based Runtime System on Raspberry Pi Clusters (20)

PDF
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
PDF
Introduction to Big Data
PDF
Nikravesh big datafeb2013bt
PDF
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
PPT
Parallelism Processor Design
PDF
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
PDF
Distributed Computing for Everyone
PDF
Cwf96 (1)
PPT
Valladolid final-septiembre-2010
PPTX
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
PDF
The Coming Age of Extreme Heterogeneity in HPC
PPTX
Data-Centric Parallel Programming
PDF
Roadrunner Tutorial: An Introduction to Roadrunner and the Cell Processor
PDF
Is RISC-V ready for HPC workload? Maybe?
PPT
Toward Greener Cyberinfrastructure
PPTX
Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...
PDF
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...
PPT
Parallel_and_Cluster_Computing.ppt
PDF
Digital Ic Applications Jntu Model Paper{Www.Studentyogi.Com}
PDF
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Www
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
Introduction to Big Data
Nikravesh big datafeb2013bt
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Parallelism Processor Design
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Distributed Computing for Everyone
Cwf96 (1)
Valladolid final-septiembre-2010
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
The Coming Age of Extreme Heterogeneity in HPC
Data-Centric Parallel Programming
Roadrunner Tutorial: An Introduction to Roadrunner and the Cell Processor
Is RISC-V ready for HPC workload? Maybe?
Toward Greener Cyberinfrastructure
Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...
Parallel_and_Cluster_Computing.ppt
Digital Ic Applications Jntu Model Paper{Www.Studentyogi.Com}
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Www
Ad

More from Patrick Diehl (15)

PDF
Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
PDF
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
PDF
Subtle Asynchrony by Jeff Hammond
PDF
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
PDF
JOSS and FLOSS for science: Examples for promoting open source software and s...
PDF
A tale of two approaches for coupling nonlocal and local models
PDF
Recent developments in HPX and Octo-Tiger
PDF
Challenges for coupling approaches for classical linear elasticity and bond-b...
PDF
Quantifying Overheads in Charm++ and HPX using Task Bench
PDF
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
PDF
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
PDF
A review of benchmark experiments for the validation of peridynamics models
PDF
On the treatment of boundary conditions for bond-based peridynamic models
PDF
EMI 2021 - A comparative review of peridynamics and phase-field models for en...
PDF
Google Summer of Code mentor summit 2020 - Session 2 - Open Science and Open ...
Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
Subtle Asynchrony by Jeff Hammond
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
JOSS and FLOSS for science: Examples for promoting open source software and s...
A tale of two approaches for coupling nonlocal and local models
Recent developments in HPX and Octo-Tiger
Challenges for coupling approaches for classical linear elasticity and bond-b...
Quantifying Overheads in Charm++ and HPX using Task Bench
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
A review of benchmark experiments for the validation of peridynamics models
On the treatment of boundary conditions for bond-based peridynamic models
EMI 2021 - A comparative review of peridynamics and phase-field models for en...
Google Summer of Code mentor summit 2020 - Session 2 - Open Science and Open ...

Recently uploaded (20)

PDF
Sciences of Europe No 170 (2025)
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
Microbiology with diagram medical studies .pptx
PDF
The scientific heritage No 166 (166) (2025)
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPT
protein biochemistry.ppt for university classes
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
BIOMOLECULES PPT........................
PPTX
famous lake in india and its disturibution and importance
Sciences of Europe No 170 (2025)
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
7. General Toxicologyfor clinical phrmacy.pptx
Microbiology with diagram medical studies .pptx
The scientific heritage No 166 (166) (2025)
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Placing the Near-Earth Object Impact Probability in Context
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
POSITIONING IN OPERATION THEATRE ROOM.ppt
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
AlphaEarth Foundations and the Satellite Embedding dataset
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
protein biochemistry.ppt for university classes
Derivatives of integument scales, beaks, horns,.pptx
BIOMOLECULES PPT........................
famous lake in india and its disturibution and importance

Deploying a Task-based Runtime System on Raspberry Pi Clusters

  • 1. Deploying a Task-based Runtime System on Raspberry Pi Clusters Patrick Diehl and Steven R. Brandt University of Louisiana {pdiehl,sbrandt}@cct.lsu.edu Extreme Scale Programming Models and Middleware (ESPM2’20) Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 1 / 23
  • 2. Motivation Arm® -Technology is emerging in supercomputers and data centers, e.g. Fugaku the fastest super computer in the Top500. The low power consumption. The low costs of a Raspberry Pi and building a small cluster. One cluster of 4 nodes costs around $200. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 2 / 23
  • 3. Outline 1 Tools 2 Hardware and Software 3 Benchmarks 4 Results Memory Computation time Energy consumption 5 Conclusion & Outlook Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 3 / 23
  • 4. Tools Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 4 / 23
  • 5. HPX The C++ Standard Library for Concurrency and Parallelism Thread Scheduling ActiveGlobal AddressSpace Parcel Transport Layer Local Control Objects PerformanceMonitoring Operating System HPX Application HPX’s lightweight user threads reduce context switching overhead Active Global Address Space (AGAS) makes a unified view of the application Overlapping communication and computation in the Parcel Layer Reference Kaiser et al., (2020). HPX - The C++ Standard Library for Parallelism and Concurrency. Journal of Open Source Software, 5(53), 2352, https://doi.org/10.21105/joss.02352 Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 5 / 23
  • 6. Phylanx An Asynchronous Distributed Array Computing Toolkit Run Python code within the HPX runtime system in parallel. Python is a common used language in machine and deep learning. Reference Tohid, R., et al. ”Asynchronous execution of python code on task-based runtime systems.” 2018 IEEE/ACM 4th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). IEEE, 2018. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 6 / 23
  • 7. Hardware and Software Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 7 / 23
  • 8. Raspberry PI cluster & Software Table: Specification/Architecture of the three nodes utilised in the benchmarks. Model Raspberry Pi 3B Raspberry Pi 3B+ Raspberry Pi 4B Micro-architecture Arm® v8-A Arm® v8-A Arm® v8 Processor Model Cortex-A53 Cortex-A53 Cortex-A72 Number of CPUs 1 1 1 Cores per CPU 4 4 4 Total Cores 4 4 4 Frequency 1.2GHz 1.4GHz 1.5GHz Memory 1GB 1GB 4GB 1 https://bitbucket.org/blaze-lib/blaze 2 https://github.com/STEllAR-GROUP/hpx Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 8 / 23
  • 9. Raspberry PI cluster & Software Table: Specification/Architecture of the three nodes utilised in the benchmarks. Model Raspberry Pi 3B Raspberry Pi 3B+ Raspberry Pi 4B Micro-architecture Arm® v8-A Arm® v8-A Arm® v8 Processor Model Cortex-A53 Cortex-A53 Cortex-A72 Number of CPUs 1 1 1 Cores per CPU 4 4 4 Total Cores 4 4 4 Frequency 1.2GHz 1.4GHz 1.5GHz Memory 1GB 1GB 4GB Table: Overview of the compilers, software, and operating system used. Operating System Ubuntu 20.04 LTS Kernel 5.4 for Arm® blaze1 75179e6 Compilers gcc 9.30.1 boost 1.71 hwloc 2.1.0 gperftools 2.7 lapack 3.8 HPX2 5b9de48ab1 1 https://bitbucket.org/blaze-lib/blaze 2 https://github.com/STEllAR-GROUP/hpx Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 8 / 23
  • 10. Benchmarks Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 9 / 23
  • 11. 2D Jacobi Solver (Shared memory) 2D Stencil based on the Jacobi method using Standard grid layout for GCC’s autovecorize (−03) Virtual Node Scheme1 for explicit vectorization Roofline model2 to predict the optimal performance Poptimal = Memory Bandwidth × AI with arithmetic intensity (AI) is given by 1/24 for double precision and 1/12 for single precision. References 1 P. Boyle, A. Yamaguchi, G. Cossu, and A. Portelli, “Grid: A next generation data parallel c++ qcd library,” arXiv preprint arXiv:1512.03487, 2015 2 S. Williams, D. Patterson, L. Oliker, J. Shalf, and K. Yelick, “The roofline model: A pedagogical tool for auto-tuning kernels on multicore architectures,” in Hot Chips, vol. 20, 2008, pp. 24–26. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 10 / 23
  • 12. 1D Heat equation solver (Distributed memory) Parameter: heat transfer coefficient k = 0.5, time step dt = 1, and grid spacing dx = 1. 2 1 0 1 2 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0 20 40 60 80 u Figure: Initial conditions 2 1 0 1 2 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 20 30 40 50 60 70 80 u Figure: Solution Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 11 / 23
  • 13. Results Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 12 / 23
  • 14. STREAM TRIAD Benchmark 0 1000 2000 3000 4000 5000 6000 1 2 3 4 Memory Bandwidth (in MBps) Core count STREAM TRIAD Results Rpi 3B Rpi 3B+ Rpi 4 Figure: Memory Bandwidth results using the STREAM TRIAD Benchmark with an array size of 10M elements Pi 3B/3B+ have very low memory bandwidth (MB) One single processor unit (PU) already saturated Pi 4 same behavior, but double MB Conclusion: Memory bus can only handle a certain amount of MB and concurrency at the same time. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 13 / 23
  • 15. 2D stencil (Raspberry Pi 4) 2D Stencil: Raspberry Pi 4 200 300 400 500 600 700 1 2 3 4 Performance (in MLUPs/s) Core count Single Precision 100 150 200 250 300 350 1 2 3 4 Core count Double Precision scalar (auto) vector (explicit) Expected Peak Figure: 2D stencil (Raspberry Pi 4): Grid size of 4096×4096 iterated over a 100 time steps. Conclusion: Best performance on 2 cores. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 14 / 23
  • 16. 2D stencil (Raspberry Pi 3B+) 2D Stencil: Raspberry Pi 3B+ 100 150 200 250 300 350 1 2 3 4 Performance (in MLUPs/s) Core count Single Precision 40 60 80 100 120 140 160 1 2 3 4 Core count Double Precision scalar (auto) vector (explicit) Expected Peak Figure: 2D stencil (Raspberry Pi 3B+): Grid size of 4096×4096 iterated over a 100 time steps. Conclusion: We can not achieve the expected peak performance. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 15 / 23
  • 17. 2D stencil (Raspberry Pi 3B) 2D Stencil: Raspberry Pi 3B 100 150 200 250 300 350 1 2 3 4 Performance (in MLUPs/s) Core count Single Precision 40 60 80 100 120 140 160 1 2 3 4 Core count Double Precision scalar (auto) vector (explicit) Expected Peak Figure: 2D stencil (Raspberry Pi 3B): Grid size of 4096×4096 iterated over a 100 time steps. Conclusion: PI 3B and 3B+ have similar performance, because these two model differ only in the clock speeds. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 16 / 23
  • 18. Multi-Node Benchmark 0 20 40 60 80 100 120 140 1 2 3 4 Execution Time (s) Node count Raspberry Pi 3B 0 20 40 60 80 100 120 140 1 2 3 4 Node count Raspberry Pi 3B+ 0 20 40 60 80 100 120 140 160 1 2 3 4 Node count Raspberry Pi 4 Np=30M, Nt=100 Np=60M, Nt=100 Np=60M, Nt=500 Figure: Execution time in seconds for various node counts using all 4 threads. Conclusion: Multi-node codes can scale well. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 17 / 23
  • 19. Multi-Node Benchmark 0 10 20 30 40 50 60 70 1 2 3 4 Execution Time (s) Node count Raspberry Pi 3B 0 10 20 30 40 50 60 1 2 3 4 Node count Raspberry Pi 3B+ 0 2 4 6 8 10 12 14 16 1 2 3 4 Node count Raspberry Pi 4 1 thread/node 2 threads/node 3 threads/node 4 threads/node Figure: Execution time in seconds for various node counts using various threads. Conclusion: Threads provide little performance gain, and actually hurt on the 3 and 3+. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 18 / 23
  • 20. Phylanx: ALS Benchmark 3 4 5 6 7 8 9 10 1 2 3 4 Execution Time (s) Core count Performance of the ALS Benchmark Rpi 4 Rpi 3B Rpi 3B+ Figure: Execution time in seconds for various core counts. Conclusion: Our ALS code is fastest on 2 cores, but probably needs more development. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 19 / 23
  • 21. Cost wrt Power Consumption 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Rpi3 Rpi3p Rpi4 Cost (in 1e-5 US ¢) Model 1D stencil: Cost wrt Power Consumption Figure: Cost with respect to power consumption for the 1D stencil code using 30 million stencil points per iteration and a total of 100 iterations. 0 2 4 6 8 10 12 Rpi3 Rpi3p Rpi4 Cost (in 1e-5 US ¢) Model ALS: Cost wrt Power Consumption Figure: Cost with respect to power consumption for the Alternating Least Square (ALS) benchmark in Phylanx for the MovieLens 20m database. The power consumption for all models was obtained using the Linux command stress3 for all four cores. 3 https://linux.die.net/man/1/stress Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 20 / 23
  • 22. Conclusion & Outlook Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 21 / 23
  • 23. Conclusion & Outlook Conclusion Limited memory bandwith limits the utilization of all cores. Ubuntu Server 2020 supports 32-bit and 64-bit. Frequency setting of “performance” used instead of “ondemand.” The cluster provides modest performance at a reasonable cost. Outlook Use the small and affordable cluster for teaching parallel and distributing computing. The interface to attach sensors could be used in field studies to collect data and Phylanx could process the data before uploading to more powerful devices to do the analysis. Use a larger cluster and more sophisticated Arm® hardware. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 22 / 23
  • 24. This work is licensed under a Creative Commons “Attribution-NonCommercial- NoDerivatives 4.0 International” license. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 23 / 23