SlideShare a Scribd company logo
On Dynamic Load Balancing on Graphics ProcessorsDaniel Cederman and Philippas TsigasChalmers University of Technology
OverviewMotivationMethodsExperimental evaluationConclusion
The problem settingWorkOfflineTaskTaskTaskTaskTaskTaskTaskOnlineTaskTaskTaskTask
Static Load BalancingProcessorProcessorProcessorProcessor
Static Load BalancingProcessorProcessorProcessorProcessorTaskTaskTaskTask
Static Load BalancingProcessorProcessorProcessorProcessorTaskTaskTaskTask
Static Load BalancingProcessorProcessorProcessorProcessorTaskTaskTaskTaskSubtaskSubtaskSubtaskSubtask
Static Load BalancingProcessorProcessorProcessorProcessorTaskTaskTaskTaskSubtaskSubtaskSubtaskSubtask
Dynamic Load BalancingProcessorProcessorProcessorProcessorTaskTaskTaskTaskSubtaskSubtaskSubtaskSubtask
Task sharingCheck conditionWork done?DoneTask SetAcquire TaskTry to get taskTaskGot task?No, retryTaskTaskPerform taskTaskNew tasks?No, continueAdd TaskTaskAdd task
System ModelCUDAGlobal MemoryGather and scatterCompare-And-SwapFetch-And-IncMultiprocessorsMaximum number ofconcurrent thread blocksGlobal MemoryMulti-processorMulti-processorMulti-processorThread BlockThread BlockThread BlockThread BlockThread BlockThread BlockThread BlockThread BlockThread Block
SynchronizationBlockingUses mutual exclusion to only allow one process at a time to access the object. LockfreeMultiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps.WaitfreeMultiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.
Load Balancing MethodsBlocking Task QueueNon-blocking Task QueueTask StealingStatic Task List
Blocking queueFreeTB 1HeadTB 2TailTB n
Blocking queueFreeTB 1HeadTB 2TailTB n
Blocking queueFreeTB 1HeadTB 2T1TailTB n
Blocking queueFreeTB 1HeadTB 2T1TailTB n
Blocking queueFreeTB 1HeadTB 2T1TailTB n
Non-blocking QueueTB 1TB 1HeadTB 2TB 2T1T2T3T4TailTB nReferenceP. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems[SPAA01]
Non-blocking QueueTB 1TB 1HeadTB 2TB 2T1T2T3T4TailTB n
Non-blocking QueueTB 1TB 1HeadTB 2TB 2T1T2T3T4TailTB n
Non-blocking QueueTB 1TB 1HeadTB 2TB 2T1T2T3T4TailTB n
Non-blocking QueueTB 1TB 1HeadTB 2TB 2T1T2T3T4T5TailTB n
Non-blocking QueueTB 1TB 1HeadTB 2TB 2T1T2T3T4T5TailTB n
Task stealingT1TB 1T3T2TB 2TB nReferenceArora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]
Task stealingT1T4TB 1T3T2TB 2TB n
Task stealingT1T4T5TB 1T3T2TB 2TB n
Task stealingT1T4TB 1T3T2TB 2TB n
Task stealingT1TB 1T3T2TB 2TB n
Task stealingTB 1T3T2TB 2TB n
Task stealingTB 1T2TB 2TB n
Static Task ListInT1T2T3T4
Static Task ListInT1TB 1T2TB 2T3TB 3T4TB 4
Static Task ListInOutT1TB 1T2TB 2T3TB 3T4TB 4
Static Task ListInOutT1T5TB 1T2TB 2T3TB 3T4TB 4
Static Task ListInOutT1T5TB 1T2T6TB 2T3TB 3T4TB 4
Static Task ListInOutT1T5TB 1T2T6TB 2T3T7TB 3T4TB 4
Octree PartitioningBandwidth bound
Octree PartitioningBandwidth bound
Octree PartitioningBandwidth bound
Octree PartitioningBandwidth bound
Four-in-a-rowComputation intensive
Graphics Processors8800GT14 Multiprocessors57 GB/sec bandwidth9600GT8 Multiprocessors57 GB/sec bandwidth
Blocking Queue – Octree/9600GT
Blocking Queue – Octree/8800GT
Blocking Queue – Four-in-a-row
Non-blocking Queue – Octree/9600GT
Non-blocking Queue – Octree/8800GT
Non-blocking Queue - Four-in-a-row
Task stealing – Octree/9600GT
Task stealing – Octree/8800GT
Task stealing – Four-in-a-row
Static List
Octree Comparison
Previous workKorch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer, Graphics Hardware 2005
ConclusionSynchronization plays a significant role in dynamic load-balancingLock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programmingLocks perform poorly It is good that operations such as CAS and FAA have been introduced in the new GPUsWork stealing could outperform static load balancing
Thank you!http://www.cs.chalmers.se/~dcs

More Related Content

PPTX
Debug generic process
PPTX
Debug dpdk process bottleneck & painpoints
PPTX
Dynamic user trace
PDF
LPC2019 BPF Tracing Tools
PDF
BPF Internals (eBPF)
PPTX
Mmap failure analysis
PDF
Kernel development
PDF
High-Performance Physics Solver Design for Next Generation Consoles
Debug generic process
Debug dpdk process bottleneck & painpoints
Dynamic user trace
LPC2019 BPF Tracing Tools
BPF Internals (eBPF)
Mmap failure analysis
Kernel development
High-Performance Physics Solver Design for Next Generation Consoles

What's hot (20)

PDF
Debugging Hung Python Processes With GDB
PDF
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
PDF
Tuning parallelcodeonsolaris005
PPTX
Dpdk applications
PDF
The linux networking architecture
PDF
Xdp and ebpf_maps
PDF
re:Invent 2019 BPF Performance Analysis at Netflix
PDF
Performance Wins with BPF: Getting Started
POTX
Performance Tuning EC2 Instances
ODP
Linux kernel tracing superpowers in the cloud
PDF
Profiling your Applications using the Linux Perf Tools
PDF
YOW2021 Computing Performance
PDF
How to Speak Intel DPDK KNI for Web Services.
PPTX
Streams for the Web
PDF
Runtime Performance Optimizations for an OpenFOAM Simulation
PDF
Bluestore oio adaptive_throttle_analysis
PPT
Oow2007 performance
PPTX
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PDF
RxNetty vs Tomcat Performance Results
PDF
Kernel Recipes 2017: Performance Analysis with BPF
Debugging Hung Python Processes With GDB
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tuning parallelcodeonsolaris005
Dpdk applications
The linux networking architecture
Xdp and ebpf_maps
re:Invent 2019 BPF Performance Analysis at Netflix
Performance Wins with BPF: Getting Started
Performance Tuning EC2 Instances
Linux kernel tracing superpowers in the cloud
Profiling your Applications using the Linux Perf Tools
YOW2021 Computing Performance
How to Speak Intel DPDK KNI for Web Services.
Streams for the Web
Runtime Performance Optimizations for an OpenFOAM Simulation
Bluestore oio adaptive_throttle_analysis
Oow2007 performance
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
RxNetty vs Tomcat Performance Results
Kernel Recipes 2017: Performance Analysis with BPF
Ad

Similar to Dynamic Load-balancing On Graphics Processors (20)

PDF
Problems in Task Scheduling in Multiprocessor System
PPTX
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...
PDF
The International Journal of Engineering and Science (The IJES)
PPTX
Scheduling in next generation os
PPT
Parallel Programming Primer
PPT
Parallel Programming Primer 1
PPTX
Threads and multi threading
PPT
Multiprocessor scheduling 1
PPT
Multiprocessor scheduling 2
PPT
10 Multicore 07
PDF
Lj2419141918
PDF
Real Time Operating System Concepts
PPTX
LoadBalancing .pptx
PPTX
LoadBalancing .pptx
PPT
multiprocessor real_ time scheduling.ppt
PPT
Migration To Multi Core - Parallel Programming Models
PPTX
Priority assignment on the mp so c with dmac
PPTX
Scheduler performance in manycore architecture
PDF
Parallel and Distributed Computing Chapter 7
PDF
J0210053057
Problems in Task Scheduling in Multiprocessor System
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...
The International Journal of Engineering and Science (The IJES)
Scheduling in next generation os
Parallel Programming Primer
Parallel Programming Primer 1
Threads and multi threading
Multiprocessor scheduling 1
Multiprocessor scheduling 2
10 Multicore 07
Lj2419141918
Real Time Operating System Concepts
LoadBalancing .pptx
LoadBalancing .pptx
multiprocessor real_ time scheduling.ppt
Migration To Multi Core - Parallel Programming Models
Priority assignment on the mp so c with dmac
Scheduler performance in manycore architecture
Parallel and Distributed Computing Chapter 7
J0210053057
Ad

Recently uploaded (20)

PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
1. Introduction to Computer Programming.pptx
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Mushroom cultivation and it's methods.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
cloud_computing_Infrastucture_as_cloud_p
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Per capita expenditure prediction using model stacking based on satellite ima...
Machine learning based COVID-19 study performance prediction
TLE Review Electricity (Electricity).pptx
Assigned Numbers - 2025 - Bluetooth® Document
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
A comparative study of natural language inference in Swahili using monolingua...
Heart disease approach using modified random forest and particle swarm optimi...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Unlocking AI with Model Context Protocol (MCP)
1. Introduction to Computer Programming.pptx
SOPHOS-XG Firewall Administrator PPT.pptx
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
NewMind AI Weekly Chronicles - August'25-Week II
Mushroom cultivation and it's methods.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Group 1 Presentation -Planning and Decision Making .pptx

Dynamic Load-balancing On Graphics Processors