SlideShare a Scribd company logo
IMPROVING GPU FREQUENCY SCALING FOR GPU WORKLOADS
TYPICAL DVFS BASED GPU BOOST MECHANISM
• GPU frequency boosting wired through the devfreq governor
• Monitors GPU busyness and tries to keep current load under given
target load by adjusting gpu frequency with tunables like settling
time, bias, damp and rampdown_delay
• Basically boost_freq = bias * freq * (load - target)/target
• Ideal for sustained loads and burstiness within high load window
• Too aggressive tunings lead to higher reactiveness
• However also leads to constant gpu overpowering
• For e.g. too low target_load or high rampdown_delay
PROBLEM
• Low latency VR use cases typically present repetitive & bursty GPU
workloads
• Need is guaranteed GPU horsepower exactly when workload
gets scheduled
• Load quickly gets degenerated (but high chance of repeating) -
so frequency needs to quickly fall down (and ramp up back)
• Typical use cases exhibiting this kind of burstiness are camera post
processing, edge detection, atw...
• Slower response time associated with current governor in ramping
up frequency clearly shows up with overall low perf/watt
JUST IN (SUBMIT) TIME FREQ SCALING
• Density of work submission (unit time) forms basis of GPU load
• Delay (order of ms) in submit to governor’s load visibility
• Translates to latency in effective gpu frequency boost
• Short boost pulse in submit code path takes care of ramp up latency
• Inherently makes frequency follow workload
• Increased chances of governor now seeing lower load and pulling
frequency down
• Effective gpu freq comes down to fmax@vmin for profiled use cases
(presenting better perf/watt)
PERF/POWER DATA ACROSS USE CASES
GPU intensive
section (ms)
Avg GPU
Busyness
Avg GPU
Frequency
(Mhz)
Avg GPU
Power
(mW)
Avg
(VDD_IN)
Total Power
(mW)
%
Perf/Watt
Increase
Pupil Detection (with
JIT scaling)
Edge
Detection
11.004 34 497 471 5488
99.623182
Pupil Detection (with
default scaling)
21.158 182 293 421 5286
Passthrough camera
(with JIT scaling)
Camera to
Display
(e2e)
40.599 219 596 856 7591
4.5763017
Passthrough camera
(with default scaling)
45.466 590 283 837 8129
Passthrough camera
(with max gpu)
40.025 153 1331 1377 8677
PUPIL DETECTION WITH CURRENT FREQ SCALING
Avg Max Min
GPU
intensive
code
latency ( in
ms)
21.158 843.41 7.289
GPU
Busyness
182 401 57
GPU
frequency
(in Mhz)
293 595 109
GPU Power
(in mW)
421 534 152
PUPIL DETECTION WITH JIT FREQ SCALING
Avg Max Min
GPU
intensive
code
latency (in
ms)
11.004 957.52 5890
GPU
Busyness
34 504 10
GPU
frequency
(in Mhz)
497 790 109
GPU Power
(in mW)
471 610 152
PASSTHROUGH WITH DEFAULT FREQ SCALING
Avg Max Min
GPU
intensive
code
latency (in
ms)
45.466 82.425 33.461
GPU
Busyness
590 946 173
GPU
frequency
(in Mhz)
283 693 109
GPU Power
(in mW)
837 838 761
PASSTHROUGH WITH JIT FREQ SCALING
Avg Max Min
GPU
intensive
code
latency (in
ms)
40.599 63.626 31.690
GPU
Busyness
219 390 67
GPU
frequency
(in Mhz)
596 790 303
GPU Power
(in mW)
856 914 762

More Related Content

PDF
LCA13: Combining Runtime PM and suspend/resume
PPTX
PlovDev 2016: Application Performance in Virtualized Environments by Todor T...
PPTX
Asynchronous Multiplayer on Mobile Network
PPTX
ICSIPA 2017 presentation
PDF
VMworld 2013: PCoIP: Sizing For Success
PDF
Tizen Developer Conference 2017 San Francisco - Tizen Power Management Servic...
PDF
Hardware Assisted Latency Investigations
LCA13: Combining Runtime PM and suspend/resume
PlovDev 2016: Application Performance in Virtualized Environments by Todor T...
Asynchronous Multiplayer on Mobile Network
ICSIPA 2017 presentation
VMworld 2013: PCoIP: Sizing For Success
Tizen Developer Conference 2017 San Francisco - Tizen Power Management Servic...
Hardware Assisted Latency Investigations

Similar to Gpu submit time frequency boosting (20)

PDF
KVM Tuning @ eBay
PDF
WALT vs PELT : Redux - SFO17-307
PDF
PPTX
Dynamic Resolution and Interlaced Rendering
PDF
Service Assurance for Virtual Network Functions in Cloud-Native Environments
PDF
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
PDF
customization of a deep learning accelerator, based on NVDLA
PPTX
improve deep learning training and inference performance
PDF
Project ACRN CPU sharing BVT scheduler in ACRN hypervisor
PDF
Nick Fisk - low latency Ceph
PDF
5GRAN-Features---Part-3--UL-Throughput-.pdf
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
Symposium on HPC Applications – IIT Kanpur
PDF
HiPEAC 2019 Workshop - Use Cases
PDF
On the Capability and Achievable Performance of FPGAs for HPC Applications
PDF
Inside Microsoft's FPGA-Based Configurable Cloud
PPTX
45 KVA Ground Power Unit for Raphael .pptx
PDF
AMD PowerTune & ZeroCore Power Technologies
 
PPTX
Ovs perf
PDF
Performance Evaluation and Comparison of Service-based Image Processing based...
KVM Tuning @ eBay
WALT vs PELT : Redux - SFO17-307
Dynamic Resolution and Interlaced Rendering
Service Assurance for Virtual Network Functions in Cloud-Native Environments
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
customization of a deep learning accelerator, based on NVDLA
improve deep learning training and inference performance
Project ACRN CPU sharing BVT scheduler in ACRN hypervisor
Nick Fisk - low latency Ceph
5GRAN-Features---Part-3--UL-Throughput-.pdf
Energy Efficient Computing using Dynamic Tuning
Symposium on HPC Applications – IIT Kanpur
HiPEAC 2019 Workshop - Use Cases
On the Capability and Achievable Performance of FPGAs for HPC Applications
Inside Microsoft's FPGA-Based Configurable Cloud
45 KVA Ground Power Unit for Raphael .pptx
AMD PowerTune & ZeroCore Power Technologies
 
Ovs perf
Performance Evaluation and Comparison of Service-based Image Processing based...
Ad

Recently uploaded (20)

PDF
PPT on Performance Review to get promotions
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
composite construction of structures.pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
DOCX
573137875-Attendance-Management-System-original
PDF
737-MAX_SRG.pdf student reference guides
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Geodesy 1.pptx...............................................
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Well-logging-methods_new................
PPTX
additive manufacturing of ss316l using mig welding
PPT on Performance Review to get promotions
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
composite construction of structures.pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
573137875-Attendance-Management-System-original
737-MAX_SRG.pdf student reference guides
OOP with Java - Java Introduction (Basics)
Geodesy 1.pptx...............................................
CH1 Production IntroductoryConcepts.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
III.4.1.2_The_Space_Environment.p pdffdf
Model Code of Practice - Construction Work - 21102022 .pdf
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Automation-in-Manufacturing-Chapter-Introduction.pdf
Internet of Things (IOT) - A guide to understanding
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Well-logging-methods_new................
additive manufacturing of ss316l using mig welding
Ad

Gpu submit time frequency boosting

  • 1. IMPROVING GPU FREQUENCY SCALING FOR GPU WORKLOADS
  • 2. TYPICAL DVFS BASED GPU BOOST MECHANISM • GPU frequency boosting wired through the devfreq governor • Monitors GPU busyness and tries to keep current load under given target load by adjusting gpu frequency with tunables like settling time, bias, damp and rampdown_delay • Basically boost_freq = bias * freq * (load - target)/target • Ideal for sustained loads and burstiness within high load window • Too aggressive tunings lead to higher reactiveness • However also leads to constant gpu overpowering • For e.g. too low target_load or high rampdown_delay
  • 3. PROBLEM • Low latency VR use cases typically present repetitive & bursty GPU workloads • Need is guaranteed GPU horsepower exactly when workload gets scheduled • Load quickly gets degenerated (but high chance of repeating) - so frequency needs to quickly fall down (and ramp up back) • Typical use cases exhibiting this kind of burstiness are camera post processing, edge detection, atw... • Slower response time associated with current governor in ramping up frequency clearly shows up with overall low perf/watt
  • 4. JUST IN (SUBMIT) TIME FREQ SCALING • Density of work submission (unit time) forms basis of GPU load • Delay (order of ms) in submit to governor’s load visibility • Translates to latency in effective gpu frequency boost • Short boost pulse in submit code path takes care of ramp up latency • Inherently makes frequency follow workload • Increased chances of governor now seeing lower load and pulling frequency down • Effective gpu freq comes down to fmax@vmin for profiled use cases (presenting better perf/watt)
  • 5. PERF/POWER DATA ACROSS USE CASES GPU intensive section (ms) Avg GPU Busyness Avg GPU Frequency (Mhz) Avg GPU Power (mW) Avg (VDD_IN) Total Power (mW) % Perf/Watt Increase Pupil Detection (with JIT scaling) Edge Detection 11.004 34 497 471 5488 99.623182 Pupil Detection (with default scaling) 21.158 182 293 421 5286 Passthrough camera (with JIT scaling) Camera to Display (e2e) 40.599 219 596 856 7591 4.5763017 Passthrough camera (with default scaling) 45.466 590 283 837 8129 Passthrough camera (with max gpu) 40.025 153 1331 1377 8677
  • 6. PUPIL DETECTION WITH CURRENT FREQ SCALING Avg Max Min GPU intensive code latency ( in ms) 21.158 843.41 7.289 GPU Busyness 182 401 57 GPU frequency (in Mhz) 293 595 109 GPU Power (in mW) 421 534 152
  • 7. PUPIL DETECTION WITH JIT FREQ SCALING Avg Max Min GPU intensive code latency (in ms) 11.004 957.52 5890 GPU Busyness 34 504 10 GPU frequency (in Mhz) 497 790 109 GPU Power (in mW) 471 610 152
  • 8. PASSTHROUGH WITH DEFAULT FREQ SCALING Avg Max Min GPU intensive code latency (in ms) 45.466 82.425 33.461 GPU Busyness 590 946 173 GPU frequency (in Mhz) 283 693 109 GPU Power (in mW) 837 838 761
  • 9. PASSTHROUGH WITH JIT FREQ SCALING Avg Max Min GPU intensive code latency (in ms) 40.599 63.626 31.690 GPU Busyness 219 390 67 GPU frequency (in Mhz) 596 790 303 GPU Power (in mW) 856 914 762