SlideShare a Scribd company logo
Making GPU resets less
painful on Linux
André Almeida @ Igalia
OSS NA 2023
1
Hi!
Kernel developer
Working in the Steam Deck
2
Oh no, my GPU hanged!
You are playing your game on Linux
Something wrong is sent to the device
???
Game over, reboot your machine
3
Modern GPUs are complex
Really complex
AMD Radeon RX 7900 XTX
96 Compute units, 384 texture units, 6 shader
engines, 58 B transistors...
Shaders are Turing Complete
4
Modern GPUs are complex
DCHUB
HUBP
(n)
DPP
(n)
MPC OPTC DIO
DCCG DMU AZ
MMHUBBUB
DWB
(n)
Global sync
Pixel data
Sideband signal
Config. Bus
SDP Monitor
OPP
dc_plane dc_stream
dc_state
Code struct
dc_link
Floating point
calculation
bit-depth
reduction/dither
}
Notes
5
Modern GPUs are complex
If you have an infinity loop in the CPU, it's not that
bad
CPU programs has virtual memory and virtual
processor
Things might be more barebone in the GPU
But in a GPU, the display won't be able to update
6
Detecting GPU hangs
From the hardware to the application
7
Detecting GPU hangs
Device to Kernel
Submit a job and wait until is done
Check fences
Or timeouts
The driver does a GPU reset
This can be "soft" resets, one hw engine reset or
full device reset
More complete resets are more destructive
Now, report to userspace
8
Reporting GPU hangs
Kernel to Mesa
DRM has no API for that
I915_GET_RESET_STATS
AMDGPU_CTX_OP_QUERY_STATE2
MSM_PARAM_FAULTS
Return -ERROR for ioctls
It's not really hw specific
9
Reporting GPU hangs
Mesa to application
APIs provide a way to tell apps that a reset
happened:
VK_ERROR_DEVICE_LOST
GL_ARB_robustness
Non-robust GL apps are just killed
Applications then can recreate the context
10
What happens in practice?
DRM <-> Mesa it's not really hw specific
How about we have a DRM_GET_RESET_STATE?
WIP: A DRM documentation explaining what DRM
drivers and usermode drivers (Mesa) should do when
a reset happens, with a DRM IOCTL to query the reset
11
What happens in practice?
Each vendor reacts differently to resets
My focus is on amdgpu
The state was that it would be unrecoverable for any
kind of reset
Just a black screen and not responsive. Access via
ssh/tty sometimes worked
Pierre-Eric (AMD) and I fixed this for KDE compositor
radeonsi wasn't following the spec
More testing is need for robustness
12
What happens in practice?
Other OSs have more control in the stack, so they can
be more reliable
In particular in the compositor side, so it's easier to
get in a standard behavior
13
Good reporting of GPU hangs
Apart from reporting to userspace that the GPU was
reset, would be nice to tell what happened
Currently Mesa developers have a hard time figuring
out what in the game caused the hang
14
Good reporting of GPU hangs
GPU hang have two main sources:
Hardware settings (voltage, frequency)
Application errors (infinite loops)
There's no way to distinguish this right now
15
Good reporting of GPU hangs
Ideally without overhead so can be enabled by default
WIP: AMDGPU_INFO_GUILTY_APP to capture data
about the hanged app (e.g. buffer in use)
This callbacks need to be platform specific
Reads some registers
16
Good reporting of GPU hangs
Challenge: when the GPU hangs the hardware state
can be a bit unreliable.
How to get the right info correctly?
Using the GPU in "debug" mode or inserting fences,
barrier and extra information causes overhead
No easy way to deploy to all users
17
Roadmap for better GPU resets
Standardization of how DRM reports GPU hangs to
userspace
of how usermode driver deals with a hang and with
the guilty application
what compositors should do after a hang
Better hang log
Show which buffer caused the hang
Dump hardware state reliably
devcoredump
18
Links
https://lore.kernel.org/lkml/20230501185747.3351
9-1-andrealmeid@igalia.com/
https://lore.kernel.org/lkml/20230424014324.2185
31-1-andrealmeid@igalia.com/
https://lore.kernel.org/lkml/20230227204000.567
87-1-andrealmeid@igalia.com/
https://gitlab.freedesktop.org/mesa/mesa/-/merge
_requests/22290
https://gitlab.freedesktop.org/mesa/mesa/-/merge
_requests/22253
19
Thanks!
andrealmeid@igalia.com
igalia.com/jobs
20
21
Making GPU resets less painful on Linux

More Related Content

PDF
Having fun with GPU resets in Linux – XDC 2023
PDF
Optimizing the graphics pipeline with compute
PPTX
GPU Architecture NVIDIA (GTX GeForce 480)
PDF
Debugging GPU faults: QoL tools for your driver – XDC 2023
PPTX
Computer specifications
PPTX
PDF
GPU Programming with Java
PPTX
Kindratenko hpc day 2011 Kiev
Having fun with GPU resets in Linux – XDC 2023
Optimizing the graphics pipeline with compute
GPU Architecture NVIDIA (GTX GeForce 480)
Debugging GPU faults: QoL tools for your driver – XDC 2023
Computer specifications
GPU Programming with Java
Kindratenko hpc day 2011 Kiev

Similar to Making GPU resets less painful on Linux (20)

DOCX
Computer systems|Computer Networking & Communication System Assignment - Netw...
PDF
Topics - , Addressing modes, GPU, .pdf
PPTX
Parallel Futures of a Game Engine (v2.0)
ODP
Advanced Diagnostics 2
PPTX
Nvidia (History, GPU Architecture and New Pascal Architecture)
PPTX
gpuprogram_lecture,architecture_designsn
PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
PDF
Hardware refers to all of the physical parts of a computer system. F.pdf
PDF
Cuda Without a Phd - A practical guick start
PPSX
APU in nepal 2
PPTX
2013 Elite A-Series Launch
 
PPTX
Computação acelerada – a era das ap us roberto brandão, ciência
PDF
Unified Memory on POWER9 + V100
PPT
Architectural Analysis of Game Machines
PDF
Cg 4278
PDF
GPGPU algorithms in games
PPTX
Introduction to Computer Hardware
Computer systems|Computer Networking & Communication System Assignment - Netw...
Topics - , Addressing modes, GPU, .pdf
Parallel Futures of a Game Engine (v2.0)
Advanced Diagnostics 2
Nvidia (History, GPU Architecture and New Pascal Architecture)
gpuprogram_lecture,architecture_designsn
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing
Hardware refers to all of the physical parts of a computer system. F.pdf
Cuda Without a Phd - A practical guick start
APU in nepal 2
2013 Elite A-Series Launch
 
Computação acelerada – a era das ap us roberto brandão, ciência
Unified Memory on POWER9 + V100
Architectural Analysis of Game Machines
Cg 4278
GPGPU algorithms in games
Introduction to Computer Hardware
Ad

More from Igalia (20)

PDF
Life of a Kernel Bug Fix
PDF
Unlocking the Full Potential of WPE to Build a Successful Embedded Product
PDF
Advancing WebDriver BiDi support in WebKit
PDF
Jumping Over the Garden Wall - WPE WebKit on Android
PDF
Collective Funding, Governance and Prioritiation of Browser Engine Projects
PDF
Don't let your motivation go, save time with kworkflow
PDF
Solving the world’s (localization) problems
PDF
The Whippet Embeddable Garbage Collection Library
PDF
Nobody asks "How is JavaScript?"
PDF
Getting more juice out from your Raspberry Pi GPU
PDF
WebRTC support in WebKitGTK and WPEWebKit with GStreamer: Status update
PDF
Demystifying Temporal: A Deep Dive into JavaScript New Temporal API
PDF
CSS :has() Unlimited Power
PDF
Device-Generated Commands in Vulkan
PDF
Current state of Lavapipe: Mesa's software renderer for Vulkan
PDF
Vulkan Video is Open: Application showcase
PDF
Scheme on WebAssembly: It is happening!
PDF
EBC - A new backend compiler for etnaviv
PDF
RISC-V LLVM State of the Union
PDF
Device-Generated Commands in Vulkan
Life of a Kernel Bug Fix
Unlocking the Full Potential of WPE to Build a Successful Embedded Product
Advancing WebDriver BiDi support in WebKit
Jumping Over the Garden Wall - WPE WebKit on Android
Collective Funding, Governance and Prioritiation of Browser Engine Projects
Don't let your motivation go, save time with kworkflow
Solving the world’s (localization) problems
The Whippet Embeddable Garbage Collection Library
Nobody asks "How is JavaScript?"
Getting more juice out from your Raspberry Pi GPU
WebRTC support in WebKitGTK and WPEWebKit with GStreamer: Status update
Demystifying Temporal: A Deep Dive into JavaScript New Temporal API
CSS :has() Unlimited Power
Device-Generated Commands in Vulkan
Current state of Lavapipe: Mesa's software renderer for Vulkan
Vulkan Video is Open: Application showcase
Scheme on WebAssembly: It is happening!
EBC - A new backend compiler for etnaviv
RISC-V LLVM State of the Union
Device-Generated Commands in Vulkan
Ad

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation theory and applications.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Unlocking AI with Model Context Protocol (MCP)
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Digital-Transformation-Roadmap-for-Companies.pptx
Empathic Computing: Creating Shared Understanding
Group 1 Presentation -Planning and Decision Making .pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Assigned Numbers - 2025 - Bluetooth® Document
Machine learning based COVID-19 study performance prediction
Encapsulation_ Review paper, used for researhc scholars
Per capita expenditure prediction using model stacking based on satellite ima...
Advanced methodologies resolving dimensionality complications for autism neur...
Programs and apps: productivity, graphics, security and other tools
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation theory and applications.pdf
Spectral efficient network and resource selection model in 5G networks

Making GPU resets less painful on Linux