SlideShare a Scribd company logo
HSA Kernel Code 
(KFD v0.6) 
Advisor: 徐慰中教授 
Student: 黃昱儒 
2014/7/25
Agenda 
● Introduction to HSA 
o hUMA 
o User Level Queueing 
● HSA Driver 
o Concepts 
▪ Flow Overview 
▪ User & Hardware Queues 
o Source Code Detail 
● IOMMU 
o Concepts 
▪ GCR3 
▪ PPR 
o Source Code Detail
hUMA
User Level Queuing - Before HSA
User Level Queuing
Application 1 
Queue 1 
1. AQL Packet 
2. Ring 
3. Doorbell 
HSA Device 
Application 1 
Queue 2 
Application 3 
Queue 1 
Application 3 
Queue 1 
HSA device access 
application’s ring 
Application 
kick doorbell 
IOMMU address translation 
(VA->PA)
HSA Software Stack
HSA Software Stack 
Application 
Runtime Library 
● open(“/dev/kfd”) 
● ioctl(KFD_IOC_SET_MEMORY_POLICY) 
● ioctl(KFD_IOC_CREATE_QUEUE) 
● ioctl(KFD_IOC_DESTROY_QUEUE) 
KFD IOMMU Driver 
HSA-aware Kernel 
HSA 
Device 
IOMMU
Agenda 
● Introduction to HSA 
o hUMA 
o User Level Queueing 
● HSA Driver 
o Concepts 
▪ Flow Overview 
▪ User & Hardware Queues 
o Source Code Detail 
● IOMMU 
o Concepts 
▪ GCR3 
▪ PPR 
o Source Code Detail
Concepts - HSA Run Flow 
Application KFD Driver 
Create user queues 
Create HW queue with user 
queue information 
Enqueu AQL packets, 
kick doorbell, and wait 
signal 
Nothing 
Application finish and 
destroy queues 
Release HW queue 
Initialization 
Computation 
Finish 
User - HW 
interaction
Scheduled Policy 
1. Hardware scheduler and allows 
oversubscription (more queues than HW 
slots) 
2. HW scheduling but does not allow 
oversubscription, so create_queue requests 
fail when we run out of HW slots 
3. Not use HW scheduling, so the driver 
manually assigns queues to HW slots by 
programming registers
pasid=1 
queue_id=0 
ring_base_address 
pasid=1 
queue_id=1 
ring_base_address 
HSA GPU’s configuration register mmio address 
Software Scheduler 
pasid=0 
queue_id=0 
ring_base_address 
doorbell 
Free hardware queue_id bitmap 
pasid=0 
queue_id=1 
ring_base_address 
doorbell 
doorbell 
doorbell 
queue 
acquire 
register 
(pipe, queue) 
Physical Address
HSA GPU’s configuration register mmio address 
Hardware Scheduler 
kernel_queue 
ring_base_address 
doorbell 
queue 
acquire 
register 
(pipe=4, queue=0) 
Physical Address
Hardware Scheduler - No Oversubscription 
PM4 Packet (Type3) 
IT_RUN_LIST 
run_list 
PM4 Packet (Type3) 
IT_MAP_PROCESS 
page_table_base 
pasid 
sh_mem_config 
PM4 Packet (Type3) 
IT_MAP_QUEUES 
mqd_addr 
(Memory Queue 
Descriptoy) 
3 Processes
Hardware Scheduler - Oversubscription 
PM4 Packet (Type3) 
IT_RUN_LIST 
run_list 
PM4 Packet (Type3) 
IT_MAP_PROCESS 
page_table_base 
pasid 
sh_mem_config 
PM4 Packet (Type3) 
IT_MAP_QUEUES 
mqd_addr 
(Memory Queue 
Descriptoy) 
PM4 Packet (Type3) 
IT_RUN_LIST 
run_list
Per Application 
Per Device 
Only for HW 
scheduling 
Per HW Queue
IOCTL Command Provided by KFD 
● KFD_IOC_CREATE_QUEUE 
o Create hardware queue from application’s information (ex: ring base address) 
● KFD_IOC_DESTROY_QUEUE 
o Release hardware queue 
● KFD_IOC_UPDATE_QUEUE 
● KFD_IOC_SET_MEMORY_POLICY 
o Set cache coherent policy 
● KFD_IOC_GET_CLOCK_COUNTERS 
o Get GPU clock counter 
● KFD_IOC_GET_PROCESS_APERTURES 
o Get apertures information of GPU 
● KFD_IOC_PMC_ACQUIRE_ACCESS 
● KFD_IOC_PMC_RELEASE_ACCESS 
o Exclusive access for performance counters
HSA Driver Flow 
● System intialization 
○ module_init 
○ device_init (Called by radeon) 
● Application open “/dev/kfd” device 
● Application send ioctl 
○ KFD_IOC_SET_MEMORY_POLICY 
○ KFD_IOC_CREATE_QUEUE 
● Application send ioctl 
○ KFD_IOC_DESTROY_QUEUE 
● Application termination
module_init(kfd_module_init) 
● radeon_kfd_pasid_init 
o Initialize PASID bitmap 
● radeon_kfd_chardev_init 
o register_chrdev: /dev/kfd 
o kfd_ops 
▪ Define open, ioctl member function
kgd2kfd_device_init 
● radeon_kfd_doorbell_init(kfd); 
● radeon_kfd_interrupt_init(kfd); 
● amd_iommu_set_invalidate_ctx_cb(kfd->pdev, 
iommu_pasid_shutdown_callback); 
● device_queue_manager_init(kfd); 
o dqm->initialize 
● dqm->start(kfd->dqm);
dqm->initialize For 
KFD_SCHED_POLICY_NO_HWS* 
● Prepare pipe, queue bitmap
kfd_open 
● radeon_kfd_create_process(current) 
o Create kfd_process 
o Assign PASID
KFD_IOC_SET_MEMORY_POLICY 
● Two policy 
o cache_policy_coherent 
o cache_policy_noncoherent 
● Okra 
o default policy=cache_policy_coherent 
o alternate policy=cache_policy_noncoherent
radeon_kfd_bind_process_to_device 
● Called when user application send ioctl 
command 
● amd_iommu_bind_pasid() 
o Register iommu with this kfd_process
KFD_IOC_CREATE_QUEUE 
● Create queue with informations from 
userspace 
● pqm_create_queue 
● Return queue_id and doorbell_address to 
userspace 
o queue_id is per kfd_process 
o doorbell_address map to device mmio address
pqm_create_queue 
● find_available_queue_slot 
o Assign qid (per kfd_process) 
● dqm->register_process 
o Register process to dqm (device queue manager) 
● create_cp_queue 
o Create with queue_properties get from application 
o Map doorbell mmio address to application 
● dqm->create_queue 
● dqm->execute_queue
dqm->create_queue For 
KFD_SCHED_POLICY_NO_HWS 
● init_mqd (memory queue descriptor) 
o Store queue configuration from application 
● Find unused (pipe, queue) from dqm (device 
queue manager) 
o If no, return -EBUSY 
o Maximum = 56
dqm->execute_queue For 
KFD_SCHED_POLICY_NO_HWS 
● Write queue configuration to device 
● load_mqd 
o ring_base_addr 
o doorbell_offset 
o queue_priority 
o ...
pasid=1 
queue_id=0 
ring_base_address 
pasid=1 
queue_id=1 
ring_base_address 
HSA GPU’s configuration register mmio address 
pasid=0 
queue_id=0 
ring_base_address 
Free hardware queue_id bitmap 
queue 
select 
register 
doorbell 
pasid=0 
queue_id=1 
ring_base_address 
doorbell 
doorbell 
doorbell 
Each process can have up to 1024 queues 
(pipe, queue) 
Physical Address
kgd2kfd_device_init 
● radeon_kfd_doorbell_init(kfd); 
● radeon_kfd_interrupt_init(kfd); 
● device_iommu_pasid_init(kfd); 
● kfd_topology_add_device(kfd); 
● amd_iommu_set_invalidate_ctx_cb(kfd->pdev, 
iommu_pasid_shutdown_callback); 
● device_queue_manager_init(kfd); 
o dqm->initialize 
● dqm->start(kfd->dqm);
dqm->start For 
KFD_SCHED_POLICY_HWS* 
● pm_init (packet manager) 
● kernel_queue_init 
o kernel_queue doorbell 
o kernel_queue ring address 
o load_mqd to write kernel_queue configuration to 
device
pqm_create_queue 
● find_available_queue_slot 
o Assign qid (per kfd_process) 
● dqm->register_process 
o Register process to dqm (device queue manager) 
● create_cp_queue 
o Create with queue_properties get from application 
o Map doorbell mmio address to application 
● dqm->create_queue 
● dqm->execute_queue
dqm->create_queue For 
KFD_SCHED_POLICY_HWS* 
● init_mqd (memory queue descriptor) 
o Store queue configuration from application
dqm->execute_queue For 
KFD_SCHED_POLICY_HWS* 
● dqm->destroy_queues 
● pm_send_runlist 
o pm_create_runlist_ib 
▪ Construct pm4 packet of MAP_PROCESS and 
MAP_QUEUES type 
● Packet contains application’s ring address 
o pm->kernel_queue->acquire_packet_buffer 
▪ Get a not used entry of kernel_queue 
o pm_create_runlist 
▪ Construct pm4 packet of RUN_LIST type 
o pm->kernel_queue->submit_packet 
▪ Kick kernel queue’s doorbell
Hardware Scheduler - No Oversubscription 
PM4 Packet (Type3) 
IT_RUN_LIST 
run_list 
PM4 Packet (Type3) 
IT_MAP_PROCESS 
page_table_base 
pasid 
sh_mem_config 
PM4 Packet (Type3) 
IT_MAP_QUEUES 
mqd_addr 
(Memory Queue 
Descriptoy) 
3 Processes
Hardware Scheduler - Oversubscription 
PM4 Packet (Type3) 
IT_RUN_LIST 
run_list 
PM4 Packet (Type3) 
IT_MAP_PROCESS 
page_table_base 
pasid 
sh_mem_config 
PM4 Packet (Type3) 
IT_MAP_QUEUES 
mqd_addr 
(Memory Queue 
Descriptoy) 
PM4 Packet (Type3) 
IT_RUN_LIST 
run_list
Software Scheduling HardwareScheduling 
● Prepare (pipe, queue) bitmap 
dqm-> 
initialize 
dqm-> 
start 
● Create kfd_process 
● Assign PASID 
kfd_open 
● Get queue_id 
● Map doorbell to application 
ioctl(CREAT 
E_QUEUE) 
● init_mqd 
● Find unused (pipe, queue) to 
assign HW queue_id 
dqm- 
>create_que 
ue 
● Write queue configuration to 
device 
dqm- 
>execute_qu 
eue 
dqm-> 
initialize 
● pm_init 
● kernel_queue_init 
dqm-> 
start 
● Create kfd_process 
● Assign PASID 
kfd_open 
● init_mqd 
dqm- 
>create_que 
ue 
● Create pm4 packet 
● Kick kernel_queue’s doorbell 
dqm- 
>execute_qu 
eue 
● Get queue_id 
● Map doorbell to application 
ioctl(CREAT 
E_QUEUE)
Application Computation ... 
● HW has ring_base_addr userspace address 
o Application enqueue AQL packet and wait signal 
● Application has HW doorbell mmio address 
o Use to kick hardware 
● Driver do nothing 
● Until application send 
ioctl(KFD_IOC_DESTROY_QUEUE) or 
application finish
Haredware Queue Deactivation 
1. Application send 
ioctl(KFD_IOC_DESTROY_QUEUE) 
2. Task exit notifier
Haredware Queue Deactivation (1) 
● ioctl(KFD_IOC_DESTROY_QUEUE) 
● pqm_destroy_queue 
o dqm->destroy_queue 
o Restore queue, pipe bitmap 
o dqm->execute_queues(dqm);
dqm->destroy_queue For 
KFD_SCHED_POLICY_NO_HWS 
● destroy_mqd 
o acquire_queue(kgd, pipe_id, queue_id); 
o write_register(kgd, 
CP_HQD_DEQUEUE_REQUEST, 
DEQUEUE_REQUEST_DRAIN);
dqm->destroy_queue For 
KFD_SCHED_POLICY_HWS* 
● dqm->destroy_queues 
o pm_send_unmap_queue 
▪ Send a pm4 packet of UNMAP_QUEUES 
o pm_send_query_status(KFD_FENCE_COMPLETE 
D)
Haredware Queue Deactivation (2) 
● Task exit notifier will call 
iommu_pasid_shutdown_callback 
o Register in kgd2kfd_device_init 
->amd_iommu_set_invalidate_ctx_cb 
o Will be called in mmu_notifier’s release function 
(mmu_notifier is registered in 
radeon_kfd_bind_process_to_device 
->amd_iommu_bind_pasid)
iommu_pasid_shutdown_callback 
● pqm_destroy_queue 
o dqm->destroy_queue 
o Restore queue, pipe bitmap 
o dqm->execute_queues(dqm);
Agenda 
● Introduction to HSA 
o hUMA 
o User Level Queueing 
● HSA Driver 
o Concepts 
▪ Flow Overview 
▪ User & Hardware Queues 
o Source Code Detail 
● IOMMU 
o Concepts 
▪ GCR3 
▪ PPR 
o Source Code Detail
Introduction to IOMMU 
● User application send AQL packet into ring 
address which is virtual address 
● Device accessing need translate VA to PA 
Doorbell 
Ring 
Address
HSA GPU 
Device table 
PASID=2 
GCR3 
Assign this entry with 
kfd_process->mm->pgd 
Physical Address
PRI & PPR 
● The operating system is usually required to 
pin memory pages used for I/O. 
● IOMMU Provide mechnism to let peripheral 
to use unpinned pages for I/O. 
● Only support in AMD IOMMU_v2
PRI & PPR 
● PRI(page request interface) 
o peripheral request memory management service 
from a host OS (eg, page fault service for peripheral) 
o Issued by peripheral 
● PPR(peripheral page service request) 
o When IOMMU receives a valid PRI request, it 
creates a PPR message in request log to request 
changes to virtual address space 
o Issued by IOMMU as interrupt 
● Use to request IO page table change 
o IOMMU driver can register PPR notifier
module_init(amd_iommu_v2_init) 
● amd_iommu_register_ppr_notifier(&ppr_nb); 
o PPR callback 
▪ ppr_notifier function
Set IOMMU With PASID 
● amd_iommu_bind_pasid 
● Called when kfd_process create 
o mmu_notifier_register(&pasid_state->mn, 
pasid_state->mm); 
o amd_iommu_domain_set_gcr3(dev_state->domain, 
pasid, __pa(pasid_state->mm->pgd));
HSA GPU 
Device table 
PASID=2 
GCR3 
Assign this entry with 
kfd_process->mm->pgd
PRI & PPR Flow 
Peripheral issue PRI to IOMMU 
IOMMU write PPR request to PPR log 
(log contains fault address, pasid, 
device_id, tag, flags) 
IOMMU send interrupt to CPU
PPR Flow 
When irq comes 
readl(iommu->mmio_base + MMIO_STATUS_OFFSET); 
if (status & MMIO_STATUS_PPR_INT_MASK) 
ppr_notifier 
Register in amd_iommv_v2_init 
do_fault
do_fault 
● get_user_pages() API to pin fault pages into 
memory 
o mm_struct, fault_addr
Flow Review 
Application 
Runtime Library 
● open(“/dev/kfd”) 
● ioctl(KFD_IOC_SET_MEMORY_POLICY) 
● ioctl(KFD_IOC_CREATE_QUEUE) 
● ioctl(KFD_IOC_DESTROY_QUEUE) 
KFD IOMMU Driver 
HSA-aware Kernel 
HSA 
Device 
IOMMU
Q&A 
Thanks!
Reference 
● https://github.com/HSAFoundation/HSA-Drivers- 
Linux-AMD 
● http://www.hsafoundation.com/standards/

More Related Content

PPTX
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
PDF
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
PDF
HSA Design (2015-04-30)
PDF
Dockerからcontainerdへの移行
PDF
BPF Internals (eBPF)
PDF
テスト文字列に「うんこ」と入れるな
PDF
Process Scheduler and Balancer in Linux Kernel
PDF
SIGGRAPH 2018 - Digital typography
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
HSA Design (2015-04-30)
Dockerからcontainerdへの移行
BPF Internals (eBPF)
テスト文字列に「うんこ」と入れるな
Process Scheduler and Balancer in Linux Kernel
SIGGRAPH 2018 - Digital typography

What's hot (20)

PDF
HSA System Architecture Overview (2014-10-31)
PPTX
HSA Queuing Hot Chips 2013
PPTX
ISCA final presentation - Queuing Model
PDF
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
PDF
malloc & vmalloc in Linux
PDF
Process Address Space: The way to create virtual address (page table) of user...
PDF
Boosting I/O Performance with KVM io_uring
PPTX
Linux networking
PPTX
U-Boot presentation 2013
PDF
Memory Management with Page Folios
PDF
Launch the First Process in Linux System
PDF
Memory Mapping Implementation (mmap) in Linux Kernel
PPTX
Linux Initialization Process (2)
PDF
Physical Memory Management.pdf
PPTX
Linux Kernel Booting Process (1) - For NLKB
PDF
DPDK in Containers Hands-on Lab
PDF
Reverse Mapping (rmap) in Linux Kernel
PDF
Performance Wins with eBPF: Getting Started (2021)
PPTX
Dynamic filtering for presto join optimisation
PDF
Galera Cluster - Node Recovery - Webinar slides
HSA System Architecture Overview (2014-10-31)
HSA Queuing Hot Chips 2013
ISCA final presentation - Queuing Model
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
malloc & vmalloc in Linux
Process Address Space: The way to create virtual address (page table) of user...
Boosting I/O Performance with KVM io_uring
Linux networking
U-Boot presentation 2013
Memory Management with Page Folios
Launch the First Process in Linux System
Memory Mapping Implementation (mmap) in Linux Kernel
Linux Initialization Process (2)
Physical Memory Management.pdf
Linux Kernel Booting Process (1) - For NLKB
DPDK in Containers Hands-on Lab
Reverse Mapping (rmap) in Linux Kernel
Performance Wins with eBPF: Getting Started (2021)
Dynamic filtering for presto join optimisation
Galera Cluster - Node Recovery - Webinar slides
Ad

Viewers also liked (6)

PPTX
Building a KVM-based Hypervisor for a Heterogeneous System Architecture Compl...
PDF
CGDC 2016 Building paragon in UE4
PDF
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
PPTX
NvFX GTC 2013
PPTX
Siggraph 2016 - Vulkan and nvidia : the essentials
PDF
Vulkan 1.0 Quick Reference
Building a KVM-based Hypervisor for a Heterogeneous System Architecture Compl...
CGDC 2016 Building paragon in UE4
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
NvFX GTC 2013
Siggraph 2016 - Vulkan and nvidia : the essentials
Vulkan 1.0 Quick Reference
Ad

Similar to HSA Kernel Code (KFD v0.6) (20)

PDF
Linux kernel debugging
PDF
망고100 보드로 놀아보자 15
PPTX
Labs_BT_20221017.pptx
PDF
Performance Analysis Tools for Linux Kernel
PDF
Exploiting the Linux Kernel via Intel's SYSRET Implementation
PDF
Let's trace Linux Lernel with KGDB @ COSCUP 2021
PDF
Introduction of unit test on android kernel
ODP
Linux kernel tracing superpowers in the cloud
PDF
OSN days 2019 - Open Networking and Programmable Switch
PPTX
Avoiding Catastrophic Performance Loss
ODP
FPGA on the Cloud
PDF
Osol Pgsql
PPTX
Roll your own toy unix clone os
PDF
lecture_GPUArchCUDA02-CUDAMem.pdf
PPT
PDF
Kernel Recipes 2018 - New GPIO interface for linux user space - Bartosz Golas...
PDF
Anatomy of ROCgdb presentation at gcc cauldron 2022
PDF
Details on Platform Drivers in Embedded Linux
DOCX
Bsdtw17: ruslan bukin: free bsd/risc-v and device drivers
Linux kernel debugging
망고100 보드로 놀아보자 15
Labs_BT_20221017.pptx
Performance Analysis Tools for Linux Kernel
Exploiting the Linux Kernel via Intel's SYSRET Implementation
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Introduction of unit test on android kernel
Linux kernel tracing superpowers in the cloud
OSN days 2019 - Open Networking and Programmable Switch
Avoiding Catastrophic Performance Loss
FPGA on the Cloud
Osol Pgsql
Roll your own toy unix clone os
lecture_GPUArchCUDA02-CUDAMem.pdf
Kernel Recipes 2018 - New GPIO interface for linux user space - Bartosz Golas...
Anatomy of ROCgdb presentation at gcc cauldron 2022
Details on Platform Drivers in Embedded Linux
Bsdtw17: ruslan bukin: free bsd/risc-v and device drivers

Recently uploaded (20)

PPTX
Introduction to Windows Operating System
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
chapter 5 systemdesign2008.pptx for cimputer science students
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PDF
iTop VPN Crack Latest Version Full Key 2025
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PPTX
Tech Workshop Escape Room Tech Workshop
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PPTX
Custom Software Development Services.pptx.pptx
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PPTX
"Secure File Sharing Solutions on AWS".pptx
PDF
Website Design Services for Small Businesses.pdf
PDF
Complete Guide to Website Development in Malaysia for SMEs
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
STL Containers in C++ : Sequence Container : Vector
PDF
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
PDF
Time Tracking Features That Teams and Organizations Actually Need
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
Introduction to Windows Operating System
Computer Software and OS of computer science of grade 11.pptx
chapter 5 systemdesign2008.pptx for cimputer science students
DNT Brochure 2025 – ISV Solutions @ D365
iTop VPN Crack Latest Version Full Key 2025
GSA Content Generator Crack (2025 Latest)
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
Tech Workshop Escape Room Tech Workshop
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Custom Software Development Services.pptx.pptx
Oracle Fusion HCM Cloud Demo for Beginners
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
"Secure File Sharing Solutions on AWS".pptx
Website Design Services for Small Businesses.pdf
Complete Guide to Website Development in Malaysia for SMEs
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
STL Containers in C++ : Sequence Container : Vector
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
Time Tracking Features That Teams and Organizations Actually Need
How to Use SharePoint as an ISO-Compliant Document Management System

HSA Kernel Code (KFD v0.6)

  • 1. HSA Kernel Code (KFD v0.6) Advisor: 徐慰中教授 Student: 黃昱儒 2014/7/25
  • 2. Agenda ● Introduction to HSA o hUMA o User Level Queueing ● HSA Driver o Concepts ▪ Flow Overview ▪ User & Hardware Queues o Source Code Detail ● IOMMU o Concepts ▪ GCR3 ▪ PPR o Source Code Detail
  • 4. User Level Queuing - Before HSA
  • 6. Application 1 Queue 1 1. AQL Packet 2. Ring 3. Doorbell HSA Device Application 1 Queue 2 Application 3 Queue 1 Application 3 Queue 1 HSA device access application’s ring Application kick doorbell IOMMU address translation (VA->PA)
  • 8. HSA Software Stack Application Runtime Library ● open(“/dev/kfd”) ● ioctl(KFD_IOC_SET_MEMORY_POLICY) ● ioctl(KFD_IOC_CREATE_QUEUE) ● ioctl(KFD_IOC_DESTROY_QUEUE) KFD IOMMU Driver HSA-aware Kernel HSA Device IOMMU
  • 9. Agenda ● Introduction to HSA o hUMA o User Level Queueing ● HSA Driver o Concepts ▪ Flow Overview ▪ User & Hardware Queues o Source Code Detail ● IOMMU o Concepts ▪ GCR3 ▪ PPR o Source Code Detail
  • 10. Concepts - HSA Run Flow Application KFD Driver Create user queues Create HW queue with user queue information Enqueu AQL packets, kick doorbell, and wait signal Nothing Application finish and destroy queues Release HW queue Initialization Computation Finish User - HW interaction
  • 11. Scheduled Policy 1. Hardware scheduler and allows oversubscription (more queues than HW slots) 2. HW scheduling but does not allow oversubscription, so create_queue requests fail when we run out of HW slots 3. Not use HW scheduling, so the driver manually assigns queues to HW slots by programming registers
  • 12. pasid=1 queue_id=0 ring_base_address pasid=1 queue_id=1 ring_base_address HSA GPU’s configuration register mmio address Software Scheduler pasid=0 queue_id=0 ring_base_address doorbell Free hardware queue_id bitmap pasid=0 queue_id=1 ring_base_address doorbell doorbell doorbell queue acquire register (pipe, queue) Physical Address
  • 13. HSA GPU’s configuration register mmio address Hardware Scheduler kernel_queue ring_base_address doorbell queue acquire register (pipe=4, queue=0) Physical Address
  • 14. Hardware Scheduler - No Oversubscription PM4 Packet (Type3) IT_RUN_LIST run_list PM4 Packet (Type3) IT_MAP_PROCESS page_table_base pasid sh_mem_config PM4 Packet (Type3) IT_MAP_QUEUES mqd_addr (Memory Queue Descriptoy) 3 Processes
  • 15. Hardware Scheduler - Oversubscription PM4 Packet (Type3) IT_RUN_LIST run_list PM4 Packet (Type3) IT_MAP_PROCESS page_table_base pasid sh_mem_config PM4 Packet (Type3) IT_MAP_QUEUES mqd_addr (Memory Queue Descriptoy) PM4 Packet (Type3) IT_RUN_LIST run_list
  • 16. Per Application Per Device Only for HW scheduling Per HW Queue
  • 17. IOCTL Command Provided by KFD ● KFD_IOC_CREATE_QUEUE o Create hardware queue from application’s information (ex: ring base address) ● KFD_IOC_DESTROY_QUEUE o Release hardware queue ● KFD_IOC_UPDATE_QUEUE ● KFD_IOC_SET_MEMORY_POLICY o Set cache coherent policy ● KFD_IOC_GET_CLOCK_COUNTERS o Get GPU clock counter ● KFD_IOC_GET_PROCESS_APERTURES o Get apertures information of GPU ● KFD_IOC_PMC_ACQUIRE_ACCESS ● KFD_IOC_PMC_RELEASE_ACCESS o Exclusive access for performance counters
  • 18. HSA Driver Flow ● System intialization ○ module_init ○ device_init (Called by radeon) ● Application open “/dev/kfd” device ● Application send ioctl ○ KFD_IOC_SET_MEMORY_POLICY ○ KFD_IOC_CREATE_QUEUE ● Application send ioctl ○ KFD_IOC_DESTROY_QUEUE ● Application termination
  • 19. module_init(kfd_module_init) ● radeon_kfd_pasid_init o Initialize PASID bitmap ● radeon_kfd_chardev_init o register_chrdev: /dev/kfd o kfd_ops ▪ Define open, ioctl member function
  • 20. kgd2kfd_device_init ● radeon_kfd_doorbell_init(kfd); ● radeon_kfd_interrupt_init(kfd); ● amd_iommu_set_invalidate_ctx_cb(kfd->pdev, iommu_pasid_shutdown_callback); ● device_queue_manager_init(kfd); o dqm->initialize ● dqm->start(kfd->dqm);
  • 21. dqm->initialize For KFD_SCHED_POLICY_NO_HWS* ● Prepare pipe, queue bitmap
  • 22. kfd_open ● radeon_kfd_create_process(current) o Create kfd_process o Assign PASID
  • 23. KFD_IOC_SET_MEMORY_POLICY ● Two policy o cache_policy_coherent o cache_policy_noncoherent ● Okra o default policy=cache_policy_coherent o alternate policy=cache_policy_noncoherent
  • 24. radeon_kfd_bind_process_to_device ● Called when user application send ioctl command ● amd_iommu_bind_pasid() o Register iommu with this kfd_process
  • 25. KFD_IOC_CREATE_QUEUE ● Create queue with informations from userspace ● pqm_create_queue ● Return queue_id and doorbell_address to userspace o queue_id is per kfd_process o doorbell_address map to device mmio address
  • 26. pqm_create_queue ● find_available_queue_slot o Assign qid (per kfd_process) ● dqm->register_process o Register process to dqm (device queue manager) ● create_cp_queue o Create with queue_properties get from application o Map doorbell mmio address to application ● dqm->create_queue ● dqm->execute_queue
  • 27. dqm->create_queue For KFD_SCHED_POLICY_NO_HWS ● init_mqd (memory queue descriptor) o Store queue configuration from application ● Find unused (pipe, queue) from dqm (device queue manager) o If no, return -EBUSY o Maximum = 56
  • 28. dqm->execute_queue For KFD_SCHED_POLICY_NO_HWS ● Write queue configuration to device ● load_mqd o ring_base_addr o doorbell_offset o queue_priority o ...
  • 29. pasid=1 queue_id=0 ring_base_address pasid=1 queue_id=1 ring_base_address HSA GPU’s configuration register mmio address pasid=0 queue_id=0 ring_base_address Free hardware queue_id bitmap queue select register doorbell pasid=0 queue_id=1 ring_base_address doorbell doorbell doorbell Each process can have up to 1024 queues (pipe, queue) Physical Address
  • 30. kgd2kfd_device_init ● radeon_kfd_doorbell_init(kfd); ● radeon_kfd_interrupt_init(kfd); ● device_iommu_pasid_init(kfd); ● kfd_topology_add_device(kfd); ● amd_iommu_set_invalidate_ctx_cb(kfd->pdev, iommu_pasid_shutdown_callback); ● device_queue_manager_init(kfd); o dqm->initialize ● dqm->start(kfd->dqm);
  • 31. dqm->start For KFD_SCHED_POLICY_HWS* ● pm_init (packet manager) ● kernel_queue_init o kernel_queue doorbell o kernel_queue ring address o load_mqd to write kernel_queue configuration to device
  • 32. pqm_create_queue ● find_available_queue_slot o Assign qid (per kfd_process) ● dqm->register_process o Register process to dqm (device queue manager) ● create_cp_queue o Create with queue_properties get from application o Map doorbell mmio address to application ● dqm->create_queue ● dqm->execute_queue
  • 33. dqm->create_queue For KFD_SCHED_POLICY_HWS* ● init_mqd (memory queue descriptor) o Store queue configuration from application
  • 34. dqm->execute_queue For KFD_SCHED_POLICY_HWS* ● dqm->destroy_queues ● pm_send_runlist o pm_create_runlist_ib ▪ Construct pm4 packet of MAP_PROCESS and MAP_QUEUES type ● Packet contains application’s ring address o pm->kernel_queue->acquire_packet_buffer ▪ Get a not used entry of kernel_queue o pm_create_runlist ▪ Construct pm4 packet of RUN_LIST type o pm->kernel_queue->submit_packet ▪ Kick kernel queue’s doorbell
  • 35. Hardware Scheduler - No Oversubscription PM4 Packet (Type3) IT_RUN_LIST run_list PM4 Packet (Type3) IT_MAP_PROCESS page_table_base pasid sh_mem_config PM4 Packet (Type3) IT_MAP_QUEUES mqd_addr (Memory Queue Descriptoy) 3 Processes
  • 36. Hardware Scheduler - Oversubscription PM4 Packet (Type3) IT_RUN_LIST run_list PM4 Packet (Type3) IT_MAP_PROCESS page_table_base pasid sh_mem_config PM4 Packet (Type3) IT_MAP_QUEUES mqd_addr (Memory Queue Descriptoy) PM4 Packet (Type3) IT_RUN_LIST run_list
  • 37. Software Scheduling HardwareScheduling ● Prepare (pipe, queue) bitmap dqm-> initialize dqm-> start ● Create kfd_process ● Assign PASID kfd_open ● Get queue_id ● Map doorbell to application ioctl(CREAT E_QUEUE) ● init_mqd ● Find unused (pipe, queue) to assign HW queue_id dqm- >create_que ue ● Write queue configuration to device dqm- >execute_qu eue dqm-> initialize ● pm_init ● kernel_queue_init dqm-> start ● Create kfd_process ● Assign PASID kfd_open ● init_mqd dqm- >create_que ue ● Create pm4 packet ● Kick kernel_queue’s doorbell dqm- >execute_qu eue ● Get queue_id ● Map doorbell to application ioctl(CREAT E_QUEUE)
  • 38. Application Computation ... ● HW has ring_base_addr userspace address o Application enqueue AQL packet and wait signal ● Application has HW doorbell mmio address o Use to kick hardware ● Driver do nothing ● Until application send ioctl(KFD_IOC_DESTROY_QUEUE) or application finish
  • 39. Haredware Queue Deactivation 1. Application send ioctl(KFD_IOC_DESTROY_QUEUE) 2. Task exit notifier
  • 40. Haredware Queue Deactivation (1) ● ioctl(KFD_IOC_DESTROY_QUEUE) ● pqm_destroy_queue o dqm->destroy_queue o Restore queue, pipe bitmap o dqm->execute_queues(dqm);
  • 41. dqm->destroy_queue For KFD_SCHED_POLICY_NO_HWS ● destroy_mqd o acquire_queue(kgd, pipe_id, queue_id); o write_register(kgd, CP_HQD_DEQUEUE_REQUEST, DEQUEUE_REQUEST_DRAIN);
  • 42. dqm->destroy_queue For KFD_SCHED_POLICY_HWS* ● dqm->destroy_queues o pm_send_unmap_queue ▪ Send a pm4 packet of UNMAP_QUEUES o pm_send_query_status(KFD_FENCE_COMPLETE D)
  • 43. Haredware Queue Deactivation (2) ● Task exit notifier will call iommu_pasid_shutdown_callback o Register in kgd2kfd_device_init ->amd_iommu_set_invalidate_ctx_cb o Will be called in mmu_notifier’s release function (mmu_notifier is registered in radeon_kfd_bind_process_to_device ->amd_iommu_bind_pasid)
  • 44. iommu_pasid_shutdown_callback ● pqm_destroy_queue o dqm->destroy_queue o Restore queue, pipe bitmap o dqm->execute_queues(dqm);
  • 45. Agenda ● Introduction to HSA o hUMA o User Level Queueing ● HSA Driver o Concepts ▪ Flow Overview ▪ User & Hardware Queues o Source Code Detail ● IOMMU o Concepts ▪ GCR3 ▪ PPR o Source Code Detail
  • 46. Introduction to IOMMU ● User application send AQL packet into ring address which is virtual address ● Device accessing need translate VA to PA Doorbell Ring Address
  • 47. HSA GPU Device table PASID=2 GCR3 Assign this entry with kfd_process->mm->pgd Physical Address
  • 48. PRI & PPR ● The operating system is usually required to pin memory pages used for I/O. ● IOMMU Provide mechnism to let peripheral to use unpinned pages for I/O. ● Only support in AMD IOMMU_v2
  • 49. PRI & PPR ● PRI(page request interface) o peripheral request memory management service from a host OS (eg, page fault service for peripheral) o Issued by peripheral ● PPR(peripheral page service request) o When IOMMU receives a valid PRI request, it creates a PPR message in request log to request changes to virtual address space o Issued by IOMMU as interrupt ● Use to request IO page table change o IOMMU driver can register PPR notifier
  • 51. Set IOMMU With PASID ● amd_iommu_bind_pasid ● Called when kfd_process create o mmu_notifier_register(&pasid_state->mn, pasid_state->mm); o amd_iommu_domain_set_gcr3(dev_state->domain, pasid, __pa(pasid_state->mm->pgd));
  • 52. HSA GPU Device table PASID=2 GCR3 Assign this entry with kfd_process->mm->pgd
  • 53. PRI & PPR Flow Peripheral issue PRI to IOMMU IOMMU write PPR request to PPR log (log contains fault address, pasid, device_id, tag, flags) IOMMU send interrupt to CPU
  • 54. PPR Flow When irq comes readl(iommu->mmio_base + MMIO_STATUS_OFFSET); if (status & MMIO_STATUS_PPR_INT_MASK) ppr_notifier Register in amd_iommv_v2_init do_fault
  • 55. do_fault ● get_user_pages() API to pin fault pages into memory o mm_struct, fault_addr
  • 56. Flow Review Application Runtime Library ● open(“/dev/kfd”) ● ioctl(KFD_IOC_SET_MEMORY_POLICY) ● ioctl(KFD_IOC_CREATE_QUEUE) ● ioctl(KFD_IOC_DESTROY_QUEUE) KFD IOMMU Driver HSA-aware Kernel HSA Device IOMMU
  • 58. Reference ● https://github.com/HSAFoundation/HSA-Drivers- Linux-AMD ● http://www.hsafoundation.com/standards/

Editor's Notes

  • #11: User queue with information
  • #12: Module_param ,insmod can change sched_policy
  • #13: Driver’s help ring VA SW 7*8 = 56
  • #14: Oversubscription dqm->processes_count >= VMID_PER_DEVICE) || // 8 dqm->queue_count >= PIPE_PER_ME_CP_SCHEDULING * QUEUES_PER_PIPE))) // 24 SW 7*8 = 56
  • #15: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/10/R6xx_R7xx_3D.pdf HSA compilant HW need to understand pm4 packet format of radeon http://www.spinics.net/linux/lists/kernel/msg1784187.html Type-0 Packet Write N DWORDs in the information body to the N consecutive registers, or to the register, pointed to by the BASE_INDEX field of the packet header . Type3:Carry out the operation indicated by field IT_OPCODE.
  • #16: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/10/R6xx_R7xx_3D.pdf HSA compilant HW need to understand pm4 packet format of radeon http://www.spinics.net/linux/lists/kernel/msg1784187.html Radeon R7 for Kaveri Type-0 Packet Write N DWORDs in the information body to the N consecutive registers, or to the register, pointed to by the BASE_INDEX field of the packet header . Type3:Carry out the operation indicated by field IT_OPCODE.
  • #17: per_device_data radeon_dev
  • #18: KFD is HSA driver!
  • #19: Start code
  • #21: kfd_topology_add_device: dev->gpu_id
  • #24: Wait for spec
  • #25: per_device_data
  • #28: Wrap all mmio access to radeon
  • #30: Driver’s help
  • #31: kfd_topology_add_device: dev->gpu_id
  • #32: packet_manager’s most important member: kernel_queue
  • #36: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/10/R6xx_R7xx_3D.pdf HSA compilant HW need to understand pm4 packet format of radeon http://www.spinics.net/linux/lists/kernel/msg1784187.html Type-0 Packet Write N DWORDs in the information body to the N consecutive registers, or to the register, pointed to by the BASE_INDEX field of the packet header . Type3:Carry out the operation indicated by field IT_OPCODE.
  • #37: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/10/R6xx_R7xx_3D.pdf HSA compilant HW need to understand pm4 packet format of radeon http://www.spinics.net/linux/lists/kernel/msg1784187.html Type-0 Packet Write N DWORDs in the information body to the N consecutive registers, or to the register, pointed to by the BASE_INDEX field of the packet header . Type3:Carry out the operation indicated by field IT_OPCODE.
  • #43: Query also a packet
  • #47: SMMU functionality
  • #49: 以前沒差,IOMMU只摸device address For now, data in AQL packet is VA