SlideShare a Scribd company logo
Hardware/Software Co-Design for Efficient
Microkernel Execution
Martin Děcký
martin.decky@huawei.com
February 2019
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 2
Who Am I
Passionate programmer and operating systems enthusiast
With a specific inclination towards multiserver microkernels
HelenOS developer since 2004
Research Scientist from 2006 to 2018
Charles University (Prague), Distributed Systems Research Group
Senior Research Engineer since 2017
Huawei Technologies (Munich), German Research Center, Central
Software Institute, OS Kernel Lab
3Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution
Microkernel Multiserver
Systems are better than
Monolithic Systems
3
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 4
Monolithic OS Design is Flawed
Biggs S., Lee D., Heiser G.: The Jury Is In: Monolithic OS Design Is
Flawed: Microkernel-based Designs Improve Security, ACM 9th Asia-
Pacific Workshop on Systems (APSys), 2018
“While intuitive, the benefits of the small TCB have not been quantified to
date. We address this by a study of critical Linux CVEs, where we examine
whether they would be prevented or mitigated by a microkernel-based
design. We find that almost all exploits are at least mitigated to less than
critical severity, and 40 % completely eliminated by an OS design based
on a verified microkernel, such as seL4.”
5Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution
Problem Statement5
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 6
Problem Statement
Microkernel design ideas go as back as 1969
RC 4000 Multiprogramming System nucleus (Per Brinch Hansen)
Isolation of unprivileged processes, inter-process communication,
hierarchical control
Even after 50 years they are not fully accepted as mainstream
Hardware and software used to be designed independently
Designing CPUs used to be an extremely complicated and costly process
Operating systems used to be written after the CPUs were designed
Hardware designs used to be rather conservative
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 7
Problem Statement (2)
Mainstream ISAs used to be designed in a rather conservative way
Can you name some really revolutionary ISA features since IBM
System/370 Advanced Function?
Requirements on the new ISAs usually follow the needs of the
mainstream operating systems running on the past ISAs
No wonder microkernels suffer performance penalties compared to
monolithic systems
The more fine-grained the architecture, the more penalties it suffers
Let us design the hardware with microkernels in mind!
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 8
The Vicious Cycle
CPUs do not support
microkernels properly
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 9
The Vicious Cycle
CPUs do not support
microkernels properly
Microkernels suffer
perfromance penalties
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 10
The Vicious Cycle
CPUs do not support
microkernels properly
Microkernels are not
in the mainstream
Microkernels suffer
perfromance penalties
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 11
The Vicious Cycle
CPUs do not support
microkernels properly
Microkernels are not
in the mainstream
Microkernels suffer
perfromance penalties
No requirements on
CPUs from microkernels
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 12
The Vicious Cycle
CPUs do not support
microkernels properly
Microkernels are not
in the mainstream
Microkernels suffer
perfromance penalties
No requirements on
CPUs from microkernels
13Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution
Any Ideas?
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 14
Communication between Address Spaces
Control and data flow between subsystems
Monolithic kernel
Function calls
Passing arguments in registers and on the stack
Passing direct pointers to memory structures
Multiserver microkernel
IPC via microkernel syscalls
Passing arguments in a subset of registers
Privilege level switch, address space switch
Scheduling (in case of asynchronous IPC)
Data copying or memory sharing with page granularity
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 15
Communication between Address Spaces (2)
Is the kernel round-trip of the IPC necessary?
Suggestion for synchronous IPC: Extended Jump/Call and Return instructions
that also switch the address space
Communicating parties identified by a “call gate” (capability) containing the target
address space and the PC of the IPC handler (implicit for return)
Call gates stored in a TLB-like hardware cache (CLB)
CLB populated by the microkernel similarly to TLB-only memory management
architecture
Suggestion for asynchronous IPC: Using CPU cache lines as the buffers for the
messages
Async Jump/Call, Async Return and Async Receive instructions
Using the CPU cache like an extended register stack engine
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 16
Communication between Address Spaces (3)
Bulk data
Observation: Memory sharing is actually quite efficient for large amounts
of data (multiple pages)
Overhead is caused primarily by creating and tearing down the shared
pages
Data needs to be page-aligned
Sub-page granularity and dynamic data structures
Suggestion: Using CPU cache lines as shared buffers
Much finer granularity than pages (typically 64 to 128 bytes)
A separate virtual-to-cache mapping mechanism before the standard
virtual-to-physical mapping
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 17
Fast Context Switching
Current microsecond-scale latency hiding mechanisms
Hardware multi-threading
Effective
Does not scale beyond a few threads
Operating system context switching
Scales for any thread count
Too slow (order of 10 µs)
Goal: Finding a sweet spot between the two mechanisms
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 18
Fast Context Switching (2)
Suggestion: Hardware cache for contexts
Again, similar mechanism to TLB-only memory management
Dedicated instructions for context store, context restore, context switch, context
save, context load
Context data could be potentially ABI-optimized
Autonomous mechanism for event-triggered context switch (e.g. external
interrupt)
Efficient hardware mechanism for latency hiding
The equivalent of fine/coarse-grained simultaneous multithreading
The software scheduler is in charge of setting the scheduler policy
The CPU is in charge of scheduling the contexts based on ALU, cache and other resource
availability
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 19
User Space Interrupt Processing
Extension of the fast context switching mechanism
Efficient delivery of interrupt events to user space device drivers
Without the routine microkernel intervention
An interrupt could be directly handled by a preconfigured hardware context in
user space
A clear path towards moving even the timer interrupt handler and the scheduler from
kernel space to user space
Going back to interrupt-driven handling of peripherals with extreme low latency
requirements (instead of polling)
The usual pain point: Level-triggered interrupts
Some coordination with the platform interrupt controller is probably needed
to automatically mask the interrupt source
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 20
Capabilities as First-Class Entities
Capabilities as unforgeable object identifiers
But eventually each access to an object needs to be bound-checked and
translated into the (flat) virtual address space
Suggestion: Embedding the capability reference in pointers
RV128 (128-bit variant of RISC-V) would provide 64 bits for the capability
reference and 64 bits for object offset
128-bit flat pointers are probably useless anyway
Besides the (somewhat narrow) use in the microkernel, this could be useful
for other purposes
Simplifying the implementation of managed languages’ VMs
Working with multiple virtual address spaces at once
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 21
Prior Art
Nordström S., Lindh L., Johansson L., Skoglund T.: Application Specific
Real-Time Microkernel in Hardware, 14th IEEE-NPSS Real Time
Conference, 2005
Offloading basic microkernel operations (e.g. thread creation, context
switching) to hardware shown to improve performance by 15 % on
average and up to 73 %
This was a coarse-grained approach
Hardware message passing in Intel SCC and Tilera TILE-G64/TILE-
Pro64
Asynchronous message passing with tight software integration
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 22
Prior Art (2)
Hajj I. E,, Merritt A., Zellweger G., Milojicic D., Achermann R., Faraboschi
P., Hwu W., Roscoe T., Schwan K.: SpaceJMP: Programming with Multiple
Virtual Address Spaces, 21st ACM ASPLOS, 2016
Practical programming model for using multiple virtual address spaces on
commodity hardware (evaluated on DragonFly BSD and Barrelfish)
Useful for data-centric applications for sharing large amounts of memory between
processes
Intel IA-32 Task State Segment (TSS)
Hardware-based context switching
Historically, it has been used by Linux
The primary reason for removal was not performance, but portability
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 23
Prior Art (3)
Intel VT-x VM Functions (VMFUNC)
Efficient cross-VM function calls
Switching the EPT and passing register arguments
Current implementation limited to 512 entry points
Practically usable even for very fine-grained virtualization with the
granularity of individual functions
Liu Y., Zhou T., Chen K., Chen H., Xia Y.: Thwarting Memory Disclosure with
Efficient Hypervisor-enforced Intra-domain Isolation, 22nd ACM SIGSAC
Conference on Computer and Communications Security, 2015
– “The cost of a VMFUNC is similar with a syscall”
– “… hypervisor-level protection at the cost of system calls”
SkyBridge paper to appear at EuroSys 2019
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 24
Prior Art (4)
Woodruff J., Watson R. N. M., Chisnall D., Moore S., Anderson J., Davis B., Laurie
B., Neumann P. G., Norton R., Roe. M.: The CHERI capability model: Revisiting RISC
in the an age of risk, 41st ACM Annual International Symposium on Computer
Architecture, 2014
Hardware-based capability model for byte-granularity memory protection
Extension of the 64-bit MIPS ISA
Evaluated on an extended MIPS R4000 FPGA soft-core
32 capability registers (256 bits)
Limitation: Inflexible design mostly due to the tight backward compatibility with a 64-bit
ISA
Intel MPX
Several design and implementation issues, deemed not production-ready
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 25
Summary
Traditionally, hardware has not been designed to accommodate the
requirements of microkernel multiserver operating systems
Microkernels thus suffer performance penalties
This prevented them from replacing monolithic operating systems and closed
the vicious cycle
Hardware design is hopefully becoming more accessible and democratic
E.g. RISC-V
Co-designing the hardware and software might help us gain the benefits
of the microkernel multiserver design with no performance penalties
However, it requires some out-of-the-box thinking
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 26
Acknowledgements
OS Kernel Lab at Huawei Technologies
Javier Picorel
Haibo Chen
Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution 27
Huawei Dresden R&D Lab
Focusing on microkernel research, design and development
Basic research
Applied research
Prototype development
Collaboration with academia and other technology companies
Looking for senior operating system researchers, designers, developers and
experts
Previous microkernel experience is a big plus
“A startup within a large company”
Shaping the future product portfolio of Huawei
Including hardware/software co-design via HiSilicon
28Martin Děcký, FOSDEM, February 3rd
2019 Hardware/Software Co-Design for Efficient Microkernel Execution
Q&A
Thank You!

More Related Content

PDF
Microkernels in the Era of Data-Centric Computing
PDF
Formal Verification of Functional Code
PDF
Lessons Learned from Porting HelenOS to RISC-V
PDF
IPC in Microkernel Systems, Capabilities
PDF
Unikernels, Multikernels, Virtual Machine-based Kernels
PDF
Hardware Implementation of Algorithm for Cryptanalysis
PDF
LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
PDF
Optimization of latency of temporal key Integrity protocol (tkip) using graph...
Microkernels in the Era of Data-Centric Computing
Formal Verification of Functional Code
Lessons Learned from Porting HelenOS to RISC-V
IPC in Microkernel Systems, Capabilities
Unikernels, Multikernels, Virtual Machine-based Kernels
Hardware Implementation of Algorithm for Cryptanalysis
LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
Optimization of latency of temporal key Integrity protocol (tkip) using graph...

What's hot (18)

PDF
ICCT2017: A user mode implementation of filtering rule management plane using...
PPTX
Data-Centric Parallel Programming
PDF
40520130101005
PDF
Fpga based encryption design using vhdl
PDF
An Efficient PDP Scheme for Distributed Cloud Storage
PPTX
Multicore Intel Processors Performance Evaluation
PDF
Towards Edge Computing as a Service: Dynamic Formation of the Micro Data-Centers
PDF
Iaetsd implementation of secure audit process
PPTX
Shilpa ppt
PDF
DIVISION AND REPLICATION OF DATA IN GRID FOR OPTIMAL PERFORMANCE AND SECURITY
PPTX
Lec08 optimizations
PDF
The effect of distributed archetypes on complexity theory
PDF
A COMPARISON BETWEEN PARALLEL AND SEGMENTATION METHODS USED FOR IMAGE ENCRYPT...
PDF
DEF CON 27 - BRENT STONE - reverse enginerring 17 cars
PDF
NEW ALGORITHM FOR WIRELESS NETWORK COMMUNICATION SECURITY
PPTX
Lec07 threading hw
PDF
PERFORMANCE EVALUATION OF PARALLEL INTERNATIONAL DATA ENCRYPTION ALGORITHM ON...
PDF
Final Year Project Synopsis: Post Quantum Encryption using Neural Networks
ICCT2017: A user mode implementation of filtering rule management plane using...
Data-Centric Parallel Programming
40520130101005
Fpga based encryption design using vhdl
An Efficient PDP Scheme for Distributed Cloud Storage
Multicore Intel Processors Performance Evaluation
Towards Edge Computing as a Service: Dynamic Formation of the Micro Data-Centers
Iaetsd implementation of secure audit process
Shilpa ppt
DIVISION AND REPLICATION OF DATA IN GRID FOR OPTIMAL PERFORMANCE AND SECURITY
Lec08 optimizations
The effect of distributed archetypes on complexity theory
A COMPARISON BETWEEN PARALLEL AND SEGMENTATION METHODS USED FOR IMAGE ENCRYPT...
DEF CON 27 - BRENT STONE - reverse enginerring 17 cars
NEW ALGORITHM FOR WIRELESS NETWORK COMMUNICATION SECURITY
Lec07 threading hw
PERFORMANCE EVALUATION OF PARALLEL INTERNATIONAL DATA ENCRYPTION ALGORITHM ON...
Final Year Project Synopsis: Post Quantum Encryption using Neural Networks
Ad

Similar to Hardware/Software Co-Design for Efficient Microkernel Execution (20)

PDF
L4 Microkernel :: Design Overview
PDF
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
PDF
Construct an Efficient and Secure Microkernel for IoT
PPTX
Seminario utovrm
PPT
embedded systems & robotics Projects Based training @Technogroovy
PPT
Buy Embedded Systems Projects Online
PDF
What Could Microkernels Learn from Monolithic Kernels (and Vice Versa)
PPT
Embedded 120206023739-phpapp02
PDF
Introduction to Microkernels
PDF
FOSDEM 2013: Operating Systems Hot Topics
PDF
ERTS_Unit 1_PPT.pdf
PDF
PDF Embedded Systems Design 2nd Edition Steve Heath download
PPTX
Introduction to architecture exploration
PPT
Microcontroller Based Projects
PPTX
Embedded system-1 is a first note for fourth year students
PPTX
Develop High-bandwidth/low latency electronic systems for AI/ML application
PPTX
Designing memory controller for ddr5 and hbm2.0
PDF
Flexible and Scalable Domain-Specific Architectures
PPT
Design of embedded systems tsp
PPT
Design of embedded systems
L4 Microkernel :: Design Overview
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
Construct an Efficient and Secure Microkernel for IoT
Seminario utovrm
embedded systems & robotics Projects Based training @Technogroovy
Buy Embedded Systems Projects Online
What Could Microkernels Learn from Monolithic Kernels (and Vice Versa)
Embedded 120206023739-phpapp02
Introduction to Microkernels
FOSDEM 2013: Operating Systems Hot Topics
ERTS_Unit 1_PPT.pdf
PDF Embedded Systems Design 2nd Edition Steve Heath download
Introduction to architecture exploration
Microcontroller Based Projects
Embedded system-1 is a first note for fourth year students
Develop High-bandwidth/low latency electronic systems for AI/ML application
Designing memory controller for ddr5 and hbm2.0
Flexible and Scalable Domain-Specific Architectures
Design of embedded systems tsp
Design of embedded systems
Ad

More from Martin Děcký (7)

PDF
2024 in Microkernels (a year in review lightning talk)
PDF
HelenOS: 20 Years of History, 20 Years of Future Vision
PDF
Code Instrumentation, Dynamic Tracing
PDF
Nízkoúrovňové programování
PDF
Porting HelenOS to RISC-V
PDF
FOSDEM 2014: Read-Copy-Update for HelenOS
PDF
HelenOS: State of the Union 2012
2024 in Microkernels (a year in review lightning talk)
HelenOS: 20 Years of History, 20 Years of Future Vision
Code Instrumentation, Dynamic Tracing
Nízkoúrovňové programování
Porting HelenOS to RISC-V
FOSDEM 2014: Read-Copy-Update for HelenOS
HelenOS: State of the Union 2012

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
KodekX | Application Modernization Development
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Modernizing your data center with Dell and AMD
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Cloud computing and distributed systems.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Diabetes mellitus diagnosis method based random forest with bat algorithm
KodekX | Application Modernization Development
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Modernizing your data center with Dell and AMD
NewMind AI Weekly Chronicles - August'25 Week I
Cloud computing and distributed systems.
Dropbox Q2 2025 Financial Results & Investor Presentation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
Review of recent advances in non-invasive hemoglobin estimation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Digital-Transformation-Roadmap-for-Companies.pptx
MYSQL Presentation for SQL database connectivity
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...

Hardware/Software Co-Design for Efficient Microkernel Execution

  • 1. Hardware/Software Co-Design for Efficient Microkernel Execution Martin Děcký martin.decky@huawei.com February 2019
  • 2. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 2 Who Am I Passionate programmer and operating systems enthusiast With a specific inclination towards multiserver microkernels HelenOS developer since 2004 Research Scientist from 2006 to 2018 Charles University (Prague), Distributed Systems Research Group Senior Research Engineer since 2017 Huawei Technologies (Munich), German Research Center, Central Software Institute, OS Kernel Lab
  • 3. 3Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution Microkernel Multiserver Systems are better than Monolithic Systems 3
  • 4. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 4 Monolithic OS Design is Flawed Biggs S., Lee D., Heiser G.: The Jury Is In: Monolithic OS Design Is Flawed: Microkernel-based Designs Improve Security, ACM 9th Asia- Pacific Workshop on Systems (APSys), 2018 “While intuitive, the benefits of the small TCB have not been quantified to date. We address this by a study of critical Linux CVEs, where we examine whether they would be prevented or mitigated by a microkernel-based design. We find that almost all exploits are at least mitigated to less than critical severity, and 40 % completely eliminated by an OS design based on a verified microkernel, such as seL4.”
  • 5. 5Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution Problem Statement5
  • 6. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 6 Problem Statement Microkernel design ideas go as back as 1969 RC 4000 Multiprogramming System nucleus (Per Brinch Hansen) Isolation of unprivileged processes, inter-process communication, hierarchical control Even after 50 years they are not fully accepted as mainstream Hardware and software used to be designed independently Designing CPUs used to be an extremely complicated and costly process Operating systems used to be written after the CPUs were designed Hardware designs used to be rather conservative
  • 7. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 7 Problem Statement (2) Mainstream ISAs used to be designed in a rather conservative way Can you name some really revolutionary ISA features since IBM System/370 Advanced Function? Requirements on the new ISAs usually follow the needs of the mainstream operating systems running on the past ISAs No wonder microkernels suffer performance penalties compared to monolithic systems The more fine-grained the architecture, the more penalties it suffers Let us design the hardware with microkernels in mind!
  • 8. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 8 The Vicious Cycle CPUs do not support microkernels properly
  • 9. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 9 The Vicious Cycle CPUs do not support microkernels properly Microkernels suffer perfromance penalties
  • 10. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 10 The Vicious Cycle CPUs do not support microkernels properly Microkernels are not in the mainstream Microkernels suffer perfromance penalties
  • 11. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 11 The Vicious Cycle CPUs do not support microkernels properly Microkernels are not in the mainstream Microkernels suffer perfromance penalties No requirements on CPUs from microkernels
  • 12. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 12 The Vicious Cycle CPUs do not support microkernels properly Microkernels are not in the mainstream Microkernels suffer perfromance penalties No requirements on CPUs from microkernels
  • 13. 13Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution Any Ideas?
  • 14. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 14 Communication between Address Spaces Control and data flow between subsystems Monolithic kernel Function calls Passing arguments in registers and on the stack Passing direct pointers to memory structures Multiserver microkernel IPC via microkernel syscalls Passing arguments in a subset of registers Privilege level switch, address space switch Scheduling (in case of asynchronous IPC) Data copying or memory sharing with page granularity
  • 15. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 15 Communication between Address Spaces (2) Is the kernel round-trip of the IPC necessary? Suggestion for synchronous IPC: Extended Jump/Call and Return instructions that also switch the address space Communicating parties identified by a “call gate” (capability) containing the target address space and the PC of the IPC handler (implicit for return) Call gates stored in a TLB-like hardware cache (CLB) CLB populated by the microkernel similarly to TLB-only memory management architecture Suggestion for asynchronous IPC: Using CPU cache lines as the buffers for the messages Async Jump/Call, Async Return and Async Receive instructions Using the CPU cache like an extended register stack engine
  • 16. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 16 Communication between Address Spaces (3) Bulk data Observation: Memory sharing is actually quite efficient for large amounts of data (multiple pages) Overhead is caused primarily by creating and tearing down the shared pages Data needs to be page-aligned Sub-page granularity and dynamic data structures Suggestion: Using CPU cache lines as shared buffers Much finer granularity than pages (typically 64 to 128 bytes) A separate virtual-to-cache mapping mechanism before the standard virtual-to-physical mapping
  • 17. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 17 Fast Context Switching Current microsecond-scale latency hiding mechanisms Hardware multi-threading Effective Does not scale beyond a few threads Operating system context switching Scales for any thread count Too slow (order of 10 µs) Goal: Finding a sweet spot between the two mechanisms
  • 18. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 18 Fast Context Switching (2) Suggestion: Hardware cache for contexts Again, similar mechanism to TLB-only memory management Dedicated instructions for context store, context restore, context switch, context save, context load Context data could be potentially ABI-optimized Autonomous mechanism for event-triggered context switch (e.g. external interrupt) Efficient hardware mechanism for latency hiding The equivalent of fine/coarse-grained simultaneous multithreading The software scheduler is in charge of setting the scheduler policy The CPU is in charge of scheduling the contexts based on ALU, cache and other resource availability
  • 19. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 19 User Space Interrupt Processing Extension of the fast context switching mechanism Efficient delivery of interrupt events to user space device drivers Without the routine microkernel intervention An interrupt could be directly handled by a preconfigured hardware context in user space A clear path towards moving even the timer interrupt handler and the scheduler from kernel space to user space Going back to interrupt-driven handling of peripherals with extreme low latency requirements (instead of polling) The usual pain point: Level-triggered interrupts Some coordination with the platform interrupt controller is probably needed to automatically mask the interrupt source
  • 20. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 20 Capabilities as First-Class Entities Capabilities as unforgeable object identifiers But eventually each access to an object needs to be bound-checked and translated into the (flat) virtual address space Suggestion: Embedding the capability reference in pointers RV128 (128-bit variant of RISC-V) would provide 64 bits for the capability reference and 64 bits for object offset 128-bit flat pointers are probably useless anyway Besides the (somewhat narrow) use in the microkernel, this could be useful for other purposes Simplifying the implementation of managed languages’ VMs Working with multiple virtual address spaces at once
  • 21. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 21 Prior Art Nordström S., Lindh L., Johansson L., Skoglund T.: Application Specific Real-Time Microkernel in Hardware, 14th IEEE-NPSS Real Time Conference, 2005 Offloading basic microkernel operations (e.g. thread creation, context switching) to hardware shown to improve performance by 15 % on average and up to 73 % This was a coarse-grained approach Hardware message passing in Intel SCC and Tilera TILE-G64/TILE- Pro64 Asynchronous message passing with tight software integration
  • 22. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 22 Prior Art (2) Hajj I. E,, Merritt A., Zellweger G., Milojicic D., Achermann R., Faraboschi P., Hwu W., Roscoe T., Schwan K.: SpaceJMP: Programming with Multiple Virtual Address Spaces, 21st ACM ASPLOS, 2016 Practical programming model for using multiple virtual address spaces on commodity hardware (evaluated on DragonFly BSD and Barrelfish) Useful for data-centric applications for sharing large amounts of memory between processes Intel IA-32 Task State Segment (TSS) Hardware-based context switching Historically, it has been used by Linux The primary reason for removal was not performance, but portability
  • 23. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 23 Prior Art (3) Intel VT-x VM Functions (VMFUNC) Efficient cross-VM function calls Switching the EPT and passing register arguments Current implementation limited to 512 entry points Practically usable even for very fine-grained virtualization with the granularity of individual functions Liu Y., Zhou T., Chen K., Chen H., Xia Y.: Thwarting Memory Disclosure with Efficient Hypervisor-enforced Intra-domain Isolation, 22nd ACM SIGSAC Conference on Computer and Communications Security, 2015 – “The cost of a VMFUNC is similar with a syscall” – “… hypervisor-level protection at the cost of system calls” SkyBridge paper to appear at EuroSys 2019
  • 24. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 24 Prior Art (4) Woodruff J., Watson R. N. M., Chisnall D., Moore S., Anderson J., Davis B., Laurie B., Neumann P. G., Norton R., Roe. M.: The CHERI capability model: Revisiting RISC in the an age of risk, 41st ACM Annual International Symposium on Computer Architecture, 2014 Hardware-based capability model for byte-granularity memory protection Extension of the 64-bit MIPS ISA Evaluated on an extended MIPS R4000 FPGA soft-core 32 capability registers (256 bits) Limitation: Inflexible design mostly due to the tight backward compatibility with a 64-bit ISA Intel MPX Several design and implementation issues, deemed not production-ready
  • 25. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 25 Summary Traditionally, hardware has not been designed to accommodate the requirements of microkernel multiserver operating systems Microkernels thus suffer performance penalties This prevented them from replacing monolithic operating systems and closed the vicious cycle Hardware design is hopefully becoming more accessible and democratic E.g. RISC-V Co-designing the hardware and software might help us gain the benefits of the microkernel multiserver design with no performance penalties However, it requires some out-of-the-box thinking
  • 26. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 26 Acknowledgements OS Kernel Lab at Huawei Technologies Javier Picorel Haibo Chen
  • 27. Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution 27 Huawei Dresden R&D Lab Focusing on microkernel research, design and development Basic research Applied research Prototype development Collaboration with academia and other technology companies Looking for senior operating system researchers, designers, developers and experts Previous microkernel experience is a big plus “A startup within a large company” Shaping the future product portfolio of Huawei Including hardware/software co-design via HiSilicon
  • 28. 28Martin Děcký, FOSDEM, February 3rd 2019 Hardware/Software Co-Design for Efficient Microkernel Execution Q&A