SlideShare a Scribd company logo
GCMA: Guaranteed
Contiguous Memory Allocator
SeongJae Park <sj38.park@gmail.com>
These slides were presented during
The Kernel Summit 2018
(https://events.linuxfoundation.org/events/linux-kernel-summit-2018/)
This work by SeongJae Park is licensed under the
Creative Commons Attribution-ShareAlike 3.0 Unported
License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
I, SeongJae Park
● SeongJae Park <sj38.park@gmail.com>
● PhD candidate at Seoul National University
● Interested in memory management and parallel programming for OS kernels
Contiguous Memory
Requirements
Who Wants Physically Contiguous Memory?
● Virtual Memory avoids need of physically contiguous memory
○ CPU uses virtual address; virtual regions can be mapped to any physical regions
○ MMU maps the virtual address to physical address
● Particularly, Linux uses demand paging and reclaim
Physical
Memory
Application MMUCPU
Logical address Physical address
Devices using Direct Memory Access (DMA)
● Lots of devices need large internal buffers
(e.g., 12 Mega-Pixel Camera, 10 Gbps Network interface card, …)
● Because the MMU is behind the CPU, devices directly addressing the
memory cannot use the virtual address
Physical
Memory
Application MMUCPU
Logical address Physical address
Device
DMA
Physical address
Systems using Huge Pages
● A system can have multiple sizes of pages, usually 4 KiB (regular) and 2 MiB
● Use of huge pages improve system performance by reducing TLB misses
● Importance of huge pages is growing
○ Modern workloads including big data, cloud, and machine learning are memory intensive
○ Systems equipping tera-bytes are not rare now
○ It’s especially important on virtual machine based environments
● A 2 MiB huge page is just 512 physically contiguous regular pages
Existing Solutions
H/W Solutions: IOMMU, Scatter / Gather DMA
● Idea is simple; add MMU-like additional hardware for devices
● IOMMU and Scatter/Gather DMA gives the contiguous device memory illusion
● Additional H/W means increase of power consumption and price, which is
unaffordable for low-end devices
● Useless for many cases including huge pages
Physical
Memory
Application MMUCPU
Logical address Physicals address
Device
DMA
Device address
IOMMU
Physical address
Even H/W Solutions Impose Overhead
● Hardware based solutions are well known for low overhead
● Though it is low, the overhead is inevitable
Christoph Lameter et al., “User space contiguous memory allocation for DMA”,
Linux Plumbers Conference 2017
Reserved Area Technique
● Reserves sufficient amount of contiguous area at boot time and
let only contiguous memory allocation to use the area
● Simple and effective for contiguous memory allocation
● Memory space utilization could be low if the reserved area is not fully used
● Widely adopted despite of the utilization problem
System Memory
(Narrowed by Reservation)
Reserved
Area
Process
Normal Allocation Contiguous Allocation
CMA: Contiguous Memory Allocator
● Software-based solution in the Linux kernel
● Generously extended version of the Reserved area technique
○ Mainly focus on memory utilization
○ Let movable pages to use the reserved area
○ If contig mem alloc requires pages being used for movable pages,
move those pages out of the reserved area and use the vacant area for contig mem alloc
○ Solves the memory utilization problem well because general pages are movable
● In other words, CMA gives different priority to clients of the reserved area
○ Primary client: contiguous memory allocation
○ Secondary client: movable page allocation
CMA in Real-land
● Measured latency for taking a photo using Camera app on Raspberry Pi 2
under memory stress (Blogbench) for 30 times
● 1.6 Seconds worst-latency with Reserved area technique,
● 9.8 seconds worst-latency with CMA; Unacceptable
● Measured the latency of CMA under the situation
● It’s clear that latency of Camera is bounded by CMA
The Phantom Menace: Latency of CMA
CMA: Why So Slow?
● In short, secondary client of CMA is not so nice as expected
● Moving page out is an expensive task (copying, rmap control, LRU churn, …)
● If someone is holding the page, it could not be moved until the one releases
the page (e.g., get_user_page())
● Result: Unexpected long latency and even failure
● That’s why CMA is not adopted to many devices
● Raspberry Pi does not support CMA officially because of the problem
Buddy Allocator
● Adaptively split and merge adjacent contiguous pages
● Highly optimized and heavily used in the Linux kernel
● Supports only multi-order contiguous pages and up to (MAX_ORDER - 1)
contiguous pages
● Under fragmentation, it does time-consuming compaction and retry or just
fails
○ Not a good news for contiguous memory requesting tasks, either
● THP uses buddy allocator to allocates huge pages
○ Quickly falls back to regular pages if it fails to allocate contiguous pages
○ As a result, it cannot be used on highly fragmented memory
THP Performance under High Fragmentation
● Default: THP disabled, Thp: THP always
● .f suffix: On the fragmented memory
● Under fragmentation, THP benefits disappear because of the fragmentation
(Lowerisbetter)
Guaranteed Contiguous
Memory Allocator
GCMA: Guaranteed Contiguous Memory Allocator
● GCMA is a variant of CMA that guarantees
○ Fast latency and success of allocation
○ While keeping memory utilization as well
● The idea is simple
○ Follow primary / secondary client idea of CMA to keep memory utilization
○ Select secondary client as only nice one unlike CMA
■ For fast latency, should be able to vacate the reserved area as soon as required without
any task such as moving content
■ For success, should be out of kernel control
■ For memory utilization, most pages should be able to be secondary client
■ In short, frontswap and cleancache
Secondary Clients of GCMA
● Pages for a write-through mode Frontswap
○ Could be discarded immediately because the contents are written through to swap device
○ Kernel thinks it’s already swapped out
○ Most anonymous pages could be covered
○ We recommend users to use Zram as a swap device to minimize write-through overhead
● Pages for a Clean cache
○ Could be discarded because the contents in storage is up to date
○ Kernel thinks it’s already evicted out
○ Lots of file-backed pages could be the case
● Additional pros 1: Already expected to not be accessed again soon
○ Discarding the secondary client pages would not affect system performance much
● Additional pros 2: Occurs with only severe workloads
○ In peaceful case, no overhead at all
GCMA: Workflow
● Reserve memory area in boot time
● If a page is swapped out or evicted from the page cache, keep the content of
the page in the reserved area
○ If the system requires the content of a page again, give it back from the reserved area
○ If a contiguous memory allocation requires area being used by those pages, discard those
pages and use the area for contiguous memory allocation
System Memory GCMA Reserved
Area
Swap Device
File
Process
Normal Allocation Contiguous Allocation
Reclaim / Swap
Get back if hit
GCMA Architecture
● Reuse CMA interface
○ User can turn CMA to GCMA entire or use them selectively on single system
● DMEM: Discardable Memory
○ Abstraction for backend of frontswap and cleancache
○ Works as a last chance cache for secondary client pages using GCMA reserved area
○ Manages Index to pages using hash table of RB-tree based buckets
○ Evicts pages in LRU scheme
Reserved Area
Dmem Logic
GCMA Logic
Dmem Interface
CMA Interface
CMA Logic
Cleancache Frontswap
GCMA Implementation
● Implemented on Linux v3.18, v4.10 and v4.17
● About 1,500 lines of code
● Ported, evaluated on Raspberry Pi 2 and a high-end server
● Available under GPL v3 at: https://github.com/sjp38/linux.gcma
● Submitted to LKML (https://lkml.org/lkml/2015/2/23/480)
Evaluations of
GCMA for a Device
Experimental Setup
● Device setup
○ Raspberry Pi 2
○ ARM cortex-A7 900 MHz
○ 1 GiB LPDDR2 SDRAM
○ Class 10 SanDisk 16 GiB microSD card
● Configurations
○ Baseline: Linux rpi-v3.18.11 + 100 MiB swap +
256 MiB reserved area
○ CMA: Linux rpi-v3.18.11 + 100 MiB swap +
256 MiB CMA area
○ GCMA: Linux rpi-v3.18.11 + 100 MiB Zram swap +
256 MiB GCMA area
● Workloads
○ Contig Mem Alloc: Contiguous memory allocation using CMA or GCMA
○ Blogbench: Realistic file system benchmark
○ Camera shot: Repeatedly taking picture using Raspberry Pi 2 Camera module with 10 seconds
interval between each shot
Contig Mem Alloc without Background Workload
● Average of 30 allocations
● GCMA shows 14.89x to 129.41x faster latency compared to CMA
● CMA even failed once for 32,768 contiguous pages allocation even though
there is no background task
4MiB Contig Mem Alloc w/o Background Workload
● Latency of GCMA is more predictable compared to CMA
4MiB Contig Mem Allocation w/ Background Task
● Background workload: Blogbench
● CMA consumes 5 seconds (unacceptable!) for one allocation in bad case
Camera Latency w/ Background Task
● Camera is a realistic, important application of Raspberry Pi 2
● Background task: Blogbench
● GCMA keeps latency as fast as Reserved area technique configuration
Blogbench on CMA / GCMA
● Scores are normalized to scores of Baseline configuration
● ‘/ cam’ means camera workload as a background workload
● System using GCMA even slightly outperforms CMA version owing to its light
overhead
● Allocation using CMA even degrades system performance because it should
move out pages in reserved area
Evaluations of
GCMA for THP
Experimental Setup
● Device setup
○ Intel Xeon E7-8870 v3
○ L2 TLB with 1024 entries
○ 600 GiB DDR4 memory
○ 500 GiB Intel SSD
○ Linux v4.10 based kernels
● Configurations
○ Nothp: THP disabled
○ Thp.bd: Buddy allocator based THP enabled
○ Thp.cma: GCMA based THP enabled
○ Thp.bc: Buddy allocator based, GCMA fallback using THP enabled
○ .f suffix: On fragmented system
● Workloads
○ 429.mcf and 471.omnetpp from SPEC CPU2006
○ TPC-H Strong test
Performance of SPEC CPU 2006
● Memory intensive two workloads are chosen
● Fragmentation decreases performance of regular pages, too
● GCMA for THP shows 2.56x higher performance compared to the original
THP under fragmentation
● The impact is workload dependent, though
Performance of TPC-H
● Power test: Measure latency of each query for OLAP workloads
● Runtime is normalized by Thp.bc.f
● Queries 17 and 19 produces speedup more than 2x
● Four queries (3, 9, 14 and 20) show a speedup of 1.5x-2x
Plan for GCMA
Unifying Solutions Under CMA Interface
● There are many solutions and interfaces for contiguous memory allocations
● Too many interfaces can confuse new coming programmers; We don’t want
to increase the confusion with GCMA
● GCMA is developed to coexist with CMA, rather than substituting it; It is
already using CMA interface
● The interface could select the secondary clients for each CMA region
○ None (Reservation)
○ Migratables (CMA)
○ Discardables (GCMA)
● Currently, we are developing a patchset; It aims to be merged to the mainline
● The patchset will include updated evaluation result
Conclusion
● Contiguous memory allocation needs improvement
○ H/W solutions are expensive for low-end devices, imposes overhead, and cannot provide real
contig memory
○ CMA is slow in many cases
○ Buddy allocator is restricted
● GCMA guarantees fast latency, success of allocation and reasonable memory
utilization
○ It achieves the goals by utilizing nice clients, the pages for Frontswap and the Cleancache
○ Allocation latency of GCMA is as fast as Reserved area technique
○ GCMA can be used for THP allocation fallback
○ 130x and 10x shorter latency for contiguous memory allocation and taking a photo compared
to CMA based system, repectively
○ More than 50% and 100% performance improvement for 7/24, 3/24 realistic workloads
● Planning to release the official version and updated evaluation results in near
future
Thanks

More Related Content

PDF
Linux Kernel Memory Model
PDF
An Introduction to the Formalised Memory Model for Linux Kernel
PDF
Understanding of linux kernel memory model
PDF
gcma: guaranteed contiguous memory allocator
ODP
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
ODP
UKUUG presentation about µCLinux on Pluto 6
ODP
µCLinux on Pluto 6 Project presentation
ODP
Performance: Observe and Tune
Linux Kernel Memory Model
An Introduction to the Formalised Memory Model for Linux Kernel
Understanding of linux kernel memory model
gcma: guaranteed contiguous memory allocator
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UKUUG presentation about µCLinux on Pluto 6
µCLinux on Pluto 6 Project presentation
Performance: Observe and Tune

What's hot (20)

PPTX
Design pipeline architecture for various stage pipelines
PPTX
Process scheduling
PDF
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
PDF
Linux Locking Mechanisms
PDF
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
PPT
Linux Kernel Debugging
PDF
Using QEMU for cross development
PDF
Linux Preempt-RT Internals
PPT
Linux kernel memory allocators
PDF
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
PDF
Mastering Real-time Linux
PPT
Linux kernel modules
PDF
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
PPTX
Improving Real-Time Performance on Multicore Platforms using MemGuard
PDF
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
PDF
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
PDF
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
PPTX
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
PDF
Ch6 cpu scheduling
PDF
From printk to QEMU: Xen/Linux Kernel debugging
Design pipeline architecture for various stage pipelines
Process scheduling
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
Linux Locking Mechanisms
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Linux Kernel Debugging
Using QEMU for cross development
Linux Preempt-RT Internals
Linux kernel memory allocators
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Mastering Real-time Linux
Linux kernel modules
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
Improving Real-Time Performance on Multicore Platforms using MemGuard
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Ch6 cpu scheduling
From printk to QEMU: Xen/Linux Kernel debugging
Ad

Similar to GCMA: Guaranteed Contiguous Memory Allocator (20)

ODP
Continguous Memory Allocator in the Linux Kernel
PDF
Virtualization for Emerging Memory Devices
PPT
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
PDF
IBM MQ Disaster Recovery
PDF
RAMinate Invited Talk at NII
PDF
LCA13: Memory Hotplug on Android
PDF
Sc19 ibm hms final
PDF
Software Design for Persistent Memory Systems
PDF
LCE13: Android Graphics Upstreaming
ODP
Memory Management in Amoeba
PDF
Recent advancements in cache technology
PDF
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
PDF
Enhanced Live Migration for Intensive Memory Loads
PDF
ch9_virMem.pdf
PDF
Drupal 7 performance and optimization
PDF
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PPTX
Literature survey presentation
PDF
Measuring a 25 and 40Gb/s Data Plane
PDF
RAMinate ACM SoCC 2016 Talk
PPTX
A survey on exploring memory optimizations in smartphones
Continguous Memory Allocator in the Linux Kernel
Virtualization for Emerging Memory Devices
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
IBM MQ Disaster Recovery
RAMinate Invited Talk at NII
LCA13: Memory Hotplug on Android
Sc19 ibm hms final
Software Design for Persistent Memory Systems
LCE13: Android Graphics Upstreaming
Memory Management in Amoeba
Recent advancements in cache technology
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
Enhanced Live Migration for Intensive Memory Loads
ch9_virMem.pdf
Drupal 7 performance and optimization
The Dark Side Of Go -- Go runtime related problems in TiDB in production
Literature survey presentation
Measuring a 25 and 40Gb/s Data Plane
RAMinate ACM SoCC 2016 Talk
A survey on exploring memory optimizations in smartphones
Ad

More from SeongJae Park (20)

PDF
Biscuit: an operating system written in go
PDF
Design choices of golang for high scalability
PDF
Brief introduction to kselftest
PDF
Let the contribution begin (EST futures)
PDF
Porting golang development environment developed with golang
PDF
An introduction to_golang.avi
PDF
Develop Android/iOS app using golang
PDF
Develop Android app using Golang
PDF
Sw install with_without_docker
PDF
Git inter-snapshot public
PDF
(Live) build and run golang web server on android.avi
PDF
Deep dark-side of git: How git works internally
PDF
Deep dark side of git - prologue
PDF
DO YOU WANT TO USE A VCS
PDF
Experimental android hacking using reflection
PDF
Hacktime for adk
PDF
Let the contribution begin
PDF
Touch Android Without Touching
PDF
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentation
Biscuit: an operating system written in go
Design choices of golang for high scalability
Brief introduction to kselftest
Let the contribution begin (EST futures)
Porting golang development environment developed with golang
An introduction to_golang.avi
Develop Android/iOS app using golang
Develop Android app using Golang
Sw install with_without_docker
Git inter-snapshot public
(Live) build and run golang web server on android.avi
Deep dark-side of git: How git works internally
Deep dark side of git - prologue
DO YOU WANT TO USE A VCS
Experimental android hacking using reflection
Hacktime for adk
Let the contribution begin
Touch Android Without Touching
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentation

Recently uploaded (20)

PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Online Work Permit System for Fast Permit Processing
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
medical staffing services at VALiNTRY
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
L1 - Introduction to python Backend.pptx
PDF
Nekopoi APK 2025 free lastest update
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
ai tools demonstartion for schools and inter college
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
CHAPTER 2 - PM Management and IT Context
Online Work Permit System for Fast Permit Processing
How to Migrate SBCGlobal Email to Yahoo Easily
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
medical staffing services at VALiNTRY
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
L1 - Introduction to python Backend.pptx
Nekopoi APK 2025 free lastest update
VVF-Customer-Presentation2025-Ver1.9.pptx
ISO 45001 Occupational Health and Safety Management System
Navsoft: AI-Powered Business Solutions & Custom Software Development
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Adobe Illustrator 28.6 Crack My Vision of Vector Design
ai tools demonstartion for schools and inter college
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)

GCMA: Guaranteed Contiguous Memory Allocator

  • 1. GCMA: Guaranteed Contiguous Memory Allocator SeongJae Park <sj38.park@gmail.com>
  • 2. These slides were presented during The Kernel Summit 2018 (https://events.linuxfoundation.org/events/linux-kernel-summit-2018/)
  • 3. This work by SeongJae Park is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
  • 4. I, SeongJae Park ● SeongJae Park <sj38.park@gmail.com> ● PhD candidate at Seoul National University ● Interested in memory management and parallel programming for OS kernels
  • 6. Who Wants Physically Contiguous Memory? ● Virtual Memory avoids need of physically contiguous memory ○ CPU uses virtual address; virtual regions can be mapped to any physical regions ○ MMU maps the virtual address to physical address ● Particularly, Linux uses demand paging and reclaim Physical Memory Application MMUCPU Logical address Physical address
  • 7. Devices using Direct Memory Access (DMA) ● Lots of devices need large internal buffers (e.g., 12 Mega-Pixel Camera, 10 Gbps Network interface card, …) ● Because the MMU is behind the CPU, devices directly addressing the memory cannot use the virtual address Physical Memory Application MMUCPU Logical address Physical address Device DMA Physical address
  • 8. Systems using Huge Pages ● A system can have multiple sizes of pages, usually 4 KiB (regular) and 2 MiB ● Use of huge pages improve system performance by reducing TLB misses ● Importance of huge pages is growing ○ Modern workloads including big data, cloud, and machine learning are memory intensive ○ Systems equipping tera-bytes are not rare now ○ It’s especially important on virtual machine based environments ● A 2 MiB huge page is just 512 physically contiguous regular pages
  • 10. H/W Solutions: IOMMU, Scatter / Gather DMA ● Idea is simple; add MMU-like additional hardware for devices ● IOMMU and Scatter/Gather DMA gives the contiguous device memory illusion ● Additional H/W means increase of power consumption and price, which is unaffordable for low-end devices ● Useless for many cases including huge pages Physical Memory Application MMUCPU Logical address Physicals address Device DMA Device address IOMMU Physical address
  • 11. Even H/W Solutions Impose Overhead ● Hardware based solutions are well known for low overhead ● Though it is low, the overhead is inevitable Christoph Lameter et al., “User space contiguous memory allocation for DMA”, Linux Plumbers Conference 2017
  • 12. Reserved Area Technique ● Reserves sufficient amount of contiguous area at boot time and let only contiguous memory allocation to use the area ● Simple and effective for contiguous memory allocation ● Memory space utilization could be low if the reserved area is not fully used ● Widely adopted despite of the utilization problem System Memory (Narrowed by Reservation) Reserved Area Process Normal Allocation Contiguous Allocation
  • 13. CMA: Contiguous Memory Allocator ● Software-based solution in the Linux kernel ● Generously extended version of the Reserved area technique ○ Mainly focus on memory utilization ○ Let movable pages to use the reserved area ○ If contig mem alloc requires pages being used for movable pages, move those pages out of the reserved area and use the vacant area for contig mem alloc ○ Solves the memory utilization problem well because general pages are movable ● In other words, CMA gives different priority to clients of the reserved area ○ Primary client: contiguous memory allocation ○ Secondary client: movable page allocation
  • 14. CMA in Real-land ● Measured latency for taking a photo using Camera app on Raspberry Pi 2 under memory stress (Blogbench) for 30 times ● 1.6 Seconds worst-latency with Reserved area technique, ● 9.8 seconds worst-latency with CMA; Unacceptable
  • 15. ● Measured the latency of CMA under the situation ● It’s clear that latency of Camera is bounded by CMA The Phantom Menace: Latency of CMA
  • 16. CMA: Why So Slow? ● In short, secondary client of CMA is not so nice as expected ● Moving page out is an expensive task (copying, rmap control, LRU churn, …) ● If someone is holding the page, it could not be moved until the one releases the page (e.g., get_user_page()) ● Result: Unexpected long latency and even failure ● That’s why CMA is not adopted to many devices ● Raspberry Pi does not support CMA officially because of the problem
  • 17. Buddy Allocator ● Adaptively split and merge adjacent contiguous pages ● Highly optimized and heavily used in the Linux kernel ● Supports only multi-order contiguous pages and up to (MAX_ORDER - 1) contiguous pages ● Under fragmentation, it does time-consuming compaction and retry or just fails ○ Not a good news for contiguous memory requesting tasks, either ● THP uses buddy allocator to allocates huge pages ○ Quickly falls back to regular pages if it fails to allocate contiguous pages ○ As a result, it cannot be used on highly fragmented memory
  • 18. THP Performance under High Fragmentation ● Default: THP disabled, Thp: THP always ● .f suffix: On the fragmented memory ● Under fragmentation, THP benefits disappear because of the fragmentation (Lowerisbetter)
  • 20. GCMA: Guaranteed Contiguous Memory Allocator ● GCMA is a variant of CMA that guarantees ○ Fast latency and success of allocation ○ While keeping memory utilization as well ● The idea is simple ○ Follow primary / secondary client idea of CMA to keep memory utilization ○ Select secondary client as only nice one unlike CMA ■ For fast latency, should be able to vacate the reserved area as soon as required without any task such as moving content ■ For success, should be out of kernel control ■ For memory utilization, most pages should be able to be secondary client ■ In short, frontswap and cleancache
  • 21. Secondary Clients of GCMA ● Pages for a write-through mode Frontswap ○ Could be discarded immediately because the contents are written through to swap device ○ Kernel thinks it’s already swapped out ○ Most anonymous pages could be covered ○ We recommend users to use Zram as a swap device to minimize write-through overhead ● Pages for a Clean cache ○ Could be discarded because the contents in storage is up to date ○ Kernel thinks it’s already evicted out ○ Lots of file-backed pages could be the case ● Additional pros 1: Already expected to not be accessed again soon ○ Discarding the secondary client pages would not affect system performance much ● Additional pros 2: Occurs with only severe workloads ○ In peaceful case, no overhead at all
  • 22. GCMA: Workflow ● Reserve memory area in boot time ● If a page is swapped out or evicted from the page cache, keep the content of the page in the reserved area ○ If the system requires the content of a page again, give it back from the reserved area ○ If a contiguous memory allocation requires area being used by those pages, discard those pages and use the area for contiguous memory allocation System Memory GCMA Reserved Area Swap Device File Process Normal Allocation Contiguous Allocation Reclaim / Swap Get back if hit
  • 23. GCMA Architecture ● Reuse CMA interface ○ User can turn CMA to GCMA entire or use them selectively on single system ● DMEM: Discardable Memory ○ Abstraction for backend of frontswap and cleancache ○ Works as a last chance cache for secondary client pages using GCMA reserved area ○ Manages Index to pages using hash table of RB-tree based buckets ○ Evicts pages in LRU scheme Reserved Area Dmem Logic GCMA Logic Dmem Interface CMA Interface CMA Logic Cleancache Frontswap
  • 24. GCMA Implementation ● Implemented on Linux v3.18, v4.10 and v4.17 ● About 1,500 lines of code ● Ported, evaluated on Raspberry Pi 2 and a high-end server ● Available under GPL v3 at: https://github.com/sjp38/linux.gcma ● Submitted to LKML (https://lkml.org/lkml/2015/2/23/480)
  • 26. Experimental Setup ● Device setup ○ Raspberry Pi 2 ○ ARM cortex-A7 900 MHz ○ 1 GiB LPDDR2 SDRAM ○ Class 10 SanDisk 16 GiB microSD card ● Configurations ○ Baseline: Linux rpi-v3.18.11 + 100 MiB swap + 256 MiB reserved area ○ CMA: Linux rpi-v3.18.11 + 100 MiB swap + 256 MiB CMA area ○ GCMA: Linux rpi-v3.18.11 + 100 MiB Zram swap + 256 MiB GCMA area ● Workloads ○ Contig Mem Alloc: Contiguous memory allocation using CMA or GCMA ○ Blogbench: Realistic file system benchmark ○ Camera shot: Repeatedly taking picture using Raspberry Pi 2 Camera module with 10 seconds interval between each shot
  • 27. Contig Mem Alloc without Background Workload ● Average of 30 allocations ● GCMA shows 14.89x to 129.41x faster latency compared to CMA ● CMA even failed once for 32,768 contiguous pages allocation even though there is no background task
  • 28. 4MiB Contig Mem Alloc w/o Background Workload ● Latency of GCMA is more predictable compared to CMA
  • 29. 4MiB Contig Mem Allocation w/ Background Task ● Background workload: Blogbench ● CMA consumes 5 seconds (unacceptable!) for one allocation in bad case
  • 30. Camera Latency w/ Background Task ● Camera is a realistic, important application of Raspberry Pi 2 ● Background task: Blogbench ● GCMA keeps latency as fast as Reserved area technique configuration
  • 31. Blogbench on CMA / GCMA ● Scores are normalized to scores of Baseline configuration ● ‘/ cam’ means camera workload as a background workload ● System using GCMA even slightly outperforms CMA version owing to its light overhead ● Allocation using CMA even degrades system performance because it should move out pages in reserved area
  • 33. Experimental Setup ● Device setup ○ Intel Xeon E7-8870 v3 ○ L2 TLB with 1024 entries ○ 600 GiB DDR4 memory ○ 500 GiB Intel SSD ○ Linux v4.10 based kernels ● Configurations ○ Nothp: THP disabled ○ Thp.bd: Buddy allocator based THP enabled ○ Thp.cma: GCMA based THP enabled ○ Thp.bc: Buddy allocator based, GCMA fallback using THP enabled ○ .f suffix: On fragmented system ● Workloads ○ 429.mcf and 471.omnetpp from SPEC CPU2006 ○ TPC-H Strong test
  • 34. Performance of SPEC CPU 2006 ● Memory intensive two workloads are chosen ● Fragmentation decreases performance of regular pages, too ● GCMA for THP shows 2.56x higher performance compared to the original THP under fragmentation ● The impact is workload dependent, though
  • 35. Performance of TPC-H ● Power test: Measure latency of each query for OLAP workloads ● Runtime is normalized by Thp.bc.f ● Queries 17 and 19 produces speedup more than 2x ● Four queries (3, 9, 14 and 20) show a speedup of 1.5x-2x
  • 37. Unifying Solutions Under CMA Interface ● There are many solutions and interfaces for contiguous memory allocations ● Too many interfaces can confuse new coming programmers; We don’t want to increase the confusion with GCMA ● GCMA is developed to coexist with CMA, rather than substituting it; It is already using CMA interface ● The interface could select the secondary clients for each CMA region ○ None (Reservation) ○ Migratables (CMA) ○ Discardables (GCMA) ● Currently, we are developing a patchset; It aims to be merged to the mainline ● The patchset will include updated evaluation result
  • 38. Conclusion ● Contiguous memory allocation needs improvement ○ H/W solutions are expensive for low-end devices, imposes overhead, and cannot provide real contig memory ○ CMA is slow in many cases ○ Buddy allocator is restricted ● GCMA guarantees fast latency, success of allocation and reasonable memory utilization ○ It achieves the goals by utilizing nice clients, the pages for Frontswap and the Cleancache ○ Allocation latency of GCMA is as fast as Reserved area technique ○ GCMA can be used for THP allocation fallback ○ 130x and 10x shorter latency for contiguous memory allocation and taking a photo compared to CMA based system, repectively ○ More than 50% and 100% performance improvement for 7/24, 3/24 realistic workloads ● Planning to release the official version and updated evaluation results in near future