Practicing Linux Crash/Panic
Issue on Production and Cloud
Server
2019 Shanghai Open Source Summit
China
Gavin Guo
Technical Lead - Sustaining Engineering
gavin.guo@canonical.com
Wechat
Migrating KSM page causes the
VM lock up as the KSM page
merging list is too large
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1680513
Enterprise Case Study
Case Description
After numad is enabled and there are several
VMs running on the same host machine, the
softlockup messages can be observed inside the
VMs' dmesg.
CPU: 3 PID: 22468 Comm: kworker/u32:2 Not tainted 4.4.0-47-generic
#68-Ubuntu
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
Workqueue: writeback wb_workfn (flush-252:0)
[<ffffffff81104388>] smp_call_function_many+0x1f8/0x260
[<ffffffff810727d5>] native_flush_tlb_others+0x65/0x150
[<ffffffff81072b35>] flush_tlb_page+0x55/0x90
This one seems a known issue. The bug is proactively handled by Linus when Dave Jones[3]
issued the bug which happened on the bare metal machine. Tinoco[2] also found the bug in
the nested KVM environment which happened when the IPI is sent out in the VCPU and it
seems the problem coming from the LAPIC simulation of VMX. Chris Arges also involved in
the debugging process and the debugging patch was given out by the Ingo Molnar, then Chris
added some hacks to print out the debugging information. Unfortunately, after a long
investigation, the root cause is still unknown.
Investigation on the VM side
[1]. smp/call: Detect stuck CSD locks https://patchwork.kernel.org/patch/6153801/
[2]. smp_call_function_single lockups https://lkml.org/lkml/2015/2/11/247
[3]. frequent lockups in 3.18rc4 https://lkml.org/lkml/2014/11/14/656
I've prepared a hotfix kernel which would resend the IPI and print out
the information when the softlockup happens. Unfortunately, the
hotfix kernel doesn't print out the error message. That means my
original thoughts are incorrect!
The hotfix kernel source:
http://kernel.ubuntu.com/git/gavinguo/ubuntu-xenial.git/log/?h=sf000
103690-csd-lock-debug
Investigation on the VM side
As I cannot find the clue inside the VMs, then
try to investigate the host side.
Host Machine - Hung task Backtrace
# ksmd
crash> bt 615
PID: 615 TASK: ffff881fa174a940 CPU: 15 COMMAND: "ksmd"
#0 [ffff881fa1087cc0] __schedule at ffffffff818207ee
#1 [ffff881fa1087d10] schedule at ffffffff81820ee5
#2 [ffff881fa1087d28] rwsem_down_read_failed at ffffffff81823d60
#3 [ffff881fa1087d98] call_rwsem_down_read_failed at ffffffff813f8324
#4 [ffff881fa1087df8] ksm_scan_thread at ffffffff811e613d
#5 [ffff881fa1087ec8] kthread at ffffffff810a0528
#6 [ffff881fa1087f50] ret_from_fork at ffffffff8182538f
Host Machine - Hung task Backtrace
# khugepaged
crash> bt 616
PID: 616 TASK: ffff881fa1749b80 CPU: 11 COMMAND: "khugepaged"
#0 [ffff881fa108bc60] __schedule at ffffffff818207ee
#1 [ffff881fa108bcb0] schedule at ffffffff81820ee5
#2 [ffff881fa108bcc8] rwsem_down_write_failed at ffffffff81823b32
#3 [ffff881fa108bd50] call_rwsem_down_write_failed at ffffffff813f8353
#4 [ffff881fa108bda8] khugepaged at ffffffff811f58ef
#5 [ffff881fa108bec8] kthread at ffffffff810a0528
#6 [ffff881fa108bf50] ret_from_fork at ffffffff8182538f
Host Machine - Hung task Backtrace
# qemu-system-x86
crash> bt 12555
PID: 12555 TASK: ffff885fa1af6040 CPU: 55 COMMAND: "qemu-system-x86"
#0 [ffff885f9a043a50] __schedule at ffffffff818207ee
#1 [ffff885f9a043aa0] schedule at ffffffff81820ee5
#2 [ffff885f9a043ab8] rwsem_down_read_failed at ffffffff81823d60
#3 [ffff885f9a043b28] call_rwsem_down_read_failed at ffffffff813f8324
#4 [ffff885f9a043b88] kvm_host_page_size at ffffffffc02cfbae [kvm]
#5 [ffff885f9a043ba8] mapping_level at ffffffffc02ead1f [kvm]
#6 [ffff885f9a043bd8] tdp_page_fault at ffffffffc02f0b8a [kvm]
#7 [ffff885f9a043c50] kvm_mmu_page_fault at ffffffffc02ea794 [kvm]
#8 [ffff885f9a043c80] handle_ept_violation at ffffffffc01acda3 [kvm_intel]
#9 [ffff885f9a043cb8] vmx_handle_exit at ffffffffc01afdab [kvm_intel]
#10 [ffff885f9a043d48] vcpu_enter_guest at ffffffffc02e026d [kvm]
#11 [ffff885f9a043dc0] kvm_arch_vcpu_ioctl_run at ffffffffc02e698f [kvm]
#12 [ffff885f9a043e08] kvm_vcpu_ioctl at ffffffffc02ce09d [kvm]
#13 [ffff885f9a043ea0] do_vfs_ioctl at ffffffff81220bef
Host Machine - Hung task Backtrace
We can see that the previous three tasks are waiting on the
mmap_sem. The most interesting part is the backtrace of numad:
crash> bt 2950 The disassembly analysis of numad call stack
#1 [ffff885f8fb4fb78] smp_call_function_many
#2 [ffff885f8fb4fbc0] native_flush_tlb_others
#3 [ffff885f8fb4fc08] flush_tlb_page
#4 [ffff885f8fb4fc30] ptep_clear_flush
#5 [ffff885f8fb4fc60] try_to_unmap_one
#6 [ffff885f8fb4fcd0] rmap_walk_ksm
#7 [ffff885f8fb4fd28] rmap_walk
#8 [ffff885f8fb4fd80] try_to_unmap
#9 [ffff885f8fb4fdc8] migrate_pages
#10 [ffff885f8fb4fe80] do_migrate_pages
Host Machine - Hung task Backtrace
I've tried to disassemble the code and finally find the stable_node->hlist is as long as
2306920 entries(Around 9.2GB memory merged into one page).
rmap_item list(stable_node->hlist):
stable_node: 0xffff881f836ba000 stable_node->hlist->first = 0xffff883f3e5746b0
struct hlist_head {
[0] struct hlist_node *first;
}
struct hlist_node {
[0] struct hlist_node *next;
[8] struct hlist_node **pprev;
}
crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst
$ wc -l rmap_item.lst
2306920 rmap_item.lst
KSM merge list extraction
Introduction to the KSM Stable Tree
(Stable/Unstable tree)
stable_node
stable_node stable_node
stable_nodestable_node
stable_node
stable_node stable_node stable_node
0 1 2 3
/sys/kernel/mm/ksm/merge_across_node=0
rmap_item rmap_item rmap_item
rmap_item
rmap_item rmap_item
rmap_itemrmap_item
rmap_item
rmap_item rmap_item rmap_item
0 1 2 3
root_unstable_tree[nr_node_ids]root_stable_tree[nr_node_ids]
The merge list is as long as
2306920 entries.
Automatic NUMA balancing
Local/Remote access
Cpu 0 Cpu 1 Cpu 2 Cpu 3 Cpu 4 Cpu 5 Cpu 6 Cpu 7
Process A
page pagepage
page page page
page page page
page page page
Process B Process C Process D Process E Process F Process G Process H
Local access
Remote access Node 0 Node 1
According to the memory access latency, it
would be better to migrate Process D to
node 1 and Process E to node 0. The
remote access page by Process A can be
migrated to node 0. However, it would also
need to consider the CPU loading before
migrating the processes.
https://kernel.ubuntu.com/~gavinguo/sf00131845/numa-131845.svg
FlameGraph of the performance problem
When migrating the ksm pages, numad needs to call the IPI to flush
the related TLB entries in CPUs which ever used the PTE.
Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page
deduplication limit
https://www.spinics.net/lists/linux-mm/msg125880.html
80b18dfa53bb ksm: optimize refile of stable_node_dup at the head of the
chain
8dc5ffcd5a74 ksm: swap the two output parameters of chain/chain_prune
0ba1d0f7c41c ksm: cleanup stable_node chain collapse case
b4fecc67cc56 ksm: fix use after free with merge_across_nodes = 0
2c653d0ee2ae ksm: introduce ksm_max_page_sharing per page
deduplication limit
Solution

More Related Content

PDF
How to use KASAN to debug memory corruption in OpenStack environment- (2)
PDF
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)
PDF
KASan in a Bare-Metal Hypervisor
PDF
A little systemtap
PDF
Kernel crashdump
PDF
Killing any security product … using a Mimikatz undocumented feature
PDF
App secforum2014 andrivet-cplusplus11-metaprogramming_applied_to_software_obf...
PDF
Specializing the Data Path - Hooking into the Linux Network Stack
How to use KASAN to debug memory corruption in OpenStack environment- (2)
Spectre(v1%2 fv2%2fv4) v.s. meltdown(v3)
KASan in a Bare-Metal Hypervisor
A little systemtap
Kernel crashdump
Killing any security product … using a Mimikatz undocumented feature
App secforum2014 andrivet-cplusplus11-metaprogramming_applied_to_software_obf...
Specializing the Data Path - Hooking into the Linux Network Stack

What's hot (20)

PDF
I/O, You Own: Regaining Control of Your Disk in the Presence of Bootkits
ODP
Linux Kernel Crashdump
ODP
Proxy arp
PDF
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
PDF
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
PPTX
grsecurity and PaX
PPTX
System Calls
PPTX
SSL Failing, Sharing, and Scheduling
PDF
Deploying Prometheus stacks with Juju
PDF
Semtex.c [CVE-2013-2094] - A Linux Privelege Escalation
PDF
Kernel Recipes 2015: Introduction to Kernel Power Management
PDF
Kernel Recipes 2015 - Porting Linux to a new processor architecture
PDF
Vm ware fuzzing - defcon russia 20
PDF
Linux /proc filesystem for MySQL DBAs - FOSDEM 2021
PDF
MariaDB Server on macOS - FOSDEM 2022 MariaDB Devroom
PDF
Node day 2014
PDF
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
PDF
MINCS - containers in the shell script (Eng. ver.)
PDF
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
PPTX
Operating System Engineering Quiz
I/O, You Own: Regaining Control of Your Disk in the Presence of Bootkits
Linux Kernel Crashdump
Proxy arp
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
grsecurity and PaX
System Calls
SSL Failing, Sharing, and Scheduling
Deploying Prometheus stacks with Juju
Semtex.c [CVE-2013-2094] - A Linux Privelege Escalation
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015 - Porting Linux to a new processor architecture
Vm ware fuzzing - defcon russia 20
Linux /proc filesystem for MySQL DBAs - FOSDEM 2021
MariaDB Server on macOS - FOSDEM 2022 MariaDB Devroom
Node day 2014
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
MINCS - containers in the shell script (Eng. ver.)
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Operating System Engineering Quiz
Ad

Similar to Migrating KSM page causes the VM lock up as the KSM page merging list is too large - 2019 OSS China - Gavin Guo (20)

PPTX
Techno-Fest-15nov16
PDF
Java on Linux for devs and ops
PPTX
Gc crash course (1)
PPT
Linux Crash Dump Capture and Analysis
PDF
Kvm performance optimization for ubuntu
PDF
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdf
PDF
New Ways to Find Latency in Linux Using Tracing
PDF
HKG18-TR14 - Postmortem Debugging with Coresight
PPT
Servers and Processes: Behavior and Analysis
PPTX
TroubleshootingJVMOutages-3CaseStudies.pptx
PDF
Help, my computer is sluggish
PDF
Fundamentals of Complete Crash and Hang Memory Dump Analysis
PPTX
TroubleshootingJVMOutages-3CaseStudies (1).pptx
ODP
Debugging linux
PPTX
Major Outages in Major Enterprises Payara Conference
PDF
Debugging 2013- Jesper Brouer
PDF
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
ODP
N problems of Linux Containers
PPTX
MAJOR OUTAGES IN MAJOR ENTERPRISES
PDF
Memory Mapping Implementation (mmap) in Linux Kernel
Techno-Fest-15nov16
Java on Linux for devs and ops
Gc crash course (1)
Linux Crash Dump Capture and Analysis
Kvm performance optimization for ubuntu
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdf
New Ways to Find Latency in Linux Using Tracing
HKG18-TR14 - Postmortem Debugging with Coresight
Servers and Processes: Behavior and Analysis
TroubleshootingJVMOutages-3CaseStudies.pptx
Help, my computer is sluggish
Fundamentals of Complete Crash and Hang Memory Dump Analysis
TroubleshootingJVMOutages-3CaseStudies (1).pptx
Debugging linux
Major Outages in Major Enterprises Payara Conference
Debugging 2013- Jesper Brouer
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
N problems of Linux Containers
MAJOR OUTAGES IN MAJOR ENTERPRISES
Memory Mapping Implementation (mmap) in Linux Kernel
Ad

Recently uploaded (20)

PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
CyberSecurity Mobile and Wireless Devices
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PDF
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PDF
Visual Aids for Exploratory Data Analysis.pdf
PPTX
Module 8- Technological and Communication Skills.pptx
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PPTX
communication and presentation skills 01
PPTX
Amdahl’s law is explained in the above power point presentations
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
Abrasive, erosive and cavitation wear.pdf
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PPTX
Software Engineering and software moduleing
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Fundamentals of Mechanical Engineering.pptx
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
CyberSecurity Mobile and Wireless Devices
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
distributed database system" (DDBS) is often used to refer to both the distri...
Visual Aids for Exploratory Data Analysis.pdf
Module 8- Technological and Communication Skills.pptx
"Array and Linked List in Data Structures with Types, Operations, Implementat...
communication and presentation skills 01
Amdahl’s law is explained in the above power point presentations
Information Storage and Retrieval Techniques Unit III
Abrasive, erosive and cavitation wear.pdf
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
Software Engineering and software moduleing
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION

Migrating KSM page causes the VM lock up as the KSM page merging list is too large - 2019 OSS China - Gavin Guo

  • 1. Practicing Linux Crash/Panic Issue on Production and Cloud Server 2019 Shanghai Open Source Summit China Gavin Guo Technical Lead - Sustaining Engineering gavin.guo@canonical.com Wechat
  • 2. Migrating KSM page causes the VM lock up as the KSM page merging list is too large https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1680513 Enterprise Case Study
  • 3. Case Description After numad is enabled and there are several VMs running on the same host machine, the softlockup messages can be observed inside the VMs' dmesg. CPU: 3 PID: 22468 Comm: kworker/u32:2 Not tainted 4.4.0-47-generic #68-Ubuntu Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Workqueue: writeback wb_workfn (flush-252:0) [<ffffffff81104388>] smp_call_function_many+0x1f8/0x260 [<ffffffff810727d5>] native_flush_tlb_others+0x65/0x150 [<ffffffff81072b35>] flush_tlb_page+0x55/0x90
  • 4. This one seems a known issue. The bug is proactively handled by Linus when Dave Jones[3] issued the bug which happened on the bare metal machine. Tinoco[2] also found the bug in the nested KVM environment which happened when the IPI is sent out in the VCPU and it seems the problem coming from the LAPIC simulation of VMX. Chris Arges also involved in the debugging process and the debugging patch was given out by the Ingo Molnar, then Chris added some hacks to print out the debugging information. Unfortunately, after a long investigation, the root cause is still unknown. Investigation on the VM side [1]. smp/call: Detect stuck CSD locks https://patchwork.kernel.org/patch/6153801/ [2]. smp_call_function_single lockups https://lkml.org/lkml/2015/2/11/247 [3]. frequent lockups in 3.18rc4 https://lkml.org/lkml/2014/11/14/656
  • 5. I've prepared a hotfix kernel which would resend the IPI and print out the information when the softlockup happens. Unfortunately, the hotfix kernel doesn't print out the error message. That means my original thoughts are incorrect! The hotfix kernel source: http://kernel.ubuntu.com/git/gavinguo/ubuntu-xenial.git/log/?h=sf000 103690-csd-lock-debug Investigation on the VM side
  • 6. As I cannot find the clue inside the VMs, then try to investigate the host side. Host Machine - Hung task Backtrace
  • 7. # ksmd crash> bt 615 PID: 615 TASK: ffff881fa174a940 CPU: 15 COMMAND: "ksmd" #0 [ffff881fa1087cc0] __schedule at ffffffff818207ee #1 [ffff881fa1087d10] schedule at ffffffff81820ee5 #2 [ffff881fa1087d28] rwsem_down_read_failed at ffffffff81823d60 #3 [ffff881fa1087d98] call_rwsem_down_read_failed at ffffffff813f8324 #4 [ffff881fa1087df8] ksm_scan_thread at ffffffff811e613d #5 [ffff881fa1087ec8] kthread at ffffffff810a0528 #6 [ffff881fa1087f50] ret_from_fork at ffffffff8182538f Host Machine - Hung task Backtrace
  • 8. # khugepaged crash> bt 616 PID: 616 TASK: ffff881fa1749b80 CPU: 11 COMMAND: "khugepaged" #0 [ffff881fa108bc60] __schedule at ffffffff818207ee #1 [ffff881fa108bcb0] schedule at ffffffff81820ee5 #2 [ffff881fa108bcc8] rwsem_down_write_failed at ffffffff81823b32 #3 [ffff881fa108bd50] call_rwsem_down_write_failed at ffffffff813f8353 #4 [ffff881fa108bda8] khugepaged at ffffffff811f58ef #5 [ffff881fa108bec8] kthread at ffffffff810a0528 #6 [ffff881fa108bf50] ret_from_fork at ffffffff8182538f Host Machine - Hung task Backtrace
  • 9. # qemu-system-x86 crash> bt 12555 PID: 12555 TASK: ffff885fa1af6040 CPU: 55 COMMAND: "qemu-system-x86" #0 [ffff885f9a043a50] __schedule at ffffffff818207ee #1 [ffff885f9a043aa0] schedule at ffffffff81820ee5 #2 [ffff885f9a043ab8] rwsem_down_read_failed at ffffffff81823d60 #3 [ffff885f9a043b28] call_rwsem_down_read_failed at ffffffff813f8324 #4 [ffff885f9a043b88] kvm_host_page_size at ffffffffc02cfbae [kvm] #5 [ffff885f9a043ba8] mapping_level at ffffffffc02ead1f [kvm] #6 [ffff885f9a043bd8] tdp_page_fault at ffffffffc02f0b8a [kvm] #7 [ffff885f9a043c50] kvm_mmu_page_fault at ffffffffc02ea794 [kvm] #8 [ffff885f9a043c80] handle_ept_violation at ffffffffc01acda3 [kvm_intel] #9 [ffff885f9a043cb8] vmx_handle_exit at ffffffffc01afdab [kvm_intel] #10 [ffff885f9a043d48] vcpu_enter_guest at ffffffffc02e026d [kvm] #11 [ffff885f9a043dc0] kvm_arch_vcpu_ioctl_run at ffffffffc02e698f [kvm] #12 [ffff885f9a043e08] kvm_vcpu_ioctl at ffffffffc02ce09d [kvm] #13 [ffff885f9a043ea0] do_vfs_ioctl at ffffffff81220bef Host Machine - Hung task Backtrace
  • 10. We can see that the previous three tasks are waiting on the mmap_sem. The most interesting part is the backtrace of numad: crash> bt 2950 The disassembly analysis of numad call stack #1 [ffff885f8fb4fb78] smp_call_function_many #2 [ffff885f8fb4fbc0] native_flush_tlb_others #3 [ffff885f8fb4fc08] flush_tlb_page #4 [ffff885f8fb4fc30] ptep_clear_flush #5 [ffff885f8fb4fc60] try_to_unmap_one #6 [ffff885f8fb4fcd0] rmap_walk_ksm #7 [ffff885f8fb4fd28] rmap_walk #8 [ffff885f8fb4fd80] try_to_unmap #9 [ffff885f8fb4fdc8] migrate_pages #10 [ffff885f8fb4fe80] do_migrate_pages Host Machine - Hung task Backtrace
  • 11. I've tried to disassemble the code and finally find the stable_node->hlist is as long as 2306920 entries(Around 9.2GB memory merged into one page). rmap_item list(stable_node->hlist): stable_node: 0xffff881f836ba000 stable_node->hlist->first = 0xffff883f3e5746b0 struct hlist_head { [0] struct hlist_node *first; } struct hlist_node { [0] struct hlist_node *next; [8] struct hlist_node **pprev; } crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst $ wc -l rmap_item.lst 2306920 rmap_item.lst KSM merge list extraction
  • 12. Introduction to the KSM Stable Tree (Stable/Unstable tree) stable_node stable_node stable_node stable_nodestable_node stable_node stable_node stable_node stable_node 0 1 2 3 /sys/kernel/mm/ksm/merge_across_node=0 rmap_item rmap_item rmap_item rmap_item rmap_item rmap_item rmap_itemrmap_item rmap_item rmap_item rmap_item rmap_item 0 1 2 3 root_unstable_tree[nr_node_ids]root_stable_tree[nr_node_ids] The merge list is as long as 2306920 entries.
  • 13. Automatic NUMA balancing Local/Remote access Cpu 0 Cpu 1 Cpu 2 Cpu 3 Cpu 4 Cpu 5 Cpu 6 Cpu 7 Process A page pagepage page page page page page page page page page Process B Process C Process D Process E Process F Process G Process H Local access Remote access Node 0 Node 1 According to the memory access latency, it would be better to migrate Process D to node 1 and Process E to node 0. The remote access page by Process A can be migrated to node 0. However, it would also need to consider the CPU loading before migrating the processes.
  • 14. https://kernel.ubuntu.com/~gavinguo/sf00131845/numa-131845.svg FlameGraph of the performance problem When migrating the ksm pages, numad needs to call the IPI to flush the related TLB entries in CPUs which ever used the PTE.
  • 15. Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit https://www.spinics.net/lists/linux-mm/msg125880.html 80b18dfa53bb ksm: optimize refile of stable_node_dup at the head of the chain 8dc5ffcd5a74 ksm: swap the two output parameters of chain/chain_prune 0ba1d0f7c41c ksm: cleanup stable_node chain collapse case b4fecc67cc56 ksm: fix use after free with merge_across_nodes = 0 2c653d0ee2ae ksm: introduce ksm_max_page_sharing per page deduplication limit Solution