SlideShare a Scribd company logo
© 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. SLES Performance Enhancements for Large NUMASystems 
Scott Norton –HP 
Davidlohr Bueso –SUSE Labs (formerly HP) 
SUSECon Orlando, FL 
November 2014
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Agenda 
1.Whyfocusonlocking? 
2.Proofofconcept 
3.Methodology 
4.Performanceenhancementsfor3.0.101-0.15-default 
5.Performanceenhancementsfor3.0.101-0.30-default 
6.Performanceenhancementsfor3.0.101-0.30-bigsmp 
7.3.0.101-0.30-bigsmp vs. 3.0.101-0.8-default 
8.PerformanceenhancementsforSLES12 
9.Wrap-up 
2
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Why Focus on Locking?
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Cache Line Contention in Large NUMA Systems 
•Lock contention (particularly spinning lock contention) is the primary, and probably worst, cause of cache line contention 
•Cache line contention does have a “cost” associated with NUMA systems, but it is not the same “cost” that you experience with local vs. remote memory latency in NUMA systems 
•However, it’s notonly about lock contention 
Cache line contention can also come from sharing cache lines due to poor data structure layout –two fields in a data structure that are accessed by completely different processes/threads, but end up in the same cache line 
Worst case: an unrelated and frequently accessed field occupies the same cache line as a heavily contended lock 
Other atomic operations, such as atomic-add, can also generate cache line contention 
Additionally, the processor’s cache prefetch mechanism may also cause false cache line contention 
4
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Questions Driving Investigation 
•How is performance impacted when cache line contention is: 
•contained within a socket? 
•spread among multiple sockets? 
•using <n> cores in a socket vs <n> cores from multiple sockets? 
•using all cores in 4-socket, 8-socket, and 16-socket systems? 
5
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Demonstrating Cache Line Contention Effects 
•Test program to show the cost of cache line contention in large NUMA systems: 
•Bind threads (1 per core) to specified cores. Memory is allocated from a specific node. 
•Once the threads are synchronized, perform a tight loop doing spin_lock/spin_unlock 1,000,000 times. This generates an extreme amount of cache line contention. The spinlock implementation was taken from a Linux 3.0 based kernel. 
•Based on the number of threads and the loop iteration count we can calculate the average number of “operations per second per CPU” when <N> CPUs are involved in the cache line contention. 
•This is not a real-world test. While this is a micro-benchmark, it does show the effects of cache line contention so that real code can be written with cache line contention in mind. 
•Test systems: 
•4-sockets/ 60-cores/120-threadsIntel Xeon E7-4890 v21-TB 
•8-sockets/120-cores/240-threadsIntel Xeon E7-2890 v26-TB 
•16-sockets/240-cores/480-threadsIntel Xeon E7-2890 v212-TB 
6
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Contention within a Socket: Increasing Core Count 
Performance degrades smoothly as more cores are involved in cache line contention 
Execution Nodes 
Memory node 
Sockets Used 
Cores used 
Seconds 
Ops per Sec per Core 
% decrease from 2-core 
% decrease from previous 
Node 1 
Node 1 
1-socket 
2-cores 
1.704489 
5,866,861 
0.0% 
0.0% 
3-cores 
2.783121 
3,593,088 
38.8% 
38.8% 
4-cores 
4.012157 
2,492,425 
57.5% 
30.6% 
5-cores 
5.506802 
1,815,936 
69.0% 
27.1% 
6-cores 
7.110453 
1,406,380 
76.0% 
22.6% 
7-cores 
7.834159 
1,276,461 
78.2% 
9.2% 
8-cores 
10.054136 
994,616 
83.0% 
22.1% 
9-cores 
11.185041 
894,051 
84.8% 
10.1% 
10-cores 
13.508867 
740,255 
87.4% 
17.2% 
11-cores 
14.839633 
673,871 
88.5% 
9.0% 
12-cores 
16.490477 
606,411 
89.7% 
10.0% 
13-cores 
19.138960 
522,494 
91.1% 
13.8% 
14-cores 
20.704514 
482,986 
91.8% 
7.6% 
7
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Contention across 2-Sockets 
•All 30 cores in 2 nodes/sockets participate in the cache line contention: 
Execution Nodes 
Memory node 
Sockets Used 
Cores used 
Seconds 
Ops per Sec per Core 
% decrease from 
1-socket 
Node 0 
Node 1 
1-socket 
15-cores 
2.107396 
474,519 
0.0% 
Nodes 0-1 
Node 1 
2-socket 
30-cores 
14.450938 
69,200 
85.4% 
Nodes 1-2 
14.897306 
67,126 
86.0% 
Nodes 2-3 
21.742537 
45,993 
90.4% 
•There are two interesting points here: 
1)There is a hugedrop in performance when going from 15-cores on1-socket to 30-cores on 2-sockets 
2)There is a smaller drop in performance when the lock’s memory location is completely remote from the sockets involved in cache line contention (nodes 1-2 vs. nodes 2-3) 
8
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Contention across 2-Sockets: Increasing Core Count 
•We can see that the hugedrop in performance occurs once we add a single core from the second socket. 
•This is due to the need to go through QPI to handle the cache-to-cache traffic to resolve the cache line contention. 
•This is a significantdrop in performance when going through QPI. 
Execution Nodes 
Memory node 
Sockets Used 
Cores used 
Seconds 
Ops per Sec per Core 
% decrease from 2-core 
% decrease from previous 
Node 0 
Node 1 
1-socket 
13-cores 
1.649242 
606,339 
92.7% 
9.1% 
14-cores 
1.905878 
524,693 
93.7% 
13.5% 
15-cores 
1.649242 
482,435 
94.2% 
8.1% 
Nodes 0-1 
Node 1 
2-sockets 
16-cores 
1.905878 
129,309 
98.4% 
73.2% 
17-cores 
8.348480 
119,782 
98.6% 
7.4% 
18-cores 
8.264046 
121,006 
98.5% 
-1.0% 
30-cores 
15.146260 
66,023 
99.2% 
8.5% 
•Add one core at a time, filling node/socket-0 first, then filling node/socket-1: 
9
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Contention across 2-Sockets: Round-Robin vs. Fill-First 
•Contention measured across 2, 8, and 14 cores 
•Cores spread among 2-sockets (round-robin) vs. all cores in one socket (fill-first): 
Execution Nodes 
Memory Node 
Sockets 
used 
Cores per Socket used 
Cores used 
Seconds 
Ops per Sec per Core 
Node 1 
Node 1 
1-socket FF 
2-cores 
2-cores 
0.120395 
8,305,993 
Node 0-1 
Node1 
2-sockets RR 
1-core 
2-cores 
0.314462 
3,180,034 
Node 1-2 
0.305783 
3,270,293 
Node 2-3 
0.453627 
2,204,454 
Node 1 
Node 1 
1-socket FF 
8-cores 
8-cores 
1.018527 
981,810 
Node 0-1 
Node1 
2-sockets RR 
4-cores 
8-cores 
3.351590 
298,366 
Node 1-2 
3.390266 
294,962 
Node 2-3 
5.354243 
186,768 
Node 1 
Node 1 
1-socket FF 
14-cores 
14-cores 
2.067889 
483,585 
Node 0-1 
Node1 
2-sockets RR 
7-cores 
14-cores 
6.214167 
160,923 
Node 1-2 
6.275140 
159,359 
Node 2-3 
9.471300 
105,582 
1) Numa effect is visible when memory is remote 
2) Best performance when all cores are in one socket 
10
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Contention across 4-Sockets: Round-Robin vs. Fill-First 
•Contention measured across 4, 8, and 12 cores 
•Cores spread (round-robin) among 4-sockets vs. all cores in one socket (fill-first) 
Execution Nodes 
Memory Node 
Socketsused 
Cores per Socket used 
Cores used 
Seconds 
Ops per Sec per Core 
Node 1 
Node 1 
1-socket FF 
4-cores 
4-cores 
0.396550 
2,521,750 
Node 0-3 
Node1 
4-sockets RR 
1-core 
4-cores 
1.491732 
670,362 
Node 1 
Node 1 
1-socket FF 
8-cores 
8-cores 
0.941517 
1,062,116 
Node 0-3 
Node1 
4-sockets RR 
2-cores 
8-cores 
5.421381 
184,455 
Node 1 
Node 1 
1-socket FF 
12-cores 
12-cores 
1.794806 
557,163 
Node 0-3 
Node1 
4-sockets RR 
3-cores 
12-cores 
8.937035 
111,894 
Node 0-3 
Node1 
4-sockets FF 
15-cores 
60-cores 
49.786041 
20,086 
•Cache line contention is clearly better when all the contention is contained within a single socket. 
•For the same core count, performance degrades as more sockets are involved in the contention 
11
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Contention across 8-Sockets: Round-Robin vs. Fill-First 
•Contention measured across 8, 16 and 24 cores 
•Cores spread (round-robin) among 8-sockets vs. all cores in two sockets (fill-first): 
Execution Nodes 
Memory Node 
Socketsused 
Cores per Socket used 
Cores used 
Seconds 
Ops per Sec per Core 
Node 1 
Node 1 
1-socket FF 
8-cores 
8-cores 
1.185326 
843,650 
Node 0-7 
Node1 
8-sockets RR 
1-core 
8-cores 
10.609325 
94,257 
Node 0-1 
Node 1 
2-sockets FF 
16-cores 
16-cores 
8.886286 
112,533 
Node 0-7 
Node1 
8-sockets RR 
2-cores 
16-cores 
22.296164 
44,851 
Node 0-1 
Node 1 
2-sockets FF 
24-cores 
24-cores 
12.991910 
76,626 
Node 0-7 
Node1 
8-sockets RR 
3-cores 
24-cores 
36.197777 
27,626 
Node 0-7 
Node1 
8-sockets FF 
15-cores 
120-cores 
172.782623 
5,788 
•Cache line contention is clearly better when all the contention is contained within as few sockets as possible. 
12
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Contention across 16-Sockets: Round-Robin vs. Fill-First 
•Contention measured across 16, 32 and 64 cores 
•Cores spread (round-robin) among 16-sockets vs. all cores in 1/2/4 sockets (fill-first): 
Execution Nodes 
Memory Node 
Socketsused 
Cores per Socket used 
Cores used 
Seconds 
Ops per Sec per Core 
Node 1 
Node 1 
1-socket FF 
15-cores 
15-cores 
2.21096 
452,292 
Node 0-15 
Node1 
16-sockets RR 
1-core 
16-cores 
22.904097 
43,660 
Node 0-1 
Node 1 
2-sockets FF 
15-cores 
30-cores 
15.706788 
63,667 
Node 0-15 
Node1 
16-sockets RR 
2-cores 
32-cores 
53.217117 
18,791 
Node 0-3 
Node 1 
4-sockets FF 
15-cores 
60-cores 
74.909485 
13,349 
Node 0-15 
Node1 
16-socketsRR 
4-cores 
64-cores 
109.447632 
9,137 
Node 0-15 
Node1 
16-socketsRR 
15-cores 
240-cores 
410.881287 
2,434 
•Cache line contention is clearly better when all the contention is contained within as few sockets as possible. 
13
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
•On a 4-socket/60-core system you have a 25% chance that any two random cores participating in the same cache line contention are on the same socket 
•On an 8-socket/120-core system this is reduced to a 12.5% chance 
•With a 16-socket/240-core system you have only a 6.25% chance 
Execution Nodes 
Memory Node 
Socketsused 
Cores per Socket used 
Cores used 
Seconds 
Ops per Sec per Core 
Node 1 
Node 1 
1-socket FF 
4-cores 
4-cores 
0.396550 
2,521,750 
Node 0-3 
Node1 
4-sockets RR 
1-core 
4-cores 
1.491732 
670,362 
Node 1 
Node 1 
1-sockets FF 
8-cores 
8-cores 
1.185326 
843,650 
Node 0-7 
Node1 
8-sockets RR 
1-cores 
8-cores 
10.609325 
94,257 
Node 1 
Node 1 
1-socket FF 
15-cores 
15-cores 
2.21096 
452,292 
Node 0-15 
Node1 
16-sockets RR 
1-core 
16-cores 
22.904097 
43,660 
Inter-vs Intra-Cache Line Contention Probability 
14
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Why do we care so much? 
•Many applications scale based on the number of CPUs available. For example, one or two worker threads per CPU. 
•However, many applications today have been tuned for 4-socket/40-core and 8- socket/80-core Westmere platforms. 
•Going from 40-or 80-cores to 240-cores (16-sockets) is a major jump. 
•Scaling based only on the number of CPUs is likely to introduce significant lock and cache line contention inside the Linux kernel. 
•As seen in the previous slides, the impact of cache line contention gets significantly worse as more sockets and cores are added into the system –this is a major concern when dealing with 8-and 16-socket platforms. 
•This has led us to pursue minimizing cache line contention within Linux kernel locking primitives. 
15
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Proof of Concept
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
The AIM7 fserver workload* scales poorly on 8s/80core NUMA platform with a 2.6 based kernel 
* The workload was run with ramfs. 
Background 
17
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
From the perf -g output, we find most of the CPU cycles are spent in file_move() and file_kill(). 
40 Users (4000 jobs) 
+ 9.40% reaim reaim [.] add_int 
+ 6.07% reaim libc-2.12.so [.] strncat 
….. 
-1.68% reaim [kernel.kallsyms] [k] _spin_lock 
-_spin_lock 
+ 50.36% lookup_mnt 
+ 7.45% __d_lookup 
+ 6.71% file_move 
+ 5.16% file_kill 
+ 2.46% handle_pte_fault 
Proportion of file_move() = 1.68% * 6.71% = 0.11% 
Proportion of file_kill() = 1.68% * 5.16% = 0.09 % 
Proportion of file_move() + file+kill() = 0.20% 
400 users (40,000 jobs) 
-79.53% reaim [kernel.kallsyms] [k] _spin_lock 
-_spin_lock 
+ 34.28% file_move 
+ 34.20% file_kill 
+ 19.94% lookup_mnt 
+ 8.13% reaim [kernel.kallsyms] [k] mutex_spin_on_owner 
+ 0.86% reaim [kernel.kallsyms] [k] _spin_lock_irqsave 
+ 0.63% reaim reaim [.] add_long 
Proportion of file_move() = 79.53% * 34.28% = 27.26% 
Proportion of file_kill() = 79.53% * 34.20% = 27.20% 
Proportion of file_move() + file+kill() = 54.46% 
Analysis (1-2) 
This is significant spinlock contention! 
18
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
We use the ORC tool to monitor the coherency controller results 
(ORC is a platform dependent tool from HP that reads performance counters in the XNC node controllers) 
Coherency Controller Transactions Sent to Fabric Link (PRETRY number) 
SocketAgent10users40users400users 
0 0 17,341 36,782 399,670,585 
0 8 36,905 45,116 294,481,463 
1 0 0 0 49,639 
1 8 0 0 25,720 
2 0 0 0 1,889 
2 8 0 0 1,914 
3 0 0 0 3,020 
3 8 0 0 3,025 
4 1 45 122 1,237,589 
4 9 0 110 1,224,815 
5 1 0 0 26,922 
5 9 0 0 26,914 
6 1 0 0 2,753 
6 9 0 0 2,854 
7 1 0 0 6,971 
7 9 0 0 6,897 
PRETRY indicates the associated read needs to be re- issued. 
We can see that when users increase, PRETRY on socket 0 increases rapidly. 
There is serious cache line contention on socket 0 with 400 users. Many jobs are waiting for the memory location on socket 0 which contains the spinlock. 
PRETRY number on socket 0: 400 users = 400M + 294M = 694M 
Analysis (2-2) 
19
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
•Code snippet from the 2.6 based kernel for file_move() and file_kill(): 
extern spinlock_t files_lock; 
#define file_list_lock() spin_lock(&files_lock); 
#define file_list_unlock() spin_unlock(&files_lock); 
void file_move(struct file *file, 
struct list_head *list) 
{ 
if (!list) return; 
file_list_lock(); 
list_move(&file->f_u.fu_list, list); 
file_list_unlock(); 
} 
void file_kill(struct file *file) 
{ 
if (!list_empty(&file->f_u.fu_list)) { 
file_list_lock(); 
list_del_init(&file->f_u.fu_list); 
file_list_unlock(); 
} 
} 
Removing Cache Line Contention 
20 
•Contention on this global spinlock is the cause of all the cache line contention 
•We developed a prototype MCS/Queued spinlock to see its effect on cache line traffic 
•MCS/Queued locks are NUMA aware and each locker spins on local memory rather than the lock word 
•Implementation is available in the back-up slides 
•No efforts were made to make this a finer grained lock
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Comparing the performance of the new kernel (blue line) vs. the original kernel (red line) 
Prototype Benchmark Results 
2.4x improvement in throughput with theMCS/Queued spinlock prototype! 
21
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
•The proportion of time for the functions file_move()and file_kill()is now small in the 400 users case when using an MCS/Queued spinlock (dropped from 54.46% to 2.38%) 
•The functions lookup_mnt() and __mutex_lock_slowpath() now take most of the time. 
400 users(40000 jobs) 
44.71% reaim [kernel.kallsyms] [k] _spin_lock 
-60.94%--lookup_mnt 
…. 
22.13% reaim [kernel.kallsyms] [k] mutex_spin_on_owner 
-96.16%--__mutex_lock_slowpath 
…… 
1.19% reaim [kernel.kallsyms] [k] file_kill 
1.19% reaim [kernel.kallsyms] [k] file_move 
Proportion of lookup_mnt() = 27.2% 
Proportion of __mutex_lock_slowpath() = 21.3% 
Proportion of file_move() + file+kill() 2.38% 
Prototype Analysis (1-2) 
perf –g output of the kernel with MCS/Queued spinlock prototype: 
22
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Coherency controller results of the kernel with the MCS/Queued spinlock 
Coherency Controller Transactions Sent to Fabric Link (PRETRY number) 
SocketAgent10users40users400users 
0 0 18,216 24,560 83,720,570 
0 8 37,307 42,307 43,151,386 
1 0 0 0 0 
1 8 0 0 0 
2 0 0 0 0 
2 8 0 0 0 
3 0 0 0 0 
3 8 0 0 0 
4 1 52 222 16,786 
4 9 28 219 10,068 
5 1 0 0 0 
5 9 0 0 0 
6 1 0 0 0 
6 9 0 0 0 
7 1 0 0 0 
7 9 0 0 0 
We can see that as users increase, PRETRY in socket 0 also increases –but it is significantly lower than the kernel without the MCS/Queued lock. 
The PRETRY number for socket 0: 400 users = 84M + 43M = 127M. 
This value is about 1/5 of the original kernel (694M). 
This shows the MCS/Queued spinlock algorithm reduces the PRETRY traffic that occurs in file_move() and file_kill() significantly even though we still have the same contention on the spinlock. 
Prototype Analysis (2-2) 
23
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
•The MCS/Queued spinlock improved the throughput of large systems just by minimizing the inter-socket cache line traffic generated by the locking algorithm. 
•The MCS/Queued spinlock did not reduce the amount of contention on the actual lock. We have the same number of spinners contending for the lock. No code changes were made to reduce lock contention. 
•However, the benchmark throughput improved from ~160,000 to ~390,000 jobs per minute due to the reduced inter-socket cache-to-cache traffic. 
•System time spent spinning on the lock dropped from 54% to 2%. 
•Lock algorithms can play a huge factor in the performance of large-scale systems 
•The impact of heavy lock contention on a 240-core system is much more severe than the impact of heavy lock contention on a 40-core system 
•This is not a substitute for reducing lock contention… Reducing lock contention is still the best solution, but attention to lock algorithms that deal with contention *is* extremely important and can yield significant improvements. 
Proof of Concept -Conclusions 
24
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Methodology
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Benchmarks and Tools Used 
•SPECjbb2005 / SPECjbb2013 
•AIM7 
•Swingbench OLTP 
•Swingbench DSS 
•Ebizzy 
•Perf-bench 
•Kernbench 
•Hackbench 
•Futextest 
•IOzone 
•fio 
•SLOB (Silly Little Oracle Benchmark) 
•Customer benchmarks and apps 
26 
•perf 
•ORC/ORC2 (HP internal tool) 
•LiKI (HP internal tool) 
•lockstat 
•numatop 
•slabtop 
•ftrace, strace 
•top, sar, vmstat, powertop 
SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC).
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Development Process 
•Identify areas to improve performance and reduce cache line contention 
•Determine if an upstream patch already fixes the problem. 
•If no upstream patches exist: 
•Develop fix 
•Validate performance improvement 
•Submit the patches upstream and work through the upstream process to get them accepted 
•Depending on upstream feedback this may be an iterative process until the changes are accepted 
•Back-port patches to the SLES kernels 
•Validate performance improvement 
•Submit back-ported patches along with performance justification to SUSEfor inclusion in SLES 
•SUSEdelivers a PTF kernel to HPcontaining the performance enhancements 
•HP performs a final performance validation of the PTF kernels, sometimes with a customer 
•Changes put into the next SLESMaintenance Update for all SUSE customers to use. 
27
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
SUSE / HP Collaboration 
•Bugzillas were filed for each set of patches 
•Technical discussions with SUSE Performance engineers 
•Discussions on the design of the patch, the back-port of the patch, the performance validation data, kABI impacts, etc 
•Engaged SUSE performance engineers in the analysis of the problem as needed 
•Sometimes restructured the back-ported patch to work better with SLES 11sp3 after consultation with SUSE 
•For some patches submitted upstream had initial review and feedback from SUSE performance engineers prior to the upstream submittal 
•The recognition of customer benefit and the flexibility of both companies really made all this possible. 
28
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Performance Enhancements 
SLES 11sp3 Kernel Version: 
3.0.101-0.15-default
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Idle Balance 
•One of the problems with significant lock contention on blocking locks (such as a mutex) is that as more processes block on the mutex there is less to run –this causes the idle balancer to take processes from a different CPU’s run queue. This in turn causes even further cache issues. 
•Ensure that we don’t attempt an idle balance operation when it takes longer to do the balancing than the time the cpu would be idle 
•We do this by keeping track of the maximum time spent in idle balance for each scheduler domain and skipping idle balance if max-time-to-balance > avg_idle for this CPU 
•Max-time-to-balance is decayed at a rate of about 1% per second 
•Improve the accuracy of the average CPU idle duration. 
•Previously the average CPU idle duration was over estimated resulting in too much idle balancing 
30 
93,609 
18,600Java Operations with 16-sockets / 240-cores / 480-threads 
before changes 
after changes
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Mutex Atomics Reduction 
•Reduce the number of atomic operations when trying to acquire a mutex 
−Entry into the mutex lock slow path will cause 2 atomic _xchginstructions to be issued 
−With the atomic decrement in the fast path a total of 3 atomic read-modify-write instructions are issued in rapid succession 
−Causes lots of cache bouncing when many tasks are trying to acquire the mutex simultaneously 
−By using atomic_read()to check the value before calling atomic_xchg()or atomic_cmpxchg() we avoid unnecessary cache coherency traffic 
•Also known as compare-compare-and-swap. 
−For example: 
if ((atomic_read(&lock->count) == 1) && (atomic_cmpxchg(&lock->count, 1, 0) == 1)) 
31 
137,370 
93,609Java Operations with 16-sockets / 240-cores / 480-threads 
before changes 
after changes
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
XFS Small Block Size Performance 
•A customer acceptance benchmark demonstrated really poor performance with XFS for 4kand 16kblock sizes (sometimes 64k) for initial-writes as well as over-writes for multithreaded applications. 
•Further investigation identified a set of patches already developed for the upstream Linux kernel revision 3.4 
•The primary patch introduces per filesystem I/O completion workqueues (as opposed to global workqueues) 
•Allows concurrency on the workqueues -blocking on one inode does not block others on a different inode. 
•These patches were back-ported to SLES 11sp3 (and by default now part of 11sp4 and 12) 
•Improved synchronous 16kinitial-write performance from 1.2 MB/s to 138 MB/s 
•Improved asynchronous 16kinitial-write performance from 14 MB/s to 141 MB/s 
•Also improves 16kover-write performance as well as 4kinitial-write and over-write performance. 
32
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Futex Hash Size and Wait Queues 
•Hash Bucket Size 
−Originally 256 hash buckets for the whole system 
•256 * nCPUs cache line aligned hash buckets 
•Fewer collisions and more spinlocks leading to more parallel futex call handling. 
−Distribute the table among NUMAnodes instead of a single one. 
−The perfect hash size will of course have one to one hash-bucket:futex ratio. 
•Lockless Wait Queue Size 
−A common misuse of futexes is to make FUTEX_WAKEcalls when there are no waiters. 
−In FUTEX_WAKE, there’s no reason to take the hb->lock if we already know the list is empty and thus one to wake up. 
−Use an independent atomic counter to keep track of the list size. 
−This can drastically reduce contention on the hash bucket spinlock. 
33 
0.10% 
43.71% % System time spinning on hb->lock for a large database 
before changes 
after changes
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Futex: Larger NUMA-aware Hash Table 
0 
200,000 
400,000 
600,000 
800,000 
1,000,000 
1,200,000 
1,400,000 
512 
256 
128 
80 
64 
32 
Operations per second 
threads 
Futex Hash Table Scaling 
baseline 
aligned buckets 
large numa-awaretable 
aligned buckets pluslarge numa-awaretable 
34
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
3.0.101-0.15-default: Perf Improvements 
•Performance measurements were done with a popular Java based workload (a higher number of operations is better) 
35 
6,502 
67,447 
0 
10,000 
20,000 
30,000 
40,000 
50,000 
60,000 
70,000 
80,000Java Operations16-sockets / 240-cores / 480-threads / 12-TB 
3.0.101-0.8-default 
3.0.101-0.15-default 
Can’t get valid result at 16-sockets
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Performance Enhancements 
SLES 11sp3 Kernel Version: 
3.0.101-0.30-default
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Mutex Optimizations 
•Unlock a mutex without acquiring the wait lock 
−Workloads with high amounts of mutex contention would spend significant time spinning on the mutex’s internal waiter lock which then delays the mutex from getting unlocked. 
−Changed the mutex unlock path to unlock the mutex before acquiring the internal waiter lock to deal with any waiters. 
−Delays in acquiring the waiter lock will not prevent others from acquiring the mutex. 
•Mutex slowpath optimizations 
−When a lock can’t be acquired and a thread enters the mutex slowpath it put’s itself on the wait list and tries one last time to acquire the mutex. 
−Changed the order and attempted the acquisition of the mutex first 
−If acquired the we do not have to remove the thread from the waiter list 
37 
156,912 
109,933Java Operations with 16-sockets / 240-cores / 480-threads 
before changes 
after changes
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Lockref 
•Reference counts are normally used to track the lifecycle of data structures. 
−A reference count of zero means the structure is unused and is free to be released 
−A positive reference count indicates how many tasks are actively referencing the structure 
−When embedded into a data structure, it is not uncommon to acquire a lock just to increment or decrement the reference count variable. Under load, this lock can become heavily contended. 
•The lockref patch introduces a new mechanism for a lockless atomic update of a spinlock protected reference count. 
−Bundle a 4-byte spinlock and a 4-byte reference count into a single 8-byte word that can be updated atomically while no one is holding the lock. 
•The VFS layer makes heavy use of reference counts for dentry operations. 
−Workloads that generate lots of filesystem activity can be bottlenecked by the spinlock contention on the dentry reference count update. 
−The dentry operations were modified to make use of the lockref patch to resolve this contention by doing reference count updates without taking a lock. 
38 
0.01% 
83.74% % time spinning on dentry lock: AIM-7 short workload 
before changes 
after changes
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Lockref and Dentry Benchmark Results 
~ 7x improvement in throughput (JPM)! 
39 
0 
1,000,000 
2,000,000 
3,000,000 
4,000,000 
5,000,000 
6,000,000 
7,000,000 
8,000,000 
10 
20 
30 
40 
50 
60 
70 
80 
90 
100 
200 
300 
400 
500 
600 
700 
800 
900 
1000 
1100 
1200 
1300 
1400 
1500 
1600 
1700 
1800 
1900 
2000 
Jobs Per Minute 
Simulated Users 
AIM7 short workload: 8-socket / 80-core / 1-TB 
Pre-lockrefchanges 
Post-LockrefChanges
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
SELinux Bit-Map Management 
•Heavy use of the lscommand results in a significant amount of CPUtime being spent in the mls_level_isvalid()kernel function. 
•Replaced the inefficient implementation of the mls_level_isvalid()function in the multi-level security (MLS) policy module of SELinux with a performance optimized version. 
−More efficient bit-map management 
•The CPUtime spent in this code path is reduced from 8.95%to 0.12%in the AIM-7 high_systime workload 
40 
0.12% 
8.95% Changes in system time for the mls_level_isvalid() code path 
before changes 
after changes
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Hugetlb: Parallel Page Faults 
•The kernel originally serialized hugetlb page faults, handling a single fault at a time. 
−Workloads with large working sets backed-by hugepages (i.e.: databases or KVM guests) can especially suffer from painful startup times due to this. 
−Protection from spurious OOM errors under conditions of low availability of free hugepages. 
−This problem is specific to hugepages because it is normal to want to use every single hugepage in the system -with normal pages we assume there will always be a few spare pages which can be used temporarily until the race is resolved. 
•Address this problem by using a table of mutexes, allowing a better chance of parallelization, where each hugepage is individually serialized. 
−The hash key is selected depending on the mapping type. 
−Because the size of the table is static, this can, in theory, still produce contention, if reserving enough hugepages. But reality indicates that this is purely theoretical. 
41 
25.7 
37.5Startup time (seconds) of a 10-Gb Oracle DB (Data Mining) 
before changes 
after changes
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
epoll_ctl() Scaling 
•Java workloads on 8-and 16-socket systems showed significant lock contention on the global epmutexin the epoll_ctl()system call when adding or removing file descriptors to/from an epoll instance. 
•Further investigation identified a set of patches already developed for the upstream Linux kernel: 
•Don’t take the global epmutexlock in EPOLL_CTL_ADDfor simple topologies (it’s not needed) 
•Remove the global epmutexlock from the EPOLL_CTL_DELpath and instead use RCU to protect the list of event poll waiters against concurrent traversals 
•RCU (Read-Copy Update) is a Linux synchronization mechanism allowing lockless reads to occur concurrently with updates 
42 
165,119 
87,903Java Operations with 16-sockets / 240-cores / 480-threads 
before changes 
after changes
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
3.0.101-0.30-default: Perf Improvements 
•Performance measurements were done with a popular Java based workload (a higher number of operations is better) 
43 
6,502 
67,447 
946,552 
0 
100,000 
200,000 
300,000 
400,000 
500,000 
600,000 
700,000 
800,000 
900,000 
1,000,000Java Operations16-sockets / 240-cores / 480-threads / 12-TB 
3.0.101-0.8-default 
3.0.101-0.15-default 
3.0.101-0.30-default 
Can’t get valid result at 16-sockets
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Performance Enhancements 
SLES 11sp3 Kernel Version: 
3.0.101-0.30-bigsmp
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Why the “bigsmp” kernel? 
•Some of the performance enhancements HPprovided for SLES 11sp3 caused a breakage in the Kernel ABI (kABI) 
•The User Application ABI remains the same –all applications that run on the “default” SLES 11sp3 kernel have full binary and source compatibility with the “bigsmp” SLES 11sp3 kernel. 
•There was a small possibility that this kABI breakage would impact kernel drivers and modules 
•Rather than risk compatibility issues at customer sites SUSEcreated the “bigsmp” flavor of the SLES 11sp3 kernel which contains these additional performance enhancements. 
•The bigsmp flavor of SLES 11sp3 has it’s own kABI 
•Requires a recompile of kernel drivers and modules 
•SUSE experience and process flexibility allowed for the creation of the bigsmp kernel so that these additional performance enhancements could be delivered to customers. 
•All of these changes will be included in the SLES 11sp4 GAand SLES 12 GA“default” kernels. Bigsmp will be an on-going flavor for SLES 11sp3 for all platforms. 
45
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Mutex MCS Optimizations 
•The MCS lock is a new locking primitive inside Linux 
•Each locker spins on a local variable while waiting for the lock rather than spinning on the lock itself. 
•Maintains a list of spinning waiters. 
•When a lock is released the unlocker changes the local variable of the next spinner. 
•This change causes the spinner to stop spinning and acquire the lock. 
•Eliminates most of the cache-line bouncing experienced by simpler locks, especially in the contended case when a simple CAS (Compare-and-Swap) calls fail. 
•Fair, passing the lock to each locker in the order that the locker arrived. 
•Specialized cancelable MCS locking was applied internally to kernel mutexes 
•The cancelable MCS lock is a specially tailored lock for MCS: when needing to reschedule, we need to abort the spinning in order to block. 
46 
250,981 
137,268Java Operations with 16-sockets / 240-cores / 480-threads 
before changes 
after changes
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
All Mutex Lock Performance Improvements 
•Performance measurements were done with a popular Java based workload (higher number of operations is better) 
•System used: 16-sockets, 240-cores, 480-threads 
47 
72,089 
137,268 
250,891 
0 
100,000 
200,000 
300,000 
NUMBER OF OPERATIONS 
Baseline prior to any mutex changes 
3 non-MCS mutex changes90% over baseline 
All mutex changes248% over baseline83% over non-MCS mutex
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Per-thread VMA caching 
•A process’s address space is divided among VMAs (virtual memory areas) –each storing a range of addresses that share similar properties, such as permissions. 
−A common operation when dealing with memory is locating (find_vma()) a VMA that contains a range of addresses. 
•Traditionally the Linux kernel will cache the last used VMA. 
−Avoids expensive tree lookups (scales poorly in multi-thread programs). 
−This works nicely for workloads with good locality (over 70% hit rates), yet very badly for those with poor locality (less than 1% hit rates). 
•Replace the cache by a small, per-thread, hash table. 
−O(1) lookups/updates, cheap to maintain and small overhead. 
−Improves poor locality hit-rates to ~99.9%. 
−Improves Oracle 11g Data Mining (4k pages) hit-rates from 70% to 91%. 
48
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
3.0.101-0.30-bigsmp: Perf Improvements 
•Performance measurements were done with a popular Java based workload (higher number of operations is better) 
49 
6,502 
67,447 
946,552 
994,771 
0 
100,000 
200,000 
300,000 
400,000 
500,000 
600,000 
700,000 
800,000 
900,000 
1,000,000Java Operations16-sockets / 240-cores / 480-threads / 12-TB 
3.0.101-0.8-default 
3.0.101-0.15-default 
3.0.101-0.30-default 
3.0.101-0.30-bigsmp 
Can’t get valid result at 16-sockets
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
3.0.101-0.30-bigsmp 
vs. 
3.0.101-0.8-default
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
3.0.101-0.8-default Scaling 
•Performance measurements were done with a popular Java based workload (a higher number of operations is better) 
51 
69,426 
114,617 
180,922 
139,825 
6,502 
0 
20,000 
40,000 
60,000 
80,000 
100,000 
120,000 
140,000 
160,000 
180,000 
200,000Java Operations -1/2/4/8/16 sockets 
1-socket / 15-cores 
2-sockets / 30-cores 
4-sockets / 60-cores 
8-sockets / 120-cores 
16-sockets / 240-cores 
•Scaling issues at 8-and 16-sockets. Other Linux distributions are similar. 
Can’t get valid result at 16-sockets
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
3.0.101-0.30-bigsmp Scaling 
•Performance measurements were done with a popular Java based workload (a higher number of operations is better) 
52 
71,517 
132,996 
260,423 
512,953 
994,771 
0 
100,000 
200,000 
300,000 
400,000 
500,000 
600,000 
700,000 
800,000 
900,000 
1,000,000Java Operations -1/2/4/8/16 sockets 
1-socket / 15-cores 
2-sockets / 30-cores 
4-sockets / 60-cores 
8-sockets / 120-cores 
16-sockets / 240-cores 
Scaling is now approximately 1.9x for each doubling of sockets!
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
3.0.101-0.30-bigsmp: AIM-7 alltests 
53 
~ 4x performance 
improvement! 
0 
100,000 
200,000 
300,000 
400,000 
500,000 
600,000 
700,000 
800,000 
900,000 
10 
20 
30 
40 
50 
60 
70 
80 
90 
100 
200 
300 
400 
500 
600 
700 
800 
900 
1000 
1100 
1200 
1300 
1400 
1500 
1600 
1700 
1800 
1900 
2000 
Jobs Per Minute 
Simulated Users 
3.0.101-0.30-bigsmp 
3.0.101-0.30-default 
3.0.101-0.15-default 
3.0.101-0.8-default 
AIM-7 alltests workload: 16-sockets / 240-cores / 480-threads / 12-TB
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
3.0.101-0.30-bigsmp: AIM-7 custom 
54 
~ 5x performance 
improvement! 
0 
100,000 
200,000 
300,000 
400,000 
500,000 
600,000 
10 
20 
30 
40 
50 
60 
70 
80 
90 
100 
200 
300 
400 
500 
600 
700 
800 
900 
1000 
1100 
1200 
1300 
1400 
1500 
1600 
1700 
1800 
1900 
2000 
Jobs Per Minute 
Simulated Users 
3.0.101-0.30-bigsmp 
3.0.101-0.30-default 
3.0.101-0.15-default 
3.0.101-0.8-default 
AIM-7 custom workload: 16-sockets / 240-cores / 480-threads / 12-TB
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
3.0.101-0.30-bigsmp: AIM-7 high_systime 
55 
~ 4x performance 
improvement! 
0 
50,000 
100,000 
150,000 
200,000 
250,000 
10 
20 
30 
40 
50 
60 
70 
80 
90 
100 
200 
300 
400 
500 
600 
700 
800 
900 
1000 
1100 
1200 
1300 
1400 
1500 
1600 
1700 
1800 
1900 
2000 
Jobs Per Minute 
Simulated Users 
3.0.101-0.30-bigsmp 
3.0.101-0.30-default 
3.0.101-0.15-default 
3.0.101-0.8-default 
AIM-7 high_systime workload: 16-sockets / 240-cores / 480-threads / 12-TB
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
3.0.101-0.30-bigsmp: AIM-7 shared 
56 
~ 3.75x performance 
improvement! 
0 
100,000 
200,000 
300,000 
400,000 
500,000 
600,000 
700,000 
800,000 
10 
20 
30 
40 
50 
60 
70 
80 
90 
100 
200 
300 
400 
500 
600 
700 
800 
900 
1000 
1100 
1200 
1300 
1400 
1500 
1600 
1700 
1800 
1900 
2000 
Jobs Per Minute 
Simulated Users 
3.0.101-0.30-bigsmp 
3.0.101-0.30-default 
3.0.101-0.15-default 
3.0.101-0.8-default 
AIM-7 shared workload: 16-sockets / 240-cores / 480-threads / 12-TB
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Performance Enhancements 
SLES 12
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
IPC Lock with System V Semaphores 
•The global IPCspinlock is held too long in various System VSemaphore paths 
•When one semaphore array is doing lots of operations scaling is poor 
•Single semaphore arrays now use a finer grained spinlock as an alternative to the coarse grained IPC lock. 
•Introduced rcu_read_lock only methods to obtain IPCobjects 
•Similar changes were made to System V Shared Memory and Message Queues 
58 
21.86% 
84.35% 
5.23% 
7.39% 
0.00% 
20.00% 
40.00% 
60.00% 
80.00% 
100.00% 
400 Users 
800 Users% time spinning on IPC lock: Oracle 11gR2 Data Mining 
before changes 
after changes
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
SysV Semaphore Performance Improvements 
•Performance measurements were done with an Oracle 11gR2 Data Mining workload 
•System used: 8-sockets, 80-cores, 160-threads 
59 
1,257 
1,438 
1,237 
2,805 
2,665 
2,751 
0 
500 
1,000 
1,500 
2,000 
2,500 
3,000 
100 Users 
400 Users 
800 Users 
before changes 
after changes
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
World Record Performance
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
World record performance: HP CS900 for SAP HANA 
Key take aways 
•#1 and #2 overall max-jOPS results 
•#1 and #4 overall critical-jOPS results 
•#1 16-socket (16s) results on both max-jOPS and critical-jOPS 
•#1 8-socket (8s) results on max-jOPS 
•16s max-jOPS results 2.1X greater than Fujitsu 16s results 
•8s max-jOPS results are 2.2X greater than Sugon 8s results 
•8s max-jOPS results 1.1X greater than Fujitsu16s results 
•HP CS900 demonstrates excellent scaling from 8s to 16s 
23,058 
168,127 
198,418 
126,617 
247,581 
214,961 
308,936 
425,348 
474,575 
888,164 
0 
200,000 
400,000 
600,000 
800,000 
1,000,000 
Sugon I980G108-socket, Intel Xeon E7-8890 v2 
Fujitsu SPARC M10-4S16-socket, SPARC64 X 
Fujitsu SPARC M10-4S16-socket, SPARC64 X+ 
HP ConvergedSystem 900 for SAPHANA 8s/6TB 8-socketIntel Xeon E7-2890 v2 
HP ConvergedSystem 900 for SAPHANA 16s/12TB 16-socketIntel Xeon E7-2890 v2 
HP ConvergedSystem 900 for SAP HANA 
powered by SLES 11sp3 
owns top two SPECjbb2013 max-jOPS records 
max-jOPS 
SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several other countries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). The stated results are published on spec.orgas of 07/30/2014.
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
HP and SUSE Break 1-Million max-jOPS!!! 
129,876 
244,162 
524,790 
1,002,350 
0 
200,000 
400,000 
600,000 
800,000 
1,000,000 
1,200,000 
HP ConvergedSystem 900 for SAPHANA (8s/6TB) 8-socketIntel Xeon E7-2890 v2 
HP ConvergedSystem 900 for SAPHANA (16s/12TB) 16-socketIntel Xeon E7-2890 v2 
HP ConvergedSystem 900 for SAP HANA 
powered by SLES 11sp3 
#1 16-socket SPECjbb2013 max-jOPS record 
max-jOPS 
SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several other countries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). The stated results are published on spec.orgas of 11/14/2014.
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
World record performance: HP CS900 for SAP HANA 
•The HP BL920s Gen8 Server Blade powers the HP ConvergedSystem 900 for SAP HANA system. 
•Publicly available SPECjbb2013-MultiJVM benchmark performance briefs: TBD –provide link for new briefNov 2014http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3288ENW&cc=us&lc=enJune 2014 
•Official benchmark results for HP ConvergedSystem900 for SAP HANA on spec.org: TBD –provide link for new result(16s/240c/12TB)Nov 2014TBD –provide link for new result(8s/120c/6TB)Nov 2014http://spec.org/jbb2013/results/res2014q2/jbb2013-20140610-00081.html(16s/240c/12TB)June 2014http://spec.org/jbb2013/results/res2014q2/jbb2013-20140610-00080.html(8s/120c/6TB)June 2014 
SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several othercountries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC).
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Wrap-up
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Acknowledgements 
•HP Linux Kernel Performance team contributing this work: 
−Davidlohr Bueso 
−Tuan Bui 
−Waiman Long 
−Jason Low 
−Scott Norton 
−Thavatchai Makphaibulchoke 
−Tom Vaden 
−Aswin Chandramouleeswaran 
65 
•SUSE Linux R&D team contributing to this work: 
−Scott Bahling 
−Matthias Eckermann 
−Mike Galbraith 
−Mel Gorman 
−Jan Kara 
−Jeff Mahoney 
−Ahmad Sadeghpour 
−Miklos Szeredi
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Thank you
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Back-up Slides
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
We developed a prototype MCS/Queued lock to see the effect on cache line traffic 
(MCS/Queued locks are NUMA aware and each locker spins on local memory rather than the lock word) 
MCS/Queued Lock Prototype (1-2) 
typedef struct _local_qnode { 
volatile bool waiting; 
volatile struct _local_qnode *volatile next; 
} local_qnode; 
static inline void 
mcsfile_lock_acquire(mcsglobal_qlock *global, 
local_qnode_ptr me) 
{ 
local_qnode_ptr pred; 
me->next = NULL; 
pred = xchg(global, me); 
if (pred == NULL) 
return; 
me->waiting = true; 
pred->next = me; 
while (me->waiting); /*spin on local mem*/ 
} 
static inline void 
mcsfile_lock_release(mcsglobal_qlock *global, 
local_qnode_ptr me) 
{ 
local_qnode_ptr succ; 
if (!(succ = me->next)) { 
if ( cmpxchg(global, me, NULL) == me ) return; 
do { 
succ = me->next; 
} while (!succ); /* wait for succ ready */ 
} 
succ->waiting = false; 
} 
68
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
Replacing the files_lock spinlock with the prototype mcsfiles_lock MCS/Queued spinlock 
externmcsglobal_qlock mcsfiles_lock; 
#define file_list_lock(x) mcsfile_lock_acquire(&mcsfiles_lock, &x); 
#define file_list_unlock(x) mcsfile_lock_release(&mcsfiles_lock, &x); 
void file_kill(struct file *file) 
{ 
volatile local_qnode lq; 
if (!list_empty(&file->f_u.fu_list)) { 
file_list_lock(lq); 
list_del_init(&file->f_u.fu_list); 
file_list_unlock(lq); 
} 
} 
void file_move(struct file *file, 
struct list_head *list) 
{ 
volatile local_qnode lq; 
if (!list) 
return; 
file_list_lock(lq); 
list_move(&file->f_u.fu_list, list); 
file_list_unlock(lq); 
} 
MCS/Queued Lock Prototype (2-2) 
69
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
AIM7 Benchmark Suite 
•TraditionalUNIXsystem-levelbenchmark(writteninC). 
•Multipleforks,eachofwhichconcurrentlyexecutesacommon,randomly-orderedsetofsubtestscalledjobs. 
•Eachoftheoverfiftykindofjobsexercisesaparticularfacetofsystemfunctionality 
•DiskIOoperations,processcreation,virtualmemoryoperations,pipeI/O,andcompute-boundarithmeticloops. 
•AIM7includesdisksubtestsforsequentialreads,sequentialwrites,randomreads,randomwrites,andrandommixedreadsandwrites. 
•AnAIM7runconsistsofaseriesofsubrunswiththenumberoftasks,N,beingincreasedaftertheendofeachsubrun. 
•Eachsubruncontinuesuntileachtaskcompletesthecommonsetofjobs.Theperformancemetric,"Jobscompletedperminute",isreportedforeachsubrun. 
•TheresultoftheentireAIM7runisatableshowingtheperformancemetricversusthenumberoftasks,N. 
•Reference:“FilesystemPerformanceandScalabilityinLinux2.4.17”,2002. 
70
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
perf-bench futex (1-2) 
•To measure some of the changes done by the futexhastable patchset, a futex set of microbenchmarks are added to perf-bench: 
−perf bench futex [<operation> <all>] 
•Measures latency of different operations: 
−Futex hash 
−Futex wake 
−Futex requeue/wait
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 
perf-bench futex (2-2) 
$ perf bench futex wake# Running 'futex/wake' benchmark: Run summary [PID 4028]: blocking on 4 threads (at futex 0x7e20f4), waking up 1 at a time. [Run 1]: Wokeup 4 of 4 threads in 0.0280 ms[Run 2]: Wokeup 4 of 4 threads in 0.0880 ms[Run 3]: Wokeup 4 of 4 threads in 0.0920 ms… [Run 9]: Wokeup 4 of 4 threads in 0.0990 ms[Run 10]: Wokeup 4 of 4 threads in 0.0260 msWokeup 4 of 4 threads in 0.0703 ms (+-14.22%) 
$ perf bench futex hash# Running 'futex/hash' benchmark: Run summary [PID 4069]: 4 threads, each operating on 1024 futexes for 10 secs. [thread 0] futexes: 0x1982700 ... 0x19836fc [ 3507916 ops/sec ] [thread 1] futexes: 0x1983920 ... 0x198491c [ 3651174 ops/sec ] [thread 2] futexes: 0x1984ab0 ... 0x1985aac [ 3557171 ops/sec ] [thread 3] futexes: 0x1985c40 ... 0x1986c3c [ 3597926 ops/sec ] Averaged 3578546 operations/sec (+-0.85%), total secs = 10

More Related Content

PDF
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
PDF
Open stack in action cern _openstack_accelerating_science
PDF
Extreme Linux Performance Monitoring and Tuning
PPTX
IxVM on CML
PDF
Processor Selection for Optimum Middleware Price Performance
PDF
110604 2nd SyoueneIT Workshop
PDF
Service Assurance for Virtual Network Functions in Cloud-Native Environments
PDF
Voltaire fca en_nov10
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
Open stack in action cern _openstack_accelerating_science
Extreme Linux Performance Monitoring and Tuning
IxVM on CML
Processor Selection for Optimum Middleware Price Performance
110604 2nd SyoueneIT Workshop
Service Assurance for Virtual Network Functions in Cloud-Native Environments
Voltaire fca en_nov10

What's hot (19)

PDF
Ron perrot
PDF
Mateo valero p1
PDF
Mateo valero p2
PPTX
Accelerating science with Puppet
PDF
Linux performance
PPTX
Bare Metal Provisioning for Big Data - OpenStack最新情報セミナー(2016年12月)
PDF
200.1,2-Capacity Planning
PDF
In-Network Acceleration with FPGA (MEMO)
PDF
Testing of Sensor Observation Services: A Performance Evaluation
PPTX
NTTドコモ様 導入事例 OpenStack Summit 2016 Barcelona 講演「Expanding and Deepening NTT D...
PDF
Lustre File System on ARM
PPTX
MX – универсальная сервисная платформа. Вчера, сегодня, завтра.
PDF
Brkdct 3101
PDF
Brkarc 3601
PPTX
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
PPTX
How the Internet Works
PDF
Brkarc 3454 - in-depth and personal with the cisco nexus 2000 fabric extender...
ODP
SUN
PDF
860 dspi high_speed_throughput_appnote
Ron perrot
Mateo valero p1
Mateo valero p2
Accelerating science with Puppet
Linux performance
Bare Metal Provisioning for Big Data - OpenStack最新情報セミナー(2016年12月)
200.1,2-Capacity Planning
In-Network Acceleration with FPGA (MEMO)
Testing of Sensor Observation Services: A Performance Evaluation
NTTドコモ様 導入事例 OpenStack Summit 2016 Barcelona 講演「Expanding and Deepening NTT D...
Lustre File System on ARM
MX – универсальная сервисная платформа. Вчера, сегодня, завтра.
Brkdct 3101
Brkarc 3601
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
How the Internet Works
Brkarc 3454 - in-depth and personal with the cisco nexus 2000 fabric extender...
SUN
860 dspi high_speed_throughput_appnote
Ad

Viewers also liked (20)

PDF
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
PPT
NUMA overview
PDF
Notes on NUMA architecture
PPTX
NUMA Performance Considerations in VMware vSphere
 
PDF
Cat @ scale
PDF
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
PPTX
CPU Caches
PPTX
Leveraging memory in sql server
PDF
美团技术沙龙04 - 高性能服务器架构设计和调优
PDF
Tuning Linux for Databases.
PPT
PDF
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaaf
PPTX
Federal VMUG - March - Virtual machine sizing considerations in a numa enviro...
PPTX
Cache & CPU performance
PDF
Linux numa evolution
PDF
Non-Uniform Memory Access ( NUMA)
PDF
Linux NUMA & Databases: Perils and Opportunities
PDF
美团技术团队 - KVM性能优化
PDF
QEMU Disk IO Which performs Better: Native or threads?
PDF
Kvm performance optimization for ubuntu
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
NUMA overview
Notes on NUMA architecture
NUMA Performance Considerations in VMware vSphere
 
Cat @ scale
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
CPU Caches
Leveraging memory in sql server
美团技术沙龙04 - 高性能服务器架构设计和调优
Tuning Linux for Databases.
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaaf
Federal VMUG - March - Virtual machine sizing considerations in a numa enviro...
Cache & CPU performance
Linux numa evolution
Non-Uniform Memory Access ( NUMA)
Linux NUMA & Databases: Perils and Opportunities
美团技术团队 - KVM性能优化
QEMU Disk IO Which performs Better: Native or threads?
Kvm performance optimization for ubuntu
Ad

Similar to SLES Performance Enhancements for Large NUMA Systems (20)

PPTX
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
PPTX
Super scaling singleton inserts
PDF
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
PPTX
QCon 2015 Broken Performance Tools
PDF
Nick Fisk - low latency Ceph
PDF
CPU Caches - Jamie Allen
PDF
Cpu Caches
PDF
Runtime Performance Optimizations for an OpenFOAM Simulation
PPTX
Scaling sql server 2014 parallel insert
PDF
STATUS UPDATE OF COLO PROJECT XIAOWEI YANG, HUAWEI AND WILL AULD, INTEL
PDF
PPT_on_Cache_Partitioning_Techniques.pdf
PDF
OOW 2013: Where did my CPU go
PDF
Broken Performance Tools
PDF
Van jaconson netchannels
PPTX
Solve the colocation conundrum: Performance and density at scale with Kubernetes
PPTX
Sql server engine cpu cache as the new ram
PDF
Riyaj real world performance issues rac focus
PDF
RedGateWebinar - Where did my CPU go?
PDF
Cache Consistency – Requirements and its packet processing Performance implic...
PPT
Contention - Aware Scheduling (a different approach)
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Super scaling singleton inserts
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
QCon 2015 Broken Performance Tools
Nick Fisk - low latency Ceph
CPU Caches - Jamie Allen
Cpu Caches
Runtime Performance Optimizations for an OpenFOAM Simulation
Scaling sql server 2014 parallel insert
STATUS UPDATE OF COLO PROJECT XIAOWEI YANG, HUAWEI AND WILL AULD, INTEL
PPT_on_Cache_Partitioning_Techniques.pdf
OOW 2013: Where did my CPU go
Broken Performance Tools
Van jaconson netchannels
Solve the colocation conundrum: Performance and density at scale with Kubernetes
Sql server engine cpu cache as the new ram
Riyaj real world performance issues rac focus
RedGateWebinar - Where did my CPU go?
Cache Consistency – Requirements and its packet processing Performance implic...
Contention - Aware Scheduling (a different approach)

Recently uploaded (20)

PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
PDF
Complete Guide to Website Development in Malaysia for SMEs
PPTX
GSA Content Generator Crack (2025 Latest)
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PPTX
Trending Python Topics for Data Visualization in 2025
PDF
MCP Security Tutorial - Beginner to Advanced
PPTX
Patient Appointment Booking in Odoo with online payment
PDF
Salesforce Agentforce AI Implementation.pdf
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Autodesk AutoCAD Crack Free Download 2025
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
PPTX
"Secure File Sharing Solutions on AWS".pptx
PDF
Time Tracking Features That Teams and Organizations Actually Need
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
Weekly report ppt - harsh dattuprasad patel.pptx
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
Complete Guide to Website Development in Malaysia for SMEs
GSA Content Generator Crack (2025 Latest)
Why Generative AI is the Future of Content, Code & Creativity?
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Trending Python Topics for Data Visualization in 2025
MCP Security Tutorial - Beginner to Advanced
Patient Appointment Booking in Odoo with online payment
Salesforce Agentforce AI Implementation.pdf
Digital Systems & Binary Numbers (comprehensive )
Autodesk AutoCAD Crack Free Download 2025
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
"Secure File Sharing Solutions on AWS".pptx
Time Tracking Features That Teams and Organizations Actually Need

SLES Performance Enhancements for Large NUMA Systems

  • 1. © 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. SLES Performance Enhancements for Large NUMASystems Scott Norton –HP Davidlohr Bueso –SUSE Labs (formerly HP) SUSECon Orlando, FL November 2014
  • 2. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Agenda 1.Whyfocusonlocking? 2.Proofofconcept 3.Methodology 4.Performanceenhancementsfor3.0.101-0.15-default 5.Performanceenhancementsfor3.0.101-0.30-default 6.Performanceenhancementsfor3.0.101-0.30-bigsmp 7.3.0.101-0.30-bigsmp vs. 3.0.101-0.8-default 8.PerformanceenhancementsforSLES12 9.Wrap-up 2
  • 3. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Why Focus on Locking?
  • 4. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Cache Line Contention in Large NUMA Systems •Lock contention (particularly spinning lock contention) is the primary, and probably worst, cause of cache line contention •Cache line contention does have a “cost” associated with NUMA systems, but it is not the same “cost” that you experience with local vs. remote memory latency in NUMA systems •However, it’s notonly about lock contention Cache line contention can also come from sharing cache lines due to poor data structure layout –two fields in a data structure that are accessed by completely different processes/threads, but end up in the same cache line Worst case: an unrelated and frequently accessed field occupies the same cache line as a heavily contended lock Other atomic operations, such as atomic-add, can also generate cache line contention Additionally, the processor’s cache prefetch mechanism may also cause false cache line contention 4
  • 5. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Questions Driving Investigation •How is performance impacted when cache line contention is: •contained within a socket? •spread among multiple sockets? •using <n> cores in a socket vs <n> cores from multiple sockets? •using all cores in 4-socket, 8-socket, and 16-socket systems? 5
  • 6. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Demonstrating Cache Line Contention Effects •Test program to show the cost of cache line contention in large NUMA systems: •Bind threads (1 per core) to specified cores. Memory is allocated from a specific node. •Once the threads are synchronized, perform a tight loop doing spin_lock/spin_unlock 1,000,000 times. This generates an extreme amount of cache line contention. The spinlock implementation was taken from a Linux 3.0 based kernel. •Based on the number of threads and the loop iteration count we can calculate the average number of “operations per second per CPU” when <N> CPUs are involved in the cache line contention. •This is not a real-world test. While this is a micro-benchmark, it does show the effects of cache line contention so that real code can be written with cache line contention in mind. •Test systems: •4-sockets/ 60-cores/120-threadsIntel Xeon E7-4890 v21-TB •8-sockets/120-cores/240-threadsIntel Xeon E7-2890 v26-TB •16-sockets/240-cores/480-threadsIntel Xeon E7-2890 v212-TB 6
  • 7. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Contention within a Socket: Increasing Core Count Performance degrades smoothly as more cores are involved in cache line contention Execution Nodes Memory node Sockets Used Cores used Seconds Ops per Sec per Core % decrease from 2-core % decrease from previous Node 1 Node 1 1-socket 2-cores 1.704489 5,866,861 0.0% 0.0% 3-cores 2.783121 3,593,088 38.8% 38.8% 4-cores 4.012157 2,492,425 57.5% 30.6% 5-cores 5.506802 1,815,936 69.0% 27.1% 6-cores 7.110453 1,406,380 76.0% 22.6% 7-cores 7.834159 1,276,461 78.2% 9.2% 8-cores 10.054136 994,616 83.0% 22.1% 9-cores 11.185041 894,051 84.8% 10.1% 10-cores 13.508867 740,255 87.4% 17.2% 11-cores 14.839633 673,871 88.5% 9.0% 12-cores 16.490477 606,411 89.7% 10.0% 13-cores 19.138960 522,494 91.1% 13.8% 14-cores 20.704514 482,986 91.8% 7.6% 7
  • 8. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Contention across 2-Sockets •All 30 cores in 2 nodes/sockets participate in the cache line contention: Execution Nodes Memory node Sockets Used Cores used Seconds Ops per Sec per Core % decrease from 1-socket Node 0 Node 1 1-socket 15-cores 2.107396 474,519 0.0% Nodes 0-1 Node 1 2-socket 30-cores 14.450938 69,200 85.4% Nodes 1-2 14.897306 67,126 86.0% Nodes 2-3 21.742537 45,993 90.4% •There are two interesting points here: 1)There is a hugedrop in performance when going from 15-cores on1-socket to 30-cores on 2-sockets 2)There is a smaller drop in performance when the lock’s memory location is completely remote from the sockets involved in cache line contention (nodes 1-2 vs. nodes 2-3) 8
  • 9. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Contention across 2-Sockets: Increasing Core Count •We can see that the hugedrop in performance occurs once we add a single core from the second socket. •This is due to the need to go through QPI to handle the cache-to-cache traffic to resolve the cache line contention. •This is a significantdrop in performance when going through QPI. Execution Nodes Memory node Sockets Used Cores used Seconds Ops per Sec per Core % decrease from 2-core % decrease from previous Node 0 Node 1 1-socket 13-cores 1.649242 606,339 92.7% 9.1% 14-cores 1.905878 524,693 93.7% 13.5% 15-cores 1.649242 482,435 94.2% 8.1% Nodes 0-1 Node 1 2-sockets 16-cores 1.905878 129,309 98.4% 73.2% 17-cores 8.348480 119,782 98.6% 7.4% 18-cores 8.264046 121,006 98.5% -1.0% 30-cores 15.146260 66,023 99.2% 8.5% •Add one core at a time, filling node/socket-0 first, then filling node/socket-1: 9
  • 10. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Contention across 2-Sockets: Round-Robin vs. Fill-First •Contention measured across 2, 8, and 14 cores •Cores spread among 2-sockets (round-robin) vs. all cores in one socket (fill-first): Execution Nodes Memory Node Sockets used Cores per Socket used Cores used Seconds Ops per Sec per Core Node 1 Node 1 1-socket FF 2-cores 2-cores 0.120395 8,305,993 Node 0-1 Node1 2-sockets RR 1-core 2-cores 0.314462 3,180,034 Node 1-2 0.305783 3,270,293 Node 2-3 0.453627 2,204,454 Node 1 Node 1 1-socket FF 8-cores 8-cores 1.018527 981,810 Node 0-1 Node1 2-sockets RR 4-cores 8-cores 3.351590 298,366 Node 1-2 3.390266 294,962 Node 2-3 5.354243 186,768 Node 1 Node 1 1-socket FF 14-cores 14-cores 2.067889 483,585 Node 0-1 Node1 2-sockets RR 7-cores 14-cores 6.214167 160,923 Node 1-2 6.275140 159,359 Node 2-3 9.471300 105,582 1) Numa effect is visible when memory is remote 2) Best performance when all cores are in one socket 10
  • 11. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Contention across 4-Sockets: Round-Robin vs. Fill-First •Contention measured across 4, 8, and 12 cores •Cores spread (round-robin) among 4-sockets vs. all cores in one socket (fill-first) Execution Nodes Memory Node Socketsused Cores per Socket used Cores used Seconds Ops per Sec per Core Node 1 Node 1 1-socket FF 4-cores 4-cores 0.396550 2,521,750 Node 0-3 Node1 4-sockets RR 1-core 4-cores 1.491732 670,362 Node 1 Node 1 1-socket FF 8-cores 8-cores 0.941517 1,062,116 Node 0-3 Node1 4-sockets RR 2-cores 8-cores 5.421381 184,455 Node 1 Node 1 1-socket FF 12-cores 12-cores 1.794806 557,163 Node 0-3 Node1 4-sockets RR 3-cores 12-cores 8.937035 111,894 Node 0-3 Node1 4-sockets FF 15-cores 60-cores 49.786041 20,086 •Cache line contention is clearly better when all the contention is contained within a single socket. •For the same core count, performance degrades as more sockets are involved in the contention 11
  • 12. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Contention across 8-Sockets: Round-Robin vs. Fill-First •Contention measured across 8, 16 and 24 cores •Cores spread (round-robin) among 8-sockets vs. all cores in two sockets (fill-first): Execution Nodes Memory Node Socketsused Cores per Socket used Cores used Seconds Ops per Sec per Core Node 1 Node 1 1-socket FF 8-cores 8-cores 1.185326 843,650 Node 0-7 Node1 8-sockets RR 1-core 8-cores 10.609325 94,257 Node 0-1 Node 1 2-sockets FF 16-cores 16-cores 8.886286 112,533 Node 0-7 Node1 8-sockets RR 2-cores 16-cores 22.296164 44,851 Node 0-1 Node 1 2-sockets FF 24-cores 24-cores 12.991910 76,626 Node 0-7 Node1 8-sockets RR 3-cores 24-cores 36.197777 27,626 Node 0-7 Node1 8-sockets FF 15-cores 120-cores 172.782623 5,788 •Cache line contention is clearly better when all the contention is contained within as few sockets as possible. 12
  • 13. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Contention across 16-Sockets: Round-Robin vs. Fill-First •Contention measured across 16, 32 and 64 cores •Cores spread (round-robin) among 16-sockets vs. all cores in 1/2/4 sockets (fill-first): Execution Nodes Memory Node Socketsused Cores per Socket used Cores used Seconds Ops per Sec per Core Node 1 Node 1 1-socket FF 15-cores 15-cores 2.21096 452,292 Node 0-15 Node1 16-sockets RR 1-core 16-cores 22.904097 43,660 Node 0-1 Node 1 2-sockets FF 15-cores 30-cores 15.706788 63,667 Node 0-15 Node1 16-sockets RR 2-cores 32-cores 53.217117 18,791 Node 0-3 Node 1 4-sockets FF 15-cores 60-cores 74.909485 13,349 Node 0-15 Node1 16-socketsRR 4-cores 64-cores 109.447632 9,137 Node 0-15 Node1 16-socketsRR 15-cores 240-cores 410.881287 2,434 •Cache line contention is clearly better when all the contention is contained within as few sockets as possible. 13
  • 14. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. •On a 4-socket/60-core system you have a 25% chance that any two random cores participating in the same cache line contention are on the same socket •On an 8-socket/120-core system this is reduced to a 12.5% chance •With a 16-socket/240-core system you have only a 6.25% chance Execution Nodes Memory Node Socketsused Cores per Socket used Cores used Seconds Ops per Sec per Core Node 1 Node 1 1-socket FF 4-cores 4-cores 0.396550 2,521,750 Node 0-3 Node1 4-sockets RR 1-core 4-cores 1.491732 670,362 Node 1 Node 1 1-sockets FF 8-cores 8-cores 1.185326 843,650 Node 0-7 Node1 8-sockets RR 1-cores 8-cores 10.609325 94,257 Node 1 Node 1 1-socket FF 15-cores 15-cores 2.21096 452,292 Node 0-15 Node1 16-sockets RR 1-core 16-cores 22.904097 43,660 Inter-vs Intra-Cache Line Contention Probability 14
  • 15. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Why do we care so much? •Many applications scale based on the number of CPUs available. For example, one or two worker threads per CPU. •However, many applications today have been tuned for 4-socket/40-core and 8- socket/80-core Westmere platforms. •Going from 40-or 80-cores to 240-cores (16-sockets) is a major jump. •Scaling based only on the number of CPUs is likely to introduce significant lock and cache line contention inside the Linux kernel. •As seen in the previous slides, the impact of cache line contention gets significantly worse as more sockets and cores are added into the system –this is a major concern when dealing with 8-and 16-socket platforms. •This has led us to pursue minimizing cache line contention within Linux kernel locking primitives. 15
  • 16. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Proof of Concept
  • 17. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The AIM7 fserver workload* scales poorly on 8s/80core NUMA platform with a 2.6 based kernel * The workload was run with ramfs. Background 17
  • 18. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. From the perf -g output, we find most of the CPU cycles are spent in file_move() and file_kill(). 40 Users (4000 jobs) + 9.40% reaim reaim [.] add_int + 6.07% reaim libc-2.12.so [.] strncat ….. -1.68% reaim [kernel.kallsyms] [k] _spin_lock -_spin_lock + 50.36% lookup_mnt + 7.45% __d_lookup + 6.71% file_move + 5.16% file_kill + 2.46% handle_pte_fault Proportion of file_move() = 1.68% * 6.71% = 0.11% Proportion of file_kill() = 1.68% * 5.16% = 0.09 % Proportion of file_move() + file+kill() = 0.20% 400 users (40,000 jobs) -79.53% reaim [kernel.kallsyms] [k] _spin_lock -_spin_lock + 34.28% file_move + 34.20% file_kill + 19.94% lookup_mnt + 8.13% reaim [kernel.kallsyms] [k] mutex_spin_on_owner + 0.86% reaim [kernel.kallsyms] [k] _spin_lock_irqsave + 0.63% reaim reaim [.] add_long Proportion of file_move() = 79.53% * 34.28% = 27.26% Proportion of file_kill() = 79.53% * 34.20% = 27.20% Proportion of file_move() + file+kill() = 54.46% Analysis (1-2) This is significant spinlock contention! 18
  • 19. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. We use the ORC tool to monitor the coherency controller results (ORC is a platform dependent tool from HP that reads performance counters in the XNC node controllers) Coherency Controller Transactions Sent to Fabric Link (PRETRY number) SocketAgent10users40users400users 0 0 17,341 36,782 399,670,585 0 8 36,905 45,116 294,481,463 1 0 0 0 49,639 1 8 0 0 25,720 2 0 0 0 1,889 2 8 0 0 1,914 3 0 0 0 3,020 3 8 0 0 3,025 4 1 45 122 1,237,589 4 9 0 110 1,224,815 5 1 0 0 26,922 5 9 0 0 26,914 6 1 0 0 2,753 6 9 0 0 2,854 7 1 0 0 6,971 7 9 0 0 6,897 PRETRY indicates the associated read needs to be re- issued. We can see that when users increase, PRETRY on socket 0 increases rapidly. There is serious cache line contention on socket 0 with 400 users. Many jobs are waiting for the memory location on socket 0 which contains the spinlock. PRETRY number on socket 0: 400 users = 400M + 294M = 694M Analysis (2-2) 19
  • 20. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. •Code snippet from the 2.6 based kernel for file_move() and file_kill(): extern spinlock_t files_lock; #define file_list_lock() spin_lock(&files_lock); #define file_list_unlock() spin_unlock(&files_lock); void file_move(struct file *file, struct list_head *list) { if (!list) return; file_list_lock(); list_move(&file->f_u.fu_list, list); file_list_unlock(); } void file_kill(struct file *file) { if (!list_empty(&file->f_u.fu_list)) { file_list_lock(); list_del_init(&file->f_u.fu_list); file_list_unlock(); } } Removing Cache Line Contention 20 •Contention on this global spinlock is the cause of all the cache line contention •We developed a prototype MCS/Queued spinlock to see its effect on cache line traffic •MCS/Queued locks are NUMA aware and each locker spins on local memory rather than the lock word •Implementation is available in the back-up slides •No efforts were made to make this a finer grained lock
  • 21. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Comparing the performance of the new kernel (blue line) vs. the original kernel (red line) Prototype Benchmark Results 2.4x improvement in throughput with theMCS/Queued spinlock prototype! 21
  • 22. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. •The proportion of time for the functions file_move()and file_kill()is now small in the 400 users case when using an MCS/Queued spinlock (dropped from 54.46% to 2.38%) •The functions lookup_mnt() and __mutex_lock_slowpath() now take most of the time. 400 users(40000 jobs) 44.71% reaim [kernel.kallsyms] [k] _spin_lock -60.94%--lookup_mnt …. 22.13% reaim [kernel.kallsyms] [k] mutex_spin_on_owner -96.16%--__mutex_lock_slowpath …… 1.19% reaim [kernel.kallsyms] [k] file_kill 1.19% reaim [kernel.kallsyms] [k] file_move Proportion of lookup_mnt() = 27.2% Proportion of __mutex_lock_slowpath() = 21.3% Proportion of file_move() + file+kill() 2.38% Prototype Analysis (1-2) perf –g output of the kernel with MCS/Queued spinlock prototype: 22
  • 23. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Coherency controller results of the kernel with the MCS/Queued spinlock Coherency Controller Transactions Sent to Fabric Link (PRETRY number) SocketAgent10users40users400users 0 0 18,216 24,560 83,720,570 0 8 37,307 42,307 43,151,386 1 0 0 0 0 1 8 0 0 0 2 0 0 0 0 2 8 0 0 0 3 0 0 0 0 3 8 0 0 0 4 1 52 222 16,786 4 9 28 219 10,068 5 1 0 0 0 5 9 0 0 0 6 1 0 0 0 6 9 0 0 0 7 1 0 0 0 7 9 0 0 0 We can see that as users increase, PRETRY in socket 0 also increases –but it is significantly lower than the kernel without the MCS/Queued lock. The PRETRY number for socket 0: 400 users = 84M + 43M = 127M. This value is about 1/5 of the original kernel (694M). This shows the MCS/Queued spinlock algorithm reduces the PRETRY traffic that occurs in file_move() and file_kill() significantly even though we still have the same contention on the spinlock. Prototype Analysis (2-2) 23
  • 24. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. •The MCS/Queued spinlock improved the throughput of large systems just by minimizing the inter-socket cache line traffic generated by the locking algorithm. •The MCS/Queued spinlock did not reduce the amount of contention on the actual lock. We have the same number of spinners contending for the lock. No code changes were made to reduce lock contention. •However, the benchmark throughput improved from ~160,000 to ~390,000 jobs per minute due to the reduced inter-socket cache-to-cache traffic. •System time spent spinning on the lock dropped from 54% to 2%. •Lock algorithms can play a huge factor in the performance of large-scale systems •The impact of heavy lock contention on a 240-core system is much more severe than the impact of heavy lock contention on a 40-core system •This is not a substitute for reducing lock contention… Reducing lock contention is still the best solution, but attention to lock algorithms that deal with contention *is* extremely important and can yield significant improvements. Proof of Concept -Conclusions 24
  • 25. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Methodology
  • 26. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Benchmarks and Tools Used •SPECjbb2005 / SPECjbb2013 •AIM7 •Swingbench OLTP •Swingbench DSS •Ebizzy •Perf-bench •Kernbench •Hackbench •Futextest •IOzone •fio •SLOB (Silly Little Oracle Benchmark) •Customer benchmarks and apps 26 •perf •ORC/ORC2 (HP internal tool) •LiKI (HP internal tool) •lockstat •numatop •slabtop •ftrace, strace •top, sar, vmstat, powertop SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC).
  • 27. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Development Process •Identify areas to improve performance and reduce cache line contention •Determine if an upstream patch already fixes the problem. •If no upstream patches exist: •Develop fix •Validate performance improvement •Submit the patches upstream and work through the upstream process to get them accepted •Depending on upstream feedback this may be an iterative process until the changes are accepted •Back-port patches to the SLES kernels •Validate performance improvement •Submit back-ported patches along with performance justification to SUSEfor inclusion in SLES •SUSEdelivers a PTF kernel to HPcontaining the performance enhancements •HP performs a final performance validation of the PTF kernels, sometimes with a customer •Changes put into the next SLESMaintenance Update for all SUSE customers to use. 27
  • 28. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. SUSE / HP Collaboration •Bugzillas were filed for each set of patches •Technical discussions with SUSE Performance engineers •Discussions on the design of the patch, the back-port of the patch, the performance validation data, kABI impacts, etc •Engaged SUSE performance engineers in the analysis of the problem as needed •Sometimes restructured the back-ported patch to work better with SLES 11sp3 after consultation with SUSE •For some patches submitted upstream had initial review and feedback from SUSE performance engineers prior to the upstream submittal •The recognition of customer benefit and the flexibility of both companies really made all this possible. 28
  • 29. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Performance Enhancements SLES 11sp3 Kernel Version: 3.0.101-0.15-default
  • 30. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Idle Balance •One of the problems with significant lock contention on blocking locks (such as a mutex) is that as more processes block on the mutex there is less to run –this causes the idle balancer to take processes from a different CPU’s run queue. This in turn causes even further cache issues. •Ensure that we don’t attempt an idle balance operation when it takes longer to do the balancing than the time the cpu would be idle •We do this by keeping track of the maximum time spent in idle balance for each scheduler domain and skipping idle balance if max-time-to-balance > avg_idle for this CPU •Max-time-to-balance is decayed at a rate of about 1% per second •Improve the accuracy of the average CPU idle duration. •Previously the average CPU idle duration was over estimated resulting in too much idle balancing 30 93,609 18,600Java Operations with 16-sockets / 240-cores / 480-threads before changes after changes
  • 31. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Mutex Atomics Reduction •Reduce the number of atomic operations when trying to acquire a mutex −Entry into the mutex lock slow path will cause 2 atomic _xchginstructions to be issued −With the atomic decrement in the fast path a total of 3 atomic read-modify-write instructions are issued in rapid succession −Causes lots of cache bouncing when many tasks are trying to acquire the mutex simultaneously −By using atomic_read()to check the value before calling atomic_xchg()or atomic_cmpxchg() we avoid unnecessary cache coherency traffic •Also known as compare-compare-and-swap. −For example: if ((atomic_read(&lock->count) == 1) && (atomic_cmpxchg(&lock->count, 1, 0) == 1)) 31 137,370 93,609Java Operations with 16-sockets / 240-cores / 480-threads before changes after changes
  • 32. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. XFS Small Block Size Performance •A customer acceptance benchmark demonstrated really poor performance with XFS for 4kand 16kblock sizes (sometimes 64k) for initial-writes as well as over-writes for multithreaded applications. •Further investigation identified a set of patches already developed for the upstream Linux kernel revision 3.4 •The primary patch introduces per filesystem I/O completion workqueues (as opposed to global workqueues) •Allows concurrency on the workqueues -blocking on one inode does not block others on a different inode. •These patches were back-ported to SLES 11sp3 (and by default now part of 11sp4 and 12) •Improved synchronous 16kinitial-write performance from 1.2 MB/s to 138 MB/s •Improved asynchronous 16kinitial-write performance from 14 MB/s to 141 MB/s •Also improves 16kover-write performance as well as 4kinitial-write and over-write performance. 32
  • 33. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Futex Hash Size and Wait Queues •Hash Bucket Size −Originally 256 hash buckets for the whole system •256 * nCPUs cache line aligned hash buckets •Fewer collisions and more spinlocks leading to more parallel futex call handling. −Distribute the table among NUMAnodes instead of a single one. −The perfect hash size will of course have one to one hash-bucket:futex ratio. •Lockless Wait Queue Size −A common misuse of futexes is to make FUTEX_WAKEcalls when there are no waiters. −In FUTEX_WAKE, there’s no reason to take the hb->lock if we already know the list is empty and thus one to wake up. −Use an independent atomic counter to keep track of the list size. −This can drastically reduce contention on the hash bucket spinlock. 33 0.10% 43.71% % System time spinning on hb->lock for a large database before changes after changes
  • 34. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Futex: Larger NUMA-aware Hash Table 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 512 256 128 80 64 32 Operations per second threads Futex Hash Table Scaling baseline aligned buckets large numa-awaretable aligned buckets pluslarge numa-awaretable 34
  • 35. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3.0.101-0.15-default: Perf Improvements •Performance measurements were done with a popular Java based workload (a higher number of operations is better) 35 6,502 67,447 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000Java Operations16-sockets / 240-cores / 480-threads / 12-TB 3.0.101-0.8-default 3.0.101-0.15-default Can’t get valid result at 16-sockets
  • 36. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Performance Enhancements SLES 11sp3 Kernel Version: 3.0.101-0.30-default
  • 37. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Mutex Optimizations •Unlock a mutex without acquiring the wait lock −Workloads with high amounts of mutex contention would spend significant time spinning on the mutex’s internal waiter lock which then delays the mutex from getting unlocked. −Changed the mutex unlock path to unlock the mutex before acquiring the internal waiter lock to deal with any waiters. −Delays in acquiring the waiter lock will not prevent others from acquiring the mutex. •Mutex slowpath optimizations −When a lock can’t be acquired and a thread enters the mutex slowpath it put’s itself on the wait list and tries one last time to acquire the mutex. −Changed the order and attempted the acquisition of the mutex first −If acquired the we do not have to remove the thread from the waiter list 37 156,912 109,933Java Operations with 16-sockets / 240-cores / 480-threads before changes after changes
  • 38. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Lockref •Reference counts are normally used to track the lifecycle of data structures. −A reference count of zero means the structure is unused and is free to be released −A positive reference count indicates how many tasks are actively referencing the structure −When embedded into a data structure, it is not uncommon to acquire a lock just to increment or decrement the reference count variable. Under load, this lock can become heavily contended. •The lockref patch introduces a new mechanism for a lockless atomic update of a spinlock protected reference count. −Bundle a 4-byte spinlock and a 4-byte reference count into a single 8-byte word that can be updated atomically while no one is holding the lock. •The VFS layer makes heavy use of reference counts for dentry operations. −Workloads that generate lots of filesystem activity can be bottlenecked by the spinlock contention on the dentry reference count update. −The dentry operations were modified to make use of the lockref patch to resolve this contention by doing reference count updates without taking a lock. 38 0.01% 83.74% % time spinning on dentry lock: AIM-7 short workload before changes after changes
  • 39. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Lockref and Dentry Benchmark Results ~ 7x improvement in throughput (JPM)! 39 0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000 7,000,000 8,000,000 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 Jobs Per Minute Simulated Users AIM7 short workload: 8-socket / 80-core / 1-TB Pre-lockrefchanges Post-LockrefChanges
  • 40. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. SELinux Bit-Map Management •Heavy use of the lscommand results in a significant amount of CPUtime being spent in the mls_level_isvalid()kernel function. •Replaced the inefficient implementation of the mls_level_isvalid()function in the multi-level security (MLS) policy module of SELinux with a performance optimized version. −More efficient bit-map management •The CPUtime spent in this code path is reduced from 8.95%to 0.12%in the AIM-7 high_systime workload 40 0.12% 8.95% Changes in system time for the mls_level_isvalid() code path before changes after changes
  • 41. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Hugetlb: Parallel Page Faults •The kernel originally serialized hugetlb page faults, handling a single fault at a time. −Workloads with large working sets backed-by hugepages (i.e.: databases or KVM guests) can especially suffer from painful startup times due to this. −Protection from spurious OOM errors under conditions of low availability of free hugepages. −This problem is specific to hugepages because it is normal to want to use every single hugepage in the system -with normal pages we assume there will always be a few spare pages which can be used temporarily until the race is resolved. •Address this problem by using a table of mutexes, allowing a better chance of parallelization, where each hugepage is individually serialized. −The hash key is selected depending on the mapping type. −Because the size of the table is static, this can, in theory, still produce contention, if reserving enough hugepages. But reality indicates that this is purely theoretical. 41 25.7 37.5Startup time (seconds) of a 10-Gb Oracle DB (Data Mining) before changes after changes
  • 42. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. epoll_ctl() Scaling •Java workloads on 8-and 16-socket systems showed significant lock contention on the global epmutexin the epoll_ctl()system call when adding or removing file descriptors to/from an epoll instance. •Further investigation identified a set of patches already developed for the upstream Linux kernel: •Don’t take the global epmutexlock in EPOLL_CTL_ADDfor simple topologies (it’s not needed) •Remove the global epmutexlock from the EPOLL_CTL_DELpath and instead use RCU to protect the list of event poll waiters against concurrent traversals •RCU (Read-Copy Update) is a Linux synchronization mechanism allowing lockless reads to occur concurrently with updates 42 165,119 87,903Java Operations with 16-sockets / 240-cores / 480-threads before changes after changes
  • 43. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3.0.101-0.30-default: Perf Improvements •Performance measurements were done with a popular Java based workload (a higher number of operations is better) 43 6,502 67,447 946,552 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000Java Operations16-sockets / 240-cores / 480-threads / 12-TB 3.0.101-0.8-default 3.0.101-0.15-default 3.0.101-0.30-default Can’t get valid result at 16-sockets
  • 44. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Performance Enhancements SLES 11sp3 Kernel Version: 3.0.101-0.30-bigsmp
  • 45. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Why the “bigsmp” kernel? •Some of the performance enhancements HPprovided for SLES 11sp3 caused a breakage in the Kernel ABI (kABI) •The User Application ABI remains the same –all applications that run on the “default” SLES 11sp3 kernel have full binary and source compatibility with the “bigsmp” SLES 11sp3 kernel. •There was a small possibility that this kABI breakage would impact kernel drivers and modules •Rather than risk compatibility issues at customer sites SUSEcreated the “bigsmp” flavor of the SLES 11sp3 kernel which contains these additional performance enhancements. •The bigsmp flavor of SLES 11sp3 has it’s own kABI •Requires a recompile of kernel drivers and modules •SUSE experience and process flexibility allowed for the creation of the bigsmp kernel so that these additional performance enhancements could be delivered to customers. •All of these changes will be included in the SLES 11sp4 GAand SLES 12 GA“default” kernels. Bigsmp will be an on-going flavor for SLES 11sp3 for all platforms. 45
  • 46. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Mutex MCS Optimizations •The MCS lock is a new locking primitive inside Linux •Each locker spins on a local variable while waiting for the lock rather than spinning on the lock itself. •Maintains a list of spinning waiters. •When a lock is released the unlocker changes the local variable of the next spinner. •This change causes the spinner to stop spinning and acquire the lock. •Eliminates most of the cache-line bouncing experienced by simpler locks, especially in the contended case when a simple CAS (Compare-and-Swap) calls fail. •Fair, passing the lock to each locker in the order that the locker arrived. •Specialized cancelable MCS locking was applied internally to kernel mutexes •The cancelable MCS lock is a specially tailored lock for MCS: when needing to reschedule, we need to abort the spinning in order to block. 46 250,981 137,268Java Operations with 16-sockets / 240-cores / 480-threads before changes after changes
  • 47. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. All Mutex Lock Performance Improvements •Performance measurements were done with a popular Java based workload (higher number of operations is better) •System used: 16-sockets, 240-cores, 480-threads 47 72,089 137,268 250,891 0 100,000 200,000 300,000 NUMBER OF OPERATIONS Baseline prior to any mutex changes 3 non-MCS mutex changes90% over baseline All mutex changes248% over baseline83% over non-MCS mutex
  • 48. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Per-thread VMA caching •A process’s address space is divided among VMAs (virtual memory areas) –each storing a range of addresses that share similar properties, such as permissions. −A common operation when dealing with memory is locating (find_vma()) a VMA that contains a range of addresses. •Traditionally the Linux kernel will cache the last used VMA. −Avoids expensive tree lookups (scales poorly in multi-thread programs). −This works nicely for workloads with good locality (over 70% hit rates), yet very badly for those with poor locality (less than 1% hit rates). •Replace the cache by a small, per-thread, hash table. −O(1) lookups/updates, cheap to maintain and small overhead. −Improves poor locality hit-rates to ~99.9%. −Improves Oracle 11g Data Mining (4k pages) hit-rates from 70% to 91%. 48
  • 49. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3.0.101-0.30-bigsmp: Perf Improvements •Performance measurements were done with a popular Java based workload (higher number of operations is better) 49 6,502 67,447 946,552 994,771 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000Java Operations16-sockets / 240-cores / 480-threads / 12-TB 3.0.101-0.8-default 3.0.101-0.15-default 3.0.101-0.30-default 3.0.101-0.30-bigsmp Can’t get valid result at 16-sockets
  • 50. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3.0.101-0.30-bigsmp vs. 3.0.101-0.8-default
  • 51. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3.0.101-0.8-default Scaling •Performance measurements were done with a popular Java based workload (a higher number of operations is better) 51 69,426 114,617 180,922 139,825 6,502 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000 200,000Java Operations -1/2/4/8/16 sockets 1-socket / 15-cores 2-sockets / 30-cores 4-sockets / 60-cores 8-sockets / 120-cores 16-sockets / 240-cores •Scaling issues at 8-and 16-sockets. Other Linux distributions are similar. Can’t get valid result at 16-sockets
  • 52. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3.0.101-0.30-bigsmp Scaling •Performance measurements were done with a popular Java based workload (a higher number of operations is better) 52 71,517 132,996 260,423 512,953 994,771 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000Java Operations -1/2/4/8/16 sockets 1-socket / 15-cores 2-sockets / 30-cores 4-sockets / 60-cores 8-sockets / 120-cores 16-sockets / 240-cores Scaling is now approximately 1.9x for each doubling of sockets!
  • 53. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3.0.101-0.30-bigsmp: AIM-7 alltests 53 ~ 4x performance improvement! 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 Jobs Per Minute Simulated Users 3.0.101-0.30-bigsmp 3.0.101-0.30-default 3.0.101-0.15-default 3.0.101-0.8-default AIM-7 alltests workload: 16-sockets / 240-cores / 480-threads / 12-TB
  • 54. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3.0.101-0.30-bigsmp: AIM-7 custom 54 ~ 5x performance improvement! 0 100,000 200,000 300,000 400,000 500,000 600,000 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 Jobs Per Minute Simulated Users 3.0.101-0.30-bigsmp 3.0.101-0.30-default 3.0.101-0.15-default 3.0.101-0.8-default AIM-7 custom workload: 16-sockets / 240-cores / 480-threads / 12-TB
  • 55. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3.0.101-0.30-bigsmp: AIM-7 high_systime 55 ~ 4x performance improvement! 0 50,000 100,000 150,000 200,000 250,000 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 Jobs Per Minute Simulated Users 3.0.101-0.30-bigsmp 3.0.101-0.30-default 3.0.101-0.15-default 3.0.101-0.8-default AIM-7 high_systime workload: 16-sockets / 240-cores / 480-threads / 12-TB
  • 56. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3.0.101-0.30-bigsmp: AIM-7 shared 56 ~ 3.75x performance improvement! 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 Jobs Per Minute Simulated Users 3.0.101-0.30-bigsmp 3.0.101-0.30-default 3.0.101-0.15-default 3.0.101-0.8-default AIM-7 shared workload: 16-sockets / 240-cores / 480-threads / 12-TB
  • 57. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Performance Enhancements SLES 12
  • 58. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. IPC Lock with System V Semaphores •The global IPCspinlock is held too long in various System VSemaphore paths •When one semaphore array is doing lots of operations scaling is poor •Single semaphore arrays now use a finer grained spinlock as an alternative to the coarse grained IPC lock. •Introduced rcu_read_lock only methods to obtain IPCobjects •Similar changes were made to System V Shared Memory and Message Queues 58 21.86% 84.35% 5.23% 7.39% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 400 Users 800 Users% time spinning on IPC lock: Oracle 11gR2 Data Mining before changes after changes
  • 59. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. SysV Semaphore Performance Improvements •Performance measurements were done with an Oracle 11gR2 Data Mining workload •System used: 8-sockets, 80-cores, 160-threads 59 1,257 1,438 1,237 2,805 2,665 2,751 0 500 1,000 1,500 2,000 2,500 3,000 100 Users 400 Users 800 Users before changes after changes
  • 60. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. World Record Performance
  • 61. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. World record performance: HP CS900 for SAP HANA Key take aways •#1 and #2 overall max-jOPS results •#1 and #4 overall critical-jOPS results •#1 16-socket (16s) results on both max-jOPS and critical-jOPS •#1 8-socket (8s) results on max-jOPS •16s max-jOPS results 2.1X greater than Fujitsu 16s results •8s max-jOPS results are 2.2X greater than Sugon 8s results •8s max-jOPS results 1.1X greater than Fujitsu16s results •HP CS900 demonstrates excellent scaling from 8s to 16s 23,058 168,127 198,418 126,617 247,581 214,961 308,936 425,348 474,575 888,164 0 200,000 400,000 600,000 800,000 1,000,000 Sugon I980G108-socket, Intel Xeon E7-8890 v2 Fujitsu SPARC M10-4S16-socket, SPARC64 X Fujitsu SPARC M10-4S16-socket, SPARC64 X+ HP ConvergedSystem 900 for SAPHANA 8s/6TB 8-socketIntel Xeon E7-2890 v2 HP ConvergedSystem 900 for SAPHANA 16s/12TB 16-socketIntel Xeon E7-2890 v2 HP ConvergedSystem 900 for SAP HANA powered by SLES 11sp3 owns top two SPECjbb2013 max-jOPS records max-jOPS SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several other countries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). The stated results are published on spec.orgas of 07/30/2014.
  • 62. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. HP and SUSE Break 1-Million max-jOPS!!! 129,876 244,162 524,790 1,002,350 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 HP ConvergedSystem 900 for SAPHANA (8s/6TB) 8-socketIntel Xeon E7-2890 v2 HP ConvergedSystem 900 for SAPHANA (16s/12TB) 16-socketIntel Xeon E7-2890 v2 HP ConvergedSystem 900 for SAP HANA powered by SLES 11sp3 #1 16-socket SPECjbb2013 max-jOPS record max-jOPS SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several other countries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). The stated results are published on spec.orgas of 11/14/2014.
  • 63. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. World record performance: HP CS900 for SAP HANA •The HP BL920s Gen8 Server Blade powers the HP ConvergedSystem 900 for SAP HANA system. •Publicly available SPECjbb2013-MultiJVM benchmark performance briefs: TBD –provide link for new briefNov 2014http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3288ENW&cc=us&lc=enJune 2014 •Official benchmark results for HP ConvergedSystem900 for SAP HANA on spec.org: TBD –provide link for new result(16s/240c/12TB)Nov 2014TBD –provide link for new result(8s/120c/6TB)Nov 2014http://spec.org/jbb2013/results/res2014q2/jbb2013-20140610-00081.html(16s/240c/12TB)June 2014http://spec.org/jbb2013/results/res2014q2/jbb2013-20140610-00080.html(8s/120c/6TB)June 2014 SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several othercountries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC).
  • 64. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Wrap-up
  • 65. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Acknowledgements •HP Linux Kernel Performance team contributing this work: −Davidlohr Bueso −Tuan Bui −Waiman Long −Jason Low −Scott Norton −Thavatchai Makphaibulchoke −Tom Vaden −Aswin Chandramouleeswaran 65 •SUSE Linux R&D team contributing to this work: −Scott Bahling −Matthias Eckermann −Mike Galbraith −Mel Gorman −Jan Kara −Jeff Mahoney −Ahmad Sadeghpour −Miklos Szeredi
  • 66. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Thank you
  • 67. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Back-up Slides
  • 68. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. We developed a prototype MCS/Queued lock to see the effect on cache line traffic (MCS/Queued locks are NUMA aware and each locker spins on local memory rather than the lock word) MCS/Queued Lock Prototype (1-2) typedef struct _local_qnode { volatile bool waiting; volatile struct _local_qnode *volatile next; } local_qnode; static inline void mcsfile_lock_acquire(mcsglobal_qlock *global, local_qnode_ptr me) { local_qnode_ptr pred; me->next = NULL; pred = xchg(global, me); if (pred == NULL) return; me->waiting = true; pred->next = me; while (me->waiting); /*spin on local mem*/ } static inline void mcsfile_lock_release(mcsglobal_qlock *global, local_qnode_ptr me) { local_qnode_ptr succ; if (!(succ = me->next)) { if ( cmpxchg(global, me, NULL) == me ) return; do { succ = me->next; } while (!succ); /* wait for succ ready */ } succ->waiting = false; } 68
  • 69. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Replacing the files_lock spinlock with the prototype mcsfiles_lock MCS/Queued spinlock externmcsglobal_qlock mcsfiles_lock; #define file_list_lock(x) mcsfile_lock_acquire(&mcsfiles_lock, &x); #define file_list_unlock(x) mcsfile_lock_release(&mcsfiles_lock, &x); void file_kill(struct file *file) { volatile local_qnode lq; if (!list_empty(&file->f_u.fu_list)) { file_list_lock(lq); list_del_init(&file->f_u.fu_list); file_list_unlock(lq); } } void file_move(struct file *file, struct list_head *list) { volatile local_qnode lq; if (!list) return; file_list_lock(lq); list_move(&file->f_u.fu_list, list); file_list_unlock(lq); } MCS/Queued Lock Prototype (2-2) 69
  • 70. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. AIM7 Benchmark Suite •TraditionalUNIXsystem-levelbenchmark(writteninC). •Multipleforks,eachofwhichconcurrentlyexecutesacommon,randomly-orderedsetofsubtestscalledjobs. •Eachoftheoverfiftykindofjobsexercisesaparticularfacetofsystemfunctionality •DiskIOoperations,processcreation,virtualmemoryoperations,pipeI/O,andcompute-boundarithmeticloops. •AIM7includesdisksubtestsforsequentialreads,sequentialwrites,randomreads,randomwrites,andrandommixedreadsandwrites. •AnAIM7runconsistsofaseriesofsubrunswiththenumberoftasks,N,beingincreasedaftertheendofeachsubrun. •Eachsubruncontinuesuntileachtaskcompletesthecommonsetofjobs.Theperformancemetric,"Jobscompletedperminute",isreportedforeachsubrun. •TheresultoftheentireAIM7runisatableshowingtheperformancemetricversusthenumberoftasks,N. •Reference:“FilesystemPerformanceandScalabilityinLinux2.4.17”,2002. 70
  • 71. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. perf-bench futex (1-2) •To measure some of the changes done by the futexhastable patchset, a futex set of microbenchmarks are added to perf-bench: −perf bench futex [<operation> <all>] •Measures latency of different operations: −Futex hash −Futex wake −Futex requeue/wait
  • 72. © Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. perf-bench futex (2-2) $ perf bench futex wake# Running 'futex/wake' benchmark: Run summary [PID 4028]: blocking on 4 threads (at futex 0x7e20f4), waking up 1 at a time. [Run 1]: Wokeup 4 of 4 threads in 0.0280 ms[Run 2]: Wokeup 4 of 4 threads in 0.0880 ms[Run 3]: Wokeup 4 of 4 threads in 0.0920 ms… [Run 9]: Wokeup 4 of 4 threads in 0.0990 ms[Run 10]: Wokeup 4 of 4 threads in 0.0260 msWokeup 4 of 4 threads in 0.0703 ms (+-14.22%) $ perf bench futex hash# Running 'futex/hash' benchmark: Run summary [PID 4069]: 4 threads, each operating on 1024 futexes for 10 secs. [thread 0] futexes: 0x1982700 ... 0x19836fc [ 3507916 ops/sec ] [thread 1] futexes: 0x1983920 ... 0x198491c [ 3651174 ops/sec ] [thread 2] futexes: 0x1984ab0 ... 0x1985aac [ 3557171 ops/sec ] [thread 3] futexes: 0x1985c40 ... 0x1986c3c [ 3597926 ops/sec ] Averaged 3578546 operations/sec (+-0.85%), total secs = 10