SLES Performance Enhancements for Large NUMA Systems

© 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. SLES Performance Enhancements for Large NUMASystems
Scott Norton –HP
Davidlohr Bueso –SUSE Labs (formerly HP)
SUSECon Orlando, FL
November 2014

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Agenda
1.Whyfocusonlocking?
2.Proofofconcept
3.Methodology
4.Performanceenhancementsfor3.0.101-0.15-default
5.Performanceenhancementsfor3.0.101-0.30-default
6.Performanceenhancementsfor3.0.101-0.30-bigsmp
7.3.0.101-0.30-bigsmp vs. 3.0.101-0.8-default
8.PerformanceenhancementsforSLES12
9.Wrap-up
2

Why Focus on Locking?

Cache Line Contention in Large NUMA Systems
•Lock contention (particularly spinning lock contention) is the primary, and probably worst, cause of cache line contention
•Cache line contention does have a “cost” associated with NUMA systems, but it is not the same “cost” that you experience with local vs. remote memory latency in NUMA systems
•However, it’s notonly about lock contention
Cache line contention can also come from sharing cache lines due to poor data structure layout –two fields in a data structure that are accessed by completely different processes/threads, but end up in the same cache line
Worst case: an unrelated and frequently accessed field occupies the same cache line as a heavily contended lock
Other atomic operations, such as atomic-add, can also generate cache line contention
Additionally, the processor’s cache prefetch mechanism may also cause false cache line contention
4

Questions Driving Investigation
•How is performance impacted when cache line contention is:
•contained within a socket?
•spread among multiple sockets?
•using <n> cores in a socket vs <n> cores from multiple sockets?
•using all cores in 4-socket, 8-socket, and 16-socket systems?
5

Demonstrating Cache Line Contention Effects
•Test program to show the cost of cache line contention in large NUMA systems:
•Bind threads (1 per core) to specified cores. Memory is allocated from a specific node.
•Once the threads are synchronized, perform a tight loop doing spin_lock/spin_unlock 1,000,000 times. This generates an extreme amount of cache line contention. The spinlock implementation was taken from a Linux 3.0 based kernel.
•Based on the number of threads and the loop iteration count we can calculate the average number of “operations per second per CPU” when <N> CPUs are involved in the cache line contention.
•This is not a real-world test. While this is a micro-benchmark, it does show the effects of cache line contention so that real code can be written with cache line contention in mind.
•Test systems:
•4-sockets/ 60-cores/120-threadsIntel Xeon E7-4890 v21-TB
•8-sockets/120-cores/240-threadsIntel Xeon E7-2890 v26-TB
•16-sockets/240-cores/480-threadsIntel Xeon E7-2890 v212-TB
6

Contention within a Socket: Increasing Core Count
Performance degrades smoothly as more cores are involved in cache line contention
Execution Nodes
Memory node
Sockets Used
Cores used
Seconds
Ops per Sec per Core
% decrease from 2-core
% decrease from previous
Node 1
Node 1
1-socket
2-cores
1.704489
5,866,861
0.0%
0.0%
3-cores
2.783121
3,593,088
38.8%
38.8%
4-cores
4.012157
2,492,425
57.5%
30.6%
5-cores
5.506802
1,815,936
69.0%
27.1%
6-cores
7.110453
1,406,380
76.0%
22.6%
7-cores
7.834159
1,276,461
78.2%
9.2%
8-cores
10.054136
994,616
83.0%
22.1%
9-cores
11.185041
894,051
84.8%
10.1%
10-cores
13.508867
740,255
87.4%
17.2%
11-cores
14.839633
673,871
88.5%
9.0%
12-cores
16.490477
606,411
89.7%
10.0%
13-cores
19.138960
522,494
91.1%
13.8%
14-cores
20.704514
482,986
91.8%
7.6%
7

Contention across 2-Sockets
•All 30 cores in 2 nodes/sockets participate in the cache line contention:
Execution Nodes
Memory node
Sockets Used
Cores used
Seconds
% decrease from
1-socket
Node 0
Node 1
1-socket
15-cores
2.107396
474,519
0.0%
Nodes 0-1
Node 1
2-socket
30-cores
14.450938
69,200
85.4%
Nodes 1-2
14.897306
67,126
86.0%
Nodes 2-3
21.742537
45,993
90.4%
•There are two interesting points here:
1)There is a hugedrop in performance when going from 15-cores on1-socket to 30-cores on 2-sockets
2)There is a smaller drop in performance when the lock’s memory location is completely remote from the sockets involved in cache line contention (nodes 1-2 vs. nodes 2-3)
8

Contention across 2-Sockets: Increasing Core Count
•We can see that the hugedrop in performance occurs once we add a single core from the second socket.
•This is due to the need to go through QPI to handle the cache-to-cache traffic to resolve the cache line contention.
•This is a significantdrop in performance when going through QPI.
Execution Nodes
Memory node
Sockets Used
Cores used
Seconds
% decrease from 2-core
% decrease from previous
Node 0
Node 1
1-socket
13-cores
1.649242
606,339
92.7%
9.1%
14-cores
1.905878
524,693
93.7%
13.5%
15-cores
1.649242
482,435
94.2%
8.1%
Nodes 0-1
Node 1
2-sockets
16-cores
1.905878
129,309
98.4%
73.2%
17-cores
8.348480
119,782
98.6%
7.4%
18-cores
8.264046
121,006
98.5%
-1.0%
30-cores
15.146260
66,023
99.2%
8.5%
•Add one core at a time, filling node/socket-0 first, then filling node/socket-1:
9

Contention across 2-Sockets: Round-Robin vs. Fill-First
•Contention measured across 2, 8, and 14 cores
•Cores spread among 2-sockets (round-robin) vs. all cores in one socket (fill-first):
Execution Nodes
Memory Node
Sockets
used
Cores per Socket used
Cores used
Seconds
Node 1
Node 1
1-socket FF
2-cores
2-cores
0.120395
8,305,993
Node 0-1
Node1
2-sockets RR
1-core
2-cores
0.314462
3,180,034
Node 1-2
0.305783
3,270,293
Node 2-3
0.453627
2,204,454
Node 1
Node 1
1-socket FF
8-cores
8-cores
1.018527
981,810
Node 0-1
Node1
2-sockets RR
4-cores
8-cores
3.351590
298,366
Node 1-2
3.390266
294,962
Node 2-3
5.354243
186,768
Node 1
Node 1
1-socket FF
14-cores
14-cores
2.067889
483,585
Node 0-1
Node1
2-sockets RR
7-cores
14-cores
6.214167
160,923
Node 1-2
6.275140
159,359
Node 2-3
9.471300
105,582
1) Numa effect is visible when memory is remote
2) Best performance when all cores are in one socket
10

•Contention measured across 4, 8, and 12 cores
•Cores spread (round-robin) among 4-sockets vs. all cores in one socket (fill-first)
Execution Nodes
Memory Node
Socketsused
Cores used
Seconds
Node 1
Node 1
1-socket FF
4-cores
4-cores
0.396550
2,521,750
Node 0-3
Node1
4-sockets RR
1-core
4-cores
1.491732
670,362
Node 1
Node 1
1-socket FF
8-cores
8-cores
0.941517
1,062,116
Node 0-3
Node1
4-sockets RR
2-cores
8-cores
5.421381
184,455
Node 1
Node 1
1-socket FF
12-cores
12-cores
1.794806
557,163
Node 0-3
Node1
4-sockets RR
3-cores
12-cores
8.937035
111,894
Node 0-3
Node1
4-sockets FF
15-cores
60-cores
49.786041
20,086
•Cache line contention is clearly better when all the contention is contained within a single socket.
•For the same core count, performance degrades as more sockets are involved in the contention
11

•Contention measured across 8, 16 and 24 cores
•Cores spread (round-robin) among 8-sockets vs. all cores in two sockets (fill-first):
Execution Nodes
Memory Node
Socketsused
Cores used
Seconds
Node 1
Node 1
1-socket FF
8-cores
8-cores
1.185326
843,650
Node 0-7
Node1
8-sockets RR
1-core
8-cores
10.609325
94,257
Node 0-1
Node 1
2-sockets FF
16-cores
16-cores
8.886286
112,533
Node 0-7
Node1
8-sockets RR
2-cores
16-cores
22.296164
44,851
Node 0-1
Node 1
2-sockets FF
24-cores
24-cores
12.991910
76,626
Node 0-7
Node1
8-sockets RR
3-cores
24-cores
36.197777
27,626
Node 0-7
Node1
8-sockets FF
15-cores
120-cores
172.782623
5,788
•Cache line contention is clearly better when all the contention is contained within as few sockets as possible.
12

•Contention measured across 16, 32 and 64 cores
•Cores spread (round-robin) among 16-sockets vs. all cores in 1/2/4 sockets (fill-first):
Execution Nodes
Memory Node
Socketsused
Cores used
Seconds
Node 1
Node 1
1-socket FF
15-cores
15-cores
2.21096
452,292
Node 0-15
Node1
16-sockets RR
1-core
16-cores
22.904097
43,660
Node 0-1
Node 1
2-sockets FF
15-cores
30-cores
15.706788
63,667
Node 0-15
Node1
16-sockets RR
2-cores
32-cores
53.217117
18,791
Node 0-3
Node 1
4-sockets FF
15-cores
60-cores
74.909485
13,349
Node 0-15
Node1
16-socketsRR
4-cores
64-cores
109.447632
9,137
Node 0-15
Node1
16-socketsRR
15-cores
240-cores
410.881287
2,434
•Cache line contention is clearly better when all the contention is contained within as few sockets as possible.
13

•On a 4-socket/60-core system you have a 25% chance that any two random cores participating in the same cache line contention are on the same socket
•On an 8-socket/120-core system this is reduced to a 12.5% chance
•With a 16-socket/240-core system you have only a 6.25% chance
Execution Nodes
Memory Node
Socketsused
Cores used
Seconds
Node 1
Node 1
1-socket FF
4-cores
4-cores
0.396550
2,521,750
Node 0-3
Node1
4-sockets RR
1-core
4-cores
1.491732
670,362
Node 1
Node 1
1-sockets FF
8-cores
8-cores
1.185326
843,650
Node 0-7
Node1
8-sockets RR
1-cores
8-cores
10.609325
94,257
Node 1
Node 1
1-socket FF
15-cores
15-cores
2.21096
452,292
Node 0-15
Node1
16-sockets RR
1-core
16-cores
22.904097
43,660
Inter-vs Intra-Cache Line Contention Probability
14

Why do we care so much?
•Many applications scale based on the number of CPUs available. For example, one or two worker threads per CPU.
•However, many applications today have been tuned for 4-socket/40-core and 8- socket/80-core Westmere platforms.
•Going from 40-or 80-cores to 240-cores (16-sockets) is a major jump.
•Scaling based only on the number of CPUs is likely to introduce significant lock and cache line contention inside the Linux kernel.
•As seen in the previous slides, the impact of cache line contention gets significantly worse as more sockets and cores are added into the system –this is a major concern when dealing with 8-and 16-socket platforms.
•This has led us to pursue minimizing cache line contention within Linux kernel locking primitives.
15

Proof of Concept

The AIM7 fserver workload* scales poorly on 8s/80core NUMA platform with a 2.6 based kernel
* The workload was run with ramfs.
Background
17

From the perf -g output, we find most of the CPU cycles are spent in file_move() and file_kill().
40 Users (4000 jobs)
+ 9.40% reaim reaim [.] add_int
+ 6.07% reaim libc-2.12.so [.] strncat
…..
-1.68% reaim [kernel.kallsyms] [k] _spin_lock
-_spin_lock
+ 50.36% lookup_mnt
+ 7.45% __d_lookup
+ 6.71% file_move
+ 5.16% file_kill
+ 2.46% handle_pte_fault
Proportion of file_move() = 1.68% * 6.71% = 0.11%
Proportion of file_kill() = 1.68% * 5.16% = 0.09 %
Proportion of file_move() + file+kill() = 0.20%
400 users (40,000 jobs)
-79.53% reaim [kernel.kallsyms] [k] _spin_lock
-_spin_lock
+ 34.28% file_move
+ 34.20% file_kill
+ 19.94% lookup_mnt
+ 8.13% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
+ 0.86% reaim [kernel.kallsyms] [k] _spin_lock_irqsave
+ 0.63% reaim reaim [.] add_long
Proportion of file_move() = 79.53% * 34.28% = 27.26%
Proportion of file_kill() = 79.53% * 34.20% = 27.20%
Proportion of file_move() + file+kill() = 54.46%
Analysis (1-2)
This is significant spinlock contention!
18

We use the ORC tool to monitor the coherency controller results
(ORC is a platform dependent tool from HP that reads performance counters in the XNC node controllers)
Coherency Controller Transactions Sent to Fabric Link (PRETRY number)
SocketAgent10users40users400users
0 0 17,341 36,782 399,670,585
0 8 36,905 45,116 294,481,463
1 0 0 0 49,639
1 8 0 0 25,720
2 0 0 0 1,889
2 8 0 0 1,914
3 0 0 0 3,020
3 8 0 0 3,025
4 1 45 122 1,237,589
4 9 0 110 1,224,815
5 1 0 0 26,922
5 9 0 0 26,914
6 1 0 0 2,753
6 9 0 0 2,854
7 1 0 0 6,971
7 9 0 0 6,897
PRETRY indicates the associated read needs to be re- issued.
We can see that when users increase, PRETRY on socket 0 increases rapidly.
There is serious cache line contention on socket 0 with 400 users. Many jobs are waiting for the memory location on socket 0 which contains the spinlock.
PRETRY number on socket 0: 400 users = 400M + 294M = 694M
Analysis (2-2)
19

•Code snippet from the 2.6 based kernel for file_move() and file_kill():
extern spinlock_t files_lock;
#define file_list_lock() spin_lock(&files_lock);
#define file_list_unlock() spin_unlock(&files_lock);
void file_move(struct file *file,
struct list_head *list)
{
if (!list) return;
file_list_lock();
list_move(&file->f_u.fu_list, list);
file_list_unlock();
}
void file_kill(struct file *file)
{
if (!list_empty(&file->f_u.fu_list)) {
file_list_lock();
list_del_init(&file->f_u.fu_list);
file_list_unlock();
}
}
Removing Cache Line Contention
20
•Contention on this global spinlock is the cause of all the cache line contention
•We developed a prototype MCS/Queued spinlock to see its effect on cache line traffic
•MCS/Queued locks are NUMA aware and each locker spins on local memory rather than the lock word
•Implementation is available in the back-up slides
•No efforts were made to make this a finer grained lock

Comparing the performance of the new kernel (blue line) vs. the original kernel (red line)
Prototype Benchmark Results
2.4x improvement in throughput with theMCS/Queued spinlock prototype!
21

•The proportion of time for the functions file_move()and file_kill()is now small in the 400 users case when using an MCS/Queued spinlock (dropped from 54.46% to 2.38%)
•The functions lookup_mnt() and __mutex_lock_slowpath() now take most of the time.
400 users(40000 jobs)
44.71% reaim [kernel.kallsyms] [k] _spin_lock
-60.94%--lookup_mnt
….
22.13% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
-96.16%--__mutex_lock_slowpath
……
1.19% reaim [kernel.kallsyms] [k] file_kill
1.19% reaim [kernel.kallsyms] [k] file_move
Proportion of lookup_mnt() = 27.2%
Proportion of __mutex_lock_slowpath() = 21.3%
Proportion of file_move() + file+kill() 2.38%
Prototype Analysis (1-2)
perf –g output of the kernel with MCS/Queued spinlock prototype:
22

Coherency controller results of the kernel with the MCS/Queued spinlock
Coherency Controller Transactions Sent to Fabric Link (PRETRY number)
SocketAgent10users40users400users
0 0 18,216 24,560 83,720,570
0 8 37,307 42,307 43,151,386
1 0 0 0 0
1 8 0 0 0
2 0 0 0 0
2 8 0 0 0
3 0 0 0 0
3 8 0 0 0
4 1 52 222 16,786
4 9 28 219 10,068
5 1 0 0 0
5 9 0 0 0
6 1 0 0 0
6 9 0 0 0
7 1 0 0 0
7 9 0 0 0
We can see that as users increase, PRETRY in socket 0 also increases –but it is significantly lower than the kernel without the MCS/Queued lock.
The PRETRY number for socket 0: 400 users = 84M + 43M = 127M.
This value is about 1/5 of the original kernel (694M).
This shows the MCS/Queued spinlock algorithm reduces the PRETRY traffic that occurs in file_move() and file_kill() significantly even though we still have the same contention on the spinlock.
Prototype Analysis (2-2)
23

•The MCS/Queued spinlock improved the throughput of large systems just by minimizing the inter-socket cache line traffic generated by the locking algorithm.
•The MCS/Queued spinlock did not reduce the amount of contention on the actual lock. We have the same number of spinners contending for the lock. No code changes were made to reduce lock contention.
•However, the benchmark throughput improved from ~160,000 to ~390,000 jobs per minute due to the reduced inter-socket cache-to-cache traffic.
•System time spent spinning on the lock dropped from 54% to 2%.
•Lock algorithms can play a huge factor in the performance of large-scale systems
•The impact of heavy lock contention on a 240-core system is much more severe than the impact of heavy lock contention on a 40-core system
•This is not a substitute for reducing lock contention… Reducing lock contention is still the best solution, but attention to lock algorithms that deal with contention *is* extremely important and can yield significant improvements.
Proof of Concept -Conclusions
24

Methodology

Benchmarks and Tools Used
•SPECjbb2005 / SPECjbb2013
•AIM7
•Swingbench OLTP
•Swingbench DSS
•Ebizzy
•Perf-bench
•Kernbench
•Hackbench
•Futextest
•IOzone
•fio
•SLOB (Silly Little Oracle Benchmark)
•Customer benchmarks and apps
26
•perf
•ORC/ORC2 (HP internal tool)
•LiKI (HP internal tool)
•lockstat
•numatop
•slabtop
•ftrace, strace
•top, sar, vmstat, powertop
SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC).

Development Process
•Identify areas to improve performance and reduce cache line contention
•Determine if an upstream patch already fixes the problem.
•If no upstream patches exist:
•Develop fix
•Validate performance improvement
•Submit the patches upstream and work through the upstream process to get them accepted
•Depending on upstream feedback this may be an iterative process until the changes are accepted
•Back-port patches to the SLES kernels
•Validate performance improvement
•Submit back-ported patches along with performance justification to SUSEfor inclusion in SLES
•SUSEdelivers a PTF kernel to HPcontaining the performance enhancements
•HP performs a final performance validation of the PTF kernels, sometimes with a customer
•Changes put into the next SLESMaintenance Update for all SUSE customers to use.
27

SUSE / HP Collaboration
•Bugzillas were filed for each set of patches
•Technical discussions with SUSE Performance engineers
•Discussions on the design of the patch, the back-port of the patch, the performance validation data, kABI impacts, etc
•Engaged SUSE performance engineers in the analysis of the problem as needed
•Sometimes restructured the back-ported patch to work better with SLES 11sp3 after consultation with SUSE
•For some patches submitted upstream had initial review and feedback from SUSE performance engineers prior to the upstream submittal
•The recognition of customer benefit and the flexibility of both companies really made all this possible.
28

Performance Enhancements
SLES 11sp3 Kernel Version:
3.0.101-0.15-default

Idle Balance
•One of the problems with significant lock contention on blocking locks (such as a mutex) is that as more processes block on the mutex there is less to run –this causes the idle balancer to take processes from a different CPU’s run queue. This in turn causes even further cache issues.
•Ensure that we don’t attempt an idle balance operation when it takes longer to do the balancing than the time the cpu would be idle
•We do this by keeping track of the maximum time spent in idle balance for each scheduler domain and skipping idle balance if max-time-to-balance > avg_idle for this CPU
•Max-time-to-balance is decayed at a rate of about 1% per second
•Improve the accuracy of the average CPU idle duration.
•Previously the average CPU idle duration was over estimated resulting in too much idle balancing
30
93,609
18,600Java Operations with 16-sockets / 240-cores / 480-threads
before changes
after changes

Mutex Atomics Reduction
•Reduce the number of atomic operations when trying to acquire a mutex
−Entry into the mutex lock slow path will cause 2 atomic _xchginstructions to be issued
−With the atomic decrement in the fast path a total of 3 atomic read-modify-write instructions are issued in rapid succession
−Causes lots of cache bouncing when many tasks are trying to acquire the mutex simultaneously
−By using atomic_read()to check the value before calling atomic_xchg()or atomic_cmpxchg() we avoid unnecessary cache coherency traffic
•Also known as compare-compare-and-swap.
−For example:
if ((atomic_read(&lock->count) == 1) && (atomic_cmpxchg(&lock->count, 1, 0) == 1))
31
137,370
before changes
after changes

XFS Small Block Size Performance
•A customer acceptance benchmark demonstrated really poor performance with XFS for 4kand 16kblock sizes (sometimes 64k) for initial-writes as well as over-writes for multithreaded applications.
•Further investigation identified a set of patches already developed for the upstream Linux kernel revision 3.4
•The primary patch introduces per filesystem I/O completion workqueues (as opposed to global workqueues)
•Allows concurrency on the workqueues -blocking on one inode does not block others on a different inode.
•These patches were back-ported to SLES 11sp3 (and by default now part of 11sp4 and 12)
•Improved synchronous 16kinitial-write performance from 1.2 MB/s to 138 MB/s
•Improved asynchronous 16kinitial-write performance from 14 MB/s to 141 MB/s
•Also improves 16kover-write performance as well as 4kinitial-write and over-write performance.
32

Futex Hash Size and Wait Queues
•Hash Bucket Size
−Originally 256 hash buckets for the whole system
•256 * nCPUs cache line aligned hash buckets
•Fewer collisions and more spinlocks leading to more parallel futex call handling.
−Distribute the table among NUMAnodes instead of a single one.
−The perfect hash size will of course have one to one hash-bucket:futex ratio.
•Lockless Wait Queue Size
−A common misuse of futexes is to make FUTEX_WAKEcalls when there are no waiters.
−In FUTEX_WAKE, there’s no reason to take the hb->lock if we already know the list is empty and thus one to wake up.
−Use an independent atomic counter to keep track of the list size.
−This can drastically reduce contention on the hash bucket spinlock.
33
0.10%
43.71% % System time spinning on hb->lock for a large database
before changes
after changes

Futex: Larger NUMA-aware Hash Table
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
512
256
128
80
64
32
Operations per second
threads
Futex Hash Table Scaling
baseline
aligned buckets
large numa-awaretable
aligned buckets pluslarge numa-awaretable
34

3.0.101-0.15-default: Perf Improvements
•Performance measurements were done with a popular Java based workload (a higher number of operations is better)
35
6,502
67,447
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000Java Operations16-sockets / 240-cores / 480-threads / 12-TB
3.0.101-0.8-default
3.0.101-0.15-default
Can’t get valid result at 16-sockets

3.0.101-0.30-default

Mutex Optimizations
•Unlock a mutex without acquiring the wait lock
−Workloads with high amounts of mutex contention would spend significant time spinning on the mutex’s internal waiter lock which then delays the mutex from getting unlocked.
−Changed the mutex unlock path to unlock the mutex before acquiring the internal waiter lock to deal with any waiters.
−Delays in acquiring the waiter lock will not prevent others from acquiring the mutex.
•Mutex slowpath optimizations
−When a lock can’t be acquired and a thread enters the mutex slowpath it put’s itself on the wait list and tries one last time to acquire the mutex.
−Changed the order and attempted the acquisition of the mutex first
−If acquired the we do not have to remove the thread from the waiter list
37
156,912
before changes
after changes

Lockref
•Reference counts are normally used to track the lifecycle of data structures.
−A reference count of zero means the structure is unused and is free to be released
−A positive reference count indicates how many tasks are actively referencing the structure
−When embedded into a data structure, it is not uncommon to acquire a lock just to increment or decrement the reference count variable. Under load, this lock can become heavily contended.
•The lockref patch introduces a new mechanism for a lockless atomic update of a spinlock protected reference count.
−Bundle a 4-byte spinlock and a 4-byte reference count into a single 8-byte word that can be updated atomically while no one is holding the lock.
•The VFS layer makes heavy use of reference counts for dentry operations.
−Workloads that generate lots of filesystem activity can be bottlenecked by the spinlock contention on the dentry reference count update.
−The dentry operations were modified to make use of the lockref patch to resolve this contention by doing reference count updates without taking a lock.
38
0.01%
83.74% % time spinning on dentry lock: AIM-7 short workload
before changes
after changes

Lockref and Dentry Benchmark Results
~ 7x improvement in throughput (JPM)!
39
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
10
20
30
40
50
60
70
80
90
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
Jobs Per Minute
Simulated Users
AIM7 short workload: 8-socket / 80-core / 1-TB
Pre-lockrefchanges
Post-LockrefChanges

SELinux Bit-Map Management
•Heavy use of the lscommand results in a significant amount of CPUtime being spent in the mls_level_isvalid()kernel function.
•Replaced the inefficient implementation of the mls_level_isvalid()function in the multi-level security (MLS) policy module of SELinux with a performance optimized version.
−More efficient bit-map management
•The CPUtime spent in this code path is reduced from 8.95%to 0.12%in the AIM-7 high_systime workload
40
0.12%
8.95% Changes in system time for the mls_level_isvalid() code path
before changes
after changes

Hugetlb: Parallel Page Faults
•The kernel originally serialized hugetlb page faults, handling a single fault at a time.
−Workloads with large working sets backed-by hugepages (i.e.: databases or KVM guests) can especially suffer from painful startup times due to this.
−Protection from spurious OOM errors under conditions of low availability of free hugepages.
−This problem is specific to hugepages because it is normal to want to use every single hugepage in the system -with normal pages we assume there will always be a few spare pages which can be used temporarily until the race is resolved.
•Address this problem by using a table of mutexes, allowing a better chance of parallelization, where each hugepage is individually serialized.
−The hash key is selected depending on the mapping type.
−Because the size of the table is static, this can, in theory, still produce contention, if reserving enough hugepages. But reality indicates that this is purely theoretical.
41
25.7
37.5Startup time (seconds) of a 10-Gb Oracle DB (Data Mining)
before changes
after changes

epoll_ctl() Scaling
•Java workloads on 8-and 16-socket systems showed significant lock contention on the global epmutexin the epoll_ctl()system call when adding or removing file descriptors to/from an epoll instance.
•Further investigation identified a set of patches already developed for the upstream Linux kernel:
•Don’t take the global epmutexlock in EPOLL_CTL_ADDfor simple topologies (it’s not needed)
•Remove the global epmutexlock from the EPOLL_CTL_DELpath and instead use RCU to protect the list of event poll waiters against concurrent traversals
•RCU (Read-Copy Update) is a Linux synchronization mechanism allowing lockless reads to occur concurrently with updates
42
165,119
before changes
after changes

3.0.101-0.30-default: Perf Improvements
43
6,502
67,447
946,552
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000Java Operations16-sockets / 240-cores / 480-threads / 12-TB
3.0.101-0.8-default
3.0.101-0.15-default
3.0.101-0.30-default

3.0.101-0.30-bigsmp

Why the “bigsmp” kernel?
•Some of the performance enhancements HPprovided for SLES 11sp3 caused a breakage in the Kernel ABI (kABI)
•The User Application ABI remains the same –all applications that run on the “default” SLES 11sp3 kernel have full binary and source compatibility with the “bigsmp” SLES 11sp3 kernel.
•There was a small possibility that this kABI breakage would impact kernel drivers and modules
•Rather than risk compatibility issues at customer sites SUSEcreated the “bigsmp” flavor of the SLES 11sp3 kernel which contains these additional performance enhancements.
•The bigsmp flavor of SLES 11sp3 has it’s own kABI
•Requires a recompile of kernel drivers and modules
•SUSE experience and process flexibility allowed for the creation of the bigsmp kernel so that these additional performance enhancements could be delivered to customers.
•All of these changes will be included in the SLES 11sp4 GAand SLES 12 GA“default” kernels. Bigsmp will be an on-going flavor for SLES 11sp3 for all platforms.
45

Mutex MCS Optimizations
•The MCS lock is a new locking primitive inside Linux
•Each locker spins on a local variable while waiting for the lock rather than spinning on the lock itself.
•Maintains a list of spinning waiters.
•When a lock is released the unlocker changes the local variable of the next spinner.
•This change causes the spinner to stop spinning and acquire the lock.
•Eliminates most of the cache-line bouncing experienced by simpler locks, especially in the contended case when a simple CAS (Compare-and-Swap) calls fail.
•Fair, passing the lock to each locker in the order that the locker arrived.
•Specialized cancelable MCS locking was applied internally to kernel mutexes
•The cancelable MCS lock is a specially tailored lock for MCS: when needing to reschedule, we need to abort the spinning in order to block.
46
250,981
before changes
after changes

All Mutex Lock Performance Improvements
•Performance measurements were done with a popular Java based workload (higher number of operations is better)
•System used: 16-sockets, 240-cores, 480-threads
47
72,089
137,268
250,891
0
100,000
200,000
300,000
NUMBER OF OPERATIONS
Baseline prior to any mutex changes
3 non-MCS mutex changes90% over baseline
All mutex changes248% over baseline83% over non-MCS mutex

Per-thread VMA caching
•A process’s address space is divided among VMAs (virtual memory areas) –each storing a range of addresses that share similar properties, such as permissions.
−A common operation when dealing with memory is locating (find_vma()) a VMA that contains a range of addresses.
•Traditionally the Linux kernel will cache the last used VMA.
−Avoids expensive tree lookups (scales poorly in multi-thread programs).
−This works nicely for workloads with good locality (over 70% hit rates), yet very badly for those with poor locality (less than 1% hit rates).
•Replace the cache by a small, per-thread, hash table.
−O(1) lookups/updates, cheap to maintain and small overhead.
−Improves poor locality hit-rates to ~99.9%.
−Improves Oracle 11g Data Mining (4k pages) hit-rates from 70% to 91%.
48

3.0.101-0.30-bigsmp: Perf Improvements
•Performance measurements were done with a popular Java based workload (higher number of operations is better)
49
6,502
67,447
946,552
994,771
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000Java Operations16-sockets / 240-cores / 480-threads / 12-TB
3.0.101-0.8-default
3.0.101-0.15-default
3.0.101-0.30-default
3.0.101-0.30-bigsmp

3.0.101-0.30-bigsmp
vs.
3.0.101-0.8-default

3.0.101-0.8-default Scaling
51
69,426
114,617
180,922
139,825
6,502
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000Java Operations -1/2/4/8/16 sockets
1-socket / 15-cores
2-sockets / 30-cores
•Scaling issues at 8-and 16-sockets. Other Linux distributions are similar.

3.0.101-0.30-bigsmp Scaling
52
71,517
132,996
260,423
512,953
994,771
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000Java Operations -1/2/4/8/16 sockets
1-socket / 15-cores
Scaling is now approximately 1.9x for each doubling of sockets!

3.0.101-0.30-bigsmp: AIM-7 alltests
53
~ 4x performance
improvement!
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
10
20
30
40
50
60
70
80
90
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
Jobs Per Minute
Simulated Users
3.0.101-0.30-bigsmp
3.0.101-0.30-default
3.0.101-0.15-default
3.0.101-0.8-default
AIM-7 alltests workload: 16-sockets / 240-cores / 480-threads / 12-TB

3.0.101-0.30-bigsmp: AIM-7 custom
54
~ 5x performance
improvement!
0
100,000
200,000
300,000
400,000
500,000
600,000
10
20
30
40
50
60
70
80
90
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
Jobs Per Minute
Simulated Users
3.0.101-0.30-bigsmp
3.0.101-0.30-default
3.0.101-0.15-default
3.0.101-0.8-default
AIM-7 custom workload: 16-sockets / 240-cores / 480-threads / 12-TB

3.0.101-0.30-bigsmp: AIM-7 high_systime
55
~ 4x performance
improvement!
0
50,000
100,000
150,000
200,000
250,000
10
20
30
40
50
60
70
80
90
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
Jobs Per Minute
Simulated Users
3.0.101-0.30-bigsmp
3.0.101-0.30-default
3.0.101-0.15-default
3.0.101-0.8-default
AIM-7 high_systime workload: 16-sockets / 240-cores / 480-threads / 12-TB

3.0.101-0.30-bigsmp: AIM-7 shared
56
~ 3.75x performance
improvement!
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
10
20
30
40
50
60
70
80
90
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
Jobs Per Minute
Simulated Users
3.0.101-0.30-bigsmp
3.0.101-0.30-default
3.0.101-0.15-default
3.0.101-0.8-default
AIM-7 shared workload: 16-sockets / 240-cores / 480-threads / 12-TB

SLES 12

IPC Lock with System V Semaphores
•The global IPCspinlock is held too long in various System VSemaphore paths
•When one semaphore array is doing lots of operations scaling is poor
•Single semaphore arrays now use a finer grained spinlock as an alternative to the coarse grained IPC lock.
•Introduced rcu_read_lock only methods to obtain IPCobjects
•Similar changes were made to System V Shared Memory and Message Queues
58
21.86%
84.35%
5.23%
7.39%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
400 Users
800 Users% time spinning on IPC lock: Oracle 11gR2 Data Mining
before changes
after changes

SysV Semaphore Performance Improvements
•Performance measurements were done with an Oracle 11gR2 Data Mining workload
•System used: 8-sockets, 80-cores, 160-threads
59
1,257
1,438
1,237
2,805
2,665
2,751
0
500
1,000
1,500
2,000
2,500
3,000
100 Users
400 Users
800 Users
before changes
after changes

World Record Performance

World record performance: HP CS900 for SAP HANA
Key take aways
•#1 and #2 overall max-jOPS results
•#1 and #4 overall critical-jOPS results
•#1 16-socket (16s) results on both max-jOPS and critical-jOPS
•#1 8-socket (8s) results on max-jOPS
•16s max-jOPS results 2.1X greater than Fujitsu 16s results
•8s max-jOPS results are 2.2X greater than Sugon 8s results
•8s max-jOPS results 1.1X greater than Fujitsu16s results
•HP CS900 demonstrates excellent scaling from 8s to 16s
23,058
168,127
198,418
126,617
247,581
214,961
308,936
425,348
474,575
888,164
0
200,000
400,000
600,000
800,000
1,000,000
Sugon I980G108-socket, Intel Xeon E7-8890 v2
Fujitsu SPARC M10-4S16-socket, SPARC64 X
Fujitsu SPARC M10-4S16-socket, SPARC64 X+
HP ConvergedSystem 900 for SAPHANA 8s/6TB 8-socketIntel Xeon E7-2890 v2
HP ConvergedSystem 900 for SAPHANA 16s/12TB 16-socketIntel Xeon E7-2890 v2
HP ConvergedSystem 900 for SAP HANA
powered by SLES 11sp3
owns top two SPECjbb2013 max-jOPS records
max-jOPS
SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several other countries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). The stated results are published on spec.orgas of 07/30/2014.

HP and SUSE Break 1-Million max-jOPS!!!
129,876
244,162
524,790
1,002,350
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
HP ConvergedSystem 900 for SAPHANA (8s/6TB) 8-socketIntel Xeon E7-2890 v2
HP ConvergedSystem 900 for SAPHANA (16s/12TB) 16-socketIntel Xeon E7-2890 v2
HP ConvergedSystem 900 for SAP HANA
powered by SLES 11sp3
#1 16-socket SPECjbb2013 max-jOPS record
max-jOPS
SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several other countries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). The stated results are published on spec.orgas of 11/14/2014.

World record performance: HP CS900 for SAP HANA
•The HP BL920s Gen8 Server Blade powers the HP ConvergedSystem 900 for SAP HANA system.
•Publicly available SPECjbb2013-MultiJVM benchmark performance briefs: TBD –provide link for new briefNov 2014http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3288ENW&cc=us&lc=enJune 2014
•Official benchmark results for HP ConvergedSystem900 for SAP HANA on spec.org: TBD –provide link for new result(16s/240c/12TB)Nov 2014TBD –provide link for new result(8s/120c/6TB)Nov 2014http://spec.org/jbb2013/results/res2014q2/jbb2013-20140610-00081.html(16s/240c/12TB)June 2014http://spec.org/jbb2013/results/res2014q2/jbb2013-20140610-00080.html(8s/120c/6TB)June 2014
SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several othercountries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC).

Wrap-up

Acknowledgements
•HP Linux Kernel Performance team contributing this work:
−Davidlohr Bueso
−Tuan Bui
−Waiman Long
−Jason Low
−Scott Norton
−Thavatchai Makphaibulchoke
−Tom Vaden
−Aswin Chandramouleeswaran
65
•SUSE Linux R&D team contributing to this work:
−Scott Bahling
−Matthias Eckermann
−Mike Galbraith
−Mel Gorman
−Jan Kara
−Jeff Mahoney
−Ahmad Sadeghpour
−Miklos Szeredi

Thank you

Back-up Slides

We developed a prototype MCS/Queued lock to see the effect on cache line traffic
(MCS/Queued locks are NUMA aware and each locker spins on local memory rather than the lock word)
MCS/Queued Lock Prototype (1-2)
typedef struct _local_qnode {
volatile bool waiting;
volatile struct _local_qnode *volatile next;
} local_qnode;
static inline void
mcsfile_lock_acquire(mcsglobal_qlock *global,
local_qnode_ptr me)
{
local_qnode_ptr pred;
me->next = NULL;
pred = xchg(global, me);
if (pred == NULL)
return;
me->waiting = true;
pred->next = me;
while (me->waiting); /*spin on local mem*/
}
static inline void
mcsfile_lock_release(mcsglobal_qlock *global,
local_qnode_ptr me)
{
local_qnode_ptr succ;
if (!(succ = me->next)) {
if ( cmpxchg(global, me, NULL) == me ) return;
do {
succ = me->next;
} while (!succ); /* wait for succ ready */
}
succ->waiting = false;
}
68

Replacing the files_lock spinlock with the prototype mcsfiles_lock MCS/Queued spinlock
externmcsglobal_qlock mcsfiles_lock;
#define file_list_lock(x) mcsfile_lock_acquire(&mcsfiles_lock, &x);
#define file_list_unlock(x) mcsfile_lock_release(&mcsfiles_lock, &x);
void file_kill(struct file *file)
{
volatile local_qnode lq;
if (!list_empty(&file->f_u.fu_list)) {
file_list_lock(lq);
list_del_init(&file->f_u.fu_list);
file_list_unlock(lq);
}
}
void file_move(struct file *file,
struct list_head *list)
{
volatile local_qnode lq;
if (!list)
return;
file_list_lock(lq);
list_move(&file->f_u.fu_list, list);
file_list_unlock(lq);
}
MCS/Queued Lock Prototype (2-2)
69

AIM7 Benchmark Suite
•TraditionalUNIXsystem-levelbenchmark(writteninC).
•Multipleforks,eachofwhichconcurrentlyexecutesacommon,randomly-orderedsetofsubtestscalledjobs.
•Eachoftheoverfiftykindofjobsexercisesaparticularfacetofsystemfunctionality
•DiskIOoperations,processcreation,virtualmemoryoperations,pipeI/O,andcompute-boundarithmeticloops.
•AIM7includesdisksubtestsforsequentialreads,sequentialwrites,randomreads,randomwrites,andrandommixedreadsandwrites.
•AnAIM7runconsistsofaseriesofsubrunswiththenumberoftasks,N,beingincreasedaftertheendofeachsubrun.
•Eachsubruncontinuesuntileachtaskcompletesthecommonsetofjobs.Theperformancemetric,"Jobscompletedperminute",isreportedforeachsubrun.
•TheresultoftheentireAIM7runisatableshowingtheperformancemetricversusthenumberoftasks,N.
•Reference:“FilesystemPerformanceandScalabilityinLinux2.4.17”,2002.
70

perf-bench futex (1-2)
•To measure some of the changes done by the futexhastable patchset, a futex set of microbenchmarks are added to perf-bench:
−perf bench futex [<operation> <all>]
•Measures latency of different operations:
−Futex hash
−Futex wake
−Futex requeue/wait

perf-bench futex (2-2)
$ perf bench futex wake# Running 'futex/wake' benchmark: Run summary [PID 4028]: blocking on 4 threads (at futex 0x7e20f4), waking up 1 at a time. [Run 1]: Wokeup 4 of 4 threads in 0.0280 ms[Run 2]: Wokeup 4 of 4 threads in 0.0880 ms[Run 3]: Wokeup 4 of 4 threads in 0.0920 ms… [Run 9]: Wokeup 4 of 4 threads in 0.0990 ms[Run 10]: Wokeup 4 of 4 threads in 0.0260 msWokeup 4 of 4 threads in 0.0703 ms (+-14.22%)
$ perf bench futex hash# Running 'futex/hash' benchmark: Run summary [PID 4069]: 4 threads, each operating on 1024 futexes for 10 secs. [thread 0] futexes: 0x1982700 ... 0x19836fc [ 3507916 ops/sec ] [thread 1] futexes: 0x1983920 ... 0x198491c [ 3651174 ops/sec ] [thread 2] futexes: 0x1984ab0 ... 0x1985aac [ 3557171 ops/sec ] [thread 3] futexes: 0x1985c40 ... 0x1986c3c [ 3597926 ops/sec ] Averaged 3578546 operations/sec (+-0.85%), total secs = 10

SLES Performance Enhancements for Large NUMA Systems

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to SLES Performance Enhancements for Large NUMA Systems (20)

Recently uploaded (20)

SLES Performance Enhancements for Large NUMA Systems