Super scaling singleton inserts

About Me
I’m pushing the database engine as
hard as I can captain, she’s going
blow.
 An independent SQL consultant.
 A user of SQL Server since 2000.
 14+ years of SQL Server experience.
The ‘Standard’ stuff What I’m passionate about !

The Exercise
Squeeze every last drop of performance
out of the hardware !
ostress –E –dSingletonInsert –Q”exec usp_insert” –n40

Test Environment
 SQL Server 2016 CTP 2.3
 Windows server 2012 R2
 2 x 10 Xeon V3 cores 2.2Ghz with hyper-threading enabled
 64GB DDR 4 quad channel memory
 4 x SanDisk Extreme Pro 480GB Raid 1 (64K allocation size ) )
 ostress used for generating concurrent workload
 Use the conventional database engine to begin with . . .

I Will Be Using Windows Performance Toolkit . . . A Lot !
 It allows CPU time to be
quantified across the whole
database engine.
 Not just what Microsoft deem
what we should see
but everything !.
 The database engine
equivalent of seeing the Matrix
in code form ;-)

Where Everyone Starts From . . . A Monotonically Increasing Key
CREATE TABLE [dbo].[MyBigTable] (
[c1] [bigint] IDENTITY(1, 1) NOT NULL,
,[c2] [datetime] NULL,
,[c3] [char](111) NULL,
,[c4] [int] NULL,
,[c5] [int] NULL,
,[c6] [bigint] NULL,
CONSTRAINT [PK_BigTableSeq] PRIMARY KEY CLUSTERED (
[c1] ASC
)
)
CPU utilization02:12:26 Waits stats

The “Last Page Problem”
Min
Min
Min Min
Min
Min
Min
Min Min
Min
HOBT_ROOT
Max

Overcoming The “Last Page” Problem
600
616
982
7946
8170
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
SPID Offset
Partition + SPID Offset
NEWID()
IDENTITY
NEWSEQUENTIALiD
Elapsed Time (s)
KeyType
Elapsed Time (s) / Key Type
What are
we waiting
on ?

Can Delayed Durability Help ?
265
600
0 100 200 300 400 500 600 700
Delayed durability
Conventional
Elapsed Time (s)
LoggingType
Elapsed time (s) / Logging Type

What Is Wrong In Task Manager ?

Fixing CPU Core Starvation With Trace Flag 8008
 The scheduler with
least load is now
favoured over the
‘Preferred’ scheduler.
 Documented in this
CSS engineers note.
 Elapsed time has
gone backwards, it is
now 453 seconds !
why ?.

Where Are Our CPU Cycles Going ?

How Spinlocks Work
A task on a scheduler will spin until it can acquire the spinlock it is after
For short lived waits this uses less CPU cycles than a yielding then waiting for
the task thread to be at the head of the runnable queue.

Spinlock Backoff
We have to yield the scheduler at some stage !

Introducing The LOGCACHE_ACCESS Spinlock
Buffer Offset (cache line)
LOGCACHE
ACCESS
Alloc Slot in
Buffer
MemCpy
Slot
Content
Log Writer
Writer Queue
Async I/O
Completion Port
Slot
1
LOGBUFFER
WRITELOG
LOG
FLUSHQ
Signal thread
which issued
commit
T0
Tn
Slot
127
Slot
126
The bit we are
interested in

Anatomy of A Modern CPU
Core
L3 Cache
L1 Instruction
Cache 32KB
L2 Unified Cache 256K
Power
and
Clock
QPI
Memory
Controller
L1 Data Cache
32KB
Core
CoreL1 Instruction
Cache 32KB
L2 Unified Cache 256K
L1 Data Cache
32KB
Core
TLBMemory bus
C P U
QPI. . .
Un-core
L0 UOP Cache L0 UOP Cache

Memory, Cache Lines and The CPU Cache
C P U
new OperationData() new OperationData() new OperationData()
Cache Line Cache LineCache Line
64B
Cache Line
Cache Line
Cache Line
Cache Line
Tag
Tag
Tag
Tag
C P U C a c h e

Spinlocks and Memory
spin_acquire
Int s
spin_acquire
Int s
spin_acquire
Int s
Transfer cache line
Transfer cache line
CPU CPU
L3
Core
Core
C P U
L3
Core
Core
C P U

What Happens If We Give The Log Writer Its Own CPU Core ?

600
265
1193
231
0
200
400
600
800
1000
1200
1400
Conventional Logging Delayed Durability TF8008, Delayed
Durability
TF8008, Delayed
Durability, Affinity mask
change
ElapsedTime(s)
Configuration
Elapsed time (s)
We Get The Lowest Elapsed Time So Far
* With 38 threads, all other tests with 40.

Scalability With and Without A CPU Core Dedicated To The Log Writer
0
100,000
200,000
300,000
400,000
500,000
600,000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
Inserts/s
Insert Threads
Insert Rate / Insert Threads
Baseline (Batch Size=1) Log Writer With Dedicated Core Batch Size=1

. . . and What About LOGCACHE_ACCESS Spins ?
0
2,000,000,000
4,000,000,000
6,000,000,000
8,000,000,000
10,000,000,000
12,000,000,000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
Spins
Threads
LOGCACHE_ACCESS spins / Thread Count
Baseline Log Writer with Dedicated CPU Core

What Difference Has This Made To Where CPU Time Is Going ?
With the
default
CPU
affinity
mask
Log writer
with
dedicated
CPU core
63,166,836 ms
(40 threads)
Vs.
220,168 ms
(38 threads)

Optimizations That Failed To Make The Grade
 Large memory pages
Allows The Look aside buffer to cover a more
memory for logical to physical memory
mapping.
 Trace flag 2330
Stops spins on OPT_IDX_STATS.
 Trace flag 1118
prevents mixed allocation extents
– enabled by default in SQL Server 2016

A Different Spinlock Is Now The Most Spin Intensive
A new spinlock is now the most spin intensive:
XDESMGR, probably spinlock<109,9,1>
what does it do ?

Digging Into The Call Stack To Understand Undocumented Spinlocks
xperf -on PROC_THREAD+LOADER+PROFILE -StackWalk Profile
xperf –d stackwalk.etl
1. Start trace
2. Run workload
3. Stop trace
4. Load trace into WPA
5. Locate spinlock in call stack 6. ‘Invert’ the call stack

Examining The XDESMGR Spinlock By Digging Into The Call Stack
 This serialises access to the part of the database engine that allocates
and destroys transaction ids.
 How do you relieve pressure on this spinlock ?
 Have multiple insert statement per transaction.

Options For Dealing With The XDESMGR Spinlock
 Relieving pressure on the LOGCACHE_ACCESS spinlock makes the
XDESMGR spinlock the bottleneck.
 There are three places to go at this point:
 Increase the ratio of transactions to DML statements.
 Shard the table across databases and instances.
 Use in memory OLTP native transactions.

Increasing The Batch Size By Just One Makes A Big Difference !
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36
Insert Rate / Thread Count
Log Writer With Dedicated Core Batch Size=2

. . . and The Difference This Makes To XDESMGR Spins
0
20,000,000,000
40,000,000,000
60,000,000,000
80,000,000,000
100,000,000,000
120,000,000,000
140,000,000,000
160,000,000,000
180,000,000,000
200,000,000,000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36
XDESMGR Spins / Thread Count
Log Writer With Dedicated Core Batch Size=2

Does It Matter Which NUMA Node The Insert Runs On ?
L3
Core 0
Core 1
Core 2
Core 4
Core 3
Core 5
Core 6
Core 7
Core 9
Core 8
C P U
L3
Core 0
Core 1
Core 2
Core 4
Core 3
Core 5
Core 6
Core 7
Core 9
Core 8
C P U
Faster here ?
Numa Node 0
. . . Or faster here?
Numa Node 1
“Whats really
going to bake your
noodle . . .”
8 threads
here
73 s
8 threads
here
125 s

What Does Windows Performance Toolkit Have To Tell Us ?
18 insert
thread log
writer
CPU socket
Co-location.
18 insert
threads not
co-located on
same socket
as the log
writer
84,697 ms
Vs.
11,281,235 ms

So I Should Look At Tuning The CPU Affinity Mask ?
 Get the basics right first:
 Minimize transaction log fragmentation ( both internal and external ).
 Use low latency storage.
 Avoid log intensive operations, page splits etc . . .
 Use minimally logged operations where appropriate.
 Only when:
 All of the above has been done.
 The disk row store engine is being used.
 The workload is OLTP heavy using more than 12 CPU cores, 6 per socket,
look at giving the log writer a CPU core to itself.

Hard To Solve Logging Issues
 I’m have to use the disk row store engine.
 My single threaded app cannot easily be multi threaded.
 How do I get the best possible write log performance ?
 Use NUMA connection affinity
to connect to the same socket
as the log writer.
 Disable hyper-threading,
whole cores and always faster
than hyper-threads.
 ‘Affinitize’ the rest of the
database engine away from
the log writer thread ‘Home’
CPU core.
 Go for a CPU with the best
single threaded performance
available.

The CPU Cycle Cost Of Spinlock Cache Line Transfer
spin_acquire
Int s
spin_acquire
Int s
spin_acquire
Int s
Transfer cache line
Transfer cache line
CPU CPU
L3Core
C P U
C P U
C P U
C P U
100 CPU cycles
Core
34 CPU cycles
100 CPU cycles
34 CPU cycles
Core to core on the same socket Core to core on different sockets

Remember, All Memory Access Is CPU Intensive

This Man Seriously Knows A Lot About Memory
 Ulrich Drepper, author of:
What Every Programmer Should Know About Memory
 From Understanding CPU Caches
“Use per CPU memory; lock thread to specific CPU”
This is our CPU affinity mask trick 

Cache Line Ping Pong
IOHub
CPU 6 CPU 7
CPU 4 CPU 5
CPU 2 CPU 3
CPU 0 CPU 1
IOHubIOHub
IOHub
“Cache line
ping pong
is deadly for
performance”
The more CPU sockets and cores you
have the greater the ramifications this
has for SQL Server scalability on
“Big boxes”.

‘Sharding’ The Database Across Instances
L3
Core 0
Core 1
Core 2
Core 4
Core 3
Core 5
Core 6
Core 7
Core 9
Core 8
C P U
L3
Core 0
Core 1
Core 2
Core 4
Core 3
Core 5
Core 6
Core 7
Core 9
Core 8
C P U
Instance A - ‘Affinitized’
to NUMA Node 0
Instance B - ‘Affinitized’
to NUMA Node 1  ‘Shard’ databases
across instances.
 2 x
LOGCACHE_ACCES
S and XDESMGR
spinlocks.
 Spinlock cache
entries are bound
by the latency of
the L3 cache, not
the quick path
inter-connect.

What Can We Get From An Instance ‘Affinitized’ To One CPU Socket ?
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
500,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Inserts/s
Threads

With a Batch Size of 2, 32 Threads Achieve The Best Throughput
Logging
related
activity
Latching !
Where to now ?

In Memory OLTP To The Rescue, But What Will It Give Us ?
 Only redo is written to the transaction
log (durability = SCHEMA_AND_DATA)
Does this relieve pressure on the
LOGCACHE_ACCESS spinlock ?.
 Zero latching and locking.
 Native procedure compilation.
 No “Last page” problem due to
IMOLTP’s use of hash buckets.
 Spinlocks will still be in play though .

Insert Scalability with A Non Natively Compiled Stored Procedure
0
100,000
200,000
300,000
400,000
500,000
600,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Inserts/s
Threads
Default Engine IMOLTP Range Index
IMOLTP Hash Index bc=8388608 IMOLTP Hash Index bc=16777216

What Does The BLOCKER_ENUM Spinlock Protect ?
Transaction synchronization between the default and in-memory OLTP engines ?

Where Are Our CPU Cycles Going, The Overhead Of Language Processing
Time to try native in memory OLTP transactions
and compiled stored procedures ?

Insert Scalability with A Natively Compiled Stored Procedure
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
9,000,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Inserts/s
Threads
bucket count=8388608 bucket count=16777216 bucket count=33554432 range

Hash Indexes Bucket Count and Balancing The Equation
Smaller bucket counts = better cache line reuse
+ reduced TLB thrashing
+ reduced hash table cache out
Larger bucket counts = reduced cache line reuse
+ increased TLB thrashing
+ less hash bucket scanning for lookups

Is Our CPU Affinity Mask Trick Relevant To In Memory OLTP ?
 Default CPU
affinity mask
and 18 insert
threads.
 A CPU core
dedicated to
the log writer
and 18 insert
threads.

Optimizations That Failed To Make The Grade
 Large memory pages
As per the default database engine, this made
no difference to performance.
 Turning off adjacent cache line pre-fetching
This can degrade performance by saturating
the memory bus when hyper threading is in
use and cause cache pollution when the
pre-fetched line is not used.

Takeaways
 Monotonically increasing keys do not scale with the default database engine.
 Dedicate a CPU core to the log write to relieve pressure on the LOGCACHE_ACCESS
spinlock.
 Batch DML statements together per transaction to relieve XDESMGR spinlock pressure.
 The further the LOGCACHE_ACCESS spinlock cache line has to travel, the more
performance is degraded.
 Native compilation results in a performance increase of over an order of magnitude
(at least) over non natively compiled stored procedures.
 There is a bucket count “Sweet spot” for IMOLTP hash indexes which is influenced by
hash collisions, bucket scans and hash lookup table cache out.

Further Reading
 Super scaling singleton inserts blog post
 Tuning The LOGCACHE_ACCESS Spinlock On A “Big Box” blog post
 Tuning The XDESMGR Spinlock On A “Big Box” blog post

Super scaling singleton inserts

chris1adkin@yahoo.co.uk
http://uk.linkedin.com/in/wollatondba
ChrisAdkin8

Super scaling singleton inserts

More Related Content

Similar to Super scaling singleton inserts (20)

More from Chris Adkin (13)

Recently uploaded (20)

Super scaling singleton inserts

Editor's Notes