Lockless

Lockless data structures
Sandeep Joshi (DC Engines)

Critical sections...
Stack::pop()
{
Lock.acquire()
Return top value
Move top to top->next
Lock.release();
}
HashTable::insert(element)
{
Lock.acquire()
Find hash bucket and insert
Lock.release();
}

Critical sections
Critical sections are like transactions. They ensure invariants on data structures
continue to hold.
Critical sections can be protected by
1. Locks (default approach)
2. Lockless - use Atomic operations (compare and swap instruction) and
load-store fences
3. Hardware transactional memory ( See Intel xbegin, xend, xabort)
Like Pune traffic

Not all Data structures easy to make lockless
Lists
● Singly-linked, doubly-linked.
● Queue, Stack, Set
Unordered : Hash table (build on singly-linked list solution)
Ordered : Skip list (build on singly-linked list), Red Black tree (requires localized
rebalancing), AVL tree (harder due to wider rebalancing).

Lockfree versus Waitfree
Concurrency levels
1. LockFree : if overall system progresses but individual threads may see delay.
You see retries (e.g. “While loop” which retries if atomic operation failed)
2. WaitFree : if the operation completes in finite number of steps. (e.g. a read
on a Multi-versioned data structure)
The same data structure can have some operations which are lockfree, and others
which are waitfree.

Basic weapon (for this talk)
C++ : std::atomic<T>has compare_exchange_strong(T& expectedValue, T desiredValue)
Java : AtomicReference (and other Atomic types) has compareAndSet(V expectedValue, V
desiredValue)
C (GNU builtin) : __sync_val_compare_and_swap(T* ptr, T expectedValue, T desiredValue)
Bool CAS(variable, expectedVal, desiredVal) {
If (Variable == expectedValue) {
Variable = desiredValue
Return true
} else { return false }
}

What we will cover
1. Stack
2. Queue
3. RCU
For Lists, see Herb Sutter’s talk on “Lock-free programming” at CppCon 2014.

Stack
bool Stack::push(int key) {
Node* newNode = new Node(key)
Do {
oldHead = top;
newNode-> next = oldHead
} While (not
top.compare_exchange(oldHead,
newNode))
}
Int Stack::pop() {
Do {
oldHead = top;
nextNode = top->next
} While (not
top.compare_exchange(top,nextNode));
int return_key = oldHead->key
return return_key;
}
Class Stack { atomic<Node*> top }

Stack
Problem : Every thread is doing read-modify-write on the same memory address
(Stack.top). The corresponding cache line keeps bouncing between CPU cores.
Solution : Find a way to match up simultaneous “push” and “pop” calls. Let the
two threads communicate without changing “Stack.top”.

Atomic exchange between 2 threads
EMPTY
value=nil
WAITING
value=T1.val
BUSY
value=T2.val
T1 comes, sets
its value, and
waits
T2 arrives, finds value
set. It atomically
exchanges T1.val with
T2.val and changes
state to BUSY
T1 who is
waiting reads
T2.val, resets
and returns
Use “compare and swap” to
atomically exchange a value
between two threads
Define an Exchanger {
state = empty, waiting, busy
Int value
}
Practical implementation in
java.util.concurrent.Exchanger

Stack + EliminationArray combination
EliminationArray is array of Exchanger objects
E1 E2 E3 …. En
stack.top
N1 N1 null
Every thread (push or pop)
first checks in
EliminationArray for a
complementary thread
After timeout, it calls
Stack.push or pop

What we will cover
1. Stack
2. Queue
3. RCU

Queues
Many dimensions to this problem
1. SPSC, SPMC, MPSC, MPMC (SPSC=Single Producer, Single Consumer)
2. Bounded vs unbounded
3. Blocking or nonblocking
4. Priorities, Intrusive, Ordering..
http://www.1024cores.net/home/lock-free-algorithms/queues

Queue with sentinel
HEAD TAIL
Sentinel New node
HEAD TAIL
Sentinel
HEAD TAIL
Deleted New sentinel
Enqueue
Dequeue
Return the node value and
turn it into sentinel

Unbounded SPSC (*incomplete)
SPSC_Queue { atomic<Node*> Head, Tail; }
enqueue(T elem) {
Node* newNode = new Node(elem)
Tail->next = newNode
Tail.store(newNode)
}
dequeue(T& returnElem) {
If (Head->next = null) { throw Empty; }
returnElem = Head->next->value;
Head.store(Head->next)
Delete oldHead;
}
Head = Tail = new Node()
First node is always Sentinel
Dequeue always returns value of
next node

Bounded SPSC
ProducerConsumerQueue
● atomic<int> readIndex
● Item records[size]
● atomic<int> writeIndex
enqueue(Item newElem) {
Int freeSlot = writeIndex.load()
If freeSlot + 1 != readIndex.load() {
records[freeSlot] = newElem
writeIndex.store(freeSlot)
}
dequeue(Item& returnElem) {
Int curSlot = readindex.load()
If curSlot != writeIndex.load() {
returnElem = record[curSlot]
readIndex.store(curSlot + 1)
}
Based on Facebook folly library
Avoid cache
line sharing

Multiple readers, one writer (with locks)
READER
1. Take Read lock
2. Safely read pointer and act
3. Release read lock
WRITER
1. Take write lock
2. Free pointer
3. Release write lock
This is the conventional approach

Multiple readers, one writer (RCU)
READER
1. Record new reader
2. Safely access the pointer
3. Inform reader finished
WRITER
1. Switch the pointer
2. Ensure all readers gone (Drain the queue in Grace period)
3. Free pointer

Multiple readers, one writer (RCU)
READER
1. Record new reader (rcu_read_lock)
2. Safely access the pointer
3. Inform reader finished (rcu_read_unlock)
WRITER
1. Switch the pointer (rcu_assign_pointer(ptr,val))
2. Ensure all readers gone (synchronize_rcu)
3. Free pointer

RCU (Read copy update)
On preemptible Linux kernels
1. Preemption disabled for Reader on calling “rcu_read_lock()”
2. Writer runs on every CPU core when “synchronize_rcu()” called to ensure all
readers have completed.
On real-time Linux kernels : Introduce two queues (current and next) to record the
Readers that were present before and after Writer started.
Userspace RCU : Same API now available for use in userspace
(https://github.com/urcu)

Some tricks used in lockless programming
1. Sentinels
2. Unused bits in 64-bit pointers
3. Lazy delete
4. Two (or more) bottlenecks better than one
5. Padding to avoid false cache line sharing

Trick 1 : Sentinels
Sentinel node is pre-allocated and never deleted.
Head and tail point to Sentinel when List or
Queue is empty.
This helps because when List/Queue transitions
from empty to non-empty or vice-versa, you don’t
have to update two variables atomically which
can get tricky.
class Queue {
Node* head, tail;
};
Head = tail = new Node(sentinel)

Trick 2 : Unused bits in pointer
Addresses on Intel x86-64 and ARM64 are limited to 48 bits. The unused higher
16 bits can be used to store a “marker” with every pointer. This allows you to use
“compare-and-swap” instruction to atomically change “pointer + custom info”
Facebook Folly C++ library : PackedSyncPtr, DiscriminatedPtr exploits this.
Java has AtomicMarkableReference, AtomicStampedReference.
Caveat : The number of unused bits may shrink with newer processors.
Intel also has “Cmpxcng16B” to manipulate 128 bit values.

Trick 3 : Lazy delete
Updater sets marker bit on the Node.
Marked Node is skipped during traversal until it is safe to delete
Deleted = true

Trick 4 : Two bottlenecks better than one
Cache line bouncing is reduced if threads can spin(i.e. do CAS) on multiple
variables instead of one. Seen in the Stack + EliminationArray example earlier.
Same with WaitQueue below. Each thread adds its own node to the WaitQueue.
It spins on local variable inside Node until woken up by Predecessor.
Wait Queue T1’s node T2’s node T3’s node

Trick 5 : Padding to avoid false cache line sharing
Class Queue {
Atomic<int> head;
Char cache_line_pad[CACHE_LINE_SIZE (e.g.64 byte)];
Atomic<int> tail; // Keeps head and tail on separate cache
lines
}
https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads

Locks vs Lockless
● Locks can increase context switches
● Lockless can increase cache line contention
Which option performs better depends on several factors...

Language support
Golang : Philosophy is to “share memory by communicating instead of
communicating by shared memory”. But see ‘sync.atomic” package.
Java : “volatile” variables ensure sequential consistency. “Java.util.concurrent”,
sun.misc.unsafe.compareAndSwapObject()
C++ : std::atomic provides multiple levels of consistency
1. sequential consistency.
2. acquire, release, consume (not discussed today).
3. relaxed.

Who uses lockless ?
1. Early adopters were desktop audio drivers [1]
2. MemSQL : pervasive use of lockfree data structures
3. Couchbase : Nitro storage engine
4. DataDomain (EMC) : lockfree doubly linked list
5. Facebook Folly library
6. Java.util.concurrent (Doug Lea)
7. Linux kernel (other mechanisms besides RCU)
[1] http://www.rossbencina.com/code/lockfree

Not covered
1. ABA problem and Hazard pointers
2. Weaker memory models
3. Concurrent Skip List, Hash tables, Trees
4. Underlying Memory allocation also needs to be lockfree (e.g. Streamflow)

References
1. Herlihy, et al. The Art of Multiprocessor Programming
2. McKenney, Paul. Is Parallel Programming Hard, And, If So, What Can You
Do. About It?
3. http://1024cores.net
4. http://preshing.com
5. http://www.rdrop.com/~paulmck/
6. http://www.rossbencina.com/code/lockfree

Lockless

More Related Content

What's hot (20)

Similar to Lockless (20)

More from Sandeep Joshi (11)

Recently uploaded (20)

Lockless