jvm/java - towards lock-free concurrency

jvm/java: towards lock-
free concurrency
Arvind Kalyan
Engineer at LinkedIn

agenda
intro to concurrency & memory model on jvm
reordering -> barriers -> happens-before
jdk 8 concurrency primitives
volatile -> atomics, collections, explicit locks, fork/join
trends in this area (to make all of this practical)
lock-free, STM, TM

background
for control and performance, sometimes there are valid
reasons to use locks (like a mutex) for concurrency control
in most other situations, primitive synchronization
constructs in some modules lead to unreliable & incorrect
programs in most non-trivial systems that are composed
over such modules
the best practice, in the current state, is to write single
threaded programs

‘automatic’ concurrency
there are platforms that take your single
threaded program and run it concurrently —
most web servers do this, for example
on the other hand, there are times when you
really must use multiple threads

practicality
concurrency control techniques have been studied for a
while, but since 2005 it is being studied intensely* to make
it more practical for more widespread (and safer) use
simpler software techniques, and also hardware level
support for those techniques are being developed
before we see how to write safe code using these new
techniques, let’s look into some basics
* https://scholar.google.com/scholar?as_ylo=2005&q=%22software+transactional+memory%22

why concurrency control?
when dealing with multiple threads,
concurrency control/synchronization is
necessary not only to guard critical sections
from multiple threads using a mutex…
but also to ensure that the memory updates
(through mutable variables) are made visible
to all threads ‘correctly’

memory model
as a platform, jvm guarantees that ‘correctly
synchronized’ programs have a very well
deﬁned memory behavior
let’s look into the jvm memory model which
deﬁnes those guarantees

memory model
your code manipulates memory by using variables and
objects
the memory is separated by a few layers of caches from
the cpu
on a multi-core cpu when a write happens in one cpu’s
cache, we need to make it visible to other cpus as well
and then there is the topic of re-odering…
* http://en.wikipedia.org/wiki/Memory_barrier

memory model
to improve performance, the hardware (cpu,
caches, …) reorders memory access using its
own memory model (set of rules)* dynamically
the visibility of a value in a memory location is
further complicated by the code reordering
performed by the compiler statically
http://en.wikipedia.org/wiki/Memory_ordering

memory model
the static and dynamic reordering strive to
ensure an ‘as-if serial’ semantics
i.e., the program appears to be executing
sequentially as per the lines in your source
code

memory model
memory reordering is transparent in single-
threaded use-cases because of that as-if-
serial guarantee
but logic quickly falls apart and causes
surprises in incorrectly synchronized multi-
threaded programs

memory model
while jvm’s OOTA safety (out of thin air)
guarantees that a thread always reads a value
written by *some* thread, and not some value
out of thin air…
with all the reordering, it’s good to have a
slightly stronger guarantee …

the need for memory barriers
in the following code, say reader is called after writer
(from different threads) 
class Reordering { 
int x = 0, y = 0; 
public void writer() { 
x = 1; 
y = 2; 
} 
public void reader() { 
int r1 = y; 
int r2 = x; 
// use r1 and r2 
} 
}
in reader, even if r1 == 2, r2 can be 0 or 1
synchronization is needed if we want to control the
ordering (and ensure r2 == 1) using a memory barrier

memory barrier
the jvm memory model essentially deﬁnes the
relationship between the variables in your
code
the semantics also deﬁne a partial ordering on
the memory operations so certain actions are
guaranteed to ‘happen before’ others

happens-before
happens-before is a visibility guarantee for
memory provided through synchronization
such as locking, volatiles, atomics, etc
…and for completeness, through Thread
start() & join()

Concurrency control on jvm with
JDK 8
with that background, let’s look at some
speciﬁc tools & mechanisms available on the
jvm & jdk 8..

Concurrency control on jvm with
JDK 8
volatiles
atomics
concurrent collections/data-structures
synchronizers
fork/join framework

volatiles
volatiles are typically used as a state variables
across threads
writing to & reading from a volatile is like releasing
and acquiring a monitor (lock), respectively
i.e., it guarantees a happens-before relationship
not just with other volatile but also non-volatile
memory

volatiles
typical use of volatiles with reader and writer called from
different threads: 
class VolatileExample { 
int x = 0; 
volatile boolean v = false; 
public void writer() { 
x = 42; 
v = true; 
} 
public void reader() { 
if (v == true) { 
//uses x - guaranteed to see 42. 
} 
} 
}
the happens-before guarantee in jvm memory model makes it
simpler to reason about the value in x, even though x is non-
volatile!
code: https://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html

volatiles
guaranteeing happens-before relationship for
non-volatile memory is a performance
overhead, so like any other synchronization
primitive, it must be used judiciously
but, it greatly simpliﬁes the program and by
aligning the dynamic and static reordering with
most programmers’ expectations

atomics
atomics* extend the notion of volatiles, and support
conditional updates
being an extension to volatiles, they guarantee
happens-before relationship on memory operations
the updates are performed through a CAS cpu
instruction
* http://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/package-summary.html

atomics
atomics/cas allow designing non-blocking
algorithms where the critical section is around
a single variable
if there is more than one variable, other forms
of synchronization is needed

CAS
JDK 8 uses CAS for ‘lock-free’ operation
at a high-level, it piggy backs on a cpu
provided CAS* instruction —like lock:cmpxchg on
x86
let’s see how jvm dynamically improves the
performance of the hardware provided CAS
*CAS: http://en.wikipedia.org/wiki/Compare-and-swap

CAS/atomics
CAS in recent cpu implementations don’t assert the lock# to gain
exclusive bus access, but rather rely on efﬁcient cache-coherence
protocols* — unless the memory address is not cache-line aligned
even if that helps CAS to scale on many-core systems, CAS still
adds a lot to local latency, sometimes nearly halting the cpu
to address that local latency, a biased-locking* approach is used
— where uncontended usage of atomics are recompiled
dynamically to not use CAS instructions!
* more about MESI: https://courses.engr.illinois.edu/cs232/sp2009/lectures/x24.pdf 
* biased locking in jvm: https://blogs.oracle.com/dave/entry/biased_locking_in_hotspot

biased-locking
the biased-locking feature in jvm extends
beyond atomics, and generalizes to different
kinds of locking (monitor entry & exit) on the
jvm

atomics
before we move on, JDK 7 also provides
‘weakCompareAndSet’ atomic api, which relaxes the
happens-before ordering guarantee
relaxing the ordering makes it very hard to reason
about the program’s execution so its use is limited to
debugging counters, etc
there are better ways of doing this ‘fast’ — which
brings us to…

adders & accumulators
under high contention, the biased locking would be
spending too much time in lock revocation from a
thread if we used atomics
in these high contention situations, adders* help
gather counts by actively reducing contention, and
‘gather’ the value only when sum() or longValue() is
called
* http://download.java.net/lambda/b78/docs/api/java/util/concurrent/atomic/LongAdder.html

concurrent collections
the JDK also comes with a handful of lock-
free collections
these help in correctly synchronizing larger
data sets than single variables

ConcurrentHashMap (CHM) uses some of
the concepts listed so far and provides a lock-
free read, and a mostly lock-free write in java 8
relies on a good hashCode to reduce
collisions, after which it reverts to using a lock
for that bin

CHM — in general — allows concurrent use of
a Map which can be pretty useful especially to
represent a shared ‘mutating’ state, and such
CHM, together with adders for example,
enable concurrent, lock-free, histogram
generation across threads
more about CHM here, ofcourse: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/package-
summary.html

synchronizers
let’s look at some synchronization primitives…
(a.k.a. ‘source of bugs’)

synchronizers
2 major categories…
coarse-grained locks are usually less
performant, but are easy to code
and, ﬁne-grained locking has potential for
higher performance, but is more error prone

synchronized
synchronized keyword is a coarse grained locking
scheme
you acquire & release locks at method or block level,
typically holding the lock longer than needed
translates directly to jvm synchronization (intrinsic) &
hardware monitor
so its use is currently discouraged (might change in java9)

explicit locks
Locks* enables ﬁne-grained locking
these extend intrinsic locks, and allow unconditional,
polled, timed & interruptible lock acquisition
allow ‘custom’ wait/notify queues (Condition*) on the
same lock
nice features, but …
* http://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/Lock.html 
* http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/Condition.html

explicit locks
developer needs to remember to release locks, so
following style is encouraged: 
Lock l = ...; 
l.lock(); 
try { 
// access the resource protected by this lock 
} finally { 
l.unlock(); 
}
it gets *very* complicated when we have to deal with
more than 1 lock
…source of all kinds of bugs & surprises
* http://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/Lock.html

ReentrantLock
an implementation of Lock described earlier
support fairness policy to deal with lock
starvation — ‘fair’, not ‘fast’
there is nothing special in this lock to make it
‘reentrant’; all intrinsic locks are per-thread and
reentrant, unlike POSIX invocation based locks
* http://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/ReentrantLock.html

a note about reentrancy
reentrancy helps encapsulate locking behavior & helps write
cleaner (oop) concurrent code
in simpler cases (using single ‘resource’ but multiple methods)
this also helps avoid deadlocks: 
class A { 
synchronized void run(){ 
//.. 
} 
} 
class B extends A { 
synchronized void run() { 
super.run() 
} 
}
if intrinsic locks were not reentrant on jvm, the call to
super.run() would be deadlocked

ReentrantLock
ReentrantLock (not reentrancy in general)
has some issues so it must be used with
caution:
causes starvation, and performs poorly when
fairness is used

StampedLock
supports optimistic reads & lock upgrades
is not reentrant — needs the stamp, so not
usable across calls to unknown methods
for internal use in thread safe components,
where you fully understand the data, objects
& methods involved

StampedLock
for very short read-only code, optimistic
reads improve throughput by reducing
contention
useful when reading multiple ﬁelds of an
object from memory without locking
must call validate() later to ensure consistency

StampedLock
along with optimistic reads, the lock upgrade
capability enables many useful idioms: 
StampedLock sl = new StampedLock(); 
double x, y; 
.. 
double distanceFromOrigin() { // A read-only method 
long stamp = sl.tryOptimisticRead(); 
double currentX = x, currentY = y; // read without locking 
if (!sl.validate(stamp)) { 
stamp = sl.readLock(); // upgrade to read-lock if values are dirty 
try { 
currentX = x; 
currentY = y; 
} finally { 
sl.unlockRead(stamp); 
} 
} 
return Math.sqrt(currentX * currentX + currentY * currentY); 
}

fork/join
unlike regular java.lang.Thread (which are mostly
based on POSIX threads), fork/join tasks never
‘block’
for simple tasks, the overhead of constructing and/or
managing a thread is more expensive than the task
itself
programming on fork/join, in essence, allows
frameworks to optimize such tasks ‘behind the scenes’

fork/join
going beyond performance, the framework does
nothing to ensure concurrency control
the framework is also only usable in a few
scenarios where task can be easily disintegrated
in a sense, this is not making it easier to create
correct (and fast) programs

lambdas & streams
framework available on jdk 8 for data-
processing workloads
looks ‘functional’ — but due to type-erasure
these aren't typed
‘look’ like anonymous inner class but are
fundamentally different from the ground-up —
enabling jvm optimizations for concurrency & gc

lock-free
we’ve looked at a few lock-free concepts at a
single-variable level, using CAS
and atomics, which rely on CAS
and optimizations to make CAS faster…

lock-free
but how do we write ‘real-world’ concurrent
applications using lock-free concepts?
i.e., more than just CAS?

lock-free
that brings us to software transactional
memory (STM)!
STM is to concurrency control, what garbage-
collection is to memory management

STM
brings DB transaction concept to regular
memory access
read & write ‘as-if’ there is no contention…
during commit time the system ensures sanity
under the hood
… no locks in the code!

STM
in low contention use-cases (i.e., well-
designed programs), the absence of
synchronization makes execution very fast!
even in poorly designed programs, the
absence of locks makes it easier to focus on
correctness

STM implementation
multiverse[1] is a popular jvm implementation of
STM (groovy and Scala/Akka use it in their STM)
in essence, multiverse implements multiversion
concurrency control (MVCC[2])
Clojure has a language built-in STM feature
[1] http://multiverse.codehaus.org/overview.html  
[2] http://en.wikipedia.org/wiki/Multiversion_concurrency_control

STM & composability
the biggest beneﬁt of STM is composability (software
reuse) 
class Account { 
private final TxnRef<Date> lastUpdate = …; 
private final TxnInteger balance = …; 
public void incBalance(int amount, Date date){ 
atomic(new Runnable() { 
public void run(){ 
balance.inc(amount); 
lastUpdate.set(date); 
if(balance.get() < 0) { 
throw new IllegalStateException("Not enough money"); 
} 
} 
}); 
} 
} 
class Teller { 
state void transfer(Account from, Account to, int amount) { 
atomic(new Runnable() { 
public void run() { 
Date date = new Date(); 
from.incBalance(-amount, date); 
to.incBalance(amount, date); 
} 
}); 
} 
}

STM & composability
the Teller class is able to ‘compose’ over other
atomic operations without knowing their internal
details (i.e., what locks they use to synchronize)
so if to.incBalance() fails, the memory effects of
from.incBalance() are not committed so will
never be visible to other threads!
this is a pretty big deal…

Simplicity
STM makes composing concurrent software
modules appear very trivial
in the absence of locks, it is easier to
conceptualize the code ﬂow
the ability to code atomic operations this way
essentially nulliﬁes the challenges typically
associated with concurrent programming

performance
as stated earlier, stm allows optimistic execution: ‘as
though’ there are no other threads running, so it
increases concurrency
STM synchronizes only when required and falls back to
slower (serialized) executions when necessary
STM performs better than explicit locks as the number
of cores increase beyond 4*
* http://en.wikipedia.org/wiki/Software_transactional_memory 
http://channel9.msdn.com/Shows/Going+Deep/Programming-in-the-Age-of-Concurrency-Software-Transactional-Memory

more performance
apart from just software improvements, cpu
makers have started looking into hardware
support for TM
this is an emerging area and more advances are
being made, apart from Haswell, and TSX from
Intel
* https://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell 
* http://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions

STM & Practicality
concurrent programming is getting more
practical
stm brings the beneﬁts of ﬁne-grained locking
to coarse-grained locking without using locks

Summary
lock-free concurrency control techniques like
STM not only make it easier to write correct
code…
but also allows platforms (like JVM) to make
your code correct code run faster

References
Being a long slideshow with dense content,
I’ve put references on each slide so you can
read through
Reach out to me on LinkedIn if you’d like more
info or just to discuss!

jvm/java - towards lock-free concurrency

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to jvm/java - towards lock-free concurrency (20)

Recently uploaded (20)

jvm/java - towards lock-free concurrency