Call me maybe: Jepsen and flaky networks

Call me maybe: Jepsen and ﬂaky networks
Shalin Shekhar Mangar
@shalinmangar
Lucidworks Inc.

Typical first year for a new
cluster
— Jeff Dean, Google
• ~5 racks out of 30 go
wonky (50% packetloss)
• ~8 network
maintenances (4 might
cause ~30-minute
random connectivity
losses)
• ~3 router failures (have
to immediately pull
traffic for an hour)
LADIS 2009

Reliable networks are a
myth
• GC pause
• Process crash
• Scheduling delays
• Network maintenance
• Faulty equipment

Network partition
n1
n2
n3
n4
n5

Messages can
be lost,
delayed,
reordered and
duplicated
n1
n2
X
n1
n2
Time
Drop
Delay
n1
n2
Duplicate
n1
n2
Reorder

CAP recap
• Consistency (Linearizability): A total order on all operations such
that each operation looks as if it were completed at a single instant.
• Availability: Every request received by a non-failing node in the
system must result in a response.
• Partition Tolerance: Arbitrary many messages between two nodes
may be lost. Mandatory unless you can guarantee that partitions
don’t happen at all.

Have you
planned for
these?
Availability
Consistency
X
X
• Errors
• Connection timeouts
• Hung requests (read
timeouts)
• Stale results
• Dirty results
• Data lost forever!
During and after a partition

Jepsen: Testing systems
under stress
• Network partitions
• Random process crashes
• Slow networks
• Clock skew
http://github.com/aphyr/jepsen

Anatomy of a Jepsen test
• Automated DB setup
• Test definitions a.k.a Client
• Partition types a.k.a Nemesis
• Scheduler of operations (client & nemesis)
• History of operations
• Consistency checker
Data store specific
(Mongo/Solr/Elastic)
Provided by Jepsen

n1
n2
n3
c1
c2
c3
OK
X
DatastoreClients
History
?

nem.e.sis
the
inescapable
agent of
someone’s
downfall

Nemesis
n1
n2
n3
n4
n5
partition-random-node
n1
n2
n3
n4
n5
kill-random-node clock-scrambler

Nemesis
n1
n2
n3
n4
n5
partition-halves
n1
n4
n5
n2
n3
partition-random-halves
n1
n2
n4
n5
bridge
n3

A set of integers: cas-set-client
• S = {1, 2, 3, 4, 5, …}
• Stored as a single document containing all the integers
• Update using compare-and-set
• Multiple clients try to update concurrently
• Create and restore partitions
• Finally, read the set of integers and verify consistency

Compare and Set client
cas({}, 1)
cas(1, 2)
{1}
{1, 2}
cas(1, 3) X
Time
Client 1
Client 2
cas(2, 4) X
cas(2, 5) {1, 2, 5}
Client 1
Client 2
t=0 t=1 t=x

Compare and Set client
cas({}, 1)
cas(1, 2)
{1}
{1, 2}
cas(1, 3) X
Time
Client 1
Client 2
cas(2, 4) X
cas(2, 5) {1, 2, 5}
Client 1
Client 2
t=0 t=1 t=x
History = [(t, op, result)]

Solr
• Search server built on Lucene
• Lucene index + transaction log
• Optimistic concurrency, linearizable CAS ops
• Synchronous replication to all ‘live’ nodes
• ZooKeeper for ‘consensus’
• http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky-
networks/

Add an integer
every second,
partition
network every
30 seconds for
200 seconds

Solr - Are we safe?
• Leaders become unavailable for upto ZK session timeout, typically
30 seconds (expected)
• Some write ‘hang’ for a long time on partition. Timeouts are
essential. (unexpected)
• Final reads under CAS are consistent but we haven’t proved
linearizability (good!)
• Loss of availability for writes in minority partition. (expected)
• No data loss (yet!) which is great!

Solr - Bugs, bugs & bugs
• SOLR-6530: Commits under network partition can put any node into
‘down’ state.
• SOLR-6583: Resuming connection with ZK causes log replay
• SOLR-6511: Requests threads hang under network partition
• SOLR-7636: A flaky cluster status API - times out during partitions
• SOLR-7109: Indexing threads stuck under network partition can mark
leader as down

Elastic
• Search server built on Lucene
• It has a Lucene index and a transaction log
• Consistent single doc reads, writes & updates
• Eventually consistent search but a flush/commit should ensure that
changes are visible

Elastic
• Optimistic concurrency control a.k.a CAS linearizibility
• Synchronous acknowledgement from a majority of nodes
• “Instantaneous” promotion under a partition
• Homegrown ‘ZenDisco’ consensus

Elastic - Are we safe?
• “Instantaneous” promotion is not. 90 seconds timeouts to elect a
new primary (worse in <1.5.0)
• Bridge partition: 645/1961 writes acknowledged and lost in 1.1.0.
Better in 1.5.0, only 22/897 lost.
• Isolated primaries: 209/947 updates lost
• Repeated pauses (simulating GC): 200/2143 updates lost
• Getting better but not quite there. Good documentation on
resiliency problems.

MongoDB
• Document-oriented database
• Replica set has a single primary which accepts writes
• Primary asynchronously replicates writes to secondaries
• Replica decide between themselves to promote/demote primaries
• Applies to 2.4.3 and 2.6.7

MongoDB
• Claims atomic writes per document and consistent reads
• But strict consistency only when reading from primaries
• Eventual consistency when reading from secondaries

MongoDB - Are we safe?
Source: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads

MongoDB - Are we really safe?
• Inconsistent reads are possible even with majority write concern
• Read-uncommitted isolation
• A minority partition will allow both stale reads and dirty reads

Conclusion
• Network communication is flaky! Plan for it.
• Hackernews driven development (HDD) is not a good way of
choosing data stores!
• Test the guarantees of your data stores.
• Help me find more Solr bugs!

References
• Kyle Kingsbury’s posts on Jepsen: https://aphyr.com/tags/jepsen
• Solr & Jepsen: http://lucidworks.com/blog/call-maybe-solrcloud-
jepsen-flaky-networks/
• Jepsen on github: github.com/aphyr/jepsen
• Solr fork of Jepsen: https://github.com/LucidWorks/jepsen

Solr/Lucene Meetup on 25th July 2015
Venue: Target Corporation, Manyata Embassy Business Park
Time: 9:30am to 1pm
Talks:
Crux of eCommerce Search and Relevancy
Creating Search Analytics Dashboards
Signup at http://meetu.ps/2KnJHM

Thank you
shalin@apache.org
@shalinmangar

Call me maybe: Jepsen and flaky networks

More Related Content

What's hot (20)

Similar to Call me maybe: Jepsen and flaky networks (20)

More from Shalin Shekhar Mangar (6)

Recently uploaded (20)

Call me maybe: Jepsen and flaky networks