SlideShare a Scribd company logo
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
Shalin Shekhar Mangar
@shalinmangar
Lucidworks Inc.
Typical first year for a new
cluster
— Jeff Dean, Google
• ~5 racks out of 30 go
wonky (50% packetloss)
• ~8 network
maintenances (4 might
cause ~30-minute
random connectivity
losses)
• ~3 router failures (have
to immediately pull
traffic for an hour)
LADIS 2009
Reliable networks are a
myth
• GC pause
• Process crash
• Scheduling delays
• Network maintenance
• Faulty equipment
Network
n1
n2
n3
n4
n5
Network partition
n1
n2
n3
n4
n5
Messages can
be lost,
delayed,
reordered and
duplicated
n1
n2
X
n1
n2
Time
Drop
Delay
n1
n2
Duplicate
n1
n2
Reorder
CAP recap
• Consistency (Linearizability): A total order on all operations such
that each operation looks as if it were completed at a single instant.
• Availability: Every request received by a non-failing node in the
system must result in a response.
• Partition Tolerance: Arbitrary many messages between two nodes
may be lost. Mandatory unless you can guarantee that partitions
don’t happen at all.
Have you
planned for
these?
Availability
Consistency
X
X
• Errors
• Connection timeouts
• Hung requests (read
timeouts)
• Stale results
• Dirty results
• Data lost forever!
During and after a partition
Jepsen: Testing systems
under stress
• Network partitions
• Random process crashes
• Slow networks
• Clock skew
http://github.com/aphyr/jepsen
Anatomy of a Jepsen test
• Automated DB setup
• Test definitions a.k.a Client
• Partition types a.k.a Nemesis
• Scheduler of operations (client & nemesis)
• History of operations
• Consistency checker
Data store specific
(Mongo/Solr/Elastic)
Provided by Jepsen
n1
n2
n3
c1
c2
c3
OK
X
DatastoreClients
History
?
nem.e.sis
the
inescapable
agent of
someone’s
downfall
Nemesis
n1
n2
n3
n4
n5
partition-random-node
n1
n2
n3
n4
n5
kill-random-node clock-scrambler
Nemesis
n1
n2
n3
n4
n5
partition-halves
n1
n4
n5
n2
n3
partition-random-halves
n1
n2
n4
n5
bridge
n3
A set of integers: cas-set-client
• S = {1, 2, 3, 4, 5, …}
• Stored as a single document containing all the integers
• Update using compare-and-set
• Multiple clients try to update concurrently
• Create and restore partitions
• Finally, read the set of integers and verify consistency
Compare and Set client
cas({}, 1)
cas(1, 2)
{1}
{1, 2}
cas(1, 3) X
Time
Client 1
Client 2
cas(2, 4) X
cas(2, 5) {1, 2, 5}
Client 1
Client 2
t=0 t=1 t=x
Compare and Set client
cas({}, 1)
cas(1, 2)
{1}
{1, 2}
cas(1, 3) X
Time
Client 1
Client 2
cas(2, 4) X
cas(2, 5) {1, 2, 5}
Client 1
Client 2
t=0 t=1 t=x
History = [(t, op, result)]
Solr
• Search server built on Lucene
• Lucene index + transaction log
• Optimistic concurrency, linearizable CAS ops
• Synchronous replication to all ‘live’ nodes
• ZooKeeper for ‘consensus’
• http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky-
networks/
Add an integer
every second,
partition
network every
30 seconds for
200 seconds
Solr - Are we safe?
• Leaders become unavailable for upto ZK session timeout, typically
30 seconds (expected)
• Some write ‘hang’ for a long time on partition. Timeouts are
essential. (unexpected)
• Final reads under CAS are consistent but we haven’t proved
linearizability (good!)
• Loss of availability for writes in minority partition. (expected)
• No data loss (yet!) which is great!
Solr - Bugs, bugs & bugs
• SOLR-6530: Commits under network partition can put any node into
‘down’ state.
• SOLR-6583: Resuming connection with ZK causes log replay
• SOLR-6511: Requests threads hang under network partition
• SOLR-7636: A flaky cluster status API - times out during partitions
• SOLR-7109: Indexing threads stuck under network partition can mark
leader as down
Elastic
• Search server built on Lucene
• It has a Lucene index and a transaction log
• Consistent single doc reads, writes & updates
• Eventually consistent search but a flush/commit should ensure that
changes are visible
Elastic
• Optimistic concurrency control a.k.a CAS linearizibility
• Synchronous acknowledgement from a majority of nodes
• “Instantaneous” promotion under a partition
• Homegrown ‘ZenDisco’ consensus
Elastic - Are we safe?
• “Instantaneous” promotion is not. 90 seconds timeouts to elect a
new primary (worse in <1.5.0)
• Bridge partition: 645/1961 writes acknowledged and lost in 1.1.0.
Better in 1.5.0, only 22/897 lost.
• Isolated primaries: 209/947 updates lost
• Repeated pauses (simulating GC): 200/2143 updates lost
• Getting better but not quite there. Good documentation on
resiliency problems.
MongoDB
• Document-oriented database
• Replica set has a single primary which accepts writes
• Primary asynchronously replicates writes to secondaries
• Replica decide between themselves to promote/demote primaries
• Applies to 2.4.3 and 2.6.7
MongoDB
• Claims atomic writes per document and consistent reads
• But strict consistency only when reading from primaries
• Eventual consistency when reading from secondaries
MongoDB - Are we safe?
Source: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads
MongoDB - Are we really safe?
• Inconsistent reads are possible even with majority write concern
• Read-uncommitted isolation
• A minority partition will allow both stale reads and dirty reads
Conclusion
• Network communication is flaky! Plan for it.
• Hackernews driven development (HDD) is not a good way of
choosing data stores!
• Test the guarantees of your data stores.
• Help me find more Solr bugs!
References
• Kyle Kingsbury’s posts on Jepsen: https://aphyr.com/tags/jepsen
• Solr & Jepsen: http://lucidworks.com/blog/call-maybe-solrcloud-
jepsen-flaky-networks/
• Jepsen on github: github.com/aphyr/jepsen
• Solr fork of Jepsen: https://github.com/LucidWorks/jepsen
Solr/Lucene Meetup on 25th July 2015
Venue: Target Corporation, Manyata Embassy Business Park
Time: 9:30am to 1pm
Talks:
Crux of eCommerce Search and Relevancy
Creating Search Analytics Dashboards
Signup at http://meetu.ps/2KnJHM
Thank you
shalin@apache.org
@shalinmangar

More Related Content

PDF
Introduction to SolrCloud
PPTX
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
PPTX
Solrcloud Leader Election
PDF
Solr cluster with SolrCloud at lucenerevolution (tutorial)
PDF
Deploying and managing Solr at scale
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
PDF
High Performance Solr
PDF
How to make a simple cheap high availability self-healing solr cluster
Introduction to SolrCloud
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Solrcloud Leader Election
Solr cluster with SolrCloud at lucenerevolution (tutorial)
Deploying and managing Solr at scale
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
High Performance Solr
How to make a simple cheap high availability self-healing solr cluster

What's hot (20)

PDF
Scaling search with SolrCloud
PPTX
Scaling Through Partitioning and Shard Splitting in Solr 4
PPTX
Solr Exchange: Introduction to SolrCloud
ODP
GIDS2014: SolrCloud: Searching Big Data
PDF
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
PDF
How SolrCloud Changes the User Experience In a Sharded Environment
PDF
Cross Datacenter Replication in Apache Solr 6
PPTX
Scaling SolrCloud to a large number of Collections
ODP
Apache SolrCloud
PPTX
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
PPT
Solr Performance Monitoring with SPM
PDF
Best practices for highly available and large scale SolrCloud
PDF
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
PDF
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
PDF
SolrCloud Failover and Testing
PPTX
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
PPTX
NYC Lucene/Solr Meetup: Spark / Solr
PPTX
Scaling Solr with Solr Cloud
PPTX
Benchmarking Solr Performance at Scale
PDF
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Scaling search with SolrCloud
Scaling Through Partitioning and Shard Splitting in Solr 4
Solr Exchange: Introduction to SolrCloud
GIDS2014: SolrCloud: Searching Big Data
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
How SolrCloud Changes the User Experience In a Sharded Environment
Cross Datacenter Replication in Apache Solr 6
Scaling SolrCloud to a large number of Collections
Apache SolrCloud
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Solr Performance Monitoring with SPM
Best practices for highly available and large scale SolrCloud
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
SolrCloud Failover and Testing
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
NYC Lucene/Solr Meetup: Spark / Solr
Scaling Solr with Solr Cloud
Benchmarking Solr Performance at Scale
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Ad

Similar to Call me maybe: Jepsen and flaky networks (20)

PDF
Percona XtraDB Cluster
PDF
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
PPTX
Realtime olap architecture in apache kylin 3.0
PDF
Seek and Destroy Kafka Under Replication
PDF
No sql & dq2 tracer service
PPTX
Webinar Back to Basics 3 - Introduzione ai Replica Set
PDF
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
PDF
Thoughts on consistency models
PDF
STORMPresentation and all about storm_FINAL.pdf
PDF
Kafka practical experience
PDF
Raft After ScyllaDB 5.2: Safe Topology Changes
PPTX
Scality S3 Server: Node js Meetup Presentation
PDF
Oss4b - pxc introduction
PPTX
Fail-Safe Cluster for FirebirdSQL and something more
PDF
Distributed system coordination by zookeeper and introduction to kazoo python...
PPT
Spil Games @ FOSDEM: Galera Replicator IRL
PPS
Storm presentation
PDF
Real world repairs
PDF
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
PDF
Buytaert kris my_sql-pacemaker
Percona XtraDB Cluster
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Realtime olap architecture in apache kylin 3.0
Seek and Destroy Kafka Under Replication
No sql & dq2 tracer service
Webinar Back to Basics 3 - Introduzione ai Replica Set
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
Thoughts on consistency models
STORMPresentation and all about storm_FINAL.pdf
Kafka practical experience
Raft After ScyllaDB 5.2: Safe Topology Changes
Scality S3 Server: Node js Meetup Presentation
Oss4b - pxc introduction
Fail-Safe Cluster for FirebirdSQL and something more
Distributed system coordination by zookeeper and introduction to kazoo python...
Spil Games @ FOSDEM: Galera Replicator IRL
Storm presentation
Real world repairs
AsiaBSDCon2023 - Hardening Emulated Devices in OpenBSD’s vmd(8) Hypervisor
Buytaert kris my_sql-pacemaker
Ad

More from Shalin Shekhar Mangar (6)

PDF
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
PDF
Parallel SQL and Streaming Expressions in Apache Solr 6
PDF
Intro to Apache Solr
ODP
Introduction to Apache Solr
PDF
SolrCloud and Shard Splitting
ODP
Get involved with the Apache Software Foundation
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Parallel SQL and Streaming Expressions in Apache Solr 6
Intro to Apache Solr
Introduction to Apache Solr
SolrCloud and Shard Splitting
Get involved with the Apache Software Foundation

Recently uploaded (20)

PDF
Digital Strategies for Manufacturing Companies
PPTX
Online Work Permit System for Fast Permit Processing
PDF
System and Network Administration Chapter 2
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Transform Your Business with a Software ERP System
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPT
Introduction Database Management System for Course Database
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
top salesforce developer skills in 2025.pdf
Digital Strategies for Manufacturing Companies
Online Work Permit System for Fast Permit Processing
System and Network Administration Chapter 2
Operating system designcfffgfgggggggvggggggggg
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Transform Your Business with a Software ERP System
CHAPTER 2 - PM Management and IT Context
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Introduction Database Management System for Course Database
Odoo POS Development Services by CandidRoot Solutions
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Understanding Forklifts - TECH EHS Solution
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Odoo Companies in India – Driving Business Transformation.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
How to Choose the Right IT Partner for Your Business in Malaysia
top salesforce developer skills in 2025.pdf

Call me maybe: Jepsen and flaky networks

  • 2. Call me maybe: Jepsen and flaky networks Shalin Shekhar Mangar @shalinmangar Lucidworks Inc.
  • 3. Typical first year for a new cluster — Jeff Dean, Google • ~5 racks out of 30 go wonky (50% packetloss) • ~8 network maintenances (4 might cause ~30-minute random connectivity losses) • ~3 router failures (have to immediately pull traffic for an hour) LADIS 2009
  • 4. Reliable networks are a myth • GC pause • Process crash • Scheduling delays • Network maintenance • Faulty equipment
  • 7. Messages can be lost, delayed, reordered and duplicated n1 n2 X n1 n2 Time Drop Delay n1 n2 Duplicate n1 n2 Reorder
  • 8. CAP recap • Consistency (Linearizability): A total order on all operations such that each operation looks as if it were completed at a single instant. • Availability: Every request received by a non-failing node in the system must result in a response. • Partition Tolerance: Arbitrary many messages between two nodes may be lost. Mandatory unless you can guarantee that partitions don’t happen at all.
  • 9. Have you planned for these? Availability Consistency X X • Errors • Connection timeouts • Hung requests (read timeouts) • Stale results • Dirty results • Data lost forever! During and after a partition
  • 10. Jepsen: Testing systems under stress • Network partitions • Random process crashes • Slow networks • Clock skew http://github.com/aphyr/jepsen
  • 11. Anatomy of a Jepsen test • Automated DB setup • Test definitions a.k.a Client • Partition types a.k.a Nemesis • Scheduler of operations (client & nemesis) • History of operations • Consistency checker Data store specific (Mongo/Solr/Elastic) Provided by Jepsen
  • 16. A set of integers: cas-set-client • S = {1, 2, 3, 4, 5, …} • Stored as a single document containing all the integers • Update using compare-and-set • Multiple clients try to update concurrently • Create and restore partitions • Finally, read the set of integers and verify consistency
  • 17. Compare and Set client cas({}, 1) cas(1, 2) {1} {1, 2} cas(1, 3) X Time Client 1 Client 2 cas(2, 4) X cas(2, 5) {1, 2, 5} Client 1 Client 2 t=0 t=1 t=x
  • 18. Compare and Set client cas({}, 1) cas(1, 2) {1} {1, 2} cas(1, 3) X Time Client 1 Client 2 cas(2, 4) X cas(2, 5) {1, 2, 5} Client 1 Client 2 t=0 t=1 t=x History = [(t, op, result)]
  • 19. Solr • Search server built on Lucene • Lucene index + transaction log • Optimistic concurrency, linearizable CAS ops • Synchronous replication to all ‘live’ nodes • ZooKeeper for ‘consensus’ • http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky- networks/
  • 20. Add an integer every second, partition network every 30 seconds for 200 seconds
  • 21. Solr - Are we safe? • Leaders become unavailable for upto ZK session timeout, typically 30 seconds (expected) • Some write ‘hang’ for a long time on partition. Timeouts are essential. (unexpected) • Final reads under CAS are consistent but we haven’t proved linearizability (good!) • Loss of availability for writes in minority partition. (expected) • No data loss (yet!) which is great!
  • 22. Solr - Bugs, bugs & bugs • SOLR-6530: Commits under network partition can put any node into ‘down’ state. • SOLR-6583: Resuming connection with ZK causes log replay • SOLR-6511: Requests threads hang under network partition • SOLR-7636: A flaky cluster status API - times out during partitions • SOLR-7109: Indexing threads stuck under network partition can mark leader as down
  • 23. Elastic • Search server built on Lucene • It has a Lucene index and a transaction log • Consistent single doc reads, writes & updates • Eventually consistent search but a flush/commit should ensure that changes are visible
  • 24. Elastic • Optimistic concurrency control a.k.a CAS linearizibility • Synchronous acknowledgement from a majority of nodes • “Instantaneous” promotion under a partition • Homegrown ‘ZenDisco’ consensus
  • 25. Elastic - Are we safe? • “Instantaneous” promotion is not. 90 seconds timeouts to elect a new primary (worse in <1.5.0) • Bridge partition: 645/1961 writes acknowledged and lost in 1.1.0. Better in 1.5.0, only 22/897 lost. • Isolated primaries: 209/947 updates lost • Repeated pauses (simulating GC): 200/2143 updates lost • Getting better but not quite there. Good documentation on resiliency problems.
  • 26. MongoDB • Document-oriented database • Replica set has a single primary which accepts writes • Primary asynchronously replicates writes to secondaries • Replica decide between themselves to promote/demote primaries • Applies to 2.4.3 and 2.6.7
  • 27. MongoDB • Claims atomic writes per document and consistent reads • But strict consistency only when reading from primaries • Eventual consistency when reading from secondaries
  • 28. MongoDB - Are we safe? Source: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads
  • 29. MongoDB - Are we really safe? • Inconsistent reads are possible even with majority write concern • Read-uncommitted isolation • A minority partition will allow both stale reads and dirty reads
  • 30. Conclusion • Network communication is flaky! Plan for it. • Hackernews driven development (HDD) is not a good way of choosing data stores! • Test the guarantees of your data stores. • Help me find more Solr bugs!
  • 31. References • Kyle Kingsbury’s posts on Jepsen: https://aphyr.com/tags/jepsen • Solr & Jepsen: http://lucidworks.com/blog/call-maybe-solrcloud- jepsen-flaky-networks/ • Jepsen on github: github.com/aphyr/jepsen • Solr fork of Jepsen: https://github.com/LucidWorks/jepsen
  • 32. Solr/Lucene Meetup on 25th July 2015 Venue: Target Corporation, Manyata Embassy Business Park Time: 9:30am to 1pm Talks: Crux of eCommerce Search and Relevancy Creating Search Analytics Dashboards Signup at http://meetu.ps/2KnJHM