The importance of databases at scale and the growth of event-driven architecture (including Internet of Things environments), will mean greater reliance on distributed systems to be robust and able to maintain transactional records efficiently and accurately.
Maintaining a distributed database’s system of record may be compromised by network partitions, partial failures or clock skews, any of which could rewrite earlier data over later input, save divergent copies of stored data to backup, or read stale data. For financial and IoT systems, in particular, this could create significant production impacts. Distributed systems are hard, yo.
Until recently, being able to test and ensure the robustness of distributed systems has relied on unit tests and distributed tests. But units tests tend to focus on tracing one process and are used internally as part of maintaining a code base and in continuous delivery. Distributed tests can observe database clusters but tend to be written as a series of steps that check the robustness of carrying out a number of typical database activities, like writing data, adding a node to a cluster, and then check that the data writes are maintained.
Other approaches like Netflix’s Chaos Monkey, which can introduce random database failures, are intended to help in-house developers to ensure they are building performant, reliable systems at scale that can maintain functionality if one component mysteriously falls over.
But that still leaves the need for tests that provide insight into system-level control. That’s where Jepsen comes in.
“Jepsen is a project to analyze distributed systems under stress and to understand if concurrency problems are going to manifest, not only in applications but in databases and queues and so forth,” said creator Kyle Kingsbury.” “We want to verify that the safety properties of a distributed system hold up, given concurrency, non-determinism, and partial failure.”
The software creates a database instance and then sets up a number of tests that run through example operations that might typically be performed on a distributed database at scale. Then verification logs are checked and reports created to ensure that the results of those operations have kept the data records in sync.
- The Jepsen Clojure library
- Tests for a range of distributed systems
- The analyses written up by Kingsbury on the various tests conducted so far.
Since late 2015, the majority of analyses have seen database vendors hiring Kingsbury to carry out independent testing, although the open source project documentation is sufficient enough for independent use. Both Datastax and Bloomberg, for example, have used Jepsen independently of Kinsgbury’s involvement.
“Jepsen is now moving to be a more general purpose tool,” said Kingsbury. “But documentation is still pretty sparse, and the API is still a bit rusty,” he conceded. Kingsbury might be being modest here. A quick review of the GitHub repo shows a walkthrough series of tutorials explaining how to set up a Jepsen environment (including how to launch a Jepsen cluster from AWS), how to write Jepsen tests, setting up and tearing down the database instance, writing a client, introducing failures into the system, and analyzing the results.
Kingsbury said much of his work using Jepsen with clients has centered on the database vendors themselves who want to analyze the reliability of their products. “But in some companies, I help with internal testing,” Kingsbury said. “Engineers may have an internal product they want to test, either with general purpose test design or end-to-end analysis.”
Kingsbury said the starting point is to help engineers formalize the variants they want to use in the tests, such as ensuring strong transactional requirements. “It is really suited to any case where you have a distributed system and want to verify the reliability of the system. You want to conduct unit and integration tests, and then use Jepsen for the cluster as a whole.”
In particular, a Jepsen test draws on the principles of three three types of system testing:
- Black box systems testing: Systems are tested without knowing the actual data calculations of operations. So any errors observed in production can be recorded, but the correct value of the data isn’t provided, for example
- Failure mode testing: Jepsen aims to test reliability of data updates after events such as faulty networks, unsynchronized clocks and partial failures, pointing to database system behavior when it is strained rather than in evaluating healthy cluster behavior
- Generative testing: Tests are conducted by running random operations and logging a history of the results. This can then be checked against the engineering team’s model to establish how well the system operates.
From having conducted over 22 tests so far on Aerospike, Cassandra, CockroachDB, Kafka, MongoDB, Redis, Riak, and others, Kingsbury shared some thoughts on database systems he has analyzed:
- CockroachDB: “A lot of vendors I have worked with have a great emphasis on safety and are very concerned with stale reads. VoltDB, for example, made a significant change that made their reads much slower in order to ensure safety. In the same vein, CockroachDB does that too, but they make some interesting assumptions around needing clocks synchronized. In cloud environments, you could have up to a three second wait time for reading. CockroachDB is guessing there will be a significant future interest in clock synchronization, and you see that in the market with Google recently releasing the Truetime API for Spanner.”
- Elasticsearch: “I would not use this as a system of record, so you would put your data in S3 or Postgres and have a replication tool so that it reiterates the data. This is how you want to design robust systems. You don’t write data directly to Elasticsearch, you write to the primary database and then have a tool that continually writes to a secondary DB. The biggest issue will be shard rebalancing because if Elasticsearch is built in one shard, it can collapse under a high load. A lot of people are probably losing data and they don’t know it.”
- etcd: “My loose impression is that if you are going to go with the next gen consensus systems, I know Consul has taken down large scale systems, at least in terms of outages. Zookeeper is the best choice, it has been around for a long time and it has rounded out the edge use cases. But the etcd UI is much nicer to work with, which might be a deal closer for some.”
While the official Jepsen open source project elements are written in Clojure, others are adapting the library to their own languages, with both Bloomberg and software engineer Milan Simonovic using Python, for example.
Joel Knighton from Datastax — who tested Apache Cassandra with Jepsen last year — wrote that Jepsen encouraged well-defined models and invariants when preparing systems testing.
“This encourages unambiguous design and communication, and it also makes the purpose of the test clear. Second, Jepsen enables the tester to easily test these invariants in realistic conditions through test composability; the same core test can be run concurrently with node crashes, network partitions, or other pathological failures. Lastly, by embracing generative testing, it acknowledges the difficulty of thoroughly testing distributed systems using deterministic, hand-selected examples,” Knighton wrote.
Kingsbury will be keynoting on Jepsen at the Software Architecture Summit in Berlin, being held 18-20 September. His keynote talk at Scala Days earlier this year is available online:
Feature image: Screen capture from Kyle Kingsbury’s keynote talk on Jepsen at Scala Days earlier this year.