A new open source database called Noms aims to do for published data what Git does for source code.
It’s from former Google engineers Aaron Boodman and Rafael Weinstein, who formed the company Attic Labs. They set out to build a database with versioning, deduplication, and synchronization built into its core. The beta version is out now.
“One of the things we really loved about Git and were inspired by was that it took this problem of synchronization that has been so hard to solve — it’s only been solved a number of times and only with great effort, only by, like, Microsoft with hundreds of people on it – and solved it in this really elegant, simple way,” said Boodman.
He explained that while other databases are focused on storing and processing huge amounts of data, written in Go, focuses on moving data between people and organizations, and collaborating on that data, which today relies on systems he calls “completely archaic.”
He points to all the publicly published data, such as from news organizations, government, sports leagues and the scientific community as well as businesses’ reports that need to be shared privately.
“This is usually done in one of two ways: static snapshots, static CSV files or JSON files, which are really easy for the publisher, but really hard for the consumer. Or you publish data by an API, which is a lot easier for the consumer to use, but it’s really, really hard for the provider. The provider has to build [the API], secure it, maintain it. And if you look at the data shared publicly, it’s just full of errors. It’s really messy. There’s formatting errors, factual errors, spelling errors,” he said.
You can’t query a subset of a CSV file without putting it into another system, he notes in a blog post.
“Because there are so many errors, you see a proliferation of different formats, different versions of the same dataset that are only slightly different. If you want to use this public data, it’s really hard to know which one is the correct one. Why did this change? Who changed it? When did they change it?
“We looked at this problem and saw it’s really similar to what software engineers have, and we thought Git was just a really elegant way to look at this problem. We thought the same ideas for solving source code could be used for data,” he said.
Understanding what’s changed
Attic Labs just landed $8.1M in a Series A funding round, led by Greylock Partners to address the problem.
Noms won’t replace the monolithic MySQL, Mongo or other databases that enterprises use, but will make it easier to write the software to consume and understand the data and how it’s changed.
“It will be more efficient, so you don’t have to download the data over and over again,” he said.
“It has atomic commits, you can program APIs, so you can use it as an application database. We don’t have a sophisticated query language right now, but the design supports a query language and if you have one, you can explore the data in a structured way,” he said.
“You can see history. When you make a commit to Noms, you’re not overriding data, you’re adding data. You end up with all the changes over time that have happened to your database. You can compute detailed differences and understand how the data has changed. You can compute summaries of differences if the data set is really big. And just like Git, you can synchronize efficiently, so you can move the data from one instance of the database to another. Any instance of the database can work disconnected from another, build their own history of changes, then they can come together and merge those changes.”
Noms builds on the basic properties of structured databases, such as efficient queries, updates, and range scans, and concepts from Git to create a decentralized database based on concepts such as:
- Any peer’s state is as valid as any other.
- All commits of the database are retained and available at any time.
- Any peer is free to move forward independently of communication from any other—while retaining the ability to reconcile changes at some point in the future.
Noms models data as a directed acyclic graph of nodes or more specifically, a single large Merkle DAG. Noms builds on the ideas of previous systems to reconcile differences, synchronize state, and build indexes over large-scale, structured data.
Noms is a typed system that will “chunk” gigantic blobs, sets, lists, and maps into workable size parts for storage, searching and updating.
“A critical invariant of Noms is that the same value will be represented by the same graph, having the same chunk boundaries, regardless of what past sequence of logical mutations resulted in the value. This is the essence of content-addressing and it is what makes deduplication, efficient sync, indexing, and other features of Noms possible,” according to a tech overview.
Rather than using B-Trees, it introduces what it calls “Prolly Trees” where the number of values stored in each node is determined by a probabilistic average rather than an enforced upper and lower bound, and the set of values in each node is determined by the output of a rolling hash function over the values.
Prolly Trees are sorted, either by type or by hash, making efficient indexes. Ordered search trees also make it possible to implement set operations like union and intersect efficiently on them. The team plans to build out these capabilities into a general query system.
While also setting out use cases for versioning and collaboration, Boodman described synchronization this way:
“The way it’s done now, you publish a static hunk of CSC files or JSON files. It’s extremely laborious for developers to consume this data. They have to write a parser, the data is untyped — it doesn’t say it’s a date or a stream or whatever — you just have to know from some external communication what the data’s supposed to be. So you do all this work to parse the data and import it into your system, all for one snapshot [of a point in time] of your data,” Boodman said.
“But, usually, it’s changing. Then you have to download all the data again. You have to look at it and say, ‘OK, what’s changed?’,” Boodman continued. “Developers spend a lot of time complaining about writing this kind of code, and nobody on their team wants to do it. So the source code gets really far out of date because they keep delaying the inevitable cost of having to re-import the data.”
There are two ways Noms can deal with this: Users can get that snapshot into Noms on the first week, then the next snapshot for the second week. Noms will just tell you the difference between them, no matter how big they are or how complicated they are.
“If the upstream producer of the data is using Noms, too, they can publish their data in Noms format,” Boodman said. “Then when the downstream consumer only has to do one commit: Noms pull. It pulls the latest revision of the data. It only pulls the changes. What changed? When did it change? It’s a very detailed report, so you can take those differences and apply them.”