Data Artisans, the company behind the open source project Apache Flink, has taken on a vexing problem with stream processing — that it’s limited to transactions on one row at a time. It has introduced Streaming Ledger, a library that sits on top of Flink providing serializable distributed ACID transactions across multiple tables, rows, and streams in high-performance systems.
It grew out of the needs of financial services customers that could not be adequately addressed by current stream-processing technology including Flink, according to data Artisans co-founder Fabian Hueske. However, the company foresees a wide range of use cases for the technology.
“These applications had some kind of transactional behavior where you would have to access and update multiple rows in the streaming application. Stream processing so far have been good at offering consistency for single rows, similar to key-value stores. But our customers wanted transactions more similar to relational database systems — accessing and updating multiple rows in a consistent way, which is also not possible with key-value stores,” he said.
He uses the example of moving money from one account to another, where one account likely is on one node in the data center and the other account on another node. To make this transaction, you’d need to use a database or optimistic concurrency control and hasn’t really been possible with stream processors.
Streaming Ledger offers consistency and high throughput to large-scale applications like inventory management, pricing, billing, supply-demand matching and logistics without requiring an underlying relational database. The company predicts the technology will bring a whole new class of applications to a data-streaming architecture.
“Data-intensive, real-time applications such as fraud detection, machine learning, and real-time trade pricing can now be upgraded to the streaming era effortlessly,” data Artisans Chief Technology Officer Stephan Ewen wrote in a blog post.
Parallel, but on Single Thread
Flink is known for its ability to handle streams of very high volume with low latency with exactly once results. Enterprise customers including Alibaba, Netflix, Uber, Microsoft, ING use it at massive scale. Analyst Janakiram MSV previously outlined how it stacks up against Storm, Spark and other Apache streaming projects.
Streaming Ledger maintains high throughput and performance, Heuske said, referring to a test outlined in this white paper, using globally distributed data centers where it was benchmarked at 2 million transactions per second.
The core building blocks of Streaming Ledger are:
- Tables maintain state in one or more key/value stores. They can be stored purely in memory or in RocksDB. The table keys and values can be any type (structured row, object, blob, etc.)
- Transaction Event Streams — The ledger can handle multiple event streams. The events carry the “parameters” of the transaction — information about what data is to be modified and how.
- Transaction Functions contain complex business logic, implemented in Java/Scala. They’re described as “powerful stored procedures that execute directly in the system that manages the data, rather than pulling the data out and pushing updates back in.”
- Optional Result Streams can be used to declare success or failure or to send updates to a mirror of the tables in an external system.
Streaming Ledger requires users to declare transaction functions up front, similar to registering a stock procedure in a database system, Heuske said.
In his example, “The event stream would provide data such as which account to take money out, which account to move it to, and the amount. The transaction function would receive these events, and read some of the state managed in the application, subtract the amount from the source account, add it to the target account and update both. So this transaction function would perform two reads, two writes, and all the reads and writes would be done in a consistent fashion. Many transactions would be done in parallel, but the transaction function would ensure they were done in a consistent and serializable fashion,” he explained.
Streaming Ledger does not use distributed locks or multiversion concurrency control, instead relying on the conflict-free scheduling of state access and lightweight logical clocks to achieve high throughput.
“It acts as if there were a single thread processing all these transactions when they’re actually done in parallel to achieve better performance,” Heuske said. Streaming Ledger is not part of the open source Flink project, but a feature of data Artisans’ enterprise River Edition. The company, however, has posted Maven modules on GitHub for the SDK needed to define a streaming ledger application; a simplistic serial runner, to experiment with the SDK and some example programs. The white paper also includes example code.
Feature image via Pixabay.