Update Apache Kafka + Spark + Database = Real-Time Trinity

Main Contents:

Apache Kafka + Spark + Database = Real-Time Trinity is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Apache Kafka + Spark + Database = Real-Time Trinity in today’s post !

Messaging System

Real-time pipelines often begin with data capture and use a messaging system to ensure every record makes it to the right place. Data might come from logging information, sensor data, financial market streams, mobile applications or other sources. It will be written to file systems, object stores and databases.

Apache Kafka is one example of such a messaging system. According to the Apache Kafka web site: “Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.” It acts as a broker between producers and consumers of message streams, as outlined in the following diagram:

Because it is a distributed system, Kafka can scale the number of producers and consumers by adding servers or instances to the cluster. Kafka’s effective use of memory, combined with the commit log to disk, provides great performance for real-time pipelines plus durability in the event of server failure.

With message ingest and propagation in place, it is time to move to the next tier of transformation.

Transformation Tier

The transformation tier allows data to be manipulated, enriched and analyzed before it is persisted for use by an application. Today, Apache Spark is one of the most popular transformation tiers.

Spark is also a distributed, memory-optimized system, and therefore a perfect complement to Kafka. Spark includes a streaming library, and a rich set of programming interfaces to make data processing and transformation easier.

With Spark you can ingest data from Kafka, filter that stream down to a smaller data set, run enrichment operations to augment data, and then push that refined data set to a persistent data store.

Spark’s support for a wide range of operators to facilitate data transformation allows broad functionality within a single system, and makes it an ideal part of real-time pipelines. However, Spark does not include a storage engine and therefore works well when paired with a persistent datastore, such as an operational database.

Operational Database

Real-time data streams provide the most value when analysis spans both real-time and historical data. In order to do that, data must be persisted beyond the streaming aspects like messaging and transformation into a permanent datastore. While that could be an unstructured system like the Hadoop Distributed File System or Amazon S3, neither of those solutions provide instant analytics.

An in-memory operational database, however, provides persistence for real-time and historical data as well as the ability to query both together.

As one example, MemSQL, an in-memory database for transactions and analytics that our company developed, can ingest and persist data from Spark. This allows applications to be built on an operational database using the freshest data. Now with familiar SQL, the lingua franca for data programming, enterprises can quickly and easily build real-time applications and run analytics.

Easily Remove Duplicates With an Operational Database

One of the challenges with streaming pipelines is the appearance of duplicates. In an effort to guarantee no data loss, messages and data points will sometimes be propagated multiple times. This can adversely affect capacity and complicate analytics.

With an operational database, and a mutable dataset, operators can be included such as:

<br />
INSERT … ON DUPLICATE KEY UPDATE

1	INSERT ... ON DUPLICATE KEY UPDATE

which will add a new record, or update an existing record, avoiding the duplicates challenge.

Completing the Real-Time Trinity

Companies such as Pinterest have seen the power of this software infrastructure combination, showcasing the results at this year’s Strata + Hadoop World.

The real-time data pipeline also enables advanced analytics from application data. Data from the operational database can go to Spark for advanced analytics, and be persisted back to the operational database for application use.

Economics make this workflow accessible. As each piece is a distributed system, users can deploy on commodity hardware or the cloud using the most cost effective infrastructure. Performance scales with the number of nodes for flexibility. Entire new deployment methods are emerging too, like Mesosphere, to simplify the deployment of multiple distributed systems.

It is hard to see our appetites for immediate information abating. Fortunately, cost-effective architectures like the real-time trinity support this demand today.

Eric Frenkiel is the founder and CEO of MemSQL, a distribued in-memory database.

InApps is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Real.

Feature image via Flickr Creative Commons.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.