Home
>
Data Science
>
Update Apache Kafka: A Primer

March 30, 2022 by Phu Nguyen

Update Apache Kafka: A Primer

Main Contents:

Apache Kafka: A Primer is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Apache Kafka: A Primer in today’s post !

Why Kafka?

Before the introduction of Apache Kafka, Message Oriented Middleware (MOM) such as Apache Qpid, RabbitMQ, Microsoft Message Queue, and IBM MQ Series were used for exchanging messages across various components. While these products are good at implementing the publisher/subscriber pattern (Pub/Sub), they are not specifically designed for dealing with large streams of data originating from thousands of publishers. Most of the MOM software have a broker that exposes Advanced Message Queuing Protocol (AMQP) protocol for asynchronous communication.

Kafka is designed from the ground up to deal with millions of firehose-style events generated in rapid succession. It guarantees low latency, “at-least-once”, delivery of messages to consumers. Kafka also supports retention of data for offline consumers, which means that the data can be processed either in real-time or in offline mode.

The fundamental difference between Message Oriented Middleware and Kafka is that the clients will never receive messages automatically. They have to explicitly ask for a message when they are ready to handle.

Expanding further on the persistence and retention, Kafka is designed to be a distributed commit log. Much like relational databases, it can provide a durable record of all transactions that can be played back to recover the state of a system. The key thing to understand is that the data is stored durably in an order that can be read deterministically. Due to the distributed design, Kafka provides redundancy, which ensures high availability of data even when one of the servers faces disruption.

This architecture makes Kafka the gateway for all things data. Multiple event sources can concurrently send data to a Kafka cluster, which will reliably gets delivered to multiple destinations.

Key Concepts and Terminology

Apache Kafka uses a slightly different nomenclature than the traditional Pub/Sub systems. Let’s explore the terminology to understand it better.

Message — In Kafka, messages represent the fundamental unit of data. Each message is a key/value pair. Irrespective of the data type, Kafka always converts messages into byte arrays.

Producers — Producers map to the publishers or the writers of the Pub/Sub architecture. They are the source that generate messages which get ingested into the system. In the context of Kafka, the producer is often referred as the client. It is important to note that the client is source of data and not to be confused with the consumer.

Consumers — Consumers are the subscribers or readers that receive the data. They are at the other side of the Kafka infrastructure. Unlike subscribers in MOM, Kafka consumers are stateful, which means they are responsible for remembering the cursor position, which is called as an offset. The consumer is also a client of Kafka cluster. Each consumer may belong to a consumer group, which will be introduced in the later sections.

The fundamental difference between MOM and Kafka is that the clients will never receive messages automatically. They have to explicitly ask for a message when they are ready to handle.

Topics — Topics represent the logical collection of messages that belong to a group. They are similar to the topics in MOM. The data sent by the producers is stored in topics. Consumers subscribe to a specific topic that they are interested in.

Partition — Partitions are unique to Apache Kafka that are not seen the traditional message queuing systems. Each topic is split into one or more partitions. While sending data, producers don’t mention the partition but consumers are aware of the available partitions. Kafka may use the message key to automatically group similar messages into a partition. This scheme enables Kafka to dynamically scale the messaging infrastructure.

Partitions are redundantly distributed across the Kafka cluster. Messages are written to one partition but copied to at least two more partitions maintained on different brokers with the cluster.

The concept of partitions and consumer groups allows horizontal scalability of the system.

Consumer Groups — As mentioned earlier, consumers belong to at least one consumer group, which is typically associated with a topic. Each consumer within the group is mapped to one or more partitions of the topic. Kafka will guarantee that a message is only read by a single consumer in the group. Kafka will also ensure that all the messages belonging to the same topic are delivered to all the registered consumers of a group.

Each consumer will read from a partition while tracking the offset. If a consumer that belongs to a specific consumer group goes offline, Kafka can assign the partition to an existing consumer. Similarly, when a new consumer joins the group, it balances the association of partitions with the available consumers.

It is possible for multiple consumer groups to subscribe to the same topic. For example, in the IoT use case, a consumer group might receive messages for real-time processing through an Apache Storm cluster. A different consumer group may also receive messages from the same topic for storing them in HBase for batch processing.

The concept of partitions and consumer groups allows horizontal scalability of the system.

Broker — Each Kafka instance belonging to a cluster is called a broker. Its primary responsibility is to receive messages from producers, assigning offsets, and finally committing the messages to the disk. Based on the underlying hardware, each broker can easily handle thousands of partitions and millions of messages per second.

The partitions in a topic may be distributed across multiple brokers. This redundancy ensures the high availability of messages.

Cluster — A collection of Kafka broker forms the cluster. One of the brokers in the cluster is designated as a controller, which is responsible for handling the administrative operations as well as assigning the partitions to other brokers. The controller also keeps track of broker failures.

The Role of ZooKeeper

Most of the contemporary distributed orchestration systems such as Kubernetes and Swarm rely on a distributed key/value pair for maintaining the global state of the cluster. Consul, etcd, and even Redis are used for service discovery and cluster state management. Apache Kafka was designed much before these lightweight services are built.

Like of most of the other Java-based distributed systems such as Apache Hadoop, Kafka uses Apache ZooKeeper as the distributed configuration store. It forms the backbone of Kafka cluster that continuously monitors the health of the brokers. When new brokers get added to the cluster, ZooKeeper will start utilizing it by creating topics and partitions on it.

Apart from cluster management, initial versions of Kafka used ZooKeeper for storing the partition and offset information for each consumer. Starting from 0.10, that information has moved to an internal Kafka topic.

In the next article of this series, we will build an end-to-end IoT application that utilizes Apache Kafka for real-time data processing. Stay tuned.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

May 29, 2025 by Anh Hoang

Update Apache Kafka: A Primer

Read more about Apache Kafka: A Primer at Wikipedia

Why Kafka?

Key Concepts and Terminology

The Role of ZooKeeper

AI Automation for Business in 2025: A Step-by-Step Guide

FITNESS APP DEVELOPMENT

ONLINE COURSE APP

EVE HR – WEB DESIGN

AIRGOGO WEBSITE

WALLET APP DEVELOPMENT

Ho Chi Minh City Launches Digital Traffic App 2017

Why Your Business Needs a Mobile App Rather Than a Website

7 Questions To Ask Yourself Before You ‘App’ | Entrepreneur

Homestays Marketplace Application Development

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2025

AI Automation for Business in 2025: A Step-by-Step Guide

Top 10 Offshore Development Companies (ODCs) in 2025

How can businesses effectively integrate AI into their operations?

Locations

Read more about Apache Kafka: A Primer at Wikipedia

Why Kafka?

Key Concepts and Terminology

The Role of ZooKeeper

Get a custom Proposal

You need to enter your email to download

Blog post

Locations