Home
>
Data Science
>
Update Apache Kudu: A Fast Columnar Data Store for Hadoop

March 30, 2022 by Anh Hoang

Update Apache Kudu: A Fast Columnar Data Store for Hadoop

Main Contents:

Apache Kudu: A Fast Columnar Data Store for Hadoop is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Apache Kudu: A Fast Columnar Data Store for Hadoop in today’s post !

Key Summary

This InApps.net article, published in 2022, introduces Apache Kudu, a columnar data store designed to bridge the gap between HDFS and HBase in the Hadoop ecosystem. Authored with an informative, technical tone, it aligns with InApps Technology’s mission to explore data science and software development trends, offering an accessible guide for leveraging Kudu in mixed-workload environments.

Key Points:

Context: Developed by Cloudera and donated to the Apache Software Foundation, Kudu became a top-level project in 2022, providing a storage solution for Hadoop that supports both real-time streaming and batch analytics, unlike HDFS (batch-focused) or HBase (real-time-focused).
Core Insight: Kudu combines HDFS’s analytic performance with HBase’s real-time capabilities, enabling low-latency writes, random reads, updates, and deletes, simplifying complex systems for users with mixed workloads.
Key Features:
- Mixed Workload Support: Handles streaming ingestion, random access, and analytics, addressing needs for data corrections and real-time processing.
- Integration with Impala: Offers a unified SQL layer via Impala, with compatibility for SparkSQL and Apache Drill, supporting diverse analytic engines.
- Performance and Scalability: Optimized for modern hardware, Kudu provides fast columnar storage and querying, reducing the need for complex architectures.
Use Case: Atul Mohan at NYU Langone Medical Center highlights Kudu/Impala as a potential alternative to the Hive/MapReduce/HDFS stack for advanced analytics.
Development Status: Pre-1.0 in 2022, used by cutting-edge organizations compiling binaries from source. The 1.0 release, targeted for September 2022, aims for enhanced stability, high availability, and enterprise-grade reliability.
Outcome: Apache Kudu enables developers and analysts to handle mixed workloads efficiently, combining real-time and batch processing in a single platform, streamlining Hadoop-based data workflows.

This article reflects InApps.net’s focus on innovative data science and software development, providing an inclusive, practical overview of Apache Kudu’s role in enhancing Hadoop’s flexibility for modern data applications.

Read more about Apache Kudu: A Fast Columnar Data Store for Hadoop at Wikipedia

You can find content about Apache Kudu: A Fast Columnar Data Store for Hadoop from the Wikipedia website

The African antelope Kudu has vertical stripes, symbolic of the columnar data store in the Apache Kudu project.

Cloudera began working on Kudu in late 2012 to bridge the gap between the Hadoop File System HDFS and HBase Hadoop database and to take advantage of newer hardware. It donated Kudu and its accompanying query engine Impala to the Apache Software Foundation late last year; Kudu recently was named a top level project.

“The idea was to build a storage system for the Hadoop ecosystem that was for mixed workloads,” explained Todd Lipcon, vice president of Apache Kudu and a software engineer at Cloudera.

HDFS is good for analytics where you’re batch loading large amounts of data, but you typically don’t update the data. It’s just a batch of transactions that took place yesterday or weblogs or something like that, he said.

Meanwhile, HBase basically took the other side of this trade-off. It’s much more real-time. It’s really good for streaming. You can write data quickly with low latency. You can randomly read that data. You can look up an individual record. You have the ability to update as well.

“When we found HBase was becoming really popular, we found that some users had a mix of the two. They weren’t just doing this online random access, they also had some analytics. Conversely, other customers started with analytics, then said, ‘Hey wait. With this analytics workload, it’d be really nice to start ingesting it in a streaming fashion instead of just batch. Occasionally I have data corrections – updates or deletes.’ So we had a lot of people in the community and customers at Cloudera who were kind of between a rock and a hard place,” he said.

They needed the capabilities of HBase, but like the performance of HDFS for analytics, and they were building really complicated systems to achieve that.

kudu_vs_parquet

“They would keep some amount of the data in HBase, like the most recent data, then have these background processes to export that data into HDFS, then have to synchronize the two systems to keep them up to date with the data. All that complexity was hampering people’s ability to adopt the ecosystem because you no longer had one storage system. You had to manage two, you had to learn about two, they had different APIs and you’re building a lot of this extra tooling to keep the data synchronized between the two.

“We decided to try to build a happy medium for the two use cases. And even if the happy medium probably won’t be optimal for either of the use cases, at least it will be one system that can do both use cases reasonably well. Of course, when you build a specialized system, it will be better than a general one. So there are specific use cases where HDFS will be more relevant and many use cases where HBase will be relevant.

“So basically, some people are willing to take some tradeoff on performance for simplicity of deployment and simplicity of application development,” he said.

Sort of Familiar

Cloudera stated its initial design goals for Kudu as:

Strong performance for both scan and random access
High CPU efficiency and IO efficiency.
The ability to update data in place.
The ability to support active-active replicated clusters that span multiple data centers in different parts of the world.

Kudu is a storage system for tables of structured data. Its tables look like those in SQL relational databases, each with a primary key made up of one or more columns that enforce uniqueness and acts as an index for efficient updates and deletes.

Logical subsets of data called tablets, make up the tables, similar to partitions in relational database systems. It replicates these tablets to multiple commodity hardware nodes using the Raft consensus algorithm, which ensures that every write is persisted by at least two nodes before responding to the client request to protect against data loss due to a machine failure.

There are Java, C++, or Python APIs for “NoSQL”-style access to individual rows. And these APIs can be used with batch access for machine learning or analytics.

You can stream real-time data in using the Java client, and then process it immediately using Spark, Impala, or MapReduce, and transparently join Kudu tables with data stored in HDFS or HBase.

Rajan Chandras, director of data architecture and strategy at NYU Langone Medical Center, has called Kudu/Impala potential game changers as a full-fledged alternative to the Hive/MapReduce/HDFS stack.

1.0 Coming Soon

While Kudu has good integration with Impala, it’s not tight coupling, Lipcon says.

“You can choose to use SQL or Impala, which is the one Cloudera has been focusing on, but you can also choose to use SparkSQL. The Kudu project has people working on Apache Drill. We’re happy to integrate with any SQL engine – the more the better. For our success, we want to work with as many SQL engines and analytic engines as possible.”

The software is still in the pre-1.0 release phase, though a number of organizations are already using Kudu in production. Those users are pretty tightly integrated with the development community, though. These are the types of cutting-edge users who compile their own Kudu binaries from source.

“Our goal, of course, is to make it a generally usable thing for any enterprise,” Lipcon said. “That’s the 1.0 milestone we’re marching toward, and we’re hoping that will happen sometime in September. It will include more quality assurance and a couple more features that are important for stability and reliability of the system. Fixing some bugs, improving high availability, knowing: Can you really run this thing 24/7 365 and have really great uptime?”

Feature Image: “n130_w1150” by Biodiversity Heritage Library, licensed under CC BY-SA 2.0.

Source: InApps.net

Rate this post

Anh Hoang

Anh Hoang is Head of SEO Optimization at InApps Technology, ensuring that the message and research of InApps Technology reach the most people possible while adhering to our strict journalistic standards of excellence and integrity.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

April 10, 2026 by Anh Hoang

Update Apache Kudu: A Fast Columnar Data Store for Hadoop

Key Summary

Key Points:

Read more about Apache Kudu: A Fast Columnar Data Store for Hadoop at Wikipedia

Sort of Familiar

1.0 Coming Soon

Best Angular Projects for Beginners in 2026

Offshore Product Development and How It Differs?

Is It Too Late to Switch Into Tech? What Reddit Career Changers Say

Are Developers Becoming Too Dependent on AI Tools?

Is Being a Self-Taught Developer Still Viable in 2026?

Imposter Syndrome in Tech: Why So Many Developers Feel Like Frauds

Too Many Tools, Too Little Time: How Developers Deal With Stack Fatigue

Why AI Productivity Is Making Developers Feel More Stressed, Not Faster

How to Stay Relevant in Tech Without Learning Everything

Why So Many Developers Feel Burned Out (And What Actually Helps)

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2025

Hire Offshore Angular Developers: The Right Development Team In Vietnam

What Is ODC (Offshore Development Center)? Understand Offshore Development Center In 3 Seconds

Hire Full-Stack Developers From Software Outsourcing Companies in 2026

Locations

Key Summary

Key Points:

Read more about Apache Kudu: A Fast Columnar Data Store for Hadoop at Wikipedia

Sort of Familiar

1.0 Coming Soon

Get a custom Proposal

You need to enter your email to download

Blog post

Locations