Dr. Tong Zhang
Dr. Tong Zhang is a well-established researcher with significant contributions to the areas of data storage systems and VLSI signal processing. Dr. Zhang is a co-founder and Chief Scientist of ScaleFlux, responsible for developing key techniques and algorithms for Computational Storage products and exploring their optimal use in mainstream application domains such as database. He is currently a Professor at Rensselaer Polytechnic Institute (RPI). His current and past research span over a wide range of areas, including database, filesystem, solid-state and magnetic data storage devices and systems, digital signal processing and communication, error correction coding, VLSI architectures, and computer architecture.

In-storage transparent compression is providing enterprises with benefits far beyond what one might expect. By transparently building compression directly into storage drives, enterprises experience enhanced performance at a lower cost with zero adoption barrier — but the story doesn’t end there. For those who want to take things a step further by experimenting and making changes to their data management software, the potential for further improving performance is limitless.

Before we delve into what’s possible, let’s take a look at how this technology came about, and the more well-known benefits.

Before transparent compression, storage drives acted only as a physical devices to store data. There was no processing, compression, or decompression: If you stored two terabytes (TB) of data on a device, it would physically take up 2TB worth of space. This is how the solid-state storage drives (SSDs) of today work. Only computational storage drives (CSDs) are capable of compressing (and decompressing) data volume so that it takes up less space. For example, if a user writes 2TB of data onto a CSD, it may only take up 500GB of physical space after being compressed.

Read More:   Update When Scaling Data Infrastructure, Sometimes Greed Is Good

In-storage transparent compression is the best option for kicking off the journey of commercializing CSDs for two main reasons.

Key Perks of In-Storage Transparent Compression

Firstly, there’s no adoption barrier: Users simply plug-and-play to experience lower cost and higher performance. This is possible because in-storage transparent compression doesn’t demand any changes to an existing storage I/O software stack (e.g. filesystem, block layer, or driver) or I/O interface protocols (e.g. NVMe, SATA). This ensures its seamless integration and deployment into existing infrastructure without requiring user applications to change a single line of code.

Secondly, it provides invaluable benefits. Besides its very wide applicability, lossless data compression (e.g. the well-known LZ77 and its variants such as lz4, zlib, and ZSTD) involves a significant amount of random data access that inevitably causes very high CPU/GPU cache miss rates. This leads to very low CPU/GPU hardware utilization efficiency. For this reason, in-storage transparent compression is highly desirable to relieve host CPU/GPU from executing lossless data compression.

What’s Possible: Massive Potential for Innovation

At first glance, one might think that these are the full benefits of in-storage transparent compression, but this is far from the complete picture. Beyond the obvious cost and IOPS perks, transparent compression opens a door to exciting system-level innovation opportunities that remain largely unexplored.

This lack of exploration is due to the fact that, for the past four decades, data management software infrastructure has been built subject to a fixed-size block I/O constraint. But the arrival of virtually variable-size block I/O (enabled by in-storage transparent compression) has ushered in unique opportunities to re-think the design and implementation of data management software.

Case Study: PostgreSQL

Let’s examine how in-storage transparent compression benefits one of the most widely used relational databases: PostgreSQL.

This database applies a B+ tree index to manage its data storage, with a default B+ tree page size of 8KB and realizes multiversion concurrency control (MVCC) by storing all the row versions in the tablespace. So, instead of directly updating a row in one B+ tree page, PostgreSQL first stores the new row version at a new location and relies on a background vacuum process to reclaim the space occupied by dead row versions.

Read More:   Update The Age of Software Needs Value Stream Architects

As a result, the transactions per second (TPS) performance of update operations strongly depends on whether PostgreSQL can store the new row version in the same page as the old row version.

When PostgreSQL runs on normal SSDs or HDDs, regardless of how full one page is with user data, it always consumes 8KB of physical storage space. This leads to a clear TPS performance versus storage cost trade-off configured by the fill factor (a parameter in PostgreSQL that can be adjusted by users).

In-storage transparent compression enables virtually variable-size block I/O in the presence of a 4KB block I/O interface. In other words, when PostgreSQL runs on storage drives with built-in transparent compression, how full one page is with user data directly determines the amount of physical storage space being consumed by the page.

This relieves PostgreSQL from the TPS performance versus storage cost trade-off: As we reduce the value of fill factor to leave more space in each page for future updates, the physical storage usage will not proportionally increase. By enabling virtually variable-size block I/O, in-storage transparent compression allows PostgreSQL to drastically reduce the fill factor and therefore improve TPS performance without sacrificing the physical storage cost.

This is just a single example of how in-storage transparent compression can benefit one database, but many more possibilities exist and are waiting to be explored further.

With the advent of in-storage transparent compression, there’s never been a more exciting time to experiment with what’s possible when it comes to databases. Aside from the cost and IOPS benefits, users have the ability to make unprecedented strides in performance that were never possible until now. 

Feature image via Pixabay.