Álvaro Hernández

“Is Kubernetes ready for stateful workloads?” is the first question that pops up when decision-makers consider deploying databases on Kubernetes. For years the answer was “don’t do it,” and for good reasons. Kubernetes was initially designed to handle the orchestration of stateless workload. But the technology has matured, and it is time to reconsider running data on Kubernetes.

There are three important technical aspects to be considered:

  • Kubernetes maturity
  • Kubernetes stateful capabilities
  • Availability and performance characteristics of running databases in containers.

How Mature Is Kubernetes?

While assessing the maturity of any technology isn’t a straightforward process, there are solid signals that can be used. Kubernetes is a Cloud Native Computing Foundation graduated project, meaning that the technology has the “adoption, a healthy rate of changes, and committers from multiple organizations”. Their 2020 Survey Report shows that “91% of respondents using containers report using Kubernetes, 83% of them in production.”

Since November 2017, reputed analyst firm Thoughtworks considers Kubernetes as a mature technology that companies should adopt, explaining that “it has become the default solution for most of our clients when deploying containers into a cluster of machines.”

A bell curve showing the adoption of technology from the CNCF

Chart courtesy of the Cloud Native Computing Foundation.

Is Kubernetes Stateful Ready?

Kubernetes stateful capabilities are often doubted, and a first-generation stateful technology named Persistent Sets (“PetSet”) is (partially) to blame. This feature was deprecated in favor of the current stateful technology in Kubernetes: StatefulSets. Released for GA (“General Availability”) in 2018, it is used today across countless solutions that provide persistent, non-ephemeral, storage for Kubernetes containers. This is what makes Vitess or other cloud native databases deployment in Kubernetes possible.

Read More:   Update Unravel Data Tackles Application Performance Management for the Big Data Stack

Most notably, StatefulSets mount PersistentVolumes (“PVs”) into the containers. These PVs are generally provided by storage external to the Kubernetes node, either in the form of networked drives or software-defined storage solutions, like OpenEBS. In essence, the storage used in Kubernetes and in the cloud is the same EBS volumes you use on AWS, or the Persistent Disks you use on GCP; and we can expect the same level of maturity.

Performances of Running Data on K8s

Surely, database performance suffers in Kubernetes, doesn’t it? Containers are wrongly perceived as “lightweight virtual machines.” They are rather extremely thin layers of abstraction wrapping the filesystem, process, and networking spaces, provided by the Linux kernel. There might be some overhead if you use only ephemeral, container storage for the data. But the overhead is negligible if you use external PV storage.

And what about the ephemeral nature of containers? Wouldn’t this affect high availability? Since containers are just “wrappers” around a process, their lifetime is tied to that of the process. In other words, containers will be as stable as the database process running inside of them.

Running Databases on Kubernetes Revolutionizes the Way you Run Databases

There are obvious advantages to running databases on Kubernetes: the simplicity of deployment, having the whole stack managed by the same orchestration tool, auto-healing, and automatic reprovisioning of failed containers leading to higher availability. For example, if one of the nodes running a database fails, Kubernetes will automatically self-heal, rescheduling the workload on another node. With cooperation with the database management software, it may elect a new database primary running on a previously existing replica, and re-initialize the new node as a new replica, all automatically. But there are other, more important, reasons why you want to run databases in Kubernetes.

Most companies want to operate databases as a DBaaS (“Database-as-a-Service”). To self-provision a self-healing database, including backups, and monitoring. While this is offered by most cloud providers, doing it yourself by using Kubernetes can save significant costs, and offer additional capabilities, such as multicloud and cloud portability.

Read More:   Update Dashboards Are a Waste of Time and Money

These capabilities are made available via Kubernetes Operators. Operators are application-specific extensions to Kubernetes that encode deployment and operations automation while exposing simple interfaces to the users. Advanced database Kubernetes operators bring, among others, the following benefits:

  • A declarative approach to deployments and updates, making it 100% GitOps friendly and perfect for any company using CI/CD. Operators define CRDs (Custom Resource Definition) that are high-level objects — typically interfaced as simple YAML files — that allow to deploy and manage complex database architectures in a simple manner.
  • Automate “Day 2 Operations”: deployment, high availability, backups, and monitoring; patching, vacuuming, bloat removal, reindexing, etc. Operators can encode these operations into CRDs, YAML files that allow performing these operations automatically. One example of this approach is StackGres (which I founded), an advanced operator to run Postgres on Kubernetes, which fully automates all the operations mentioned above.
  • Externalization of database functionality to third-party, well-known, Kubernetes components, like the Envoy proxy; Prometheus and Grafana for monitoring; or Cert Manager for SSL certificate management. The database operators may rely on these components to offload database functionality, reducing the cognitive knowledge for the user to operate them, as it is more familiar, and obtaining more advanced functionality.

Running databases on Kubernetes is not only the future but also the present, as shown by leading companies such as Goldman Sachs, Zalando, and Flipkart. As with any technology, careful and objective evaluation should be performed before deploying production workloads.

Unsurprisingly, the Data on Kubernetes 2021 report found that 90% of the responding companies believe that Kubernetes is ready for stateful workloads. A large majority of these organizations (70%) run stateful workloads in production with databases topping the list. Those running 75% or more of their production workloads on it report an impressive 2x or greater productivity gains!

Considering all the advantages that running databases on Kubernetes offers, companies should ensure to consider it. Running data on Kubernetes was the latest frontier to have fully orchestrated infrastructure and I believe that this shift will unleash considerable value for businesses.

Read More:   Top 25 Most Popular Python Libraries Your Projects Need

Photo by Guille Álvarez on Unsplash.