Service Level Objectives in Kubernetes – InApps 2022

Main Contents:

Service Level Objectives in Kubernetes – InApps is an article under the topic Devops Many of you are most interested in today !! Today, let’s InApps.net learn Service Level Objectives in Kubernetes – InApps in today’s post !

SLOs in a Nutshell

William Morgan

William is the co-founder and CEO of Buoyant, the creator of the open source service mesh projects Linkerd. Prior to Buoyant, he was an infrastructure engineer at Twitter, where he helped move Twitter from a failing monolithic Ruby on Rails app to a highly distributed, fault-tolerant microservice architecture. He was a software engineer at Powerset, Microsoft, and Adap.tv, a research scientist at MITRE, and holds an MS in computer science from Stanford University.

At its most basic level, an SLO is simply a metric, a goal for that metric, and a time period. For instance: “the success rate for service A must be at least 99.7% percent over the past 30 days.” The metric is known as the “service level indicator” (SLI) and the goal is the “objective.”

The output of an SLO is the error budget, which is a measure of how the metric is doing relative to the goal over that time period. For example, if your SLO is defined as 99% successful over a 30-day period, and the success rate over that period is 99.75%, your error budget is 75%.

The error budget is a measure of how much leeway is remaining before the objective is violated. For a service owner, the error budget represents a way to quantify the amount of risk they can incur — an indicator of whether you should hold off on new deployments until things cool off, for example.

But for a platform owner, the error budget acts as something else: a kind of context-free judgment of the health of the service. If the error budget for an SLO is 100% and steady, then we know things are going well for that service. If it’s close to 0 (or below 0!) and dropping, then we know things are going poorly. It doesn’t matter what the underlying metric is, what the application does, or how it performed last month — the error budget is a universal number.

This universality and context-free nature of error budget values is the key to the value an SLO provides in the context of the Kubernetes platform.

SLO compliance, SLIs, and error budget for a Kubernetes workload (Dive dashboard).

SLOs for Kubernetes Platform Owners

The Kubernetes platform owner may be responsible for hundreds or thousands of applications running across tens or hundreds of Kubernetes clusters. And they may understand none of them. (Arguably, this lack of understanding is the mark of a healthy platform!)

The Challenges of Using SLOs

Despite their many benefits, implementing SLOs for a Kubernetes platform can be difficult. As a first challenge, consistent SLOs require consistent metrics — what are the success rates, latencies, etc, of your Kubernetes workloads at any point in time? Next, you must formulate the SLOs with appropriate SLIs, objectives, and time periods — what is the “right” parameterization of SLOs that you want to track? Finally, you must actually compute the error budgets. While the math is simple, selecting the correct metrics data points from the correct workloads during the correct time periods can be non-trivial, especially when services and workloads change over time.

For the metrics challenge, at least, there are some simple options. A service mesh like the open source CNCF project Linkerd can provide a consistent and uniform layer of metrics for all HTTP and gRPC services on your Kubernetes clusters, without requiring any configuration.

Formulating the SLOs on top of these metrics is the next step. Here, there are a spectrum of options — ranging from “get all stakeholders in a meeting and hammer it out from first principles” to “just use the current metric value as the objective and see what happens.” Tooling here can help immensely, especially with the latter approach, by providing suggestions based on historical data.

Finally, computing the error budget. The Kubernetes ecosystem provides good options here in the form of open source tools like Prometheus and Grafana — with Linkerd metrics in place, for example, SLOs can be expressed as Prometheus queries and error budgets plotted as Grafana dashboards. Alternatively, hosted tools like Dive can make use of these same Linkerd metrics and allow you to set up and track SLOs with the click of a button, across arbitrary numbers of clusters and workloads.

No matter which approach you take, adopting SLOs can play a vital role in helping platform owners understand the state of their applications in a way that’s both uniform and context-free, which means they can prioritize their efforts and ensure that both the applications — and the platform on which they run — remain reliable.

To learn more about Kubernetes and other cloud native technologies, consider coming to KubeCon + CloudNativeCon North America 2020, Nov. 17-20, virtually.

The Cloud Native Computing Foundation is a sponsor of InApps.

Feature image via Pixabay.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.