Red Hat Code Sleuths Uncover Mysterious Bug in Registry Service – InApps is an article under the topic Software Development Many of you are most interested in today !! Today, let’s InApps.net learn Red Hat Code Sleuths Uncover Mysterious Bug in Registry Service – InApps in today’s post !

Read more about Red Hat Code Sleuths Uncover Mysterious Bug in Registry Service – InApps at Wikipedia

You can find content about Red Hat Code Sleuths Uncover Mysterious Bug in Registry Service – InApps from the Wikipedia website

Red Hat sponsored this post.

Alex Handy

Alex is a technical marketing manager at Red Hat. In his previous life, he cut his teeth covering the launch of the first iMac, before embarking upon a 20-plus year career as a technology journalist. His work has appeared in Wired, the Atlanta Journal Constitution and The Austin American Statesman.

After updating to OpenShift 4.3.19, Quay.io experienced intermittent service interruptions. The team quickly rolled back to 4.3.18, restoring service and steadying the waters, but everyone involved was now taking part in a murder mystery.

You’ve heard stories, but if you’re lucky, you’ve never experienced it. The bug is below you. It’s above you. It’s in the walls. It’s listening to us right now.

Troubleshooting and debugging are time-honored traditions of the methodical and systematic elimination of possibilities. But what happens if you cannot rule out a portion of the stack because your team does not have deep knowledge of it? Or worse yet, what if one of the layers of your stack is closed source software?

What if, horror of horrors, your stack is entirely open source and the bug is down in one of those layers? In Kubernetes? In Linux? Can your teams even begin to comprehend tracking down that type of bug? Can they even eliminate it as a possibility without reading hundreds of pages of code and documentation?

Growing to 10-Digit Scale

Red Hat’s Quay.io is a very large hosted service. There’s been a lot of news recently about the business of hosting container images at scale for enterprise cloud users, and Quay.io has quietly been performing that function since 2013 and growing steadily. In the month of August 2020 alone, Quay.io served 1 billion container pulls and had 100% uptime.

Back in 2014, when Quay.io was acquired by CoreOS, a decision was made to build an App Registry into the service. This predated the modern methods of cloud native artifact bundling that we use today in Kubernetes, with solutions like OCI, but the functionality was nonetheless included into Quay’s codebase. Because this feature wasn’t what most users adopted Quay.io to do, it wasn’t highly used and so it didn’t get a lot of engineering scrutiny.

App Registry is a lesser-known feature of quay.io that allows objects like Helm charts and containers with rich metadata to be stored. While most quay.io customers don’t use this feature, Red Hat OpenShift is a large user. The OperatorHub within OpenShift uses App Registry to host all of its Operators.

Every OpenShift 4 cluster uses Operators from the embedded OperatorHub to serve a catalog of available Operators, to install and provide updates to already installed Operators. As OpenShift 4 adoption has increased, so has the number of clusters globally. Each one of those clusters needs to download Operator content to run the embedded OperatorHub, using the App Registry inside quay.io as a backend.

The Outage

Fast forward to this summer and Quay.io is processing over one billion image requests per month, a rate of over 1.5 million per hour. It’s a large scale data distribution and retention service depended upon by enterprises around the globe. It’s also hosted on Red Hat OpenShift, an open hybrid cloud platform for container-based IT teams around the world.

After updating to OpenShift 4.3.19 from OpenShift 4.3.18, Quay.io’s database froze and the service stopped working, resulting in services that were intermittently disrupted. During these periods, users experienced a range of outcomes, including slow container image access times and inability to retrieve container images. The team quickly rolled back to 4.3.18, restoring service and steadying the waters, but everyone involved was now taking part in a murder mystery as their very own Inspector Lynley.

Read More:   The Three Stages of Software Engineering – InApps 2022

But the culprit has already been mentioned: the app registry. Turns out it had become the way internal teams at Red Hat were building Kubernetes Operators. The code behind app registry had never been pushed to work at this scale, and thus, the entire system suffered because of it.

We’re not here to discuss the end results: they’re almost boring compared to the giant bug hunt which ensued, and which shows just how CSI-style procedural such a search can get when Red Hat is involved.

Instead, we’re here to discuss that bug hunt. The twists and turns, the insane breadth of possibilities, and the methods used to track it down. The ensuing weeks after the crash saw Red Hat employees working on Quay, OpenShift, the Linux Kernel, and all manner of other systems, attempt to eliminate possibilities and identify the exact culprit.

William Dettelback is an engineering manager on the Quay engineering team. When it came to the Quay.io outage, the first thing he saw was the Red Hat SRE team, run by Jay Ferrandini and Jonathan Beakley, isolate the changes that had taken place between the service functioning properly and its newly degraded state.

Dettelback says it’s important to have this type of monitoring and performance measurement in place to start; otherwise, when things go sideways, you cannot actually tell. Without a baseline of system behavior, pinpointing when exactly the problem started is nigh impossible.

A Mile Wide, an Inch Deep

Fortunately, the number of changes across the systems involved were minimal. Unfortunately, they went deep. The OpenShift 4.3.18 to OpenShift 4.3.19 upgrade included not only OpenShift updates, but also some updates to the fundamental Linux systems and kernel used to power containers.

That’s because the OpenShift platform is not just some PaaS, or some framework, or even simply some implementation of Kubernetes. Instead, it is a harmonizing of thousands of open source projects, from the very bottom at the Linux kernel all the way up to the support for serverless applications running on top with Knative. Red Hat engineers have first-hand expertise across the entire open source stack.

In OpenShift 4, the Linux operating system is delivered as a feature of the platform through Red Hat Enterprise Linux CoreOS. Each instance of this OS is provisioned and updated by Kubernetes itself, using the Kubernetes declarative API machine controllers as part of the OpenShift installer. The entire stack embraces the concepts of fully immutable infrastructure.

Red Hat engineers were able to quickly narrow down what had changed in the kernel to just a few networking packages. It turned out, those were only a few commits worth of changes, but Bill said the team was able to learn this fact in a day — rather than spending their time researching the vagaries of the Linux kernel.

Stephen Cuppett, director of engineering for Red Hat OpenShift said that the Quay team, the OpenShift teams, and the Linux teams all tried to root out possible causes, narrowing the problem space as quickly as possible. But that wasn’t as easy as it sounded, as the problem only manifested at tremendous scale, making replication difficult in the lab.

Compounding matters, the Telemeter service, remote debugging data stream, had been experiencing network-based outages after the 4.3.18 to 4.3.19 update, so both the Quay.io and Telemeter teams were initially convinced that they were tracking down the same bug.

“As a macro-level service failure,” said Dettelback, “there were a lot of avenues to chase down. We had application things to chase down, infrastructure things to chase down, we had OpenShift things, then the RHEL side of things for this. We knew we had a small number of deltas we were dealing with. After quite a bit of investigation, we figured out that the telemeter issues were networking related, but [it was] not the same issue Quay saw.”

This is when proper logging of performance metrics became important. When the outage occurred, the clusters’ performance metrics were captured and saved using a synthetic benchmark test against a smaller version of Quay in the staging environment. Since the bug was nearly impossible to reproduce in the lab, this data would be a lifeline to figuring out the cause. The team couldn’t simply spin the updated version of Quay.io back up and wait for it to fail again, as that would interrupt services for users who had built critical systems based on Quay.

Read More:   Automation, Joblessness and the Vacuum of Tech Ethics – InApps Technology 2022

Thus, the data from the initial issue conditions was critical to troubleshooting. Said Dettelback, “We found Quay on OpenShift 4.3.18 versus 4.3.19 behaved very differently at that breaking point. That was the clue. We knew 4.3.19 wasn’t the smoking gun, but it was the thing we were concerned about. It didn’t explain why we went down, but we knew when we had to do the upgrade, [that] we had to be careful.”

The Usual Suspects

At first, the backup system was suspected as the cause, as the database had been running backup calls prior to the outage. That turned out not to be the case, however, closing entire avenues of possibility in the process, narrowing the list of suspects.

Unfortunately, the initial list of suspects was as long as one in an Agatha Christie novel. Cuppett said “We have different teams at all levels of the stacks, so none of my folks had to investigate all of them. It could have been a very protracted path. It’s complicated, this crosses skills boundaries. From Web services on Quay in Python, to Kubernetes in Go, to the Linux Kernel in C, and then there’s networking… These are all different teams that have multiple engineers at Red Hat.”

That means, said Cuppett, “We had gone wide across the different layers with multiple teams. That way, when one team found conclusive evidence, other teams could quickly abandon other costly and deep paths of investigation. And there were plenty of wrong roads to choose from, so narrowing it down across the many teams helped to prevent any one team from wasting their time, or blocking the other investigations. ”

In the end, the problem stemmed from the increased demand on the App Registry in Quay, a new feature that had never been tested at that scale, and was experiencing unexpected increased usage from development teams over time. That underlying App Registry code has since been optimized and the teams using those features are also being accommodated in other ways, reducing demand.

Said Dettelback, “The correct solution was multiple factors: it wasn’t one thing that took down Quay.io, it was a lot of traffic on a fairly vulnerable portion of Quay’s codebase that wasn’t designed to take the load it was taking. At a technical level, DNS resolution was slower on 4.3.19, but the way we determined that was that the team was able to build a reproducer in Python.”

The Abyss, Avoided

This could have been an endless dive into every avenue of possible issues. While the idea of possibly coming up against a bug that’s in the Linux kernel might sound like being knighted as a new open source warrior, is that really what your developers should be spending their time on if they’ve never touched the kernel before? This is the expertise of Red Hat engineers, and their work on issues like this is one of the advantages of Red Hat support. And if your teams really do encounter a kernel bug and want to take on the challenge of fixing it, we’ll help them do just that. We love bringing new contributors to open source!

“The kernel was one of those things where it looked very likely. ‘Oh, there was a kernel change! That could have had an upward effect on the stack.’ But we ruled that out very quickly,” said Dettelback.

So what’s the long-term solution? “I’d say the long-term solution is not the removal of app registry (that’s a tactical fix), but continuing to strengthen our cross-team collaboration across SRE, OCP and RHEL so we can fix these sorts of things faster. Because we have experts across the value chain and we work in an open manner, it’s easy to get the right people looking at the problem when you suspect it may be in their backyard. If we were a closed source shop or a less open organization, it would have been nearly impossible to get the collaboration and insight into chasing down what was going on when quay.io went down,” said Dettelback.

Read More:   What Kubernetes Needs to Run in Production – InApps Technology 2022

Feature image via Pixabay.

At this time, InApps does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: [email protected].

Source: InApps.net

List of Keywords users find our article on Google:

red hat jobs
red hat
red hat linux
ocp redhat
redhat
ocp red hat
red hat ocp
openshift 4 installation
red hat openshift installation
openshift performance monitoring
redhat openshift cluster manager
openshift ocp
openshift rhel
install red hat openshift
openshift installer download
openshift 4
inspector lynley
quay container registry
murder mystery team building game atlanta
openshift container platform 4
red hat running
working at red hat
red hat openshift free hosting
openshift logging
openshift power
ocp openshift container platform
openshift 4 documentation
openshift concepts
red hat backup
openshift
registry redhat
red hat container catalog
redhat jobs
red hat openshift download
murder mystery team building atlanta
red hat api
quay io
openshift auf power
openshift backup
openshift container registry
red hat operators
red hat openshift 4
openshift os
redhat support
red hat enterprise linux
operatorhub
rhel container
red hat openshift
openshift supported platforms
ocp kubernetes
openshift software
operatorhub travel
api red hat
redhat container catalog
red hat openshift support
redhat careers
openshift outage
red hat monitoring
twitter unc0ver
redhat operators
openshift service
openshift svc
what is red hat ocp
red hat coding
red hat openshift cluster manager
redhat ocp
openshift 4 x
red hat openshift container platform ocp
openshift 4 monitoring
red hat openshift dedicated
work at red hat
openshift image registry
install openshift 4
ocp container platform
openshift 4 api
what is redhat ocp
openshift operators
redhat openshift installation
openshift packages
red hat open shift
how to start openshift
backup openshift
openshift installer
red hat work culture
artifact reading inspector
redhat openshift
abyss solutions
red hat linux price list
red hat number of employees
red hat openshift cloud
uncover
openshift image stream
red had
linux openshift
teams machine wide installer
rhel kubernetes
openshift getting started
redhat enterprise linux
made in abyss wikipedia
red hat openshift troubleshooting
the way we were wikipedia
openshift.io
red hat openshift crash
the usual suspects wikipedia
wikipedia made in abyss
murder mystery wikipedia
redhat aquires backup
red hat container registry
openshift 4 logging
openshift logging operator
openshift red hat documentation
red hat quay
openshift container platform documentation
red hat enterprise linux documentation
red hat openshift response time
red hat openshift traces
red hat recruitment
red-hat
teams machine wide installer para que serve
red hat openshift issues
red hat shop
openshift 4 backup
wikipedia bugs
openshift monitoring operator
redhate
lifelines trackit
openshift cluster logging
openshift redhat
red ht
openshift remote debugging
red hat list users
redhat kernel
teams registry
openshift api
openshift documentation
openshift registry
red hat support levels
vacatures full-time redhat system administrator
oci monitoring api
openshift traces
red hate
application monitoring openshift
ocp openshift cloud platform
openshift cluster monitoring
openshift crash
red hat operator hub
openshift database as a service
all red hat
lifeline game wiki
quay.io registry
how to install openshift on rhel 7
ocp openshift
openshift on cloud
the inspector lynley mysteries
open shift cluster
openshift metrics
red hat openshift service
redhat review
“red hat enterprise linux
coreos wiki
openshift 4 install
openshift cluster
red hat learning
red hat linux”
red hat openshift kubernetes service
red hat update kernel
redhat openshift kubernetes service
what is cluster in openshift
openshift installera
openshift troubleshooting
redhat openshift on microsoft azure
what is red hat linux
appregistry
openshift backup project
openshift cluster monitoring operator
openshift performance
red hat backup solutions
red hat openshift overview
red hat software
red hat cloud
services in openshift
what is openshift
红帽 api
openshift4
red hat employees
red hat openshift data
redhat paas
openshift container platform cost
openshift review
openshift hosted
openshift run as user
redhat softwares
causes of smoking wikipedia
linux red hat
ocp cluster
red hat version
helm openshift
how much does red hat linux cost
openshift coreos
rhel insights
csi tactical shop
helm registry
openshift 4 new features
openshift free version
red hat linux support
artifact registry
cluster openshift
openshift on power
openshift plus
rhel wiki
Rate this post
Content writer

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Get a custom Proposal

Please fill in your information and your need to get a suitable solution.

    You need to enter your email to download

      Success. Downloading...