Home
>
DevOps News
>
6 Scary Outage Stories from CTOs – InApps 2022

March 30, 2022 by Phu Nguyen

6 Scary Outage Stories from CTOs – InApps 2022

Main Contents:

6 Scary Outage Stories from CTOs – InApps is an article under the topic Devops Many of you are most interested in today !! Today, let’s InApps.net learn 6 Scary Outage Stories from CTOs – InApps in today’s post !

Charity Majors, CTO of Honeycomb

Charity Majors

“Push notifications are down!”

“No, they aren’t.”

“No really, people are complaining — push is down.”

“Push can’t possibly be down. Our pushes are in a queue, and I am receiving pushes.”

“It’s been five days, and push is STILL down. People are filing all kinds of tasks.”

… so I reluctantly started poking around. All our push metrics looked relatively normal, every test push I sent was promptly delivered. Yet the support team was right — people had been steadily complaining for five full days about pushes not succeeding. What on earth could it be?

These were Android push notifications, and Android devices needed to hold a socket open to the server to subscribe to push notifications. We had tens of millions of Android devices, so we ran the push notification service in an autoscaling group. To load-balance connections across the group, we used round-robin DNS, and to increase capacity we would simply increase the size of the ASG [auto-scaling group]. Eventually, we figured out that the complaints had begun right around the last time we increased the size of the ASG, so that was a good clue. Another clue was that all the people complaining seemed to be in Eastern Europe. We asked a few of them to run a verbose trace, and that’s when we learned that the DNS record was coming back as … missing?

Turns out that when we increased the size of the ASG, the round-robin DNS record exceeded the UDP packet size. Normally this is no big deal; the protocol says it should fall back to using TCP in that instance. And it did, for almost everyone. Except for users behind one major router in Romania. We delegated DNS for that record from route53 to a small local python DNS server that let us return a random subset of four Android push notification servers, and everything was fine again.

Matthew Fornaciari, CTO of Gremlin

Matt Forniciari

The outage occurred on a Friday afternoon, just as we were about to head out to Halloween Happy Hour. The page came in that we were serving exclusively 500s — a bad, bad experience for customers. After some digging, we realized that our hosts had filled up their disks, and we started failing because we couldn’t write logs (also scary because we were flying blind).

We ended up refreshing the hosts, implementing log rotation to prevent that from happening in the future, and creating an alarm to warn us if we were ever getting close again. But the most interesting thing we did is have one of our engineers write a new Gremlin for our platform: the disk Gremlin to make sure we could proactively exercise the fixes to make sure we never failed that way again. Then we automated that test and that test still lingers, running randomly in our production environment to this very day.

Liran Haimovitch, CTO of Rookout

Liran Haimovitch

Remember that urban legend about a server going down everyday, at the same specific hour? And after weeks of investigations, someone looked at the security camera footage… and found out that the maid was disconnecting the server to connect the vacuum cleaner! Well, we all know that the Gremlin in the closet isn’t always as scary or mysterious as we initially think 🙂

Recently, we experienced something similar.

Several times a week, we’d been seeing the backend’s latency metrics going through the roof. And each time we investigated it, we noticed one of the tables getting locked and queries kept timing out all over. We wondered: Is one of our customers redeploying their application non-stop? The main suspect was a complex query which fetches the list of all our customers’ servers’ information, so they’ll be able to choose which of them they would like to debug. We started optimizing that query and saw huge improvements, yet those latency spikes kept happening.

Then a couple of weeks ago, while attending the weekly “Customer Success Briefing,” the latency spike was happening again and it hit me like a brick. I noticed a query that we barely used, from our application’s back office, that was really slow because we never prioritized fixing it (it was scarcely used). Apparently, our customer success manager had been collecting the data for the meeting, and every time the query didn’t return fast enough, he just kept hitting refresh and retrying. That rarely used query was locking up our database and challenging our customer success manager’s sanity! Looking back at the data, we confirmed that all of the latency peaks were aligned with Customer Success briefings. Eventually, after about 20 minutes of optimizing that query, everything returned to normal.

Daniel “Spoons” Spoonhower, CTO of Lightstep

Daniel “Spoons” Spoonhower

It was a clear, sunny day in San Francisco. I was working at a small internet company, when suddenly our app stopped loading for me. Not just one view, but the whole app. Hard reload, but no luck. I looked around and my teammates were also confused; the app wasn’t working for them either. Our users weren’t complaining (yet?) but we started digging in anyway. No deployments had happened yet that day, no infrastructure had changed; yet it was broken consistently across OS types and browsers. What could have changed?

We found some errors in a critical (but boring-and-hadn’t-changed-in-forever) API call, without which the app wouldn’t load. But why were the errors only happening for people that worked at the company? And why now? It turned out that for internal users, the API returned some extra data…extra data that had been slowly growing over the last few weeks, until it had finally exceeded the request’s maximum payload size that afternoon.

Lee Liu, CTO of LogDNA

The AddTrust Root Certificate Authority (CA) we relied on expired at roughly 4 a.m. Pacific Time on Saturday morning, May 30, 2020.

At the time, we were transitioning some of our infrastructure to Let’s Encrypt, a nonprofit certificate authority, as part of our move to Kubernetes. Legacy Syslog clients required AddTrust/UserTrust/Comodo. We run our own SaaS environment in addition to a number of worldwide environments for a major cloud partner. In our SaaS environment, a single certificate chain is used everywhere, including our ingestion endpoint, Syslog endpoint, and web app. We thought we were ready for this root certificate expiry… we were not.

Quick primer on certificate chains: All certificate-based security relies on chains of trust. Browsers and operating systems ship with these trust stores of root certificates.

LogDNA Chain: AddTrust Root CA (expired May 30) -> UserTrust CA -> Sectigo -> *.logdna.com

Modern browsers allow: UserTrust CA -> Sectigo -> *.logdna.com

UserTrust CA itself is also part of root trust stores of many browsers, so even if AddTrust is expired, it’s ignored since the chain leading up to the UserTrust CA is still valid.

Or so we thought.

Turns out, old legacy systems will only see the LogDNA chain, which is considered an invalid chain if any of the four certificates are expired. They also don’t recognize UserTrust as a trusted root certificate.

All of the support tickets we received mentioned that our v1 agent was no longer sending logs to our ingestion endpoints, but our v2 agent and other modern implementations of REST API-based clients were all working fine.

We erroneously started working on an update to our v1 agent. Ironically, our CI/CD provider also had an outage of their own due to the same AddTrust Root CA expiration, which further complicated our rollout of that agent. Once we realized that the issue was with the actual certificate chain and how older legacy systems behaved with that chain, we quickly rectified it by switching in a new certificate chain based on Let’s Encrypt.

Tina Huang, CTO of Transposit

Tina Huang

Full-on site outages are horrible — but they don’t make your skin crawl the same way that the random, unpredictable failures really can. I was working on the mobile web version of Twitter, and we got requests that, for some random unlucky campers, caused a scary error page whenever they visited the site. For everyone else, the sky was blue and the birds were chirping. But now and again, someone else would get hit. And, once they were hit, they were stuck in a pit of despair, unable to read any tweets from their phone.

Slowly, as the number of these tarnished accounts increased, the 500s started creeping up to critical levels. We were able to see that the new library we were using failed to parse session cookies with a specific character. So every time you logged back in, you were rolling the dice on getting bit by this pesky bug, and you couldn’t be cured without the wizardly powers to reset your cookies on a phone. Eventually, we fixed the bug in the library, and everyone was able to go back to reading their tweets… which, as we know, can be a very scary thing on its own! ️

InApps is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Honeycomb.io.

Source: InApps.net

List of Keywords users find our article on Google:

spirit halloween jobs

ctos

customer success manager jobs san francisco

comodo endpoint manager

story spoons

comodo certificate manager

comodo ca certificate

datadog agent status

pegas romania

spirit halloween application

“sectigo”

ctos wiki

morning has broken wikipedia

how does a vacuum cleaner work wikipedia

“syslog”

nonstop recruitment customer service

datadog chef

cto light

the brick wawa

sectigo positive

logdna blog

datadog trace api

tina huang

datadog agent version

comodo certificates

outage management system industry

major internet outage today

internet outage

amazonit wikipedia

ctos hotline

asg wikipedia

our world is ended wikipedia

asg 3 hurt

niche majors

adam strong wikipedia

wikipedia always sunny

asg group jobs

sectigo jobs

adam and eve wikipedia

load balance router hire

endpoint wikipedia

socket io developer san francisco

web agency wikipedia

running with spoons

certificate wikipedia

how to use logdna

la greca home

lightstep competitors

comodo ca certificates

datadog custom agent check

logdna status

adams polishes logo

asg pr

comodo endpoint security manager

comodo endpoint

spirit halloween number

consultant devops confirmé

datadog custom metrics python

spirit halloween phone number

ctos game

logdna

drain net technologies

logdna query

how logdna works

blue light incident management software

clear sky recruitment

comodo endpoint security

cheap comodo certificate

cheap comodo certificates

sectigo.com

asg sound

chef automate logs

chef-client debug

cto recruitment

cto.vn

scary numbers that work

the brick in wawa

comodo partner

comodo sectigo

gremlin api

route53 metrics

comodo one enterprise

amazon root ca

comodo one review

dedicated software team romania

real forni

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.