Last week an anonymous young programmer logged into a Reddit forum for computer science career questions, and condensed his first week on the job into an evocative story for our times: “Accidentally destroyed production database on first day of a job, and was told to leave, on top of this i was told by the CTO that they need to get legal involved, how screwed am i?”
According to the post, the proud young programmer had received his college degree, then packed up for a cross-country move to a shiny new job as a junior software developer. At a company with over 100 employees — and over 40 developers — it was his first non-internship position, and you can imagine that hopes were high for that first day on the job.
“Unfortunately i screwed up badly,” he wrote.
Accidentally destroyed production database on first day of a job, and was told to leave, on top of this i was told by the CTO that they need to get legal involved, how screwed am i? from cscareerquestions
With nowhere else to go, the developer had turned to Reddit “CS Careers Questions” forum. The poor developer described himself as “confused” and “terrified.” He wondered if he could hold on to the job. He then did what probably any stressed out young developer would do — go out for a drink.
Waking up — presumably a little hungover — he logged in to Reddit to discover that his post was on its way to over to over 23,000 upvotes, 4,500 comments, and countless private messages. His story had quickly become the forum’s top vote-getter of all time, earning seven times as many upvotes as its next-most popular thread.
It was a stunning display of support, and the universal response seemed to be a collective expression of sympathy from the entire industry.
There are two types of ops people: those who have fucked up production, and those who are about to. @petecheslock #monitorama pic.twitter.com/TMpdvW1Wqs
— (╯°□°）╯︵ ┻━┻ sdoɹǝǝq (@beerops) May 23, 2017
Using the handle YorickPeterse — and the title of Database (removal) Specialist — Yorick later shared some additional perspective. “For a lot of companies, something doesn’t matter until it becomes a problem, which is unfortunate… I personally think the startup culture reinforces this: it’s more important to build an MVP, sell sell sell, etc than it is to build something sustainable.”
Yorick’s comment prompted an inspiring example of how the geek community responds to honest admissions of mistakes. “That was you!?!!! Kudos to you guys for being so open about it. It was a blast following along with your recovery efforts on youtube. Keep up the good fight!”
Yorick even got a response from someone involved in the last high-profile data deletion SNAFU:
“Hi, guy here who accidentally nuked GitLab.com’s database earlier this year…” read another response. “This is not your fault…there are so many red flags from the company side of things…” One of the red flags was the fact that the company’s backups weren’t working, indicating they weren’t tested, which he calls the “same problem we ran into with GitLab, at least that’s working now.” (GitLab did have a backup, but it was six hours behind.)
One Redditor, with the handle CoffeeSippingBastard who identified himself as a systems engineer, reminded the junior developer of that time an Amazon maintenance process deleted load-balancing data from a production server: “Last I remember — guy is still there.”
This prompted an interesting analysis of Amazon’s response. “They wrote a COE (correction of errors report) that detailed why this happened (using 5 whys to get to the true ‘bottom’ of each cause), wrote up specific immediate actions, and included lessons learned (like never make direct changes in prod anywhere without a second set of eyes approving your change through the CM process)… That report is sent out in draft form to nearly the entire company for review and comment. And they do comment. A lot. Questioning things is a cultural habit they have. For all that’s wrong with Amazon, the best part was when someone fucked up, the team and the company focused only on how we make it never happen again.
“A human mistake was a collective failure, not an individual one.”
Another comment added, “The logic at my firm is, unless you are a colossal repeat fuck up…why fire the guy who knows the most about what broke? Firing the dude doesn’t un-break your process. He gets to create a process document so it doesn’t happen again now.”
The same sympathetic sentiment was expressed again and again. “You did the right thing,” wrote another Reddit user. “You made a mistake. You admitted what it was and that it was you.”
A chief technology officer commented: “This just screams incompetency on their part. This isn’t your fault…but you should in some way feel lucky you aren’t there anymore. Sounds like you would have learned some really bad habits there.”
When the tech news site The Register ran a poll asking who should be fired, a whopping 99 percent of respondents exonerated the junior developer — and another 43 percent voting that no one should be fired; “It’s a wake-up call for everyone.”
Although 10 percent wanted to fire whoever did the maintenance on the backups — and another 45 percent voted to fire the CTO.
The top response on Hacker News also had no sympathy for the company. “Sorry, but if a junior dev can blow away your prod database system by running a script on his _local_ dev environment while following your documentation, you have no one to blame but yourself. Why is your prod database even reachable from his local env? What does the rest of your security look like? Swiss cheese I bet.”
Or, as someone else put it, “What was their tech stack, Jenga?“
Feature image: Jenga promotional image on Amazon.