Ludicity

I Accidentally Saved Half A Million Dollars

I saved my company half a million dollars in about five minutes. This is more money than I've made for my employers over the course of my entire career because this industry is a sham. I clicked about five buttons.

Let's talk about why happened and why it's a disgrace that it was even possible.

I. Background

Let's start with some background, because it is fucking wild that an inefficiency that took me five minutes to solve in a GUI configuration panel was allowed to persist. We cancelled someone's contract the week before I did this. Someone lost their job because no one could get their act together long enough to click the button I told them to click.

A few years ago, this company decided that it wanted to create an analytics platform, following the decision to become more "data driven". They hired some incredibly talented people to make this happen, and then like five times as many idiots.

At the time this was happening, I had just graduated and joined the organization as a data scientist. We, of course, did not do any data science, because the organization did not require any data science to be done - what they actually needed to do was fire most of the staff in every team, leaving behind the two people who actually had good domain knowledge, then allow them to collaborate with good engineering teams to build sensible processes and systems. Instead, they hired a bunch of Big Firm Consultants. You can see where this is going already.

Nonetheless, at the time I was young and took the organization at its word. Executives would tell us constantly how excited they were for us to roll out new A.I initiatives (then tell us there was no time, so could we please get that report to them in a spreadsheet), and I'd ask for some sort of compute to perform some machine learning, or even set up data pipelines.

It never worked. Instead, we were told that we just had to wait for the Advanced Analytics Platform (AAP) to be deployed. You see, it's December, and it's launching in January.

Then in January I was told to be patient, it was coming in March.

In June, I was told it had been put on hold due to Covid - this was a very convenient excuse because they had absolutely fucked the whole project up already, but it bought some valuable time. By the next December, I had left the organization and the AAP was still nowhere to be seen.

We skip ahead three years. The AAP is finally ready to launch. It turns out none of the features I needed were ever planned, so I guess they were just lying to me before I left.

Four engineers leave the company in the same week, and I speak with the directors because I know they need a real engineer in and they can't find them. I'm a substantially less experienced engineer than many of the readers here, but suffice it to say that I can read documentation without panicking, which is considered S-tier in this country. My conditions - a big pile of money and they had to put me on the AAP team because they're the only team that gets actual toys to play with.

II. It Fucking Sucks

It's an insane dumpster fire spiderweb of technical debt and it's only like one week old. Here are some fun details.

I get a friend of mine hired (big fan of nepotism), and he finds, on day one, a file in the project's repository that deletes prod using our CI/CD pipelines if it is ever moved into the wrong folder. It comes complete with the key and password required for an admin account. It was produced by the former lead engineer, who has moved on to a new role before his sins catch up with him.

The entire thing is stitched together by spreadsheets that are parsed by Python, dropped into S3, parsed by Lambdas into more S3, the S3 files are picked up by MongoDB, then MongoDB records are passed by another Lambda into S3, the S3 files are pulled into Snowflake via Snowpipe, the new Snowflake data is pivoted by a Javascript stored procedure into a relational format... and that's how you edit someone's database access. That whole process is to upload like a 2KB CSV to a database that has people's database roles in it.

This is considered more auditable.

Everything is transformed into a CSV because the security team demanded something that could undergo easy scanning for malicious content, then they never deployed the scanning tool, so we have all the downsides of the CSVs and none of the upsides.

Every Lambda function, the backbone of all the ETL pipelines, starts with counter = 1 because one of the early iterations used to use a counter and people have just been copying that line over and over. Senior data engineers have been copying that line over and over.

The test suites in the CI/CD pipelines have been failing for months, because someone during debugging chose to use the Linux tee command to log any errors to both stdout and a file at the same time, but tee successfully executing was overwriting the error code from the failing tests.

To get access to the password for any API we need to hit, you search for something like service-password in an AWS service, which returns the value... service-password (as in, literally all the values are the same as the keys), then you use that to look up the actual password in a completely different service. No one knows why we do this.

The script that generates configuration files for our pipelines starts with 600 lines of comments, because senior engineers have been commenting the lines out in case they're needed later. The lines are just setting the same variables to different values, and they're all on GitHub anyway.

This is at an organization that some percentage of readers will recognize on sheer brand strength if they're in my country.

I'm not even getting started, but we have to stop for now because I am going to catch fire. These details are important because now you understand the kind of operational incompetence that allows you to waste so much money on processing <1TB of data per day that it dwarfs your team's salary.

III. The Budget

The next thing to realize is that this platform never really had a chance of making any money for the organization. They do a little accounting trick (read: lying) which I'll talk about in another post that makes it seem like they've had huge wins, but really this is just many times more expensive than our previous operational model.

The deal is that we pretend the whole team is doing something or other, and we stay within budget because the organization can't afford to spend infinite money on this social fiction. However, the budget for our database costs was being drastically overrun. I'm not sure what the original estimate was, but I think it was intended to cost something like 200K for a year of operations, but we were now close to a million dollars.

Some quick facts:

  1. We use Snowflake as our database, which charges you based on the size of the computer you use to run your queries.
  2. You only pay for computers while they're on.
  3. We probably run a few thousand queries per week, mostly developers experimenting with little tweaks for PowerBI reports that no one reads, and on average they take about 2 seconds to run.
  4. The computers are set to idle for 10 minutes after every query.

I noticed this about a month into joining the team, and suggested we uh... don't have the computers run for like two orders of magnitude longer than they need to for every query. I literally can't remember what was said, there was some Agile bullshit about doing a discovery piece, then it just never happened.

IV. Just Doing The Fucking Thing

Anyway, months later, they finally give me a card that says "Discovery: Optimise Costs". Now I have to optimize costs so that I have something to say at the next standup, and fortunately I know just the thing! I'll test my hypothesis that this is all a sick joke, and I'm going to push the button that I secretly think should obviously have been pushed.

We've got a new guy on another team who seems excellent, so I ask management if I can give him admin credentials since we need competent people. They say no. I flick him some lower-level database credentials that I technically wasn't told not to do since they aren't admin credentials, and he sanity-checks that it would save money. At 4PM on the last day of the week, I ping a chat full of good engineers and no managers to make sure I'm not about to nuke everything, then just do it.

V. Chaos Reigns

I return to work the following Monday. I suspected that this would save a bunch of money, and guess what, our projected bill dropped from a million to half a million dollars, and everyone is losing their fucking minds.

My team has spun this as a huge cost saving, when really we just applied a fire extinguisher to the pile of money that we had set alight.

Other teams are attacking my team, insisting that it can't be a coincidence that the one new guy joined exactly as we did this, and how was it possible we didn't know how to generate that kind of saving without his help? They are saying this because it makes them seem higher status and their teams only produce money in the land where you lie all day, but it is a fair question.

While my managers are very happy, they quietly suggest it may be unwise to roll out the changes to all the computers (I only did a few to be safe) because it would oversaturate the department to hear about us all day. And invite unwelcome questions. The subtext is that if we do this all slowly enough, it might seem like it took a lot of effort instead of just clicking buttons that I said had to be clicked almost a year ago.

I am asked to write some PowerPoints, which include phrases like "a careful statistical analysis of user usage patterns indicated an opportunity to more effectively allocate resources", implying that nothing was wrong, we just needed to collect more data before deciding not to let the expensive machines idle all day.

Every day, I dread someone asking me to explain what the change was, because I will have to fucking yeet some managers I like under a bus, but they can't resist talking about the change non-stop because it is the closest some of them will ever get to impacting the bottom line. And many of them are actually decent managers, it's just that this whole department, like many departments, is some sort of weird political PsyOp to get executives promoted. It's cosplaying as a real business and the board thinks the costume is convincing.

VI. The Aftermaths and Takeaways

By identifying a handful of good engineers and going totally rogue, we outperformed the entire department pretty effortlessly. The competent people are there, just made totally impotent by the organization, and I'm still convinced that this place is probably better than the median organization.

I ask management for a 30K raise after saving 500K and my message is still unread. I suspect I will eventually receive either nothing or 5K.

I have even more meetings now because everyone wants to talk about how we saved the money. I had to make a PowerPoint. Kill me.

I would have been better off not doing anything. Let that be a lesson to you. Do you hear me? I applied myself for five minutes against my own better judgement, had the greatest success of my career, and have immediately been punished for it. Learn from my mistakes, I beg of you.