How I Broke Production (And Got Promoted)

A couple of weeks back, I celebrated 4 years at Sambla Group, and this is a perfect moment to reflect on some meaningful moments that shaped my path.

It was 5th April 2024, after 4 pm... After work started and everybody is walking around with beer cans and wine glasses around the office, trying to finish the week in a relaxed and rewarding way.

At the same time, I'm trying to figure out why a certain Lambda function creates multiple redirects. Task submitted by the SEO team, something that we need to fix, but it could wait until Monday.

Finally, I got the working version, but I realized we are still referencing a different version in the CloudFormation template.

Easy, I'll create a change set, update the parameter, and I'm done - redirects are working fine, tough week is conquered, and I can indulge in a beer with the teammates.

BUT, while updating the parameter, I made a change that was not supposed to be released on Friday at 5 pm - so as a responsible developer, I decided to delete my change set.

OOPS! I deleted the CloudFormation stack.

The site is down, there is only a white screen where we had our main brand, biggest, and most important site during peak time, at the highest traffic.

It is not a bug to fix. I can't rollback. No backup for these kinds of things. The whole stack and related AWS resources are gone!

Slack is burning! Other senior colleagues are approaching, the vibe is like in the jungle, they are trying not to spook me, but to emphasize urgency to fix this immediately.

And If You Thought That Was a Plot Twist, You Were Wrong

At the time, we are running 8+ sites on 3+ markets, all on the same tech stack. Each site has 3 CloudFormation stacks: one for CMS (WordPress on ECS, based on an ECR image), one for the CMS pipeline, and one for the frontend serving a Gatsby application.

Each of those 8+ sites has its own CloudFormation template for the frontend. There had been work to unify them into a single public-web template, but it was not completed - only the recently migrated sites used the new one.

Our main sites in the Swedish market? Running on 2 different, very, very old and specific templates that were risky even to update, not to mention replace.

So I'm doing the obvious. I'm using the latest CloudFormation template to spin up the deleted stack. And if you haven't worked with CloudFormation templates, it is like driving a truck uphill, with the cliff on the side, and a mountain on the other. It is slow, it can stop unexpectedly, it has to fit perfectly to the road; otherwise, you are going down. One letter in the wrong place can put you back on square one.

With the support of one of our teammates and my manager, who was containing the Slack fire, and after tweaking several parameters, some IAM roles and permissions on the resources, the site was up and running after a fairly short time.

Wearing the I BROKE PRODUCTION hat after the incident

I earned my "I BROKE PRODUCTION" hat, and the beer I planned to have before everything went south.

What Did I Learn?

That AWS needs better UI (joke's on me - now they do!). There is an alert when you press the delete button on a stack to check with the user if you really want to delete the resources. Don't worry, I checked that functionality on staging.

As the site came back up, I realized that production scars create edge - your thinking process sharpens. If you are overtaking, DO NOT BRAKE - press the pedal and finish it - when something important is broken, you need to repair it immediately, and it is not done until it works again - there is no time and space to leave it for some other time. It is so rewarding to have a supportive team, rather than a team that blames you. Production is messy, never perfect - school rules don't apply there. Breaking down a problem into smaller pieces is a technique that is superior to any other problem-solving skill you can name, with other words - breaking down a problem under pressure is the only methodology that can help you go over the finish line...

What I Learned About AWS

DO NOT DELETE CLOUDFORMATION STACK FOR PRODUCTION
Keep your CloudFormation templates up-to-date
Keep your stacks simple
Review and reduce resources your application needs (remove old Lambdas, make sure you don't have old Roles piling up, clean up certificates and secrets, etc.) - ensure your stack is neat
Always have failover systems

Incidents Shape Developers

Besides creating an incident, it was resolved with improvements, and it was resolved quickly. Of course, at the time, I was feeling terrible - because I deleted the stack, because we had downtime, because I created stress for my colleagues, and because that incident hurt the organization I'm a part of.

However, this was not my first production incident in my professional life, and it was not the last. But this one was different - my perspective on incidents shifted permanently.

Incidents, when managed properly, can increase a developer's maturity. Under pressure, they craft the skills that enable sight beyond the wall of issues. They also strengthen organizational capabilities - each post-mortem encapsulates knowledge and unlocks understanding that was not available before.

Realizing all of that, and utilizing the craftsmanship forged in hardship, I earned my current title. One I proudly carry until the next problem to solve and a new opportunity to rise above the critical situation.

The Recipe for a Happy Developer's Life

That being said, my recipe for a happy developer's life is: first find a good team, solve problems & build, fail and learn, try again, grab a beer, wear the "I broke production" hat (rarely, but at least sometimes), and always make sure to share understanding for people and systems, and share knowledge and support - the more you share, the more you will gain in return!

Have your own production horror stories? I'd love to hear them. Sometimes the best lessons come from the worst moments. Get in touch.

How I Broke Production (And Got Promoted)

A couple of weeks back, I celebrated 4 years at Sambla Group, and this is a perfect moment to reflect on some meaningful moments that shaped my path.

It was 5th April 2024, after 4 pm... After work started and everybody is walking around with beer cans and wine glasses around the office, trying to finish the week in a relaxed and rewarding way.

At the same time, I'm trying to figure out why a certain Lambda function creates multiple redirects. Task submitted by the SEO team, something that we need to fix, but it could wait until Monday.

Finally, I got the working version, but I realized we are still referencing a different version in the CloudFormation template.

Easy, I'll create a change set, update the parameter, and I'm done - redirects are working fine, tough week is conquered, and I can indulge in a beer with the teammates.

BUT, while updating the parameter, I made a change that was not supposed to be released on Friday at 5 pm - so as a responsible developer, I decided to delete my change set.

OOPS! I deleted the CloudFormation stack.

The site is down, there is only a white screen where we had our main brand, biggest, and most important site during peak time, at the highest traffic.

It is not a bug to fix. I can't rollback. No backup for these kinds of things. The whole stack and related AWS resources are gone!

Slack is burning! Other senior colleagues are approaching, the vibe is like in the jungle, they are trying not to spook me, but to emphasize urgency to fix this immediately.

And If You Thought That Was a Plot Twist, You Were Wrong

Our main sites in the Swedish market? Running on 2 different, very, very old and specific templates that were risky even to update, not to mention replace.

Wearing the I BROKE PRODUCTION hat after the incident

I earned my "I BROKE PRODUCTION" hat, and the beer I planned to have before everything went south.

What Did I Learn?

What I Learned About AWS

DO NOT DELETE CLOUDFORMATION STACK FOR PRODUCTION
Keep your CloudFormation templates up-to-date
Keep your stacks simple
Review and reduce resources your application needs (remove old Lambdas, make sure you don't have old Roles piling up, clean up certificates and secrets, etc.) - ensure your stack is neat
Always have failover systems

Incidents Shape Developers

However, this was not my first production incident in my professional life, and it was not the last. But this one was different - my perspective on incidents shifted permanently.

The Recipe for a Happy Developer's Life

Have your own production horror stories? I'd love to hear them. Sometimes the best lessons come from the worst moments. Get in touch.

How I Broke Production (And Got Promoted)

How I Broke Production (And Got Promoted)

And If You Thought That Was a Plot Twist, You Were Wrong

What Did I Learn?

What I Learned About AWS

Incidents Shape Developers

The Recipe for a Happy Developer's Life

Table of Contents

Nikola Lalovic

Related Articles

Building a Multi-Brand CDN Architecture: Lessons from Scaling CMS Media Delivery

Comments

Liked this? Get the next one.

Working on something similar?

How I Broke Production (And Got Promoted)

How I Broke Production (And Got Promoted)

And If You Thought That Was a Plot Twist, You Were Wrong

What Did I Learn?

What I Learned About AWS

Incidents Shape Developers

The Recipe for a Happy Developer's Life

Table of Contents

Nikola Lalovic

Related Articles

Building a Multi-Brand CDN Architecture: Lessons from Scaling CMS Media Delivery

Comments

Liked this? Get the next one.

Working on something similar?