A nice-to-have feature and 37 hours of downtime
Failures used to be my biggest fear. For years, I struggled to deal with them. Since I was a kid, I strived to be the best. During school, I could not accept any score other than the highest. I put a lot of effort into avoiding failure. Yet, I missed the beauty of failing: learning. Later in life, I realized that nothing teaches more than learning from our mistakes.
Failure is never an end state; it’s a step towards success.
How you perceive failures and what you do with them ultimately defines how far you can go in your journey. As a Product Owner, I had so many fuck-ups that I cannot recall them all, but each of them taught me something special.
In this post, I will share the worst fuck-up I faced during my career and what I learned from it. Hopefully, it will help you on your journey as well.
A Nice to Have Feature and 37 Hours of Downtime
You didn’t read incorrectly. A nice-to-have feature led to the longest downtime I’ve ever experienced. The story started like this, our team reached its highest performance, and consequently, the business was rocking to the top. Our growth rate was aggressive, an average of 20% per month. Our confidence level was high; we believed no challenge was big enough for us.
Our product started as a greenfield, and due to its success, it was time to scale it up. The business was an online shop, and we wanted to get more sellers to advertise their products on our platform. We already had working interfaces with multiple sellers, but acquiring more sellers also meant more problems.
Back then, as we grew, the number of canceled orders increased from 1% to 5%. This number was unacceptable, customers were unhappy, and our reputation was endangered. I had to understand the root cause of the problem, which turned out to be outdated product inventory. Many of our sellers advertised their products on multiple platforms. That’s why having accurate inventory information was complex yet critical for our business.
When Confidence Becomes the Enemy
To solve our problem with canceled orders, we had to speed up our inventory process. However, that was not as simple as we imagined due to the number of interfaces we had. Yet, we found a meaningful solution: priority queues. A few sellers represented more than 80% of our orders. Therefore, we decided to have priority queues for the most representative sellers.
I was the Product Owner, and my mistake was pushing a nice-to-have feature together with the priority queues. I could not imagine what was about to happen. We saw no risk in adding an image compression process to speed up the queue as a team. It looked like low-hanging fruit. We processed more than two hundred thousand pictures a day; speeding up this process would be beneficial for the future, though we had no problem with it during that moment. Anyway, we took that into our Sprint.
Well, it turned out to be a horrible idea to squeeze this nice-to-have image compression into our process. As Angela Merkel said on her commencement in Harvard:
“It’s not because you can do something that you should do it.”
A Small Change Created a Nightmare
Friday was the last day of Sprint, and we reached our Sprint Goal. We were excited about solving the problem with the canceled orders. During the Sprint Review, the stakeholders were pleased we had a solution ready to go live. They tried to convince us to deploy right after our Sprint Review, but we didn’t deploy anything on Fridays. That’s quite obvious as nobody wanted to put the weekend in jeopardy. We agreed to deploy on Monday.
On Monday, I arrived in the office at around 08:30 a.m. I was looking forward to monitoring the outcome of our previous Sprint. After a couple of minutes in the office, the developers arrived, and they proceeded with the release. Before 09:00 a.m., the new process was live. So we went to grab a coffee, as we would need some hours to notice the difference in the interfaces. At least I thought that.
As I returned to my desk, I noticed I had more messages than usual. Most of them were like this, “David, could you check why this product doesn’t have an image?” At that moment, I thought, “It cannot be we caused a problem with the images due to the compression process. We tested that intensively.” I looked at some products, and I couldn’t find any problem. But suddenly, from a couple of products without images, it went to thousands, and then to all products. At 10:00 a.m., all products had no picture, and I was sweating and falling into panic.
The image compressing process is one of the few decisions I regret in life. This nice-to-have process blocked the whole queue, which overloaded our servers, and made our database unreachable. We were swimming into shit.
A nice-to-have feature removed all products from our shop. Our system had the rule to take out products without images. As Product Owner, I knew I fucked it up, and at this time, it was a massive problem as clients couldn’t buy anything. Still, I was hoping we could get the pictures back in a couple of minutes. Unfortunately, I was wrong again.
Although we identified the root cause of the problem in some minutes, the pressure didn’t help us figure out how to tackle the situation. The CEO was already shouting at us, “What the hell is going on? We are losing money and trust every minute you don’t get these products back.”
Overcoming the Challenge
My mission was challenging; I had to hold back the angry stakeholders. I had to be honest with them; I said, “We know the size of the problem, and we know we fucked up. But we need your patience to let us find a solution for it. Coming to the room every fifteen minutes will block us from fixing the issue. I will keep you updated on our progress.” They accepted that, and we continued to work on the problem.
At around 6 pm, we decided to restore the database and accept a loss of data. Our backup was hourly, so it meant we had to restore to the 08:00 am version. The backup took some hours to process. At 09 pm, our database was restored. But now, we had to process six hundred thousand images to make the products buyable again. We put the process to run in the hope of finishing the next day in the morning.
Well, once again, we were surprised. We committed another foolish mistake; the job process was not running, and no image was processed during the night. As we noticed that on Tuesday morning, we started the job process immediately. We had another long and stressful day, but at this time, with a light at the end of the tunnel. At around 10:00 pm, all products were back, and sales started popping up.
My Learnings
The story I’ve just shared is painful. When I remember it, I become nervous, and I start sweating as I can recall how stressful and embarrassing these 37 hours were. Yet, looking at the bright side, I learned a lot from this massive failure.
Difference between can and should: in the product world, we can do many things. But it’s not because we can do something that we should. Learning how to prioritize what we should do is vital for Product Owners.
Slow down: when you start walking too fast, you miss many opportunities and eventually create problems. I learned that we should start small, implement one change, measure the outcome, and then take the next step. Most of my failures are related to mixing many changes simultaneously and being unable to understand where the problem is coming from.
Don’t solve an absent problem: sometimes we believe we can do something to avoid a potential problem in the future. I realize that trying to solve problems we don’t have ensures we create problems we don’t expect.