This is an article from early 2020. The world has seen many changes since then, but I believe the topic is as relevant as ever. I figure many of you haven’t seen this before. I hope you enjoy it.
Everyone knows that technical debt is bad. Still many choose to ignore technical debt in favor of new features that are expected to bring value. Often the potential value of the new features is considered to be more important than the costs of technical debt. Here’s an account of a team that made this choice and came to regret it. The names are altered.
The big day is near
For team Tidal Wave, the big moment is approaching fast. Their new Payment Router system survived the litmus test with a limited amount of carefully selected clients. This Sprint will make the product available for the complete international internal payment flow of the bank. We’re talking about millions of payments totalling billions and billions of dollars.
It will be huge.
One small thing
But there’s this small thing that happened, occupying Claire from the Development Team and Jesse, the Product Owner. Claire is worried:
“Puneet thinks he found the cause of the issue in production. It has to do with the shortcut we choose to apply instead of a rigorous time-consuming alternative. It doesn’t work properly. Fixing it will take about half the capacity of our current Sprint.”
“We really shouldn’t Claire. I don’t want to spend all this time on this. The first analysis tells us that we can easily work around it. I propose that we ensure that we all know the workaround and do the fixing later. We have some very important new stuff in the backlog that requires our full attention.”
It’s only a first assessment, Jesse. We’re not 100% sure that we found the cause.”
“Well, Puneet did the analysis and he knows the code in and out. He is certain. I trust him fully on this.”
“That’s true. I agree a workaround will suffice. However, we should not forget to add it to the list of ‘known issues’.”
“And I will also put it on the Product Backlog because it needs to be repaired at one point.”
The big day
The canteen is packed because there’s a party. It’s the day of the big launch of Payment Router. The most important product launch of the year. Even C-level is here. Champagne is flowing, though team Tidal Wave sticks to non-alcoholic drinks.
Maria — the CTO — delivers a short speech in which she praises the team and then pushes a red button. With that, she symbolically starts the processing. Immediately monitoring screens show a steep rise in the flow, to around 50 times the amount of the litmus test. And the flow remains stable. A big roar fills the room. What a success!
The first night, 11.30 PM
Puneet just arrived at his apartment, ready to have a good night’s sleep. It was a fantastic day, everything went perfect.
Then his phone rings. It’s the help desk. He’s not surprised. This 24/7 functionality is new, so everyone needs to get used to how it works and the types of messages that it generates. This is why not one but two people from the team are on standby this week.
“Hey, Puneet. I see a flat-liner on the monitoring screen. There doesn’t appear to be a standard scenario for this situation. It looks like the processing has stopped.”
“Hmmmm. That’s odd. No, it’s impossible. We will dive into it and keep you updated.”
After a quick analysis, he knows he has to call Claire to tackle the topic.
“Hi, Claire. Do you remember that we had that tiny issue in production a few weeks ago? There’s more to the error than I initially thought. I believe that we need to change the code to fix the issue. And we need to do it fast. I think we can pull this off in an hour.”
An hour later, Claire and Puneet managed to deploy the fix. But the problem still persists. There’s still a flat-liner, so nothing is processed. Puneet mumbles:
“This is bad. Really bad. The problem is more complicated than I initially thought. We need to take a deep dive into the topic.”
Claire sighs and says “Looks like this is going to be an all-nighter Puneet.”
It’s 2 AM and the phone rings. It’s Maria, the CTO.
“How are you doing with fixing the issue? Our clients are going crazy. Do you know how impacting this disruption is?”
“I am fully aware of it”, says Puneet. “Our clients don’t receive their money and we’re talking about millions of dollars. Trust me, we are on top of this. Nothing is as important for us right now as solving this issue.”
“If you need anything, just tell me. I’ll make sure you’ll get it.”
“Thank you, Maria. It would be great if you could shield us from people interrupting our work with calls and messages.”
“OK. I’ll arrange that they have to deal with me instead. Good luck!”
The more Claire and Puneet dig into the code, the more they are at a loss for what is causing the issue. On top of that they are getting really tired.
They are going to have to pass the issue along to the team members at the office who should be in by now. It’s 9 AM. Time flies when you are having fun. Thirty minutes later the hand-over is completed. Exhausted, both fall asleep almost immediately.
Next day, 3 PM
Puneet is back at the office to understand what the situation is to prepare himself for the next night.
Jesse debriefs him: “We managed to get the Payment Router up and running again at 11 AM. With a band-aid. We should be OK now. We will now focus on the permanent fix.”
When Puneet hears the details of the band-aid solution, he doesn’t know what to think. On one hand, he’s happy that the system works again, but the work-around appears to be rather shaky.
“Have you checked if the payments are processed fast enough? I’m worried that our high-priority customers will receive their payments with a delay exceeding our Service Level Agreements.”
Fifteen minutes later the team can only confirm that Puneet is right. There’s an enormous delay in the processing of the payments. Everything piles up. This is an unsustainable situation.
They decide to create a band-aid on top of the band-aid. By now they also decided to work in shifts. John, Carlos and Alice will work from 8 AM to 9 PM. Puneet, Karthik and Claire will do the 8 PM to 9 AM shift. With this they should be able to survive the first week, enough to resolve the issues.
1 month after the launch, 10 PM
It’s a five-minute walk from the tram station to the office. For Puneet, it’s the road to hell. He has come to hate the red bricks of the building and the company logo on top makes him nauseous. It’s been a month of working 90 hours a week, always on his toes. And the months before the launch weren’t a walk in the park either. They’ve taken their toll. But he knows he can’t be missed. The Payment Router is still far from stable.
There’s also this feeling of being guilty of this mess. If only he had done a more thorough analysis if only he had asked for someone’s help then.
Claire sees Puneet approaching the office. He’s too late… again. Karthik will not be working at all because his wife has her appendix removed. Claire is disappointed in the complete team. No one from the day shift had the decency to await Puneet’s arrival with her. It feels like she’s alone, sticking a finger in the dike to prevent a disastrous flood. Because during the night shifts, everything is on her shoulders until Karthik returns. Puneet is a mere shadow of himself. Part of her blames Puneet for his wrong analysis. Another part of her knows she could have stopped the team from deciding to ignore this technical debt.
3 months after the launch
The Payment Router is still active. And the situation has improved considerably. In the first week of the launch, the team had around 20 priority 1 incidents per 24 hours (of which 6 were in the middle of the night). Now it is reduced to 1 prio 1 per day.
The team managed to save the bank’s reputation. But it was walking on a tightrope. Another large outage as on the day of the official launch would have sealed the fate of the Payment Router and a large part of the bank’s credibility.
Ignoring technical debt had cost the bank millions of dollars. Extreme efforts to repair the situation saved the bank from losing hundreds of millions.
You may be interested in how team Tidal Wave is doing. Well, two months after the launch — around the time that the situation had improved — Puneet reported himself sick with burnout symptoms.
Claire threatened to leave the company if she had to continue working on the Payment Router team. She moved to a different team.
No one wants to take the place of Puneet and Claire. The Payment Router team has a bad reputation. This is why the team has two vacancies. Externals who are oblivious to what happened will have to step in.
The remaining four do the best they can to continue improving the stability of the system. Their only reason to remain on the team is their sense of duty.
The lesson
Technical Debt can be a silent killer. Prioritizing new features over technical debt is taking a huge risk. Every time you make this choice, you consciously choose to ignore quality issues.
Be wiser than team Tidal Wave. Be relentless when attacking technical debt.
This story is just too hard. I can completely relate to it. For 5 months I have been supporting a team that has been trying for years to make it clear to our customer's PO that the software has its back to the wall with technical debt. A month ago, together with the software architect, I managed to scare the customer so much that the team can now work on it undisturbed.