Southwest Airlines and Technical Debt
We have all heard the news about the 2022 Christmas Cold front and the devastation it has lead to across the country. One company in particular was hit hard and that was Southwest Airlines. 1000s of flights have been delayed or canceled and 1000s of people stranded across the country. This is what Gene Kim calls The Downward Spiral of Technical Debt.
Apparently these problems have been brewing for a long time, CNN quoted Captain Casey Murray the president of the Southwest Airlines Pilots Association as stating “We’ve been having these issues for the past 20 months,” he told CNN. “We’ve seen these sorts of meltdowns occur on a much more regular basis and it really just has to do with outdated processes and outdated IT.”
It still amazes me that in 2022 we have product owners and IT leaders that do not understand the need to handle technical debt on a regular basis. Mary Cagan wrote a great book called Inspired where he wrote about how eBay nearly went out of business three times because it was choking on technical debt.
The question is how do companies learn to manage technical debt efficiently? The answer is not as difficult as you might think, implementing the solution however will require brave leaders who are capable of creating the right culture.
There is No Root Cause
First we have to create a blameless culture. John Allspaw blogged about how there is No Root Cause and the flaws of the 5 Whys. You might ask yourself what does this have to do with addressing technical debt? Once you read the article the answer is simple. Creating a culture where engineers feel empowered to talk about what went wrong and why when outages do occur is going to go a long way to creating stories and backlog items that improve the resiliency of your companies IT infrastructure. Companies on the other hand that create a duck and cover culture will not be so lucky.
20% of Sprint Time Goes to Technical Debt Reduction
This is one of the most basic things when it comes to the reduction of technical debt and unfortunately I see it being abused a lot. This isn’t 20% for the tech management to create side or pet projects this is 20% at minimum for the team to fix issues with their code, rewrite functions, classes, update their testing framework, etc.. This is not 20% to do DevOps it’s for the team to decide how to make their platform better and more resilient. The product team still needs to prioritize non functional stories as part of every sprint, the 20% is specifically for the engineering team to reduce technical debt.
Deploy Often and In Small Batch Sizes
This is really where DevOps comes into play. It amazes me that companies in 2022 still batch up multiple stories and deploy them all at once at the end of the sprint or even worse at the end of the quarter. There needs to be a focus on pipeline thinking. Every feature after it’s developed should be delivered to production as fast as possible. You might be thinking “Our customers cannot handle that much change at once” and that could be 100% true! This is when companies should employ the concept of separating code deployments from feature releases. The business needs to be in full control of how and when features are released to customers and engineering needs to be in full control of how often code makes it into production. Every application and every service should be built around this concept. I wrote a blog years ago about The Three Ways and The Hello World App and that post is just as relevant today as it was in 2013 when I wrote it.
Increase Situational Awareness
The final concept is increasing situation awareness. Years ago I was the director of infrastructure for Aetna’s consumer business. We had an engineering team that was doing front end development work for Aetna’s new consumer business platform. The engineering team was struggling because issues would arise with the backend platforms they were developing against. It would take hours to debug the issue and figure out which service or set of services was causing the issue. My team and I built a simple monitoring solution that would perform a health check on every backend service, it would validate the json schema of that service and measure the latency. We put up monitors all around the office displaying the dashboard for this simple monitoring service. When an issue arose everyone in the office knew which backend service was either serving up a new json scheme, sending an incorrect response code, or was just down. This lead to greater efficiency for the engineering teams and helped communication with the backend team. We could easily notify them that the ID Card Service was down or that the Claim Service was down, etc..
Anyway, I hope that those of you suffering from the Southwest Airlines debacle find a resolution soon. For the rest of us we should use this as a learning opportunity. Is our platform or service suffering from too much technical debt? Are we about to be the next headline?