From "Technical debt" to "Technical decision"
Date Updated: Thursday, April 18, 2019
Technical debt is a term that reflects the:
"implied cost of additional rework caused by choosing an easy solution now instead of using a better approach that would take longer"
[Definition of the term "technical debt" (plus, some background information and an "explanation"). Techopedia. Retrieved August 11, 2016.]
I would like to present a different approach or point of view on "technical debt":
There is no such thing as "technical debt".
At every stage, we (development managers, leads and developers) decide (or should decide!) what and how to develop based on return-on-investment (ROI) considerations. I would even go further and say it is reckless and not professional if we decide to "overdevelop" when not needed.
Following that, I would change the question:
"How do we deal with technical debt?"
"How do we make the right decision on what to develop?"
And take it a step further and ask: "How do we make the right decisions for investing in any non-customer facing development"? This also covers the decision making when facing the dilemma of which bugs to solve, what engineering tools to develop, and more.
Now that we've changed the question, let's look at some different types of non-customer facing development and see what considerations and heuristics can help us in always making the right investment decisions. I'll do that by sharing real-life dilemmas and the solutions I chose to address them.
We don't have any tests
You have a legacy feature that has been out in the market for a long time. A new developer looks at the code and tests and discovers that there is almost no test coverage for many scenarios nor any coverage of the feature's core logic. The developer states that you must assign someone today before this blows up in your face.
Pretty scary. What would you decide? Well, here's how I would assess this.
- The feature has been out in the market for quite a long time so I would look at the bug/incident history of that feature. If we didn't have many bugs or regressions introduced, then I wouldn't invest in writing tests for the feature now. I would instead adopt an approach to add tests when a significant bug is found. And yes, I am not fond of adding tests for every bug fix: test code also has a maintenance cost.
- Another point that would affect my decision is our release cycle. If we expose new versions gradually to the market before being full GA, and we are monitoring that version's health during that gradual exposure, then I know I have a safety belt to catch regressions in that feature and can be less worried about missing coverage.
Should we develop Engineering tooling
Every week your developers complain about needing to check-in the final bits of code to several different release branches, some check-ins require different tools and processes. But there might be a solution, developing a tool that would automatically handle check-ins and manage your releases. Your team estimates that it will take 5 days to build the tool. They urge you to add this task to the backlog and prioritize it.
Hmm, you don't want to be a non-thoughtful manager and have them do manual work that they could potentially automate easily. What would you decide? Here is how I would assess this.
- First, I bet the work won't end in 5 days. A POC will be ready in 5 days, but then it will have bugs, adjustments, timeout issues, etc. There will likely be additional systems, processes and tools we will discover also need to be integrated; some of those systems will either not have the required API exposed or integration will be too complex. So that step will end up staying manual (This I had to vent about: it's not that I underestimate the developers, it's something I've learned from painful, past experience!).
- Having said that, I would calculate how much time this tool will save. In this case, we do this special release check-in to several branches once a month. The process includes setting up a machine per branch and then running a set of commands that take considerable time. My estimate is that developing the tool will overall save about half a day of developer time. So, if developing the tool will indeed take 40 hours (=~5 days), then the overall gain would be 3 hours a month and so this will start paying off in more than a year from now.
- This is the time to share this calculation with the team and brainstorm on more ways to solve the problem with a focus on achieving a better ROI.
In our case, we ended identified a solution that only took 1 day to develop, having a set of dedicated machines for each branch so all set-up work was eliminated. We also added notifications to our release scripts so the developer will get told when it completes and will not need to "busy poll" (this was revealed as an additional issue during our brainstorming!). When we returned to our ROI calculation, we saved 2.5 hours a month with a cost of 1 day of investment! And more importantly, the developers were super happy and supportive.
Any product has bugs, sometimes many bugs. Bugs that are reported by customers, surfaced by monitoring systems and health reports, bugs that we find ourselves, and more. Any development manager knows that an entire development team can be locked down for a long time if they decide to try to solve them all.
So how do we decide which bugs to solve, which not to work on, which to prioritize over others and when to solve them? Here are some guiding principles I use.
- Be able to measure for every bug the number of users/sessions that hit it and solve the top *X* (a bar that is decided by Engineering and Product together)
- For newly developed features I lower the bar and fix more bugs as the code is fresh (the developers just wrote it!) and so the fix cost should be lower. I also want to ensure any new feature is high quality and often I'll get a higher ROI because the feature is new and potentially more valuable than a legacy feature.
- Be able to understand the real effect of each bug. You can have bugs with a very high hit rate but that has no effect on user experience. For example, some exceptions in code or an error that retries transparently to the user but spikes in your telemetry. Knowing the real effect on user experience will enable you to prioritize better.
- Favor solving new bugs that started showing up recently, as those are often caused by new code that was just submitted. You can also often more easily identify the offending piece of code as its recent and fresh in the system and people's minds.
When to work on Non-Functional Requirements (NFR)
Another example is deciding when to work on Non-Functional Requirement or NFR. NFR refers to any requirement which is not a feature, per se. The definition varies greatly across the industry but sometimes includes (rightly or wrongly) requirements such as Accessibility, High Availability, Disaster Recovery, Performance, Security, etc.
This type of engineering work should be addressed and prioritized like any normal feature. More than that, the bar to reach (System SLA, performance numbers, etc.) should be defined by Product like any other feature definition.
The decision to invest and refactor existing code is based in large on three things. The first: if the current design blocks or places a large constraint on our ability to develop the next required feature or capability. The second: if that specific code is buggy or unstable. Fixing bugs in it is often too risky as the fixes frequently cause regressions. And the third is code cleaning, deleting "dead code", updating code to remove deprecated features, etc.
Here too we should try to assess the ROI, weighing the effort and risk of refactoring versus the potential gain from it, specifically the ease of new features development and ease of future maintenance. In most cases, this assessment is not an "all on nothing" scenario. Indeed, sometimes we can attach some of the refactoring work to a different development effort, which reduces the refactoring effort and therefore the cost significantly.
Here are some examples of refactoring considerations and options:
- See if you can build an "adapter" layer on top of legacy code that will expose it in a more suitable format and interface and hence hide or "screen out" the legacy code and its current complexity.
- Raise all options of minimizing the amount of code suggested to refactor, see if you can break it to parts and take separate decisions on each.
- Deleting "dead code" or deprecated functions - have guidelines that developers need to remove "dead code" or update code to use new functions from areas they are updating as part of their other development work. This should reduce the cost of removal as the developer is already working on that code and is hopefully familiar with it.
- Understand what test coverage and other validations you have in place for code to be refactored, sometimes you'll find out it will require additional high validation effort.
One of the necessary skills for a great engineering manager is the ability to analyze and decide what needs to be developed, when, and how. Asking the right questions, raising creative ideas, making sure you have in place the right infrastructure that will help make those decisions, being ROI and data-driven versus pressure-driven and making the right decision (hopefully most of the time!) has a huge impact on the ability of a development team to develop fast and efficiently.
Penina Weiss, Microsoft for Startups CTO in Israel.