Putting bugs under control

Whats that bug - Anegela DiTerlizzi & Brendal Wenzel
Whats that bug – Anegela DiTerlizzi & Brendal Wenzel

Bugs are bad. And especially for software engaged in serious business – such as saving lives. A few reasons why:

  • Quality medical devices have few known bugs, and of low severity. Thus, having bugs prevents you from shipping. You need to be able to ship at the end of every iteration to your customers (to get feedback), or more realistically, considering integration and product registration, to a system integration team.
  • Bug backlogs are inventories and as such are a form of waste. Lean Software Development advocates having low inventories since a bug left open will incur in additional costs:
    • Bugs in the software might provoke very complex system bugs when mixed with hardware and bioware. These bugs take an awful lot of time to investigate and are often blocking the entire project plan. You don’t want this to happen with a bug known to the software team that could have been fixed a long time ago.
    • Workarounds elicitation and teaching (by documentation or face-to-face) take time.
    • It is always more expensive to fix bugs in the future, when knowledge fades away in people’s heads or vanishes when they go.
    • Bug backlog engineering (prioritization, endless reviews, risk analysis…) typically takes the time of several experts at the same time. A tremendous waste of energy when the bug backlog is large.
    • Bug duplicates imply wasteful investigations. A closed bug has no duplicates.
  • Bugs in medical devices have the potential to do harm to people. As such they must be considered with horror and dealt with accordingly. And the best way is to fix them ASAP. Don’t let them a chance to slip through your processes.
  • I’ve seen projects with a huge bug count (close to 1000) a couple of times. You know what? They never recovered. They stayed at 1000 bug count forever. Maybe because of the cost of all this waste, maybe because it spread a sense of bad quality and failure in everybody’s hearts.
Bug count graph
Bug count graph of a real-life project. After a quick phase of exponential growth, bug count stayed in the six hundreds. In spite of two heroic campaigns of bug fixing, the end was inevitable: the flat part of the curve on the right is the clinical death of the project (brutal end, no production, millions lost).

 

Morale of the story: never get high on bug count, or your feet may never touch the ground again.

Guy swllowing bugs
What happens when you let bugs free…

Conclusion: a good bug is a bug killed. Bug count should be close to zero.

I won’t write about bug detection here, but only about what you do when you know them. Let’s assume you already have a good testing system in place.

 

Bug count threshold and two-phase iteration

So how do we actually manage known bug count? Simple. Set a threshold. Respect it.

  • Set the threshold at the start of the project. Write it down in your Project Plan and have everybody sign it. You’ll still be able to change it, but it’s motivating to give it some official existence.
  • Recommended values for the threshold:
    • More than zero (or you might seriously delay shipping for minor issues)
    • Inferior to a couple of dozens. The maximum threshold will depend on the size of the team and its ability to fix bugs. I suggest the max bug threshold doesn’t exceed what your team is able to fix in a few days if it’s its sole focus.
    • Split limits by bug severity.
    • For example, typical thresholds I use: 0 blocking bugs, 3 majors, 20 minors.
  • Bug count evolution is easier to understand inside the two-phase iteration framework. During construction, you take risks, you build, you refactor: bug count gets high. During stabilization, you stop taking risks, you fix bugs: bug count gets down. There will be a delay due the lengthy manual testing processes: you will discover the real extent of the bug count some time after they are introduced in your code.
Iteration 0 regulation
Total bug count (orange curve) is below the threshold at the end of iteration 0 (inside the green circle)

 

  • What’s important is what you do when the threshold is not respected. My advice: don’t deliver. Hold the version back until more bugs are fixed. You can’t leave into the wild a version that will waste precious integration time or harm patients. You would be ashamed of it. Take the blame for the delay. Put bug count in your information radiators so that everybody gets used to the fact that it’s important. When you have trouble respecting the threshold, talk about it around you and in your team retrospectives. It’s serious. Find solutions.

 

Iteration 1 regulation
Bug threshold is not respected at the end of stabilization of iteration 1 (red circle). An extra stab is added until the quality criteria is met (new green circle).

 

  • What’s also important is what happens to the iteration following the iteration that went wrong. Here’s where the two-phase iteration gets handy. If iteration N has too many bugs and if Stabilization phase N takes 3 more days, Construction Phase N+1 will be 3 days shorter. It means that a few user stories will have to be removed from iteration N+1. It also means that since iteration N+1 is smaller, it should be a little easier to get right, so Stabilization N+1 should run more smoothly. There is an automatic-short-term regulation effect in the two-phase iteration framework.

 

Iteration 2 regulation
Iteration 2 has a shorter construction, with less features, refactorings and bug creation than usual.

 

 

  • On the long run, if you encounter this situation on a regular basis, consider increasing Stabilization phase proportion. That is the beauty of the two-phase iteration: it also embeds a long-term regulation system. Take an extreme example: 1 day of construction, 29 days of stabilization. Plenty of time to fix bugs and get the doc right, no? It should not be a problem. This means that there exists a good proportion between construction and stabilization phase durations that will allow you to finish iterations with bugs below the threshold and documentation in good shape. Your job is to find that proportion.
  • This regulation system is vital to any project. What do you when your car engine gets hot and spits steam? You slow down. The same with a team. If a project pace is so fast that quality gets out of control, you must slow down. Remember the agile belief that quality is not negotiable? Now show your true colors. Negotiate time.
  • By the way, it is quite a logical for a project to slow down after a while, as maintenance effort increases. You might expect an increase in Stabilization size over time.

I’ve used these techniques on projects of a respectable size (several years, several dozen people, several thousand bugs created and fixed) and they have proved to work well: known bug count never exceeded the threshold for a long time.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s