Fix The Bug, Not The Symptom

I'm a big believer that a large part of the art of successful debugging is attitude. If you adopt the attitude that you can fix the problem and you're not getting to let this piece of technology make you look feeble and unworthy, you'll usually win.

There's more to write about that another time. Today I want to write a bit more about the flip side of this: being a responsible debugger and making sure you are fixing the real problem, not merely making the symptoms go away.
Learn A Lesson From The Villagers

Imagine that you live in a village on the edge of some grassy plains in Africa in the 19th century. Every now and again, a lion comes through the fence, enters a hut and eats all the meat inside. You decide to "fix" this problem. It seems to be a choice between two options:

1. Put a stronger door on your hut and maybe hang a slab of meat on your neighbours' huts as an extra incentive for the lion to go and visit them instead of you.
2. Look for all the places where the fence line can be breached and fix them.

Why is it that the 19th century African villager will pick the right solution (you picked #2, right?) and the 21st century software developer will often go for something akin to solution #1? Is it just because "screw your buddy" is much more obvious in this scenario than in the software band-aid patching approach?
Be Doubtful. Be Reluctant. Work Hard.

Tracking down the place where a bug appears is only the first step. That isn't the place where you necessarily need to fix the code. It's the place where the problems have finally mounted up to such an extent that the software falls over. It's the lion coming through the door of the hut when the problem was really the lion getting into the village in the first place.

When you're debugging a problem, as a practical matter, doubt that the problem you're seeing is the root cause. Maybe — once in a while — the problem is something obvious that was caused by a bad piece of logic a few lines earlier. More often, though, you're going to find that the current method or function was passed bad data by something else and that added to the problems until the camel's back completely broke.

The data might not necessarily be bad in the obvious sense. It might be perfectly valid data in another part of the program but you weren't expecting to handle it at the point the bug appeared. Sure, you could adjust the final location to handle this new data type, but why do this? Be very reluctant to broaden the interface you allow as input to a function. It just increases the number of code paths you have to worry about. Why is the unexpected data type getting here? Is there a broken assumption somewhere else? Is it newly added code that didn't respect the (possibly unadvertised or implicit) interface? It might be a legitimate oversight, but don't make that your first approach to a fix. That's a band-aid solution. You're fixing the symptom.

I will claim that for any reasonably large and stable body of code, there is usually a coherent set of design choices floating through the implementation. They might not be your choices and they might vary from component to component (particularly if the code has been built up over years and years), but there is going to be some logic to things. Try to understand this imlpicit logic. Work hard to ferret out the reasoning behind the interfaces and information flow. At some level, a "proper" bug fix feels very right. It feels like you've fixed it in the right place.

Applying a fix that makes the symptom disappear is lazy development. Maybe it's the right fix, but you'd better be able to explain why the problem should be fixed there. Why is that place the cause of the problem? Does the design back you up?

If you write a quick change to the code, run it and think to yourself "cool, that seems to have worked," it should make you feel uncomfortable. Why are you surprised that it worked? Lack of understanding of the root problem, perhaps?
Reviewing Proposed Fixes

Donald Knuth wrote somewhere (maybe in the TeX book?) that debugging was easier if you were already in a bad mood. The logic being that it's easier to be prepared to rip something apart if you want to do exactly that.

This might sound abstract, but on some level it's very true. I find my most successful periods of reviewing patches or designs — in my own code, just as much as other peoples' — is when I walk in dying to find a way to reject the proposed solution. If it holds up despite my best efforts to tear it down, it's probably a reasonable fix. This is, of course, the software development equivalent of statistic's null hypothesis; you have to disprove the opposite claim to have your version accepted.

The one line fix that applies to the precise spot of the traceback in a bug report is a prime candidate for this treatment. Without a lot of hard data, but with the benefit of about 15 years of professional experience, I will claim that most of those fixes are band-aids. Particularly if they're in an area that is likely to be executed with some frequency. Why hasn't this problem been noticed before? What special conditions caused the pile up of problems that made it noticeable only now? Again, working hard and being reluctant to apply the fix that makes the symptom disappear makes for stronger code.
Surely This Doesn't Happen In Practice?

I'd been planning to write this entry for a little while now. I spend a lot of time reading and reviewing bug reports and proposed fixes. I see more than a few messages on mailing lists asking why ticket number XYZ hasn't been fixed despite having three patches attached that make the problem go away.

Symptom removal happens. Unless you recognise it that way, and are prepared to look for the disease in every patch, it's easy to overlook. Particularly in Open Source code. We (Open Source developers) encourage community submissions. This pays off in the form of lots of bug reports, feature suggestions and patches to fix problems. The hidden downside is that a lot of patches for problems are written by somebody trying to get real work done. They want the fastest fix possible so they can get on with their real work, not devote a lot fo time to work on this piece of code that they downloaded from somewhere. Under those circumstances, it's easy to forget to look at the bigger picture. It's not a black mark against the person writing the patch — all contributions are welcome and if we don't take a particular patch, it causes no harm. It does explain why an existing patch might not necessarily be the right solution.

So, yes, the problem exists in practice.

Coincidentally, the issue was brought into the light recently when a couple of articles appeared on IT websites pointing out that Microsoft's Vista had degraded network performance when playing and mp3 file. That sounded odd on first reading. Microsoft's statement to explain why it was expected (for some value of "expected" that means "we thought it would be acceptable") raised eyebrows around the world.

Robert Love has written a good deconstruction of why this explanation is a model case of symptom hiding rather than bug fixing. You can see from Robert's write-up that a detrimental effect was made to go away, instead of investigating the root cause: why does using the network take up so much CPU? As noted in that article, Linux (amongst many others) has already solved this problem. It's hardly new technology any longer.

Clearly the solution isn't having more funds, more available developers or more process in place. Eternal vigilance is the last defender here. How you choose to get into that mode, using my "I'm going to assume this is busted until I can prove otherwise" approach or some other way, is entirely up to you.