Monday, September 12, 2011

Birth of a Bug

This is a a story of a bug. It was a simple bug. Like many simple problems, on the surface it had a simple cause. Also like many simple problems, in reality it had many contributing factors. Before I tell you about the bug, let me tell you about the application.

let people configure which emails they getAt my place of employment we have various electronic forms that need to be signed by people. For example, if you want to purchase something, there are people who need to approve that purchase - probably your supervisor plus whoever has signature authority for the project you are billing the purchase to. To keep the process flowing people get emails letting them know when they have a form to sign. As it turns out, not everyone wants these emails, and sometimes people only want the emails under certain circumstances. So I wrote an app to let people configure which emails they get.

I deployed this app the other day after work, and promptly the next morning people started using it to turn off email notifications. Shortly thereafter they started sending in complaints that they were still receiving email.

Why didn't it work? How did such a fundamental bug get into the application?

Before answering that question, I'll tell you what the bug was. One of the features of the configuration is that people could turn off email notification, but add a financial exception. As an example, some people only want to get email notifications for requests that are for more than $10,000. Somewhere in the user's NotificationPreferences object is a line that looks similar to:
boolean checkFinancialOverride(Double amount) {
 return (amount >= mMoney);
}
where amount is the amount the form is for ($53.24 stapler), and mMoney is the threshold the user set ($10,000). There's a catch though - what is mMoney if the user didn't specify a threshold? Turns out that there are two possible values, depending on how the value got set. If they never have set any preferences at all, this value is NULL, but if they have set a preference for configuration, but never set a threshold, this value gets set to 0. This means that we do NOT want to send a financial override if the mMoney value is either NULL or 0. The actual code looked like:
boolean checkFinancialOverride(Double amount) {
  return mMoney != null && amount != null && mMoney > 0 && amount >= mMoney;
}
Don't bother looking for the bug in that code, as the bug isn't there. At least not yet. It didn't show up until deployment day. While the bug was simple, there were a number of "best practices" that were violated that led to the birth of this bug.

Don't Make Last Minute Changes
The plan was to deploy the code at 5PM. At 3PM, I had a coworker who suggested that "amount" was a better name for the database column for the threshold than "money" because "amount" is consistent with other projects. (and, of course, "mAmount" would be the Java variable containing the threshold value.)

Anytime you make a change, the risk of breaking something is non-zero. Make the change at the last minute and you won't have time to exhaustively test it. As you have figured out, we broke the code. Then in my "smoke testing" I only chose examples that didn't trigger the new bug. Which is how I unknowingly release code with this bug.

This wasn't a critical user bug. Heck, this wasn't anything that was going to affect the user in any way, shape, or form. I never should've agreed to make this change.

Beware of Semantic Code Changes
When making this last minute name change from "money" to "amount" we split the work up between me and the developer who recommended the change, even though all the code involving this had been written by me. This means that he was changing code that he wasn't familiar with. Here is what the updated code looked like:
boolean checkFinancialOverride(Double amount) {
  return mAmount != null && amount != null && amount >= mAmount;
}
As you can see, the condition for checking if "mAmount > 0" got dropped. Why was it dropped? I haven't asked, so I don't know. I don't know if it was accidental or intentional. Either way, there are related best practices that should have been followed.

Understand Your Changes
Ever run across code in a working system that just can't be right? Before just assuming that the bug is being hidden by other functionality and going ahead and fixing the code, make sure you understand exactly what is happening. If you have access to the original developer (in this case I was just across the hall, a slightly raised voice away), ask why the code is the way it is. It might be a bug, but you might also not be understanding it.

Leverage Refactoring Tools
If you are just renaming a variable, let your IDE rename it for you. This is the type of change that can almost always be done automatically by tools. Use those tools.

Check Your Changes Before Committing
This is something I am a big enough believer in that I may turn it into its own blog post at some point. After making code changes, no matter how large or how trivial, run diff on the code before checking it in. Verify that every change that you are committing is one that you mean. This serves as a simple and quick code review, as well as ensuring that you don't accidentally commit left in debugging statements. Not to mention catching those places where you accidentally inserted or deleted code because of careless clicks.

Run Your Unit Tests
After making any changes, no matter how trivial, make sure you rerun your unit tests. Especially before deploying. You do have unit tests, don't you? Ok, I'll admit it, we had no unit tests on this project, and this is exactly the type of bug that would've gotten caught by unit tests.

Write Understandable Code
While you can follow all the best practices under the sun, you can't force your coworkers to do the same. The best thing you can do to defend against that is to write code that is not only correct but understandable and obvious. Let's look at that query one more time:
boolean checkFinancialOverride(Double amount) {
  return mMoney != null && amount != null && mMoney > 0 && amount >= mMoney;
}
Checking for null is fairly obvious as it prevents NullPointerExceptions when the autounboxing happens. The comparison "amount >= mMoney" is the comparison that is needed. But what's with the "mMoney > 0"? If I hadn't explained it above, would you have guessed why it is here? Maybe, maybe not, either way I would argue that this is a piece of non-obvious code. I will offer up two possible suggestions for how this code could be made more obvious. The first is to add a comment:
boolean checkFinancialOverride(Double amount) {
  // if mMoney <= 0, it is not a valid value and should be ignored
  return mMoney != null && amount != null && mMoney > 0 && amount >= mMoney;
}
But since I think comments are often a sign of confusing code, my other suggestion involves changing the code:
boolean checkFinancialOverride(Double amount) {
  return isThresholdValid() && amount != null && amount >= mMoney;
}

boolean isThresholdValid() {
  return mMoney != null && mMoney > 0
}

Conclusion
Write clean code. Don't make last minute changes. Understand your changes. Write unit tests and use them. These are the steps to avoid unexpected and avoidable bugs.

No comments: