Monday, June 20, 2011

Testing in Production

A co-worker of mine has the picture to the right posted on her door. Last week we put that into practice. We deployed a new internal app which depended on two other components which were also new. On Thursday at 4:30 PM an email went out to tell people they could use it. By 4:45 our server had crashed. Thursday was a late night, but we eventually got the system where it could run without crashing the server. On Friday morning (when it really got used), we had to scramble to fix some less obvious (at least to the end user) bugs, but just as critical.

So what happened?
The short answer is, we didn't do enough testing. Every bug that we've found (so far), arguably should've been found in testing. All in all, it was a pretty embarrassing experience. Especially since part of it was our first internal Rails app which I had pushed hard to get us to use. However, in hind-sight if I had it to do all over again, I think I'd have done the same thing. Why?

Obviously, it's not good to have such a big failure. Plus, I never like to *have* to work late. However, nothing is in a vacuum. Every decision has to be compared to the alternative. We probably could've found most or all of the bugs with another week of testing. So which was better to the Lab? That we test for another week, with the opportunity cost that two developers can't be working on other projects? Or that we deploy a poorly tested app, give a few users a poor experience, and have a couple people work one long day on a Thursday?

The fact of the matter is, 15 minutes of real user testing found bugs that would've taken at least a week of developer testing. In terms of people hours, it was much more efficient to get some real use than to continue testing. As Jeff Atwood says, release your buggy software. So what was the negative? In this particular situation it was an internal app so there are no customers to lose or investors to turn away. The only negative is that a few people probably think less of me and my division. On the plus side, getting the app in front of people helped define requirements better than any amount of up front analysis. Also, the majority of this code will be reused on future apps, so this testing should allow for future high quality releases, allowing us to regain some of that lost reputation.

Releasing finds missing requirementsSo what's the take home message? Well, I am still embarrassed by how many problems we had with this release. However, given that there were no real negative consequences, it was probably in everyone's best interest for us to release (potentially) buggy software, rather than delay it for more exhaustive testing. Releasing finds missing requirements and shows how the code is really used. No amount of testing can do that.

In general, I try to have this blog be a voice for higher quality software, and to not subject users to a bad experience. However, at times it might be better to give a temporary bad experience so that the majority can have a good experience sooner.

2 comments:

Anonymous said...

Wow, I'm surprised to hear these words from you of all people. Didn't you used to work for a software testing company? Wasn't that company always trying to tell their clients not to release buggy code and get into the discipline of testing before releasing? How will you be able to face your former coworkers when news of this posting gets out?
-Alan

Michael Haddox-Schatz said...

While I know you are just trying to get my goat, you're right. I do feel like I have failed that philosophy. (though I was in the research division, and so never told clients what to do. So at least I'm not being hypocritical. :) ) The point I am trying to make is that testing, like everything else, has an opportunity cost which needs to be considered. (though I am not claiming that we made that consideration up front...)