Knowing the User and Their Unique Environment
As I was working on the Repeating the Unrepeatable Bug article for Better Software magazine, I found consistent patterns in cases where I have found a repeatable case to a so-called “unrepeatable bug”. One pattern that surprised me was how often I do user profiling. Often, one tester or end-user sees a so-called unrepeatable bug more frequently than others. A lot of my investigative work in these cases involves trying to get inside an end-user’s head (often a tester) to emulate their actions. I have learned to spend time with the person to get a better perspective on not only their actions and environment, but their ideas and motivations. The resulting user profiles fuel ideas for exploratory testing sessions to track down difficult bugs.
Recently I was assigned the task of tracking down a so-called unrepeatable bug. Several people with different skill levels had worked on it with no success. With a little time and work, I was able to get a repeatable case. Afterwards, when I did a personal retrospective on the assignment, I realized that I was creating a profile of the tester who had come across the “unrepeatable” cases that the rest of the dev team did not see. Until that point, I hadn’t realized to what extent I was modeling the tester/user when I was working on repeating “unrepeatable” bugs. My exploratory testing for this task went something like this.
I developed a model of the tester’s behaviour through observation and some pair testing sessions. Then, I started working on the problem and could see the failure very sporadically. One thing I noticed was that this tester did installations differently than others. I also noticed what builds they were using, and that there was more of a time delay between their actions than with other testers (they often left tasks mid-stream to go to meetings or work on other tasks). Knowing this, I used the same builds and the same installation steps as the tester; I figured out that part of the problem had to do with a Greenwich Mean Time (GMT) offset that was set incorrectly in the embedded device we were testing. Upon installation, the system time was set behind our Mountain Time offset, so the system time was back in time. This caused the system to reboot in order to reset the time (known behavior, working properly). But, as the resulting error message told me, there was also a kernel panic in the device. With this knowledge, I could repeat the bug about every two out of five times, but it still wasn’t consistent.
I spent time in that tester’s work environment to see if there was something else I was missing. I discovered that their test device had connections that weren’t fully seated, and that they had stacked the embedded device on both a router and a power supply. This caused the device to rock gently back and forth when you typed. So, I went back to my desk, unseated the cables so they barely made a connection, and—while installing a new firmware build—tapped my desk with my knee to simulate the rocking. Presto! Every time I did this with a same build that this tester had been using, the bug appeared.
Next, I collaborated with a developer. He went from, “that can’t happen,” to “uh oh, I didn’t test if the system time is back in time, *and* that the connection to the device is down during installation to trap the error.” The time offset and the flakey connection were causing two related “unrepeatable” bugs. This sounds like a simple correlation from the user’s perspective, but it wasn’t from a code perspective. These areas of code were completely unrelated and weren’t obvious when testing at the code level.
The developer thought I was insane when he saw me rocking my desk with my knee while typing to repeat the bug. But when I repeated the bugs every time, and explained my rationale, he chuckled and said it now made perfect sense. I walked him through my detective work, how I saw the device rocking out of the corner of my eye when I typed at the other tester’s desk. I went through the classic conjecture/refutation model of testing where I observed the behavior, set up an experiment to emulate the conditions, and tried to refute my proposition. When the evidence supported my proposition, I was able to get something tangible for the developer to repeat the bug himself. We moved forward, and were able to get a fix in place.
Sometimes we look to the code for sources of bugs and forget about the user. When one user out of many finds a problem, and that problem isn’t obvious in the source code, we dismiss it as user error. Sometimes my job as an exploratory tester is to track down the idiosyncrasies of a particular user who has uncovered something the rest of us can’t repeat. Often, there is a kind of chaos-theory effect that happens at the user interface, that only a particular user has the right unique recipe to cause a failure. Repeating the failure accurately not only requires having the right version of the source code and having the test system deployed in the right way, it also requires that the tester knows what a that particular user was doing at that particular time. In this case, I had all three, but emulating an environment I assumed was the same as mine was still tricky. The small differences in test environments, when coupled with slightly different usage by the tester, made all the difference between repeating the bug and not being able to repeat it. The details were subtle on their own, but each nuance, when put together, amplified each other until the application had something it couldn’t handle. Simply testing the same way we had been in the tester’s environment didn’t help us. Putting all the pieces together yielded the result we needed.
Note: Thanks to this blog post by Pragmatic Dave Thomas, this has become known as the “Knee Testing” story.