Load Testing Your Web Infrastructure: Please Be Careful. Part 4

Earlier, we looked at different ways that load testing can go wrong, if you aren’t informed, or if you don’t know what you’re doing. In part 1, we talked about a well meaning person who inadvertently created meaningless tests. In part 2, we saw the disastrous effects of someone with a little knowledge creating a mess. In part 3, we read about what can happen to a network if you unleash load tests while other people are working. In this section, we will talk a bit about some of the underlying math we need to use with load and performance testing. (On second thought, “underlying” is a bit misleading as a term, it is actually foundational, but it’s also lots of fun. It’s fun, even for math phobics, as long as you get help from time-to-time.)

NOTE: I am simplifying the math descriptions here for brevity. If you are a stats expert, please don’t be offended by my glossing over the details. The point here is to provide a basic amount of information so people get the gist of it.

What? We Need Math?

It’s one thing to generate load and point out potential issues, but the real key to performance and load testing is an understanding of probability and statistics. A lot of problems are uncovered through basic statistical analysis, and reports on this testing are also used to help with forecasting, service commitments and purchase decisions. Communicating anything useful and actionable about performance requires stats and probability knowledge and skill. It’s important to highlight that generating load and successfully taxing a test system is the easy part of load and performance testing. The hard part, and the time consuming part is to figure out what the results data is telling us, or not telling us. This requires a working knowledge of statistics, including:

  • Averages
  • Means, Medians, Modes
  • Standard deviation
  • Confidence intervals
  • Distribution types: normal vs uniform
  • Statistical significance, equivalence, and outliers
  • Percentiles
  • Probability

It’s also important to have a good knowledge of elementary math:

  • Addition and Multiplication
  • Exponentiation
  • Combinatorics

You don’t need deep expertise in these concepts, but a working knowledge is important, as well as the ability to work with these concepts in popular productivity or math tools.

It’s one thing to manage the math, it is quite another to communicate what the math means to stakeholders clearly, honestly, and with context. It’s also important to be able to explain the limitations of what your math work has revealed.

While I’m not an expert in probability and statistics, I had worked at conferences and workshops with performance testing luminaries Scott Barber and Ben Simo. I once spent hours in a conference hotel lounge with Ben Simo as he dumped game pieces on the table and would ask me to observe and describe what I saw. Little did I know that this data visualization practice would help me track down a nasty performance bug months later. I also took online courses, attended other workshops and talks, and tried out various tools. Once I was comfortable with generating suitable levels of load, working with the numbers started to take precedence in my work.

Basic Math and Exponentiation

Performance and load testing requires dealing with large numbers, and calculating and observing the effects of addition and multiples. While this sounds simple, it can be deceptively complex.

At its simplest, generating load against a test server requires generating multiple simulated users, which in-turn requires counting and observing. For example, if you generate 10 simulated users with a testing tool, you need to observe your test environment and see what effect that has on it. Does the machine work harder? What do CPU usage, I/O and other measurable aspects look like? For most systems, ten is a small number and may not even register, so what happens if you simulate 100 users? Furthermore, can the network infrastructure you are using handle that much load, or will it limit traffic in unintended ways?

Once you are absolutely sure that yes, your 100 simulated users are exercising the test server more or less like 100 real users would exercise your production server, now you can start to add on more. What happens with the 101st user? Nothing much? Ok, let’s add more and observe. The trick here is to find the point where unintended behaviors start to occur when you add that nth user to the tests. The temptation is to think of this as a linear graph, where nth amount of load will add n amount of server utilization, but that isn’t how this tends to work. What often happens is the nth user causes a surge in server activity, which looks like a geometric graph, or a hockey stick shape effect. Adding that nth user causes I/O to go out of control, or CPU utilization to stay at 100%, or memory usage to get used up, etc. In other words, that nth test user causes the system to get overwhelmed, rather than increment resource usage the way all the previous ones did. This forces us to move from thinking about addition and multiplication, or simple product calculations, and start looking at exponentiation.

Exponentiation in simplest terms deals with the rapid increase of numbers. This can occur in distributed systems for a lot of different reasons. There can be a massive influx of users for unpredictable reasons, there can be massive increases in utilization of hardware components, there can be data that grows unbearably large quickly … the possibilities are numerous. In other words, something unexpected happens, and suddenly there are huge numbers that are impacting things, and we get called in because these rapid increases upset the status quo, making things worse. This is a complicated topic with lots of discrete math concepts, but it is fun and rewarding to study, as long as you aren’t learning during a production outage.

Even simple product based calculations can be tricky, especially when small numbers can lead to large numbers. Without some thought and analysis, this can lead to poor results. Our brains struggle with large numbers (hence the need to create computers in the first place), and our shorthand for dealing with them can get us in trouble.

How many servers do we need??!??!??!!

One project I worked on required a backend overhaul due to the addition of a suite of mobile apps. The mobile apps used the existing server infrastructure differently than the legacy suite of web apps, and there were some nasty load-related surprises. Trouble was, these surprises were major bugs that required architectural changes in the code base, as well as the server hardware. There was little appetite to address those issues due to cost, and politics, so they were deferred for a later release. In the short term, that meant that they had to severely curtail the estimates of simultaneous users per server with the addition of mobile app usage. (Note, when I say severe, I mean severe, as in a factor of 10 reduction of users.) The thinking was to get a couple of friendly existing customers to take on the mobile app product as beta testers, and then slowly roll on more organizations as the existing code base and infrastructure was updated. Trouble was, some of the sales people weren’t on board with this, because they wanted the potentially lucrative sales and commissions for that now, not months in the future. One salesperson returned from a trip with a friendly, major customer, who had signed up for an early release of our mobile app suite. There was great rejoicing. However …

One of the most important things I do when I take on performance and load testing projects is to read all the published claims about the system. That includes the README files, the release notes, website and other pubs, blog posts, and most importantly, any contracts with user and performance commitments and SLAs (service level agreements.) I asked for the contract that the sales people had signed with the customer, and I was horrified. They agreed to an enormous number of licensed users, starting modestly, but increasing at 3 month intervals over two years. The numbers didn’t look too bad at a glance, but when you factored in that they committed to doubling, tripling, quadrupling, etc over time, it was cause for concern. The lead architect and I spent a few minutes calculating what these commitments looked like in server requirements, and the numbers were insane. If we were to support that number of users without substantial work and massive performance increases, it would require thousands of web servers to support the commitment of one customer.

Getting to the bottom of this required a bit of digging.

It turned out that the lead sales person who had signed the agreement said he had approached QA for information about how many simultaneous users we could support on the test server. He then went to IT and asked how much more powerful the production server was. Since they said it was at least 10x more powerful, he took the QA quote, and multiplied it by 10. He then massaged the numbers to increase to the extreme level to sweeten the sales offer, assuming a massive increase in performance every six months for two years. Of course when he talked to QA and IT, he did not make it clear what he needed the numbers for. We had to explain that you can’t take raw numbers that a server can sustain for a short period of time before crashing, and then multiply it and assume some sort of “half Moore’s law” for the product.

In the end, legal and senior managers had to approach the customer and try to salvage the sale. They were able to renegotiate the contract SLA into something achievable and sensible. It wasn’t pretty, and the company lost money, but they thankfully didn’t lose the customer. It could have been a serious outcome though, with lawsuits and other potentially calamitous outcomes.

Calculating and Communicating Probability and Statistics

The real fun of performance and load testing for me is in the various ways we can use math to uncover important problems. It can also get a bit messy, since we aren’t dealing in absolutes, but in likelihoods. There is some experience involved in how to manage the uncertainty, and that comes with risk. Taking some calculated risks with the math you use can help your clients greatly reduce the risk in the operations of their systems. I used to really enjoy that uncertainty, using mathematical tools, observation and background knowledge to help inform recommendations, and seeing those ideas pay off in better customer service. The only downside is that when you have in-depth work in this area, you will yell at your computer screen when you see polling data, media articles or marketing campaigns that get it wrong either purposefully to manipulate, or due to a lack of research.

What metrics can we publish?

One system I was brought in to test was updating to support a significant higher number of mobile users. They needed to publish some of their user metrics, especially within contracts that required licenses. They wanted to provide a safe number of simultaneous users for customers who were hosting their solution themselves, so they would know what to expect and plan accordingly. This is straight forward, but from a statistics perspective, it adds a lot of complication and time to our work. It is one thing to find problems to fix, and to anticipate what you need for your own systems, it is another to make commitments about that to others. For example, if you have too much traffic on your own system, you can quietly add more capacity and no one needs to know. If a customer who hosts your solution is budgeting for servers, they need to have specifics. Also, if they end up with more traffic than they can handle, you might be on the hook, determining on what claims you have made in your SLA.

Company leadership understood what I needed and were willing to provide everything, including a safe test network. What I had to do was determine safe, but enticing metrics that marketing could use to publish in advertising, and sales could use in service level agreements for contracts. The key was, how many simultaneous users could they safely advertise, and commit to supporting legally? The way forward with this task involved a lot of simulation, and a lot of math.

I started by analyzing their legacy product and their website traffic metrics. Unfortunately, the data seemed to be off somehow. When I asked for more information, it turned out that the data I wanted was from two different sources. To make up for that, IT had been asked to add the two datasets together, and divide by two, providing a sort of average. Unfortunately, this isn’t the way to approach this kind of data. When you are dealing with two separate, but related sets of data, it is sometimes called bivariate data. The reason for this was a bit complicated, but imagine that you could get a dataset for web browsers only, and then a dataset for operating systems only. You can use some deduction on this data to get a better sense of the reality of the metrics. For example, if you are seeing lots of Safari browsers, then you know you are dealing with Apple devices only. But if you are seeing Chrome browsers, they will be Android devices, but can also be Apple and other operating system providers. The “averaged” data provided earlier skewed the data in unintended ways because it didn’t account for those proportions.

To cope with the bivariate data, I reviewed Chi-Square analysis from university statistics, and read up on how to analyze bivariate data accurately. I use spreadsheets a lot, so I found some youtube videos on built in analysis I could use there. Fortunately, while I was struggling with my calculations, a programmer who had worked with complex statistical systems was sent my way. He happily took over the task and used a more suitable approach. The numbers he generated looked much more realistic. With a bit of research we were able to find the proportions of mobile operating systems and web browsers, and our analysis revealed something similar in these metrics.

Phew. Our first math problem was out of the way. However, this had implications for our testing. We had to repeat certain tests to increase our confidence in our analysis. I’m simplifying for the sake of brevity here, but essentially, we needed to figure out a realistic sample size, and calculate our margin of error, or confidence interval. It got a bit complex, and meant we had to have a production snapshot available for a few days and did nothing but re-run subsets of our load tests on it, and analyzed results based on our prior calculations.

Next, we analyzed the new system that would support much more mobile traffic. What might change now that we had better mobile support? Would the proportions of OS/web browser remain the same, only increase in amounts, or would traffic behaviour change completely? Since most people like to use their mobile devices first, we felt that it could have a much larger impact than just increasing the same traffic as the legacy system. The behaviour and type of traffic could change significantly. This was a prediction, or a hypothesis, and we needed to research published metrics of mobile usage when web sites became more mobile friendly to help bolster that prediction.

While we were researching and adapting our tests to better reflect production data, I was extremely fortunate to be on-site during a system outage. I was able to view errors, request snapshots of server logs, server utilization and other metrics, and anything related to data. What are queues doing, are there problematic processes, tables filling up, etc. Also, we were able to gather hardware and network infrastructure information. After the initial problem solving to get the system back up, failure point analysis and bug reports, we were able to pour over the data to get a picture of the weak points in the existing system. This also required some math, since server utilization and other metrics have different formulas. One type of hardware might use one set of metrics, while another might use something that sounds similar, but uses different calculations. In other words, a “one” might be a great measure for one type, while another might use a percentage, like “97% utilization”. Furthermore, “97% utilization” might be a good metric for one service, but a red flag for server CPU usage. Furthermore, monitoring a web server vs monitoring an RDBMS vs network activity can be very different. Also, different applications can behave differently, utilizing different infrastructure and services depending on their unique needs and client load. Context and an understanding of what tools to use and what the metrics mean is vital.

We identified problem areas in the existing system, and then created conditions in the test environment to reproduce this at lighter levels of user load. Then, we used real mobile devices with different OS and web browser combinations and captured their traffic information so we could add those into our load tests. We then used simulated mobile clients to analyze the system and observed how and where the increased mobile clients would impact the servers. Next, we figured out how to artificially create some of these unique conditions in key areas of the system. For example, we created tools to eat up machine memory, or to cause database queries to slow down or even hang. We tried to determine how an influx of mobile users might use the system differently, and created tests based on typical user scenarios mobile users would be interested in. We also determined peaks, such as peak usage by number of simultaneous users, as well as peak usage with regards to system utilization. This is important, since a lot of simultaneous users reading a marketing release is easier to support than fewer users who are taxing the system using applications. From there, we got a good sense of what how the system behaved under heavier load vs. lighter load. Once we had a suite of tests that had a good mix of mobile and PC users, doing simple things and more complex things, we were able to simulate our projected system behavior, once it was released into the wild. We could also force conditions that could be problematic, so we could determine outcomes with various combinations of things going wrong on the back end. For example, what happens if an influx of mobile users all do the most taxing thing that could be done to the system, from a user workflow perspective? In other words, we were modeling expected server behavior based on both web and mobile application usage.

Finally, we worked on what areas we were going to measure. Management had asked for the greatest number of simultaneous users that the system would support, but this is a bit too vague. It is one thing to measure how many users can connect to the home page, versus how many users can use the supported apps, versus a combination of browsing, lightweight processing and apps that require heavy processing. Furthermore, while a server might be able to handle many users without crashing, if the performance is poor, people will get frustrated. Similarly, a server may handle a certain level of traffic for a period of time, and then stop performing adequately, either by slowing down considerably, hanging or crashing, etc. Or, a server may manage many multiple users, but it may become unreliable, also negatively impacting their user experience. To determine what to measure, we needed to utilize the following related testing approaches:

  • Load testing
  • Stress testing
  • Duration testing
  • Performance testing

Load testing is about generating a number of simulated users, and analyzing the system. Stress testing involves simulating enough traffic to push the server to its limits, or to failure, in order to learn limitations, what behavior to be aware of in production, etc. Duration testing involves load testing over time. Finally, performance testing is all about the measurement. It’s one thing to survive load, stress and testing over a duration, but qualitatively, how is the performance? What measures can we do to signify “good”, or “adequate”, or “poor” performance? We determined to measure average times of connections to the website, and the duration of completing the most common tasks in the mobile apps. That meant we did the typical web measure of simultaneous users and page load times, but we also timed how long it would take to do important things. That said, we needed to be wary of averaging these values too quickly, since outliers are important to find and identify the underlying cause. Once we had a reasonable sample size we performed calculations such as standard deviation in addition to spotting outliers and repeating conditions to cause them and verify when they were eliminated. For example, one issue we ran into was a nasty database table that required a lot of processing time to read, write, update, etc, and that could impact the load times at seemingly random points in user workflows. Once we found a fix, a subset of time delays on certain pages were eliminated.

Next, we analyzed mean, median and mode for each of our measurement points. Mode is one of my favorites for analysis, because it shows the frequency of a result, which can look different when graphed than a mean or median. A mode can show a cluster points at unexpected parts of a graph, which are a sign that there is a performance problem that needs to be addressed. Once averages of our data are calculated, based on sample sizes that are sufficient, I then use one of my secret weapons: percentiles. Percentiles can be used in several ways with performance testing. A percentile takes a portion of the results, which you can then analyze as a subset of your full set of data. For example, with the 90th percentile, you eliminate the top ten percent of your result set, and look at the remaining 90%. I have found a lot of performance issues in systems using percentiles to analyze and visualize data that weren’t apparent when using the full data set. This works because the top results can skew the overall results, pulling the graph in an area beyond the mode, for example. There are several ways you can use percentile to find patterns and problems that are shown in test data, but this is one I use a lot to troubleshoot. I often use the 80th, 85th and 90th percentiles in various ways to find unexpected results in the data. Those three work really well for me to find problems that get flattened out when using 100th. Percentiles are used in other ways in performance testing, but this is a potent analysis tool when you are finding problems.

Once the system was tuned, anomalies discovered and reduced, and the response times are fitting in a normal distribution that coincides with mean, median, mode, etc. then we are ready to measure and communicate metrics. First, we need to create a sample set of test results that is reasonably statistically significant. We don’t necessarily need to have a great deal of rigour with these calculations (such as statistical significance), but we need to run the tests enough times to have confidence in them. For example, running the tests once is not enough for a sample set of data. On your project, running them 100 times with the same build, the same equipment and conditions, etc. might be large enough. Or, you may need to run them a thousand times. In general, the larger the sample size, the better, but diminishing returns can kick in too. This requires some experience and judgment. Other projects may budget for the time and expense to do an auditable, full set of statistical calculations. I will use percentile here again, but rather than using it to look for problems, I am using it to assess the validity of the set of test results we are working with. If I find something surprising, then there is either a bug we didn’t encounter, a server misconfiguration, or a problem with the tool or test environment itself. Once we are happy with the sample set data, we can start capturing metrics and generating reports. (Reporting results could take up several blog posts to cover, so I will just touch on it.)

Determining server performance metrics that we want to commit to isn ‘t an exact science. Our test environment is rarely identical to a production environment, and no matter what we do to distribute simulated test users, etc, we aren’t completely emulating real world conditions. As a result of the statistical calculations, and analyzing the probabilities of events occurring, we tend to deal with percentages. “We guarantee a 99% up time” is a common one we see in marketing materials. They don’t say “100%, because there are so many factors beyond their control that might temporarily cause down time. Server up time is a pretty simple metric to measure and communicate, whereas performance is even less exact. For example, in testing, 90% of users may experience page load times of a certain average, or falling in a certain range, 90% of the time. Furthermore, the metrics we publish to brag about versus the numbers we are legally required ot meet might look very different. For example, we may find that a certain type of server configuration is adequate for performance targets, using a certain number of users. An aggressive approach might be to publicize one particular set of data that is attractive. We reached that level once, so we will tell the world we can do it. When it comes to SLAs though, we will likely be much more conservative. In some cases, an average is determined, and then some breathing room is built in those metrics by diminishing them, just in case of some events in production that weren’t apparent in test.

Communicating and reporting results requires skill and experience. Figuring out what is useful to measure, how to accurately analyze and interpet those measurements is part of the picture, but communicating what that means, what the limitations are, and providing advice on how to proceed is much more difficult. It’s one thing to do the math, and it’s altogether another to do something useful and helpful with it.

Lies, Damned Lies and Statistics

One of the great side effects of load and performance testing is how formerly intermittent bugs start to become repeatable. This is due to high volume test automation, one of the most powerful and useful test automation approaches you can use. While it is often unintended, adding load starts to cause problems to bubble up. This is so common, I always recommend teams schedule time around their load and performance testing efforts to deal with the inevitable issues that crop up. This is a good thing, because it helps improve the overall system and the end user experience with your software. In the short term though, it can be frustrating and might threaten schedules. These problems tend to require time and effort to fix, so while testers get excited, project managers start to get nervous.

One performance testing project I worked on had a particularly nasty “unrepeatable bug.” Once in a while, a tester using one of the web apps would experience a crash. This crash would also cause the test web server to hang, requiring a manual restart. No one was able to repeat it, so it was put into the state where bugs go to get forgotten, otherwise known as: “We’ll monitor it.” One day, the QA team installed a major new build. The team was getting ready to release a new version of the software with some new features and important bug fixes included. We started to run our automated tests, and testers began to work through their daily tasks. Suddenly, there was the familiar crash, and the required server restart. We had four test servers at the time, with one dedicated to our load and performance testing, with the other three available for other testing work. The testers moved on to a new server as the frozen one was restarted, and then the bug happened again, a tester saw a crash report, and the server froze up. Now there were two. Once again, a server froze up, and the testers were all on one test server. It crashed, and so did the load testing server. “That’s odd.” At one point, we had all four test servers requiring a restart at the same time, and this was causing serious productivity issues in QA, not to mention the implications for the new release. We raised the issue with the product and project managers, and started to analyze it.

The testers all kept track of what they were doing when they saw the bug, but we quickly set that aside. There was a factor in the system that wasn’t observable through the UI that was the likely culprit. We started to monitor the servers, turned up logging to get more information, and when a crash occurred, we tried to investigate every component of the web infrastructure on that server. We used low level load testing traffic on the each of servers to cause the bug to occur even more frequently. It took a couple of days, but we realized there was a strange race condition, where two services were utilized at exactly the same time. In the previous version of the software, this happened infrequently, but now, it was happening a lot. But, at least we had a repeatable case, and with the aid of our automated tests for load testing, we could repeat it on command, within five minutes. That gave developers the opportunity to run their debugging tools and track the issue down so they could fix it.

Trouble was, the fix was not an easy one, and was extremely political. To fix the problem required some major architectural rework, and re-opened a major debate on the development team. There had been bitter disagreement on a particular direction, and the one that was chosen was not popular. Now that the unpopular architectural decision was shown to be problematic, the issue blew up. There were heated arguments, lots of negative back channel chatter, polarization over possible solution ideas. All of this caused a lot of hurt feelings and resentment on the team. Some minor server setting tweaks were proposed, and each of them helped reduce the frequency of the bug somewhat, but didn’t reduce it enough. The team now had a choice: proceed with the release as-is, delay the release to try to find a temporary fix to reduce the occurrence more, or put the release on hold until the rework could be done to remove the problem for good. I was tasked with coming up with an impact assessment to help management determine a course of action. Here is what we observed, so I recorded it:

“Intermittently, a catastrophic bug causes a web server to crash, requiring a reboot. This means that once the bug occurs, the server is not available for users until it has been restarted. It doesn’t corrupt data, but it deletes the work that the user was currently working on, so they have to start over. The user will see a crash message, and once they refresh and connect to a new server, they have to log in again, and start over. In the meantime, there are fewer servers available, which means that at times, some users are unable to connect until someone else logs off. We found that on average, one in five users who connected to the server would come across this bug. This is a high probability issue, and it affects more than just the person who triggers the crash, the server is now unavailable for anyone until there is IT intervention. It costs time and money, not to mention the extreme frustration of the users who experience this. With self-hosted equipment, there is time required by IT to go and reboot the server, often several times a day. With cloud-hosted infrastructure, moving to new servers could cause expenses to increase significantly.”

Unfortunately, the people with political power did not want to fix the problem, they wanted to release. They took my 1 in 5 occurrence metrics and reframed it. While it wasn’t technically a lie, they greatly minimized the impact of the bug. This is what they told senior management:

“There is a severe bug that QA have found a repeatable case for, but it is going to hold up the release to fix it. The bug only happens 20 percent of the time!”

They also heavily implied that it was happening in the test environment more frequently because the QA team were abusing the system to find more bugs. Technically, we were using load testing tools to generate very light levels of load, but they didn’t say that. “You know how QA are, and they are also running load testing!!!” which made it sound like it would happen more frequently in test than in production. However, we were extremely worried about how often it could occur in production, with thousands of users, instead of the 15 testers and light load we were generating in the lab. Senior management decided to move ahead with the release as it was, and take a risk on the bug not occurring at all, or occurring infrequently. Why did they do this?

A 1 in 5 chance of something occurring is quite high. So is 20%, but twenty percent sounds smaller. If you use that figure without context, and your attitude is to make it seem small and insignificant, people will generally interpret it according to how you spin it. A 1 in 5 chance of the bug occurring in production, could mean that 200 people out of the first 1000 could experience this bug. It wasn’t uncommon for client sites to have dozens or even hundreds of simultaneous users, and our servers would peak at 1000 simultaneous users at times. If you think of 200 people seeing this crash, and then many people having to log in to a new server and start over, until license or server capacity was filled, with the system being unavailable for everyone after them, it starts to look more serious. However, the political players decided to just say “It has a 20% chance of occurring.”

The product management lead approached me and asked for a second opinion. I had to tread carefully because of the political implications of what they had been told, but I explained that even a 20% chance is sky high. For a bug like this, we could risk a 0.02% (zero point zero two percent) chance. Even a 2 percent chance would result in outages that would anger our customer base. For example, if you were gambling in Vegas, you’d take a 20% all day long. Those are wonderful odds if you are gaming. To hedge their bets, I advised that they create and rehearse a roll back strategy in case the new release was as bad as we expected it to be. Thankfully, the team followed that advice, because the release was a disaster. Every client site had no access at all by mid morning, which meant that our IT and customer support teams were busy 24 hours a day, dealing with extremely angry people. The release was rolled back, and the difficult architectural change was implemented, and the bug disappeared. It was weeks of effort, but if they had decided to wait on their release, they would have been much better off than unleashing something so unstable to the public. They lost a lot of money, they lost face publicly, and they lost some customers. They also lost months of time on their product roadmaps, since everything ground to a halt to address the customer anger and problems, and then efforts were split between support and fixing the problem.

The most expensive combination were the cloud based hosting services of the system, in some cases causing a huge increase in hosting bills. When you couple a frequently occurring server outage and a wish to fix the problem quickly with an extremely easy way to add more servers, you can quickly end up over your hosting limit and incur costs. As you might imagine, there were some extremely angry customers whose IT teams fell into the “just add more” trap to try to minimize the problem.

What went wrong? Someone decided to use metrics to try to spin a narrative that was counter to reality. This happens all the time in the world! It is almost always by people who want to minimize the problems highlighted by scientific rigour, or to try to maximize public support for unpopular policy. Or it is used by people trying to sell you something. The concept lies, damned lies and statistics explains how metrics can be used to spin a narrative. It’s important to question narratives, especially if they lack context. What can go wrong? Who wins and who loses when a particular course of action is taken? Are methodologies with weaknesses and strengths explained, or are they glossed over? Is the person presenting the data a relevant authority, or are they just a good talker? What happens if you scale up the numbers (if they are small), or scale down the numbers if they are large? Does the message change? These are all important questions to ask yourself when you are shown data that is supposed to convince you of something. The math lesson here is how you communicate metrics is important. Spin can blunt a serious issue and problem minimizers can win out of they are clever, albeit dishonest, communicators.

In the Part 2 story, we saw what a load testing tool can do when it is used by someone who doesn’t have the right knowledge and skill about the tool and underlying systems. However, you also need to understand the environment where you would need to use the tool. Creating and using test environments that are optimized for load and performance testing is a must. If you use these tools on a regular network, you will likely disrupt everyone else at the office, causing lost productivity and extra work for IT staff. The last thing you want to do is try them out at home, and end up blacklisted by your ISP (internet service provider).

Bye Bye Network!

After a while, I was an old hand at load and performance testing. To bolster my hands-on experience, I attended workshops on how to overcome technical restrictions, how to accurately analyze the data and find problems others would miss, how to write reports and describe risk and problems, and I was adept with a handful of tools. I started to get hired for performance and load testing gigs, and under the right circumstances, I had some rewarding and fun projects. I worked with a lot of talented people with vastly different skills, and learned from each of them.

Since I had a lot of retail and telco experience, a work friend asked me to come in to help him with a large retail system that was going through an upgrade. One of my tasks was to provide load testing help, since they were upgrading all the software and hardware for their back end system. I was given a lot of freedom to choose the tools, to interview everyone I could about any backend system issues, how to simulate credit card processing, etc. I was given a lot of freedom to research and design exactly what they needed. However, I was not given a test network to run the tests, so I never used any load. I verified my load tests would work with only one user.

To find potential areas of concern, we set up monitoring at several key areas on the system, and I had test results output in a format we could utilize with statistical analysis software. We also monitored server utilization, and recommended moving some processes around to better utilize the system. We learned a lot, but I wasn’t ready to unleash full load testing capabilities without a dedicated test network. There was no way I wanted to use this on the corporate network, even though we knew it would only run against our internal test system. I knew from experience that we could overload the internal network and cause problems for others. My friend, the dev manager, ignored my concerns. He was confident that the internal network would handle the extra traffic, since the IT admins had shown him that it was perpetually under-utilized.

Despite my objections, the dev manager insisted I run the load tests on the regular internal network. To start, he wanted to run the tests with 1000 simultaneous users, but I suggested we try something smaller. I wanted to try 10, he insisted we try 100. Still objecting, I hit the “Enter” key on my machine to start the tests. Immediately, a collective howl started to swell across the entire floor of the office. Then people started calling out that they had no network access. The dev manager and the IT manager ran to the server room, and when they unlocked it, all we could see in the dark rook was a sea of blinking red and yellow lights. Clearly, my load tests had overwhelmed the entire network, and every piece of hardware was in an error state. No one in the office was able to do work until all of the equipment was restarted. It took about a half hour to get the network up and running again, and the first thing my friend said was: “TRY IT AGAIN!!!!” He insisted the network outage was coincidental.

I refused to run the tests again, and made him tap the button on my machine. No sooner had his hand lifted from my keyboard, when the collective howl swelled again. The IT admin opened the server room door, and again, it was all blinky lights, and no network access for the company. It was remarkable how quickly the network was getting overwhelmed. Technically, the dev manager and IT team felt it was impossible, but they agreed not to run the tests again until we had investigated the source of the problem. Furthermore, permission and a budget for a test network specifically for load and performance testing was immediately approved by stakeholders.

It turned out that it was an extraordinary event that caused the outage, but it was something that would have happened in production without us catching it internally first. In simple terms, the network cards on the new servers had been set to a default to broadcast to each other when under load, to try to load balance. This was a new feature, that looked good on paper. However there was already had a load balancing system in place, so this was redundant, and harmful. In effect, the servers spammed each other because they were all under load, and the traffic increased exponentially. Machine one would find itself under too much load, so it would message machine two to get it to process excess. Unfortunately, Machine two was also under extreme load and was also messaging machine one, who was messaging machine two for help, as were Machines three and four, messaging each other over and over and over with more and more messages.

To visualize what they were trying to process and the traffic they created themselves, imagine a geometric or hockey stick curve on a graph, or an infinite series in mathematics. The load tests were already creating a huge amount of traffic, but the servers themselves were generating more network traffic at an exponential rate. This traffic generation behavior instantly overwhelmed every component in the corporate network. We quickly turned off that setting in the network cards of the test servers, and then waited for a test network we could safely run the tests on.

The next time we ran the tests, I had several managers breathing down my neck, but the server outages they caused did not cause any network outages. There was no collective howl, no server room full of blinky error lights. We all breathed a sigh of relief, and we went on a find and fix cycle for a few weeks to get the systems ready for a production launch. We were able to ship with a lot of confidence due to this work, and the load tests were part of pre-production tests for years after that launch.

This was a relatively small company, and the impact was fairly low. The entire development team and IT team sat together, and the infrastructure was in a server room on the same floor as the office. We were able to deal with the outages quickly, and the incident became a part of office lore, brought up when a laugh was needed. It wasn’t without political fallout though, since it was disruptive and problematic. Now imagine if this was a larger company, with IT departments in another location, servers at a hosting provider or on the cloud, etc. There could be considerable downtime, and increased costs with hosting providers, etc. While this situation was more lighthearted due to friendships and a tight knit office environment, it could have been extremely serious.

In the Part 1 story, time, money and effort were wasted. This story is much more serious. Load and performance testing tools can be simple to get started on, but they belie a good deal of complexity. In other words, a little knowledge can be a dangerous thing. While the tool may look simple, and like there isn’t a lot going on, they have a lot of power and can unleash mayhem on a system. To simulate adequate load, the tools are generating a lot of traffic, which can have unintended consequences unless you know what you’re doing. Using record/playback can be handy when someone has skill and understanding of what they are doing, but when used by someone who is unskilled, can unleash absolute misery. Just because you can use a tool and generate load doesn’t mean that you should.

A Complete Clusterfuck

A year after the Part 1 story, I was brought in to work with some Agile teams that were helping an overwhelmed IT department. Load and performance testing were brought up, but since I had been down that road before, I explained the work and potential pitfalls to stakeholders. They agreed we should treat it as a separate project, and use a cross functional team. However, a high powered consultancy had brought in a team who were desperate to show their mettle. They were skilled, they had a great reputation for turning projects around, but they were extremely arrogant. I was pulled into a meeting with sneering programmers who mocked my experience and concerns about load testing without analysis and careful planning. After my treatment in the meeting, my manager told me to decline further invitations, and let them “sink or swim.”

I didn’t hear much about what they were doing for a few weeks, but then one day a concerned executive assistant called the CTO. The CTO called the IT manager, who in-turn called the people who were on my team. I was on a small cross-functional team that worked on development projects, but we would get pulled into helping fix any difficult production issues. The problem was that the CEO couldn’t access their work email. After rolling our eyes and asking if they had forgotten their password, we realized that webmail access for the entire company was down. The lead IT Admin and I sat next to each other, and he provided me with a play-by-play of what he was doing. He found that the webmail service was hanging, so restarted it. Webmail briefly came up again, but the service started to hang again. Then more reports came in of poor performance on the corporate network, and some services becoming unavailable. He had to restart the mail servers, which in a large organization is not a simple task. It requires communication to all staff, timing warnings over a few minutes, doing the restart, communicating and monitoring. Similarly, certain areas of the network seemed to be under some sort of attack. Was it a security breach? Did someone have a virus or trojan horse?

Eventually, we tracked down the excess traffic to a particular machine, and it was one of the staff consultants from the arrogant consultancy. The IT Admin blocked his IP from the network, and we went to management to figure out what to do next. We wandered over and initiated a chat with a now angry group of consultants who were furious that one of their team members had lost network access. After a brief explanation, and a query as to why they were nuking our network, they admitted they had tasked one of their junior consultants with researching load testing tools. He had downloaded an open source tool, recorded HTTP traffic, played it back, and then kept adding more simultaneous users. There were several problems here, and senior management were furious. The consultant was kicked off the project and escorted out the door, and the consultancy was warned that they were in breach of contract. They had ignored several directives that they had pledged to follow when they signed the contract. As time went on, more problems than the CEO not being able to access webmail started to emerge.

Internally, there were formal complaints to the IT team about a lack of access and downtime. IT was in violation of their commitments for network and tool availability, and management had to spend time mollifying angry managers in other groups. You have to imagine what can happen in an internal network when someone starts generating hundreds of simultaneous requests over and over. Devices get saturated and stop functioning, others go into error mode, and everything slows to a crawl. IT technicians need to identify areas of the network that need intervention, and try to remotely restart services. In some cases, they had to physically go and restart network infrastructure manually. This resulted in thousands of dollars worth of lost time that day.

Remember when I said that if you record traffic for a load testing scenario, it will capture ALL the protocol level traffic on your machine? It turns out that this programmer didn’t know that or think of that. Later that day, the consultancy found out that they were locked out of their corporate messaging system. This is a core tool for a company that has most of its employees distributed at various customer sites. The load test against our system included all the instant message traffic that occurred while he was recording the scenario. They were without their system for days, while they negotiated with the vendor and tried to explain why one of their employees had essentially executed a denial of service attack. They were able to reinstate their corporate account, but that employee was banned from using it.

A few weeks went by, and an IT Manager came storming into our development area with a credit card bill. There were several thousand dollars worth of mystery expenses on it. It turned out that the day of the tests, he had given the consultancy his corporate credit card number “to run a few tests”, and assumed that they would let him know what they had done, and he would call to cancel them. The day of the load test disaster, the credit card company called to let him know they had frozen his account, but he assured them it was ok, people were running a few tests. By the time he had approached the staff consultants, the load testing had been stopped. Unfortunately, no one thought to connect the dots and tell him how his corporate card had been used. Thankfully, the credit card company found the problem and shut down his card, but the damage was done. He had to get a new corporate card, and it took time to dispute the payments and get them refunded. It took time, energy, and other managers had to use their cards on his behalf.

In the end, the consultancy lost their MSA with the company, and they lost credibility due to one person ruining it for everyone else. Unfortunately, a consultancy with people who weren’t as skilled was hired instead, but they were much nicer to deal with. Internally in IT we had hoped the prior consultancy would work out, because they had the skills and experience to deliver. Due to their arrogance, we all lost out. Furthermore, IT lost credibility with the business for allowing a consultant to wreak that much havoc. Because of the sudden, repeated excess traffic from that location, even our corporate ISP had flagged us, and that required finessing and promises to not occur in the future. If we suggested a vendor, stories about this ridiculous situation would be recalled, and we would get stuck with less ideal providers that other groups chose for us. This, plus thousands of dollars of costs, not to mention all the staff work to clean up the mess was caused because someone without the knowledge and skills used a tool they didn’t understand and ran it on our network. Depending on who retells this story, it can even sound amusing, but it was extremely serious. This person downloaded an unauthorized tool against a client corporate policy, recorded some HTTP traffic, then ran this over and over with various sizes of payloads. A few hours of playing around with something they didn’t understand had extremely serious effects.

Now that I am on the product management side of software projects, I don’t deal with testing approaches in my day-to-day work very much. I get info about product quality criteria, quality goals and metrics, information on testing status and quality, or show stoppers that require attention. Unless I want to dig deeper, I don’t hear much about the actual testing work. Once in a while though, something big pops up on to my radar, usually because there is a threat to a product release, or there is a political issue at play. In those moments, my background as a software tester comes in handy.

Recently, my testing experience was called into action, because of project controversy about load testing.

There were some problems with a retail system in production, and poor performance was blamed. The tech team did not have the expertise or budget for load testing, and were instead pushing the sales team to take responsibility for that testing. The sales team didn’t have any technically minded people on their team, so they approached marketing. The marketing team has people with more technical skills, so a manager decided to take on that responsibility. They asked the team for volunteers to research load testing, try it out, and report back to the technical team. I happened to overhear this, and began waving my arms like the famous robot from Lost in Space who would warn about impending danger by saying: “Danger, Will Robinson!” This is out of character for me, since I prefer to let the team make technical decisions, and rarely weigh in, so people were shocked by my reaction. I will relay to you what I said to them.

Load testing is an important testing technique, but it needs to be done by people with specialized skills who know exactly what they are doing. It also needs to have test environments, accounts, permissions and third party relationships taken into account.

Load testing is a great way to not only find performance issues with your website or backend servers, it will also cause intermittent bugs to pop up with greater frequency. Problems you might miss with regular use will suddenly appear while under load, due to the high volume of tests that are run during a short period of time. High volume automated testing is extremely effective, and one of my favorite approaches to test automation. To do it correctly and to get utility requires work, environment setup, as well as knowledge and skill. Done well, performance bottlenecks are identified and addressed, intermittent bugs are found and fixed, and a good test environment and test suite helps mitigate risks going forward when there are pushes to production. However, when done poorly, load testing can have dangerous results. Here are some cautionary stories.

The simplest load testing tools involve setting up a recorder on your device to capture the traffic to and from the website you are testing. You start the recorder, execute a workflow test, turn off the recorder, and then use that recorded session for creating load. The load testing tool generates a certain number of unique sessions, and replays that test at the transport layer. In other words, it generates multiple tests, simulating several simultaneous users using the website. However, lots of systems get suspicious of a lot of hits coming from a particular device, and protect against that. Furthermore, internal networks aren’t designed for one machine to broadcast a huge volume of data. If you are working from home, your ISP will get suspicious if you are doing this from your account, fearing that your devices are being used for a Denial of Service attack. Payment processors are especially wary of large amounts of traffic as well. So if you use this method, you need to completely understand the system and the environments where you are performing the tests.

Part 1: Expensive Meaningless Tests

Early in my career, I was working with a popular ecommerce system. They were successful with managing load, but felt their approach was too reactive and possibly a bit expensive. If they could do load and performance testing within the organization rather than deal with complaints and outages, they could also improve customer experience. I was busy with other projects, and I had never worked with load testing tools before. Since I was a senior tester, I was asked to oversee the work by a consultant who was a well known specialist, who also worked for a tool vendor that sold load and performance testing tools. To be completely honest, I was busy, I trusted their expertise, and I didn’t pay a lot of attention to what they were doing. One day, they scheduled a meeting with me, and provided an overview. It all looked impressive, there were charts and graphs, and the consultant had a flashy presentation. They then showed me their load tests, and highlighted that they had found “tons of errors”. He said that his two weeks of work had demonstrated that we clearly needed to buy the tool he was selling. “Look at all the important errors it revealed!”

My heart sank. All they had done was record one scenario on the ecommerce system, and then played that back with various amounts of simultaneous users. They were wise enough not to saturate the local network, so they kept the numbers small, but their tests were all useless because they had no idea or curiosity about how the system actually worked. The first problem was that retail systems don’t have an endless supply of goods. Setting up test environments means you set up fake goods, or copies of production inventories that don’t actually result in a real life sale. To make them realistic, you don’t have an infinite number of widgets, unless you need that for a particular test. These tests didn’t take that into account, and the “important errors” his hard work had revealed with the tool were just standard errors about missing inventory. In other words, there were ten test books for sale, and he was trying to buy the 11th, 12th, 13th books. If he had been a real user using a website, the unavailable inventory messages would have been displayed more clearly. Because he was getting errors from the protocol level, they weren’t as pretty. A two minute chat with an IT person or programmer would have set him straight, but he didn’t look into it. He copied the messages and put them in his report, treating them as bugs, rather than the system working just fine, due to his error.

Next, they were using a test credit card number that was provided to us by the payment processor. There are lots of rules around usage of these test numbers, and he was completely oblivious to these rules. In his days of so-called analysis of our system, he had not explored this at all. That meant that our test credit card numbers were getting rejected. This was the source of some of the other “important errors” he had found, but not investigated. This was so egregious to me, I had to stop the meeting and talk to our IT accountant who managed our test credit card. My fears were confirmed – these load tests resulted in our test credit card numbers getting flagged due to suspicious activity. That meant none of us could test using the credit card, and we had to have a meeting explaining ourselves and apologizing to get them reinstated.

I got dragged into developing my own load and performance testing skills because of this. The consultant went back to the office, and I inherited these terrible tests. What I found that was while the load testing tool looked impressive, it had this terrible proprietary programming language that created unmaintainable code. While it had impressive charts and graphs, they were extremely basic and could actually mask important problems. Recording HTTP(S) traffic and playing it back could be fraught with peril, because the recorder is going to pick up ALL the HTTP traffic on your machine, including your instant messages, webmail, other websites that are open, and 3rd party services such as a weather plugin or stock ticker. Also, you need a protected test network that prevents you from causing problems and interfering with everyone else’s work. Then, you need to look at your backend and see what is possible. In my case, I worked with the team to create new load test products on the website, but the backend retail system only allowed a maximum of 9999, since it maxed out with a 4 digit integer. We also had to create a system to simulate credit card processing, since the payment processor wasn’t going to allow thousands of test purchases hitting their machine. Furthermore, our servers had DDoS protection, and would flag machines that were hitting them with lots of simultaneous requests and deny access, so we had to distribute tests across multiple machines. (These issues were all a bit more technical than I am recording here, but this should give you an idea.)

How much time do you think it took to create the environment for load tests, and then to create good load tests that would actually work?

If you answered: “weeks” with several people working on the testing project, then you are in the ballpark.

We also abandoned the expensive load testing tool, mostly due to it using a vendorscript instead of a real programming language. We used one that was based on the same language the development team used, so I would have support, and other people could maintain the tests over time. It was a bit rudimentary, but we were able to identify problem areas for performance, and address those in production. A happy side effect was the load tests caused intermittent issues that we had missed before to become repeatable cases that could be fixed. It was a lot of work, but it was the start of something useful. The tests were useful, the results were helpful, and we had tests that could be understood, maintained and run by multiple people in the organization.

I was fortunate in this case to be able to work with a great team that was finally empowered to do the right thing for the organization. We were also fortunate in our software architecture and design. We spent the time early on to create something maintainable, with simple tests. As a result, our testing framework was used for years before it required major updates.

Recently, I wrote about Using Storytelling Games in Software Testing, and pointed you to a paper by Martin Jansson and Greger Nolmark. Now I want to give you some tips on creating great storytelling for your testing projects.

First of all, check out Cem Kaner’s work on Scenario Testing: An Introduction to Scenario Testing. I want you to pay special attention to the CHAT (cultural, historical activity theory) model that he talks about. For more on CHAT and testing, read this paper: Putting the Context in Context-Driven Testing (an
Application of Cultural Historical Activity Theory)
.Pay special attention to the descriptions of networks of activity, and tensions. These are vital to help construct variations and different forces within our storytelling. Both of these pieces are foundational and worth the effort to dig into.

Now, I want you to read Hans Buwalda’s article on Soap Opera Testing. This is a nice variation on scenario testing. Buwalda uses television soap operas as inspiration for a story arcs, for structure, and for variation. Remember, there are lots of variations on a theme in testing, as well as real life! Further to that, look into testing tours. Cem Kaner has a blog post with a link or two to help get some background info: Testing tours: Research for Best Practices?.

Soap Opera tests, Testing Tours and Test Scenarios are a great place to start creating good testing stories.

Next, read up on personas in user experience work. Jenny Cham has a really nice description, with lots of helpful links on creating personas here: Creating design personas. Remember to explore her links in this blog, she has great advice here. I wrote a position paper about using UX personas in testing years ago (I will have to dig it up, there’s a dead link) in this blog post. Elisabeth Hendrickson introduced me to this idea, but she recommended using extreme personas such as cartoon characters. I prefer the standard UX methods pioneered by people like Alan Cooper, but the cartoon or other characters are a great place to start, especially if you feel stuck. Personas are a great way to start developing characters for your story that are relevant. What are their motivations when they use our software? What are their fears? What are their cares and worries and distractions?

Next, I want you to read this piece on telling a great story by a famous author: Kurt Vonnegut at the Blackboard. (I am getting to the gamification side of this project, and I asked Andrzej Marczewski for good references on storytelling in games, and this was the first link he sent me. Thanks Andrzej!) Notice the different options for structuring a good story. In testing, we can use different ones for the same scenario, if we think about activity patterns, tensions, characters, and variations during real life product use. Several versions of one story will yield different kinds of important information and observations. Vonnegut provides a simple framework for story creation that we can easily adapt and apply.

Finally, I want you to look at story telling in games. Andrzej talks about it here: I want to experience games not just play them. Notice that within a game context, of a well designed game, he has a sense of cause and effect: decisions made here can impact things in other areas of the game. That’s just like real life, and it is important to add dimensions to storytelling in games for testing. Variation and dimensions have different effects in a system, and they are rewarding to exercise. Now read this piece on Gamasutra The Designer’s Notebook: Three Problems for Interactive Storytellers, Resolved by Ernest Adams. The points about character amnesia, internal consistency and narrative flow are pure gold for testers. We often arrive into a system without really knowing what is going on, especially at first. However, our customers are also starting from scratch when they use our app for the first time. These problems are areas we should also address when creating stories to test around.

There is also a lot of really useful information here: Environmental Storytelling: Creating Immersive 3D Worlds Using Lessons Learned from the Theme Park Industry by Don Carson, particularly with regards to environmental conditions being so important to incorporate (particularly for you mobile testers!) and the idea of an all-encompassing world, rather than one, linear story.

Andrzej also recommends reading Uncle Computer, Tell Me A Story, and Story Structure 104: The Juicy Details.

As testers, we can incorporate more than a linear scenario into our work. We can add so much more depth to our test approach using stories and worlds. Story development in games is incredibly similar to the story telling we need to do in testing. There is a lot to be learned about creating virtual worlds and stories within them to help change our perspective, explore variations and make important discoveries about the software and systems we test. We can leverage these various works that have been provided with us to create something new and powerful.

Some final points to put this all together:

  • Combine the elements from each of the areas I asked you to study above to create a great story, or even better, sets of stories
  • Use structure to create real life conditions: different people, motivations, different environmental conditions, and change.
  • Add plot twists, surprises and ulterior motives, and look for unintended consequences in systems and people
  • Don’t stop at one scenario – create variations on a theme, and change the setting, or the entire world you have created to help change your perspective
  • Introduce different characters – are they interrupting? Helping?
  • Create a beginning, middle and an end
  • Move beyond all happy endings – also try to leave things unresolved, or end on a bad note

I have compiled several foundational concepts to help influence your storytelling, so now the rest is up to you. How you combine them to create something useful is up to you and your team. You have an opportunity to create rich perspectives to kickstart your testing efforts.

Happy storytelling!

Test Quests – Gamification Applied to Software Test Execution

I decided to analyze a game feature, the “quest“, which is used in popular video games, particularly MMORPGs. Quests have some compelling aspects for structuring testing activitues. Jane McGonigal‘s book “Reality is Broken” provided me with a solid analysis of quests, and how they can be adapted to real life activities. Working from her example of a quest (ch. 3 pp. 56) , I created a basic test quest format:

  1. Goal statement (what we intend to accomplish with our testing work)
  2. Why the goal matters (why are we testing this?)
  3. Where to go in the application (what technique or approach are we using to test?)
  4. Guidance (not detailed steps, but enough to help. Bonus points for using video or other rich media examples.)
  5. Proof of completion (how do you know when you are finished?)

A quest is larger than a single testing mission (or a test case), but is smaller than a test plan. It’s a way we can organize testing tasks to help provide a sense of completion and interest, but in areas that require exploration and creativity. Just like in a video game, there are multiple ways to satisfy a quest. Once we have fulfilled a quest, which might take days or hours, depending on how it is created, we can move on to another one. It’s another way of organizing people, with the added bonus of leveraging years of game design success. Furthermore, modern technology involves a lot of collaboration between people in different locations, using different technology to reach a common goal, and we need to adapt testing to meet that. Testing a mobile app in your lab, one tester at a time, won’t really provide useful testing for an app that requires real-time communication and collaboration for people all over the world. MMO’s do a fabulous job of getting people to work hard and co-ordinate activities in a virtual world, and people have fun doing it. I decided to apply it to testing.

Where do quests fit? Think in terms of a hierarchy of activities:

  • test strategy and plan
  • risks that are mitigated through testing
  • different models of coverage that map to risk mitigation
  • test quests
  • sessions, tours, tasks
  • feedback and reporting

A good test approach will have more than one model of coverage (check I SLICED UP FUN for 12 mobile coverage models), and under each model of coverage, there will be multiple quests. Sometimes quests will be repeated when regressions are required.

So why add this structure?

One area I have worked on over the years is using structure and guidance to help manage exploratory testing efforts. In the past, test case management systems provided some measure of coverage and oversight, but they have little in the way of intrinsic value for testers. People get tired of repeating the same tests over and over, but management love the metrics and they provide even though they are incredibly easy to cheat with. Furthermore, from a tester’s perspective there is an extrinsic reward that is inherent in the design of the tools, and they are easy to use. There is also a sense of completion, once I have run through X number of test cases, I feel like I have accomplished something.

With exploratory testing, the rewards are more intrinsic. The approach can be more fulfilling; I personally feel like I am approaching testing in a more effective way, and I can spend my time on high value activities. However, it is harder to measure coverage, and it is more difficult to direct people in areas where coverage is required without adding some guidance. There have been a lot of different approaches to adding structure to exploratory testing over the years to find a balance. Test quests are another approach to adding structure and finding that balance between the intrinsic rewards of pure exploratory testing, and the extrinsic rewards of scripted testing. This is an idea to provide a blend.

As many of you have heard me argue over the years, test cases and test case management systems are merely one form of guidance, there are others. In the exploratory testing community, you will see coverage outlines, checklists, mind maps, charter lists, session sheets, and media such as video demonstrations and all sorts of alternatives. When it comes to managing exploratory testing, one of the first places we start is to use session-based testing management. This approach helps us focus testing in particular areas, and provides a reviewable result, which makes our auditors and stakeholders happy. I’ve used it a lot over the years.

I’ve also used Bach’s General Functionality and Stability Procedure for over a decade to help organize exploratory testing. However, through experience, unique projects and contexts, I have adapted and moved away from the orthodoxy where I saw fit. However, when I started analyzing why people on my teams have fun with testing, SBTM and Bach’s General Functionality and Stability Procedures were big reasons why. Even though I often use a much more lightweight version of SBTM than he has created, people appreciate the structure. The General Functionality and Stability Procedures is a great example of guidance for analysis, exploration, and great things to do as testers.

The other side of fun on the teams I work on are related to humour, collaboration and technology. We often come up with nicknames, and divide up testing into teams and hold contests. Who can come up with the best test approach? Who recorded the best bug report video? Who found the most difficult to find bug last week? What team has the most pop culture references in their work? Testing is filled with laughter, excitement and learning, and some good plain old fashioned silly fun. We communicate constantly using technology to help stay up to speed on changes and progress, and often other team members want to get in on the action. Sometimes, it’s hard to get the coders to code, the product owners to product own, and the managers to manage, because everyone wants in on the fun. In the midst of this fun is incredibly valuable testing. Stakeholders are blown away by the productivity of testing, the volume of useful information produced, the quality of bugs, and the detailed, useful information from bug reports to status reports and quality criteria that is produced. While there is laughter and fun, there is hard work going on. I learned why this is so effective reading Jane McGonigal’s work.

In Reality is Broken, Jane McGonigal describes Augmented Reality Games (ARGs). These are real life activities that are gamified – they have a game-like structure applied to them. She mentions Chore Wars, and how gamifiying something as mundane as household chores can turn it into a fun activity. She mentions that since cleaning the bathroom is a high value activity in the game, her and her husband have to work hard to try to clean it before the other does. McGonigal explains that since there is a choice, and meaning attached to the task, people choose to do it under the mechanism of the game. It’s not that awful thing no one wants to do anymore because it is unpleasant, when framed within a game context, it is a highly sought after quest or task to complete. You get points in the game, you get bragging rights, you get intrinsic rewards as well as the extrinsic clean bathroom. Amazing.

If we apply that to testing, how about using lessons from ARGs to gamify things like regression testing, or test data creation, or other maintenance tasks we don’t like doing? One way we can do this is to sprinkle these tasks within quests. You can only complete the quest by finishing up one of these less desirable tasks.

In Reality is Broken, McGonigal defines a game as having four traits: a goal, rules, a feedback system, and voluntary participation (pp.21). Working backwards, in exploratory testing, a lot of what we do is voluntary because testers have some degree freedom to make decisions about what they are going to test, even if it is within narrow parameters of coverage. Furthermore, we can choose a different model of coverage to reach a goal. For example, I was working with an e-commerce testing team who were bored to death of testing the purchasing engine because they were following the same set of functional test scripts. To help them be more effective and to enjoy what they were doing, I introduced a new model of coverage to test the purchasing engine: user scenarios. Suddenly, they were engaged and interested and found bugs they had previously missed. I then helped them develop more models of coverage so that they could change their perspective and test the same thing, but with variation to keep them engaged and interested while still satisfying coverage requirements. As humans, we need to mix things up. Previously, they had no choice – they were told to execute the tests in the test case management system, and that was the end of it.

Feedback systems are often linked to bug reporting systems in testing. But I like to go beyond that. Bring in other people to test with you in pairs, trios or whatever combination to bring more ideas to the table. This isn’t duplicated testing, but a redoubling of brain power and effort. I also utilize instant messaging, IRC, and big visible charts to help encourage feedback across functional areas of teams.

Rules in testing are often related to what is dictated to us by managers, developers, and tradition. It boggles my mind how many so-called Agile programmers will demand their testers work in un-Agile ways, expecting them to create test plans, test cases and use test case management systems. When I ask the programmers if they would like to work that way, they usually say no. Well guess what, not many other homo sapiens like to work that way either. I prefer to have rules around approach. We have identified risks, and models of coverage to mitigate those risks, and we use people, tools and automation to help us reach our goals. Rather than count test cases and bugs, we rate our team on our ability to get great coverage and information that helps stakeholders make quality-related decisions.

Finally, a goal in testing needs to be project-specific. If you want to fail, you just copy what you did last time on your test project. The problem with that is you are unaware of any new risks or changes and you’ll likely be blind to them. Every project has a goal, a way we can measure whether we did the right sort of work to help reach that goal, rather than “run the regression tests, automate as many as possible, and if there is time, do other testing”, we have something specific that helps ensure we aren’t doing busy work, but we’re creating value.

When it comes to quests, they can have this format as well. A goal, a feedback system, rules or parameters on where to test, and voluntary participation. As long as all the quests are fulfilled for a project, it doesn’t matter who did them.

It turns out that my application of SBTM, Bach’s General Funcationality and Stability Procedure, plus some zany fun and utilizing technology to help socialize, report and record information, I was right next door to gamification. Using gamification as a guide, I hope to provide tools for others who also want to make testing effective and fun. A test quest is one option to try. Consider using avatars, fun names and anything that resonates with your team members to help make the activity more fun. Also consider rewards for difficult quests and tasks such as a free meal, public kudos, or time off in lieu. Get creative and use as much or as little from the video game world as you like.

Some of my goals with test quests are:

  • Enough structure to provide guidance to testers so they know where to focus efforts
  • Not so much structure (like scripted test cases) that personal choice, creativity and exploration are discouraged or forbidden
  • Guidance and structure is lightweight so that it doesn’t become a maintenance burden like our scripted regression test cases become (both manual and automated)
  • Testers get a sense of purpose, they get a sense of meaning in their work, and completion by completing a set of tasks in a quest
  • Utilize tools (automated tests, automated tasks, simulators, high volume test automation, monitoring and reporting) to help boost the power of the testers and be more efficient and effective, and to do things no human could do on their own
  • Encourage collaboration and sharing information so that testers can provide feedback to other project team members on the quality of the products, but also get feedback on their own work and approaches
  • Encourage test teams to use multiple models of coverage (changing perspectives, using different testing techniques and tools) on a project instead of thinking of coverage as a singular thing
  • Utilize an effective gaming structure to augment reality and encourage people to have fun working hard at testing activities

I am encouraging testing teams to use this as a structure for organizing test execution to help make testing more engaging and fun. Feel free to add as many (or few) elements from video game quests as you see fit, and alter to match the unique personalities and goals of the people on your team. Or, study them and analyze how you organize your testing work for you and your teams. Does your structure encourage people to have fun and work hard at accomplishing something great? If not, you might learn something from how others have managed to get people to work hard in games.

Happy questing!

Applying Gamification to Software Testing

I wrote an article for Better Software magazine this month called “Software Testing is a Game”, available here in PDF format. I wrote about using gamification as an approach to analyze and help make software testing more engaging. I encouraged readers to apply some ideas from gamification to their own testing efforts. Now, why would I do a thing like that? And what do I mean by using game mechanics when we are testing? Games are all well and good, and I may enjoy them, but we are talking about serious work here, why would we make it look like a game?

Let me give you a bit of background information.

I was working with my friends Monroe Thomas and David McFadzean on product strategy when they started bringing up my gamification design ideas. I use gamification in mobile app design to help them be more engaging for users. That doesn’t mean that I make an app look like a game, it means I use ideas from games to help make the app more interesting and easier to use. However, we weren’t talking about mobile apps, so I was a bit surprised. They pointed out that the same concepts that make gamification in mobile apps apply to other apps, after all, David and I even wrote an article about using gaming when creating software processes. Why couldn’t I use those ideas in a product strategy meeting for something else?

Good point.

In fact, they even urged me to look at some of my other prior app designs, they felt I would find gamification-style aspects in those as well, because I always worry about making apps more engaging. Once I started thinking about the implications of what they were saying, an entire new world of possibility opened up. I felt like they had just kicked open a big door of perception for me.

But wait a minute. What is this business about games? Well, the thing with gamification is that when I use those tools correctly in an app, you don’t know it is there. I don’t put childish badges and leaderboards in a productivity app and then say: “Look! gamification at work!” for example. Andrzej Marczewski describes gamification mechanics in terms we can relate to in his blog Game Mechanics in Gamification as: Desired Behavior, Motivation and Supporters.

Andrzej uses a game format to illustrate his point, but it should be obvious that these three themes are not limited to games. Where game designers shine, and where policy wonks and enterprise or productivity designers tend to fail is in the structure around desired behavior. Too often, we just expect people to excel in a work place environment with little support. Games on the other hand tickle our emotions, they captivate us, and they encourage us to work hard at solving problems and reaching goals.

Framing something like software testing in terms of gaming, and borrowing some of their ideas and mechanics, applying them and experimenting can be incredibly worthwhile. After all, as I state in the article, it is difficult to get people involved in software testing, and as technology becomes more pervasive and more enmeshed in our every day lives, it has more potential to do harm. We need new people and new ideas and new approaches, and I want to figure out how to make it more engaging for people. Why can’t effective testing be fun?

It can.

If you work on a team with me, you will notice that there is a lot of laughter, a lot of collaboration, a lot of discovery and learning. And everyone tests from time to time. Sometimes, it can be difficult to get the coders to code, the designers to design and the managers to manage, because everyone wants to test. Why is that? Well, gamification can help provide a structure to analyze what we do and learn why some things are fun and help us work hard, while others cause us to avoid them.

Speaking of analyzing something from a gamification perspective, remember in the Better Software article how I described several aspects from gaming and asked you to apply it to your testing work? Prior to writing the article, I did exactly that with a product I designed called Session Tester. Aaron West and I developed a tool to help testers capture information while using an approach called Session-Based Testing. We had high hopes for the project, but after several setbacks, it’s now dormant. However, a back of the napkin analysis of the tool using a gamification approach was incredibly useful. This is what we came up with, using game concepts from Michael Wilson’s “Gamification: You’re Doing it Wrong!” presentation:

  1. Guidelines and Behaviors:
    Context and rules around the tool was hit and miss. The tool enforces the basic form of session-based testing which helps people learn how to approach testing from this perspective. People are required to fill in the minimum information to create a session sheet. There are strategy ideas readily at hand, and the elements are easily added by using tags. The tool was helpful to teach beginners on the basic form of SBT, but we didn’t enforce the original SBTM rules as set out by James and Jon Bach. This hurt the tool’s effectiveness. While we value the ability for people to modify and adapt, we should have started with the known rules and then provided the ability to adapt, rather than design it from an adapted view. This caused confusion and controversy.
  2. Strategies and Tasks:
    Elisabeth Hendrickson’s ET Heuristics Cheatsheet is provided in the tool to help people think about strategy, and there are oblique strategies to help create test ideas using the Prime Me! button. There could be more resources added to help with strategy, and in fact a lot of the strategy work can be done outside of the tool. We could have done more feature-wise to help with strategy. Tasks can be pre-planned outside of the tool, or done on the fly and recorded with the @tasks tag, which is saved in session sheets. We could also have done more to support tasks.
  3. Risks and Rewards:
    There is a risk that you don’t have a productive session, or your session sheet is woefully inadequate. The timer was a good motivator since you run the risk of running out of time, so there was a bit of a game there with trying to beat the clock and have a focused, productive session. I designed that to be analogous to the “red bar green bar game” used in Test Driven Development tools. There is a reward inherent in getting your mission completed and having a good session sheet you can be proud to share, but it is completely intrinsic. You are also rewarded a bit with the Prime Me! button to help you get a new idea, or break a creativity log jam. We could have done a lot more to help people plan and manage risks, and add features to reward testers for using a good assortment of tags, or a peer-reference or reward system for great testing. The full bar showing once time has run out helps tickle an intrinsic reward of completion. As a tester, I did all I could in that session, and now I can move on to other things.
  4. Skill and Chance Events:
    Skilled testers often like to record what they discover, to have the freedom to investigate areas of high value, and take pride in having a varied approach to their testing. However, there is no extrinsic reward for completion of session sheets. Sheets with more tags having a higher score might have been a good option to add,to help people learn how to improve what they record. Outside of discovering bugs, chance events are brought in by the Prime Me! button. Like rolling a dice, people can click the button until an oblique strategy jiggles their brain in a different direction. The Prime Me! button is the most popular feature of the tool and is still demonstrated at testing conferences by people like Jon Bach. People find it fun and useful.
  5. Cheating and Compliance:
    Cheating: Anyone who uses a test case management system will have a high degree of cheating. People just get tired of the regression tests they run over and over and start clicking pass or fail to show progress. They are very easy to cheat, but a session-based approach is much more difficult to cheat, because you have to show a description of a testing session. However, there is nothing to prevent people from saving an empty session sheet. I have seen this happen on over worked teams, and it wasn’t discovered for weeks. We could possibly have looked at flagging incomplete or blank session sheets in the system so there is visibility on them /prior/ to an audit, or encourage people to do something about it within the tool. Compliance was a big miss because we altered the original SBTM rules, which caused a lot of controversy and prevented more widespread adoption. We should have enforced the original rules by supporting the Bach SBTM format first, then added the ability to adapt it instead of approaching it from the other direction.

It’s interesting to note that the aspects that made this tool popular and engaging can also be viewed in terms of gaming mechanics. A couple of them were there by design, but the others were just there because I was trying to make the app more engaging. However, if we had used this gamification structure during design of the tool, we would have had different results, and arguably a better tool, because it provides a more thorough structure. Areas of fun such as the Prime Me! button, and trying to automate some of the processes of SBTM helped make the experience more enjoyable for our users.

However, if you didn’t look at the tool from a gaming perspective, you wouldn’t notice that there are game mechanics at play within it. This is an example of using a gamification approach that goes beyond superficial leaderboards and rewards, and I encourage you to try it not only with your testing tools, but your processes and practices in testing. Use it as a system to analyze: What is working well? Where are you lacking? It’s a useful, systematic approach.

That analysis doesn’t look like a childish game does it? Bottom line: if you aren’t a gamer, you probably won’t notice the gaming aspects I bring into testing process and tools. If you are a gamer, you’ll notice the parallels right away, and will hopefully appreciate them. For both groups, hopefully gamification will be one tool we can use to help make testing more engaging and fun.

Software Testing Training and Gaming

If you spend time at conferences, or hire a well-known testing consultant to provide some training for your company, it’s likely that one or more of them have used game mechanics as teaching tools. In fact, they probably used them on you. You may not be aware that they did, but they used gaming mechanics to help you learn something important.

James Bach is famous for using magic tricks and puzzle solving as teaching tools. When I spent time with James learning about how to be a more effective trainer, he told me that magic tricks are great teaching tools because we all love to be fooled. When we are fooled by something, we are entertained, and our mind is primed for learning about what we missed during the trick. That is an ideal state for the introduction to new ideas. If you spend any time with James or any of his adherents at a conference or peer workshop, you will likely be inundated with puzzles to solve. There is always a testing lesson to be learned at the end, and it is a novel way of helping people learn through solving a tangible problem. If you love to solve puzzles and learning about testing, you’ll enjoy these experiences.

Dorothy Graham has a board game that she developed for testing tutorials. It’s a traditional style game that she created as a training aid, and Dot loves to deliver this course. The tutorial attendees have a lot of fun, and they learn some important lessons, but Dot admits she may even have more fun than they do. Dot loves training, and the game takes the entertainment value of learning up a few notches. I’ve taught next door to Dot and heard attendees as they play the game and learn with her, and I’ve seen their smiling faces during breaks and after the course. There is something inherently positive about using a real, physical game, designed for a specific purpose (and fun) in this way.

Fiona Charles and Michael Bolton also created a board game for a software development game workshop they facilitated in 2006. Fiona says:  “Our experience with the game highlighted the power of games and simulations in teaching: their ability to teach the participants (and the teachers) more than was consciously intended.”

Ben Simo uses a variation on a board game. I’m not going to give it away, since it’s highly effective, but he used it on me when I was moving from a dabbler in performance and load testing to working on some serious projects. Ben is an experienced and talented performance tester, and he has taught a lot of people how to do the job well. Ben spent hours with me using pieces from a board game, and posing problems for me and having me work on solving them. It was highly interactive, was chock full of performance testing analysis lessons, and we enjoyed working together on it. He would set up the scenario, enhanced by the board game, and I would work on approaches to solve it. I had about 15 pages of notes from this game play activity to take back and apply to my work on Monday. After playing this training game with Ben, I had much more confidence and I was able to spot far more performance anomaly patterns than I had prior to working with him. (We worked through this in a hotel lounge, and we got a lot of weird looks. We didn’t care, we were having fun! Besides, channeling Ralph Wiggum: I was “learnding”!)

James Lyndsay developed a fascinating course on exploratory testing, and with it, simple “black box test machines” that he developed in Flash to aid in experiential learning. These machines had no text on them, and they are difficult to start using, because there are no outward signs of what they are for. This is done on purpose, and each machine helps each class participant experience the lesson through their own exploration and discovery. This is one of my favorite game-like experiences in a testing training course. The machine exercises remind me of a puzzle adventure game. One of my favorites of this type of game is Myst. You have to explore and go off of your observations and clues to figure out what to do, and the possibilities for application and experience are wide open. James managed to create 4 incredibly simple programs that can replicate this sort of game experience during training. Simply brilliant.

Those of you who follow Jerry Weinberg, or the many consultants who have been influenced by him have likely worked through simulations during a workshop or tutorial. Much like an RPG (role playing game), attendees are organized around different goals, roles, activities and tasks to create an improvised simulation of a real-life problem. This involves drawing on improvisation, your “pretending” skills and applying your problem solving techniques in a different context than a work context. Many people report having very positive experiences and “aha!” moments when learning from these sorts of activities.

Another theme in Jerry’s people working is physical activity. Jerry gets people to move around, and he can influence the mood of the room by adding in physical activity to a workshop. In the book, the Gift of Time, Fiona Charles shares a poignant story about Jerry using a movement activity to calm down a room full of people during a workshop when they first learned about the events of September 11. Michael Bolton has told me several stories of how Jerry changes the learning dynamic by getting people to move and work in different parts of the room, or grouping people and having them move and work with others in creative combinations. Movement is a huge part of many games, especially sports and outdoor activities, and it gets different parts of our brain working. If you couple movement with learning concepts, it brings together more of your senses to help with concept retention. It is also associated with good health, a sense of well being and fun.

(Speaking of experiential learning, pretty much everyone I have mentioned here, including me, (and a lot more trainers you have heard of) have been influenced either directly or indirectly by Jerry Weinberg’s work on experiential learning. He even has a series books on the topic on Leanpub. The first one: Experiential Learning: Beginning , the second: Experiential Learning: Inventing and the third: Experiential Learning: Simulation.)

There are other examples of trainers using game structures in software testing, and I’ve probably missed some obvious ones. (I haven’t even told you about the ones I use, but that doesn’t matter.) These are some good examples off the top of my head that demonstrate the use of game mechanics in teaching.

I wanted to point out that each of them use game mechanics to teach serious lessons. While people may have fun, they come away with real-world skills that they can apply to their work as soon as they are back in the office.

Don’t be turned off by the term “game” when it comes to serious business – if you look at gaming with an open mind, you’ll see that it is all around us, being used in effective ways.

Did I miss a good software testing training gaming example? Please add them in the comments.

Edit: I just discovered an interesting post on games and learning on the blog: Software Testing Games – Do They Help?