Dillon Hot Topics article

HOT Topics!
Publication of the Human Oriented Technology Lab
Carleton University

Usability Testing: Myths, Misconceptions and Misuses

In this article, I identify and try to straighten out some common misconceptions about usability testing. These are my opinions, based on my interpretation of what usability testing is and how it works. I hope these comments are useful. My approach is to state a common misunderstanding and then discuss it.

Usability testing is user-centred design.
Contrary to a popular misconception, usability testing is not user-centred design. It can be one part of a user-centred design process. It is possible to do user-centred design (UCD) without usability testing and to do usability testing without a user-centred design process. Briefly, user-centred design is an iterative UI design process that includes understanding users and their tasks, setting usability goals, designing parts of the user-interface to meet those goals, building prototypes of parts of the UI, and evaluating the user interface against the goals. If the user-interface is not adequate, the UI is redesigned. Usability testing occurs at the evaluation stage of this process.

There are many types of usability testing procedures.
We have a problem with the term "usability test." Usability testing is just one of many techniques that serve as a basis for evaluating the UI in a user-centred approach. Other techniques for evaluating a UI include inspection methods such as heuristic evaluations, expert reviews and cognitive walkthroughs. Unfortunately, the term "usability testing" is often used in a generic as well as a specific sense. In the generic sense, it is used as a descriptive term to represent all types of UI evaluation. In the specific sense, it is used to denote a type of evaluation procedure concerned with users performing tasks. In my opinion, a single term meaning evaluation in general and also a specific type of evaluation can cause unnecessary confusion. For the sake of consistency and simplicity, it makes sense to reserve the term for a specific type of evaluation procedure based mainly on user performance.
The idea behind usability testing is to have people perform the tasks for which the product was designed. If they can't do the tasks or if they have difficulty performing the tasks, the UI is not adequate and should be redesigned. Confusion can be avoided if we use usability evaluation for the generic term and reserve usability testing for the specific evaluation method based on user performance.

Evaluation isn't necessary in an iterative user-interface design process.
Without the evaluation step, user-centred design won't necessarily result in a better product. Without an evaluation, "thrashing" - making changes that don't improve the UI, perhaps even making it worse -- often occurs. Without evaluation, designers are left to hope that the new design is better than the previous version. If you don't evaluate, how will you know if you are improving or thrashing? To instill some objective measurement of effectiveness, usability evaluation is used as part of an integrated UCD approach. Usability testing is a good way to evaluate.

You don't have to measure user performance.
Usability testing measures actual performance by actual users. It often involves building prototypes (simple or very sophisticated) of parts of the user interface, having representative users perform representative tasks and seeing if the appropriate users can perform the tasks. The key concept is that users perform tasks. In other techniques such as the inspection methods, it is not performance, but someone's opinion of how users might perform that is offered as evidence that the UI is acceptable or not. The other techniques don't have specific tasks as such so the evaluations are broader, more general. This distinction between performance and opinion about performance is crucial. Opinions are, by definition, subjective. Whether a sample of users can accomplish what they want or not is objective. Under many circumstances, but not all, it is more useful to find out if users can do what they want to do rather than asking someone, even an HCI expert, if they think that users will be able to do what they want to do.

Usability test results will help you determine what users want.
We frequently encounter people who believe that you can discover user requirements from a usability test. There are a number of problems with this misconception, but the most important is that expecting to discover user wants and requirements after you've already designed is inefficient and will not work. Finding out what the requirements are after you design is a bad idea. Usability testing is an excellent evaluation technique, but it is a poor user-needs assessment technique.

A usability test tells if an application is useful.
Usability tests tell you about usability not usefulness. Usability and usefulness (utility) are related, but different things. An application can be incredibly easy to use but of no use whatever and vice versa. During the user-needs assessment stage of the development process, there are very good ways to find out if an application or feature is useful. Usability testing is a poor way to assess utility.

A beta test is a usability test.
Beta tests and usability tests are entirely different processes with entirely different purposes. A beta test asks users to do whatever they want with an application.
A usability test has users perform specific tasks. In addition, a beta test comes far too late in the development process to address fundamental usability issues.

You can do a usability test in a focus group.
A focus group in not a usability test; in fact, it isn't a test at all. A focus group can be very effective at getting opinions. It has nothing whatever to say about performance. The focus group process is about as subjective as any process can be. To our dismay, we have even seen people ask focus group participants to design a UI. Designing a good UI is a lot of work for a trained team of UI experts, usually working over a fairly long period of time, carefully considering all requirements, constraints and opportunities. I fail to see how a few users can come up with a good design in a short focus group session. There are far better ways to do design and evaluate designs and there are far better uses for focus groups.

Iterative means one usability test.
Just about everyone who knows anything about user-centred design will tell you that the process is iterative. Most products are not usability tested at all, and for those that are tested, my sense is that the most common number of tests is probably one. Is one usability test good enough?
Many people have found that it is relatively easy to determine what doesn't work with usability testing, but it is very difficult to fix the problems. Many people try to tweak the design to fix isolated problems. Having an existing design leads to a form of fixation that makes it difficult to even conceptualize redesigns and makes it extremely difficult to change the design direction completely. This psychological barrier almost guarantees that, without retesting the new design, it is not a sure bet to solve the problem. One usability test is better than none, but the process is most effective when redesigns are retested.

A usability test tells you how to fix the problems you detect.
Results of usability tests do not tell how to fix any problems that are detected. Usability testing is not a substitute for good design. The specialized role of usability testing in the user-centred design process is to find out what works and doesn't work. There is nothing in the usability testing process that will tell you how to redesign. If you expect to find out how to redesign based on what users say and do, you will be disappointed. User comments and suggestions are made in isolation without knowledge of all the issues that go into UI design decisions. Redesign based on user comments and suggestions from the limited context of the tasks they perform in a usability test is a shot in the dark.

Almost any usability test is better than no usability test.
It is often said that "almost any usability test is better than no usability test", most recently in the November-December 2002 issue of Interactions (Stewart, 2002). One of the things our students find out when they start doing usability tests is that it is really easy to do a usability test and really hard to do a good usability test. They find out the hard way that it is really easy to do a bad usability test. Do you believe that a bad usability test is better than no usability test?
There will be some useful and some not so useful conclusions reached about the UI in a bad test. Presumably, some UI aspects would be redesigned, at some cost, and some would be approved as they are. But how do you tell the accurate results from the inaccurate results? How do you tell which of the changes demanded by the usability test are necessary and would be worth the time, people and money to redesign and which would be an unnecessary waste of time and resources? How do you tell which parts of the UIs that passed the usability test are indeed acceptable and which should be redesigned? Any possible good that might come from a bad usability test is lost if you don't know which results to believe. Basing UI design decisions on bad usability tests is worse than no usability test because you think you know something, but you don't.

Anybody can do a usability test.
Anybody with appropriate training can do usability testing. Well, perhaps not everybody, but a lot of people. Without training on how to do usability testing, chances are that even the most well-intended person will mess it up. A lot of people read a book about usability testing and think they know how to do it. Nothing could be further from the truth. Usability testing is a highly learned skill that, like all skills, requires practice -- and practice without feedback doesn't result in improvement. Many people think they are doing good usability testing, but they have no way of knowing if what they are doing is good or not, and they often repeat the same mistakes. In the last few years, we have seen a lot of highly skilled people, primarily very knowledgeable market researchers and excellent graphics designers, who read a book about usability testing and proceed to do bad usability tests. Anyone can do a usability test, but only highly skilled usability testers will get valid and useful results.

Don't test until the UI is complete.
Just about everyone who knows anything about user-centred design can recite the advantages of testing as early as possible. Unfortunately, in our experience, the reality is that most development teams wait far too long to test. There are many reasons for this, but I will address three of the primary reasons. First, there is a belief that it is more efficient to get as much UI design done as possible before testing. Most of the time, this argument will postpone testing far too long. Second, there is a belief that a sophisticated, comprehensive, bug-free prototype is necessary before you can test, so time is wasted waiting to get the last bugs out. Third, human nature being what it is, there is the self-defeating pride element that causes designers to delay until they have the best design they can and the belief that occurrence of problems is a sign of weakness. Many people know this is not desirable, but they do it anyway. There are artificial barriers to testing early, but they can all be overcome.

You need a usability lab to run a usability test.
For most products, you don't normally need special facilities to do usability testing. You can do a lot without special rooms with one way-mirrors and without video taping. Most of the time, all you really need is a quiet room. Putting money into competent testers is preferable to acquiring sophisticated test facilities. But definitely don't shy away from usability testing just because you don't have specialized facilities.

You have to test a large number of users.
There is confusion about the number of users required in a usability test because number of users depends on two separate issues, not one. One issue is the number of users required to get valid data on any specific UI issue. The other is the number required to test all possible issues. To understand these two issues, let's review the current misunderstanding about the appropriate number of users to test.
Until recently, the belief was that only a small number of representative users, often as few as 4 or 5, are necessary to draw valid conclusions from a usability test. Thousands of successful usability tests have been performed using very small numbers of users. Then Spool and Schroeder (2001) told us that the number required is far more than 5. Some people are saying that the number required to test web sites may be in the thousands (e. g., Kangas, 2000). How do we account for the discrepancy?
The test by Spool and Schroeder asked users to purchase whatever they wanted on large commercial web sites. There are many, many possible paths depending on what different users wanted to purchase, hence the recommendation for more users to cover all possible paths. In a recent Carleton Honours Thesis, Siavash Solati (2003) asked users to purchase items at large online book stores. 19 of 20 users found one book in less than 3 minutes, most in less than 1 minute. But 19 of 20 users failed to find another book in 5 minutes. Clearly if you tested only one of these books, you would draw dramatically different conclusions depending on which book (path) was followed. The claim that you need more users follows from the larger number of paths that could be followed.
There are two interesting points from Solati's thesis. First, only a small number of users, as few as 4-5, would have demonstrated that one path was acceptable and an equally small number would have demonstrated that the other path was unacceptable. Let's separate the number of users required to address a particular issue, which can be quite small, from the number required to do a comprehensive search of all possible aspects of a UI, e.g., all paths on a web site, which could be enormous. To test a targeted UI issue or a small set of selected issues, 4 or 5 users will do just fine. But the number required to address all aspects of a UI, e.g., every path that could be followed on a large e-commerce site, would be unacceptably large.
The second important point from Solati's thesis is that he looked at the paths for the two books and immediately saw why one path was successful and the other wasn't. It is now possible to formulate a general guideline that can be applied to improve a large number of paths. The result is not that you are fine-tuning a single path but that you are gathering information applicable to path construction in general. If you can figure out why a path is unsuccessful and generalize the reason to other paths, the number of paths that have to be tested drops dramatically.

Usability testing is a good way to test large numbers of navigation paths.
It requires only a small number of users to evaluate a potential UI problem or a given path on a web site, but how can you cover all possible problems and how can you ever test enough users to cover all paths on a web site? You can't, of course. For some large e-commerce sites, the number of users needed to cover all possible purchases could be in the 10s or 100s of thousands. (Incidentally, the inspection methods don't do any better with this problem.) Usability testing to find out if all paths are good is not a practical approach. On the other hand, usability testing is an excellent way to determine if a small number of potentially problematic paths are acceptable or not.

More usability tests are required for web sites than for desktop applications.
This can be true, but it doesn't have to be true. Among many other things, the application of UI standards when designing UIs limits the number of things that have to be tested. Desktop environment standards are under the control of a few large, powerful companies. Standards force consistency which means that many aspects of the UI have been used many times before and require no further testing. The constraints imposed by UI standards make it practical to do targeted usability testing with small numbers of users.
As everyone knows, there are no UI standards for the web. However, de facto standards are evolving. We know a lot from practice and research about what navigation techniques work and don't work. Sticking with approaches that are known to work can reduce the amount of usability testing. Furthermore, rather than trying to custom build the navigation for each web site with heavy reliance on usability testing to get it right, the HCI research community is doing research on UI design techniques that will allow us to optimize web UI success "off the shelf."
The problem is that many companies want their web sites to be different. In the anarchy that is the web, web site design involves a tension between requirements for branding, projecting a unique corporate image, etc. vs. ease of use. One question that site designers will have to address is whether it is worthwhile to sacrifice usability in favour of UI customization to accomplish all the other things that web sites are often expected to accomplish.
Those who have been around for a while will remember that the same difficult choices existed 20 years ago when standards for desktop UIs were first imposed. Most application developers were faced with the sacrifice of some presumed competitive advantage by customizing the UI vs. the advantages of being consistent with others. We all know how that debate was resolved -- to the relief of users. Do you want your customers to be impressed with your unique site or do you want them to be able to find and buy your products? Some companies undoubtedly will go for customization, but they can expect to put a lot more usability testing (more work, money and users) into their site with less certainty in the result.

You have to test everything.
With usability tests, you can't, and it isn't necessary to, test everything. A major problem related to number of users is deciding what to test. You want to test only important things that may cause problems. Don't test things that you already know will work. Because a usability test targets specific issues, the chances of identifying problems that are not targeted are slim. How would a person performing tasks on particular aspects of the UI identify other problems unless they stumbled on them by accident?
A fundamental problem for a good usability test is how to determine what to test and, of equal importance, what not to test. When our students first start usability testing, the students often go to developers and stakeholders and ask them what they want tested. The reply is usually a blank stare, an assertion that it is your job to do usability testing not mine, or a not-so subtle suggestion to go away and leave me alone. It is definitely possible to find out what the controversial, difficult, important or key issues are, but it will take some skills to dig beyond the superficial first impressions. A test of unimportant issues is of little value and wastes important resources like time, money and users efforts.

If usability testing is about performance, why include rating scales and comments?
A usability test measures user performance, but users are also asked to offer comments as well as opinions about satisfaction, ease of use, etc. Yes, rating scales and comments are subjective, but they come from users experiencing the UI not from someone guessing if users will be satisfied or not. However, comments and rating scales must be handled intelligently. It is not unusual to find developers who hear something from a user and treat is as a definitive requirement. Just about everyone who has done usability testing has some sort of scale questions and has found that users frequently rate tasks as easy or the UI as satisfactory even though they struggled, perhaps even failed, to accomplish an assigned task. When there is disagreement between user performance and user opinion, performance should (almost) always take precedence over opinion. On the other hand, someone who knows what she is doing can find useful information in rating scales and comments to supplement performance measures if that information is handled intelligently and flexibly. For example, if subjective ratings don't match the performance, especially when performance is good but the participant assigns a poor rating, the rating gives the tester an opportunity to ask questions. The participant's answers might contain useful information that wouldn't otherwise surface.

Let users decide what they want to do.
You decide what users have to do -- users decide how they do it. It is terribly inefficient to let users decide what tasks to perform. Left on their own, users will spend a lot of time on aspects of the UI that you already know work well. On the other hand, it is hit or miss that they will test the things you want them to test. If you want something tested, the only way to make sure it gets tested is to create a task that addresses that issue.

Usability Testing: Myths, Misconceptions and Misuses (Part 2)

In Part 1 of this series, I identified and discussed some common misconceptions about usability testing. In Part 2, I continue my discussion of widely misunderstood aspects of usability testing. These are my opinions, based on my experiences and my interpretation of what usability testing is and how it works. My approach is to state a common misunderstanding and then discuss it.

It doesnâ??t matter who you test
Getting appropriate users is essential for success of usability testing. Many usability testers use convenience samples â?? whoever they can get their hands on --that may or may not be representative of the population of users for whom the product is targeted. Often the assumption is that the UI is good if â??normalâ?? people can do the tasks. If your sample of people performing the tasks is not representative of those who will use the product, the conclusions about the UI may not be valid. For example, the fact that the programmer, engineer, secretary down the hall can perform the tasks may or may not mean that the doctors who will use the medical application can perform the task. The fact that the engineer, programmer, secretary cannot perform the tasks may or may not mean that the doctors with their specialized knowledge cannot perform the tasks. If the fact that unrepresentative users can or cannot perform appropriate tasks tells you little about the adequacy of the UI to the users for which the product is intended, what will you learn from the usability test?

You must have a sample that precisely reflects the anticipated users.
While you definitely want appropriate users, you usually donâ??t have to be overly precise. When selecting users, consider only those user characteristics that will have an impact on the UI design. My experience is that it is difficult enough getting appropriate users without imposing unnecessary requirements. For each situation, decide what user characteristics are relevant to the UI. User requirements that are relevant for one application may be irrelevant for another. Decide if you really care what gender, age, amount of education, etc. your users have. Select users based on this decision.

It is easy to set up scenarios and tasks
Usability test tasks and scenarios must be carefully chosen. How tasks are stated will affect the results obtained. I have seen numerous examples of tasks that tell the user not only what to do but how to do it. A common problem is to set up tasks that use the UIâ??s terminology rather than the usersâ?? terminology, with the task description often using the very words that cause confusion. This problem is especially serious when the UI uses technological terms rather than domain language and the usability task also uses the technological terms. Another common problem is to literally tell users what steps to take to accomplish the task (â??Go to the XXX menu and â?|â??). I have also seen tasks phrased in such a way that users canâ??t perform the task because they canâ??t figure out what they are supposed to do. If the user canâ??t perform the task, it is crucial to know that the UI, not the task wording, is the problem.

It is very difficult to come up with tasks without knowing how the UI works and it is often difficult to find out how the UI works. A superficial understanding when setting up tasks and scenarios will provide superficial results. In most of the usability tests we have performed, in spite of our best efforts, there have been times ?when we thought we understood how the application â??workedâ??, but we didnâ??t. When this happens, users will provide the information that allows you to identify that the problem lies with your understanding of the UI and the task rather than the UI itself. The best approach is to include revised tasks based on improved understanding on subsequent usability tests. What is unacceptable is to fail to realize shortcomings of the task while testing. Even worse is to try to cover up the shortcomings of the task.

Usability testers and developers know what tasks to include
Users perform tasks. A usability tester or a developer guessing how users will work and what they will do, and then building a series of tasks based on those hunches isnâ??t likely to provide useful information about the UI. How does the tester know if tasks are appropriate? The tasks must be appropriate if the conclusions are to be valid and useful. Representative tasks must be based on a proper user-needs assessment. Even a rigorous usability test with representative users wonâ??t be very useful if it is based on ill-conceived tasks.

It's essential to measure time it takes to perform tasks
Time to perform a task can be a useful measure for some things, but often it is not necessary. One difficulty is deciding what amount of time to perform a task is acceptable. Another problem is that measuring time to perform and asking users to â??think aloudâ?? are contradictory goals. If a person provides comments, that time is counted as part of the response time. The response time is meaningless because it includes time spent passing information to you about user thoughts. Often, the â??think aloudâ?? information is more valuable than time to perform a task. If both are wanted, they should be collected either on different tasks or with different users working under different instructions.

All results from a usability test are equally good.
Some usability test results are conclusive ?and some are inconclusive. If 5 of 5 users make the same mistake on a task, most usability testers would probably agree that they have a problem with the part of the UI being evaluated by that task. If 5 of 5 users quickly and easily perform a task without negative comments, most testers will probably agree that the UI being evaluated by that task is OK. But what if 4 of 5, 3 of 5, or even 1 of 5 have a problem with the task?

When faced with inconclusive data, there are a number of alternatives. One possibility is to run more users to try to resolve the ambiguity. Another is to rethink the task and design different or additional tasks to address the issue in more detail A third possibility is to use other techniques, including inspection methods, to gain better understanding. If differences exist in spite of your best efforts to resolve the ambiguity, the design team has learned that a single UI will probably not be appropriate and the UI design should reflect that diversity, possibly in terms of tailorability.

Usability results are objective
Results from a well-designed and run usability test are objective, but there is usually a subjective, professional judgment aspect to evaluating those results. Results for any usability test of some aspect of the user interface have to be considered in the context of other aspects of the UI. On a number of occasions, we have rejected UI elements for which we obtained extremely positive results (excellent performance, highest possible satisfaction ratings, no errors ) because they were obtained in isolation and we knew they introduced inconsistencies with other aspects of the UI. Professional interpretation of individual task results must always occur within the framework of the whole UI.

You donâ??t need usability goals
Setting usability goals is an essential part of targeted usability testing within a user-centred design approach. Many usability testers do not set usability goals or they determine usability goals after seei ?ng the results. A common reason for differences in reports of problems found with usability testing is differences in usability goals.

The value of usability goals goes far beyond helping to make decisions about usability test results. Usability goals that are set early in a project serve to drive the UI development effort. Knowing that a usability test will be required -- and what it takes to pass -- serves to give direction to the development effort.

Usability goals are either met or they are not met
Usability goals are guidelines. They should be used flexibly. Often, failure to meet usability goals will result in redesign of the aspect of the UI being evaluated, but it may also result in gathering additional user-needs information or even a change in the usability goals to something more realistic.

Automated data logging is a good substitute for usability testing
There is a belief that automatically logging what users do (clicks, scrolling, typing, etc.), often with very large numbers of users on the web, provides much of the information obtained in a usability test. A number of firms will either record keystrokes, mouse movements and links clicked, or sell you software that will allow you to do it yourself. One of the first usability studies we did over two decades ago recorded keystrokes as users navigated through a predecessor of the web called Telidon. We found then, and a lot of people have found since, that it is often difficult to make sense of actions if you donâ??t know what users are trying to do. Is a click on a link an error or what the user wanted to do? Was the user really lost or just exploring related pages?

An event of considerable interest in e-commerce is abandoning a shopping cart. The high incidence of shopping cart abandonment is often cited as a reflection of user-interface problems. There will be some patterns when you can conclude that a UI problem caused the person to abandon the purchase, but there will be many more when you ?just donâ??t know. Without knowing the intentions of users, it is extremely difficult to make sense of their behaviours. Automated logging of user actions may be of value, but there are many times when it is not.

Unlike automated keystroke logging systems that donâ??t know intentions, a good usability tester will always know the context and, when in doubt, will follow up what users do with questions about why they did it. How to capture intention information with automated logging is a major research challenge.

We can find out which usability evaluation method is â??bestâ??
After addressing this problem for a number of years, I am convinced that the issue is not worth pursuing. A series of studies have compared the results obtained with different evaluation techniques in an attempt to proclaim which one is best. Best is often defined in terms of the number of UI problems and the â??importanceâ?? of the UI problems identified.

These studies are asking the wrong question. â??Which is bestâ?? is not very useful. A more useful question is what are the relative advantages of different techniques under different circumstances. Inspection methods such as heuristic evaluation, expert reviews and cognitive walkthroughs can and should do a better job than usability testing in terms of the number of problems identified. One of the primary strengths of inspection methods is breadth of coverage. For example, a heuristic evaluation typically covers all screens systematically considering all heuristics. On the other hand, one of the main strengths of usability testing is precision in exploring specific issues that were identified in advance to be a potential problem. For targeted issues, usability testing is more appropriate than the inspection methods. But usability testing, by its nature, is of little value at detecting problems that were not targeted in advance. If a task doesnâ??t address some part of the UI, there is no reason for users to even look at it.

These are my opinions about some common usability test misunderstandings. I'll address more misunderstandings in "Usability testing: Myths, misconceptions and misuses - Part 2" in a future issue of HOT Topics. If you have thoughts on these usability testing issues or other usability test issues, send them along and I can include them. You can reach me at dick_dillon@carleton.ca

References

Kangas, S. (2002) Is 5000 users enough? http://www.netconversions.com/research.htm. Nielsen, J. (2000) Why you only need to test with 5 users. http://useit.com/alertbox/20000319.html
Spool, J. and Schroeder, W. (2001), Testing Web sites: Five users is nowhere near enough, CHI 2001.
Stewart, T. (2002). How to cope with success. Interactions: New visions of human-computer interaction, IX.6, 17-21.

Subscribe/Unsubscribe

Subscribe to HOT Topics and let us bring you interesting and well-researched articles and interviews every month. Subscribe/Unsubscribe