Design And Evaluation
Designing any sort of computer-mediated device for ordinary people for effective and pleasant everyday use has proven to be surprisingly difficult. The evidence for this observation comes from the myriad problems cited above in this report and at the workshop organized by the steering committee, from systematic empirical studies cited in this chapter, and from anecdotes involving frequent complaints from ordinary people when they are required to use the currently most common public-oriented application-telephone-based voice response menu systems-as well as from more sophisticated users of World Wide Web concerning the complexities and frustrations that have led as many to abandon the on-line life as to join it. (Consideration of the experiences and needs of people without specific special needs, referred to here as "ordinary" people, is an important complement to discussion of those with special needs (see Chapter 2) for developing ideas for research to support interfaces that work for more, if not most, of the population.)
It is, of course, possible that the greater power, utility, and desirability of computer-based functions as compared to traditional mass-market technologies (e.g., television, telephony) mean that greater difficulty of use is inevitable, worth a high price in human effort and inconvenience, and solvable only by increased education with its concomitant risk of leaving out those with insufficient time, resources, or ability. However, an alternate view is that it should be possible to use the power of the new technologies not only to do more and better things but also to do most of them at least as, or more, easily. Much of the burden of introducing new
information technologies to the public can be removed or relieved by better design of the functions and interfaces with which most people will deal.
The steering committee assumes that it is often or usually possible to design more widely useful functions and to make them easier to use through design activities specifically aimed at these goals. Proof of the existence of this opportunity is readily available, beginning with popular knowledge of such consumer devices as cars and television sets, which were very complex initially but became, from the user's perspective, less so through sequences of adjustments over time. The Handbook of Human-Computer Interaction (Helander, 1988) contains many examples of prohibitively difficult systems made very much easier and more effective by redesign, and many more recent examples are reviewed by Nielsen (1993) and Landauer (1995). Some of these successes are reviewed in more detail below in this chapter. To set the stage, one is mentioned here that involves comparatively simple store-and-forward (as opposed to more complex multimedia, hypermedia, or collaboration support) technology-a case that has particular relevance to much of the expected uses in the every-citizen interface (ECI) environment.
Gould et al. (1987a) designed an electronic message system for the 1984 Olympics in Los Angeles. The system was to be used by athletes, coaches, families, and members of the press from all corners of the globe. The original design was done by a very experienced team at IBM's J.T. Watson Research Center. When first tested in mock-up with representatives of its intended user population, it was virtually impossible to operate effectively. By the time an extensive program of iterative user testing and redesign was finished, more than 250 changes in the interface, the system-user dialogue, and the functionality were found to be necessary or advantageous. The final system was widely used without any special training by an extremely diverse population. Another example comes from the digital libraries context and relates to the Cypress on-line database of some 13,000 color images and associated metadata from the Film Library of the California Department of Water Resources (Van House, 1996). Iterative usability testing led to improvements for two groups of users, a group from inside the film library and a more diverse and less expert group of outsiders. Both direct user suggestions and ideas based on observing users' difficulties gave rise to design changes that were implemented incrementally.
A central research challenge lies in better design and evaluation for ordinary use by ordinary users and, more basically, in how to accomplish these goals. The future is not out there to be discovered: it has to be invented and designed. The scientific challenge is to understand much better than we do now (1) why computer use is difficult when it is, (2)
how to design and ensure a design for easier and more effective use, and (3) how to teach effectively both school children and those past school age to take advantage of what there is to use (a complex topic outside the scope of this report).
Available research and expert opinion point to at least three reasons why many computer-mediated tools (including, especially, communications systems) are currently difficult or ineffective for use by a large part of the population: (1) complexity and power of computer-mediated tools, (2) emphasis on users with unusual abilities, and (3) sophistication of designers and their discipline.
Complexity and Power of Tools
Computer-mediated tools, as compared with traditional technologies, can be extremely powerful and complex, doing a vast array of different things with enormous speed. Of course, this is their advantage and appeal, but it is also their temptation. It means that a communications facility such as e-mail can be designed not only to let a user send an asynchronous text message to another subscriber but also to send multiple messages, create mailing lists, respond automatically, forward, save, retrieve, edit, cut and paste, attach attachments, create vacation messages, fax, and so on. If the design is not handled extremely well, users will have to learn how to negotiate this vast array of options, to know about them and how to operate them if they want to use them, and at least how to ignore them if they do not, and will always be required somehow to choose whether and what. The situation can become analogous to providing the cockpit control panel of an airliner for use by its passengers to turn on their reading lights. The consequences in computing range from the proliferation of features in software products to observations that most amateur spreadsheets contain serious errors, and that employee hand-holding costs as much as hardware for business personal computer users, to additional but seldom-used features on standard computer keyboards.1 The concept of multimodal interfaces that would accommodate alternative approaches to input and/or output, discussed in Chapters 2 and 3, will introduce considerable complexity into the technology development process, without adding any new functional features.
Great power and complexity also bring the opportunity to make very costly errors. Pressing the wrong key on an ordinary telephone touch-tone pad leads at worst to a wrong number. With a computer-mediated system it can, and often does, lead to hours of lost work or inadvertently sending, for example, a "take me off this mailing list" message to 300
people, many of whom also wish they were not on it. Laboratory studies have found repeatedly that the majority of user time spent with popular applications such as word processors (which will be incorporated into many ECI applications) or spreadsheets is occupied with recovery from errors (see, for example, Card et al., 1983). This is one of two reasons why computer-mediated activities (see below for the second) create very much more variability in task completion times than do traditional technologies (Egan, 1988). Contemporary discussions in the business and personal press about the "futz factor"-extra time and effort to adjust various aspects of a computer-based system-attest to continuing problems resulting from increased complexity and power. The irony is that in some cases (e.g., early cellular phones, personal computer software), a significant amount of complexity appears to derive from software and sometimes hardware added with the intention of "enhancing" usability.2
Emphasis on Users with Unusual Abilities
Computer-mediated tools emphasize individual differences in ability more than do traditional technologies. Egan (1988) reviewed a large number of studies of individual differences in the time taken and errors made in using common computer applications. In every case in which comparisons could be made, the variability among different people was much greater when they used computers rather than precomputer approaches to doing the same sorts of operations. An approximate summary of the data from these studies is that while most traditional tasks, such as operating a conventional cash register, calculating a sum (manually), or running around the block, will take about twice as long for the slowest of 20 randomly chosen people than for the fastest, in computer-mediated tasks the range is never that small; typically, it is around 4 or 5 to 1, and may be as high as 20 to 1, even among well-trained and experienced users such as professional programmers.
In several instances a good portion of the greater between-individual differences in computerized tasks has been traced to measurable differences in cognitive abilities. In the aggregate, workshop participants commented, such differences contribute to observations about the concentration of computer use among teenage males; they also contribute to reports in the business press about the frustrations of "information overload."3 Egan (1988) and Landauer (1995) reviewed studies in which measures of spatial memory, logical reasoning, and verbal fluency, as well as age and interest in mathematics and things mechanical, show greater than two-to-one differences between the highest and lowest quarter of the sampled potential user populations (see Figures 4.1 and 4.2 for examples). The participants in the studies illustrated were mostly noncareer middle-class
suburban women with little or no computer experience, fairly representative of the average citizens one might expect to be future network users, although not of their range. How significant a problem is this? One guess comes from studies of the efficiency gains expected for computer applications to common business tasks. From the sparse available data, Landauer (1995) estimated that computer augmentation speeds work by 30 percent on average (with large variations). Combining this with the individual-difference estimates and a normal probability distribution suggests that about a third of the population would usually be better off without computer help as now provided because they do not possess the basic abilities prerequisite to its effective use. This is without consideration of the part of the population ordinarily designated as disabled or disadvantaged.
While education and training can usually reduce individual differences, there are two reasons why computer-mediated tasks may be less susceptible to this solution. One is the aforementioned vastly greater complexity usually offered-the much larger variety of different functions available and alternate means for achieving the same effects (e.g., five or more ways to cut and paste in most recent text editors). This variety often means that it can take longer to acquire high skill, akin on a smaller scale to the greater difficulty of learning to fly a jetliner than to drive a car. It also means, often, that some users will find better ways to operate their system than will others, not because there are large differences in which method serves which person best-such ''treatment-aptitude interactions," despite widespread folk belief in their existence, have virtually never been found in carefully controlled studies-but merely through chance variations in which operations users learn first, make habitual, and thus allow to become dominant over other possibilities that it thereafter takes excessive time to find and retrain for.
The second reason that computer training is less helpful than training for earlier technologies is the much more rapid and challenging changes in the technology itself. The basic automobile, typewriter, and telephone have not changed significantly from the user's perspective in almost half a century, and changes from their very beginning have been few, slow, relatively minor, and learnable without help (nonuse of extra features, such as the clocks on video cassette recorders or cruise control on cars, does not tend to be associated with an inability to use these devices for their essential functions). By contrast, every new model of a personal computing software package, even from the same manufacturer, comes with many new features and functions, new menu arrangements with new labels, and a large instruction book (and built-in help system). Such enhancements can affect even basic tasks. And every few years another new computer-based technology is offered. Thus, there is simply not the time available to consider yearlong high school courses for each computer
technology every citizen might want to use-this year e-mail, next year the World Wide Web-as there was for typewriters and accounting machines, or 7-year apprenticeships, as there were for steam shovels and looms; the systems would be obsolete and gone before expertise was gained. The result is that high-functionality computational systems are never completely learned nor is their power fully exploited, and the primary learning strategy is based on learning on demand (Fischer, 1991). The challenge is to design so as to exploit the potential power for ease of learning and use as well as for increased functionality. Discontent with proliferating features contributed to mid-1990s experiments with so-called network computers, with fewer features than conventional personal computers, as well as to periodic articles in the business press about the persistently high costs of owning and using personal computers.4
Several members of the steering committee and reviewers of a draft of this report wondered whether the low-efficiency gains and large individual differences found in studies in the 1980s may have been overcome by technological advances in the 1990s. Although market statistics attest to growing use of information technologies, the sparse empirical evaluations of these issues in earlier periods appear to have become no more common in the past 5 years. While it was not possible to mount a systematic search for empirical evidence on trends in usability, the consensus of the usability engineers on the steering committee and among workshop participants was that things have not in general improved: for the most part, technological advances, particularly in software, have increased complexity, and, while some vendors are doing more usability testing, increased competition to be first to market with new features has brought a growing tendency to omit the kinds of early and iterative design and evaluation activities these experts think is essential to ensure ease of learning and use. In addition, what testing is done often generates results that vendors hold closely in the interests of gaining or preserving proprietary competitive advantages.
Market forces alone are unlikely to yield interfaces for every citizen because the rapid pace of the commercial market fosters an emphasis on sales performance as an indirect measure of value or effectiveness rather than direct presale evaluation of interface quality. At the workshop, Dennis Egan suggested several reasons for the lack of attention to interface evaluation by the industry. First, industrial research groups have reoriented themselves toward identifying near-term profit-making products and services, not performing longer-term research to evaluate new interface concepts and technologies and usually not publishing helpfully detailed results when they do. Second, the acceleration of product life cycles-particularly software-leaves little time for interface evaluation studies. Third, information technology products may succeed despite
having inferior user interfaces by supporting highly desired functions, reaching the market before their competition and becoming de facto standards. Workshop participants from industry and report reviewers emphasized the commercial dependence on marketplace Darwinism, noting that vendors seem to find fielding their best guesses in products more cost effective than added precommercialization testing. Some went further to suggest that the World Wide Web had provided a mechanism for harnessing market cycles, noting that some vendors are using Web sites for beta testing of products and for eliciting feedback from those users (mostly sophisticated and eager "early adopters" unrepresentative of the average citizen) who opt to try the products. The constant release of beta versions of software over the Web represents a limited kind of software evaluation and user involvement on a massive scale; some of these releases are now reviewed, sometimes even on the basis of modest empirical tests, in trade publications, and usability and other design experts are dedicated by some vendors to some releases. Several companies are using this mechanism for iterative design. However, work by Hartson et al. (1996) suggests that methods for using "the Web as a usability lab" effectively, while promising, are in their infancy and face a number of problems that will be resolved only by considerable research. For example, this approach will require significant innovation in system instrumentation and user sampling techniques because, as outlined later, the untutored opinions of programmers and other power users are usually of little value for detecting the functionality and usability problems that are important for ordinary people (Nielsen, 1993). Tracking such efforts in broad-based user involvement and assessing their effectiveness might provide a productive starting place for research on large-scale participatory design and evaluation methods.
Sophistication of Computer Hardware and Software Designers and Their Discipline
Most of the people involved in the design and implementation of functions and interfaces for computer applications are themselves sophisticated computer users. Feature requests and inventions come primarily from experienced users and are supplemented and implemented by programmers. The situation is unlike that in other consumer-oriented technologies in two important respects. As noted earlier, computers offer and usually provide a larger range of functions and controls and therefore almost always greater complexity in the choices and actions required of the user. Hence, expertise with a computer technology can often play a much greater role in its use. Computer technology started as an aid for a highly technical portion of the population-scientists, engineers, and
mathematicians, most of whom were capable of designing at least some software themselves, and often did.5 To a considerable extent, computer designers have designed for their own use and for that of people like them; the design of computer applications is still primarily in the hands of programmers and other software specialists, albeit now as leaders of large teams and abetted by marketers, physical designers, and managers. Perspectives of other kinds of people-the differently abled, those with low levels of income and education, those resistant to technology merely for its own sake, and so on-have been represented most commonly by proxy or surrogate, if at all.
Not only are software specialists typically more experienced with the technology, but they are also, in general, quite different from the average user in the characteristics and abilities currently needed to deal effectively with computers: youth, mechanical and mathematical interests, good spatial memory, verbal fluency, and logical ability. They also tend to be less socially and pragmatically oriented in personality (Tognazzini, 1992). Although they may attempt to incorporate models of user behavior, behavioral scientists at the workshop noted that in practice designers tend to assume model users-users whose behavior poses fewer problems than actually experienced. As a result, it is extremely difficult for today's computer-based systems designers to have good intuitions about what will and will not be easy and gratifying for all citizens. This situation is illustrated by a press account of a Microsoft consumer product team's visits to five families for 3 hours each, reporting surprise about and better understanding of presumably ordinary households (Pitta, 1995).
The rise of the World Wide Web and experimentation with it by a widening range of people provide many illustrations of the challenge to designers. In a recent e-mail discussion on the topic, it was mentioned that ordinary citizens might experience difficulties in finding e-mail and Web addresses, to which a well-meaning expert replied that there were three universal resource locators (URLs) on the Web that could be searched and that at least one of them would usually locate a person's address. It seems unlikely that this procedure would be very appealing or effective for most citizens who are not already frequent and accomplished users-the suggestion is consistent with an expert rather than an every-citizen interface. True, the availability of these searchable databases means that the possibility of eventually providing a good directory service for every citizen exists, but the necessary next steps need taking. The anecdote suggests that this may be a larger than obvious task, since the difference between usually and always locating a person's address may affect how broad a segment of the population finds the service desirable and how much the Internet can contribute to a truly national information infrastructure (NII). The trends toward supporting do-it-yourself
activity that extends to assembly and customization of software systems from components and modules by users appear likely, in the near term, to exacerbate the challenge of serving more of the population. (One of the steering committee members has been told by a major manufacturer that less than 10 percent of office workers ever change the factory configuration of their ergonomically adjustable chairs.)
In short, evidence discussed at the workshop indicates that an organized design and development process that ensures that the needs and abilities of potential average citizen users will be well taken account of has not yet become standard practice in software to nearly the extent that it has in the manufacture of most other mass-market products. Workshop discussions among technical experts and social scientists knowledgeable about specific population segments attested to the diversity of needs, reactions, and other qualities within the population as well as the uneven appreciation for that diversity.
The Possibility Of Easier-To-Use, More Effective Systems
There is ample evidence that computer systems with highly useful functions can be designed and built for easy, pleasant, and effective use by every citizen. Figures 4.1 and 4.2 give two examples. In both cases a function that could be used at an adequate level by only a minority of people was redesigned so that everyone could use it well. Moreover, improving usability for the less capable users did not penalize the more capable. These cases and others like them show that paying attention to the needs of novice users can often be accomplished without undesirable tradeoffs for expert users. Indeed, it is commonly the case that redesigns that help occasional users are even more helpful for frequent users; for example, effective free-form queries such as those provided by Excite and Latent Semantic Indexing will allow both novices and the most sophisticated systems analysts to search the Web more easily and effectively and with fewer frustrating, time-consuming errors than are common with Standard Query Language (SQL) or Boolean search formats.
Two additional examples of success in improving usability through redesign are instructive. In one case, an e-mail system was redesigned for simple text message interchange, e-mail's most popular use. The system was always on (no log-on was required), like a telephone, and had a screen that said "To," a simple backspace and retype editor, a button labeled "Send," and a printer that printed only when a message arrived. A group of elderly women-a segment of the population shown by data and demographics to be especially technophobic and less likely to succeed at computing-learned the system after about 30 minutes of training
and used it eagerly. (The same e-mail system was preferred by several high-level executives of a telecommunications research company, all technical Ph.D.s who had easy access to a much more powerful system.) By contrast, today's typical e-mail systems are usually introduced to business employees in full-day training classes. The second example involves hypertext. In the majority of experiments evaluating how well people can find information in the same large book, manual, or encyclopedia using traditional print versions and on-line hypertext versions, people did significantly better with paper (see, for example, Gould and Grischkowsky, 1984; Gould et al., 1987b6). But in a few cases, people using the hypertext systems have greatly outperformed those using the old technology (see Landauer, 1995, for a review). The difference has been attributable to the design of the hypertext system, and especially the methods by which the design was done.
When conflict between ensuring usability for relative novices and providing power for the highly trained is unavoidable, perhaps because
of entrenched development and marketing techniques or the inherent sophistication of some applications, two complementary approaches are possible. One is to provide differing levels of functionality for different users. Several usability specialists at the workshop reported routinely advising designers to provide as many functions, features, and options as will be useful, feasible, and in demand by experts, but to "hide" them from users who want only basic functions-by, for example, retaining the simplicity of short menus that emphasize only the best general functions and offer the option of selecting an "advanced functions" button for access to special features. The second approach is, of course, to increase the sophistication of users through education, training, and access to good guides and manuals (e.g., "training wheels" and "minimal manual" techniques, and scaffolded and staged advancement). Future computing functions of use to many citizens may well require fundamental understanding of concepts and operations that are not now taught in school-iterative
and simulation-based problem solving, for example. Some research has been done on speeding the acquisition of such expertise, but the basic issue of what knowledge and skills are most important to teach itself warrants assessment7-the history of computing machines and the difference between bytes and bits, RAM and ROM (random access memory and read-only memory), as often taught in "introductory" computer classes may not even be among them.
Existence Of Effective Design Methods
It is by now well established that iterative test-and-redesign methods almost always result in substantial gains in usefulness and usability. Landauer (1995) summarizes a large number of published reports of comparisons of task performance efficiency before and after a redesign based on some kind of empirical evaluation of the use of a computer-based system. The modal gain is around 50 percent, and new methods and new evidence of their success appear regularly (see, e.g., Sawyer et al., 1996).
Unfortunately, there is almost no discernible commonality to the methods used in the studies surveyed, other than that they all empirically evaluated the performance of users trying to do what the system was designed to help them do and that in all cases the evaluation was formative rather than summative-that is, done early enough and with a view to guiding improvements rather than just certifying adequacy. Sometimes the evaluation was done by systematic experiments in laboratory settings, sometimes by careful examination of the interface and dialogue by two or more experienced usability experts, sometimes by detailed examination of usage logs, sometimes by analysis of videotapes of users working, sometimes by informal observation of users or by asking users to talk about what they were doing. In one telling example, as recounted at the workshop by John Thomas, researchers at NYNEX used a simulation model to estimate and measure the work efficiency of a new graphical user interface intended to improve the efficiency of a computer system for use by thousands of employees. The simulation model, which emulates the perceptual and motion time demands of well-practiced repetitive tasks, predicted a decrease in efficiency. Field results after the system was deployed without revision confirmed the prediction; and at the same time, a revision in the laboratory taking into account the discovered flaws reversed the unfavorable result (Gray et al., 1992).
While there are known ways to ensure improvements, there are two very serious outstanding problems in the design process. The first is that these methods are not applied often enough, well enough, or early enough in design and development cycles to help most systems. In part this results from a persistent myth among technical practitioners and managers
that iterative empirical usability engineering methods are too slow and expensive to bring to bear in today's competitive environment. In fact, as has been shown repeatedly, usefulness/usability testing almost always shortens delivery times by removing the need to develop unneeded features and expensively rework designs. Moreover, if done right, it actually takes only about 4 to 12 hours per week during development, a trivial addition in most projects. In addition, even faster and more economical feedback techniques are still evolving (Virzi et al., 1996) and would undoubtedly make even more progress with greater than their present meager support. Indeed, this is the second part of the problem-too little is known about what methods work best for what purposes: which are fastest, most accurate, most cost effective, and what new methods might be even better for some purposes. Workshop participant Dennis Egan, of Bellcore, put it this way:
Some people have the view that evaluation studies do not matter, that any interface that has to be systematically evaluated and whose obvious superiority does not hit you between the eyes must not be worth much. The idea is to focus on entirely new interface designs and concepts, the "home run" rather than the incremental improvement. Clearly, we need people creating totally novel interface concepts, seeking breakthrough technology. But there are many instances where a creator's intuition about the use of novel technology has proved very wrong. We need to understand how best the new (and older) technology can be harnessed to support work, aid in instruction and training, and provide entertainment. The skeptics and scientists among us often are left wondering whether a new technology really serves well for a particular task or job.
The challenge is compounded with the advance of such innovations as three-dimensional interaction interfaces, which can involve different tasks and metrics than two-dimensional interfaces. Comprehensive usefulness and usability evaluation of entire three-dimensional user interfaces, for example, remains to be undertaken (Herndon et al., 1994).
Too Little Use Of Known Methods
While more empirical evaluation of usefulness and usability is being done, especially by large producers on major products, the quantity and intensity of such research is still very slight relative, for example, to the mechanical testing of auto engines, wind tunnel testing of airframes, cycle speed and reliability testing of computer chips, or code correctness and performance testing of computer programs. Often, lessons learned in empirical tests, especially when some features are found to be undesirable
and simplification is called for, are ignored in favor of feature-list-driven marketing considerations. Often, testing is postponed until the end of development, when it is both in danger of being omitted entirely and too late to inform revision. Some of this project's industry participants reported that recent reductions in product development times resulting from market pressures have made timely test-and-fix efforts even less common than before.
Evaluation needs to be done very early and at all stages of development-indeed, starting even before invention and development, at the needs analysis, wants identification, and ideas-about-what-to-build stages. A recent study suggests that paper mockups of screens and interaction dialogues are often as effective as full-scale prototype testing (Sawyer et al., 1996). What will cause there to be much more empirical evaluation of whether all citizens can easily use what is provided to do the things they want to do? Ideas proposed at the workshop ranged from encouraging broader education of software engineers (which might be facilitated by federal agency sponsorship of summer schools, workshops, and fellowships) to the creation of a new professional category of computer system designer, separating the design activity from the building activity much as the structural architect job is separated from that of the general contractor, plumber, or carpenter. Professional societies would have a role or at least a position on possible mechanisms. Another suggested approach was to get appropriate evaluation activities better specified in development process standards and their certifications. Consistent with the project objective to identify areas for research were discussions of the need for more interdisciplinary research and suggestions for combining or better linking research on concepts with evaluation of implementations. One idea voiced by several workshop participants was that government (at different levels) might play a leading role by establishing requirements for usability and usefulness testing in procurement processes and by exploring and demonstrating every-citizen usability assurance in the design of public-service-oriented systems offered over the NII.8 Comments by federal observers at the workshop suggested that the basis for some government activity can already be seen in activities under the aegis of the U.S. Department of Defense (CAPTEC program) and the General Services Administration as well as specific agency projects involving kiosk installations.
Appropriate incentives for better design of what are becoming mass-market products are complicated by the inherent need to balance economic and social interests. At the workshop, industry participants emphasized that additional evaluation, design, and features all have costs. Controversies related to regulatory interventions (e.g., requirements for certain product features) under the aegis of the Occupational Safety and
Health Administration and the Americans with Disabilities Act, for example, illustrate this tension. They raise the question of whether research can make both more usable technology and associated evaluation less expensive. This is one reason why some workshop participants called for research to develop better tools to support design and evaluation.
One of the issues energetically argued at the workshop was the problem of assessing the potential value of a new technology before it exists. Some participants worried that attempts to do so stifle creativity because people cannot be expected to imagine with any accuracy whether they would like or find useful an entirely novel way of doing things. Examples are rife: Alexander Graham Bell did not think the telephone would be useful for interpersonal communication but only for entertainment broadcasting. Early IBM leaders thought a handful of computers would fill the nation's needs. Xerography met long and strong investor resistance because few could see a need. Until recently, large numbers of people who had not tried it, including most telecommunications company executives and marketers, could not see any benefit in e-mail. These examples illustrate that evaluation calls for doing more than simply describing a novel technology and asking people whether they would want it.
Other workshop participants, including especially social psychologists and human factors specialists, were of the opinion that, while such assessments are more difficult and less sure than, for example, comparative evaluation of existing and widely used technologies, methods already exist that can give good early hints. some methods are associated with scholarly research; some are associated with market research (much of which draws on social science techniques, such as conjoint analysis-using analysis of statistical variance to explore trade-offs-and constant sum point assignment-to assess priorities).
An illustration of the technology-forecasting dilemma is video telephony. On the strength of the technical promise, this technology has been reinvented and tried repeatedly, in approximate 10-year cycles. But from the earliest laboratory and field trials, through many succeeding ones with better and better (and often, because the trials were not constrained by cost, quite excellent) technology, it has been found repeatedly that most people do not choose to use these facilities. Trials accompanied by usage data, observation, and surveys repeatedly find that people-from executives to researchers, engineers, and middle managers, to homemakers-usually prefer not to be seen when conveying a short message. While nonverbal cues can convey information, the amount and validity are much less than is popularly believed. Moreover, what is conveyed is not necessarily relevant to the function at hand and may even detract from it. Studies suggest, for instance, that obscuring information about gender or social status can result in more-and more egalitarian-participation in
on-line group activities. On the other hand, the added value of a visual channel for communication is often not for provision of affective cues but rather to enable references to or manipulation of shared visual artifacts (such as photographs, engineering drawings, and proposed budgets). The likely value of these facilities for ordinary citizens and for special populations is among the questions that need answering. These and other examples indicate that, while it is sometimes possible to forecast what technologies will emerge soon in the marketplace with some confidence, more than intuitive hunches are needed to predict whether, how, and why they will be widely used.
Among the methods cited as reasonably effective for predicting the value of a new technology were the following:
Task analysis. Broadly conceived, task analysis means observing people in real-life situations to see what they are trying to accomplish, how well they are succeeding, and what is preventing success, and asking questions about goals, frustrations, and important incidents. This may occur in the context of ethnography, an anthropological approach involving observation of people in their natural environments, or by designers involving themselves as natural participants in the task in its natural setting, sometimes called contextual inquiry. Applied in 1870, a task analysis approach might have discovered that people spent a good deal of time writing letters in order to pass small bits of news or to obtain short answers: reports of sickness, requests for prices, dinner invitations, social maintenance greetings. An analyst might have gone on to count the number of occasions in which circumstances arose that would be well served by different means of communication if they were available and, if clever, would have noted that people often took long periods out of demanding days to walk miles to merely chat with friends, that occasionally runners were sent in all possible haste to fetch midwives or relatives. From such observations would come at least an educated guess that a faster means of interpersonal communication would be useful and desirable. Then the analyst might ask people to imagine talking at a distance, but would do so carefully and with informed interpretation.
Task analysis is implicit in the development of domain-oriented design environments (Fischer, 1994a), which (1) reduce the learning problem by supporting human problem-domain interaction by bringing tasks to the forefront, (2) allow users to learn more about functionality in the context of a task at hand (thereby supporting the integration of working and learning), and (3) assist users through an agent-based critiquing mechanism (Fischer et al., 1991) to reflect and improve the designed artifacts. The use of these systems of domain-oriented design environments has been demonstrated in a variety of different domains, ranging
from kitchen design to voice dialogue design, multimedia design, and lunar habitat design.
Performance analysis, which goes a step finer. The word ''analysis" is important here and has a meaning not unlike its use in chemistry. Performance analysis involves observation and experimentation to find out what people can do easily and what they can do only with difficulty or not at all. The research that raised the low ends of the curves in Figures 4.1 and 4.2 was of this kind. For example, in the database query research, experimental analysis revealed that the majority of the population cannot form logical expressions to describe an object of search even though they can recognize with high accuracy the things they want (Landauer, 1995). The invention that flowed directly from this was a recognition-based query method (a truth table of possible data).
Special forms of focus group discussions. In these, an analyst takes participants step by step through scenarios of increasingly futuristic technologies, at each step eliciting discussion of possible uses, values, and defects. One variant, known as "laddering," involves interviewing people about what they are familiar with in order to get them to think about newer concepts (Reynolds and Gutman, 1988). The result is not necessarily a prediction of utility and market success but the raising of important and often unanticipated issues.
Mock-ups, Wizard-of-Oz experiments, and rapid prototypes. Very often new interfaces, sometimes new functionality, can be given an informative initial exploration by a mock-up using paper, slides, or video and either walking people through scenarios of using the proposed screens and functions, seeing what they understand, and listening to what they say (statements such as, "What's this for?" or "I wish I could do X") or asking them, where "them" is both ordinary potential users and usability experts, what they think of what the system does and has to offer (Lund and Tschirgi, 1991). At the workshop, Bruce Tognazzini showed the workshop the "Starfire" vision video produced by Sun Microsystems. Similar to Apple Computer's "Knowledge Navigator" video, "Starfire" gives an enactment of a futuristic scenario in which an anticipated or imaginary technology is being used by people. These videos were designed for various purposes; the status of "Starfire" as an engineering project output illustrates the potential value of such vehicles to inform design (e.g., panels of people, both ordinary and expert, could discuss and critique them; see Tognazzini, 1996).
In the Wizard of Oz technique, a human, often assisted by software, performs some of the functions of an imagined system before it can be built. For example, Gould et al. (1987a) used a human to transcribe spoken input to emulate a much better automatic voice dictation transcription machine than could be built at the time (or now) in order to
study how useful and desirable it would be and to understand some of the design alternatives and parameters that eventually should be implemented. The technique is only good for functions that humans with computer help can emulate, but with ingenuity this can cover a considerable range.
Finally, there are rapid prototypes, perhaps a more familiar idea. Essentially, the technique is to build a system that does only part of what the invention will call for in actual deployment, using much more easily constructed software and, perhaps, much more expensive hardware so that something like the intended technology can be tried with real users before its design (or attempted production) is settled. One especially appropriate target for the discussion here might be advanced voice recognition, graphics, multimedia, or virtual reality interfaces. Early usefulness and usability trials with approximations to such systems using human participants could reveal what functions and features are and are not promising for every-citizen use before great effort and expense are devoted to their realization. For example, at the workshop, Robert Virzi described how GTE progressed from a 1990 case study of how people shop-identifying unmet needs for information about how to find vendors or special sales and difficulties in communication-through design of a service for which the underlying network technology was not adequate at the time to the 1995-1996 Superpages effort to provide on-line national yellow pages, intended to be a basis for broader services supporting communications and information exchange among vendors and consumers. This anecdote illustrates the blending and balancing of studies of user preferences with judgments about whether and how to address them with available technologies.
Beyond Individual Interfaces: Computers As Social Machines
Earlier research on interfaces tended to presuppose a view of human-computer interaction as an exchange involving a single individual performing an independent action by means of a computer. This view, perhaps influenced by the "input-process-output" paradigm, has strongly influenced the design and evaluation of user interfaces. Not surprisingly, that orientation yielded a substantial body of information about the significance of individual differences in ability and prior experience for ease of use and judged and measured usefulness of alternative system designs as well as guides for improved research and evaluation (see above). Now, however, advances in information and communications technologies have created a much more complicated context for the design and evaluation of interfaces for everyday use.
The move toward network access, distributed architectures, and open systems has made this medium a social one: computers provide a means for taking part in social activities, doing collaborative work tasks, engaging in educational pursuits with other learners and teachers, doing commerce, playing group games, contributing to and benefiting from shared information stores, and so on (see Chapters 2 and 5). The good news is that these kinds of distributed architectures and tools come much closer to reflecting the ways people learn, work, socialize, and more generally participate in the varied activities that comprise everyday life. The bad news is that "social" computing adds another layer of complexity to the already difficult design and evaluation issues just summarized. Whether interconnected computers are viewed as media for enhancing interactions among individuals in physical communities or for forming information spaces in which people, representations of people (e.g., avatars), and intelligent agents interact in a built environment (or both), it is likely that new or significantly improved design and evaluation methods will be needed to make such interchange accessible to everyday citizens.
Although user-oriented research in this field is relatively new, it is worth reviewing the main findings that can be expected to figure importantly in the design and evaluation of interfaces for computer-based collaborative activities. The term "groupware" was coined to refer to computer-based technologies that primarily target groups as users (Box 4.1).
Roles, Relationships, and Boundaries
As some of the definitions in Box 4.1 indicate, groupware must accommodate many individuals, not all of whom have the same roles (in contrast to individually oriented research, in which humans interacting with a computer application were assumed to be engaging in the same functions). Typically, roles are captured by rules that attempt to express relationships, permissions, and limits. For instance, in collaborative learning environments the teacher role may be accompanied by some options (e.g., annotating student submissions) that students do not have, or students may be allowed to see and comment on one another's term papers, but only after their own papers have been submitted. In health care, more complicated scenarios are envisioned involving, for instance, the roles of patients, health maintenance organizations, expert systems provided by drug companies, and pharmacists (see position paper by Michael Traynor on-line at http://www2.nas.edu/CSTBWEB).
Social applications of the sort citizens might use to support routine aspects of daily life like education and health care complicate underlying system design issues: How much social knowledge of roles and relationships is appropriately incorporated into application development, and
BOX 4.1 Some Definitions of "Groupware"
Specialized computer aids designed for use by collaborative work groups-Robert Johnson
Software applications that are designed to support ... groups-especially software that recognizes the different roles the users of the application have-Jonathan Grudin
The class of applications arising from the merging of computer plus communications technology. These systems support ... users engaged in a common task and provide an interface to a shared environment-Clarence Ellis
Computer-based tools that can be used by work groups to facilitate the exchange and sharing of information (including user adaptations of individual tools to support group functionality)-Christine Bullen and John Bennett
Loose-bundled collection of multifunction tools in an interactive system ... that are susceptible to use by all the members of a group, plus the user practices that incorporate them into day-to-day service to accomplish group tasks-Tora Bikson
NOTE: The term "groupware," while variously defined, refers to current efforts to make distributed computer technology meet the needs of multiperson groups engaged in varied work or social interactions.
It is important to underscore that nothing about groupware as a social medium invalidates the prescriptions set out earlier for improving the design of novel technologies. Instead, the inference to be drawn is that taking into account the social nature of most citizens' everyday activities and the range of actors and contexts involved makes advance use of task analysis, performance analysis, focus group discussion, and rapid prototyping both more important and more difficult.
Interdependency and Critical Mass
A second differentiating factor is that, unlike computer applications for individual use, groupware not only presupposes but also requires a multiplicity of users for its functionality to be experienced and evaluated. An example is provided in an early study by Grudin (1988) of why collaborative applications fail. He targeted electronic scheduling as the application of interest because market research had consistently found managers and professionals saying that it would be highly desirable to let their computers handle the tedious task of arranging appointments and meetings automatically, once provided with information about when relevant individuals were available; but when such applications were installed in organizations, they were rarely if ever used.
The reason for the discrepancy between anticipated and actual use is instructive and turns mainly on the interdependent nature of social applications. The payoff from programs that schedule meetings or appointments depends on the proportion of potentially relevant users who in fact use them; if a "critical mass" of people whose calendars are affected do not use the program, others derive no benefit from entering and updating their own schedule information and so they soon stop. The scheduling program thus imposed user costs (in the form of added tasks) without generating the expected benefits. In this way it differs from independent technologies that yield individual benefits to those who adopt them, regardless of whether there are other users.9
The same interdependency characterizes shared discretionary databases (Markus and Connolly, 1990) and computer-based communications systems (Anderson et al., 1996). More generally, while the design and development costs are borne up front, the value of interdependent applications is apparent only later, after a substantial portion of the intended user community engages in their actual use. Besides necessitating a more careful understanding of the contexts of use in the design of groupware applications, these considerations underlie the "extreme difficulty" of evaluating them (Grudin, 1988). They also illustrate how broader usability involves not only the user interface per se but also the social context and the overall service and what these imply for interfaces.
Although the more recent growth in sales and apparent use of groupware products such as Lotus Notes suggests that progress has been made, the evolving NII also raises the prospect of far larger numbers of people interacting than has been experienced to date. Sheer numbers, plus variations between groups of people who interact on a sustained basis and those who come together on an ad hoc basis, and variations on what people will do with such facilities beyond the current business environment of their uses, are among the emerging technical challenges that may affect interface design.
Iterative Test and Redesign for Social Applications
Earlier sections of this chapter made it clear that, while there are problems with these methods, iterative test and redesign procedures are effective known ways to improve system usefulness and usability. This thesis holds true for interdependent as well as independent applications, but the deployment of effective test-and-redesign procedures is more complicated for social applications.
In the first place, whatever methods are chosen, they will need to involve trial users playing the interdependent roles relevant to the application in sufficient number and over sufficient time to exercise, assess, and redesign the varied functions that the application is supposed to support. This consideration by itself suggests that iterative test and redesign of groupware may take longer and cost more than the same methods applied to independent applications if sufficient ingenuity is not brought to bear on the evaluation methods, for example, by embedding usefulness and usability analysis in the instrumentation of experimental designs offered over the World Wide Web.
Second, getting good answers to design questions depends on having both realism and control in user trials, and there can sometimes be a tension between them (cf. Mason and Edwards, 1988). Many human-computer interaction studies, for example, have relied on laboratory experiments that presented a computer-based task to individual subjects and measured their performance. While strong on experimental control, such studies are typically weak in realism. In particular, they do not account for the influence of the varied contexts (environmental and social) within which computer-based tasks are usually situated, and they measure performance by means of variables that assume noninterdependency among users. On the other hand, achieving realism means having an adequate scope (enough users and uses) and a realistic time frame as well as an appropriate task context for judging an application's usability and usefulness. Typically, such realism is achieved at the expense of control, in a one-shot field trial (not infrequently, the user group is the development department, as noted above); outcomes do not generalize beyond the unique case. Further, the tensions between realism and control are heightened when social applications are the evaluation focus.
The techniques outlined above for assessing the likely value of a new technology improve evaluation in part by trying to join realism and control, and their extension to social applications is promising (e.g., Olson and Olson, 1995). In addition, better-designed laboratory studies (e.g., Kiesler et al., 1996) and field experiments (e.g., Bikson and Eveland, 1990), as well as innovative approaches to research on the implementation and use of computer-based media over time in nondirected real-world cases
(e.g., HomeNet (Kraut et al., 1996); Blacksburg (Va.) Electronic Village project, (http://duke.bev.net)), are helping to address the need for both realism and control in evaluating social applications. However, as the size of user groups increases-as reference to "every citizen" suggests-some participants were not certain about how well any of these approaches would scale up. (These concerns surface above in discussions of social-interest applications of the NII.) Several workshop participants noted that once one moves beyond a focus on personal computers as the access device and considers all manner of devices-telephones, television remote controls, and so on, as well as embedded systems-the problems and opportunities add up to a very large set.
Inherent Unpredictability of Use
For reasons both practical and theoretical, predicting the performance of social applications in real-world use on the basis of prior research is inherently difficult. Practically speaking, cut-and-try or design-and-fix methods-those most likely to yield accurate results-are least likely to be employed for social applications because of the time frame and scope of uses they entail, as suggested above.
In theory as well, groupware uses are hard to anticipate because they are embedded in a social system that exerts effects quite independently of the technology. For instance, the social system of work had a significant influence on the automatic scheduling applications studied by Grudin (1988; see above). Managers, who most often called meetings, were most likely to have secretaries who kept their on-line calendars up to date and handled "workarounds" by phone when others had not put their schedules on-line; so managers benefited from the application but experienced none of its burdens. Lacking secretaries, professional users experienced all of the burdens but few of the benefits and soon gave up on it (Grudin, 1988). There had been several task analysis studies of the problems and promise of schedulers, and many had predicted just the problems Grudin cites, but they were ignored or unknown to proponent designers.
As Markus and Connolly (1990) have pointed out, managers sometimes solve these kinds of problems simply by mandating the use of an application. In turn, however, clever professionals respond by gaming the system so that what appears in the on-line calendar is what is most convenient or most socially desirable, regardless of the actual status of the individual's time commitments. Similar results have been reported for use of shared databases by Patriotta (1996). These outcomes, reflecting interventions by the social system, are even more removed from expectations based on untested designer intuitions. It should be emphasized that
social "reinventions" of technology are not necessarily negative; on the contrary, research literature provides a great many instances of user-based improvements (e.g., Bikson, 1996; Orlikowski, 1996). The point, rather, is that unpredictability inevitably characterizes the use of groupware because of the reciprocal adaptations of the technology and the social context in which it is situated.
Implementation as a Critical Success Factor
Implementation, construed as the complex series of decisions and actions by which a new technology is incorporated into an existing context of use, assumes critical importance as a success factor for groupware given the reciprocal influence of social and technical sources of effect cited above. During implementation, the new technology must be adapted to work in particular user settings even as users must learn new behaviors and change old ones to take advantage of it. At the workshop, Sara Kiesler cautioned that experiences related to the performance of specific tasks (e.g., by telephone operators) will not necessarily generalize to the larger NII. Specific tasks tend to be tightly delimited and jobs of the performers in typical studies depend on their use of the system; in the NII, in contrast, there is a huge variety of tasks, a huge variety of users, and the users have more choice in what they do and how. Walter Feurzeig, of BBN, argued that it is nevertheless difficult to consider user interfaces independent of specific activities. Sara Czaja, for example, drew from her work in medical trauma settings to emphasize that real experience in real contexts is necessary to understand interface needs at, for example, physician workstations. Help features of the system and user training as well as modifications of the application and changes in users' behavior, for example, affect the course of implementation.
Current research on work group computing corroborates the conclusion that the effectiveness of the implementation process itself has a substantial impact on the usability and usefulness of social applications somewhat independently of their design features (Mankin et al., 1996; see also the literature reviews in Bikson and Eveland, 1990). The vital role of implementation also emerges as a salient factor in the life of new civic networks, according to their administrators (see Anderson et al., 1995).
Nonetheless, evaluation efforts frequently target technology design as it bears on specific functions, leaving implementation processes and related features (e.g., help screens, on-line tutorials, user manuals) out of account in attempting to predict use. Further, although it is clear that many desirable changes in social technologies cannot be anticipated before their deployment in specific user settings, these applications are not usually designed with a view toward ease of modification either by end
users or by service providers who maintain end-user systems. On the contrary, desires on the part of end users or those who provide information technology assistance are usually regarded with suspicion by designers and developers (Ciborra, 1992).
Given the significant variation in uses, users, and user contexts represented by everyday citizens, along with serious questions about how their NII-based interactions can be supported, such implementation issues merit considerable attention.
Directions for Improvement
For reasons like those reviewed here, it is manifest that systems intended for use by communicating social groups-including large populations-raise many kinds of questions that individual applications do not. The design and evaluation techniques appropriate for individual applications need to be extended or supplemented with approaches more suitable for the envisioned NII environment. While there is not a large body of empirical work on which to draw for this purpose, research on computer-supported cooperative work and technologies for collaboration yields suggestive directions for improvement. Some promising approaches are summarized below.
Involve Representative Users in Substantive Design and Evaluation Activity Early and Often
Participatory design is difficult to arrange, as noted above, and so more likely to be slighted. The goal is to understand how interfaces to connected communities may prove more than skin deep, how they may affect how we locate and remain aware of one another and find shared information, as well as how we understand, enact, and track our roles in group activities, recover from errors, merge our work with others, and so on.
An illustrative example comes from an exploration of how new technologies could assist wildlife habitat development by the U.S. Forest Service. To support wildlife habitat protection, forest service teams needed an interface to varied databases (e.g., about soil, vegetation, water quality, forest wildlife) that would permit different experts literally to overlay their views of a geographic territory on a shared map, create and manipulate jointly devised scenarios, and observe the results. The design of such an application required the participation of users with specialized domain expertise from its inception to its evaluation in field trials. NII-based applications envisioned for ordinary use (see Chapter 2) are no less complex and are similarly likely to require participatory design with representative users; offering lifelong learning, continuing education, or targeted
training, for instance, or delivering selected health services on-line, are cases in point.
In these and other social applications, methods for design and evaluation that discover and fix problems before they are widely promulgated are especially important. Many workshop participants believe these needs are particularly acute in areas-such as education and health care-that are now being eagerly promoted and anticipated for NII applications. One obvious approach is to conduct field trials with smaller than universal, but still representative, population samples; this procedure is as yet seldom followed. Often, as workshop participants noted, experts-both system designers and such specialists as speech or occupational therapists-may play the role of representative users; sometimes a think-aloud approach is used in which users comment on their experiences as they use a system. A related question is simply how to design and evaluate with the full range of the population in view, rather than drawing on educated middle-class citizens who have constituted the potential or actual computer user samples typically studied in the past.
Expand the Repertoire of Research Methods to Be More Inclusive and Innovative
There is a pressing need for social-psychological, sociological, and organizational research into how innovation, development, and implementation processes should be arranged and managed so that the goal of every-citizen utility is effectively pursued. Issues like those raised above clearly require techniques for research with large populations, for instance, by survey methods or perhaps sampled observations; as yet there is little experience in the use of these techniques for design and evaluation of large networked social applications. In discussing the prospects for instrumenting various systems, an interesting opportunity broached by participants was to use the Internet itself to conduct experiments and surveys, to record usage data (in anonymized ways) stratified by user categories and applications, and to assess the properties of emerging social networks (for examples, see Eveland et al., 1994; Huberman, 1996; Eveland and Bikson, 1987; Dubrovsky et al., 1991; Finholt et al., 1991; and Kraut et al., 1996). Practical issues may relate to protection of user privacy and to the nature of actual user populations (e.g., early adopters of the Internet may not be representative). Thus, consideration of how to get back good information is itself a research issue. Trials and assessments of the suitability of these and other design-and-evaluation techniques for large and widely varying populations would be very worthwhile.
Consider Ways to Minimize the Separation of Design and Evaluation from Implementation and Use
On the one hand, new computer-based technologies continue to emerge in the market at an incredibly rapid pace, and this trend will only be accelerated by population-wide access to the NII. On the other hand, recommendations to use methods for research with representative population samples to ensure the usefulness and usability of social applications before their implementation and use seemingly entail a much more leisurely pace for innovation. This dilemma suggests that it might be worthwhile to reconceptualize as concurrent or overlapping processes the traditional linear sequence from design, iterative trials, and redesign to implementation, use, and inevitable user ''reinvention."
This suggestion draws, in part, on the concurrent engineering model; in bringing together the designing and building stages of technology development, it reduced the total time involved while enabling designers and engineers to learn more from one another in the course of coproduction. It also builds on rapid prototyping approaches that draw no sharp boundaries between prototype trials with representative users, field pilot projects, and early-stage implementation processes (e.g., Seybold, 1994; Mankin et al., 1996). Finally, it takes into account the unfeasibility of "getting it right the first time" as a guiding principle for NII applications. As virtually every study of communicating social applications has shown, these technologies are invariably modified in use in ways that respond to user contexts, changes in skills or task demands, and changes in the suite of applications with which they must be integrated. That is, the application should not be viewed as "finished" or static just because it has left the developer's world (Bikson, 1996).
New back-end technologies (e.g., client-server architectures, middleware) make it possible to keep the infrastructure or platform in place while delivering, updating, and supporting new tools and applications in user environments over a network. This is the principle behind new efforts to conduct product beta-tests via the Web, as noted earlier. Given the desirability of involving greater numbers of representative users in application design and evaluation as well as field trials and implementation, and given the capability of networked systems to enable both the provision of usable prototypes and the collection of user feedback, it would be desirable to explore options for leaving applications intentionally underdesigned, to be adaptively developed as they are implemented in contexts of use (see Box 4.2).
BOX 4.2 Toward Informed Participation
Technology that genuinely supports informed participation will be inherently democratic and adaptable. It will allow us to take advantage of our social diversity and not force us to conform to the limits of our limited foresight.
The philosophical model for understanding knowledge acquisition and the communication of information holds at least three primary lessons for anyone designing or deploying information systems for groups of people, as follows:
Focus more on relationships than things. Information technology can and should change relationships among people; that is where its chief value lies. Information technology that changes the nature of relationships can change the fundamental features of a given complex system.
Honor "emergent behavior." The new theories of complex adaptive systems hold that the adaptability of any system greatly depends on the "genetic variance"-or pluralism of competing models-within it. Therefore, information technology should allow the emergence of competing agents (or models or schema) and enhance their interrelationships.
Underdesign systems in order to let new truths emerge. It is a mistake to set forth some a priori notion of truth or to try to design in totality (which requires an infinite intelligence in any case). Rather, one should underdesign a system in order to assist the emergence of new ideas.
The brilliant logic of an underdesigned information system is well illustrated by the constitutional and cultural principles espoused by Thomas Jefferson, one of the preeminent information architects of all time.
SOURCE: Brown et al. (1994).
Consider the Prospect of Research-based Principles for Design
Regardless of the perspective taken, the bottom line is that what we know now about evaluation and design methods is not good enough to meet the challenges presented by every-citizen applications in an NII context.
Although there are good methods and techniques available for evaluating ideas and systems for individuals at all stages of development and providing tests of usability and guidance for design, none of the workshop participants thought that evaluation methodology was a solved problem. Although a few comparative studies have been made of some of the different methods in use-user testing, heuristic evaluation, cognitive walkthroughs, scenario analysis, ordinary and video ethnography-these studies have not reached any unequivocal conclusions; indeed, there is active controversy about their relative advantages. This is an area in
which more and more systematic research would almost certainly have great impact. Some of the current evaluation methods are orders of magnitude more expensive in terms of time and money than others, often prohibiting their use and often inhibiting the use of any evaluation, yet we do not know for sure whether they reliably produce better, or even different, information or result in better or different products. Such research should, of course, also be aimed at finding better methods. In particular, research is needed on what kinds of evaluation give not just summative quality estimates but also useful formative guidance that leads to better design.
These kinds of problems and uncertainties about evaluation techniques and methods lead naturally to reawakened interest in the prospect of research-based principles for design. It has often been hoped by the scientists and technologists involved, and perhaps even more often by their managers, that the design of useful and usable interfaces could be based on theory, engineering principles, and models rather than sheer cut-and-try and creativity. There have been some modest successes along this line. As mentioned earlier, there are models of the perceptual-motor processes involved in operating an interactive device that can predict the times required with useful accuracy. So far, these have had their greatest utility in the design of computer-based work tools where large numbers of people will do the same operations large numbers of times so that small savings in time will add up to large savings in money. In addition, there are some models and means of analyzing and simulating the cognitive operations of users of complex computer-based systems that are often capable of yielding important insights for design or redesign (e.g., Kieras and Polson, 1985; Kitajima and Polson, 1996; Olson and Olson, 1995; Carroll, 1990). And there are a dozen or so basic principles from experimental, perceptual, and cognitive psychology that can be put to work on occasion by insightful experts. However, for everyday guidance about the design of everyday interfaces and functions for every citizen, current science and engineering theory are of little help. One reason is that both the human and the potential computer-based agents involved, and especially their combination, are extremely complex dynamic systems of the sort that are not often reducible to practical closed-form models. They appear to be more like the phenomenon of turbulence that plagues airframe design or the chaos that confronts weather prediction than they are like the design of circuits; they are matters in which test-and try is unavoidable. It is often mystifying to usability professionals that testing is resisted as strongly as it is and that calls for doing principle-based design are so frequent in this arena, when practitioners and managers concerned with other complex dynamic systems (even electronic circuits and software) can easily see the need and strongly support empirical methods.
This hope of avoiding test-and-fix methods is astonishingly persistent. For example, there is a myth in circulation that the Macintosh interface, which for certain basic functions has demonstrated large usability advantages over its predecessors, was accomplished without user testing. The truth could not be more different. At Apple Computer, the Macintosh interface was developed originally for the Lisa computer, building, in turn, on the highly structured design and testing process for the Xerox Star system. During its development, it was subjected to an exemplary application of formative evaluation and involving nearly daily user testing and redesign. Moreover, the graphical user interface (GUI) components of the Macintosh interface can be and have been combined in ways that do not produce superior results, while some old-style command-based applications that have been iteratively designed are just as usable as the comparable Macintosh-style GUI applications (see Landauer, 1995, for a review and examples).
While research on both the fundamental science of human abilities and performance and the engineering principles for better usability certainly could be highly worthwhile in the long run if adequately pressed, progress to date has been slow, and a principle-based approach probably cannot be counted on to underwrite the design of effective every-citizen interfaces in the near term. On the other hand, many of the scientists who have worked on these problems believe that attempting to understand the issues involved in the interaction of people with computer-based intellectual tools and with one another through these tools offers an excellent laboratory for studying human cognition. The problems posed, and the nature of the response of the world to what a human does, can be controlled much better in this environment than, say, in a classroom, and yet are much more realistically complex than in the traditional psychological laboratory. Moreover, the end-result test, making interactions among and between humans and computers go better, requires not just piecemeal modeling but also complete understanding, an especially useful criterion in studying human cognition and communication that can take so many new forms and functions. Thus, more support (of which there is currently very little) of basic human-computer interaction research, especially at the level of the cognitive and social processes involved, could be quite valuable as science.
In concluding this discussion, the steering committee notes that some technologists, economists, and others have expressed the belief that problems of usefulness and usability are sufficiently solved by market competition and that, in particular, most earlier problems with user productivity
have been overcome. There is indeed some anecdotal evidence that large software producers are paying more attention to these matters, and with good effect. For example, a report from Microsoft (Sullivan, 1996) describes iterative user-interface design efforts for Windows 95 that followed prescriptions for interface development suggested by recent research (e.g., Nielsen, 1993; Landauer, 1995; Sawyer et al., 1996). As prior research has found, user test results showed a gain of approximately 50 percent in user task performance efficiency as a direct result of usability engineering activities.
Several lessons can be taken from this and recent, similar reports. First is the encouraging sign that assessment-driven design is being applied to significant projects and that it is working. A more cautionary lesson, however, is the authors' report of how narrowly the Microsoft project escaped neglect of assessment on several occasions, and how important the consequences would have been. In moving the interface design from that of immensely popular Windows 3.1 and 3.11, the team reported, it had originally believed that, because the previous interface was so well evolved and so successful in the marketplace, only small evolutionary changes based on known flaws, user complaints, and bug reports would be needed. However, early direct user tests and observations "surprised" the team into a realization that many critical problems could be solved only by a complete redesign and that many opportunities existed for significant innovative improvements that market response had not suggested. By the time the product was delivered, hundreds of flaws deemed worth remedy had been found and several provably important innovations were incorporated. Throughout the development, the team continued to be surprised both by how poorly features and functions previously thought good actually performed and by how poorly newly proposed fixes often turned out on actual test.
The point here is that the prior interface from the same source, the most "advanced" Windows project, was still, in the mid-1990s, very far from optimized and there was still room for dramatic improvement based on explicit assessment-driven usability engineering. The fact that computer hardware has become much faster and more capacious-and software commensurably larger and more highly featured-does not in the least ensure that usefulness and usability of applications have improved; indeed, the effect is often the opposite. Thus, it seems certain that there will continue to be opportunities for major improvements in the design of interfaces for some time to come, especially in the many new and so far very sparsely evaluated mass network-based applications for social activities.
Meanwhile, another complementary question needs to be answered. Windows 95 got the evaluation attention it needed, but no one knows
how many other products are or are not profiting from formative evaluation. One bit of suggestive evidence comes from informal analysis of the same publication in which the Windows 95 results were reported, the Proceedings of the 1996 ACM Conference on Human Factors in Computing Systems, CHI96. This is the major organ in which work on interface development and research is first published. Among 67 articles in the 1996 issue, of which over half describe newly developed or modified interface designs, only one of every six articles reports any kind of serious user testing or observation. This small proportion is not significantly different from the numbers reported for relevant publications in the 1980s (Nielsen and Levy, 1993). Thus, it appears that progress toward better interfaces still has plenty of scope for greater application of this well-established methodology. Also of interest, about one-sixth of the papers at CHI96 were directed toward network interface applications, and another sixth were about research on general interface components that might be used in the future-the kind of science research toward principled design many workshop participants thought should be better encouraged.
As mentioned above, it could be hypothesized that greatly increased beta testing made possible by World Wide Web dissemination of software has reduced the need for explicit evaluation. There may be some truth in this hypothesis in that many (but far from all) of the flaws and remedies discovered in usability engineering efforts come from trial user comments. On the other hand, as mentioned, World Wide Web beta testing is suspect as a usability design methodology because it gets information primarily from relatively expert, relatively heavy early-adopter users, those willing and able to try faulty versions (the average untested application interface has 40 flaws, according to Nielsen and Levy, 1993) of unproved things, people who are certainly unrepresentative of the target audience of this report. In addition, the Web has produced an explosion of new software that is often the result of extremely rushed, frequently amateurish, design efforts. Indeed, some usability experts think that much current Web-based software, and most home pages, have reintroduced long-recognized, serious design flaws (e.g., untyped hypertext links, missing escape and backout capabilities, and lengthy processes and downloads about which users are not warned) and that Web dissemination may have promulgated and institutionalized more avoidable problems than it has fixed. Requiring the using public to weed through the technology because of involuntary subjection to a welter of bad applications does not seem a desirable strategy for rapidly bringing every citizen happily on-line. Research is needed to determine whether, in fact rather than impression, recent trends in software development, such as World Wide Web beta testing and increasing speed of development cycles, are making things better or worse.
1. According to Cynthia Crossen in the Wall Street Journal (1996, pp. B1, B11) "Not even computer industry executives can explain the illogic of the modern keyboard ... a device jerry-built from technology as old as 1867 and as new as this year. Because there has never been an overarching plan or design, [it] defies common sense. Its terminology is inscrutable (alt, ctrl, esc, home), and the simplest tasks require memorizing keystroke combinations that have no intuitive basis."
2. Today's elegant cellular phone interfaces emerged after a period of what some observers deem excessive feature creep. See Virzi et al. (1996).
3. A Reuters business information survey of 1,300 managers reported complaints about stress associated with an excess of information, fostered by information technology (King, 1996, p. 4).
4. See, for example, Munk (1996). She reports estimates that 27 percent ($3,510) of the $13,000 annual cost of a networked personal computer goes for providing technical support to the user, and writes, "There's a Parkinson's Law in effect here: computer software grows to fill the expanded hardware. This is not to say that all the new software isn't useful; it often is. But not everybody needs it. For mundane uses, the older software may, paradoxically, be more efficient" (p. 280).
5. In addition to instances of software for scientific and engineering applications, current popular examples, such as the World Wide Web and assorted approaches to electronic publishing, derived from efforts of technical users to design systems to meet their own needs.
6. Gould et al. (1987b) notes that equivalent reading speed for screens and for paper depends on high-resolution antialiased fonts, an element of output display (see Chapter 3).
7. A meaningful approach to computer literacy, including essential concepts and skills, is the focus of an anticipated Computer Science and Telecommunications Board project.
8. The Telecommunications and Information Infrastructure Access Program (TIIAP), run by the National Telecommunications and Information Administration, funds diverse public-interest (including government services-related, educational, library, and other) information infrastructure projects that would form a natural platform for evaluation if funding were sufficient. See O'Hara (1996, p. 6).
9. For independent innovations, "early adopters" were regarded as having a competitive advantage over those still using older technologies; for interdependent innovations, early adopters do not achieve full benefits from the new technology until the late adopters come on board (Rogers, 1983).