Evaluation of Synthetic Environment Systems
Many questions arise about design, performance, usability, and cost-effectiveness as a new system progresses from inspiration to realization. A carefully chosen program of evaluation studies, conducted throughout the development and operational life of a system, can help provide the answers to such questions, as well as reduce development time and minimize the need for expensive design changes. To the extent that synthetic environments (SE) systems are based on a family of new technologies configured in new ways to perform new functions, the need for evaluation studies becomes especially important. In this chapter we first outline a variety of approaches to evaluation and identify some key issues in the evaluation of systems in general, whether or not they are SE-based. We then comment on special issues in the evaluation of SEs, including the current tendency to ignore, or at least minimize, the evaluation problem.
GENERAL ISSUES IN SYSTEM EVALUATION
There are many practical reasons why evaluation studies should be conducted. At the outset of development, they can be used to refine the requirements for a system and to compare design concepts. (For simplicity, and unless stated otherwise, we use the term system here to refer to components and subsystems of SE systems as well as to complete SE systems.) Once a design has been chosen, evaluation studies can be used to diagnose problems and suggest alternative approaches. If appropriate, the results of these formative evaluations can also be used to facilitate
communication with a sponsor or a customer. To be useful, the findings from such formative studies must be timely; this may require that scientific rigor and precision be traded for speed. Accordingly, rapid prototyping and simulations are often used to provide representations of the elements to be examined in formative evaluations.
Once a system has been developed, a summative evaluation can be used to measure the capability of the system to fulfill its intended function, to compare its performance with that of alternative systems, and to assess its acceptance by intended users. For example, the training effectiveness of a virtual environment (VE) training system might be compared with that of a conventional training simulator. Quantitative measures of training performance and training transfer, together with pooled ratings of experts and judgmental information about the friendliness of the system, all would have a role in such a summative evaluation. Taken together, formative and summative evaluations provide a critique of a system over its entire life-cycle. Finally, evaluation studies can be used to estimate the cost-effectiveness of an SE system in performing a particular application function. The results of these studies can then be used to inform policy decisions about investment and production.
The specific type of evaluation to be conducted will depend, of course, on the characteristics of the system to be evaluated as well as the purpose of the evaluation. One dimension along which evaluations vary, mentioned in the preceding paragraph, concerns the extent to which the purpose of the evaluation is to guide system development or to determine various characteristics of a system that has already been developed (e.g., related to overall performance, cost-effectiveness). A second dimension concerns the amount and type of empirical work involved. The empirical component of the evaluation can be restricted to pure observation (of how the system performs, how the user behaves, how the market reacts); it can involve surveys in which system users and other individuals affected by the system are asked questions; or it can involve highly structured and controlled scientific experiments. Under certain conditions, evaluations can be conducted without any empirical work at all and be based solely on theoretical analyses, for example, when appropriate models are available for describing all relevant components and performance can be reliably estimated solely by human or machine computation. A third dimension concerns the extent to which the item being evaluated constitutes a whole system or just one of the components in the system of interest. As one would expect, in most cases, the evaluation of a system component is much simpler than the evaluation of the whole system (particularly when the whole system involves a human operator). A fourth dimension concerns the extent to which the evaluation is analytic, in the sense of providing information on how the performance of the whole system is related to
the performance of various components and subsystems. Obviously, analytic evaluations play an exceedingly important role in guiding the development of improved systems. A fifth dimension concerns the distinction between static and dynamic tests. Whereas static evaluation methods focus on nonperformance attributes of a system, dynamic methods focus on performance attributes. General background on types and methods of evaluation can be found in Meister (1985).
SPECIAL ISSUES IN SE EVALUATION
As discussed throughout this book, the creation of SE systems draws on previous work in, and provides research and development challenges to, a wide variety of established disciplines, including computer science, electrical and mechanical engineering, sensorimotor psychophysics, cognitive psychology, and human factors. In each discipline, the requirements associated with creating cost-effective SE technology raise new questions that call for evaluations within the context of research and development. In general, evaluation studies of SEs and SE technology are needed to help ensure that: (1) the perceptual and cognitive capabilities and limitations of human beings, as well as the needs of the specific tasks under consideration, are being used as driving criteria for system design; (2) hardware and software deliver SEs in a cost-effective manner; and (3) SE applications represent a significantly better way of doing old things or of doing new things that have never before been possible.
Despite the clear need for evaluation in the SE area, the types and amounts of evaluation currently taking place in this area are rather limited (a brief review of the limited work on performance measurement in teleoperator systems is provided in Chapter 9; three highly experimental studies on VE are described in Chapter 12). This is undoubtedly due, at least in part, to the high level of enthusiasm that exists about what the technology is likely to be able to accomplish, as well as the belief among many individuals in the SE field that no special evaluation efforts are required. According to this belief, the informal evaluations that take place more or less automatically as one is developing a system and the evaluation evidenced by the degree of acceptance in the marketplace (i.e., a system is good or bad according to whether it is used or not used) are sufficient. Although these forms of evaluation are necessary, the committee does not agree that they are sufficient; the cost-effectiveness of the research and development is likely to be significantly increased if the task of evaluation is taken more seriously. Although many existing evaluation tools can be adapted for use with SEs, a variety of new tools will need to be developed in order to evaluate the unique properties of this technology. In the following paragraphs, we comment briefly on some of the
considerations that are relevant to the design of an analytic evaluation of an SE system. We discuss evaluation of SE system characteristics and issues that arise in observing and measuring human behavior in SE systems.
Evaluation of System Characteristics
Perhaps the first general evaluation task consists of measuring the physical characteristics of the SE system and considering how these characteristics relate to those of the prospective human user. Thus, for example, the characteristics of the displays and controls (dynamic ranges, resolutions, time lags, distortions, internally generated noises) should be measured and compared with the sensorimotor capabilities of humans as determined from psychophysical studies. Ideally, some kind of metric of physical fidelity should be developed that takes account not only of the fidelity that exists in all the relevant sensorimotor channels, but also of the extent to which this fidelity falls short of the maximally useful fidelity and of the implications of this shortfall for overall system performance. Another portion of the evaluation effort should focus on an analysis of the task to be performed by the SE system and an examination of how well the system is designed to perform this task. Such an evaluation would take account of physical fidelity and how such fidelity is expected to influence performance on various task components, as well as more central issues, such as the degree to which the SE system has been designed to anticipate the user's intention by examination of, and extrapolation from, the control signals. A further set of related issues can perhaps be best grouped under the heading of ''cognitive fidelity." Such issues arise in connection with interface design and the interaction metaphors employed in this design, as well as the structure and function of the machine (computer or telerobot) to which the human is interfaced. The classic notion of stimulus-response compatibility is a pale and restricted version of the cognitive fidelity factors that require consideration in an analytic evaluation of an SE system.
Observation and Measurement of Human Behavior in SEs
In addition to evaluating basic system characteristics, it is important to evaluate overall system performance and how well the system performs the tasks for which it was designed. Less obvious, but also of considerable importance, is the need to examine the behavior of the human operator in the SE system. The most obvious method for accomplishing this involves storing all the signals that occur as part of the SE operation (i.e., all the display signals and all the control signals flowing in
and out of the interface) and then studying this set of signals. In order to make such a procedure meaningful and efficient, however, procedures must be developed for filtering and transforming this mass of data in a manner that addresses well-defined evaluation questions. One such question, for example, might relate to how the user's behavior (as measured by the relation of the control signals to the previous display signals) compares with the behavior that would have been generated under the same circumstances by some model operator (e.g., an operator that is ideal according to some well-defined criterion).
Beyond examining stored SE records, evaluation of human behavior in the SE can make use of supplementing information derived from external observations and measurements. Such information could be obtained, for example, from direct observation of the subject by the evaluator, from video or audio recordings or physiological measurements of the subject, or from administering questions to the subject after termination of the SE experience.
Further information can be obtained by performing experiments during the SE experience. For example, the evaluator can intentionally degrade the system in some fashion to probe the effects of degradations that are likely to occur during field use. Similarly, the evaluator can introduce special signals into the displays and observe the subject's actions in response to these special signals (e.g., to test alertness).
Additional supplementary information can be obtained by having the evaluator enter the SE in which the subject is operating and making observations and measurements of subject behavior from within the SE. This can be done passively and unobtrusively (in the sense that the subject's environment remains identical to that which would have existed if the evaluator had remained outside) or the evaluator can intentionally interact with the subject or the subject's environment. In general, observations and experiments can be performed either from outside the SE or from inside the SE.
Among the special features to consider when designing an SE evaluation program are those related to measurements of the sensation of presence and the sopite syndrome. Although it has not yet been demonstrated that the sense of presence is an important variable for predicting objective performance (it has not yet even been adequately defined), it seems likely that interest in this variable will continue. Also, it is clear that the extent to which an SE system elicits the sopite syndrome is of major importance. Thus, it is important to consider measuring both these variables in any comprehensive SE system evaluation. (Information on both the sense of presence and the sopite syndrome are available in recent issues of Presence.)
Special considerations arise in connection with assessing system usability
and acceptance. For example, in addition to choosing individual test subjects who truly constitute a representative sample of the anticipated population of users as test subjects, because many envisioned SE applications involve simultaneous multiple users communicating and working together on a common task, attention must be given to appropriate sampling of groups of users as well as to observations and measurements of relevant group processes.
Finally, because of the immersive character of SEs, special attention in SE evaluation must be given to possible negative long-term psychological and social effects. Illustrative questions in this category include the following: To what extent, if any, will individuals begin to confuse occurrences in SEs with occurrences in the real world? How will an individual's self-image be influenced by spending large amounts of time in SEs that seriously transform the individual's interactions with the environment? What impact will widespread use of networked SEs have on various types of social institutions? Fundamental psychosocial questions of this type are not likely to be addressed adequately (if at all) by the developers of SE technology. However, it is important that they be seriously addressed by some group.
In general and as a consequence of the many special features associated with SE evaluation, as well as the current tendency to ignore evaluation in the SE field, the committee believes that it would be extremely useful to develop a special evaluation tool kit for this field. Such a tool kit could serve to educate people in the field, to provide a more or less standardized set of evaluation tools for the field, and eventually to help provide a cumulative and shareable database that would constitute both a current snapshot of accomplishments in the field and a guide for future research and development.