Meeting the every-citizen interface (ECI) criteria described in Chapter 2 will require advances in a number of technology areas. Some involve advances in basic underlying display and interface technologies (higher-resolution visual displays, three-dimensional displays, better voice recognition, better tactile displays, and so on). Others involve advances in our understanding of how to best match these input/output technologies to the sensory, motor, and cognitive capabilties of different users in different and changing environments carrying out a wide variety of tasks. But the new interfaces will need to do more than just physically couple the user to the devices. To meet these visions, the interfaces must have the ability to assist, facilitate, and collaborate with the user in accomplishing tasks.
Subsequent chapters address interface designthe creation of
interfaces that make the best-possible use of these human-machine
communications technologiesand system attributes that lie beneath the veneer
of the interface, such as system intelligence and software support for
collaborative activities. This chapter examines the current state and
prospective advances in technology areas related directly to communication
between a person and a systemhardware and software for input (to
the system) and output (to a human). The emphasis is on technical
advances that, if implemented in well-designed systems (as stressed in Chapter 4), hold the potential to expand accessibility and usability to many
more people than at present. The discussion includes a cluster of speech
input/output technologies; natural language understanding (including
restricted languages with limited vocabularies); keyboard input; gesture
recognition and machine vision; auditory and touch-based output; interfaces
that combine multiple modes of input and output; and visual displays,
including immersive or virtual reality systems. Because the ECI challenge
involves connecting to the information infrastructure, rather than just
to stand-alone systems, this chapter reviews the current status of and
research challenges for interfaces for systems in large-scale national
networks. The chapter ends with the steering committee's conclusions,
based on workshop discussions and other inputs, about the research
priorities to advance these technologies and our understanding of how to use
them to support every citizen.
FRAMING THE INPUT/OUTPUT DISCUSSIONLAYERS
OF COMMUNICATION
The interface is the means by which a user communicates with a system, whether to get it to perform some function or computation directly (e.g., compute a trajectory, change a word in a text file, display a video); to find and deliver information (e.g., getting a paper from the Web or information from a database); or to provide ways of interacting with other people (e.g., participate in a chat group, send e-mail, jointly edit a document). As a communications vehicle, interfaces can be assessed and compared in terms of three key dimensions: (1) the language(s) they use, (2) the ways in which they allow users to say things in the language(s), and (3) the surface(s) or device(s) used to produce output (or register input) expressions of the language. The design and implementation of an interface entail choosing (or designing) the language for communication, specifying the ways in which users may express "statements" of that language (e.g., by typing words or by pointing at icons), and selecting device(s) that allow communication to be realizedthe input/output devices.
Box 3.1 gives some examples of choices at each of these levels.
Although the selection and integration of input/output devices will
generally involve hardware concerns (e.g., choices among keyboard,
mouse, drawing surfaces, sensor-equipped apparel), decisions about the
language definition and means of expression affect interpretation processes that
are largely treated in software. The rest of this section briefly describes
each of the dimensions and then examines how they can be used to
characterize some currently standard interface choices; the remainder of the
chapter provides an examination of the state of the art.
Language Contrasts and Continuum
There are two language classes of interest in the design of interfaces: natural languages (e.g., English, Spanish, Japanese) and artificial languages (e.g., programming languages, such as C++, Java, Prolog; database query languages, such as SQL; mathematical languages, such as logic; command languages, such as cshell provides). Natural languages are derived evolutionarily; they typically have unrestricted and complex syntax and semantics (assignment of meaning to symbols and to the structures built from those symbols). Artificial languages are created by computer scientists or mathematicians to meet certain design and functional criteria; the syntax is typically tightly constrained and designed to minimize semantic complexity and ambiguity.
Because an artificial language has a language definition, construction of an interpreter for the language is a more straightforward task than construction of a system for interpreting sentences in a natural language. The grammar of a programming language is given; defining a grammar for English (or any other natural language) remains a challenging task (though there are now several extensive grammars used in computational systems). Furthermore, the interactions between syntax and semantics can be tightly controlled in an artificial language (because people design them) but can be quite complex in a natural language.1,2
Natural languages are thus more difficult to process. However, they allow for a wider range of expression and as a result are more powerful (and more "natural"). It is likely that the expressivity of natural languages and the ways it allows for incompleteness and indirectness may matter more to their being easy to use than the fact that people already "know them." For example, the phrase, "the letter to Aunt Jenny I wrote last March," may be a more natural way to identify a letter in one's files than trying to recall the file name, identify a particular icon, or grep (a UNIX search command) for a certain string that must be in the letter. The complex requests that may arise in seeking information from on-line databases provide another example of the advantages of complex languages near the natural language end of this dimension. Constraint specifications that are natural to users (e.g., "display the protein structures having more than 40 percent alpha helix") are both diverse and rich in structure, whereas menu- or form-based paradigms cannot readily cover the space of possible queries. Although natural language processing remains a challenging long-range problem in artificial intelligence (as discussed under "Natural Language Processing" below in this chapter), progress continues to be made, and better understanding of the ways in which it makes communication easier may be used to inform the design of more restricted languages.
However, the fact that restricted languages have limitations is not, per se, a shortcoming for their use in ECIs. Limiting the range of language in using a system can (if done right) promote correct interpretation by the system by limiting ambiguity and allowing more effective communication. For instance, the use of domain- and task-specific restricted languages for certain applications of speech recognition systems has produced results, allowing people to use speech to communicate when they cannot see (either because they are limited by the communication device being used, such as the telephone, or because of physical impairment). Radiologists' workstations, for example, allow the use of speech as the primary means of inputting reports on X-rays or other radiographic tests. Direct manipulation languages may be ideal if there is a close match to what the user wants to do (and hence is able to "say"), that is, if the user's needs are anticipated and the user will not need to program or alter what the system does; they can be a robust means of control that limits the risk of system crashes from misdirected user actions.
In short, the design of an interface along the language
dimension entails choices of syntax (which may be simple or complex) and
semantics (which can be simple or complex either in itself or in how it relates
to syntax). More complex languages typically allow the user to say
more but make it harder for the system to figure out what a person means.
Expression Contrasts
A natural language sentence can be spoken, written, typed, gestured, or selected from a menu. An artificial language statement also can be spoken, written, typed, gestured, or selected from a menu.
Language expression can take many forms, generally differentiated as being more or less continuous or involving selection from a set of options (e.g., a menu). Speaking can involve isolated words or continuous speech recognition. Writing can involve handwriting or typing; drawing can be free form or can use prespecified options. Gesturingindependently or to manipulate objectscan be free form, can involve a full-scale natural language (e.g., American Sign Language), or can involve a more restricted set of prespecified options (e.g., pointing, dragging, stretching, rotating). Virtual reality and other visualization techniques represent a multimedia form of expression that may involve speech, gesture, direct manipulation, and haptic and other elements.
Thus, the different ways of saying things in a language may also be divided into two structural categoriesfree form and structuredand several different realization categories: typing, speaking, pointing. Free-form expression is usually more difficult to process than structured expression. For example, a sentence in natural language can be spoken "free form" (this is what we usually think of with natural language), or it might be specified by picking one word at a time out of a structured menu.3 In the structured form the system can control what the user gets to choose to "say" next, and so it is much easier for a system to interpret and handle. Within a given form, some means of realization may be easier to handle than others (e.g., correctly typed words are easier to interpret than handwritten words; freehand drawings are more difficult than structured CAD/CAM (computer-aided design/computer-aided manufacturing) diagrams). It is also important to note that more structured systems may be preferable for certain applications, such as those involving completion of forms (Oviatt et al., 1994).
Menu/icon systems thus provide an alternative way of
expressing command-like languages. They have underlying languages,
typically very much like command languages. The commands (natural
language verb equivalents) are often menu items (e.g., "select," "edit"); the
parameters (natural language noun equivalents) are icons (or open files);
and the statements (natural language sentence equivalents) are sequences
of select "nouns" and "verbs." The menus and icons provide the
structure within which a user can say something in the language.
Devices
The hardware realization of communication can take many
forms; common ones include microphones and speakers, keyboards and
mice, drawing pads, touch-sensitive screens, light pens, and push buttons.
The choice of device interacts with the choice of medium: display,
film/videotape, speaker/audiotape, and so on. There may also be interactions
between expression and device (an obvious example is the connection
between pointing device (mouse, trackball, joystick) and pull-down
menus or icons). On the other hand, it is also possible to relax some of
these associations to allow for alternative surfaces (e.g., keyboard
alternatives to pointers, aural alternatives to visual outputs). Producing interfaces
for every citizen will entail providing for alternative input/output
devices for a given language-expression combination; it might also call for
alternative approaches to expression.
Comparisons Among Graphical User Interfaces,
Natural Language, and Speech
The language-expression-device framework can be used to gain perspective on current standard interface types and on the research opportunities and challenges presented by ECIs. For example, it makes clear that natural language processing and speech recognition (and other technologies that may be associated colloquially) introduce different issues and different tradeoffs. A speech-based interface such as AT&T's long-distance voice recognition system, which can recognize phrases such as "collect call" and "calling card,"4 can combine a restricted language with speech as a means of expression. As this example illustrates, neither speech recognition with unlimited vocabulary nor complete/comprehensive language understanding is necessary to provide natural language-like input to a system within a restricted domain and task. Similarly, it is possible to improve restricted language interfaces by applying principles from natural language communication.
Current graphical user interface/menu/icon systems tightly constrain what one can say, both by starting with a very constrained language and by having a structured way in which one can express things in that language. They are at the opposite end of both the language and the expression spectrum from natural languages. It is thus clear why they are easier to process, but also why they are more constraining (Cohen and Oviatt, 1994).
Ongoing efforts to develop speech interfaces for Web browsers provide a concrete example of the importance of understanding the different tradeoffs of each of these dimensions. Choosing speech on the expression layer rather than pointing and clicking would lead to being able to "speak" the icons and hyperlinks that are designed for keyboard and mouse. Although this may suffice in certain settingsreplacing one modality for another can be useful in hands-free contexts and for those with physical limitationsit does not necessarily expand a system's capabilities or lead to new paradigms for interactions. An alternative approach would be to explore how spoken language technology can expand the user's ability to obtain the desired information easily and quickly from the Web, leading to a different, probably more expressive, language. From this perspective, speech would augment rather than replace the mouse and keyboard, and a user would be able to choose among many interface language-expression options to achieve a task in the most natural and efficient manner.
Natural language interaction is particularly appropriate when the information space is broad and diverse or when the user's request contains complex constraints. Both of these situations occur frequently on the Web. For example, finding a specific home page or document now requires remembering a universal resource locator, searching through the Web for a pointer to the desired document, or using one of the keyword search engines available. Current interfaces present the user with a fixed set of choices at any point, of which one is to be selected. Only by stepping through the offered choices and conforming to the prescribed organization of the Web can users reach the documents they desire. The multitude of indexes and meta-indexes on the Web is testimony to the reality and magnitude of the problem. The power of a human/natural language in this situation is that it allows the user to specify what information or document is desired (e.g., "Show me the White House home page," "Will it rain tomorrow in Seattle?" or "What is the ZIP code for Orlando, Florida?") without having to know where and how the information is stored. A natural language, regardless of whether it is expressed using speech, typing, or handwriting, offers a user significantly more power in expressing constraints, thereby freeing the user from having to adhere to a rigid, preconceived indexing and command hierarchy.
In examining the state of the art of various input/output
technologies, it is important to recognize that no single choice is right for
all interfaces. In fact, one of the major challenges of interface design may
be designing a language that is powerful enough for a user to say
what needs to be said, but in as constrained a manner as possible, while
still having the power to make processing easier and the possibility of
misinterpretation less likely. In looking at input/output options, it will
be useful to keep in mind where various options fall on one or another
of these scales and the tradeoffs implicit in choosing a given option.
TECHNOLOGIES FOR COMMUNICATING WITH SYSTEMS
Humans modulate energy in many ways. Recognizing that fact allows for exploration of a rich set of alternatives and complementsat any time, a user-chosen subset of controls and displaysthat a focus on simplicity of interface design as the primary goal can obscure. Current direct manipulation interfaces with two-dimensional display and mouse input make use, minimally, of one arm with two fingers and a thumb and one eyeabout what is used to control a television remote. It was considered a stroke of genius, of course, to reduce all computer interactions to this simple set as a transition mechanism to enable people to learn to use computers without much training. There are no longer any reasons (including cost) to remain stuck in this transition mode. We need to develop a fuller coupling of human and computer, with attention to flexibility of input and output.
In some interactive situations, for example, all a computer or information appliance needs for input is a modulated signal that it can use to direct rendered data to the user's eyes, ears, and skin. Over 200 different transducers have been used to date with people having disabilities. In work with severely disabled children, David Warner, of Syracuse University, has developed a suite of sensors to let kids control computer displays with muscle twitches, eye movement, facial expressions, voice, or whatever signal they can manage to modulate. The results evoke profound emotion in patients, doctors, and observers and demonstrate the value of research on human capabilities to modulate energy in real time, the sensors that can transduce those energies, and effective ways to render the information affected by such interactions.
The state of the art in a range of technologies for communicating
with systems is reviewed below. Also addressed are the device and
expression layers of the model described in the previous section and summarized
in Box 3.1. The choice of languagenatural, restricted, or direct
manipulationinfluences but does not dictate the technologies discussed here.
The exception is the subsection, "Natural Language Processing,"
which also encompasses the language layer of the model and discusses
how choices along a spectrum from fully natural languages to relatively
restricted languages influence the performance of various
expression modes, particularly speech input.
Speech Synthesis
Text-to-speech systems, or speech synthesizers, take unrestricted text as input and produce a synthetic spoken version of that text as output. Most current commercial synthesizers exhibit a high degree of intelligibility, but none sound truly natural. The major barriers to naturalness are deficiencies of text normalization, intonational assignment, and synthesized voice quality. Female speech and children's speech are generally less acceptable than adult male synthetic speech, probably because they have been studied less (Roe and Wilpon, 1994).
In the course of transforming text into speech, all text-to-speech systems must do the following:
While most systems permit some form of user control over various parameters at many of these stages, to fine-tune system defaults, documentation and tools for such control are usually lacking, and most users lack the requisite background to produce satisfying results.
Particularly for concatenative synthesizers, it is difficult and time consuming to produce new voices, since each voice requires that a new set of concatenative units be recorded and segmented. While most research groups are developing tools in an attempt to automate this process (often by using automatic speech recognition systems to produce a first-pass segmentation), none have succeeded in eliminating the need for laborious hand correction of the database. There have also been efforts in recent years to automate the production of other components of synthesis, to facilitate the production of synthesizers in many languages from a single architecture.
We know that synthetic speech should sound better. It is not
clear, exactly, how to decide what is better: More natural and more
human-like? More intelligible? More intelligible at normal talking speeds or
at high speeds? Speech is usually used for conversational modes of
interaction. When speech is being used for presenting a Web page, for
example, there is additional information that needs to be provided: Which
words form links? Which words are italicized? How is this information
presented most effectively? How should words be dealt with that
have multiple different pronunciations in different parts of the country or
to different individuals?
Speech Input/Recognition
The full integration of voice as an input medium, if achievable, could alleviate many of the known limitations of existing human-machine interfaces. People with poor or no literacy skills, people whose hands are busy, people suffering from cumulative trauma disorders associated with typing and pointing (or seeking to avoid them)could all benefit from spoken communication with systems. While the capabilities envisioned in such a system are well beyond the state of the art in both speech recognition and language understanding at present, the technology has advanced sufficiently to allow very simple voice-based applications to emerge (see below).
Speech recognition research has made significant progress in the past 15 years (Roe and Wilpon, 1994; Cole and Hirschman, 1995; Cole et al., 1996). The gains have come from the convergence of several technologies: higher-accuracy continuous speech recognition based on better speech modeling techniques, better recognition search strategies that reduce the time needed for high-accuracy recognition, and increased power of audio-capable, off-the-shelf workstations. As a result of these advances, real-time, speaker-independent, continuous speech recognition, with voabularies of a few thousand words, is now possible in software on regular workstations.
In terms of recognition performance, word error rates have dropped by more than an order of magnitude in the past decade and are expected to continue to fall with further research. These improvements have come about as a result of technical as well as programmatic innovations. Technically, there have been advances in two areas. First, a paradigm shift from rule-based to model-based methods has taken place. In particular, probabilistic hidden Markov models (HMM) have proven to be an excellent method of modeling phonemes in various contexts. This model-based paradigm, with its ability to estimate model parameters automatically from training data, has shown its power and versatility by applying the technology to various languages, using the same software. Second, the use of statistical grammars, which estimate the probability of two- and three-word sequences, have been instrumental in improving recognition accuracy, especially for large-vocabulary tasks. These simple statistical grammars have, so far, proven to be superior to traditional rule-based grammars for speech recognition purposes.
Programmatically, the collection and dissemination of standard, common training and test corpora worldwide, the sponsorship of common evaluations, and the dissemination at workshops of information about competing methods have all ensured very rapid progress in the technology. This programmatic approach was pioneered by the Defense Advanced Research Projects Agency (DARPA), which continues to sponsor common evaluations and initiated the establishment of the Linguistic Data Consortium, which has been in charge of the collection and dissemination of common corpora. A similar approach is now being taken in Europe.
Word error rates for speaker-independent continuous speech recognition vary a great deal, depending on the difficulty of the task: from less than 0.3 percent for connected digits, to 3 percent for a 2,500-word travel information task, to 10 percent for articles read from the Wall Street Journal, to 27 percent for transcription of broadcast news programs, to 40 percent for conversational speech over the telephone. Although word error rates in the laboratory can be quite small for some tasks, error rates can increase by a factor of four or more when the same systems are used in the field. This increase has various causes: heavy accents, ambient noise, different microphones, hesitations and restarts, and straying from the system's vocabulary.
Speech recognition has begun to enter the mainstream of everyday life, chiefly through telephone-based applications (Margulies, 1995). The most visible of these applications involve directory assistance services, such as the recognition of a few words (e.g., the digits and words such as "operator," "yes/no," "collect") or recognition of the names of cities in a particular area code. Speaker-independent recognition of over-the-phone digit strings (more difficult than single-digit recognition) has been deployed since 1990.6 Other applications include voice-activated dialing (especially useful for cellular phones), personal assistant services (to manage one's telephone at work), and call router applications (where the caller says the person's full name instead of dialing). Other less prevalent applications include obtaining stock and mutual fund quotes by voice, simple banking services, and bill payment by telephone.7
Other operational applications of speech recognition include air traffic control training, dictation, and Internet access. Large-vocabulary dictation systems capable of recognizing discrete speech are available on the market and have been used for years. For continuous speech there are systems that are capable of recognizing a few thousand words in real time; at least one of these systems is now being marketed for the dictation of radiology reports. Systems for using voice for Internet access have recently been announced.
Simply making speech recognition available with machines, however, does not necessarily make it immediately useful; it will have to be interfaced properly with the other modalities so that it appears seamless to the user (Martin et al., 1996). (Several vendors have been shipping speech recognition capabilities with personal computers, but there is little evidence of wide usage.) Optimism for general use of speech technologies comes from the facts that performance levels are continuing to improve and that many applications do not require large vocabulary sizes. However, applications must be designed to take into account the fact that recognition errors will occur, either by allowing the user to correct errors or by designing additional error correction mechanisms, such as proper inclusion of human-machine dialogue capabilities. These include the ability to deal with issues such as how to phrase a system prompt, how to determine if a recognition error has occurred, and how to engage in conversational repair if such a determination is made. Other speech integration issues include habitability (the ability of a user to stay within the system's vocabulary most of the time), portability (the ease with which a speech recognition system can be ported to a new domain), and user experience (different users, depending on their experience, may require different types of interaction).
Looking into the future of the national information infrastructure (NII), speech recognition could have many applications, such as command and control, information access and retrieval, training and education, e-mail and memo dictation, and voice mail transcription. The current state of the art in speech recognition can support these applications at various levels of performance, some quite well (e.g., command and control) and others not well at all (e.g., voice mail transcription). Functions that perform information access, such as making an airline reservation, may require the use of a certain level of language understanding technology. The state of the art in that field only allows for the simplest of such applications at this time (see "Natural Language Processing" below).
Despite significant progress in speech recognition technology in
the past decade, the fact remains that machine performance may still not
be good enough for many applications. As a barometer of how
much progress we may need for certain advanced applications,
experiments have shown that human speech recognition performance is still at least
an order of magnitude better than that of machines. One optimistic
note, however, is that commercialization of the technology is proceeding
very vigorously and is lagging the corresponding research capabilities by
only a few years, so that any advances in the laboratory can be expected
to appear on the market with a delay of only a few years.
Speaker Verification
A related but quite different technology is speaker verification. There has been much concern about private and secure communications over the Internet, especially for business information and financial transactions. Although encryption methods will be used more and more to protect digital data, it will still be necessary to make a more positive identification of customers for certain types of transactions. Speaker verification technology can be used to help provide additional security.
In an initial enrollment phase, each user is enrolled in a system by providing samples of his or her voice. System performance improves over time as the user supplies more voice samples. Using those voice samples, the system creates a model for the voice of each user. Then, when in operation, the system prompts the user to say a (random) phrase and, using the stored model of the user with the claimed identity, computes the likelihood that the speech came from that person. The user is then either accepted or rejected.
The performance of a speaker verification system is often measured by the Equal Error Rate (EER), which is the operating point in a system where the false rejection rate is equal to the false acceptance rate. In the laboratory, an EER of less than 0.5 percent can be achieved. Performance typically degrades to an EER of 2 to 4 percent in the field.
While the current state of the art may be sufficient for low-security applications, it would not, by itself, be adequate for high-security applications. However, if combined with other security measures, such as use of a PIN (Personal Identification Number), speaker verification can provide the added desired security for many applications of interest.
For users with physical disabilities who would like to have
voice-only access to devices and systems, speaker verification could be of great
benefit. It should be noted, however, that there are a significant number
of people who are unable to speak clearly or reliably. For those
people, alternate means of verification will be necessary if they are to use
systems that rely on voice verification.
Alternate Keying/Typing Approaches: Strategies and Accelerators
As speech recognition becomes accurate and reliable, it will play a much larger role in future interface systems than it does today. It will not, however, ever completely replace or obsolete keyboard or keypad input to systems. Keying information into systems will continue to be a quiet, accurate, noise-immune (and, for some applications, faster) means of inputting data or commands. Furthermore, even as the performance of natural language understanding improves, free-form typing of natural language will remain a viable alternative to spoken input to such systems.
Today, keypads and keyboards range from systems that are as small as a wristwatch and are operated with a pen tip, to large, wall-sized keyboards operated with a light pen. Common keyboards are operated by using all 10 fingers, which push keys one at a time. Other keyboards have been developed that are chordic in nature and involve the pressing of multiple keys simultaneously. Many of these do not require the user to ever remove his or her hand.
In addition to pressing discrete keys, data can also be input using gestures. Finger spelling is one technique. Today, there are gloves that allow the wearer to spell out the desired characters using finger-spelling gestures. Techniques are also being explored that use cameras to take data via both finger spelling and sign language.
Handwriting is another common method for entering alpha-numeric data. There are techniques for recognizing letters formed in the standard way, as well as techniques (such as "Graffiti") that increase the accuracy of handwritten characters by having the user write with letters that are similar to, but different from, the standard characters people are familiar with.
To increase the rate of data entry, a number of abbreviation and prediction techniques have been developed. Abbreviation techniques allow an individual to use a smaller set of letters (which can resemble the target word, such as "abv" for "abbreviation," or be completely arbitrary such as "T1" for "please call home"). Prediction techniques look at what a person has typed and try to guess what the next word or words would be. Prediction techniques are less useful for people who can enter data quickly since the time spent looking at the system's guesses may slow one down to the point where it is faster to just enter the data. However, for individuals who have to enter data very slowly or for those who have difficulty spelling (e.g., because of a learning disability, cognitive impairment, second language), systems that can guess words correctly can significantly increase their rate of communication. If a system always guesses consistently (e.g., when "t" is typed, it guesses "the"; when "th" is typed, it guesses "there"), the user can begin using it for prediction techniques, but very quickly switch over to using it as abbreviation expansion (e.g., the user types "t" and then the confirm button because the user knows that the system will have guessed "the"). Ironically, systems that monitor the context and change their guesses to better match the context prevent an individual from getting into the faster abbreviation expansion mode. If systems can predict whole sentences or phrases, however, their utility would increase. This is usually possible only, however, for stereotypic communication (Vanderheiden et al., 1986).
In some aspects this area is one of the more thoroughly
researched ones. However, it is not clear what the best techniques are for
combining these input techniques for using keyboard input in connection with
speech and other virtual reality and gestural input systems. What is the best
way to use a minimalist keyboard with a voice response system either in a
key-in/voice-out paradigm or to help handle error correction in voice
recognition systems? Also, currently there are no good mechanisms for
providing keyboard-based input when people are walking or moving
about in virtual reality-like environments.
Natural Language Processing8
Natural languagespoken, written, or even signedis at the heart of human communication. It is key to interaction between humans and the medium for much of the vast amount of information stored in books, newspapers, scientific journals, audio and video tapes, and now Web pages. As a means of interaction with computers, it requires no special training on the part of users, but it remains uncommon because of the difficulties in supporting it technically. To date, there have been a number of successful commercial applications of natural language processing, including grammar- and style-checking programs; text indexing and retrieval systems, particularly for the named-entity task9 database query products that utilize natural language as input, which are being marketed for targeted applications; abstracting software (for summarizing blocks of text), which has been introduced commercially; and machine-aided translation programs. Access to the NII could be made easier and more productive if people could interact with a computer using natural language and if the computer could better retrieve, summarize, and understand the wealth of linguistic information at its disposal.10
Over the years, natural language processing (NLP) has focused primarily on three tasks: (1) database access, from typed or spoken queries; (2) information extraction, or the generation of formatted summaries from texts such as newspaper stories, military messages, and Web pages; and (3) machine translation of typed or spoken utterances from one language to another. The challenge of NLP is to build systems that can distinguish in the input language as many significantly different meanings as are relevant to the applications of interest; to interpret correctly as large a variety of linguistic expressions of these meanings as would naturally occur; and to do so in as many task settings as possible, with the computational resources available.
Until recently, most NLP systems shared the same gross architecture, roughly analogous to that of programming language compilers: a syntactic analyzer, or parser, to identify the lexical category of the words of the input sentence11 and their hierarchical organization into phrases and clauses; a semantic analyzer, to construct a representation of the meaning of the input sentence, generally independent of the specific task or application domain; and, finally, a domain- and task-specific mapping from the semantic interpretation to a representation suitable to the task at hand, such as a database query for query systems, a filled template for information extraction systems, or an input into a language generation module for a machine translation system.12 In the current practice, several hundred rules may need to be hand-coded for a new application, even in a limited domain.13
In the early 1990s, NLP took several new directions, largely at the instigation of a succession of DARPA program managers. First, after years of working in parallel, researchers in speech recognition and NLP were encouraged to construct integrated speech understanding systems, for which the chosen task was to answer spoken queries to databases (e.g., of air travel information). Second, information extraction was made a major task of interest. Finally, the performance of NLP and speech understanding systems was to be systematically evaluated.
It was thus necessary to reject the then-prevailing assumption that the NLP system needed to understand only syntactically and semantically well-formed utterances or that the entire content of an utterance or text needed to be understood. Spoken language systems had to deal with the inevitable recognition errors of even the best speech recognition systems as well as queries such as "Boston San Francisco after 8 a.m." and "I'd like to go to Boston, ah, to Atlanta, tomorrow." Systems were designed that tolerated not understanding some parts of the utterance, combining partial analyses of other parts, and explicitly correcting certain forms of disfluencies. Even with such difficult input, it now became possible to actually improve the accuracy of even the best speech recognition programs by applying syntactic and semantic constraints, at least in limited domains.
Systematic evaluation of NLP systems is not possible without the collection of large corpora of linguistic data, both raw and annotated, such as with correct transcriptions and correct answers to spoken queries.14 Although the rule-based paradigm that has dominated computational linguistics so far has produced only a few large-scale systems that have been reused over several different projects (e.g., the CORE Language Engine at SRI), it has been difficult to share large grammars, lexicons, and semantic rules across sites, making it difficult to build on previous results.
The domain specificity of rule-based NLP systems suggests that it would be attractive to be able to automatically train an NLP system, as is done with the hidden Markov models used in speech recognition. Significant effort is being devoted to this direction. The results are promising but still not comparable to what is routinely achievable with rule-based systems. Some of the problems are the amount of training data required, the difficulty of obtaining such data in a wide range of domains, and the cost of annotating the input data with the correct task-specific semantic representation. The annotation problem is exacerbated by the fact that it is much more difficult to get human annotators to agree on correct semantic annotations than on transcriptions of spoken utterances.
Many researchers believe that for some time yet the most effective strategy for the development of NLP systems in new domains will be hybrid systems, based on a core of hand-coded rules but tuned to a domain by automatic training methods. Domain-specific corpora can be used, for example, to assign probabilities to the rules, providing a mechanism by which probabilities can be assigned to rule-based interpretations. This approach, used by most of the currently best-performing systems, can be seen as a way of adapting a set of general rules to a particular domain. Farther down the road are ways of circumventing the data and annotation requirements of fully automatic training methods by dynamically adapting to one domain a system developed in another.
NLP systems vary widely, from those that perform full and deep understanding of an utterance in narrowly construed domains to those that perform partial and shallow understanding of very wide domains. Query systems tend to be at the full and deep understanding end, and information extraction systems at the partial and shallow end.
Several systems have been implemented to answer queries in the DARPA-sponsored Airline Travel Information Service task (DARPA, 1995b), where the user asks information about flights and schedules using speech. The utterance error rate, measured as the percentage of queries for which the system gives the wrong answer, is currently about 6 percent for spoken input and 4 percent for the corresponding text input.
The standards for evaluation of information extraction systems are set by the DARPA-sponsored Message Understanding Conferences (MUCs). For the "named entity" application, where the system must find all named organizations, locations, persons, dates, times, monetary amounts, and percentages, the error rates are below 5 percent. For the "scenario template" application, where the system extracts complex relationships in well-defined domains (such as joint ventures) in an open source (such as the Wall Street Journal), the error rate for finding the correct elements of the templates is about 45 percent.
In the area of machine translation, the most significant advances continue to occur in Europe. Recent work in the United States using texts written with an eye toward translation also show promise (Carbonell, 1992). Several speech-to-speech translation systems in limited domains, combining speech understanding, machine translation, and speech generation, have been demonstrated.
Still in their infancy are systems with which a human can conduct
a coherent dialogue in service of a complex and extended task. Early
examples include the TRAINS system (in use at the University of
Rochester), which allows a human to control a system that plans the transport
of materials, and the CommandTalk system, which provides a spoken
interface to a large military simulator. The approach of Sadek and
co-workers, at France Telecom (Bretier and Sadek, 1996; Bretier et al., 1995),
offers compelling evidence that spoken language systems can have
sophisticated models of dialogue and can benefit from them. Future systems
will need to allow for a variety of speech acts (e.g., requests, assertions,
questions, rejections) and contain dialogue models that enable the
establishment of correlations between occurrence of phrases used to refer to
the same entities and events in the discourse. Coreference resolution
has been the subject of much research, and systems using it are being
evaluated in the MUC benchmarks. Also, there is compelling evidence
that spoken language systems can have sophisticated models of dialogue
and can benefit from them.
Gesture Recognition
Gesture input can come in many forms from a variety of devices (e.g., mouse, pen, data glove). Its role is to convey information (e.g., identify, make reference to, explain, shift focus) in a manner similar to the other more studied forms of language. Gesture replaces the click of the mousethe mouse's only wordwith a wide range of commands. It eliminates the myriad objects on the screen intended to let the user communicate his or her desires. Rather than having to find the word, duplicate, and click on it, the user can make simpler movements involving only the hand. For example, at the workshop, Bruce Tognazzini's "Starfire" video showed a user separating her fingers to indicate a desire to duplicate an objectleave it here and move it. Gesture can relieve problems of repetitive stresss by varying the user's movements, thereby lowering the repetition of any particular action.
Rimé and Schiaratura (1991) characterize several classes of gesture. Symbolic gestures are conventional, context-independent, and typically unambiguous expressions (e.g., "OK" or peace sign). In contrast, deictic gestures are pointers to entities, analogous to natural language deixis (e.g., "this not that"). Iconic gestures are used to display objects, spatial relations, and actions (e.g., illustrating the orientation of two cars at an accident). Finally, pantomimic gestures display an invisible object or tool (e.g., making a fist and moving to indicate a hammer). Gestural languages exist as well. These include sign languages and signing systems for use in environments where alternative communication is difficult. Early experience with glove interfaces indicates that some users have difficulty remembering the gesture equivalents to commands (Herndon et al., 1994).
Gesture recognition plays a role in immersive environments such
as the virtual reality or simulation environments. It also should find
widespread application in helping to give directions to computers or
computerized agents. Pointing and gesturing with the hand or with other
objects are natural communication behaviors and will likely form an
important component in a natural intuitive interface. In addition, for
individuals who are deaf and who communicate primarily through gestural
languages (such as American Sign Language), machine recognition of American
Sign Language gestures is the equivalent of speech recognition for those of
us who can speak.
Machine Vision and Passive Input
Machine vision is likely to play a number of roles in future interface systems. Primary roles are likely to be:
Experience with text and image recognition provides a number of
insights relevant to future interface development, especially in the
context of aiding individuals with physical disabilities. In particular,
systems that are difficult to use by blind people would pose the same problems
to people who can see but who are trying to access information
aurally because their vision is otherwise occupied. Similar problems may arise
as well for intelligent agents.
Text Recognition
Today, there are powerful tools for turning images of text into
electronic text (such as ASCII). Optical character recognition (OCR) is
quite good and is improving daily. Driven by a desire to turn warehouses
of printed documents into electronic searchable form, companies have
been and are making steady advances. Some OCR programs will convert
programs into electronic text that is compatible with particular word
processing packages, preserving the text layout, emphasis, font, and so on.
The problem with OCR is that it is not 100 percent accurate. When it makes
a mistake, however, it is not usually a character anymore (since word
look-up is used to improve accuracy). As a result, when an error is made, it
is often a legal (but wrong) word. Thus, it is often impossible to look at
a document and figure out exactly what it did saysome sentences
may not be accurate (or even make sense). One company gets around this
by pasting a picture of any words the system is not sure about into the
text where the unknown word would go. This works well for sighted
persons, allows human editors to easily fix the mistakes, and preserves
the image for later processing by a more powerful image recognizer. It
does not help blind users much except that they are not misled by a
wrong word and can ask a sighted person for help if they cannot figure
something out. (Most helpful would be to have an OCR system include
its guess as to the letters of a word in question as hidden text, which a
person who is blind could call up to assist in guessing the word.) Highly
stylized or embellished characters or words are not recognizable. Text that
is wrapped around, tied in knots, or arranged on the page or laid out in
an unusual way may be difficult to interpret even if available in
electronic text. This is a separate problem from image recognition, though.
Image Recognition
Despite great strides by the military, weather, intelligence, and other communities, image interpretation remains quite specialized and focused on looking for particular features. The ability to identify and describe arbitrary images is still beyond us. However, advances in artificial intelligence, neural networks, and image processing in combination with large data banks of image information may make it possible in the future to provide verbal interpretation or description for many types of information. A major impetus comes from the desire to make image information searchable by computers. The combination of a tactile representation with feature or texture information presented aurally may provide the best early access to graphic information by users who are blind or cannot use their sight.
Some images, such as pie charts and line graphs, can be
recognized easily and turned into raw data or a text description. Standard
software has been available for some time that will take a scanned image of a
chart and provide a spreadsheet of the data represented in the chart.
Other images, such as electronic schematic diagrams, could be recognized
but are difficult to describe. A house plan illustrates the kind of diagram
that may be describable in general terms and would benefit from combining
a verbal description with a tactile representation for those who cannot
see to deal with this type of information.
Visual Displays
Visual display progress begins with the screen design (graphics, layouts, icons, metaphors, widget sets, animation, color, fisheye views, overviews, zooming) and other aspects of how information is visualized. The human eye can see far more than current computer displays can show. The bandwidth of our visual channel is many orders of magnitude greater than other senses: ~1 gigabit/second. It has a dynamic range of 1013 to 1 (10 trillion to 1). No human-made sensor or graphics display has this dynamic range. The eye/brain can detect very small displacements at very low rates of motion and sees change up to a rate of about 50 times a second. The eye has a very focused view that is optimized for perceiving movement. Humans cannot see clearly outside an ~5-degree cone of foveal vision and cannot see behind them.
State-of-the-art visualization systems (as of 1996) can create images of approximately 4,000 polygons complexity at 50 Hz per eye. Modern graphics engines also filter the image to remove sampling artifacts on polygon edges and, more importantly, textures. Partial transparency is also possible, which allows fog and atmospheric contrast attenuation in a natural-looking way. Occlusion (called "hidden surface removal" in graphics) is provided, as is perspective transformation of vertices. Smooth shading in hardware is also common now.
Thus, the images look rather good in real time, although the
scene complexity is limited to several thousand polygons and the resolution
to 1,280 × 1,024. Typical computer-aided design constructions or
animated graphics for television commercials involve scenes with millions of
polygons; these are not rendered in real time. Magazine illustrations
are rendered at resolutions in excess of 4,000
× 3,000. Thus, the imagery used in real-time systems is portrayed at rather less than optimal
resolution, often much less actually than the effective visual acuity required to
drive a car. In addition, there are better ways of rendering scenes, as when
the physics of light is more accurately simulated, but these techniques are
not currently achievable in real time. A six-order-of-magnitude increase
in computer speed and graphics generation would be easy to absorb;
a teraflop personal computer would be rather desirable, therefore, but
is probably 10 years off.
Visual Input/Output Hardware
The computer industry provides a range of display devices, from small embedded liquid-crystal displays (LCDs) in personal digital assistants (PDAs) and navigational devices to large cathode-ray tubes (CRTs) and projectors. Clearly, desirable goals are lower cost, power consumption, latency, weight, and both much larger and much smaller screens. Current commercial CRTs achieve up to 2,048 × 2,048 pixels at great cost. Projectors can do ~1,900 × 1,200 displays. It is possible to tesselate projectors at will to achieve arbitrarily higher resolution (Woodward, 1993) and/or brightness (e.g., video walls shown at trade shows and conventions). Screens with >5,000-pixel resolution are desirable. Durability could be improved, especially for portable units.15 Some increase in the capability of television sets to handle computer output, which may be furthered by recent industry-based standards negotiations for advanced television (sometimes referred to as high-definition television), is expected to help lower costs.16 How, when, and where to trade off the generality of personal computers against other qualities that may inhere in more specialized or cheaper devices is an issue for which there may be no one answer.
Hollywood and science fiction have described most of the conceivable, highly futuristic display devicesdirect retinal projection, direct cerebral input, Holodecks, and so on. Less futuristic displays still have a long way to go to enable natural-appearing virtual reality (VR). Liquid crystal displays do not have the resolution and low weight needed for acceptable head-mounted displays to be built; users of currently available head-mounted displays are effectively legally blind given the lack of acuity offered. Projected VR displays are usable, although they are large and are not portable.
The acceptance of VR is also hindered by the extreme cost of the high-end graphics engines required to render realistic scenes in real time, the enormous computing power needed to simulate meaningful situations, and the nonlinearity and/or short range of tracking devices. Given that the powerful graphics hardware in the $200 Nintendo 64 game is just incremental steps from supporting the stereo graphics needed for VR, it is clear that the barriers are now in building consumer-level tracking gear and some kind of rugged stereo glasses, at least in the home game context. Once these barriers are overcome, VR will be open for wider application.
High-resolution visual input devices are becoming available to nonprofessionals, allowing them to produce their own visual content. Digital snapshot cameras and scanners, for example, have become available at high-end consumer levels. These devices, while costly, are reasonable in quality and are a great aid to people creating visual materials for the NII.17 Compositing and nonlinear editing software assist greatly as well. Similarly, two-dimensional illustration and three-dimensional animation software make extraordinary graphics achievable by the motivated and talented citizen. The cost of such software will continue to come down as the market widens, and the availability of more memory, processing, graphics power, and disk space will make results more achievable and usable.
As a future goal that defines a conceptual outer limit for input
and output, one might choose the Holodeck from the movie
Star Trek, a device that apparently stores and replays the molecular reconstruction
information from the transporter that beams people up and down. In
The Physics of Star Trek, physicist Lawrence Krauss (1995, pp. 76-77) works out
the information needed to store the molecular dynamics of a single
human body: 1031 bytes, some
1016 times the storage needed for all the books
ever written. Krauss points out the other difficulties in
transporter/Holodeck reconstruction as well.
Auditory Displays
The ear collects sound waves and encodes the spatial characteristics of the sound source into temporal and spectral attributes. Intensity difference and temporal/phase difference in sound reaching the two ears provide mechanisms for horizontal (left to right) sound localization. The ear gets information from the whole space via movement in time.
Hearing individual components of sound requires frequency identification. The ear acts such as a series of narrowly tuned filters. Sound cues can be used to catch attention with localization, indicate near or far positions with reverberation, indicate collisions and other events, and even portray abstract events such as change over time. Low-frequency sound can vibrate the user's body to somewhat simulate physical displacement.
Speakers and headphones as output devices for synthesized sound match the ears well, unlike the case with visual displays. However, understanding which sounds to create as part of the human-computer interface is much less well understood than for the visual case.
About 50 million instructions per second are required for each synthesized sound source. Computing reverberation off six surfaces for four sound sources might easily require a billion-instruction-per-second computer, one that is within today's range but is rarely dedicated to audio synthesis in practice. Audio sampling and playback are far simpler and are most often used for primitive cues such as beacons and alarms.
Thus, the barriers to good matching to human hearing have to do
with computing the right sound and getting it to each ear in a properly
weighted way. Although in many ways producing sound by computer is
simpler than displaying imagery, many orders of magnitude more research
and development have been devoted to graphics than sound synthesis.
Haptic and Tactile Displays18
Human touch is achieved by the parallel operation of many sensor systems in the body (Kandel and Schwartz, 1981). The hand alone has 19 bones, 19 joints, and 20 muscles with 22 degrees of freedom and many classes of receptors and nerve endings in the joints, skin, tendons, and muscles. The hand can squeeze, stroke, grasp, and press; it can also feel texture, shape, softness, and temperature.
The fingerpad has hairless ridged skin enclosing soft tissues made of fat in a semiliquid state. Fingers can glide over a surface without losing contact or grab an object to manipulate it. Computed output and input of human touch (called "haptics") is currently very primitive compared to graphics and sound. Haptic tasks are of two types: exploration and manipulation. Exploration involves the extraction of object properties such as shape and surface texture, mass, and solidity. Manipulation concerns modification of the environment, from watch repair to using a sledge hammer.
Kinesthetic information (e.g., limb posture, finger position), conveyed by receptors in the tendons, and muscles and neural signals from motor commands communicate a sense of position. Joint rotations of a fraction of a degree can be perceived. Other nerve endings signal skin temperature, mechanical and thermal pain, chemical pain, and itch.
Responses range from fast spinal reflex to slow deliberate conscious action. Experiments on lifting objects show that slipping is counteracted in 70 milliseconds. Humans can perceive a 2-micrometer-high single dot on a glass plate, a 6-micrometer-high grating, using different types of receptors (Kalawsky, 1993). Tactile and kinesthetic perception extends into the kilohertz range (Shimoga, 1993). Tactile interfaces aim to reproduce sensations arising from contact with textures and edges but do not support the ability to modify the underlying model.
Haptic interfaces are high-performance mechanical devices that support bidirectional input and output of displacement and forces. They measure positions, contact forces, and time derivatives and output new forces and positions (Burdea, 1996). Output to the skin can be point,multipoint, patterned, and time-varying. Consider David Warner, who makes his rounds in a "cyberwear" buzz suit that captures information from his patients' monitors, communicating it with bar charts tingling his arms, pulse rates sent down to his fingertips, and test results whispered in his ears, yet allowing him to maintain critical eye contact with his patients (http://www.pulsar.org).
There are many parallels and differences between haptics and visual (computer graphics) interfaces. The history of computing technology over the past 30 to 40 years is dominated by the exponential growth in computing power enabled by semiconductor technology. Most of this new computing power has supported enriched high-bandwidth user interfaces. Haptics is a sensory/motor interaction modality that is just now being exploited in the quest for seamless interaction with computers. Haptics can be qualitatively different from graphics and audio input/output because it is bidirectional. The computer model both delivers information to the human and is modified by the human during the haptic interaction. Another way to look at this difference is to note that, unlike graphics or audio output, physical energy flows in both directions between the user and the computer through a haptic display device.
In 1996 three distinct market segments emerged for haptic technology: low-end (2 degrees of freedom (DOF), entertainment); mid-range (3 DOF, visualization and training); and high-end (6 DOF, advanced computer-aided engineering). The lesson of video games has been to optimize for real-time feedback and feel. The joysticks or other interfaces for video games are very carefully handled so that they feel continuous. The obviously cheap joystick on the Nintendo 64 game is very smooth, such that a 2 year old has no problem with it. Such smoothness is necessary to be a good extension of a person's hand motion, since halting response changes the dynamics, causing one to overcompensate, slow down, etc.
A video game joystick with haptic feedback, the "Force FX," is now on the market from CH Products (Vista, Calif.) using technology licensed from Immersion Corp. It is currently supported by about 10 video game software vendors. Other joystick vendors are readying haptic feedback joysticks for this low-priced, high-volume market. In April 1996, MicroSoft bought Exos, Inc. (Cambridge, Mass.) to acquire its haptic interaction software interface.
Haptic interaction will play a major role in all simulation-based training involving manual skill (Buttolo et al., 1995). For example, force feedback devices for surgical training are already in the initial stages of commercialization by such companies as Boston Dynamics (Cambridge, Mass.), Immersion Corp. (Palo Alto, Calif.), SensAble Devices (Cambridge, Mass.), and HT Medical (Rockville, Md.).
Advanced CAD users at major industrial corporations such as Boeing (McNeely, 1993) and Ford (Buttolo et al., 1995) are actively funding internal and external research and development in haptic technologies to solve critical bottlenecks they have identified in their computer-aided product development processes.
These are the first signs of a new and broad-based high-technology industry with great potential for U.S. leadership. Research (as discussed below) is necessary to foster and accelerate the development of these and other emerging areas into full-fledged industries.
A number of science and technology issues arise in the haptics and tactile display arena. Haptics is attracting the attention of a growing number of researchers because of the many fascinating problems that must be solved to realize the vision of a rich set of haptic-enabled applications. Because haptic interaction intimately involves high-performance computing, advanced mechanical engineering, and human psychophysics and biomechanics, there are pressing needs for interdisciplinary collaborations as well as basic disciplinary advances. These key areas include the following:
Some of the applications of haptics that are practical today may seem arcane and specialized. This was also true for the first applications of computer graphics in the 1960s. Emerging applications today are the ones with the most urgent need for haptic interaction. Below are some examples of what may become possible:
The first of these examples is technically possible today; the second is not. There are critical computational and mechatronic challenges that will be crucial to successful implementation of ever-more realistic haptic interfaces.
Because haptics is such a basic human interaction mode for so
many activities, there is little doubt that, as the technology matures, new
and unforeseen applications and a substantial new industry will develop
to give people the ability to physically interact with computational models.
Once user interfaces are as responsive as musical instruments, for
example, virtuosity is more achievable. As in music, there will always be a
phase appropriate to contemplation (such as composing/programming) and
a phase for playing/exploring. The consumer will do more of the latter,
of course. Better feedback continuously delivered appears to take less
prediction. Being able to predict is what expertise is mostly about in a
technical/scientific world, and we want systems to be usable by nonexperts,
hence the need for real-time interactions with as much multisensory realism as
is helpful in each circumstance. Research is necessary now to provide
the intellectual capital upon that such an industry can be based.
Tactile Displays for Low- or No-Vision Environments or Users
Tactile displays can help add realism to multisensory virtual reality environments. For people who are blind, however, tactile displays are much more important for the provision of information that would be provided visually to those who can see. For people who are deaf and blind and who cannot use auditory displays or synthetic speech, it is the principal display form.
Vibration has been used for adding realism to movies and virtual reality environments and also as a signaling technique for people with hearing impairments. It can be used for alarm clocks or doorbells, but is limited in the information it can present even when different frequencies are used for different signals. Vibration can also be used effectively in combination with other tactile displays to provide supplemental information. For example, vibratory information can be used in combination with Braille to indicate text that is highlighted, italicized, or underlined, or to indicate text that is a hyperlink on a hypertext page.
Vibrotactile displays provide a higher-bandwidth channel. With a vibrotactile display, small pins are vibrated up and down to stimulate tactile sensors. The most widespread use of this technique is the Optacon (OPtical to TActile CONverter), which has 144 pins (6 × 24) in its array (100 pins on the Optacon II). The tactile array is usually used in conjunction with a small handheld camera but can also be connected directly to a computer to provide a tactile image around a mouse or other pointing device on the screen.
Electrocutaneous displays have also been explored as a way to create solid-state tactile arrays. Arrays have been constructed for use on the abdomen, back, forearm, and, most recently, the fingertip. Resolution for these displays is much lower than for vibrotactile displays.
Raised-line drawings have long been "king of the hill" for displaying of tactile information. The principal problem has been an inexpensive and fast way to generate them "on the fly." Wax jet printers showed the greatest potential (especially for high resolution), but none are currently available. For lower resolution, there is a paper onto which one can photocopy and then process with heat, to cause it to swell wherever there are black lines (although at a much lower resolution). Printers that create embossed Braille pages can also be programmed to create tactile images that consist of raised dots. The resolution of these is lower still (the best having a resolution of about 12 dots per inch), but the raised-dot form of the graphics actually has some advantages for tactile perception.
Braille is a system for representing alphanumeric characters tactiley. The system consists of six dots in a two wide by three high pattern. To do the full ASCII character set, an eight-pin braille (two by four) was developed. Braille is most commonly thought of as being printed or embossed, where paper is punched upward to form Braille cells or characters as raised dots on the page.19 There are also dynamic Braille displays, where cells having (typically) 8 pins that can be raised or lowered independently are arranged in lines of 12 to 20 or more cells on small portable devices and 20 to 40 cells on desktop displays. A few 80-cell displays have been developed, but they are quite expensive and large. By raising or lowering the pins, a line of Braille can be dynamically changed, rather like a single line of text.
Virtual Page Displays. Because of the difficulties creating full-page tactile displays, a number of people have tried techniques to create a "virtual" full-page display. One example was the Systems 3 prototype, where an Optacon tactile array was placed on a mouse-like puck on a graphics tablet. As the person moved the puck around on the tablet, he or she felt a vibrating image of the screen that corresponded to that location on the tablet. The same technique has been tried with a dynamic Braille display. The resolution, of course, is much lower. In neither case did the tactile recognition approach that of raised lines.
Full-Page Displays. Some attempts have been made to create full-page Braille-resolution displays. The greatest difficulty has been in trying to create something with that many moving parts that is still reliable and inexpensive. More recently, some interesting strategies using ferro-electric liquids and other materials have been tried. In each case the objective was to create a system that involves the minimum number of moving parts and yet provides a very high-resolution tactile display.
Ideal Displays.A dream of the blindness community has been the development of a large plate of hard material that would provide a high-resolution solid-state tactile display. It would be addressable like a liquid-crystal display, with instant response, very high resolution, and variable height. It would be low cost, lightweight, and rugged. Finally, it would be best if it could easily track the position of fingers on the display, so that the tactile display could be easily coupled with voice and other audio to allow parallel presentation of tactile and auditory information for the area of the display currently being touched.
An even better solution, both for blind people and for virtual
reality applications, would be a glove that somehow provided both full
tactile sensation over the palm and fingertips and force feedback. Elements
of this have been demonstrated, but nothing approaching full tactile
sensation or any free-field force feedback.
INTEGRATING INPUT/OUTPUT TECHNOLOGIES
Filling out the range of technologies for people to communicate
with systemsfilling in the research and development gaps in the
preceding sectionis only part of the input/output requirement for ECIs.
Integration of these technologies into systems that use multiple
communications modalities simultaneouslymultimodal systemscan improve
people's performance. (These ideas are discussed in more detail in Chapter 6.)
Integration can also ensure that at least one mechanism is available
for every person and situation, independent of temporary and/or
permanent constraints on their physical and cognitive abilities. Virtual reality
involves the integration of multiple input and output technologies into
an immersive experience that, ideally, will permit people to interact
with systems as naturally as they do with real-world places and objects.
Multimodal Interfaces
People effortlessly integrate information gathered across modalities during conversational interactions. Facial cues and gestures are combined with speech and situational cues, such as objects and events in the environment, to communicate meaning. Almost 100 years of research in experimental psychology attests to our remarkable abilities to bring all knowledge to bear during human communication.
The ability to integrate information across modalities is essential
for accurate and robust comprehension of language by machines and to
enable machines to communicate effectively with people. In noisy
environments, when speech is difficult to understand, facial cues provide
both redundant and complementary information that dramatically
improves recognition performance over either modality alone. To improve
recognition in noisy environments, researchers must discover effective
procedures to recognize and combine speech and facial cues. Similarly,
textual information may be transmitted more effectively under some
conditions by turning the text into natural-sounding speech, produced by an
animated "talking head" with appropriate facial movements. While a
great deal of excellent research is being undertaken in the laboratory,
research in this area has not yet reached the stage where commercial
applications have appeared, and fundamental problems remain to be solved. In
particular, basic research is needed into the science of understanding
how humans use multiple modalities.
Ability-Independent Interfaces
Standard mass-market products are still largely designed with single interfaces (e.g., they are designed to work with a keyboard (only) or they are designed to work with a touchscreen (only)). There are systems designed to work with keyboard or mouse, and some cross-modality efforts have been made (e.g., systems that support both keyboard and speech input). Usually, though, these multiple input systems are accomplished by having a second input technique simulate input on the firstfor example, having the speech interface create simulated keystrokes or mouse clicks rather than having the systems designed from the beginning to accommodate alternate interface modalities. This approach is usually the result of companies that decide to add voice or pen support (or other input technique support) to their applications after it has been architected. This generates both compatibility problems and very complicated user configuration and programming problems.
A similar problem exists with media, materials, databases, or educational programs designed to be used in a visual-only presentation format. Companies (and users) run into problems when the materials need to be presented aurally. For example, systems designed for visual viewing often need to be reengineered if the data are going to be presented over a phone-based information system.
The area where the greatest cross-modality interface research has been carried out has been the disability access area. Strategies for creating audiovisual materials that also include time-synchronized text (e.g., captions) as well as audio descriptions of the visual information have been developed. Interestingly, although closed captioning was added to television sets for people who are deaf, it is used much more in noisy bars, by people learning to read a new language, by children, and by people who have muted their television sets. The captions are also useful for institutions wishing to index or search audiovisual files, and they allow "agent" software to comprehend and work with the audio materials.
In the area of public information systems, such as public kiosks, interfaces are now being developed that are flexible enough to accommodate individuals with an extremely wide range of type, degree, and combination of disabilities. These systems are set up so that the standard touchscreen interface supports variations that allow individuals with different disabilities to use them. Extremely wide variation in human sensory motor abilities can be accommodated without changing the user interface for people without disabilities.
For example, by providing a "touch and hear" feature, a kiosk can be made usable by individuals who cannot read or by those who have low vision. Holding down a switch would cause the touchscreen to become inactive (e.g., touching buttons on the screen would cause no action). However, any buttons or text that were touched would be read aloud to the user. Releasing the switch would reactivate the screen. A "touch and confirm" mode would allow individuals with moderate to severe physical disabilities to use the kiosk by having it accept only input that is followed by a confirmation. An option that provides a listing of the items (e.g., text and buttons) down the left edge of the screen can be combined with the talking "touch and confirm" mode to allow individuals who are completely blind to easily and accurately use a kiosk. The use of captions for audiovisual materials on kiosks can allow individuals who have hearing impairments to access a kiosk (as well as anyone else trying to use a kiosk in a noisy mall). Finally, by sending the information on the pop-up list out through the computer's Infrared Data Association (IrDA) port, it is possible for individuals who are completely paralyzed or deaf and blind to access and use a kiosk via their personal assistive technologies.
All of these features can be added to a standard multimedia touch-screen kiosk without adding any hardware beyond a single switch and without altering the interface experienced by individuals who do not have disabilities. By adding interface enhancements such as these, it is possible to create a single public kiosk that looks and operates like any traditional touchscreen kiosk but is also accessible and usable by individuals who cannot read, who have low vision, who are blind, who are hearing impaired, who are deaf, who have physical disabilities, who are paralyzed, or who are deaf and blind. Kiosks with flexible user-configurable interfaces have been distributed in Minnesota (including the Mall of America), Washington State, and other states.
These and similar techniques have been implemented in other environments as well. Since the 1980s, Apple Computer has had options built into its human interface to make it more useful to people with functional limitations (look in any Macintosh control panel for EasyAccess). IBM has them built into its hardware and software (AccessDos and OS/2), and UNIX has both options in its human interface and modifications in its underlying structure to support connection to specialized interfaces. Windows 95 has over a dozen adjustments and variations built into its human interface to allow it to be used by individuals with a very wide range of disabilities or environmental limitations, including those with difficulty hearing, seeing, physically operating a keyboard, and operating a mouse from the keyboard.
As we move into more immersive environments and
environments that are utilizing a greater percentage of an individual's different
sensory and motor systems simultaneously (e.g., VR, multimedia), identifying
and developing cross-modal interface techniques will become increasingly
challenging. In the techniques developed to date, however, building
interfaces that allow for cross-modality interaction have generally made for
more robust and flexible interfaces and systems that can better adapt to
new interface technologies as they emerge (e.g., allowing WIMP
(windows, icons, menus, pointers) systems to accommodate a verbal interface).
Virtual Reality and Artificial Immersive Environments
The past 10 years has brought nearly a complete changeover from command line to WIMP interfaces as the dominant every-citizen's connection to computation. This happened because hardware (memory, display chips) became cheap enough to be engineered into every product. It also happened because the first step to the office of the future required replacing the typewriter with the laser printer, an event neatly handled with the "desktop metaphor" and word processing/spreadsheet software. However, the NII implies a complex of technologies relevant to far more than office work, which is a practical reason not to expect it to be accessed by every citizen with mice and windows interfaces alone (van Dam, 1997).20 Another transition is now at hand, one that potentially liberates the interface from the desktop, one that presents information more like objects in a shopping mall than printing on a pile of paper. The virtual shopping mall (or museum) is the next likely application metaphor; the parking lots will be unneeded, of course, as will attention to the laws of physics when inappropriate, but as in three-dimensional user interfaces generally, the metaphor can help in teaching users how to operate in a synthetic environment. Such a metaphor helps also to avoid the constraints that may derive from metaphors linked to one class of activity (e.g., desks and white-collar work) at a time when researchers should think about the needs and challenges posed by all kinds of people.
At SIGGRAPH 96, the major conference for computer graphics and interactive techniques, full-quality, real-time, interactive, three-dimensional, textured flight simulation was presented as the next desirable feature in every product. This visual capability, usually augmented with sound and multidimensional interactive controls, presents information as landscapes and friendly/hostile objects with which the user interacts at high speed. Visual representations of users, known as avatars, are one trend that has been recognized in the popular press. Typing is not usually required or desirable. The world portrayed is spatially three dimensional and it continues way beyond the boundaries of the display device. In this context, input and output devices with more than 2 degrees of freedom are being developed to support true direct manipulation of objects, as opposed to the indirect control provided by two- and three-dimensional widgets, and user interfaces appear to require support for many degrees of freedom,21 higher-bandwidth input and output, real-time response, continuous response and feedback, probabalistic input, and multiple simultaneous input and output streams from multiple users (Herndon et al., 1994). Note that virtual reality also expands on the challenges posed by speech synthesis to include synthesis of arbitrary sounds, a problem that is hampered by the lack of available sound samples analogous to the voice samples used in voice synthesis.
Economic factors will pace the broader accessibility of
technologies that are currently priced out of the reach of every citizen, such as
high-end virtual reality. Virtual reality technology, deriving from 30 years
of government and industry funding, will see its cost plummet as
development is amortized over millions of chip sets, allowing it to come into
the mainstream. Initially, the software for these new chips will be crafted
and optimized by legions of video game programmers driven by
teenage mass-market consumption of the best play and graphics attainable.
Coupled with the development of relatively cheap wide-angle
immersive displays and hundredfold increases in computing power, personal
access to data will come through navigation of complex artificial spaces.
However, providing the every-citizen interface to this shared information
infrastructure will need some help on the design front.
Ten- to 20-Year Challenges for Virtual Reality Systems
Very little cognitive neuroscience and perceptual physiology is understood, much less applied, by human interface developers. The Decade of the Brain is well into its second half now; a flood of information will be available to alert practitioners in the computing community that will be of great use in designing the every-citizen interface. Teams of sensory psychologists, industrial designers, electrical engineers, computer scientists, and marketing experts need to explore, together, the needs of governance, commerce, education, and entertainment. The neuroplasticity of children's cognitive development when they are computationally immersed early in life is barely acknowledged, much less understood.
THE COMMUNICATIONS INFRASTRUCTURE
Because ECIs must work in a networked environment, interface design involves choices that depend on the performance of network access and network-based services and features. What ramifications does connection to networks have for ECIs? This question is relevant because a user interface for any networked application is much more than the immediate set of controls, transducers, and displays that face the user. It is the entire experience that the user has, including the following:
To understand how networking affects user interfaces, consider the two most common interface paradigms for networked applications: speech (telephony) and the "point and click" Web browser. These are so widely accepted and accessible to all kinds of people that they can already be regarded as "almost" every-citizen user interfaces. Research to extend the functionality and performance of these interfaces, without complicating their most common applications, would further NII accessibility for ordinary people.
Speech, understood here to describe information exchange with other people and machines more than an immediate interface with a device, is natural and popular medium for most people. It is remarkably robust under varying conditions, including a wide range of communications facilities. The rise of Internet telephony and other voice and video-oriented Internet services reinforces the impression that voice will always be a leading paradigm. Voice also illustrates that the difference between a curiosity such as today's Internet telephony and a widely used and expected service depends significantly on performance:22 Technological advances in the Internet, such as IPv6 (Internet Protocol version 6) and routers with quality-of-service features, together with increased capacity and better management of the performance of Internet facilities, are likely to result in much better performance for voice-based applications in the early twenty-first century.
Research that would help make the NII as a whole more usable includes making Internet-based information resources as accessible as possible from a telephone; improving the delay performance and other aspects of voice quality in the Internet and data networks generally; implementing voice interfaces in embedded systems as well as computers; and furthering a comfortable integration of voice and data services, as in computer-controlled telephony, integrated voice mail/e-mail, and data-augmented telephony.
The "point and click" Web browser reflects basic human
behavior, apparent in any child in a toy store who points to something and
says (click!) "I want that." Because of the familiarity of this paradigm,
people all over the world use Web browsers. For reaching information
and people, a Web browser is actually far more standard than
telephony, which has different dial tones and service measurement systems in
different countries. Research issues include multimedia extensions
(including clicking with a spoken "I want that"), adaptation to the increasing skill
of a user in features such as multiple windows and navigation speed,
and adapting to a variety of devices and communication resources that
will offer more or less processing power and communications performance.
The Network Hierarchy and
How It Affects User End-to-End Performance
Among the elements of communications infrastructure that affect performance, the access network is one among several network elements (including networking in the local area of the user and networking within the public network) that have considerable influence on performance. Access network bandwidth is an important parameter affecting performance.
Local-Area Communications
Physical communications networking can be categorized as an inter-working of three networking levels: local, access, and core (or "wide area"). Almost any network-based activity of a residential user is likely to use all three.
Local area networks (LANs) are on the end-user's premises, such as a house, apartment or office building, or university campus. Ethernet, the most widely deployed LAN technology, is already appearing in homes for computer access to cable-based data access systems such as Time-Warner's RoadRunner, Com21's access system, and @Home's access system. It could be in millions of American homes by the year 2000. In general, the 10-megabit-per-second (Mbps) Ethernet is the favored communications interface for connecting personal computers and computing devices to set-top boxes and other network interface devices being developed for high-speed subscriber access networks. A properly engineered shared-bandwidth architecture such as Ethernet allows multiple devices to have the high "burst rate" capability needed for good performance, such as fast transfer of an image, with only rare degradation from congestion. It is "always on," allowing devices always to be connected and ready to satisfy user needs immediately, as opposed to a tedious connection setup.
A residence will be able to simultaneously operate not only several human-oriented user interfaces in personal computers, heating/cooling and appliance controls, light switches, communicating pocket calendars and watches, and so on, but also user interfaces used by such devices as furnaces, garage doors, and washing machines. The introduction of IPv6 in the next decade will create an extremely large pool of Internet addresses, allowing each human being in the world to own hundreds or thousands of them. This development will foster the interconnection of a wide range of devices with embedded systems, a phenomenon that underscores the concern not to cast the NII or ECI challenges in overly personal computer-centric terms.
Local networking is not necessarily restricted to one shared wired facility such as Ethernet, which is beginning in the home at 10 Mbps but will likely evolve to "Fast Ethernet" commercial versions or to ATM (asynchronous transfer mode) connection-oriented communications, at 100 Mbps and higher. It can include wireless local networking, generalizations of the cordless phone to cordless personal computers and other devices, with burst rates of at least several megabits per second. Local networking is likely to include assigned (not shared) digital channels in various media for such applications as video programming and other stream or bulk uses, at aggregate data rates of hundreds of megabits per second.
How much bandwidth is enough? Assuming "always connected" and good performance from the other network elements to be described, 10 Mbps symmetric should be adequate for almost all processor-based applications including fast response image transfers (a 5-megabyte image in 0.5 second) and high-quality MPEG-2 or H.323 (conferencing/video-phone) video at 4 Mbps. For streaming media such as video, additional requirements of reserved capacity and minimal queuing delay may be needed, requirements for which ATM is well suited. ATM breaks traffic into uniformly-sized "cells" that can be efficiently switched and reassembled with specified quality-of-service guarantees. Forecasts of how soon ATM will be available directly at consumer communicating devices vary, but there is likely to be significant availability in 5 to 10 years. For future applications with very complex immersive environments, multiple high-definition video streams, or other bandwidth-intensive needs, fast Ethernet and ATM should suffice. Additional transmission facilities for program distribution could use these or other technologies.
Both shared-bandwidth networks such as Ethernet and
dedicated high-capacity channels could reside in the same physical medium,
which might be fiber, coax, or twisted-pair. The cost of a LAN has been
falling steadily, with Ethernet cards for personal computers well below $100.
The cost of wiring a new house or apartment building with cable
for Ethernet is low, but the cost could be substantial for rewiring an
old residence. Wireless networking, to bypass the wiring problem, is
available now, and it may be priced comparably to Ethernet, for
comparable capacity, in 4 to 5 years.
Access Communications
The access network is the set of transmission facilities, control features, and network-based services that sits between a user's premises and the core public network. The twisted-pair subscriber line running from a telephone office to a user's residence is part of the telephone access network, for example. There are four basic paradigms offered (and in development): telephone company services via the twisted-pair subscriber line, cable company services via a coaxial cable (coax) feed, wireless access via higher-powered cellular mobile or lower-powered PCS (personal communications services), and direct broadcast satellite service. There are additional paradigms, such as terrestrial microwave, that are of secondary importance compared with these four. The access network has long been regarded as a performance bottleneck. The telephone channel, restricted to 3-kHz (kilohertz) bandwidth (and data rates of about 30 kbps for reliable transmission) by filters and transmission systems designed for voice, presents both bandwidth limitations and connection delays that seriously degrade performance.
"Access" can be a confusing term. An Internet service provider offers access service to the Internet and some access facilities such as TCP/IP software, but may not provide the physical pipe into the home. For the moment, the discussion is restricted to access networks that include the physical transmission facilities but returns later to Internet service provider facilities because they have a critical influence on the performance of Web browsers and other Internet-oriented user interfaces.
Twisted Pair-based Telephone Company Services.The first paradigm, access via a twisted-pair subscriber line, is advancing with ISDN (integrated services digital network), ADSL (asymmetric digital subscriber line), VDSL (very-high-speed digital subscriber line), and HDSL (high-speed digital subscriber line).23
Cable-based Access Services.A local cable television (CATV) service company maintains a cable distribution system that is still largely dedicated to broadcasting video programming. The coaxial cable network, now actually combining optical fiber trunks with coaxial branches and "drops" to subscribers, is a "tree and branch" architecture well suited to broadcast and not so well suited to upstream communications from the user. It is not well suited to upstream communications because of noise aggregation problems from many drops and branches coming together and because the capacity of the cable, however large, is being shared with bandwidth-hungry downstream video services and by a great many subscribers.
Nevertheless, the cable industry has succeeded in evolving a promising HFC (hybrid fiber coax) network architecture that can service both video distribution and interactive communications needs.24 The HFC system provides digital channels with signals produced by cable modems, for which a downstream channel may generate a 30-Mbps signal within a 6-MHz bandwidth. Instead of one analog video signal, this digital transmission can carry seven or eight high-quality MPEG-2 digital video signals or one digital HDTV (high-definition television) signal plus two MPEG-2 ordinary digital video signals. More important for the NII, the digital capacity can be used for an arbitrary mix of signals, supporting medical imaging, language instruction, software downloads, and an infinite array of other applications. A cable system could typically implement up to a few dozen such 30-Mbps channels plus 80 old-fashioned analog channels for subscribers who have not yet purchased the digital TV sets expected to hit the U.S. market in 1998.
Upstream capacity shared among many subscribers is much more constrained. Standards are being developed that will allow a user to share with neighbors a 1.5-Mbps upstream channel (one of about 20 such channels serving a group of 125 to 500 subscribers). Other modem designs allow a pool of users to share a 10-Mbps upstream channel, mimicking the behavior of Ethernet. Here, just as with ADSL, the operator is betting that traffic will be asymmetric and that the user will not have a performance complaint even though the upstream bandwidth is not especially generous.
Above this physical channel level, the cable industry's model usually includes IP services with the same "always on" flavor that professionals enjoy at work. This is an important performance advantage for cable access, supporting broadcast information services that flash the latest bulletin on a computer or TV screen, quick Internet telephony perhaps by touching a miniature picture in the screen directory, and immediate linking to a distant Web site (contingent on performance being good farther upstream). If the service, including getting started and customer premise setup,25 is done well, the popular conception of Internet service as difficult to get started and unreliable after that could change radically, and the Web browser could indeed become a universal user interface.
Wireless Access Services: Location Transparency and Consciousness and Power/Bandwidth Tradeoffs.Wireless access, currently in cellular mobile networks and soon in PCS networks, supports mobility of persons, devices, and services. It makes possible carrying wearable or pocket devices, doing computing in a car (perhaps with a "heads-up" display on the windshieldused only when it is safe to do so, of course), reading documents and messages on an electronic "infopad" at meetings, and sending "electronic postcards" from digital cameras and camcorders. The new and large unlicensed NII Supernet spectrum authorized by the Federal Communications Commission, in the relatively high 5-GHz band, will give a large boost to interactive multimedia services when mass-produced, low-cost radio modems become a reality. That could happen within 3 to 4 years.
Wireless access can support both location transparency, in which the user's application appears the same regardless of location, and location consciousness, in which the application finds and exploits local resources and can offer location-dependent services, such as giving directions to the nearest drugstore. These two features are not incompatible, and both contribute to the utility and usability of a user interface.
Because of the power constraints imposed by small portable devices, including but not limited to pocket telephones, medical monitoring and alerting devices, communicating digital cameras and camcorders, communicating watches, communicating pocket calendars, and even some laptop computers, it is important for the quality of the user interface that the wireless access system offer appropriate tradeoffs between communications and processing resources. One way this is realized is to concentrate the power of the portable device on display functions, such as a bright sharp display, and leave media processing (such as MPEG and videoconferencing digital video coding/decoding) to processors accessed through the wireless network. However, this balance of function may imply an unacceptable cost for the substantial communications capacity to carry the unencoded video information. Another issue is how to minimize power use on portable systems that are always listening. Further research will be required to identify a reasonable balance between processing and communications power in the system.
The microcellular PCS and Supernet networks are well suited to this need, aiming for burst transmission rates of 25 Mbps or more in small (perhaps 300-meter-wide) microcells. This compares very favorably with present-day telephony-oriented cellular mobile networks, where modems may provide up to about 20 kbps communications rate. Higher rates are possible in the digital cellular mobile systems becoming widely deployed now, but probably not more than 256 kbps, still far below microcellular networks.
The low Earth-orbiting satellite (LEO) systems planned for personal communications from anywhere in the world, which will compete to some extent with terrestrial microcellular PCS systems, could offer the significant user interface advantage of having exactly the same user interface anywhere in the world. This would remove a major anxiety for many users.
Direct Broadcast Satellite Distribution Services. Satellite services could augment wired facilities to improve the performance of the user interface. In particular, downloading of large information files to proxy servers in nearby network offices or in the end-user's equipment itself would reduce the delays of access to information in distant servers. There are cache memories in Web browsers that save Web HTML objects requested by users because there is a high probability that the objects will be requested again, but a proxy server does something else. It caches information when it has been requested by one user, under the assumption that if the material is popular other users may request it as well. This has the effect of improving response time considerably for those users and offering added possibilities for customization. There are many important research questions in selecting material for proxy servers, updating strategies, customization for users, and integrating the satellite facility smoothly with the wired network.
Direct broadcast satellite service in the NII would also include present function of distributing video programming directly to user TVs.
In the future it is possible that continual magazine-style broadcasting
of video information clips, captured and displayed immediately by
user devices rather than retrieved from cache servers, also will be part of
the nation's information infrastructure. This would offer the
freshest-possible material, supporting, for example, a customized user
information service in which information is updated even as the reader observes it.
Core Network Communications: QoS, Interoperability, and
Value-Added Services
The core network interconnects access networks. It aggregates traffic, and is, or should be, designed to provide differing quality-of-service (QoS) treatment for different classes of traffic.26 Continuous voice and video media should enjoy minimum delay, and data files should be transferred with minimum error rate. ATM is already widely deployed in the core network. Research and development on QoS control is already extensive, and further work, on topics such as renegotiation of offered capacity and dynamic user control over QoS, would improve the performance of future user interfaces. For example, a user with several applications running could trade QoS among them, improving video resolution, for example, at the expense of the rate of transfer of a new software module being downloaded in the background.
There are additional services that either the core network or the
access network, or indeed parties other than the network operators,
can provide to enhance user interface performance. For example, a
multiparty desktop audio/video conference can be displayed on one
user's screen as a custom combination of pictures of the other participants
with a corresponding spatial distribution of their voices. This can be
done either in the user equipment by processing multiple audio/video
information streams all coming to that user or by a processing service in
the network (or offered by a third party) called a "multimedia bridge"
that creates the customized display for the user and supplies that user
with only a single audio/video information stream. If access bandwidth is at
a premium, the network-based bridging service provides a
high-quality user interface at a minimum cost in access bandwidth.
Performance Impacts of Internet Services Architecture
The Internet, which utilizes all of the communications hierarchy outlined above, is considered by many to be the heart of the NII. As the multimedia Internet evolves and assumes much of the quality control (and charge for service!) functionality of the telephone network, this is likely to become true by definition. The Internet is defined by use of IP, which carries packets from one kind of network to another without the application having to directly control any services in those networks.
Although they do not, in general, provide the access transmission facilities, Internet service providers do supply other access facilities that have a large influence on the performance of user interfaces. These include at least the following:
It would require a lengthy report to describe how each of these affects user interface performance. Suffice it to say that a major objective in pro-viding good service is the avoidance of server congestion, by means of the use of proxy Web servers to give users the impression of fast response from uncongested access to a nearby server, when in fact the originating server is far away and highly congested. Fast response time is, as emphasized earlier, an important measure of good performance of the user interface.
We might also reemphasize the importance of being "always connected" to Internet access for applications such as receiving timely information from "push" servers (such as the fast-developing customized current information services producing ever-changing displays in screen savers), immediate delivery of e-mail, fast receipt and initiation of real-time audio/video calls, and participating in the on-line work of a distributed group. An always-connected transmission access facility is required, of course, which must be matched by similar facilities27 for the Internet service provider.
As with providers of wireless access services, Internet service
providers will soon be required to support mobility services, such as
locating and characterizing nomadic users. There are significant research
questions in coordinating Internet routing and service-class support
policies with the movement of individuals, in transferring customer profiles
for Internet services, and in other aspects of mobility support.
Software Architecture: Distributed Object Environments
and Transportable Software
Management of a mobility environment, particularly location transparency and location consciousness, is complex, and further research is needed. Distributed object environmentsa software structure being used more and more in communications as well as applications systemshas a large potential to help resolve the complexities and improve performance.28 For example, the global availability of a distributed object environment would make abstract service objects available in a consistent format everywhere, with those objects translating user needs into instructions to local systems.
Transportable software is another important object-oriented technology that proceeds from a different assumption, that a common "virtual machine" (a special operating system on top of the real one) can be created on different platforms, so that software in "applets" (and applications) can be moved around from one machine to another.29 Java is a widely accepted language a