Page 71

3—
Input/Output Technologies:
Current Status And Research Needs

Meeting the every-citizen interface (ECI) criteria described in Chapter 2 will require advances in a number of technology areas. Some involve advances in basic underlying display and interface technologies (higher-resolution visual displays, three-dimensional displays, better voice recognition, better tactile displays, and so on). Others involve advances in our understanding of how to best match these input/output technologies to the sensory, motor, and cognitive capabilities of different users in different and changing environments carrying out a wide variety of tasks. But the new interfaces will need to do more than just physically couple the user to the devices. To meet these visions, the interfaces must have the ability to assist, facilitate, and collaborate with the user in accomplishing tasks.

Subsequent chapters address interface design-the creation of interfaces that make the best-possible use of these human-machine communications technologies-and system attributes that lie beneath the veneer of the interface, such as system intelligence and software support for collaborative activities. This chapter examines the current state and prospective advances in technology areas related directly to communication between a person and a system-hardware and software for input (to the system) and output (to a human). The emphasis is on technical advances that, if implemented in well-designed systems (as stressed in Chapter 4), hold the potential to expand accessibility and usability to many more people than at present. The discussion includes a cluster of speech input/output technologies; natural language understanding (including restricted languages with limited vocabularies); keyboard input; gesture recognition



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 71
Page 71 3— Input/Output Technologies: Current Status And Research Needs Meeting the every-citizen interface (ECI) criteria described in Chapter 2 will require advances in a number of technology areas. Some involve advances in basic underlying display and interface technologies (higher-resolution visual displays, three-dimensional displays, better voice recognition, better tactile displays, and so on). Others involve advances in our understanding of how to best match these input/output technologies to the sensory, motor, and cognitive capabilities of different users in different and changing environments carrying out a wide variety of tasks. But the new interfaces will need to do more than just physically couple the user to the devices. To meet these visions, the interfaces must have the ability to assist, facilitate, and collaborate with the user in accomplishing tasks. Subsequent chapters address interface design-the creation of interfaces that make the best-possible use of these human-machine communications technologies-and system attributes that lie beneath the veneer of the interface, such as system intelligence and software support for collaborative activities. This chapter examines the current state and prospective advances in technology areas related directly to communication between a person and a system-hardware and software for input (to the system) and output (to a human). The emphasis is on technical advances that, if implemented in well-designed systems (as stressed in Chapter 4), hold the potential to expand accessibility and usability to many more people than at present. The discussion includes a cluster of speech input/output technologies; natural language understanding (including restricted languages with limited vocabularies); keyboard input; gesture recognition

OCR for page 71
Page 72 and machine vision; auditory and touch-based output; interfaces that combine multiple modes of input and output; and visual displays, including immersive or virtual reality systems. Because the ECI challenge involves connecting to the information infrastructure, rather than just to stand-alone systems, this chapter reviews the current status of and research challenges for interfaces for systems in large-scale national networks. The chapter ends with the steering committee's conclusions, based on workshop discussions and other inputs, about the research priorities to advance these technologies and our understanding of how to use them to support every citizen. Framing The Input/Output Discussion-Layers Of Communication The interface is the means by which a user communicates with a system, whether to get it to perform some function or computation directly (e.g., compute a trajectory, change a word in a text file, display a video); to find and deliver information (e.g., getting a paper from the Web or information from a database); or to provide ways of interacting with other people (e.g., participate in a chat group, send e-mail, jointly edit a document). As a communications vehicle, interfaces can be assessed and compared in terms of three key dimensions: (1) the language(s) they use, (2) the ways in which they allow users to say things in the language(s), and (3) the surface(s) or device(s) used to produce output (or register input) expressions of the language. The design and implementation of an interface entail choosing (or designing) the language for communication, specifying the ways in which users may express ''statements" of that language (e.g., by typing words or by pointing at icons), and selecting device(s) that allow communication to be realized-the input/output devices. Box 3.1 gives some examples of choices at each of these levels. Although the selection and integration of input/output devices will generally involve hardware concerns (e.g., choices among keyboard, mouse, drawing surfaces, sensor-equipped apparel), decisions about the language definition and means of expression affect interpretation processes that are largely treated in software. The rest of this section briefly describes each of the dimensions and then examines how they can be used to characterize some currently standard interface choices; the remainder of the chapter provides an examination of the state of the art. Language Contrasts and Continuum There are two language classes of interest in the design of interfaces: natural languages (e.g., English, Spanish, Japanese) and artificial languages Page 73 BOX 3.1 Layers of Communications 1. Language Layer   • Natural language: complex syntax, complex semantics (whatever a human can say)   • Restricted verbal language (e.g., operating systems command language, air traffic control language): limited syntax, constrained semantics   • Direct manipulation languages: objects are "noun-like," get "verb equivalents" from manipulations (e.g., drag file X to Trash means ''erase X"; drag message onto Outgoing Mailbox means "send message"; draw circle around object Y and click means "I'm referring to Y, so I can say something about it.") 2. Expression Layer   Most of these types of realization can be used to express statements in most of the above types of languages. For instance, one can speak or write natural language; one can say or write a restricted language, such as a command-line interface; and one can say or write/draw a direct manipulation language.   • Speaking: continuous speech recognition, isolated-word speech recognition   • Writing: typing on a keyboard, handwriting   • Drawing   • Gesturing (American Sign Language provides an example of gesture as the realization (expression layer choice) for a full-scale natural language.)   • Pick-from-set: various forms of menus   • Pointing, clicking, dragging   • Various three-dimensional manipulations-stretching, rotating, etc.   • Manipulations within a virtual reality environment-same range of speech, gesture, point, click, drag, etc., as above, but with three dimensions and broader field of view   • Manipulation unique to virtual reality environment-locomotion (flying through/over things as a means of manipulating them or at least looking at them) 3. Devices   Hardware mechanisms (and associated device-specific software) that provide a way to express a statement. Again, more than one technology at this layer can be used to implement items at the layer above.   • Keyboards (many different kinds of typing)   • Microphones   • Light pen/drawing pads, touch-sensitive screens, whiteboards   • Video display screen and mouse   • Video display screen and keypad (e.g., automated teller machine)   • Touch-sensitive screen (touch with pen; touch with finger)   • Telephone (audible menu with keypad and/or speech input)   • Push-button interface, with different button for each choice (like big buttons on an appliance)   • Joystick   • Virtual reality input gear-glove, helmet, suit, etc.; also body position detectors

OCR for page 71
INPUT/OUTPUT TECHNOLOGIES 73

OCR for page 71
Page 74 (e.g., programming languages, such as C++, Java, Prolog; database query languages, such as SQL; mathematical languages, such as logic; command languages, such as cshell provides). Natural languages are derived evolutionarily; they typically have unrestricted and complex syntax and semantics (assignment of meaning to symbols and to the structures built from those symbols). Artificial languages are created by computer scientists or mathematicians to meet certain design and functional criteria; the syntax is typically tightly constrained and designed to minimize semantic complexity and ambiguity. Because an artificial language has a language definition, construction of an interpreter for the language is a more straightforward task than construction of a system for interpreting sentences in a natural language. The grammar of a programming language is given; defining a grammar for English (or any other natural language) remains a challenging task (though there are now several extensive grammars used in computational systems). Furthermore, the interactions between syntax and semantics can be tightly controlled in an artificial language (because people design them) but can be quite complex in a natural language.1,2 Natural languages are thus more difficult to process. However, they allow for a wider range of expression and as a result are more powerful (and more "natural"). It is likely that the expressivity of natural languages and the ways it allows for incompleteness and indirectness may matter more to their being easy to use than the fact that people already "know them." For example, the phrase, "the letter to Aunt Jenny I wrote last March," may be a more natural way to identify a letter in one's files than trying to recall the file name, identify a particular icon, or grep (a UNIX search command) for a certain string that must be in the letter. The complex requests that may arise in seeking information from on-line databases provide another example of the advantages of complex languages near the natural language end of this dimension. Constraint specifications that are natural to users (e.g., "display the protein structures having more than 40 percent alpha helix'') are both diverse and rich in structure, whereas menu- or form-based paradigms cannot readily cover the space of possible queries. Although natural language processing remains a challenging long-range problem in artificial intelligence (as discussed under "Natural Language Processing" below in this chapter), progress continues to be made, and better understanding of the ways in which it makes communication easier may be used to inform the design of more restricted languages. However, the fact that restricted languages have limitations is not, per se, a shortcoming for their use in ECIs. Limiting the range of language in using a system can (if done right) promote correct interpretation by the system by limiting ambiguity and allowing more effective communication.

OCR for page 71
Page 75 For instance, the use of domain- and task-specific restricted languages for certain applications of speech recognition systems has produced results, allowing people to use speech to communicate when they cannot see (either because they are limited by the communication device being used, such as the telephone, or because of physical impairment). Radiologists' workstations, for example, allow the use of speech as the primary means of inputting reports on X-rays or other radiographic tests. Direct manipulation languages may be ideal if there is a close match to what the user wants to do (and hence is able to "say"), that is, if the user's needs are anticipated and the user will not need to program or alter what the system does; they can be a robust means of control that limits the risk of system crashes from misdirected user actions. In short, the design of an interface along the language dimension entails choices of syntax (which may be simple or complex) and semantics (which can be simple or complex either in itself or in how it relates to syntax). More complex languages typically allow the user to say more but make it harder for the system to figure out what a person means. Expression Contrasts A natural language sentence can be spoken, written, typed, gestured, or selected from a menu. An artificial language statement also can be spoken, written, typed, gestured, or selected from a menu. Language expression can take many forms, generally differentiated as being more or less continuous or involving selection from a set of options (e.g., a menu). Speaking can involve isolated words or continuous speech recognition. Writing can involve handwriting or typing; drawing can be free form or can use prespecified options. Gesturing-independently or to manipulate objects-can be free form, can involve a full-scale natural language (e.g., American Sign Language), or can involve a more restricted set of prespecified options (e.g., pointing, dragging, stretching, rotating). Virtual reality and other visualization techniques represent a multimedia form of expression that may involve speech, gesture, direct manipulation, and haptic and other elements. Thus, the different ways of saying things in a language may also be divided into two structural categories-free form and structured-and several different realization categories: typing, speaking, pointing. Free-form expression is usually more difficult to process than structured expression. For example, a sentence in natural language can be spoken "free form" (this is what we usually think of with natural language), or it might be specified by picking one word at a time out of a structured menu.3 In the structured form the system can control what the user gets to choose to "say" next, and so it is much easier for a system to interpret

OCR for page 71
Page 76 and handle. Within a given form, some means of realization may be easier to handle than others (e.g., correctly typed words are easier to interpret than handwritten words; freehand drawings are more difficult than structured CAD/CAM (computer-aided design/computer-aided manufacturing) diagrams). It is also important to note that more structured systems may be preferable for certain applications, such as those involving completion of forms (Oviatt et al., 1994). Menu/icon systems thus provide an alternative way of expressing command-like languages. They have underlying languages, typically very much like command languages. The commands (natural language verb equivalents) are often menu items (e.g., "select," "edit"); the parameters (natural language noun equivalents) are icons (or open files); and the statements (natural language sentence equivalents) are sequences of select "nouns" and "verbs." The menus and icons provide the structure within which a user can say something in the language. Devices The hardware realization of communication can take many forms; common ones include microphones and speakers, keyboards and mice, drawing pads, touch-sensitive screens, light pens, and push buttons. The choice of device interacts with the choice of medium: display, film/videotape, speaker/audiotape, and so on. There may also be interactions between expression and device (an obvious example is the connection between pointing device (mouse, trackball, joystick) and pull-down menus or icons). On the other hand, it is also possible to relax some of these associations to allow for alternative surfaces (e.g., keyboard alternatives to pointers, aural alternatives to visual outputs). Producing interfaces for every citizen will entail providing for alternative input/output devices for a given language-expression combination; it might also call for alternative approaches to expression. Comparisons Among Graphical User Interfaces, Natural Language, and Speech The language-expression-device framework can be used to gain perspective on current standard interface types and on the research opportunities and challenges presented by ECIs. For example, it makes clear that natural language processing and speech recognition (and other technologies that may be associated colloquially) introduce different issues and different tradeoffs. A speech-based interface such as AT&T's long-distance voice recognition system, which can recognize phrases such as "collect call" and "calling card,"4 can combine a restricted language with

OCR for page 71
Page 77 speech as a means of expression. As this example illustrates, neither speech recognition with unlimited vocabulary nor complete/comprehensive language understanding is necessary to provide natural language-like input to a system within a restricted domain and task. Similarly, it is possible to improve restricted language interfaces by applying principles from natural language communication. Current graphical user interface/menu/icon systems tightly constrain what one can say, both by starting with a very constrained language and by having a structured way in which one can express things in that language. They are at the opposite end of both the language and the expression spectrum from natural languages. It is thus clear why they are easier to process, but also why they are more constraining (Cohen and Oviatt, 1994). Ongoing efforts to develop speech interfaces for Web browsers provide a concrete example of the importance of understanding the different tradeoffs of each of these dimensions. Choosing speech on the expression layer rather than pointing and clicking would lead to being able to "speak" the icons and hyperlinks that are designed for keyboard and mouse. Although this may suffice in certain settings-replacing one modality for another can be useful in hands-free contexts and for those with physical limitations-it does not necessarily expand a system's capabilities or lead to new paradigms for interactions. An alternative approach would be to explore how spoken language technology can expand the user's ability to obtain the desired information easily and quickly from the Web, leading to a different, probably more expressive, language. From this perspective, speech would augment rather than replace the mouse and keyboard, and a user would be able to choose among many interface language-expression options to achieve a task in the most natural and efficient manner. Natural language interaction is particularly appropriate when the information space is broad and diverse or when the user's request contains complex constraints. Both of these situations occur frequently on the Web. For example, finding a specific home page or document now requires remembering a universal resource locator, searching through the Web for a pointer to the desired document, or using one of the keyword search engines available. Current interfaces present the user with a fixed set of choices at any point, of which one is to be selected. Only by stepping through the offered choices and conforming to the prescribed organization of the Web can users reach the documents they desire. The multitude of indexes and meta-indexes on the Web is testimony to the reality and magnitude of the problem. The power of a human/natural language in this situation is that it allows the user to specify what information or document is desired (e.g., "Show me the White House home page," "Will it rain tomorrow in Seattle?" or "What is the ZIP code for

OCR for page 71
Page 78 Orlando, Florida?") without having to know where and how the information is stored. A natural language, regardless of whether it is expressed using speech, typing, or handwriting, offers a user significantly more power in expressing constraints, thereby freeing the user from having to adhere to a rigid, preconceived indexing and command hierarchy. In examining the state of the art of various input/output technologies, it is important to recognize that no single choice is right for all interfaces. In fact, one of the major challenges of interface design may be designing a language that is powerful enough for a user to say what needs to be said, but in as constrained a manner as possible, while still having the power to make processing easier and the possibility of misinterpretation less likely. In looking at input/output options, it will be useful to keep in mind where various options fall on one or another of these scales and the tradeoffs implicit in choosing a given option. Technologies For Communicating With Systems Humans modulate energy in many ways. Recognizing that fact allows for exploration of a rich set of alternatives and complements-at any time, a user-chosen subset of controls and displays-that a focus on simplicity of interface design as the primary goal can obscure. Current direct manipulation interfaces with two-dimensional display and mouse input make use, minimally, of one arm with two fingers and a thumb and one eye-about what is used to control a television remote. It was considered a stroke of genius, of course, to reduce all computer interactions to this simple set as a transition mechanism to enable people to learn to use computers without much training. There are no longer any reasons (including cost) to remain stuck in this transition mode. We need to develop a fuller coupling of human and computer, with attention to flexibility of input and output. In some interactive situations, for example, all a computer or information appliance needs for input is a modulated signal that it can use to direct rendered data to the user's eyes, ears, and skin. Over 200 different transducers have been used to date with people having disabilities. In work with severely disabled children, David Warner, of Syracuse University, has developed a suite of sensors to let kids control computer displays with muscle twitches, eye movement, facial expressions, voice, or whatever signal they can manage to modulate. The results evoke profound emotion in patients, doctors, and observers and demonstrate the value of research on human capabilities to modulate energy in real time, the sensors that can transduce those energies, and effective ways to render the information affected by such interactions.

OCR for page 71
Page 79 The state of the art in a range of technologies for communicating with systems is reviewed below. Also addressed are the device and expression layers of the model described in the previous section and summarized in Box 3.1. The choice of language-natural, restricted, or direct manipulation-influences but does not dictate the technologies discussed here. The exception is the subsection, "Natural Language Processing," which also encompasses the language layer of the model and discusses how choices along a spectrum from fully natural languages to relatively restricted languages influence the performance of various expression modes, particularly speech input. Speech Synthesis Text-to-speech systems, or speech synthesizers, take unrestricted text as input and produce a synthetic spoken version of that text as output. Most current commercial synthesizers exhibit a high degree of intelligibility, but none sound truly natural. The major barriers to naturalness are deficiencies of text normalization, intonational assignment, and synthesized voice quality. Female speech and children's speech are generally less acceptable than adult male synthetic speech, probably because they have been studied less (Roe and Wilpon, 1994). In the course of transforming text into speech, all text-to-speech systems must do the following: • Identify words and determine their pronunciations; • Decide how such items as abbreviations and numbers should be pronounced (text normalization); • Determine which words should be made prominent in the output, where pauses should be inserted, and what the overall shape of the intonational contour should be (intonation assignment); • Compute appropriate durations and amplitudes for each of the words that will be synthesized; • Determine how the overall intonational contour will be realized for the text to be synthesized; • Identify which acoustic elements will be used to produce the spoken text (for concatenative synthesizers) or to retrieve the sequences of appropriate parameters to generate synthetic elements (for format synthesizers);5 and • Synthesize the utterance from the specifications and/or acoustic elements identified. While most systems permit some form of user control over various parameters at many of these stages, to fine-tune system defaults, documentation

OCR for page 71
Page 80 and tools for such control are usually lacking, and most users lack the requisite background to produce satisfying results. Particularly for concatenative synthesizers, it is difficult and time consuming to produce new voices, since each voice requires that a new set of concatenative units be recorded and segmented. While most research groups are developing tools in an attempt to automate this process (often by using automatic speech recognition systems to produce a first-pass segmentation), none have succeeded in eliminating the need for laborious hand correction of the database. There have also been efforts in recent years to automate the production of other components of synthesis, to facilitate the production of synthesizers in many languages from a single architecture. We know that synthetic speech should sound better. It is not clear, exactly, how to decide what is better: More natural and more human-like? More intelligible? More intelligible at normal talking speeds or at high speeds? Speech is usually used for conversational modes of interaction. When speech is being used for presenting a Web page, for example, there is additional information that needs to be provided: Which words form links? Which words are italicized? How is this information presented most effectively? How should words be dealt with that have multiple different pronunciations in different parts of the country or to different individuals? Speech Input/Recognition The full integration of voice as an input medium, if achievable, could alleviate many of the known limitations of existing human-machine interfaces. People with poor or no literacy skills, people whose hands are busy, people suffering from cumulative trauma disorders associated with typing and pointing (or seeking to avoid them)-could all benefit from spoken communication with systems. While the capabilities envisioned in such a system are well beyond the state of the art in both speech recognition and language understanding at present, the technology has advanced sufficiently to allow very simple voice-based applications to emerge (see below). Speech recognition research has made significant progress in the past 15 years (Roe and Wilpon, 1994; Cole and Hirschman, 1995; Cole et al., 1996). The gains have come from the convergence of several technologies: higher-accuracy continuous speech recognition based on better speech modeling techniques, better recognition search strategies that reduce the time needed for high-accuracy recognition, and increased power of audio-capable, off-the-shelf workstations. As a result of these advances, real-time, speaker-independent, continuous speech recognition, with vocabularies

OCR for page 71
Page 81 of a few thousand words, is now possible in software on regular workstations. In terms of recognition performance, word error rates have dropped by more than an order of magnitude in the past decade and are expected to continue to fall with further research. These improvements have come about as a result of technical as well as programmatic innovations. Technically, there have been advances in two areas. First, a paradigm shift from rule-based to model-based methods has taken place. In particular, probabilistic hidden Markov models (HMM) have proven to be an excellent method of modeling phonemes in various contexts. This model-based paradigm, with its ability to estimate model parameters automatically from training data, has shown its power and versatility by applying the technology to various languages, using the same software. Second, the use of statistical grammars, which estimate the probability of two- and three-word sequences, have been instrumental in improving recognition accuracy, especially for large-vocabulary tasks. These simple statistical grammars have, so far, proven to be superior to traditional rule-based grammars for speech recognition purposes. Programmatically, the collection and dissemination of standard, common training and test corpora worldwide, the sponsorship of common evaluations, and the dissemination at workshops of information about competing methods have all ensured very rapid progress in the technology. This programmatic approach was pioneered by the Defense Advanced Research Projects Agency (DARPA), which continues to sponsor common evaluations and initiated the establishment of the Linguistic Data Consortium, which has been in charge of the collection and dissemination of common corpora. A similar approach is now being taken in Europe. Word error rates for speaker-independent continuous speech recognition vary a great deal, depending on the difficulty of the task: from less than 0.3 percent for connected digits, to 3 percent for a 2,500-word travel information task, to 10 percent for articles read from the Wall Street Journal, to 27 percent for transcription of broadcast news programs, to 40 percent for conversational speech over the telephone. Although word error rates in the laboratory can be quite small for some tasks, error rates can increase by a factor of four or more when the same systems are used in the field. This increase has various causes: heavy accents, ambient noise, different microphones, hesitations and restarts, and straying from the system's vocabulary. Speech recognition has begun to enter the mainstream of everyday life, chiefly through telephone-based applications (Margulies, 1995). The most visible of these applications involve directory assistance services, such as the recognition of a few words (e.g., the digits and words such as "operator," "yes/no," "collect") or recognition of the names of cities in a

OCR for page 71
Page 110 Local-Area Communications Physical communications networking can be categorized as an interworking of three networking levels: local, access, and core (or "wide area"). Almost any network-based activity of a residential user is likely to use all three. Local area networks (LANs) are on the end-user's premises, such as a house, apartment or office building, or university campus. Ethernet, the most widely deployed LAN technology, is already appearing in homes for computer access to cable-based data access systems such as TimeWarner's RoadRunner, Com21's access system, and @Home's access system. It could be in millions of American homes by the year 2000. In general, the 10-megabit-per-second (Mbps) Ethernet is the favored communications interface for connecting personal computers and computing devices to set-top boxes and other network interface devices being developed for high-speed subscriber access networks. A properly engineered shared-bandwidth architecture such as Ethernet allows multiple devices to have the high "burst rate" capability needed for good performance, such as fast transfer of an image, with only rare degradation from congestion. It is "alwasy on," allowing devices always to be connected and ready to satisfy user needs immediately, as opposed to a tedious connection setup. A residence will be able to simultaneously operate not only several human-oriented user interfaces in personal computers, heating/cooling and appliance controls, light switches, communicating pocket calendars and watches, and so on, but also user interfaces used by such devices as furnaces, garage doors, and washing machines. The introduction of IPv6 in the next decade will create an extremely large pool of Internet addresses, allowing each human being in the world to own hundreds or thousands of them. This development will foster the interconnection of a wide range of devices with embedded systems, a phenomenon that underscores the concern not to cast the NII or ECI challenges in overly personal computer-centric terms. Local networking is not necessarily restricted to one shared wired facility such as Ethernet, which is beginning in the home at 10 Mbps but will likely evolve to "Fast Ethernet" commercial versions or to ATM (asynchronous transfer mode) connection-oriented communications, at 100 Mbps and higher. It can include wireless local networking, generalizations of the cordless phone to cordless personal computers and other devices, with burst rates of at least several megabits per second. Local networking is likely to include assigned (not shared) digital channels in various media for such applications as video programming and other stream or bulk uses, at aggregate data rates of hundreds of megabits per second.

OCR for page 71
Page 111 How much bandwidth is enough? Assuming "always connected" and good performance from the other network elements to be described, 10 Mbps symmetric should be adequate for almost all processor-based applications including fast response image transfers (a 5-megabyte image in 0.5 second) and high-quality MPEG-2 or H.323 (conferencing/videophone) video at 4 Mbps. For streaming media such as video, additional requirements of reserved capacity and minimal queuing delay may be needed, requirements for which ATM is well suited. ATM breaks traffic into uniformly-sized "cells" that can be efficiently switched and reassembled with specified quality-of-service guarantees. Forecasts of how soon ATM will be available directly at consumer communicating devices vary, but there is likely to be significant availability in 5 to 10 years. For future applications with very complex immersive environments, multiple high-definition video streams, or other bandwidth-intensive needs, fast Ethernet and ATM should suffice. Additional transmission facilities for program distribution could use these or other technologies. Both shared-bandwidth networks such as Ethernet and dedicated high-capacity channels could reside in the same physical medium, which might be fiber, coax, or twisted-pair. The cost of a LAN has been falling steadily, with Ethernet cards for personal computers well below $100. The cost of wiring a new house or apartment building with cable for Ethernet is low, but the cost could be substantial for rewiring an old residence. Wireless networking, to bypass the wiring problem, is available now, and it may be priced comparably to Ethernet, for comparable capacity, in 4 to 5 years. Access Communications The access network is the set of transmission facilities, control features, and network-based services that sits between a user's premises and the core public network. The twisted-pair subscriber line running from a telephone office to a user's residence is part of the telephone access network, for example. There are four basic paradigms offered (and in development): telephone company services via the twisted-pair subscriber line, cable company services via a coaxial cable (coax) feed, wireless access via higher-powered cellular mobile or lower-powered PCS (personal communications services), and direct broadcast satellite service. There are additional paradigms, such as terrestrial microwave, that are of secondary importance compared with these four. The access network has long been regarded as a performance bottleneck. The telephone channel, restricted to 3-kHz (kilohertz) bandwidth (and data rates of about 30 kbps for reliable transmission) by filters and transmission systems designed for

OCR for page 71
Page 112 voice, presents both bandwidth limitations and connection delays that seriously degrade performance. "Access" can be a confusing term. An Internet service provider offers access service to the Internet and some access facilities such as TCP/IP software, but may not provide the physical pipe into the home. For the moment, the discussion is restricted to access networks that include the physical transmission facilities but returns later to Internet service provider facilities because they have a critical influence on the performance of Web browsers and other Internet-oriented user interfaces. Twisted Pair-based Telephone Company Services. The first paradigm, access via a twisted-pair subscriber line, is advancing with ISDN (integrated services digital network), ADSL (asymmetric digital subscriber line), VDSL (very-high-speed digital subscriber line), and HDSL (high-speed digital subscriber line).23 Cable-based Access Services. A local cable television (CATV) service company maintains a cable distribution system that is still largely dedicated to broadcasting video programming. The coaxial cable network, now actually combining optical fiber trunks with coaxial branches and "drops" to subscribers, is a "tree and branch" architecture well suited to broadcast and not so well suited to upstream communications from the user. It is not well suited to upstream communications because of noise aggregation problems from many drops and branches coming together and because the capacity of the cable, however large, is being shared with bandwidth-hungry downstream video services and by a great many subscribers. Nevertheless, the cable industry has succeeded in evolving a promising HFC (hybrid fiber coax) network architecture that can service both video distribution and interactive communications needs.24 The HFC system provides digital channels with signals produced by cable modems, for which a downstream channel may generate a 30-Mbps signal within a 6-MHz bandwidth. Instead of one analog video signal, this digital transmission can carry seven or eight high-quality MPEG-2 digital video signals or one digital HDTV (high-definition television) signal plus two MPEG-2 ordinary digital video signals. More important for the NII, the digital capacity can be used for an arbitrary mix of signals, supporting medical imaging, language instruction, software downloads, and an infinite array of other applications. A cable system could typically implement up to a few dozen such 30-Mbps channels plus 80 old-fashioned analog channels for subscribers who have not yet purchased the digital TV sets expected to hit the U.S. market in 1998. Upstream capacity shared among many subscribers is much more

OCR for page 71
Page 113 constrained. Standards are being developed that will allow a user to share with neighbors a 1.5-Mbps upstream channel (one of about 20 such channels serving a group of 125 to 500 subscribers). Other modem designs allow a pool of users to share a 10-Mbps upstream channel, mimicking the behavior of Ethernet. Here, just as with ADSL, the operator is betting that traffic will be asymmetric and that the user will not have a performance complaint even though the upstream bandwidth is not especially generous. Above this physical channel level, the cable industry's model usually includes IP services with the same "always on" flavor that professionals enjoy at work. This is an important performance advantage for cable access, supporting broadcast information services that flash the latest bulletin on a computer or TV screen, quick Internet telephony perhaps by touching a miniature picture in the screen directory, and immediate linking to a distant Web site (contingent on performance being good farther upstream). If the service, including getting started and customer premise setup,25 is done well, the popular conception of Internet service as difficult to get started and unreliable after that could change radically, and the Web browser could indeed become a universal user interface. Wireless Access Services: Location Transparency and Consciousness and Power/Bandwidth Tradeoffs. Wireless access, currently in cellular mobile networks and soon in PCS networks, supports mobility of persons, devices, and services. It makes possible carrying wearable or pocket devices, doing computing in a car (perhaps with a "heads-up" display on the windshield-used only when it is safe to do so, of course), reading documents and messages on an electronic "infopad" at meetings, and sending" electronic postcards" from digital cameras and camcorders. The new and large unlicensed NII Supernet spectrum authorized by the Federal Communications Commission, in the relatively high 5-GHz band, will give a large boost to interactive multimedia services when mass-produced, low-cost radio modems become a reality. That could happen within 3 to 4 years. Wireless access can support both location transparency, in which the user's application appears the same regardless of location, and location consciousness, in which the application finds and exploits local resources and can offer location-dependent services, such as giving directions to the nearest drugstore. These two features are not incompatible, and both contribute to the utility and usability of a user interface. Because of the power constraints imposed by small portable devices, including but not limited to pocket telephones, medical monitoring and alerting devices, communicating digital cameras and camcorders, communicating watches, communicating pocket calendars, and even some

OCR for page 71
Page 114 laptop computers, it is important for the quality of the user interface that the wireless access system offer appropriate tradeoffs between communications and processing resources. One way this is realized is to concentrate the power of the portable device on display functions, such as a bright sharp display, and leave media processing (such as MPEG and videoconferencing digital video coding/decoding) to processors accessed through the wireless network. However, this balance of function may imply an unacceptable cost for the substantial communications capacity to carry the unencoded video information. Another issue is how to minimize power use on portable systems that are always listening. Further research will be required to identify a reasonable balance between processing and communications power in the system. The microcellular PCS and Supernet networks are well suited to this need, aiming for burst transmission rates of 25 Mbps or more in small (perhaps 300-meter-wide) microcells. This compares very favorably with present-day telephony-oriented cellular mobile networks, where modems may provide up to about 20 kbps communications rate. Higher rates are possible in the digital cellular mobile systems becoming widely deployed now, but probably not more than 256 kbps, still far below microcellular networks. The low Earth-orbiting satellite (LEO) systems planned for personal communications from anywhere in the world, which will compete to some extent with terrestrial microcellular PCS systems, could offer the significant user interface advantage of having exactly the same user interface anywhere in the world. This would remove a major anxiety for many users. Direct Broadcast Satellite Distribution Services. Satellite services could augment wired facilities to improve the performance of the user interface. In particular, downloading of large information files to proxy servers in nearby network offices or in the end-user's equipment itself would reduce the delays of access to information in distant servers. There are cache memories in Web browsers that save Web HTML objects requested by users because there is a high probability that the objects will be requested again, but a proxy server does something else. It caches information when it has been requested by one user, under the assumption that if the material is popular other users may request it as well. This has the effect of improving response time considerably for those users and offering added possibilities for customization. There are many important research questions in selecting material for proxy servers, updating strategies, customization for users, and integrating the satellite facility smoothly with the wired network. Direct broadcast satellite service in the NII would also include its

OCR for page 71
Page 115 present function of distributing video programming directly to user TVs. In the future it is possible that continual magazine-style broadcasting of video information clips, captured and displayed immediately by user devices rather than retrieved from cache servers, also will be part of the nation's information infrastructure. This would offer the freshest-possible material, supporting, for example, a customized user information service in which information is updated even as the reader observes it. Core Network Communications: QoS, Interoperability, and Value-Added Services The core network interconnects access networks. It aggregates traffic, and is, or should be, designed to provide differing quality-of-service (QoS) treatment for different classes of traffic.26 Continuous voice and video media should enjoy minimum delay, and data files should be transferred with minimum error rate. ATM is already widely deployed in the core network. Research and development on QoS control is already extensive, and further work, on topics such as renegotiation of offered capacity and dynamic user control over QoS, would improve the performance of future user interfaces. For example, a user with several applications running could trade QoS among them, improving video resolution, for example, at the expense of the rate of transfer of a new software module being downloaded in the background. There are additional services that either the core network or the access network, or indeed parties other than the network operators, can provide to enhance user interface performance. For example, a multiparty desktop audio/video conference can be displayed on one user's screen as a custom combination of pictures of the other participants with a corresponding spatial distribution of their voices. This can be done either in the user equipment by processing multiple audio/video information streams all coming to that user or by a processing service in the network (or offered by a third party) called a ''multimedia bridge" that creates the customized display for the user and supplies that user with only a single audio/video information stream. If access bandwidth is at a premium, the network-based bridging service provides a high-quality user interface at a minimum cost in access bandwidth. Performance Impacts of Internet Services Architecture The Internet, which utilizes all of the communications hierarchy outlined above, is considered by many to be the heart of the NII. As the multimedia Internet evolves and assumes much of the quality control (and charge for service!) functionality of the telephone network, this is

OCR for page 71
Page 116 likely to become true by definition. The Internet is defined by use of IP, which carries packets from one kind of network to another without the application having to directly control any services in those networks. Although they do not, in general, provide the access transmission facilities, Internet service providers do supply other access facilities that have a large influence on the performance of user interfaces. These include at least the following: • Adequate modem pools and fast log-on for dial-up service; • Direct low-level packet interconnection to the Internet, as well as higher-level services such as e-mail, UseNet servers, domain name servers, and proxy Web servers; • Gateway services between Internet telephony and public network telephony (evolving in the near future to multimedia real-time communications); and • Documentation and instruction for use of browser applications, e-mail, and various Internet services and resources. It would require a lengthy report to describe how each of these affects user interface performance. Suffice it to say that a major objective in providing good service is the avoidance of server congestion, by means of the use of proxy Web servers to give users the impression of fast response from uncongested access to a nearby server, when in fact the originating server is far away and highly congested. Fast response time is, as emphasized earlier, an important measure of good performance of the user interface. We might also reemphasize the importance of being "always connected" to Internet access for applications such as receiving timely information from "push" servers (such as the fast-developing customized current information services producing ever-changing displays in screen savers), immediate delivery of e-mail, fast receipt and initiation of real-time audio/video calls, and participating in the on-line work of a distributed group. An always-connected transmission access facility is required, of course, which must be matched by similar facilities27 for the Internet service provider. As with providers of wireless access services, Internet service providers will soon be required to support mobility services, such as locating and characterizing nomadic users. There are significant research questions in coordinating Internet routing and service-class support policies with the movement of individuals, in transferring customer profiles for Internet services, and in other aspects of mobility support.

OCR for page 71
Page 117 Software Architecture: Distributed Object Environments and Transportable Software Management of a mobility environment, particularly location transparency and location consciousness, is complex, and further research is needed. Distributed object environments-a software structure being used more and more in communications as well as applications systems-has a large potential to help resolve the complexities and improve performance.28 For example, the global availability of a distributed object environment would make abstract service objects available in a consistent format everywhere, with those objects translating user needs into instructions to local systems. Transportable software is another important object-oriented technology that proceeds from a different assumption, that a common "virtual machine" (a special operating system on top of the real one) can be created on different platforms, so that software in "applets" (and applications) can be moved around from one machine to another.29 Java is a widely accepted language and virtual machine structure. Web browsers now commonly implement the Java virtual machine, allowing application applets to be downloaded from Web sites and executed in the user's computer. This facilitates animated displays and other features in the user interface, with much better performance than if the software executed in the Web server and large quantities of display information had to be transmitted to the user's browser. It also facilitates customization of the Web browser user interface for users with special needs and constraints. Transportable software also has great potential for "programmable networks" in which communications protocols and services are not fixed but can be changed on user request by sending the appropriate applets to network elements, such as switches, where they execute. This, too, can improve performance where alternative protocols are better matched to applications needs, making the user interface more responsive and pleasant to use. Notes 1. See Gunter (1992), Semantics of Programming Languages, for more extensive discussion. 2. For example, the two sentences below differ only in a single word, but the resulting structure of the preferred interpretation is significantly different (Frazier and Fodor, 1978; Shieber, 1983, gives a computational model that elegantly handles this particular psycholinguistic feature). In the first sentence, "on the rack" modifies "positioned," whereas in the second, it modifies "dress": Susan positioned the dress on the rack. Susan wanted the dress on the rack.

OCR for page 71
Page 118 3. Texas Instruments had an early natural language system that did this. 4. This example was discussed by John Thomas, of NYNEX, at the workshop. 5. Concatenative synthesizers achieve synthesis by concatenating small segments (e.g., diphones) of stored digitized speech. Formant synthesizers use a rule-based approach to achieve synthesis, by specifying acoustic parameters that characterize a digital filter and how these parameters change as different phonemes are sequenced. 6. Personal communication, John C. Thomas, NYNEX, December 12, 1996. 7. A system introduced by IBM in 1996 for voice recognition software was designed to enable radiologists to dictate reports into a personal computer. Recognizing 2,000 words and requiring some training, its support for conversational discourse, in a context where certain technical phrases may be used frequently, was contrasted in the press to the need to pause after individual words in older commercial software (Zuckerman, 1996). 8. Candace Sidner, of Lotus Development, and Raymond Perrault, of SRI, contributed much of the content of this subsection. 9. Indexing and retrieval constitute a growing application area, especially with the increased desire to organize and access large amounts of data, much of which is available as text. 10. This section concentrates on the state of the art of complete end-to-end natural language processing systems and does not describe research in individual areas. The steering committee notes that there has been significant progress, ranging from new grammatical formalisms to approaches to lexical semantics to dialogue models. 11. There is much promising research on syntactic models, such as the TAG (tree-adjoining grammars) work (see Joshi et al., 1981, 1983, 1995; Shieber, 1983), which are computationally tractable syntactic formalisms with greater power than context-free grammars, and on lexical semantics. 12. Although space prevents including detailed references here, the interested reader is directed in particular to the recent years' conference proceedings of the Association for Computational Linguistics, the European Association for Computational Linguistics, the international meeting in Computational Linguistics (COLING), the DARPA Spoken Language and MUC workshops, and the journals Artificial Intelligence, Computational Linguistics, and Machine Translation. 13. For applications involving database query, or for more sophisticated command and control, the mapping between the sequence of words and their meaning can be very complicated indeed. DARPA has funded applications-oriented research in language understanding (Roe and Wilpon, 1994; Cole et al., 1996) in the context of database query, where the user requests the answer to a query by typing or uttering the query. In most language understanding systems to date, a set of syntactic and/or semantic rules is applied to the query to obtain its meaning, which is then used to retrieve the answer. If the query refers to information obtained in previous queries, another set of rules that deal with discourse is used to disambiguate the meaning. Pragmatic information about the specific application is often encoded in the rules as well. Even for a simple application like retrieval of air travel information, hundreds of linguistic rules are hand coded by computational linguists. Many of these rules must be rewritten for each new application. 14. The Linguistic Data Consortium at the University of Pennsylvania, which is sponsored by government and industry, now makes much of this data available, from different sources, for different tasks, and in different languages. 15. Note that portable devices raise the larger issue of data durability: portable devices may be easier to lose or break, which raises questions about ease of backup for the data they contain. 16. Much of the cross-industry disagreement revolved around interlacing, a technique that has long been used in television to increase resolution and that takes advantage

OCR for page 71
Page 119 of the extremely high line-to-line and temporal coherence of images produced by television cameras. Computer output, especially text and graphics, tends to be hard edged and to flicker badly when displayed on interlace monitors. Although one can convert interlaced broadcast TV to noninterlaced at the receiver end easily enough, there is a cost issue that affects the likelihood of flooding the market with the cheapest sets possible, hence affecting penetration and return on investment. The computer industry (hardware, software, and netware), of course, wants the low-end TVs of the future to handle digital output in a reasonable format; the television industry wants a 16 × 9 interlaced format (which is really a 32 × 9 format non-interlaced). 17. The Web is, of course, a great source for visual input. Copyright concepts of fair use and royalties will necessarily adapt, as they will for text, and audio quotations, samples, and outright theft. 18. Blake Hannaford, of the University of Washington, contributed much of the content of this subsection. 19. In fact, the graphics produced are not Braille but simply dot graphics printed on a Braille printer with the same resolution or dot spacing as Braille. This is a common technique, but it produces relatively low resolution graphics. 20. The WIMP interface will not serve this future, though elements will be involved (keypads, pointing, etc.). In its current form it is arguably dangerous to people susceptible to repetitive stress disorders, unusable by a large segment of the population with disabilities, and far too simple for navigation in complex spaces. 21. As noted in Herndon et al. (1994), a slider or dial for volume control has 1 degree of freedom; the mouse for picking, drawing, or two-dimensional location has 2 DOF, a 6D mouse or head-tracker for docking or view control has 6 DOF, a glove or face device for hand/face animation can have 16 or more DOF, and a body suit for whole-body animation can have over 100 DOF. 22. Many users of today's Internet telephony services experience a long delay, sometimes of the order of a second, in transmission, actually due more to buffering in the user's computer to smooth out arriving packets. 23. "Basic rate" ISDN, providing an aggregate 144-kbps symmetric service to a subscriber, suffered from a too-long development, unattractive rate structure, and general ambivalence on the part of telephone operators, but is now widely available and popular for Internet and "work at home" access needs. The usual access rate is 128-kbps symmetric from tying together 2 64-kbps channels provided within the 144-kbps aggregate service. From the user's point of view, ISDN still suffers from the need to set up a connection, although setup is usually quite fast, and from per-minute charges even for local calls. ADSL, now focused on the generally asymmetric traffic requirements of computer communications sessions, offers 1.5 to 6 Mbps downstream (network to subscriber) and up to 384 kbps upstream (from the subscriber). A subscriber's ADSL service has the potential to be always connected, permanently linked, for example, through a router in the telephone office into a high-speed data network. It is not yet clear that telephone companies have the "always-connected" paradigm in mind. Telephone companies have wavered in their commitment to ADSL, so it is a very tentative forecast that ADSL service at acceptable cost will be available to millions of telephone subscribers in 5 years. Although ADSL could vastly improve the performance of multimedia user interfaces, it should be recognized-and this will hold for the other broadband access mechanisms as well-that contention for capacity on networks upstream, and congestion at servers, may also seriously constrain performance. HDSL, which provides symmetric capacity of 1.5 Mbps and up and usually is designed to work over two twisted pair lines, is not generally associated with residential users but could quickly overtake ADSL if households begin to generate high-capacity traffic.

OCR for page 71
Page 120 VDSL, at rates of 25 Mbps or higher, requires a distribution point closer to the subscriber than a present-day telephone office. Its potential penetration is difficult to predict and depends a great deal on the success and competitive implications of cable-based data services. 24. Cable interactive access services are just beginning to be commercially available. It is a fairly safe prediction that by mid-1999 millions of cable subscribers will be offered this service. 25. It is a challenge to the cable industry to make subscription and service provisioning simple and fast, and some standards interoperability questions discussed later, such as "plug and play" of digital set-top boxes, remain to be resolved. 26. The conventional "best-effort" IP service does not require any special capabilities from the core network, but the new QoS-conscious IP services and, of course, ATM do. The core network must deploy technologies such as edge switches and access multiplexers that aggregate traffic arriving under various communications protocols, and must closely control QoS parameters for multiswitch routings. 27. For the modem-based ISPs this implies higher rates, but the cable model may allow "always-on" capability without major increases in hardware investment. 28. CORBA (Common Object Request Broker Architecture), standardized by the Object Management Group, is a leading candidate for a universally accepted architecture, although there are other distributed object systems proposed by major software vendors, such as Microsoft's ActiveX. 29. Transportable software and object broker systems such as CORBA are complementary more than competitive. CORBA provides important object location and management services and facilitates use of existing applications software by wrapping applications (written in whatever computer language) in CORBA objects with standard IDL interfaces. The Java virtual machine requires new applications, all in the Java language, and applets may not execute as efficiently as software written for the underlying operating system, but it facilitates the movement of executable software, with appropriate security constraints, with the benefits outlined above. There are many examples now of CORBA-based systems in which CORBA objects are invoked by transportable Java applets.