| ||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||
| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 71
Page 71
3
Input/Output Technologies:
Current Status And Research Needs
Meeting the every-citizen interface (ECI) criteria described in
Chapter 2 will require advances in a number of technology areas.
Some involve advances in basic underlying display and interface
technologies (higher-resolution visual displays, three-dimensional
displays, better voice recognition, better tactile displays, and so
on). Others involve advances in our understanding of how to best
match these input/output technologies to the sensory, motor, and
cognitive capabilities of different users in different and changing
environments carrying out a wide variety of tasks. But the new
interfaces will need to do more than just physically couple the
user to the devices. To meet these visions, the interfaces must
have the ability to assist, facilitate, and collaborate with the
user in accomplishing tasks.
Subsequent chapters address interface design-the creation of
interfaces that make the best-possible use of these human-machine
communications technologies-and system attributes that lie beneath
the veneer of the interface, such as system intelligence and
software support for collaborative activities. This chapter
examines the current state and prospective advances in technology
areas related directly to communication between a person and a
system-hardware and software for input (to the system) and output
(to a human). The emphasis is on technical advances that, if
implemented in well-designed systems (as stressed in Chapter 4),
hold the potential to expand accessibility and usability to many
more people than at present. The discussion includes a cluster of
speech input/output technologies; natural language understanding
(including restricted languages with limited vocabularies);
keyboard input; gesture recognition
OCR for page 72
Page 72
and machine vision; auditory and touch-based output; interfaces
that combine multiple modes of input and output; and visual
displays, including immersive or virtual reality systems. Because
the ECI challenge involves connecting to the information
infrastructure, rather than just to stand-alone systems, this
chapter reviews the current status of and research challenges for
interfaces for systems in large-scale national networks. The
chapter ends with the steering committee's conclusions, based on
workshop discussions and other inputs, about the research
priorities to advance these technologies and our understanding of
how to use them to support every citizen.
Framing The Input/Output
Discussion-Layers Of Communication
The interface is the means by which a user communicates with a
system, whether to get it to perform some function or computation
directly (e.g., compute a trajectory, change a word in a text file,
display a video); to find and deliver information (e.g., getting a
paper from the Web or information from a database); or to provide
ways of interacting with other people (e.g., participate in a chat
group, send e-mail, jointly edit a document). As a communications
vehicle, interfaces can be assessed and compared in terms of three
key dimensions: (1) the language(s) they use, (2) the ways in which
they allow users to say things in the language(s), and (3) the
surface(s) or device(s) used to produce output (or register input)
expressions of the language. The design and implementation of an
interface entail choosing (or designing) the language for
communication, specifying the ways in which users may express
''statements" of that language (e.g., by typing words or by
pointing at icons), and selecting device(s) that allow
communication to be realized-the input/output devices.
Box 3.1 gives some examples of choices at each of these levels.
Although the selection and integration of input/output devices will
generally involve hardware concerns (e.g., choices among keyboard,
mouse, drawing surfaces, sensor-equipped apparel), decisions about
the language definition and means of expression affect
interpretation processes that are largely treated in software. The
rest of this section briefly describes each of the dimensions and
then examines how they can be used to characterize some currently
standard interface choices; the remainder of the chapter provides
an examination of the state of the art.
Language Contrasts and Continuum
There are two language classes of interest in the design of
interfaces: natural languages (e.g., English, Spanish, Japanese)
and artificial languages
Page 73
BOX 3.1 Layers of
Communications
1.
Language Layer
•
Natural language: complex syntax, complex
semantics (whatever a human can say)
•
Restricted verbal language (e.g., operating
systems command language, air traffic control language): limited
syntax, constrained semantics
•
Direct manipulation languages: objects are
"noun-like," get "verb equivalents" from manipulations (e.g., drag
file X to Trash means ''erase X"; drag message onto Outgoing
Mailbox means "send message"; draw circle around object Y and click
means "I'm referring to Y, so I can say something about it.")
2.
Expression Layer
Most of these types of realization can be used to
express statements in most of the above types of languages. For
instance, one can speak or write natural language; one can say or
write a restricted language, such as a command-line interface; and
one can say or write/draw a direct manipulation language.
•
Speaking: continuous speech recognition,
isolated-word speech recognition
•
Writing: typing on a keyboard, handwriting
•
Drawing
•
Gesturing (American Sign Language provides an
example of gesture as the realization (expression layer choice) for
a full-scale natural language.)
•
Pick-from-set: various forms of menus
•
Pointing, clicking, dragging
•
Various three-dimensional
manipulations-stretching, rotating, etc.
•
Manipulations within a virtual reality
environment-same range of speech, gesture, point, click, drag,
etc., as above, but with three dimensions and broader field of
view
•
Manipulation unique to virtual reality
environment-locomotion (flying through/over things as a means of
manipulating them or at least looking at them)
3.
Devices
Hardware mechanisms (and associated
device-specific software) that provide a way to express a
statement. Again, more than one technology at this layer can be
used to implement items at the layer above.
•
Keyboards (many different kinds of typing)
•
Microphones
•
Light pen/drawing pads, touch-sensitive screens,
whiteboards
•
Video display screen and mouse
•
Video display screen and keypad (e.g., automated
teller machine)
•
Touch-sensitive screen (touch with pen; touch with
finger)
•
Telephone (audible menu with keypad and/or speech
input)
•
Push-button interface, with different button for
each choice (like big buttons on an appliance)
•
Joystick
•
Virtual reality input gear-glove, helmet, suit,
etc.; also body position detectors
OCR for page 73
INPUT/OUTPUT TECHNOLOGIES
73
OCR for page 74
OCR for page 75
OCR for page 76
OCR for page 77
OCR for page 78
OCR for page 79
OCR for page 80
OCR for page 81
OCR for page 110
OCR for page 111
OCR for page 112
OCR for page 113
OCR for page 114
OCR for page 115
OCR for page 116
OCR for page 117
OCR for page 118
OCR for page 119
OCR for page 120
Representative terms from entire chapter:
speech recognition
Page 74
(e.g., programming languages, such as C++, Java, Prolog;
database query languages, such as SQL; mathematical languages, such
as logic; command languages, such as cshell provides). Natural
languages are derived evolutionarily; they typically have
unrestricted and complex syntax and semantics (assignment of
meaning to symbols and to the structures built from those symbols).
Artificial languages are created by computer scientists or
mathematicians to meet certain design and functional criteria; the
syntax is typically tightly constrained and designed to minimize
semantic complexity and ambiguity.
Because an artificial language has a language definition,
construction of an interpreter for the language is a more
straightforward task than construction of a system for interpreting
sentences in a natural language. The grammar of a programming
language is given; defining a grammar for English (or any other
natural language) remains a challenging task (though there are now
several extensive grammars used in computational systems).
Furthermore, the interactions between syntax and semantics can be
tightly controlled in an artificial language (because people design
them) but can be quite complex in a natural language.1,2
Natural languages are thus more difficult to process. However,
they allow for a wider range of expression and as a result are more
powerful (and more "natural"). It is likely that the expressivity
of natural languages and the ways it allows for incompleteness and
indirectness may matter more to their being easy to use than the
fact that people already "know them." For example, the phrase, "the
letter to Aunt Jenny I wrote last March," may be a more natural way
to identify a letter in one's files than trying to recall the file
name, identify a particular icon, or grep (a UNIX search command)
for a certain string that must be in the letter. The complex
requests that may arise in seeking information from on-line
databases provide another example of the advantages of complex
languages near the natural language end of this dimension.
Constraint specifications that are natural to users (e.g., "display
the protein structures having more than 40 percent alpha helix'')
are both diverse and rich in structure, whereas menu- or form-based
paradigms cannot readily cover the space of possible queries.
Although natural language processing remains a challenging
long-range problem in artificial intelligence (as discussed under
"Natural Language Processing" below in this chapter), progress
continues to be made, and better understanding of the ways in which
it makes communication easier may be used to inform the design of
more restricted languages.
However, the fact that restricted languages have limitations is
not, per se, a shortcoming for their use in ECIs. Limiting the
range of language in using a system can (if done right) promote
correct interpretation by the system by limiting ambiguity and
allowing more effective communication.
Page 75
For instance, the use of domain- and task-specific restricted
languages for certain applications of speech recognition systems
has produced results, allowing people to use speech to communicate
when they cannot see (either because they are limited by the
communication device being used, such as the telephone, or because
of physical impairment). Radiologists' workstations, for example,
allow the use of speech as the primary means of inputting reports
on X-rays or other radiographic tests. Direct manipulation
languages may be ideal if there is a close match to what the user
wants to do (and hence is able to "say"), that is, if the user's
needs are anticipated and the user will not need to program or
alter what the system does; they can be a robust means of control
that limits the risk of system crashes from misdirected user
actions.
In short, the design of an interface along the language
dimension entails choices of syntax (which may be simple or
complex) and semantics (which can be simple or complex either in
itself or in how it relates to syntax). More complex languages
typically allow the user to say more but make it harder for the
system to figure out what a person means.
Expression Contrasts
A natural language sentence can be spoken, written, typed,
gestured, or selected from a menu. An artificial language statement
also can be spoken, written, typed, gestured, or selected from a
menu.
Language expression can take many forms, generally
differentiated as being more or less continuous or involving
selection from a set of options (e.g., a menu). Speaking can
involve isolated words or continuous speech recognition. Writing
can involve handwriting or typing; drawing can be free form or can
use prespecified options. Gesturing-independently or to manipulate
objects-can be free form, can involve a full-scale natural language
(e.g., American Sign Language), or can involve a more restricted
set of prespecified options (e.g., pointing, dragging, stretching,
rotating). Virtual reality and other visualization techniques
represent a multimedia form of expression that may involve speech,
gesture, direct manipulation, and haptic and other elements.
Thus, the different ways of saying things in a language may also
be divided into two structural categories-free form and
structured-and several different realization categories: typing,
speaking, pointing. Free-form expression is usually more difficult
to process than structured expression. For example, a sentence in
natural language can be spoken "free form" (this is what we usually
think of with natural language), or it might be specified by
picking one word at a time out of a structured menu.3 In the structured form the system can
control what the user gets to choose to "say" next, and so it is
much easier for a system to interpret
Page 76
and handle. Within a given form, some means of realization may
be easier to handle than others (e.g., correctly typed words are
easier to interpret than handwritten words; freehand drawings are
more difficult than structured CAD/CAM (computer-aided
design/computer-aided manufacturing) diagrams). It is also
important to note that more structured systems may be preferable
for certain applications, such as those involving completion of
forms (Oviatt et al., 1994).
Menu/icon systems thus provide an alternative way of expressing
command-like languages. They have underlying languages, typically
very much like command languages. The commands (natural language
verb equivalents) are often menu items (e.g., "select," "edit");
the parameters (natural language noun equivalents) are icons (or
open files); and the statements (natural language sentence
equivalents) are sequences of select "nouns" and "verbs." The menus
and icons provide the structure within which a user can say
something in the language.
Devices
The hardware realization of communication can take many forms;
common ones include microphones and speakers, keyboards and mice,
drawing pads, touch-sensitive screens, light pens, and push
buttons. The choice of device interacts with the choice of medium:
display, film/videotape, speaker/audiotape, and so on. There may
also be interactions between expression and device (an obvious
example is the connection between pointing device (mouse,
trackball, joystick) and pull-down menus or icons). On the other
hand, it is also possible to relax some of these associations to
allow for alternative surfaces (e.g., keyboard alternatives to
pointers, aural alternatives to visual outputs). Producing
interfaces for every citizen will entail providing for alternative
input/output devices for a given language-expression combination;
it might also call for alternative approaches to expression.
Comparisons Among Graphical User
Interfaces, Natural Language, and Speech
The language-expression-device framework can be used to gain
perspective on current standard interface types and on the research
opportunities and challenges presented by ECIs. For example, it
makes clear that natural language processing and speech recognition
(and other technologies that may be associated colloquially)
introduce different issues and different tradeoffs. A speech-based
interface such as AT&T's long-distance voice recognition
system, which can recognize phrases such as "collect call" and
"calling card,"4 can combine a
restricted language with
Page 77
speech as a means of expression. As this example illustrates,
neither speech recognition with unlimited vocabulary nor
complete/comprehensive language understanding is necessary to
provide natural language-like input to a system within a restricted
domain and task. Similarly, it is possible to improve restricted
language interfaces by applying principles from natural language
communication.
Current graphical user interface/menu/icon systems tightly
constrain what one can say, both by starting with a very
constrained language and by having a structured way in which one
can express things in that language. They are at the opposite end
of both the language and the expression spectrum from natural
languages. It is thus clear why they are easier to process, but
also why they are more constraining (Cohen and Oviatt, 1994).
Ongoing efforts to develop speech interfaces for Web browsers
provide a concrete example of the importance of understanding the
different tradeoffs of each of these dimensions. Choosing speech on
the expression layer rather than pointing and clicking would lead
to being able to "speak" the icons and hyperlinks that are designed
for keyboard and mouse. Although this may suffice in certain
settings-replacing one modality for another can be useful in
hands-free contexts and for those with physical limitations-it does
not necessarily expand a system's capabilities or lead to new
paradigms for interactions. An alternative approach would be to
explore how spoken language technology can expand the user's
ability to obtain the desired information easily and quickly from
the Web, leading to a different, probably more expressive,
language. From this perspective, speech would augment rather than
replace the mouse and keyboard, and a user would be able to choose
among many interface language-expression options to achieve a task
in the most natural and efficient manner.
Natural language interaction is particularly appropriate when
the information space is broad and diverse or when the user's
request contains complex constraints. Both of these situations
occur frequently on the Web. For example, finding a specific home
page or document now requires remembering a universal resource
locator, searching through the Web for a pointer to the desired
document, or using one of the keyword search engines available.
Current interfaces present the user with a fixed set of choices at
any point, of which one is to be selected. Only by stepping through
the offered choices and conforming to the prescribed organization
of the Web can users reach the documents they desire. The multitude
of indexes and meta-indexes on the Web is testimony to the reality
and magnitude of the problem. The power of a human/natural language
in this situation is that it allows the user to specify what
information or document is desired (e.g., "Show me the White House
home page," "Will it rain tomorrow in Seattle?" or "What is the ZIP
code for
Page 78
Orlando, Florida?") without having to know where and how the
information is stored. A natural language, regardless of whether it
is expressed using speech, typing, or handwriting, offers a user
significantly more power in expressing constraints, thereby freeing
the user from having to adhere to a rigid, preconceived indexing
and command hierarchy.
In examining the state of the art of various input/output
technologies, it is important to recognize that no single choice is
right for all interfaces. In fact, one of the major challenges of
interface design may be designing a language that is powerful
enough for a user to say what needs to be said, but in as
constrained a manner as possible, while still having the power to
make processing easier and the possibility of misinterpretation
less likely. In looking at input/output options, it will be useful
to keep in mind where various options fall on one or another of
these scales and the tradeoffs implicit in choosing a given
option.
Technologies For Communicating With
Systems
Humans modulate energy in many ways. Recognizing that fact
allows for exploration of a rich set of alternatives and
complements-at any time, a user-chosen subset of controls and
displays-that a focus on simplicity of interface design as the
primary goal can obscure. Current direct manipulation interfaces
with two-dimensional display and mouse input make use, minimally,
of one arm with two fingers and a thumb and one eye-about what is
used to control a television remote. It was considered a stroke of
genius, of course, to reduce all computer interactions to this
simple set as a transition mechanism to enable people to learn to
use computers without much training. There are no longer any
reasons (including cost) to remain stuck in this transition mode.
We need to develop a fuller coupling of human and computer, with
attention to flexibility of input and output.
In some interactive situations, for example, all a computer or
information appliance needs for input is a modulated signal that it
can use to direct rendered data to the user's eyes, ears, and skin.
Over 200 different transducers have been used to date with people
having disabilities. In work with severely disabled children, David
Warner, of Syracuse University, has developed a suite of sensors to
let kids control computer displays with muscle twitches, eye
movement, facial expressions, voice, or whatever signal they can
manage to modulate. The results evoke profound emotion in patients,
doctors, and observers and demonstrate the value of research on
human capabilities to modulate energy in real time, the sensors
that can transduce those energies, and effective ways to render the
information affected by such interactions.
Page 79
The state of the art in a range of technologies for
communicating with systems is reviewed below. Also addressed are
the device and expression layers of the model described in the
previous section and summarized in Box 3.1. The choice of
language-natural, restricted, or direct manipulation-influences but
does not dictate the technologies discussed here. The exception is
the subsection, "Natural Language Processing," which also
encompasses the language layer of the model and discusses how
choices along a spectrum from fully natural languages to relatively
restricted languages influence the performance of various
expression modes, particularly speech input.
Speech Synthesis
Text-to-speech systems, or speech synthesizers, take
unrestricted text as input and produce a synthetic spoken version
of that text as output. Most current commercial synthesizers
exhibit a high degree of intelligibility, but none sound truly
natural. The major barriers to naturalness are deficiencies of text
normalization, intonational assignment, and synthesized voice
quality. Female speech and children's speech are generally less
acceptable than adult male synthetic speech, probably because they
have been studied less (Roe and Wilpon, 1994).
In the course of transforming text into speech, all
text-to-speech systems must do the following:
•
Identify words and determine their
pronunciations;
•
Decide how such items as abbreviations and numbers
should be pronounced (text normalization);
•
Determine which words should be made prominent in
the output, where pauses should be inserted, and what the overall
shape of the intonational contour should be (intonation
assignment);
•
Compute appropriate durations and amplitudes for
each of the words that will be synthesized;
•
Determine how the overall intonational contour
will be realized for the text to be synthesized;
•
Identify which acoustic elements will be used to
produce the spoken text (for concatenative synthesizers) or to
retrieve the sequences of appropriate parameters to generate
synthetic elements (for format synthesizers);5 and
•
Synthesize the utterance from the specifications
and/or acoustic elements identified.
While most systems permit some form of user control over various
parameters at many of these stages, to fine-tune system defaults,
documentation
Page 80
and tools for such control are usually lacking, and most users
lack the requisite background to produce satisfying results.
Particularly for concatenative synthesizers, it is difficult and
time consuming to produce new voices, since each voice requires
that a new set of concatenative units be recorded and segmented.
While most research groups are developing tools in an attempt to
automate this process (often by using automatic speech recognition
systems to produce a first-pass segmentation), none have succeeded
in eliminating the need for laborious hand correction of the
database. There have also been efforts in recent years to automate
the production of other components of synthesis, to facilitate the
production of synthesizers in many languages from a single
architecture.
We know that synthetic speech should sound better. It is not
clear, exactly, how to decide what is better: More natural and more
human-like? More intelligible? More intelligible at normal talking
speeds or at high speeds? Speech is usually used for conversational
modes of interaction. When speech is being used for presenting a
Web page, for example, there is additional information that needs
to be provided: Which words form links? Which words are italicized?
How is this information presented most effectively? How should
words be dealt with that have multiple different pronunciations in
different parts of the country or to different individuals?
Speech Input/Recognition
The full integration of voice as an input medium, if achievable,
could alleviate many of the known limitations of existing
human-machine interfaces. People with poor or no literacy skills,
people whose hands are busy, people suffering from cumulative
trauma disorders associated with typing and pointing (or seeking to
avoid them)-could all benefit from spoken communication with
systems. While the capabilities envisioned in such a system are
well beyond the state of the art in both speech recognition and
language understanding at present, the technology has advanced
sufficiently to allow very simple voice-based applications to
emerge (see below).
Speech recognition research has made significant progress in the
past 15 years (Roe and Wilpon, 1994; Cole and Hirschman, 1995; Cole
et al., 1996). The gains have come from the convergence of several
technologies: higher-accuracy continuous speech recognition based
on better speech modeling techniques, better recognition search
strategies that reduce the time needed for high-accuracy
recognition, and increased power of audio-capable, off-the-shelf
workstations. As a result of these advances, real-time,
speaker-independent, continuous speech recognition, with
vocabularies
Page 81
of a few thousand words, is now possible in software on regular
workstations.
In terms of recognition performance, word error rates have
dropped by more than an order of magnitude in the past decade and
are expected to continue to fall with further research. These
improvements have come about as a result of technical as well as
programmatic innovations. Technically, there have been advances in
two areas. First, a paradigm shift from rule-based to model-based
methods has taken place. In particular, probabilistic hidden Markov
models (HMM) have proven to be an excellent method of modeling
phonemes in various contexts. This model-based paradigm, with its
ability to estimate model parameters automatically from training
data, has shown its power and versatility by applying the
technology to various languages, using the same software. Second,
the use of statistical grammars, which estimate the probability of
two- and three-word sequences, have been instrumental in improving
recognition accuracy, especially for large-vocabulary tasks. These
simple statistical grammars have, so far, proven to be superior to
traditional rule-based grammars for speech recognition
purposes.
Programmatically, the collection and dissemination of standard,
common training and test corpora worldwide, the sponsorship of
common evaluations, and the dissemination at workshops of
information about competing methods have all ensured very rapid
progress in the technology. This programmatic approach was
pioneered by the Defense Advanced Research Projects Agency (DARPA),
which continues to sponsor common evaluations and initiated the
establishment of the Linguistic Data Consortium, which has been in
charge of the collection and dissemination of common corpora. A
similar approach is now being taken in Europe.
Word error rates for speaker-independent continuous speech
recognition vary a great deal, depending on the difficulty of the
task: from less than 0.3 percent for connected digits, to 3 percent
for a 2,500-word travel information task, to 10 percent for
articles read from the Wall Street Journal, to 27 percent
for transcription of broadcast news programs, to 40 percent for
conversational speech over the telephone. Although word error rates
in the laboratory can be quite small for some tasks, error rates
can increase by a factor of four or more when the same systems are
used in the field. This increase has various causes: heavy accents,
ambient noise, different microphones, hesitations and restarts, and
straying from the system's vocabulary.
Speech recognition has begun to enter the mainstream of everyday
life, chiefly through telephone-based applications (Margulies,
1995). The most visible of these applications involve directory
assistance services, such as the recognition of a few words (e.g.,
the digits and words such as "operator," "yes/no," "collect") or
recognition of the names of cities in a
Page 110
Local-Area Communications
Physical communications networking can be categorized as an
interworking of three networking levels: local, access, and core
(or "wide area"). Almost any network-based activity of a
residential user is likely to use all three.
Local area networks (LANs) are on the end-user's premises, such
as a house, apartment or office building, or university campus.
Ethernet, the most widely deployed LAN technology, is already
appearing in homes for computer access to cable-based data access
systems such as TimeWarner's RoadRunner, Com21's access system, and
@Home's access system. It could be in millions of American homes by
the year 2000. In general, the 10-megabit-per-second (Mbps)
Ethernet is the favored communications interface for connecting
personal computers and computing devices to set-top boxes and other
network interface devices being developed for high-speed subscriber
access networks. A properly engineered shared-bandwidth
architecture such as Ethernet allows multiple devices to have the
high "burst rate" capability needed for good performance, such as
fast transfer of an image, with only rare degradation from
congestion. It is "alwasy on," allowing devices always to be
connected and ready to satisfy user needs immediately, as opposed
to a tedious connection setup.
A residence will be able to simultaneously operate not only
several human-oriented user interfaces in personal computers,
heating/cooling and appliance controls, light switches,
communicating pocket calendars and watches, and so on, but also
user interfaces used by such devices as furnaces, garage doors, and
washing machines. The introduction of IPv6 in the next decade will
create an extremely large pool of Internet addresses, allowing each
human being in the world to own hundreds or thousands of them. This
development will foster the interconnection of a wide range of
devices with embedded systems, a phenomenon that underscores the
concern not to cast the NII or ECI challenges in overly personal
computer-centric terms.
Local networking is not necessarily restricted to one shared
wired facility such as Ethernet, which is beginning in the home at
10 Mbps but will likely evolve to "Fast Ethernet" commercial
versions or to ATM (asynchronous transfer mode) connection-oriented
communications, at 100 Mbps and higher. It can include wireless
local networking, generalizations of the cordless phone to cordless
personal computers and other devices, with burst rates of at least
several megabits per second. Local networking is likely to include
assigned (not shared) digital channels in various media for such
applications as video programming and other stream or bulk uses, at
aggregate data rates of hundreds of megabits per second.
Page 111
How much bandwidth is enough? Assuming "always connected" and
good performance from the other network elements to be described,
10 Mbps symmetric should be adequate for almost all processor-based
applications including fast response image transfers (a 5-megabyte
image in 0.5 second) and high-quality MPEG-2 or H.323
(conferencing/videophone) video at 4 Mbps. For streaming media such
as video, additional requirements of reserved capacity and minimal
queuing delay may be needed, requirements for which ATM is well
suited. ATM breaks traffic into uniformly-sized "cells" that can be
efficiently switched and reassembled with specified
quality-of-service guarantees. Forecasts of how soon ATM will be
available directly at consumer communicating devices vary, but
there is likely to be significant availability in 5 to 10 years.
For future applications with very complex immersive environments,
multiple high-definition video streams, or other
bandwidth-intensive needs, fast Ethernet and ATM should suffice.
Additional transmission facilities for program distribution could
use these or other technologies.
Both shared-bandwidth networks such as Ethernet and dedicated
high-capacity channels could reside in the same physical medium,
which might be fiber, coax, or twisted-pair. The cost of a LAN has
been falling steadily, with Ethernet cards for personal computers
well below $100. The cost of wiring a new house or apartment
building with cable for Ethernet is low, but the cost could be
substantial for rewiring an old residence. Wireless networking, to
bypass the wiring problem, is available now, and it may be priced
comparably to Ethernet, for comparable capacity, in 4 to 5
years.
Access Communications
The access network is the set of transmission facilities,
control features, and network-based services that sits between a
user's premises and the core public network. The twisted-pair
subscriber line running from a telephone office to a user's
residence is part of the telephone access network, for example.
There are four basic paradigms offered (and in development):
telephone company services via the twisted-pair subscriber line,
cable company services via a coaxial cable (coax) feed, wireless
access via higher-powered cellular mobile or lower-powered PCS
(personal communications services), and direct broadcast satellite
service. There are additional paradigms, such as terrestrial
microwave, that are of secondary importance compared with these
four. The access network has long been regarded as a performance
bottleneck. The telephone channel, restricted to 3-kHz (kilohertz)
bandwidth (and data rates of about 30 kbps for reliable
transmission) by filters and transmission systems designed for
Page 112
voice, presents both bandwidth limitations and connection delays
that seriously degrade performance.
"Access" can be a confusing term. An Internet service provider
offers access service to the Internet and some access facilities
such as TCP/IP software, but may not provide the physical pipe into
the home. For the moment, the discussion is restricted to access
networks that include the physical transmission facilities but
returns later to Internet service provider facilities because they
have a critical influence on the performance of Web browsers and
other Internet-oriented user interfaces.
Twisted Pair-based Telephone Company Services. The
first paradigm, access via a twisted-pair subscriber line, is
advancing with ISDN (integrated services digital network), ADSL
(asymmetric digital subscriber line), VDSL (very-high-speed digital
subscriber line), and HDSL (high-speed digital subscriber
line).23
Cable-based Access Services. A local cable
television (CATV) service company maintains a cable distribution
system that is still largely dedicated to broadcasting video
programming. The coaxial cable network, now actually combining
optical fiber trunks with coaxial branches and "drops" to
subscribers, is a "tree and branch" architecture well suited to
broadcast and not so well suited to upstream communications from
the user. It is not well suited to upstream communications because
of noise aggregation problems from many drops and branches coming
together and because the capacity of the cable, however large, is
being shared with bandwidth-hungry downstream video services and by
a great many subscribers.
Nevertheless, the cable industry has succeeded in evolving a
promising HFC (hybrid fiber coax) network architecture that can
service both video distribution and interactive communications
needs.24 The HFC system provides
digital channels with signals produced by cable modems, for which a
downstream channel may generate a 30-Mbps signal within a 6-MHz
bandwidth. Instead of one analog video signal, this digital
transmission can carry seven or eight high-quality MPEG-2 digital
video signals or one digital HDTV (high-definition television)
signal plus two MPEG-2 ordinary digital video signals. More
important for the NII, the digital capacity can be used for an
arbitrary mix of signals, supporting medical imaging, language
instruction, software downloads, and an infinite array of other
applications. A cable system could typically implement up to a few
dozen such 30-Mbps channels plus 80 old-fashioned analog channels
for subscribers who have not yet purchased the digital TV sets
expected to hit the U.S. market in 1998.
Upstream capacity shared among many subscribers is much more
Page 113
constrained. Standards are being developed that will allow a
user to share with neighbors a 1.5-Mbps upstream channel (one of
about 20 such channels serving a group of 125 to 500 subscribers).
Other modem designs allow a pool of users to share a 10-Mbps
upstream channel, mimicking the behavior of Ethernet. Here, just as
with ADSL, the operator is betting that traffic will be asymmetric
and that the user will not have a performance complaint even though
the upstream bandwidth is not especially generous.
Above this physical channel level, the cable industry's model
usually includes IP services with the same "always on" flavor that
professionals enjoy at work. This is an important performance
advantage for cable access, supporting broadcast information
services that flash the latest bulletin on a computer or TV screen,
quick Internet telephony perhaps by touching a miniature picture in
the screen directory, and immediate linking to a distant Web site
(contingent on performance being good farther upstream). If the
service, including getting started and customer premise setup,25 is done well, the popular conception
of Internet service as difficult to get started and unreliable
after that could change radically, and the Web browser could indeed
become a universal user interface.
Wireless Access Services: Location Transparency and
Consciousness and Power/Bandwidth Tradeoffs. Wireless
access, currently in cellular mobile networks and soon in PCS
networks, supports mobility of persons, devices, and services. It
makes possible carrying wearable or pocket devices, doing computing
in a car (perhaps with a "heads-up" display on the windshield-used
only when it is safe to do so, of course), reading documents and
messages on an electronic "infopad" at meetings, and sending"
electronic postcards" from digital cameras and camcorders. The new
and large unlicensed NII Supernet spectrum authorized by the
Federal Communications Commission, in the relatively high 5-GHz
band, will give a large boost to interactive multimedia services
when mass-produced, low-cost radio modems become a reality. That
could happen within 3 to 4 years.
Wireless access can support both location transparency, in which
the user's application appears the same regardless of location, and
location consciousness, in which the application finds and exploits
local resources and can offer location-dependent services, such as
giving directions to the nearest drugstore. These two features are
not incompatible, and both contribute to the utility and usability
of a user interface.
Because of the power constraints imposed by small portable
devices, including but not limited to pocket telephones, medical
monitoring and alerting devices, communicating digital cameras and
camcorders, communicating watches, communicating pocket calendars,
and even some
Page 114
laptop computers, it is important for the quality of the user
interface that the wireless access system offer appropriate
tradeoffs between communications and processing resources. One way
this is realized is to concentrate the power of the portable device
on display functions, such as a bright sharp display, and leave
media processing (such as MPEG and videoconferencing digital video
coding/decoding) to processors accessed through the wireless
network. However, this balance of function may imply an
unacceptable cost for the substantial communications capacity to
carry the unencoded video information. Another issue is how to
minimize power use on portable systems that are always listening.
Further research will be required to identify a reasonable balance
between processing and communications power in the system.
The microcellular PCS and Supernet networks are well suited to
this need, aiming for burst transmission rates of 25 Mbps or more
in small (perhaps 300-meter-wide) microcells. This compares very
favorably with present-day telephony-oriented cellular mobile
networks, where modems may provide up to about 20 kbps
communications rate. Higher rates are possible in the digital
cellular mobile systems becoming widely deployed now, but probably
not more than 256 kbps, still far below microcellular networks.
The low Earth-orbiting satellite (LEO) systems planned for
personal communications from anywhere in the world, which will
compete to some extent with terrestrial microcellular PCS systems,
could offer the significant user interface advantage of having
exactly the same user interface anywhere in the world. This would
remove a major anxiety for many users.
Direct Broadcast Satellite Distribution Services.
Satellite services could augment wired facilities to improve the
performance of the user interface. In particular, downloading of
large information files to proxy servers in nearby network offices
or in the end-user's equipment itself would reduce the delays of
access to information in distant servers. There are cache memories
in Web browsers that save Web HTML objects requested by users
because there is a high probability that the objects will be
requested again, but a proxy server does something else. It caches
information when it has been requested by one user, under the
assumption that if the material is popular other users may request
it as well. This has the effect of improving response time
considerably for those users and offering added possibilities for
customization. There are many important research questions in
selecting material for proxy servers, updating strategies,
customization for users, and integrating the satellite facility
smoothly with the wired network.
Direct broadcast satellite service in the NII would also include
its
Page 115
present function of distributing video programming directly to
user TVs. In the future it is possible that continual
magazine-style broadcasting of video information clips, captured
and displayed immediately by user devices rather than retrieved
from cache servers, also will be part of the nation's information
infrastructure. This would offer the freshest-possible material,
supporting, for example, a customized user information service in
which information is updated even as the reader observes it.
Core Network Communications: QoS,
Interoperability, and Value-Added Services
The core network interconnects access networks. It aggregates
traffic, and is, or should be, designed to provide differing
quality-of-service (QoS) treatment for different classes of
traffic.26 Continuous voice and
video media should enjoy minimum delay, and data files should be
transferred with minimum error rate. ATM is already widely deployed
in the core network. Research and development on QoS control is
already extensive, and further work, on topics such as
renegotiation of offered capacity and dynamic user control over
QoS, would improve the performance of future user interfaces. For
example, a user with several applications running could trade QoS
among them, improving video resolution, for example, at the expense
of the rate of transfer of a new software module being downloaded
in the background.
There are additional services that either the core network or
the access network, or indeed parties other than the network
operators, can provide to enhance user interface performance. For
example, a multiparty desktop audio/video conference can be
displayed on one user's screen as a custom combination of pictures
of the other participants with a corresponding spatial distribution
of their voices. This can be done either in the user equipment by
processing multiple audio/video information streams all coming to
that user or by a processing service in the network (or offered by
a third party) called a ''multimedia bridge" that creates the
customized display for the user and supplies that user with only a
single audio/video information stream. If access bandwidth is at a
premium, the network-based bridging service provides a high-quality
user interface at a minimum cost in access bandwidth.
Performance Impacts of Internet
Services Architecture
The Internet, which utilizes all of the communications hierarchy
outlined above, is considered by many to be the heart of the NII.
As the multimedia Internet evolves and assumes much of the quality
control (and charge for service!) functionality of the telephone
network, this is
Page 116
likely to become true by definition. The Internet is defined by
use of IP, which carries packets from one kind of network to
another without the application having to directly control any
services in those networks.
Although they do not, in general, provide the access
transmission facilities, Internet service providers do supply other
access facilities that have a large influence on the performance of
user interfaces. These include at least the following:
•
Adequate modem pools and fast log-on for dial-up
service;
•
Direct low-level packet interconnection to the
Internet, as well as higher-level services such as e-mail, UseNet
servers, domain name servers, and proxy Web servers;
•
Gateway services between Internet telephony and
public network telephony (evolving in the near future to multimedia
real-time communications); and
•
Documentation and instruction for use of browser
applications, e-mail, and various Internet services and
resources.
It would require a lengthy report to describe how each of these
affects user interface performance. Suffice it to say that a major
objective in providing good service is the avoidance of server
congestion, by means of the use of proxy Web servers to give users
the impression of fast response from uncongested access to a nearby
server, when in fact the originating server is far away and highly
congested. Fast response time is, as emphasized earlier, an
important measure of good performance of the user interface.
We might also reemphasize the importance of being "always
connected" to Internet access for applications such as receiving
timely information from "push" servers (such as the fast-developing
customized current information services producing ever-changing
displays in screen savers), immediate delivery of e-mail, fast
receipt and initiation of real-time audio/video calls, and
participating in the on-line work of a distributed group. An
always-connected transmission access facility is required, of
course, which must be matched by similar facilities27 for the Internet service
provider.
As with providers of wireless access services, Internet service
providers will soon be required to support mobility services, such
as locating and characterizing nomadic users. There are significant
research questions in coordinating Internet routing and
service-class support policies with the movement of individuals, in
transferring customer profiles for Internet services, and in other
aspects of mobility support.
Page 117
Software Architecture: Distributed
Object Environments and Transportable Software
Management of a mobility environment, particularly location
transparency and location consciousness, is complex, and further
research is needed. Distributed object environments-a software
structure being used more and more in communications as well as
applications systems-has a large potential to help resolve the
complexities and improve performance.28 For example, the global availability
of a distributed object environment would make abstract service
objects available in a consistent format everywhere, with those
objects translating user needs into instructions to local
systems.
Transportable software is another important object-oriented
technology that proceeds from a different assumption, that a common
"virtual machine" (a special operating system on top of the real
one) can be created on different platforms, so that software in
"applets" (and applications) can be moved around from one machine
to another.29 Java is a widely
accepted language and virtual machine structure. Web browsers now
commonly implement the Java virtual machine, allowing application
applets to be downloaded from Web sites and executed in the user's
computer. This facilitates animated displays and other features in
the user interface, with much better performance than if the
software executed in the Web server and large quantities of display
information had to be transmitted to the user's browser. It also
facilitates customization of the Web browser user interface for
users with special needs and constraints.
Transportable software also has great potential for
"programmable networks" in which communications protocols and
services are not fixed but can be changed on user request by
sending the appropriate applets to network elements, such as
switches, where they execute. This, too, can improve performance
where alternative protocols are better matched to applications
needs, making the user interface more responsive and pleasant to
use.
Notes
1. See Gunter (1992), Semantics of Programming Languages, for
more extensive discussion.
2. For example, the two sentences below differ only in a single
word, but the resulting structure of the preferred interpretation
is significantly different (Frazier and Fodor, 1978; Shieber, 1983,
gives a computational model that elegantly handles this particular
psycholinguistic feature). In the first sentence, "on the rack"
modifies "positioned," whereas in the second, it modifies "dress":
Susan positioned the dress on the rack. Susan wanted the dress on
the rack.
Page 118
3. Texas Instruments had an early natural language system that
did this.
4. This example was discussed by John Thomas, of NYNEX, at the
workshop.
5. Concatenative synthesizers achieve synthesis by concatenating
small segments (e.g., diphones) of stored digitized speech. Formant
synthesizers use a rule-based approach to achieve synthesis, by
specifying acoustic parameters that characterize a digital filter
and how these parameters change as different phonemes are
sequenced.
6. Personal communication, John C. Thomas, NYNEX, December 12,
1996.
7. A system introduced by IBM in 1996 for voice recognition
software was designed to enable radiologists to dictate reports
into a personal computer. Recognizing 2,000 words and requiring
some training, its support for conversational discourse, in a
context where certain technical phrases may be used frequently, was
contrasted in the press to the need to pause after individual words
in older commercial software (Zuckerman, 1996).
8. Candace Sidner, of Lotus Development, and Raymond Perrault,
of SRI, contributed much of the content of this subsection.
9. Indexing and retrieval constitute a growing application area,
especially with the increased desire to organize and access large
amounts of data, much of which is available as text.
10. This section concentrates on the state of the art of
complete end-to-end natural language processing systems and does
not describe research in individual areas. The steering committee
notes that there has been significant progress, ranging from new
grammatical formalisms to approaches to lexical semantics to
dialogue models.
11. There is much promising research on syntactic models, such
as the TAG (tree-adjoining grammars) work (see Joshi et al., 1981,
1983, 1995; Shieber, 1983), which are computationally tractable
syntactic formalisms with greater power than context-free grammars,
and on lexical semantics.
12. Although space prevents including detailed references here,
the interested reader is directed in particular to the recent
years' conference proceedings of the Association for Computational
Linguistics, the European Association for Computational
Linguistics, the international meeting in Computational Linguistics
(COLING), the DARPA Spoken Language and MUC workshops, and the
journals Artificial Intelligence, Computational Linguistics,
and Machine Translation.
13. For applications involving database query, or for more
sophisticated command and control, the mapping between the sequence
of words and their meaning can be very complicated indeed. DARPA
has funded applications-oriented research in language understanding
(Roe and Wilpon, 1994; Cole et al., 1996) in the context of
database query, where the user requests the answer to a query by
typing or uttering the query. In most language understanding
systems to date, a set of syntactic and/or semantic rules is
applied to the query to obtain its meaning, which is then used to
retrieve the answer. If the query refers to information obtained in
previous queries, another set of rules that deal with discourse is
used to disambiguate the meaning. Pragmatic information about the
specific application is often encoded in the rules as well. Even
for a simple application like retrieval of air travel information,
hundreds of linguistic rules are hand coded by computational
linguists. Many of these rules must be rewritten for each new
application.
14. The Linguistic Data Consortium at the University of
Pennsylvania, which is sponsored by government and industry, now
makes much of this data available, from different sources, for
different tasks, and in different languages.
15. Note that portable devices raise the larger issue of data
durability: portable devices may be easier to lose or break, which
raises questions about ease of backup for the data they
contain.
16. Much of the cross-industry disagreement revolved around
interlacing, a technique that has long been used in television to
increase resolution and that takes advantage
Page 119
of the extremely high line-to-line and temporal coherence of
images produced by television cameras. Computer output, especially
text and graphics, tends to be hard edged and to flicker badly when
displayed on interlace monitors. Although one can convert
interlaced broadcast TV to noninterlaced at the receiver end easily
enough, there is a cost issue that affects the likelihood of
flooding the market with the cheapest sets possible, hence
affecting penetration and return on investment. The computer
industry (hardware, software, and netware), of course, wants the
low-end TVs of the future to handle digital output in a reasonable
format; the television industry wants a 16 × 9 interlaced
format (which is really a 32 × 9 format non-interlaced).
17. The Web is, of course, a great source for visual input.
Copyright concepts of fair use and royalties will necessarily
adapt, as they will for text, and audio quotations, samples, and
outright theft.
18. Blake Hannaford, of the University of Washington,
contributed much of the content of this subsection.
19. In fact, the graphics produced are not Braille but simply
dot graphics printed on a Braille printer with the same resolution
or dot spacing as Braille. This is a common technique, but it
produces relatively low resolution graphics.
20. The WIMP interface will not serve this future, though
elements will be involved (keypads, pointing, etc.). In its current
form it is arguably dangerous to people susceptible to repetitive
stress disorders, unusable by a large segment of the population
with disabilities, and far too simple for navigation in complex
spaces.
21. As noted in Herndon et al. (1994), a slider or dial for
volume control has 1 degree of freedom; the mouse for picking,
drawing, or two-dimensional location has 2 DOF, a 6D mouse or
head-tracker for docking or view control has 6 DOF, a glove or face
device for hand/face animation can have 16 or more DOF, and a body
suit for whole-body animation can have over 100 DOF.
22. Many users of today's Internet telephony services experience
a long delay, sometimes of the order of a second, in transmission,
actually due more to buffering in the user's computer to smooth out
arriving packets.
23. "Basic rate" ISDN, providing an aggregate 144-kbps symmetric
service to a subscriber, suffered from a too-long development,
unattractive rate structure, and general ambivalence on the part of
telephone operators, but is now widely available and popular for
Internet and "work at home" access needs. The usual access rate is
128-kbps symmetric from tying together 2 64-kbps channels provided
within the 144-kbps aggregate service. From the user's point of
view, ISDN still suffers from the need to set up a connection,
although setup is usually quite fast, and from per-minute charges
even for local calls.
ADSL, now focused on the generally asymmetric traffic
requirements of computer communications sessions, offers 1.5 to 6
Mbps downstream (network to subscriber) and up to 384 kbps upstream
(from the subscriber). A subscriber's ADSL service has the
potential to be always connected, permanently linked, for example,
through a router in the telephone office into a high-speed data
network. It is not yet clear that telephone companies have the
"always-connected" paradigm in mind. Telephone companies have
wavered in their commitment to ADSL, so it is a very tentative
forecast that ADSL service at acceptable cost will be available to
millions of telephone subscribers in 5 years.
Although ADSL could vastly improve the performance of multimedia
user interfaces, it should be recognized-and this will hold for the
other broadband access mechanisms as well-that contention for
capacity on networks upstream, and congestion at servers, may also
seriously constrain performance.
HDSL, which provides symmetric capacity of 1.5 Mbps and up and
usually is designed to work over two twisted pair lines, is not
generally associated with residential users but could quickly
overtake ADSL if households begin to generate high-capacity
traffic.
Page 120
VDSL, at rates of 25 Mbps or higher, requires a distribution
point closer to the subscriber than a present-day telephone office.
Its potential penetration is difficult to predict and depends a
great deal on the success and competitive implications of
cable-based data services.
24. Cable interactive access services are just beginning to be
commercially available. It is a fairly safe prediction that by
mid-1999 millions of cable subscribers will be offered this
service.
25. It is a challenge to the cable industry to make subscription
and service provisioning simple and fast, and some standards
interoperability questions discussed later, such as "plug and play"
of digital set-top boxes, remain to be resolved.
26. The conventional "best-effort" IP service does not require
any special capabilities from the core network, but the new
QoS-conscious IP services and, of course, ATM do. The core network
must deploy technologies such as edge switches and access
multiplexers that aggregate traffic arriving under various
communications protocols, and must closely control QoS parameters
for multiswitch routings.
27. For the modem-based ISPs this implies higher rates, but the
cable model may allow "always-on" capability without major
increases in hardware investment.
28. CORBA (Common Object Request Broker Architecture),
standardized by the Object Management Group, is a leading candidate
for a universally accepted architecture, although there are other
distributed object systems proposed by major software vendors, such
as Microsoft's ActiveX.
29. Transportable software and object broker systems such as
CORBA are complementary more than competitive. CORBA provides
important object location and management services and facilitates
use of existing applications software by wrapping applications
(written in whatever computer language) in CORBA objects with
standard IDL interfaces. The Java virtual machine requires new
applications, all in the Java language, and applets may not execute
as efficiently as software written for the underlying operating
system, but it facilitates the movement of executable software,
with appropriate security constraints, with the benefits outlined
above. There are many examples now of CORBA-based systems in which
CORBA objects are invoked by transportable Java applets.