Self-configuration and Adaptive Coordination
Many of the anticipated applications of networked systems of embedded computers (EmNets) will be realized only if the systems are capable of configuring and reconfiguring themselves automatically. This chapter focuses on mechanisms needed to achieve automatic reconfiguration. In many EmNets, individual nodes will need to assemble themselves into a networked system, find available resources on the network, and respond to changes in their desired functionality and in the operating environment with little human intervention or guidance.1
A set of basic underlying mechanisms will be required to ensure that EmNets are self-configuring and adaptive. For example, components will need to be able to discover other resources on the network and communicate with them. Systems will need to be able to sense changing environmental conditions or changing system capabilities and respond appropriately so that the entire system, as well as individual components, can operate as effectively and efficiently as possible. Both software and hardware adaptability will be important; EmNets will consist not only of elements that can change their software but also of those that take advantage of reconfigurable computing technologies to adapt limited hardware to
This requirement is central to DARPA’s self-healing minefield program, for example. For more information on this program, see <http://www.darpa.mil/ato/programs/apla/contractors.html>.
the operating environment. Many EmNets will contain components that are constrained in terms of their physical size, amount of memory available, and/or availability of local energy sources. For these system components, both the need for efficiency and the constraints on how it is achieved will be more severe than is the case for more traditional distributed computing systems. Efficient system designs will exploit higher-capacity and resource-rich components where they exist in the overall system and will exploit the redundancy provided by deploying large numbers of inexpensive components. Many current efforts do not focus on systems that operate under these kinds of constraints. Work on the design of personal digital assistants (PDAs) and cell phones, for example, does not need to take into account very large numbers of interacting elements, distributed control, severe energy constraints, or the kinds of physical coupling that many EmNets must accommodate. Approaches taken in the design of smart spaces for homes or office environments are relevant, but such systems generally have more infrastructure to support them than many of the EmNets discussed here.
This chapter examines approaches to providing the mechanisms needed to support self-configuration and adaptive coordination of EmNets. The first section defines these key concepts. The second discusses the elements of self-configuration and adaptive coordination in existing distributed systems, serving as a primer on the state of the art. The final section of this chapter outlines the research needed to realize the vision for robust, scalable EmNets.
Self-configuration (sometimes referred to as reconfiguration) and adaptive coordination (sometimes referred to as adaptation) refer to the spectrum of changes that a system makes to itself in response to occurrences in its environment and internally. Neither of these terms is meant to convey infinite flexibility. The changes that self-configuration and adaptive coordination induce in a system should always be within the constraints of the system’s planned functionality (admittedly, one such change might be to modify the functionality of the system). For the purposes of this report, the terms self-configuration and adaptive coordination differ with respect to the frequency and degree of change they induce in or respond to from the EmNet. Making a sharp distinction between the two is not as important as recognizing that some techniques are more relevant to one than to the other. In the rest of this chapter the terms are distinguished in order to highlight the techniques that are more appropriate for each.
Self-configuration involves the addition, removal, or modification of elements contained in an EmNet, along with the resulting process of es-
tablishing interoperability among the components and locating essential services (such as data aggregation nodes in sensor networks). Put another way, self-configuration is the process of interconnecting available elements into an ensemble that will perform the required functions at the desired performance level. As such, self-configuration changes the composition of an EmNet and may alter the distribution of functionality across the components that make up the system or may even alter the system’s overall function based on which components are available.
Adaptive coordination involves changes in the behavior of a system as it responds to changes in the environment or system resources. For example, to achieve a long lifetime, a system may need mechanisms by which nodes can mediate their actions based on the density of redundant components. Nodes with redundant capabilities might be programmed to alternate responsibility for a given task in the style of sentry duty rotation. Similarly, EmNets could implement multiple levels of service, depending on locally perceived conditions or detected events. Thus, adaptive coordination refers to changes in operational parameters that are made because of variations in available resources or load. Included in these resources are available energy, computational resources, and communication bandwidth. In general, adaptive coordination induces less dramatic changes in system architecture than does self-configuration and does not alter the system’s function. The two processes often occur on different time scales. Adaptive coordination tends to take place more quickly than does self-configuration, with a very short lag time between the moment a change is detected in the operating environment and the time the system adapts its behavior.
Another dimension to bear in mind is the level at which the configuration or adaptive coordination occurs. This level can range from reconfigurable hardware to operating systems and run-time environments all the way to application-specific code. Levels vary in the extent of the effect of the reconfiguration and/or adaptive coordination as well as in the amount of code that needs to be stored or retrieved to make the change. A crucial facility that must accompany EmNets’ ability to adaptively reconfigure themselves is the facility for self-monitoring. Despite some of the most rigorous testing in existence, many of today’s highly complex systems are prone to failure when reconfigured. Telephone switching systems, for example, have suffered severe outages when new software is brought online. Yet this report suggests that EmNets must be able to change along many distinct axes, perhaps without an expert present. New system testing and software update technology will have to be developed. Meeting this challenge has proven to be very difficult, even in more conventional systems; EmNets intensify this need. They will have to be able to convey their current operational state to their
users. As argued elsewhere in this study, establishing that state requires far more than just tallying hardware resources. An EmNet will require a way to monitor how well it is performing and to compare this result against its goals; it will also require a means for reporting such information to users.2
The nature of the configuration or adaptive coordination depends heavily on the type of application the EmNet supports. In automobiles, for example, the focus of self-configuration would probably be on accommodating the heterogeneity of system components introduced to, and removed from, the system continuously as, for example, the people, conditions, equipment, and procedures vary. Unlike more standard computer networks, such embedded monitoring networks must be built assuming that there is no professional system administration, such that the configuration is highly (if not completely) automatic. Further complicating such networks are two typical requirements (as, for example, would be needed for automobile control): that the overall network be capable of making certain service guarantees and that some operations (such as notifications of life- or safety-threatening events) take precedence over other forms of network traffic.
In sensor networks that might be used for precision agriculture or environmental monitoring, system composition will vary less because the application is more constrained, while more attention must be paid to adapting the nodes’ operational parameters to unpredictable and varying environmental conditions. This is particularly challenging and critical in energy-constrained devices that must minimize their expenditure of communications resources on overhead functions and in which opportunistic listening can be relatively expensive because of the dependence on power-consuming communication resources (for example, a radio or other wireless communications device). Extensive capabilities that incorporate both adaptive coordination and reconfiguration will be required in systems such as those used on a battlefield, where changes in both the environment and system makeup can occur rapidly yet certain service guarantees are absolutely required.
SELF-CONFIGURATION AND ADAPTIVE COORDINATION IN DISTRIBUTED SYSTEMS
This section discusses the elements of self-configuration and adaptive coordination in existing distributed systems. These elements include the
notion of service discovery, as well as the critical issues of interfaces and interoperability. The discussion is primarily applicable to self-configuration; however, it is likely that adaptive coordination will require similar elements (e.g., mobile code). This background is useful in preparing to analyze the issues posed by EmNets. How EmNets differ from other types of distributed systems will become clearer as the analysis proceeds; later in this chapter, research challenges in these areas are examined. In general, EmNets present more extreme versions of the problems encountered in distributed systems, but they also pose a few unique problems of their own, such as low power requirements.
Discovery in Distributed Systems
Automatic self-configuration requires the ability to interoperate with new and old system components without human intervention. System components must be able to automatically discover each other and the services they represent. Building on the interface concepts of network configuration, wire protocols, and code mobility, this subsection discusses the issues involved in device and service discovery and how they relate to self-configuration. How entities on an existing network communicate is generally viewed as the interoperability problem. How those entities find each other, advertise their own services, or join the network is generally taken to be a separate problem, referred to as the discovery problem. Generally, the discovery problem can be divided into four parts:
How does a network entity join the physical network; that is, how is it authorized and given a network address and a network identity?
Once an entity is on the network and wishes to provide a service to other entities on the network, how does it indicate that willingness?
If an entity is looking for a service on the network, how does it go about finding that service?
How does geographic location affect the services an entity can discover or select for use?
Joining the Network
In traditional computing networks, the task of joining a system to a network has been done by hand: A system administrator configures the system with a particular network identity and then updates the appropriate routing and/or naming tables with the information needed to find the new member of the network. As networks have been scaled up, techniques have been introduced that allow the partitioning of the large network into smaller subnets and the propagation of (manually entered) bootstrapping information from the subnets to the larger networks. How-
ever, the advent of larger networks and networks that have little or no professional administration (such as those in the home or in networks of embedded systems) has led to an interest in automating this bootstrapping mechanism.
Mechanisms that automate the joining to a network have been around for some time. The Apollo Domain system, for example, allowed a node (workstation or server) to be connected to the network by finding a location broker with which the new node registered. Then, having completed this registration, the new node could be found by any other node in the network. The Appletalk protocol enabled not only computers but also peripheral devices, such as printers, to join the network and be found automatically by other entities in the network. However, these mechanisms have been confined to particular (proprietary) networks and have not been generally adopted, especially in networks of smaller, embedded systems. One reason is that such mechanisms are based on resource-rich environments as opposed to the resource- and energy-constrained environments that many embedded systems and most EmNets must contend with.
The actual mechanism most generally used for such bootstrapping tends to be conditioned (if not fully determined) by the physical network to which the device is attached. In an Ethernet Transmission Control Protocol (TCP)/Internet Protocol (IP) environment, for example, the Dynamic Host Configuration Protocol (DHCP) is commonly used to hand out addresses to entities that are connected to the network. A part of the Universal Plug and Play (UP&P) specification is a mechanism allowing devices to self-assign a network address to themselves on networks where DHCP is not present. For IEEE 1394 (otherwise known as Firewire), however, a very different mechanism is needed because the network itself will produce the equivalent of a bus interrupt when a new device is plugged in, thus informing every other device of the presence of a new entity. Networks designed for cell phone use have yet another way of allowing the phone to be recognized in the cell. The roaming function allows a phone to register its new location with a central database that then tells the phone’s home location how to reroute calls. The range of services achievable by automatic discovery and joining mechanisms is in part determined by whether nodes have unique identifiers or whether at boot time they are literally identical.
Joining the network entails locating essential services as well as obtaining network-level address and routing information. Existing mechanisms make use of multicast3 and well-known service-location addresses to bootstrap this process.
Advertising and Finding Services
The problem of advertising a service once a physical connection to the network has been established has been approached in a number of different ways. Perhaps the most common approach in general computing systems has been naming and directory services, in which the service that wishes to advertise itself creates an entry in a naming service or a directory service that allows others who know the name or description of the service (or at least the type of service) to get a reference to the new offering. Such mechanisms generally assume that there is a human being somewhere in the loop, because both naming systems and directory servers are string based, with the meaning of the string left to the user. When programs look for services, they need to know the name or description under which the service is registered. Some directory services have evolved rather complex ontologies in the form of description schemas to allow such programmatic access to services.
A different approach has been taken by service traders and the Jini system (Arnold and Waldo, 2000), in which services are identified by the interfaces they support. In a traditional trader system (such as those found in the Distributed Computing Environment (DCE)4 or the Common Object Request Broker Architecture (CORBA)5 trading service), a service registers itself by indicating what interfaces it supports; clients look up a service by asking for a reference to something that supports a particular interface. If more than one object has been registered that implements a given interface, then any of the objects can be returned by such a query. In the Jini lookup service, services register by their Java language type; they can be returned to any client asking for something that is at least an instance of the requested class (for example, the returned object might be a subclass of the requested class).
The problem of how an entity finds the place to advertise its services is not always addressed by the systems described above; most naming or directory systems consider this problem to be part of the general bootstrapping mechanism and assume that it is dealt with in some fashion outside their scope. The Service Location Protocol (SLP) is a mechanism that enables either clients or services to find a service directory. Essentially, the entity interested in finding a service directory (either to register a service or find one that has been registered) issues a multicast request
that will return the address of a service-finding service. This service supports a well-known interface that allows querying for a service directory, which is much like a standard directory service in which services can be registered under a description or found if they match a description.
The Jini system is similar to SLP in that it begins (on TCP/IP networks) with a multicast request to the local network neighborhood. Rather than returning a directory of service locators, however, the Jini multicast request returns a reference that implements the interface to a Jini lookup service (including the stub code, or driver, allowing communication with the service) that can be used by the service provider (or client) to access that lookup service directly. Universal Plug and Play (UP&P) also makes use of a multicast request, but in UP&P what is multicast is a description (in the form of a Universal Resource Locator (URL) indicating where the description can be found) of the device that is joining the network. All entities that might want to use such a device must watch for such a multicast, and based on the description they will determine if they have the code needed to communicate with that device. There is no central repository of services in the UP&P mechanism. Bluetooth’s service discovery protocol (SDP) is specifically for Bluetooth communications and focuses on discovering services available from or through Bluetooth devices and can coexist with other service discovery protocols.
Not all basic networking systems support multicast, so any extension of these types of service-finding protocols to such networks would require that some other bootstrapping mechanism be used to find the initial repository of descriptions or objects. This mechanism could be as simple as a conventionally agreed-upon URL that would be used to identify such a repository or a well-known name of some other form. Such approaches would need to find a way of preventing the entity with the conventional name from becoming a single point of failure (or they would need to determine that such a single point of failure was acceptable in the particular application). Other networks might allow entirely different approaches. An example of this is IEEE 1394 (Firewire), in which, as mentioned previously, attaching a device to the network generates a wire-level interrupt to all other devices attached to the network. On such a network, the service repository could simply notice when a new device was attached to the wire and send to that device the information needed to connect to the service repository.
For systems deployed in the physical infrastructure, a service’s location (either absolute or relative to another entity) may determine how it is
used or even selected. The mapping between physical location and network connectivity is important. (See Chapter 2 for a discussion of the technologies that enable the determination of geographic location.) In wired or hybrid networks, two devices that are physically close may be, in fact, quite distant in terms of network communication. For example, a desktop personal computer (PC) and a cell phone may both be network-enabled, but for them to communicate, packets must travel through many network segments, including the building’s network, the link between the building and local backbone, the connection between the backbone and the cellular phone company, another hop to the appropriate base station, and finally, from the base station to the phone itself. Thus, when a device needs to determine, for example, the closest printer, network proximity is not at all likely to be an accurate measure.
Geographic location is intimately connected to discovery. If each device knows its own geolocation and can provide that information to the discovery servers, then it may be possible to answer the question about “closeness” during the discovery phase. Access to services may also be based on location. If one assumes physical security measures permit a user to enter a particular space, then services locally available in that space can be put at that user’s disposal without requiring further authentication. Without location information, users would have to obtain access to the local networks, with accompanying security risks. Thus, location can be quite useful in optimizing service discovery as well as in connecting the physical and virtual worlds so that security measures in one can be applied in the other.
In other types of EmNets, particularly resource-constrained, wireless networks, network organization needs to correspond more closely with geography in order to be efficient in its use of scarce energy resources (since communication over longer distances consumes significantly more energy). In these systems, geolocation may serve as a building block for organization of the network itself—for example, through the use of geographic routing (Karp and Kung, 2000).
Interfaces and Interoperability
Both self-configuration and adaptive coordination require interfaces, or standardized ways of communicating between components. An interface is simply a convention that is agreed to outside the scope of the communication of interest but that permits the communication to occur. These interoperability agreements can exist at every level of system abstraction, including electrical, signaling, transport, network, and application levels. Moreover, these agreements extend to code mobility and application adaptation. When EmNets communicate, they must assemble
a collection of information that will be interpretable by the receiver. This information may include not only data but also code that the receiver can execute to interpret the data, process it in some way, or forward it to other entities. The format of the information must comply with the interface on which both entities agree in advance.
At the lowest level, interoperability requires the assembling of information (data and code) into a sequence of bits that will be properly interpreted by receivers on the network. At higher levels, this means supporting an abstract machine for which the sender can include instructions within the information it sends. If there is agreement with the receiver on the execution semantics of these instructions, this serves as a powerful model for extending the functions that each device is capable of performing. That is, it becomes possible to move code from one entity to another so that functionality can be modified and extended in ways not predicted by those who originally deployed the device. Other levels of interoperability include transport protocols (e.g., TCP/IP) that permit a sequence of network packets to be generated and reassembled at the other end, as well as remote procedure calls (RPC) and remote method invocations (RMI) that permit one entity to execute an operation on another by sending parameter data and having the result returned.
How interoperability is to be achieved is often one of the major design decisions that needs to be made for networked systems.6 In traditional distributed systems, methods such as DCE, RPC, and CORBA are implemented to pass a method or procedure identifier to the receiver to indicate the code that is to be invoked on the data by the receiver. Parameters are linearized and included in the RPC packet. More specialized systems can make either or both of these classes of information (procedure identifier and input parameter data) implicit. In a simple system in which data are sent from embedded sensors to a central processing node, only the data need be transmitted, because the operation to be performed on the data is known by the receiving node. In some publish/subscribe systems, even the data that triggered the notification of an event need not be explicitly passed, because the notification itself is enough to indicate the data that triggered the notification. In a more complex, ad hoc sensor
network, intermediate nodes between the originator and its final destination may aggregate the data. Thus, the interpretation of the data may change as it travels from node to node. Each node may want to indicate to the next how to properly interpret and process each data item.
The remainder of this section discusses address configuration, wire protocols, and code mobility as illustrative examples of key interface and interoperability concepts.
One of the most familiar types of self-configuration is the process by which new devices are added to local area networks. The Dynamic Host Configuration Protocol (DHCP) performs this function on IP networks. A device new to the network must obtain a new IP address in order to have packets routed to it appropriately. A DHCP server for a network allocates a set of IP addresses to acceptable visitors for a limited period of time. DHCP servers do not have to be present on every subnetwork but must be reachable through the standard routing mechanisms. A device finds a DHCP server using a discovery message that is propagated by the network routers to a nearby DHCP server. The server then responds with the IP address for the device to use. This address may be dynamically allocated or determined based on the physical address of the device’s network interface card (providing a mechanism for mobile devices to store and later retrieve their network parameters). Devices can easily determine if they need to obtain an address using DHCP if their request packets are not acknowledged. This is an indication that the IP address being used is no longer compatible with the network at which the device is now located.
The DHCP packet format provides a standard interface for devices to use in connecting in a new network environment, thus ensuring interoperability at the level of IP packets. The servers’ functions provide a higher-level interface that provides addresses only to authorized visitors and only for limited periods of time.
The most common way of ensuring interoperability is to define a standard protocol that all entities on the network will use to identify operations and convert to and from their own internal data representations to the data representation used on the wire. Each entity on the network contains some code that performs this conversion. In a standard RPC system, the code used by a client for this purpose is called the stub code and the corresponding code on the server side is called the skeleton
code. This code is often produced by a compiler, which uses as input a description of the interface offered by the server, although handwritten or manually specialized code is often used to improve the performance of the overall system.
This approach to interoperability has a number of advantages. It makes very few assumptions about the devices that make up the network, requiring only that they have the computational power to create the stream of bits and transmit them over the wire (if the entities are sending information) or to recreate information from a stream of bits received from the wire (if the entities are receiving information). Much of the code needed to create the wire stream or recreate the data from the wire stream can be generated automatically from fairly high-level descriptions, allowing a higher level of abstraction to be presented to the human programmer.
There are disadvantages to this approach as well. Because such systems are defined by the wire protocol, the patterns of communication between the various entities are very difficult to change. Such a change essentially requires a revision of the wire protocol, which in turn requires the eventual updating of all of the communicating entities on the network. Such changes are generally needed because of changing hardware or changing requirements, which can be thought of as a scaling of the network over time. The longer the network is expected to run, the more likely it is that changes will be needed to accommodate new hardware (or new software services offered to existing hardware) or that the tasks expected of the network of devices will change or evolve (or, perhaps, a flaw in the original design will need to be fixed). Sometimes these changes can be made using the existing protocols; however, because those protocols define the information sent from one entity to another, it is often necessary to enhance the protocol before such changes can be made.
Mobile code, or the capability to dynamically deliver and load new code to be installed on network nodes, provides a mechanism for extending the lifetime of a system. The idea is to create a higher level of abstraction, an interface agreement for communicating information that is more complex semantically. By elevating the level at which the common interface is defined, mobile code enables the protocols used by system nodes to be updated over time or modified for specialization or optimization purposes. Mobile code still requires an initial interface agreement regarding how the code will be transmitted and loaded, but given this foundation and a constant physical layer for communication, it provides a graceful upgrade mechanism for network nodes.
Running Mobile Code In current client-server systems, what is known by each of the communicating entities is the (programmatic) interface used by the client to talk to the service. When the client wishes to use this interface, the client receives from the service a reference, which includes the stub code needed to talk to the service. This code is loaded into the client and presents to the client the programmatic interface that is expected for that service. Because the actual form of the bits on the wire is encapsulated in stub code that comes from the service itself, the wire protocol becomes a private matter between the service and the code it hands out. The client can be, in some sense, far more ignorant; rather than needing code that knows how to translate into a common wire protocol, the client needs only the knowledge of which call method to use. The details of how information is encapsulated into a stream of bits are known only to the code supplied by the service.
The disadvantage of this approach is that it requires considerably more from the entities participating in the network. In particular, it requires that all of the entities be able to load code dynamically and that there be a form of code that all of the participants can understand. For this to be possible there needs to be some platform-level homogeneity in the network that allows code moved from one machine to another to run on the receiving machine. There is a spectrum of approaches to providing this common level. One approach (used in some active networks research7) is to construct the network out of devices that are homogeneous at the lowest level, meaning they use the same processor and operating system. Among the advantages of this approach, optimized binary code can be moved and run on another machine, and resource use on the various devices can be controlled. However, the approach limits the flexibility of the overall network, making it difficult to introduce new types of nodes; it also presents problems in scaling over time, because the network of devices will not be able to make use of new processor or binary code environments. It is thus highly impractical. It also requires a large amount of trust in the code being moved, as there are no restrictions on what that code can do and no ways of establishing that the code is either well meaning or well written.
At the other end of the spectrum is an approach that uses a high-level scripting language, such as TCL or Python, as the homogeneity layer.
According to a DARPA-funded program at the Massachusetts Institute of Technology, active networks “allow individual users, or groups of users, to inject customized programs into the nodes of the network [and] enable a massive increase in the complexity and customization of the computation that is performed within the network.” See <http://www.sds.lcs.mit.edu/darpa-activenet/> for more information.
This approach requires that every member of the network have both the interpreter for the common language and the necessary native libraries available so that the portable scripts can be run. It provides a good layer of insulation from the hardware but requires a fairly large execution environment and set of libraries; it pays the price in performance (most of the scripting languages are between one and two orders of magnitude slower than object code performing the same functions) and, correspondingly, in power consumption. However, this approach is safer than moving binary code, because the scripting language can incorporate limits on what the code can do (as achieved in “safe TCL”).
A middle ground between these two divergent approaches is to define a virtual machine and move code that assumes the existence of that machine; this is the method used in systems (such as Jini) built on Java. This approach allows a more compact representation of the mobile code than can be found in most scripting languages, because byte codes are moved rather than text. The environment is far safer than those in which pure binary code is moved, because the virtual machine can make checks on the incoming code and enforce security policies. A rather large environment is still required, but it is often no larger than that required by the scripting approach, and work is being done to make it smaller. The performance degradation is smaller than that found in the scripting approach, although still in the range of 10 to 20 percent.
Resources Newly introduced code may require more resources than does the code already extant at a node. These resources may or may not be available at that node or may be beyond a limit set for the function the mobile code performs. Therefore, negotiation and resource allocation are clearly important aspects of this mechanism. A device seeking to introduce code into another device may first have to negotiate for the necessary resources and must expect to propagate the code only if it is granted those resources. The negotiation will include presenting the appropriate access privileges for modifying the code to be run on another node.
Advantages of Mobile Code Mobile code has many advantages over wire protocols. First, the way services represent information on the wire can be updated without the need to coordinate updates with all clients and services simultaneously. Because the stub code used by the client is obtained, when needed, from the service itself, the service can change the communication protocol used between the client and the service by simply updating the code handed out. The client will receive the new code automatically on an as-needed basis when it next wants to contact the service. Second, this approach allows different implementations of the same service to use different communication protocols. Because the com-
munication protocol is used between the service-supplied stub and the service, the protocol can differ among services, even if those services are implementations of the same interface.
Third, if the method of code movement is combined with a polymorphic language and virtual machine such as Java or Inferno, then the service can evolve in an upwardly compatible fashion to offer new functionality without being incompatible with old clients. If the new functionality can be expressed as an extension or subtype of the existing functionality, then the code handed out by the service to the client can implement all of the existing procedures or methods as well as the new procedures or methods. This design enables old clients to treat the service just as they always did, while allowing new clients (or clients that can reflectively discover and use the new functionality) to use the new aspects of the service. This advantage can be obtained, however, only by requiring a universal type system in addition to code mobility.
Adaptive Coordination in Existing Networks
Making any network of systems adaptive is a challenge, and EmNets increase the challenge by adding constraints not found in other systems. Moreover, the type of adaptive coordination needed in EmNets has only recently begun to be studied in more traditional networks of computing systems, so there is little existing knowledge on which to draw. As background for an analysis of research needs related to EmNets, this section provides examples of how adaptive coordination is handled in more traditional systems. The problems addressed are load balancing, ad hoc routing, and TCP’s adaptive congestion control mechanism.
Load balancing in distributed systems received much research attention in the 1980s as distributed computing became more prevalent. The essential problem is how to distribute processing, storage, or access demand across a set of servers as that demand increases and in some cases as the availability of underlying resources (e.g., servers) increases or decreases (Mullender, 1992). Typical load balancing requires collecting load statistics from servers and assigning new demand based on those statistics. Some approximations may be used in the absence of current load data. Systems may reassign demands based on data or reassign only if there is a failure. Techniques vary with regard to optimization level, robustness, communication cost, and convergence time. The more distributed the system, and the greater the delay and delay variance, the
more difficult it is to collect timely statistics and achieve a solution that is both efficient and stable.
Load balancing in networks, usually in the form of adaptive routing, addresses one extreme situation at a time in a highly distributed system. The problem is most challenging when the network is large and covers a wide area, in which case global load information for all network nodes and links is clearly unachievable. Therefore, adaptive routing relies on partial information, which may be partial in scope, coverage, or time (that is, out of date). A classic story of early ARPANET design was the move away from highly adaptive distributed routing to a more stable and slower adaptive routing scheme. The old ARPANET routing scheme (Mcquillan et al., 1980) attempted to move traffic away from congested links, but by doing so it encouraged the congestion to move to the new path in the network, eventually causing all the traffic to move back to the original path! These oscillations are a simple example of the challenges associated with building adaptive systems. Load balancing is applied successfully when the information required can be obtained in a timely fashion and when the rate of controlled change is much slower than the phenomena to which it responds. Within ISP networks (which are really subsets of the larger Internet), such techniques are applied in the form of “traffic engineering.” However, even in this more limited context, there is a lot of manual configuration involved.
More recently, very-large-scale distributed services have been proliferating in the context of the World Wide Web. There are Web servers that can be expanded on the fly, by adding more computing capacity without shutting down the existing Web server and then using the added capacity when traffic is heavy.8 These systems adapt to heavy load by allowing the addition of new machines to the Web server cluster in a way that is transparent to system users. This approach can be viewed as human-assisted configuration of the system; once the administrator adds the system to the physical cluster, the software is able to automatically reconfigure itself to make use of the extra capacity.
Ad Hoc Routing
In recent years, other forms of adaptive behavior have been explored in networked systems. One is ad hoc routing (Corson and Macker, 1997). Traditional routing starts with a fixed location for nodes and links and adapts only to occasional node and link failures and recoveries and to
See, for example, the Hosta system from Concept Technologies, Ltd., available at <www.concept-technologies.com>.
variable congestion. Ad hoc routing was developed to provide automatic, nonmanual construction of a network when the network routing elements are not in a fixed topology, that is, when they are mobile. Ad hoc routing protocols continually adapt to changing topology, whereas traditional protocols adapt to topology changes much more slowly and less frequently. The form of adaptive coordination required in ad hoc routing is fairly well understood and seemingly manageable, although there are few examples of operational ad hoc networks. There are clearly limits to the ability of any scheme to keep up with continual rapid change, and there is ongoing work to develop methodologies for characterizing such limits, as well as the behavior of adaptive coordination mechanisms as they approach these limits. Related to the work in ad hoc routing is power-aware routing (Sohrabi and Pottie, 1999), which attempts to adapt routes in such a way as to maximize the total network lifetime as determined by battery depletion. This work is indicative of the type of adaptive algorithms that will be needed to realize the vision of robust, long-lived, and scalable EmNets.
Adaptive Congestion Control in TCP
Another form of adaptive behavior has a completely distributed, local nature—TCP’s adaptive congestion control mechanism. TCP is the transport protocol run in the Internet over the IP protocol. TCP is an end-to-end protocol run on end-system computers (from laptops to desktop PCs to workstations to large servers). TCP provides a virtual connection to the applications that use it, offering in-order, reliable delivery of data. The Internet over which the data are sent exhibits varying data rates due to the heterogeneity of underlying link speeds and variable loading on the links. Van Jacobson introduced adaptive congestion control into TCP (Jacobson, 1988) by which the source of a data stream would reduce its sending rate when it experienced packet loss, an indicator of congestion. When no loss was experienced, the sending rate was slowly increased until all data were sent or additional loss was experienced. In this way, each of the multitude of end systems on the Internet independently adapts its behavior to the dynamic conditions experienced on the network, resulting in a more or less stable system—certainly more stable than it was before adaptive congestion control was introduced. The specifics of the TCP congestion control algorithm have evolved over the years, and a sizable body of research has emerged concerned with the characterization of TCP and the aggregate effect of TCP adaptation on the network (Fall and Floyd, 1996). However, this remains an area of active research because of the challenge associated with characterizing such a large system of adaptive elements.
RESEARCH CHALLENGES FOR CONFIGURATION AND ADAPTIVE COORDINATION
This section outlines key research challenges related to configuration and adaptation in EmNets. The subsection on adaptive coordination is the most extensive because the concept is fairly new, especially as it applies to EmNets, and there is still no extensive research base on which to rely.
Research Issues in Self-configuration
As background, it is useful to outline some design basics and criteria. EmNets will appear in hybrid environments of mobile and static networks. Users will expect to connect to networks and services as they enter vehicles, buildings, and outdoor environments. The nodes themselves will be diverse in capability, energy availability, nature and quality of connectivity, and priority. Physical node access will depend on context. Variability in priority will dictate when and if a node is revealed or has services revealed to it at the physical layer. Variability in the node population will introduce further complexity. The addition of new nodes to a local cluster may not be permitted owing to performance constraints. At other times, conversely, it may be desirable or even necessary to incorporate high-priority nodes and their services into the network.
The wireless physical layer is limited by low data communications rates, the sharp decay of radiated power with increasing range, and susceptibility to interference. This implies that network resources may not be consistently available at a given point in the network and may exhibit highly variable performance across space and time. Nodes may appear and disappear according to variations in the wireless channel environment. The wireless physical layer is also diverse. Simultaneously present in the environment are systems ranging from local-area, spread-spectrum networks to wide-area cellular, pager, and even satellite communication systems. Methods are needed for joining these different networks and bridging across adjoining cells. Support for networked embedded systems must include capabilities for low-bit-rate, low-power, low-cost access for virtually all nodes.
Ad hoc sensor networks provide an excellent example of the issues to be addressed. Many applications require the deployment of sensors over wide areas with no predetermined arrangement. The devices must discover each other (or at least their nearest neighbors) and decide how sensor information will flow through the network they collectively form. Different devices may take on different roles as generators, routers, or aggregators. Global efficiency can be achieved only if locally derived
information is propagated to other nodes in the network. Devices will need to configure their functions to produce the desired overall effect rather than optimizing for strictly local concerns. Thus, a node may take on the role of router and act as a communications hub, but at the cost of increased energy use. When it eventually loses its ability to perform the function, another device will take its place. Determining how local decisions can lead to efficient global effects is a fundamental challenge for adaptive coordination in ad hoc systems.
EmNets will necessarily be composed of heterogeneous elements. Devices will be optimized for specific functions. For example, some sensors may be small and numerous but also highly constrained, while local aggregators may be more powerful devices with longer-range communications capability and larger power supplies. In addition, the long lifetimes of these systems and the need for adaptation may very well require the ability to upgrade and/or install new code. Trust models need to be developed that will not only control the admission of new code but also police it to verify it works as advertised prior to gaining admission. Finally, these systems must be resilient in the face of failures that occur when devices, communications, or other resources become unavailable. The following paragraphs elaborate on these themes.
Configuration via Mobile Code Given the expectation of a rapid evolution in hardware, networking protocols, and basic networking algorithms in EmNets, an approach to discovery and configuration based on mobile code seems promising. Such an approach allows these components to evolve separately, rather than requiring that the whole EmNet evolve in lockstep. However, interesting and important research issues are still presented by approaches that use mobile code.
Although all of the approaches to implementing mobile code have some advantages and disadvantages, certain issues are common to all of the approaches—a point that often gets lost in the discussion of which technique is best. These issues highlight some of the fundamental engineering trade-offs that will need to be made in constructing networks of embedded systems, especially those made up of devices that are constrained in terms of memory, processor speed, power, and bandwidth.9
The most obvious issue is the trade-off that needs to be made between
memory use and the use of mobile code. For many of the small, embedded components in the systems that are the focus of this report, memory is one of the most precious resources. In some ways, the whole notion of mobile code conflicts with memory conservation; the idea that the recipient of the mobile code needs to know only the interface to the received code, and that all else is hidden behind an interface that is implemented (as needed) by the supplier of the mobile code, means that the recipient of the code has given up the capability to control memory use. Any piece of mobile code may exceed the amount of memory available at the recipient. Even if no single piece of code violates the memory constraints on the recipient, as the network scales up (and more code is moved), there will come a point at which an otherwise acceptable (and perhaps small) piece of code will need more memory than is available.
This issue cannot be dealt with at the component level—even if each piece of mobile code is written to be as small as possible (which might not always be the case)—because it is the sum of the pieces of mobile code that causes the problem. This exemplifies the need to understand how local decisions can affect global properties, and vice versa. The code actually loaded onto a node is determined by the use of the network in a specific situation. Thus, it is an aspect of the design of the network, not the components. On the other hand, the network should not have to know the details and limitations of the components present. Its properties are abstract and implemented by the underlying components. Indeed, one reason for using mobile code is to allow building the network without having information about the individual components.
Protocol-based Configuration Mobile code offers the opportunity to tailor devices to new applications and evolve their functions over time. However, the resource requirements for mobile code may dictate other approaches instead, especially on the smallest devices used in EmNets. Such approaches, based on prearranged wire protocols used for communication between the various components, present their own research issues.
The first issue is the need to develop an ontology of devices so that they can be described in a way that is natural and consistent across different systems. If services are to be discovered, then they must be discovered with a description that ensures they will be able to use the wire representation sent to them and to generate data in the wire representation expected from them. How such a convention can be described and how it can be reasonably enforced in large-scale systems such as those envisioned in EmNets is an open research question.
Once this ontology has been described, a set of associated wire representations for the data to be transferred to and from devices of each type needs to be defined. These wire representations need to allow queries of
data that has been sensed in the environment as well as the transfer of control information from one member of an EmNet to another. How to define these representations in a way that will allow the system to evolve is an open research question. In fact, the research issues surrounding protocol-based, self-configuring systems seem to be the converse of the problems posed by mobile-code-based systems. Each approach can solve some problems that arise with the other but is also subject to problems avoided by the other. Protocol-based approaches allow solutions that apply to devices that are severely resource constrained, but they produce systems that are brittle and lack easy paths of evolution. Mobile-code-based approaches allow easy system evolution, but at the price of abstraction, which consumes what could be scarce resources such as memory and communications capabilities.
A promising area of research might center on combining the two approaches in a hierarchical fashion. Small groups of devices could be built using a protocol-based approach. Together, these groups could possess enough resources to allow utilization of the mobile code approach. This method would allow the overall system to evolve, although groups of nodes in the hierarchy would need to evolve in a coordinated fashion. Such localized, planned evolution is much easier to accomplish than global planned evolution in large-scale systems. At the large scale, shared resources could enable use of the mobile code approach, which allows piecewise evolution of the overall system. Thus some devices in EmNets themselves or in the networking infrastructure to which they are connected can serve as code proxies that can offload computation and memory resources from the more resource-constrained devices in the system. Of course, it will now be necessary to communicate with these proxies or groupings more frequently than if the computation could have been performed locally. This degrades power consumption and reliability but could provide a more flexible evolutionary path than simply over-provisioning every device. In an agricultural context, for example, the irrigation and fertilization system might operate as a sensor network with relatively constrained devices running wire protocols. However, the controller for the systems might be a more capable, general-purpose computing element that would interoperate with the rest of the enterprise’s inventory and control processes and would benefit from the long-term flexibility of using mobile code technology.
Discovery Protocols Current discovery protocols, whether based on wire protocols or mobile code, require that the entity entering the network be able to find, either directly or indirectly, the other entities of interest in the local network neighborhood. Considerable research (and product development) is being done on discovery protocols and join protocols over
Ethernet-based TCP/IP networks. These networks have a number of properties that are assumed to exist, a prerequisite for such protocols to work; in particular, the ability to multicast with limited scope is required by all of the existing or proposed discovery mechanisms. Not all networks that are currently in use or being thought about support these mechanisms, however; how discovery would work over such networks is an open issue.
A research issue that needs to be addressed is how discovery mechanisms of any sort can be scaled to larger networks. For discovery mechanisms that are purely peer to peer (that is, there is no rendezvous entity at any level), it is not clear how this can be done other than by specifying some form of region of interest in the network—a concept that is not well supported in existing network topologies. This issue is further complicated by the potential dissonance between geolocation and network proximity, discussed earlier in this chapter.
For discovery protocols that rely on the collection of entity information in some sort of lookup or directory structure, an approach to scaling could be to form a hierarchy of such lookups, with the leaf nodes of the hierarchy consisting of the lookups contacted by the discovering entities and higher-level lookups consisting of information about the previous level of lookup. This approach is standard in hierarchical naming systems, but it is less clear how the approach would work in systems designed to allow programs to find other programs. In such systems, in which the entity to be found is often represented as something other than a human-readable name, it is not clear how to propagate the information about the contents of a lookup service into upper levels of the hierarchy. Some work has begun in this area, and it may be a scalable alternative to the multicast-based, publish-subscribe mechanisms that are used locally (Yu et al., 1999). In some contexts, this lookup-based approach is preferable to the always-listen approach of multicast because of the energy costs associated even with “listening” on low-power wireless channels (see Chapter 2).
The issue of low-power discovery is key for EmNets with large numbers of small sensor nodes. At this time, low-power discovery emphasizes the assembly of the physical layer at low power. This means, for example, that both the transmit and receive duty cycles are maintained at a low rate. Unique complexities arise when discovery of nodes and physical layer assets must occur in a multihop context. The need for correlation to physical location further complicates this issue. The cluster architecture is often required for typical deployments. For example, in a healthcare environment, individual clinical spaces will form embedded system clusters, which may have weak interactions with neighboring clusters. Energy, bandwidth, synchronization, and information sharing will moti-
vate clustering. Despite the progress that has been made in developing approaches to discovery and interoperability, additional research will be needed to extend these principles to EmNets.
Trust and Failure Models
The ability of EmNets to self-configure brings up a set of issues related to trust among system components, admission and allocation to resources, monitoring and policing, and the ability to deal with failures, some of which may be intentionally inflicted. In addition, means are needed to oversee and administer the status of the whole system; this includes its upgrade status, patterns of resource usage, and overall system health.
Admission Control A critical unresolved issue has to do with how to characterize components and the code they run. Components must be able to make local decisions about what code they will run, whether it resides locally or needs to be imported as mobile code from another node.
The strength of mobile code draws, in part, from the ability to distinguish between the interface (which is all that the client of the mobile-code service needs to know) and the implementation of that service (which gets moved into the client’s address space and hides the details of the service from the client). The implementation of the mobile code can change as new hardware, wire protocols, and software services evolve. The client that will run the mobile code knows only about its functional interface. The challenge is that there may well be a set of characteristics important to the client that is normally not discovered. Such characteristics might include the timing constraints or guarantees that the service needs to meet to function properly, the amount of bandwidth or power it requires, and its memory requirements, including the potential downloading of the code of subcomponents.
The problem is that an interface describes only the syntax needed to talk to the service and the broadest notion of the semantics of the service. Other semantic aspects of the service may also be important, but there is a lack of agreed-upon methods for specifying such semantic characteristics. Techniques that have been developed for software abstraction offer no well-defined middle ground between the interface and the full definition of the implementation. An example of a characteristic that might be needed is quality of service. Information about average and worst-case delay bounds might be required for some application domains. Consider the problem of trying to track a vehicle and then collect an image. The nodes that are detecting and exchanging information for localization purposes must do all the tracking in time to trigger the correct imaging de-
vice. How to combine a description of the guarantees that a service can provide with the requirements of the client on the service and the requirements of the service on the client is an area open for research.
EmNet elements need to be able to gather this information about the service they want to use so as to make intelligent admission decisions. However, this is not the end of the issue. Once they make the decision to run the code, they need to ensure that it functions as was advertised. Monitoring and policing are therefore needed to verify the service code does not overstep the agreed-upon bounds. Mechanisms are needed to stop code that does not live up to its contract. Admission control and policing decisions are further complicated by negotiations between EmNet elements as to who should run which services. If a device agrees to run a service that other devices are counting on, it has to devise a plan for offloading those functions if it finds itself unable to meet the service’s requirements or if the service oversteps its bounds. All of these issues present difficult challenges for the developers of software for EmNets and call for significant research.
Trust and Security Trust models that can be applied to code (as opposed to people) need be investigated. When code is moved on behalf of a service or device on the network into the address space of a client, the client and service need some way to decide on the level of trust between them. In some embedded systems, trust may not be an issue (for example, when only trusted sensors are allowed into a sensor network). In others, however, several trust issues will be important:
Whether the receiver trusts the mobile code and allows it to run in any fashion,
What local resources code can access if it is allowed to run, and
What rights the local client might want to delegate to the code if it moves on or needs to make calls to other members in the network.
Although some ideas have been developed about notions of trust in principals, it is not clear that mobile code is a principal, or if such code works on behalf of a principal. Indeed, there are cases in which it makes sense to distinguish between different levels of trust—how much the code is trusted to be accurate and nonharmful (which can be thought of as trust in the producer of the code) and what the code is trusted to access (which can be thought of as trust in whoever or whatever the code is running on behalf of). It may well be that all of the problems in the trust model of code can be accounted for with current trust models and an appropriate mapping of the new entities involved in such systems and the entities already dealt with in the trust models. But currently there is no such
mapping, nor is there any reason to believe that new problems will not arise.
Security in distributed systems has been investigated for some time. Most security mechanisms, however, rely on the ability to trace an action back to a principal, generally a human being, on whose behalf an operation is performed. In an EmNet, however, most of the requests or resources will be made on behalf of a program, which may not have the full identity of a principal. Even if each program or embedded processor could be treated as a principal, it is not clear how that program or processor should go about authenticating itself.
Beyond these fairly standard sorts of security issues, EmNets can pose security concerns that go beyond those generally thought of in distributed security. For systems in which code is moved from one processor to another, it is not enough to mutually authenticate the interacting entities; the code that is moved from one entity to the other needs to be trusted to some degree and must be given (or denied) access to the resources on the system in which it runs. How this is best done, or even if it can be done, is an unanswered question at this time. Some progress has been made in performing code verification prior to the loading of the code through the use of virtual machines, but the principles behind the code verification mechanism are not well understood. Further, the amount of space taken up by the verifier is large, and it may exceed the benefits offered by code verification on small devices. There have been some investigations into the possibility of performing verification before the code is moved and then signing that code to ensure that it is safe (Gun Sirer et al., 1998), but further research in this area is necessary.
The design of operating systems that can support this type of resource accountability and allocation is also an open research area. Accountability is necessary for resources such as power and bandwidth as well as for the more traditional processor cycles and memory. Allocation may be based on any or all of these considerations, and the code run by the operating system must be guaranteed not to be able to obtain more resources than it originally negotiated.
Failure Models and Monitoring Additional research needed in the area of discovery has to do with the failure models for automatically configured networks. Once a device has joined such a network, how is it discovered that the device has failed? If the automatically configured network has some conceptually central place where members of the network are found, what happens when that place fails? The Jini system has a reasonably well-specified failure model, covering both the failure of components that are registered with a lookup service and the failure of the lookup service itself. This model is implemented using the concept of leases. Leases are
granted for a specified period of time. If the device does not return to renew the lease, then it is assumed that the device has failed or left the network and is no longer available. Leases can be used in this manner in both directions, helping a client keep track of a server and—as is more common—helping a server keep track of a client. However, this does not solve all the problems, because the lease server itself may fail and a new node may need to take on this responsibility. The approach taken in the Jini system is not the only possibility; others should be investigated.
An issue related to failure is system health. In many EmNet applications it will be necessary for an administrator or user to know what issues the system is dealing with. For example, a lack of elements in one area (owing to malfunction or outside attacks) could create a low-bandwidth bottleneck or a surplus in another area (owing to malfunction or intentional interference) could cause communications interference. This is important because EmNets are unlikely to be deployed for applications that can tolerate total system failure and be fixed by simply rebooting. A key design goal is thus to have them degrade gracefully (for example, having nodes or elements take over for other nodes and elements when they fail.) The Internet provides a reasonable example of how this might be accomplished, although it is not, of course, subject to the additional constraints that EmNets are operating under.
Additional research is needed in how to characterize systems and their components based on this concept. There may be much to borrow here from the ideas of dual control. In dual control, the behavior of system elements is characterized in situ by stressing them purposefully. What is learned from the interaction can then be used to recognize a problem when it is seen in regular operation. In addition, it will be important to record system behavior so that unintended behavior that emerges when a particular combination of elements or EmNets interacts can be studied and remedied. In fact, doing this automatically might create a sort of immune system that monitors operation and takes corrective actions. Of course, such an immune system as this would itself have to be monitored. This opens up an entirely new area of research that focuses on techniques for restricting the behavior of EmNets within a parameter space that is comprehensible to both humans and machines.
Research Issues for Adaptive Coordination
Several factors make it unlikely that adaptive coordination in EmNets will be mediated or even aided by human operators. One factor is size: EmNets will often be very large, and adaptive coordination will need to take place over a scale (in terms of both numbers of networked elements and size of the covered physical space) that will preclude human involve-
ment. A second factor is the time scale. The time scale over which the adaptive coordination may need to take place is too short to be open to human intervention; by the time a human operator decides what to do, the environmental factors will have changed in such a way as to require a completely different adaptation. A third factor is that the operators, users, and individuals interacting with EmNets may be untrained in the specifics of the system and should not be expected to understand the technology to the depth that would be required to address adaptive coordination. (See Chapter 4 for a discussion of human factors and the usability of EmNets.) The rest of this discussion focuses on the technical considerations mentioned above.
The large number of elements in such systems suggests a brute force method of achieving adaptive coordination: adding more elements to the EmNet to allow high levels of redundancy without modifying the designed behavior of the nodes. However, this method would require communication bandwidths that would drain the available energy of battery-powered elements. Simple replication is predicated on the idea that bandwidth (and the power needed to use the bandwidth) is an abundant resource, which is not the case in many of the EmNets of interest. In addition, issues of stability might arise with increasing numbers of nodes in the network—additional work in control for EmNets is required to characterize and manage stability.
Monitoring system health is a critical issue for two reasons. First, many envisioned applications of EmNets have reliability and safety concerns that are more severe than those for traditional desktop distributed systems (see Chapter 4), so it is critical that system degradation and signs of imminent failure be detected. More germane to the discussion in this chapter is the need for resource-poor components to adapt to variations in available resources in other components so as to achieve overall system efficiency. However, these same resource constraints make extracting information on dynamic system state expensive. Variations in available resources could arise in the context of normal operation or be due to intruders or malicious attacks. System health monitoring will thus need to incorporate intrusion detection and antijamming facilities.
There are some promising avenues for obtaining the adaptability needed. The low cost of the elements in many applications will enable the use of large numbers of elements in ways that supply redundancy when needed, while lowering or at least limiting the amount of communication required over the system itself. However, this approach will work only if nodes are designed to be adaptive to their environment and to the behavior of other elements in the system. For example, a node might set the frequency of periodic sample communication or its transmit power based on the density of nodes observed within its proximity.
These large numbers of system elements might also allow the system to monitor itself much more carefully so that adaptive coordination can be predicted or expected in new and interesting ways. For example, traffic generated by a node could be monitored by the nearest neighbor, which could quickly determine when that pattern changed or ended abruptly, indicating failure or loss of power. Such continuous monitoring would permit nodes to react quickly to losses in the network.
One general area for research is how to exploit the redundancy that may exist in many EmNets. Especially in sensor networks and other systems based on large numbers of inexpensive nodes, some degree of redundancy can be expected. In sensor networks, for example, multiple nodes may provide coverage of overlapping geographic areas. In a smart space, multiple printers, displays, or databases might exist. Not only can this redundancy improve reliability, but it might also ease the process of self-configuration. For example, when nodes need to be upgraded, only a small percentage of the nodes might be upgraded manually, and the others could be instructed to check the new nodes for updates. With inexpensive components, the possibility exists of deploying multiple solutions rather than focusing on finding a single optimal one. In this section, the discussion is primarily about systems in which components are relatively inexpensive, allowing large numbers of them to be deployed.
In some cases, the cost of deployment is fixed within a certain range and grows only slowly as the number of deployed nodes increases. In these contexts, redundancy can be exploited to help achieve long system lifetimes (offering both robustness in the face of environmental dynamics and energy efficiency) if algorithms can be identified for nodes to use in self-configuring. For example, nodes can identify when they need to be operational and when they can sleep, thereby conserving energy to be used when other nodes go to sleep or use up their energy reserves. Such methods of exploiting redundancy require new computational models and abstractions so that elements have the information needed to determine the steps they should take to maintain system performance in the near term while preserving long-term capabilities.
Over the years, a number of approaches have been developed to help information technology systems make more efficient use of available resources. Indeed, some key issues in system scalability can be thought of as a set of methods for determining how nodes should take turns, share system resources, or coordinate actions to boost their efficiency and effectiveness. Clustering is an approach in which a single node collects information from other nodes and takes on the task of communicating that
information to other clusters on behalf of individual nodes. Time division multiple access (TDMA) is an example of nodes taking turns using communication slots. Ethernet is an example of the use of carrier sensing and collision detection to coordinate use of the shared channel. It uses randomization to help coordinate system operations. TCP/IP congestion control scales in the sense that the users of a shared, congested resource use signals (dropped packets) to coordinate their respective use of the channel, thereby taking turns sending packets through the bottleneck. Multicast transport protocols such as RTP/RTCP10 and SRM11 expanded the use of Ethernet randomized and localized techniques for scalably sharing a resource (see Box 3.1).
The systems in which these techniques will be most useful have a large potential solution space. In other words, if there is just one or a very small number of acceptable solutions (for example, if just a few particular nodes out of hundreds or thousands need to take an action), then completely distributed, localized techniques alone are unlikely to provide a good solution. However, if there are many satisfying solutions, then one can envision energy-efficient techniques based on localized algorithms that find satisfying solutions in unpredictable contexts.
The generalizations of the RTCP and SRM techniques, referred to as adaptive fidelity, have potential for uses beyond simply achieving robustness. For example, in a smart space application, wall panels might be manufactured with very large numbers of sensors and actuators embedded. Adaptive fidelity schemes could be used to arrange for smaller numbers of these elements to be active during times of relative inactivity, conducting relatively long-duty-cycle scanning and offering relatively slow response. Triggered by detection of greater activity, additional nodes would move into the low-duty cycle mode and focus on a smaller area of interest; in this way, the collection of nodes would achieve higher fidelity behavior when there was more action to be observed or managed.
Another technique for exploiting redundancy might be to program or design EmNets to take advantage of opportunistic behaviors. For ex-
RTP/RTCP is a pair of protocols used to facilitate networked multimedia applications (Floyd et al., 1997). RTP provides timing information in application-level data to allow smooth and possibly synchronized playback of data types that must be played back to the user in a smooth manner. RTCP is the control channel for RTP. RTP/RTCP was designed to support potentially very large groups where a small number would be transmitting simultaneously but a large number could be simultaneously listening. One of the scaling issues that arose was how to keep the control traffic (the periodic session messages sent by each receiver) from consuming too many resources. The designers developed a technique later referred to as scalable session messages in which each receiver monitors the number of other session participants currently sending session messages and adjusts the period of session message transmission so as to maintain the combined average session message transmission below a defined small percentage of overall data traffic being sent/received in the session. This technique was applied again in the reliable multicast transport protocols, SRM. The potentially very large set of data recipients must send session messages to communicate successful/unsuccessful receipt of packets. The same local algorithm for determining the frequency of session message reporting is used. SRM went on to use localized randomized algorithms more extensively as a means of achieving scalability. In particular, SRM uses localized algorithms for determining who should send requests for retransmissions and who should send repairs for retransmissions. This is an example of exploiting redundancy in that all members of a session that have lost a packet are potentially redundant in their role of requesting a retransmission. Similarly, all members who correctly received the lost packet are potentially redundant senders of the message repair. SRM elaborated on Ethernet distributed, randomized, resource usage techniques to identify local algorithms for each node to run that would result in efficient sending of requests and replies. Note that SRM does not result in perfect efficiency. A centralized scheme with global knowledge will always do better in any particular case. But SRM, by defining localized algorithms for each node to run, allows the collection of members to self-configure to an efficient state. It is more scalable than centralized approaches when the location of packet loss is unpredictable and nonstationary.
ample, they could delay some basic reporting functions (for example, transmitting, reorganizing, calibrating, and reporting system health) until greater bandwidth, energy, or processing capabilities become available. Some nodes could enter a sleep mode when redundancy is detected, thereby saving power and contributing to longer system lifetimes. Self-configuration itself could take competing paths in which mobile code may be distributed at times when, or to locations where, the combination
of circumstances (bandwidth, operational real-time constraints, etc.) enables a self-configuration operation. Distribution of self-configuration commands, code, and verification acknowledgements may all adapt accordingly. This type of capability will require nodes to contain algorithms that provide flexibility in operating conditions.
An important part of adaptive coordination is the capability of individual nodes to monitor their own status and that of their operating environment. Nodes will need to gather information about changes in the status of other nodes (for example, that a nearby node has failed or entered a different operating state), changes in the availability of resources (for example, limited power or loss of a communications link), and changes in the environment that are being sensed and responded to. The nodes will need to rely on a variety of sensing modalities. For example, they may need optical sensors to indicate whether they have lost line of sight to another node with which they communicate frequently. They will need checks on their power levels. One of the most critical areas of research, as yet unexplored, will be the characterization of large-scale distributed systems composed of physically coupled, autonomous and/ or adapting elements and operating under unpredictable, highly resource-constrained environmental conditions.
Centralized Versus Decentralized Control
An issue that needs to be addressed with regard to both self-configuration and adaptive coordination is control of the system configuration. If individual elements of an EmNet can change their technical characteristics, capabilities, and operating modes—either through upgraded hardware or software or through adaptive coordination—how can the system guarantee its overall performance and stability or be sure that individual elements have access to the bandwidth, quality of service, or other properties they need in the system? Conversely, if a system contains large numbers of nodes, how can a central node control the overall configuration of the network in a timely fashion?
Issues of adaptive coordination, configuration, and, more generally, control can be addressed through any of several schemes. At one end of the spectrum are centralized schemes in which individual components are not self-configuring but the overall system is. At the other end of the spectrum are decentralized schemes in which individual components are themselves self-configuring. All cases require that some policy be expressed at the time the system is deployed (and probably afterward) that guides and focuses the self-configuration, with respect to not only the humans involved and the self-configuring system but also the centralized
controller and the distributed elements. (See Box 3.2 for a discussion of cooperative behavior and control.)
The viability of a centralized versus a decentralized scheme depends on several factors, including the scale of the system and the rate of anticipated change. Central control across a large network may be impossible to implement in a time-bounded fashion. (For a brief discussion of traditional control and systems theory as it relates to EmNets, see Box 3.3.) Local functions need to be optimized and reconfigured as the environment changes. If environmental conditions are not predictable and change faster than information can be extracted and analyzed, then a decentralized scheme is needed. But decentralization introduces issues of adaptive coordination and overall system performance. How can overall system performance be optimized if decisions are made locally? How can requirements for overall system performance be specified from a single point?
To provide for a degree of centralized control in a large EmNet with numerous elements, some sort of hierarchical, tiered structure will be needed. Many EmNets will be composed of heterogeneous collections of elements, each with different sets of capabilities and constraints. Some elements may be far less restricted than others in terms of, for example, the amount of power available to them; the system ought to be able to adapt by making such elements bear the brunt of power-intensive tasks. Other elements may be less restricted in terms of available memory or bandwidth, or they may have persistent storage easily available. Adaptive mechanisms can exploit system heterogeneity by using extra power where it exists in the overall system to offload work from elements with lower energy capacity.
Even when all nodes start out with equivalent capabilities, it may be efficient from a system-lifetime perspective to have the system select a small number of nodes to execute higher-power operations using higher-power resources (for example, long-range radio). Robustness can still be achieved by arranging for the “hierarchy” to self-configure using automated mechanisms for selecting which nodes will run the higher-energy resources. Automated hierarchy formation and clustering imply a need for automated reelection and selection in the face of failures. The adaptive coordination can take place efficiently and rapidly as the various elements adapt based on local measurements of environmental conditions and available resources.
As such systems adapt by reconfiguring the tasks that each element performs based on its capabilities, the distinction between configuration and adaptive coordination may begin to blur. How those capabilities are communicated and allocated is an open area of research, as are questions
A possible approach to distributed control is directed diffusion. Directed diffusion amounts to controlling a system by means of activation and inhibition messages, the sum of which can either reinforce or discourage a course of action.1 As an example, consider a sensor network in which multiple nodes have access to the outside world through a specialized node with long-range communications capabilities and that communicates to the rest of the nodes by passing messages from one node to another (that is, via multihop connections). If several nodes observe an event, then directed diffusion can help determine which nodes should be involved in deciding whether to report the event, which one should do the processing, and what information should flow to the long-range link given a desire to minimize energy expenditures.
If latency (delay) in making a decision is not an issue and the probability of a node accurately detecting an event is related to the strength of the signal it receives relative to background noise (the signal to noise ratio, or SNR), then the nodes can wait a period of time based on the SNR before alerting or inhibiting neighbors. The node that receives the signal with the highest SNR will send its alert first, communicating a message to the long-range link and sending short inhibition signals to its neighbors. The other nodes then avoid transmitting their decisions or activating one another to engage in cooperative detection. If the signal at the node with the highest SNR is still below the threshold for reporting the event, the node could instead activate its neighbors, asking for their decisions and the relative certainty of those decisions. These activation messages will propagate outward with reduced intensity (that is, they will require progressively higher certainties to respond), and nodes with higher SNRs will reply sooner. When enough replies have been combined to allow the original node to make a decision with the desired
of how groups of machines with different capabilities could be organized to perform a set of activities that are presented to the rest of the system as a single unit. Similar hierarchical organizations have been used in more traditional systems, but they are not based on the capabilities of the individual components in the manner described above. How to adapt the overall system configuration (or subsystem configuration) to maximize the information obtained while minimizing the use of scarce resources is a promising area for future research.
Some systems may benefit from decentralized control schemes, which also require further research and analysis. The minimum number of bits that must be communicated to make a reliable decision is unknown for all but the simplest of problems involving more than one sensor node. Given the high power cost of communications, it would be useful to know what the threshold is and thus to learn whether particular algorithms are any-
where near optimal. (For a discussion of local computation vs. communication as related to EmNets, see Box 3.4.) If the processing problem is cast as a rate-distortion problem, in which (weighted) false alarm and missed detection probabilities constitute the distortion and the communications energy takes the role of rate, then additional questions can be explored. For instance, What is the effect of array density on the rate-distortion region for a given communications and signal propagation law and set of source statistics? This is a deep and difficult problem (for example, under what conditions is there a convex rate-distortion region?), but its solution could have a large payoff. Preliminary progress has been achieved with simple versions of this problem, but a huge problem space remains to be explored.
The interaction between a system element and its neighboring elements is not typically considered in control theory but is essential to
EmNets bring together two established research communities—distributed systems and control. Control is a rich research area that studies how to use feedback to optimize the behavior of electromechanical systems. Control has its roots in simple servo control systems but is now used in the design and operation of a wide class of electronic and electromechanical systems. Often these systems have hundreds of processors and components from multiple vendors. Some of these systems run chemical plants, manufacturing plants, and even buildings. By bringing together these two areas, EmNets create a number of new research areas.
Control theory is used to solve a number of difficult problems. For example in flight control systems, the dynamics of the plane are carefully studied, creating an optimal controller for this system. Often this controller is combined with a number of estimators that produce an estimate of what the measured parameters should be. The estimator can be used to provide input from sensors that might not be read each cycle (for example, the computation might require 25 data points while only 10 are being collected at any given time) or check that the current model of the system represents the actual system. In some highly critical situations, banks of estimators can be used to model how the system would behave under various fault conditions. During normal operation, these estimators will poorly match the system, but under a fault condition one of these estimators might become a better match than the original system. Thus, when a fault does occur (such as the loss of an engine in an aircraft), that fault’s estimator has current information and can be used to update the control equations for the plan, to allow it to continue to function at some reduced performance until the error is repaired.
Rather than using a fixed system model, model predictive control adapts the system model and the control formulation. It solves an optimal control problem at each step, using current sensor data and measured system performance. This type of control was initially used in large-scale refineries, where cycle times are very long (tens of minutes), providing sufficient control for the required computation.
modeling EmNets. The interaction between a node and its immediate neighbors is critical, because physical stimuli are often correlated to physical space and because the communications costs and latencies to near neighbors are likely to be less than average. Centralized control rules can be devised for such a group, but the complexity of the decision-making process, even for a relatively small collection of nodes, will demand some decentralization and probably hierarchy as well. Layered control hierarchies are notoriously difficult to optimize, but perhaps by scaling to large numbers designers can reduce the demand for efficient use of the individual components. In any scheme, the fundamental issue of stability arises. Once the design moves away from centralized control, the theory
Both types of system rely on getting sensor measurements at fixed time increments. While networks are often used in control systems, their properties are not considered in the problem formulation. For high-performance control loops, sensors are given logically separate networks (or even physically separate wire) to collect the data, making variable packet delay and possible data loss nonissues. In addition, in almost all cases the control algorithm is centralized and not run in a distributed fashion. The long cycle time of many process control systems makes the issue of networks in these systems uninteresting, and in any case existing technology meets the requirements of these systems. While robust operation is critically important, with commands being issued to individual pumps, valves, heaters, and the like (in a factory setting), the long cycles provide time to consider and reject outlying data and every actuator is likely to have a secondary sensor for redundancy and prediction checking.
While the notion of fixed time samples is fundamental to most control theory, there are some methods that might migrate to network-based systems more easily. One possibility is to use Lyapunov methods, where the idea is for each unit to greedily minimize a value function that serves as a coordinator. This transposes to asynchronous systems very nicely. In general, the actions of each unit would have to be coordinated carefully (simple examples show that activating two feedback systems simultaneously can lead to disastrous loss of performance or instability), but if there is a value function that each is separately minimizing, the actions are automatically coordinated.
To the standard control issues EmNets add the issues of resource constraints, distributed systems, and networks. In control environments, networks are assumed to be stable, not to lose information, and not to have delays. All of these are likely to be violated at some point for EmNets posing new research challenges.
NOTE: The committee thanks Stephen Boyd of Stanford University for his guidance in developing this description.
for characterizing the system and guaranteeing stability is not well developed. Note that actuation, signal processing, and communications (or more likely, a combination of these) all raise fundamental questions of resource allocation in response to a physical stimulus. Accordingly, a solution in any one of these domains may well have applications to all the rest. The problem of cooperation thus appears to offer an excellent opportunity for multidisciplinary research; there are probably lessons to be learned from diverse disciplines, with a potentially high payoff. (An example of an area in which multidisciplinary approaches are used is distributed robotics, described in Box 3.5.)
One of the design choices that must be made in EmNets is the balance between local computation and the communication of data back to a more centralized processing node. In other words, to what extent should an individual node process the data it has collected or been sent when it also has the option of communicating raw, unprocessed data to another node for processing? This issue is particularly important in EmNets that operate with limited stores of energy and must therefore minimize energy consumption. It is extremely important in systems that rely on wireless communications to transport data because of the energy requirements of wireless systems. Many sensor networks will be in this category, as will mobile elements of other EmNets, such as smart spaces.
The high energy consumption of wireless communications systems leads to unique conclusions about the distribution of tasks in the distributed embedded system network. For example, in a typical wireless sensor network, the network’s task is to identify events that occur in the network environment and communicate these occurrences to a remote user. Conventionally, this would be done by transmitting received sensor information to a remote asset for processing. EmNets composed of many distributed devices become collectively more capable if significant computation is performed locally, with the goal of recognizing local events and communicating only event identification codes, as opposed to complete data sets.
As an example of the trade-off between computation and communication in an EmNet, consider a wireless sensor system that is distributed over a large surface. Communication between devices occurs between nodes in a multihop architecture in which information is passed from the source node to the destination node by traveling through a number of intermediate, proximate nodes. Under these conditions, the power transmitted from any one node declines rapidly as the distance from the transmitting node increases.1,2
The severe decay of wireless communications has a profound influence on the balance between communication and computation resources. System designers must decide between communicating data directly for remote processing or performing local processing and communicating a shorter message or perhaps none at all to a remote node. The energy required to transmit even short messages could power significant amounts of computational processing locally. The large computation budget is available for potentially quite powerful information processing that could reduce the amount of information that needs to be communicated. Hence, considerable design and development effort will need to be directed to the deployment of EmNets that leverage powerful local computation and cooperative processing to identify local events and even command local action. Low-power wireless embedded systems will therefore create demands for a rich set of novel network and distributed computing solutions that have not been previously needed in conventional wireline systems.
A sensor network is an example of an EmNet that illustrates the benefits of using system architectures and adaptive coordination to improve overall system performance in the face of stringent resource constraints. Sensor networks generally require constant vigilance by at least a subset of the sensors so that desired events can be reliably detected. At the same time, the system must avoid generating false alarms when a particular event has not occurred. Sensor networks can employ a power-conserving hierarchical detection scheme to meet these objectives. For example, individual sensors may use energy-efficient procedures for detecting acoustic, magnetic, infrared, or other forms of energy and then attempt to make a detection decision independently. If the sensor cannot reliably make a decision, it could employ some processing and sensing to seek information from nearby sensors. These processes involve larger expenditures of energy, especially if the sensor and its neighbors must communicate. Additional processing, using a large neural network or some other sophisticated procedure, could be used to provide greater assurance if necessary. In the worst case, raw data might be transmitted back to a remote site where human operators analyze the data and determine whether an event has been detected. This step consumes large amounts of energy and must be avoided, except when absolutely necessary.
As this example illustrates, there are trade-offs to be made with regard to the extent of processing to be conducted by individual sensors and the amount of information communicated among them. In many applications, there will be no events to report much of the time and no need to apply the most expensive algorithm, which is transmitting data to human operators for analysis. But, there may be too many circumstances in which the least expensive detection algorithm will fail. A processing hierarchy can lead to huge reductions in energy consumption while assuring the required level of reliability. Processing hierarchies are intertwined with networking and data storage issues. How long and where data are stored (or queued) will differ at different levels in the hierarchy; the decision on whether to communicate with neighboring nodes—and which ones—will depend on the signal-processing task. The amount of energy consumed by communications and the degree to which energy is scarce will affect the processing strategy (that is, the willingness to communicate and whether processing is centralized or distributed). All of this, in turn, depends on the physical constraints that the system faces, allowing the physical layer to intrude.
Given the amount of energy needed to communicate a short message, it often pays to process the data locally to reduce the volume of traffic and make use of multihop routing and advanced communications techniques,
Distributed robotics is the study of algorithms for the control and coordination of groups or teams of robots. A multirobot group is a superb example of a networked embedded system that embodies challenges in control, communication, and coordination as it faces uncertainty in sensing and action, unexpected failures, and a dynamic environment. The notion of a single, centralized controller coordinating a distributed robot group is considered untenable, as it is neither scalable nor robust. Thus, control must be distributed to the individual robots, which must communicate and adapt as necessary to produce globally efficient behavior of the system as a whole.
Several key methodologies are relevant to multirobot control, as they are to individual robot control. Reactive control involves the lookup and execution of precompiled, stateless collections of rules, with no looking into the past or planning for the future. Deliberative control uses centralized world models and planning but scales poorly with the complexity of the control problem and the group size. Hybrid control attempts a compromise between reactive and deliberative approaches by employing both and compromising between them as necessary; this is a dominant paradigm in robotics. The other dominant paradigm is behavior-based control, which is of particular relevance in distributed robotics.
Behavior-based controllers consist of collections of behaviors, time-extended processes or control laws that achieve and maintain goals. For example, “avoid obstacles” maintains the goal of preventing collisions, and “go home” achieves the goal of reaching some destination. Behaviors can be implemented in software or hardware and as processing elements or as procedures. Each behavior can take inputs from the robot’s sensors (for example, camera, ultrasound, infrared, tactile)
such as coding, to reduce energy consumption. Collaborative processing can extend the effective range of sensors and enable new functions. For example, consider the problem of target location. With a dense array of networked sensors, one means for tracking the position of an object (for example, a target or a detected event) is for all nodes that detect a disturbance to make a report. The centroid of the reporting nodes is one possible estimate of the position of the target. This approach requires the exchange of very few bits of information per node.
Much more precise position estimates can be achieved with a technique called beam forming, in which individual sensors exchange information about detected events and the time they were detected. Although this approach consumes more energy, it offers several benefits: higher quality data for subsequent classification decisions, long-range position location, and even some self-location and calibration possibilities for the
and/or from other behaviors in the system and send outputs to the robot’s effectors (for example, wheels, grippers, arm, speech) and/or to other behaviors. Thus, a behavior-based controller is a structured network of interacting behaviors. Behaviors themselves embed state and can form arbitrary representations when networked together. Thus, behavior-based systems are not limited in their expressive and learning capabilities, and they are well known for their real-time response and scalability. The metaphor of a robot being controlled by a collection of behaviors scales well to systems of robots being themselves behavior collections. Currently, behavior-based control appears to be the de facto standard for distributed multirobot control, owing to its robust and scalable properties.
As EmNets evolve to include actuation and mobility, lessons can be learned from the area of distributed robotics. The significant open problems in distributed robot control include the synthesis and analysis of adaptive group behavior, group coordination, and communication strategies that facilitate dynamic, run-time, efficient resource allocation within a distributed system. Distributed robots need to be self-configuring and will usually be unattended. Latency is also an important concern for both types of systems. Both are likely to interact with humans at some points or on some level, and it may be the case that usability and interaction issues will overlap. However, the constraints on EmNets differ in some ways. Many EmNets will have severe power limitations, whereas many distributed robots may be large enough to incorporate more than adequate battery power. In addition, EmNets will probably consist of many more components and nodes than distributed robots would need to incorporate.
NOTE: The committee thanks Maja Mataric and Gaurav Sukhatme of the University of Southern California for their guidance in developing this description.
nodes.12 In some applications, sparse clusters of nodes that use beamforming techniques might be preferable to dense deployment of less-intelligent nodes, or it might be better to enable both sets of functions. For example, a dense network of less-intelligent sensors deployed in conjunction with a less-dense array of intelligent nodes could capture information on demand for beam forming. Such collaborative processing can be regarded as a further extension of the signal processing hierarchy to multiple nodes, with the collaboration being extremely expensive in terms of energy use but performed only rarely, such that its marginal energy cost may be acceptable.
Key to any network collaboration is the idea of synchronization among
elements of the network. Synchronization depends on both the accuracy of the local clocks and the ability of the network to coordinate local clock accuracy. Both long- and short-term clock drift are important for providing various levels of functionality. For spread-spectrum communication, high-accuracy clock synchronization with the received signal is necessary to decode the information sent. However, only relative synchronization is needed for node-to-node communication, because the propagation delay is not quantified at each node. In addition to enabling communication, coordinated synchronization is important as a means to enhance power savings, enable collaborative sensing, and allow multisensor self-location.
Local power requirements on a remote EmNet must be reduced to the bare minimum needed to supply continuous sensing and a minimum level of event detection, while incorporating functionality to expend power as needed for communications or more intensive processing. This is appropriate for situations in which the frequency of events is expected to be high enough that every EmNet in a network needs to be ever vigilant. For longer-lifetime sensors in environments with a lower event probability, support communication and processing may be set up to operate intermittently. If the network is operating in a form of TDMA communication, then for low latency event reporting, each sensor must stay synchronized. In addition, to coordinate sensing times and enable coherent collaborative processing, each EmNet needs to be synchronized to a global time scale. Thus, clock drift on each sensor limits the length of noncommunication between sensors or the power savings achievable by powering down the radio. Additionally, if a sensor field is put in a somnolent state in which only selected sensors are powered down, total network power savings will be greater if the multiple sensors coordinate their sleep time (requiring synchronization) as opposed to randomly powering down to provide a reduced alert state overall.
Collaborative sensing (by, for example, using beam-forming algorithms) benefits from synchronizing all the sensing inputs. The combining of results from multiple sensors at different locations to counter jamming, enhance resolution, or enable distributed sensing requires relative timing information. On the coarsest scale, timing is required to coordinate which event occurs where. Finer resolution of timing allows recognizing coordinated events by coherently combining results from multiple sensors, thereby fully realizing the utility of a distributed sensor system. In fact, the effective resolution of coherent combinations of inputs from multiple sensors is limited by the time synchronization of the sensors.
Programming EmNets to achieve significant collaborative processing raises some of the same challenges as are faced in parallel computing and distributed databases. Neither model adequately addresses the combined
constraints of EmNets, however. For example, in contrast to parallel computing environments, the data in an EmNet will be inherently distributed. And in contrast to distributed databases, EmNets are much more resource constrained. An assumption in distributed databases is that moving the data from place to place is relatively inexpensive. In EmNets, the emphasis will be on performing the processing where the data are located. Some techniques from each of these models may prove useful, however, and their applications to EmNets merit further investigation.
Finally, the cooperative and collaborative nature of EmNets might frequently create requirements for configuration actions that are implemented across all or nearly all the nodes in a network. If a system is self-configuring, at times there may be a need to clearly identify the subsets of the system that have changed or been upgraded. This is referred to as a need for “atomicity,” in which the system as a whole is considered a single, atomic (indivisible) entity. Specifically, the configuration of network protocols or security functions may be an action that must be applied with complete assurance across all nodes in a network. Errors in configuration for one node in a vast area may have unbounded impact. Atomicity of some kind may be needed when a change must be collective or coordinated, but it might not be achievable using standard techniques because there is no enumeration or unique identification of individual components. Moreover, there is a possibility that not all elements need to be upgraded; some components may be disconnected or obstructed for significant periods of time. If a piece of the system is changed, there must be a way for the system to detect whether the resulting final state is workable. How does one determine that enough components have been upgraded to take on the new behavior? How do old components detect that they should change their behavior when they encounter new ones?
Self-configuration involves the addition, removal, or modification of elements in an EmNet and the subsequent process of establishing interoperability. In contrast, adaptive coordination addresses changes in the behavior of a system as it responds to changes in the environment or system resources (such as remaining energy). Together, these processes are critical for creating robust and scalable unattended EmNets. The state of the art in self-configuration is fairly well developed, with well-understood approaches to address assignment, service discovery, and mobile code. However, significant research progress is needed to achieve automatic self-configuration among large numbers of distributed nodes, while still conforming to well-defined trust and failure models, which are critical to embedded systems applications. Adaptive coordination is a well-
developed discipline for centralized systems, and distributed coordination is widely applied outside of embedded applications (for instance, in Internet applications and protocols), but there is much work to be done in the area of distributed adaptive coordination to support embedded applications. Promising directions include techniques for exploiting system redundancies and localized processing and collaborative signal-processing techniques. Such techniques are particularly critical for unattended, resource-constrained systems.
Abelson, Harold, Don Allen, Daniel Coore, Chris Hanson, George Homsy, Thomas F. Knight, Jr., Radhika Nagpal, Erik Rauch, Gerald Jay Sussman, and Ron Weiss. 2000. “Amorphous computing.” Communications of the ACM 43(5). Also as MIT Artificial Intelligence Memo 1665, August 1999.
Arnold, Ken, and Jim Waldo, eds. 2000. The Jini Specifications, 2nd ed. Cambridge, Mass.: Addison-Wesley.
Corson, M. Scott, and Joe Macker. 1997. Presentation of draft entitled “Mobile Ad Hoc Networks: Routing Protocol Performance Issues and Evaluation Considerations,” IETF. RFC 2501.
Fall, K., and S. Floyd. 1996. “Simulation-based comparisons of Tahoe, Reno, and SACK TCP.” Computer Communication Review 26(3):5-21.
Floyd, S., V. Jacobson, C. Liu, S. McCanne, and L.A. Zhang. 1997. “Reliable multicast framework for light-weight sessions and application level framing.” IEEE/ACM Transactions on Networking 5(6):784-803. An earlier version of this paper appeared in ACM SIGCOMM ’95, August 1995, pp. 342-356.
Gun Sirer, Emin, Robert Grimm, Brian Bershad, Arthur Gregory, and Sean McDirmid. 1998. “Distributed virtual machines: A system architecture for network computing.” Eighth ACM Sigops European Workshop.
Intanagonwiwat, Chalermek, Ramesh Govindan, and Deborah Estrin. 2000. “Directed diffusion: Scalable and robust communication paradigm for sensor networks.” Proceedings of the Sixth Annual International Conference on Mobile Computing and Networks (MobiCOM 2000), Boston, Mass. Available online at <http://lecs.cs.ucla.edu/~estrin/papers/diffusion.ps>.
Jacobson, V. 1988. “Congestion avoidance and control.” ACM SIGCOMM ‘88.
Karp, B., and H.T. Kung. 2000. “GPSR: Greedy perimeter stateless routing for wireless networks.” Proceedings of the Sixth Annual International Conference on Mobile Computing and Networks (MobiCOM 2000).
Mcquillan, J., I. Richier, and E. Rosen. 1980. “The new routing algorithm for the ARPANET,” IEEE Transactions on Communications 28(5):711-719.
Mullender, Sape. 1992. “Kernel support for distributed systems.” Distributed Systems, 2nd ed. S. Mullender, ed. Cambridge, Mass.: Addison-Wesley.
Parsons, David. 1992. The Mobile Radio Propagation Channel. New York: John Wiley & Sons.
Sohrabi, K., and G.J. Pottie. 1999. “Performance of a novel self-organization protocol for wireless ad-hoc sensor networks,” IEEE VTS 50th Vehicular Technology Conference 2:1222-1226.
Sohrabi, Katayoun, Gregory J. Pottie, and Bertha Manriquez. 1998. “Near-ground wideband channel measurement in 800-1000 MHz.” IEEE 1998 Vehicular Technology Conference.
Yu, H., D. Estrin, and R. Govindan. 1999. “A hierarchical proxy architecture for Internet-scale event services.” Proceedings of WETICE’99, June.