Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 60
Basic Research in Information Science and Technology for Air Force Needs 5 Basic Research for Air Force Information Management and Integration BACKGROUND The primary reason for developing a network-centric Air Force is to gain the ability to rapidly collect and distribute information and thereby create information superiority and, from it, military superiority. This is a very challenging objective. While the Air Force has greatly increased its ability to collect information, the capability does not yet exist for quickly managing that information so as to create knowledge where it is needed while avoiding information overload. Technology to manage and integrate information is critical to the Air Force for its current operations and its long-term challenges. The need for this technology arises in finding and tracking, command and control, decision support for situation assessment, counterintelligence, and public affairs. For example, finding and tracking require integrating data from large numbers of sensors and fusing them into actionable information. Command and control and situation assessment have similar requirements—for example, to tie together sensors and databases into a publish-subscribe environment. More concretely, the Air Force will need an integrated battlespace management system that combines all of these capabilities in order to significantly improve its ability to conduct effective joint and coalition warfare. Such a system is dependent on several key IS&T advances: information exchange with complex event processing; transformation of multiple sources of information into a common representation and then into knowledge; and distributed collaboration via shared knowledge.
OCR for page 61
Basic Research in Information Science and Technology for Air Force Needs Building such a system is made more difficult by many real-world factors that are well beyond the state of the art. For example, integration often needs to be done quickly between systems that were not designed to interoperate, such as those of partners in an ad hoc coalition. Efficient techniques for evaluating complex subscriptions may enable warfighters to identify targets more quickly, while they are still vulnerable to attack. Sensors may be moving and unreliable, so that integration needs to be more dynamic and flexible. As improvements are made in power sources for sensors and networking technology and the cost and size of sensors decrease, we can expect the number of sensors and the data rate per sensor to increase. The combination of these effects will greatly increase the overall data rate from sensors, seriously taxing the scalability of current stream-processing algorithms. And, inaccuracies in human sources of information must be factored in. The Air Force also needs advanced data integration technology in support of nonkinetic operations—that is, those that do not rely on damage to physical objects. For example, public relations challenges might require quickly searching and correlating information across many data sources, only some of which may be owned by the Air Force and many of which have not been integrated at the time the search request is issued. Influence operations might rely on very fast unearthing of information about particular individuals (e.g., a leader of a threatening group) or groups (e.g., a tribe, clan, or religious community). Other nonkinetic operations might depend on quick searches of blueprints, building permits, inventories, cybercrime logs, and so on. Commercial database and distributed systems technology can be expected to meet many Air Force needs. However, for many technical areas, the Air Force places a higher value than the commercial sector on obtaining highly innovative solutions or on obtaining them well before their widespread commercial availability. Because the Air Force uses many custom-developed application systems, off-the-shelf solutions are often insufficient, particularly for aspects of databases and distributed systems that enable interfacing with other systems: schema reconciliation, metadata, and so on. Rather, fundamental models of information integration are needed, models that can be tailored to unique Air Force requirements. What is the current state of the art? Robust database system products exist to support the storage and management of structured information, and recent research is leading to versions of these products that can manage XML documents. Extract, transform, and load (ETL) tools are available to help clean and integrate data—for example, to store it in a data warehouse. Enterprise information integration (EII) tools are becoming available that enable querying heterogeneous databases, typically rela-
OCR for page 62
Basic Research in Information Science and Technology for Air Force Needs tional or XML, although this approach is much less mature than data warehousing. Data integration projects, particularly for data warehouses, are now an important part of the IT work done in large enterprises. However, such projects are usually quite expensive. Simple ones typically require a year of elapsed time and several person-years of engineering effort, and complex projects require much more effort. There is also a large market for application servers and related middleware products that support the integration of distributed applications. Application servers enable the integration of applications through remote procedure call (RPC). The newest incarnation of this technology is service-oriented architecture (SOA), where the RPC is Internet-enabled using the XML-based Simple Object Access Protocol (SOAP). Enterprise application integration (EAI) tools have a similar goal, but they use asynchronous message-passing and generally run behind the firewall. As in the database area, the deployment of such products is usually an expensive multiyear affair. There is some redundancy of application development tasks appropriate to ETL, EII, EAI, and/or application servers. For example, data may need to be cleaned and reshaped in the same way for each of them. This redundancy increases the cost of integration and potential inconsistencies between functionally related solutions. However, there are technical impediments to eliminating this redundancy, such as incompatible execution models (e.g., streaming vs. RPC vs. batch processing). This area is ripe for research. MAJOR INFORMATION MANAGEMENT CHALLENGES FOR AIR FORCE IS&T A central issue in the ubiquitous deployment of IT is the efficient and effective use of the large variety and large quantities of data that can be acquired, transmitted, and stored. In the specific area of national defense, both human and automated decision makers must rapidly sort through voluminous multimedia data and fuse disparate sources to create knowledge. The first step is to integrate raw data that arrive in different formats and are managed by different data management technologies so that they can be correlated and searched using a common mechanism. There has been much progress on integrating classical formatted files with semistructured data, such as XML, which is becoming a standard feature of commercial database systems. The integration of other information management functions into a common database infrastructure—such as information retrieval, natural language processing, geographical information systems, image and video stream processing, and data streams—is not as
OCR for page 63
Basic Research in Information Science and Technology for Air Force Needs far along. It is an open question how best to integrate these functions into a common database system architecture. Similarly, the Air Force faces major open questions on how to manage and share information in a distributed system. The “publish-subscribe” paradigm is an increasingly popular one for enabling distributed information-system integration and which is being explored by the Air Force to work with DOD’s Global Information Grid. That concept includes a common area where information is “published” and “subscription” information for various users that defines which posted information their systems will download from the common area. While the publish-subscribe concept has been shown to scale to thousands of participants within stable local area network environments,1 there are many Air Force requirements that will be met only with further research: An Air Force publish-subscribe system must work in an unstable wide-area network environment such as a battlespace network. An Air Force publish-subscribe system must scale to a very large number of users and data sources, while coping with a bursty load that generates network and server contention. A publish-subscribe system’s scalability depends in part on the amount of state information it needs to maintain about old events. Thus, in order to scale to the sizes contemplated by the Air Force, systems based on this paradigm must weed out published information that is outdated or redundant, even if it has not been replaced by a clear duplicate—except that in some cases, old information is still useful for some subscribers. The system must distinguish between subscription rules that do not fire because the rule’s conditions do not hold and those that do not fire because their data sources are unavailable. The former situation indicates that no action is required but the latter is ambiguous. A system that handles more expressive subscription rules can be more selective in identifying events of interest to battlefield commanders and can reduce the cognitive load on commanders by tracking the state of partially satisfied subscriptions. However, the expressiveness of subscription rules is limited by the speed with which they can be processed and the amount of memory needed to maintain the state of partially satisfied subscriptions—remember- 1 A. Carzaniga, D.S. Rosenblum, and A.L. Wolf, “Design and evaluation of a wide-area event notification service,” ACM Transactions on Computer Systems 19(3):332-383 (2001); Yi-Min Wang, Lili Qiu, Chad Verbowski, Dimitris Achlioptas, Gautam Das, and Paul Larson, “Summary-based routing for content-based event distribution networks,” Computer Communication Review, October 2.
OCR for page 64
Basic Research in Information Science and Technology for Air Force Needs ing, for example, that two of the three requested events have occurred. In effect, rules are continuous queries being posed on streams of data arriving from sensors and other data sources. Database-style query optimization will be needed to cope with high data rates and the large number of rules that need to be concurrently checked. It will also be needed to process queries over a combination of historical data and newly arriving data. The system must have some “intelligence” to take context into account when interpreting subscription rules. For example, a user’s information needs can vary tremendously depending on external conditions—for example, whether a battle is ongoing, whether other elements of the chain of command are online, and whether other information sources are operational. The importance of timeliness can vary greatly according to the information posted or the external conditions. Rules must be evaluated on input data streams at a rate that suits the timeliness requirements. The relative importance of certain rules can vary according to external conditions. When the system is overloaded due to a burst of activity in one area of the battlespace, performance should degrade gracefully—that is, the system must not thrash. The most important rules should be prioritized to access the limited system resources. Security of the published information is critical, and the system must be trustworthy even if an adversary has gained access to publish or subscribe. In an Air Force setting, it must also offer multilevel security. The above requirements are examples and not comprehensive. Even so, it is a long list of challenges that spans many areas of computer science in addition to publish-subscribe concepts, such as database systems, security, and data mining. Moreover, in addition to publish-subscribe, other distributed computing paradigms might be relevant, such as peer-to-peer messaging, media delivery, grid services, distributed services for wireless networks, and overlay networks. Different paradigms offer different options for power management, net capacity, security, and other factors. Information needs to be understood at a more abstract level. The Air Force needs a model and architecture for situation understanding and a means of incorporating situation modeling, model-based processing, situation projection, and top-down management of situation understanding in order to explore topics in information fusion. It also needs a scientific basis and technologies for multisensor fusion for air and ground targets. Some of these topics are extensions of ongoing work in intelligence, sur-
OCR for page 65
Basic Research in Information Science and Technology for Air Force Needs veillance, and reconnaissance methods. These sorts of analyses need to account for the uncertainty of information sources, sometimes due to known error characteristics of instrumentation but also due to hard-to-quantify uncertainties such as those inherent to human-supplied intelligence. An even bolder question would be, How can a computer understand data and information in context? In principle, background understanding of a mission or related intelligence could help a computer interpret information from the battlespace. For example, it might help identify objects in video or image data. If such context-dependent processing were possible, perhaps information-understanding algorithms could be embedded in sensors and networks to enable rapid data assessment and hence rapid situation assessment. Source compression methods might also be possible when multiple sensors collect correlated observations, whether or not the collection was done with coordination. If the correlation is known, as is often the case (e.g., when data are bursty), redundancy can be compressed out. This is critical in order to fit the large volumes of sensed data into modest network capacities, especially in a difficult communication environment. One challenge is the compression of highly correlated data without fine coordination among sensing platforms. Defense applications today involve a wide variety of information-bearing signals. Examples include audio and video communication signals, video and still light imagery, hyperspectral imagery, and radar imagery, to name but a few. Often, these different data acquisition modes will be employed simultaneously in the same system. To use such signals effectively, it is necessary to represent them in digital form in a hierarchy ranging from the basic uncompressed analog-to-digital form to a compressed digital form to a symbolic form. Continuing research will be required to find multilevel representations of multimodal signal sets so that they can be efficiently stored, transmitted, manipulated, and understood by digital techniques. Although much progress has been made in standardizing signal and data representations (e.g., MPEG7 and XML), additional AFOSR-sponsored research is needed to focus on Air Force signal representation problems. Given appropriate digital representations of multimodal signal sets, it is also important to find new ways to search for desired information and to aggregate that information into a convenient form. This means that basic digital signal processing topics such as signal enhancement, signal modeling, redundancy removal, and feature extraction will continue to be important fundamental research topics. This also requires advances in multimedia data mining and machine learning in the defense context. Combining information across different signal modes and fusing that information into robust multimedia “documents” with high semantic value is another topic of broad interest beyond just the Air Force, but there are also Air Force-specific aspects to this challenge.
OCR for page 66
Basic Research in Information Science and Technology for Air Force Needs Throughout information management and integration, there is a critical need for security, but many of the research challenges are military-specific. A very difficult challenge is how to enable the multilevel security needed in a fully functional publish-subscribe system. However, some of the most important issues there are policy ones, not technical ones. There is a great need for research in defensive and offensive information operations, some of which has already begun at the AFRL. There is also a great need for better understanding of how humans react to information operations—that is, what actions will create the desired effects—and ways to evaluate options for influence operations, trade-offs, and courses of action. Damage assessment for influence operations is a special challenge so far unexplored. RECOMMENDED BASIC RESEARCH IN INFORMATION MANAGEMENT AND INTEGRATION Following are areas of information management and integration functionality that are important for future Air Force systems and are either beyond the state of the art and require basic research or are areas where progress has been slow and could be sped up by additional research initiatives: Large-scale sensor networks need to support queries, because it is often infeasible to send all sensor data to a central location to process queries. Research is needed to understand where to place query-processing functionality when nodes have limited power, limited computing capability, and limited network bandwidth and connectivity. Conversely, as network and power limitations are overcome, higher data rates can be expected, which will lead to optimization trade-offs different from those being investigated under today’s limitations. Applications needed in the field will demand coping with dynamic, late-binding, and unpredictable structure in the data, or even no structure at all. Accordingly, dealing with semistructured content with unpredictable structure—for the data model, for querying and routing documents, and for efficient execution—is critical. Data arriving from different sensors and other sources may be inconsistent. Yet they need to be rolled up into a coherent view of the environment, thereby being turned into knowledge. Technologies are needed for fusing uncertain or inconsistent data and for querying databases containing incomplete or inconsistent information. Some of these technologies may be generic across arbitrary data types, while others may be specialized—for example, for
OCR for page 67
Basic Research in Information Science and Technology for Air Force Needs merging location information from mobile sensors. It is also important to be able to mine inconsistent data for patterns, to obtain hints about a coherent view without necessarily being able to construct such a view. It may be helpful to create and maintain an ontology of the information domain, to help direct the data mining or interpret the results. Database mechanisms are needed to track the lineage of all stored data, from its initial source through transformations and updates that were applied to it. Solutions exist for tracing schema information, for example, from the column of an output report to the input data sources and transformations of those inputs that generate it. However, much less is known about how to track lineage at the level of data instances, especially when arbitrary transformations, not just database queries, can be applied. Database system mechanisms are needed to deduce the degree of certainty of query results as a function of the uncertainty of the raw data, under appropriate probabilistic models. Uncertainty may arise, for example, from measurement errors, incomplete structure, imprecise knowledge of the location where a measurement was made, data cleaning of raw data, or conflicting data from different sources. Some of the challenges are to identify mathematical models of uncertainty that have general utility across a variety of applications, to develop efficient deduction algorithms for these models, and to find optimal ways of making these algorithms effective in the context of information management systems (e.g., in a database system, in middleware, or in application programs). New approaches are needed to reduce elapsed time and level of effort to integrate independently designed databases, such as the databases supporting each participant in a joint mission. Such approaches must be embodied in tools that can help an engineer design, develop, and test mappings between databases, find and reuse existing mappings in the same domain as the mapping to be generated, and evolve and customize mappings as data sources change and new ones are added. Here, too, the creation and maintenance of an ontology may be useful to organize a repository of schemas and mappings. The Air Force needs advanced distributed systems technology in support of battlespace information management. As discussed above, in the section on major information management challenges, publish-subscribe is one important distributed systems infrastructure, though other distributed computing paradigms should also be explored. Although there is much research on publish-subscribe systems for centralized infrastructure, many improvements in the
OCR for page 68
Basic Research in Information Science and Technology for Air Force Needs functionality of publish-subscribe systems are needed, which requires additional research. These include scalability to wide-area, unstable network environments; filtering outdated or redundant information; coping with intermittent availability of event streams; supporting more expressive subscription rules; improved optimization of rule execution; processing strategies that account for context, timeliness, and priority; and incorporating security guarantees. Architectural principles that enable us to create secure, scalable, reliable publish-subscribe platforms would also be valuable. Air Force distributed systems need to be interoperable, secure, massive in scale, highly available, dynamically reconfigurable, and easily managed. These system-level characteristics are notoriously more difficult to engineer than component-level capabilities. Moreover, some military requirements for these system-level characteristics are beyond today’s state of the art. The data that flow through such distributed systems are subject to the same integration problems described above for information management. For example, identifying errors, cleaning the data, fusing data from multiple sources into a coherent view, and tracking data lineage are all relevant to data flowing through a distributed system. Some data must be moved from a real-time publish-subscribe system into a database to enable historical queries of (possibly recent) publications, to obtain answers to point questions, and to identify trends. System management is important, especially in an environment where information publishers, subscribers, and even major servers are unreliable. System managers must know who is connected, must be notified when users and information sources connect and disconnect, must be warned of impending performance and reliability problems, must be helped in understanding the impact of these problems on system behavior, and must be able to reconfigure the system dynamically in response to these and other changes. Again, techniques are needed to ensure these capabilities can be made available despite incompatible network architectures and security domains. Performance and scalability, which are always a major concern, are especially hard to manage in heterogeneous distributed systems that are frequently reconfigured as components and communications fail and recover. For example, quality of service for real-time streaming of audio and visual information needs to be maintained, while avoiding overload that interferes with concurrent high-priority interactive workloads.
Representative terms from entire chapter: