Broadening Research Interactions
The Electronic Records Archives (ERA) program of the National Archives and Records Administration (NARA), currently being undertaken to build significant new capabilities for preserving digital records, has contracted out for two system designs in preparation for the acquisition of an operational system. As a prelude and ongoing complement to procurement activities, the ERA program has sponsored several research and development (R&D) activities (Box 4.1). This chapter examines the role that future research could play in helping NARA better understand and address emerging and future technology issues and medium- and long-term technical challenges.
RATIONALE FOR SUPPORTING RESEARCH
Information technology (IT) research has traditionally been concentrated in several federal agencies, many of which have research as part of their central mission.1 A recent National Research Council (NRC) report, Information Technology Research, Innovation, and E-Government,2 broadly considers the rationale for IT research by a broader set of federal mission agencies. It concludes that, where possible, government should follow the private sector in designing and
The Web site of the Electronic Records Archives program lists the following as current research projects:
SOURCE: National Archives and Records Administration. 2005. Electronic Records Archives: Research. Available online at <http://www.archives.gov/electronic_records_archives/research/research.html>. Accessed April 8, 2005.
implementing IT-based services. However, the report also concludes that there are some areas, such as the long-term preservation of government records, in which government requirements differ, at least somewhat, from those in the commercial world. In these areas, the report says, federal agencies should recognize and act on their role as a demand leader.
The NRC report goes on to characterize some of the benefits of research activities for both government agency sponsors and researchers. Government stands to benefit from collaboration between government agencies and the IT research community despite the fact that these two groups are at opposite ends of an extensive supply chain that also includes vendors and system integrators. Although government agencies and researchers may appear to be unlikely allies, they have a shared interest in innovation and in meeting future needs.
Through research, agencies can gain an understanding of emerging and future technologies, and of the technology risks they face. The most effective form of risk reduction occurs when an organization does in-house testing of research prototypes; with in-house testing the organization gains firsthand experience about what is easy and what is not, where the system is reliable and where it fails, and how to run the system on a daily basis. A somewhat-less-
effective form of risk reduction occurs when an external organization develops and tests the prototype; in this case the organization learns something, but most of the expertise about how the system really works resides within the external research organization. Agencies also gain an opportunity to influence the trajectory of technologies through engagement with researchers. Research should, however, not be seen as a way of short-circuiting the IT supply chain of researchers, technology vendors, and system integrators, and agencies should not necessarily be early adopters. However, when agencies and researchers work together, the overall risk of innovation in acquisition can be reduced.
Absorptive capacity for research results requires in-house people who can be engaged at a technical level with relevant research communities and who will take results from the researchers, test them in-house, and explain and propagate results within the agency. Mere funding is not enough. The real goal is substantive technical coupling between the research community and the agency that allows the organization to learn about new technologies, to learn more broadly about a research area than it would otherwise, and to understand how to apply research that it is not itself supporting.
RESEARCH MANAGEMENT FOR AGENCIES WITHOUT A TRACK RECORD IN SPONSORING RESEARCH IN INFORMATION TECHNOLOGY
For an agency such as NARA that has neither a long track record of sponsoring IT research nor a large research budget, how can limited research resources be spent most effectively? Information Technology Research, Innovation, and E-Government observes that most federal agencies have limited experience in and capability for managing IT research programs. The report goes on to recommend that consideration be given to cross-agency collaboration with agencies that already have IT research programs and/or that share common research interests.
Partners with Research Management Expertise
Several federal agencies, most notably the National Science Foundation (NSF) and the Defense Advanced Research Projects Agency (DARPA), but also the National Institute of Standards and Technology, the National Aeronautics and Space Administration (NASA), the Department of Energy, the National Institutes of Health and its National Library of Medicine, the armed services laboratories, and others, have extensive expertise in the management of IT research projects. The selection of specific research proposals and routine management of the research process—for example, selecting among peer-reviewed IT research proposals or maintaining an arm’s-length relationship with research organizations—are capabilities that these agencies have developed over time as they have gained experience, identified and worked with particular research communities, brought in researchers to manage programs, and otherwise developed research management talent. These organizations maintain relationships with varied research and user communities, which gives them easy access to a wide range of expertise and peer review when needed.
The agencies referred to above also have experience in launching research programs in new areas, including the building of a research community around a set of technical topics that were not previously identified as related or important. This process is more complicated than simply stating a problem and looking for people to work on it. Researchers will not be drawn
to work on problems that are not well thought out, for example, or that are shallow or short-term enough that they do not lead to significant research results (e.g., sufficient for the promotion of academics or the funding of start-up companies). Strong, successful research communities often are developed by several customers and funders. The digital libraries program managed by NSF is a highly relevant example of successful program development and community building.
Exchanging staff for limited assignments is a way of both learning about relevant research and letting others learn about one’s own research needs. NARA staff members could spend profitable 6-month assignments in other research organizations to learn what is on the cutting edge and let the researchers know what NARA’s problems are. Similarly, members of research organizations could spend profitable 6-month assignments inside NARA learning what NARA is really up against and offering suggestions on how to apply what they know to these problems.
Partners with Shared Research Interests
Many of NARA’s most basic needs for digital preservation and access are shared to varying degrees with organizations such as the Library of Congress, the National Library of Medicine, NASA, the National Oceanic and Atmospheric Administration (NOAA), and the intelligence agencies, all of which must maintain indefinite access to digital records. The Library of Congress has established the National Digital Information Infrastructure and Preservation Program (NDIIPP) to work collaboratively with federal agencies, research libraries, and other organizations that have an interest and expertise in digital preservation. It has partnered with the National Science Foundation to establish a research grant program focused on digital preservation and has launched joint research activities with four universities to investigate various approaches to digital preservation using a test collection.
A number of the technical problems faced by NARA are likely to be faced by other agencies as well. For example, semiautomatic redaction or declassification is a topic that could be investigated with partners such as DARPA or the intelligence community’s Advanced Research and Development Activity (ARDA). Another organization with shared interests in preservation is the Department of Veterans Affairs, which is currently developing the HealtheVet-VistA program to provide long-term access to veterans’ electronic health records (Box 4.2).
Partnership on Research-Oriented Prototypes
As NARA becomes engaged with a broader range of research communities, it will be natural for it to participate in research-oriented prototypes. The National Science Foundation’s Digital Government program has, for example, fostered close relationships between universities and agencies such as the Bureau of Labor Statistics, the Census Bureau, and the Environmental Protection Agency. NSF’s Digital Libraries program has fostered similar government and academic partnerships. These partnerships give the research community access to real data and users and thus an opportunity to receive real-world feedback about their research. In return, government agencies gain experience with new technologies before the innovations become commercially available, and the agencies develop long-term relationships with outside experts. Such partnerships and in-house testing of research prototypes should eventually become an important part of NARA’s research portfolio.
HealtheVet-VistA is a program currently under design. It will create an online environment in which veterans’ electronic health records, medical images such as x-rays, pathology slides, scanned documents, results of cardiology examinations, wound photos, and endoscopies, among other records, are stored for ready access.
Currently, the Department of Veterans Affairs (VA) treats more than 3 million patients annually in more than 1,300 health units, 163 hospitals, and 850 clinics. The Committee on Veterans’ Affairs of the U.S. House of Representatives has mandated that the VA retain all veterans’ health records (which are in electronic form) for 75 years after the death of veterans. Given technology obsolescence, this mandated retention period creates an enormous challenge for the VA. The VA also faces a comparable challenge in standardizing veterans’ electronic health records across all of its components and in implementing “upstream” creation and capture of veterans’ electronic health records that facilitate access for 75 years or more.
Setting aside the commitment of the National Archives and Records Administration (NARA) to permanent storage (i.e., forever) of selected electronic records created by federal agencies, the VA and NARA face a comparable challenge over the next hundred years in ensuring long-term access to electronic records. The two agencies also share common interests in approaches, tools, and procedures to support long-term access to electronic records.
ENGAGING THE RESEARCH COMMUNITY THROUGH MEANS OTHER THAN FUNDING RESEARCH
NARA can engage relevant research communities in a number of ways other than direct sponsorship of research. They include the following:
Hosting workshops and seminars. Workshops and seminars bring researchers and NARA employees together, allowing a two-way exchange of ideas and problems. The ERA program took advantage of this opportunity by holding a November 2004 conference, called “Partnerships in Innovation: Serving a Networked Nation,” organized in conjunction with the University of Maryland’s Institute for Advanced Computer Studies.
Hiring students. Another way of engaging research communities is to seek out and hire their students. This is well understood in industry as a very effective approach to technology transfer.
Suggesting important problems and providing access to artifacts (data, software, and system capabilities) and subject-matter experts. As the NRC report Information Technology Research, Innovation, and E-Government observes, working on government IT problems offers researchers access to applications with a “richness and texture often lacking in the laboratory” as well as other potential benefits.3
The opportunity of providing researchers with access to artifacts is discussed in more detail in the following subsection.
Providing Access to Artifacts
As a complement to sponsoring research, NARA can also engage the research community by providing data. This opportunity represents a very low cost yet effective way of engaging substantively with the research community. Academic researchers are often starved for data; they have good ideas, but no way to test them against reality. Providing interesting data in a form that the research community can use can be enough to get the best people in the world working on a problem, essentially for free. For example, Reuters created two databases of categorized newswire articles for research use and made them easy to acquire. Almost every text classification algorithm developed in the preceding decade was tested on these data sets, and hundreds of research papers have been published about them, none paid for by Reuters. Similar stories can be told about data sets from news organizations such as the Wall Street Journal, AP Newswire, Ziff-Davis, and the Financial Times, and about medical data sets based on the National Library of Medicine’s MedLine system.
There are many vehicles for providing data to the research community. Since 1992 the National Institute of Standards and Technology (NIST) has created a variety of widely used data sets that have focused research attention on a diverse set of problems, often of interest to the U.S. intelligence community. Some universities, for example the University of California at Irvine and the University of Pennsylvania, have specialized in creating and becoming long-term distributors of research data sets. Partnering with an organization that has prior experience in this area, such as NIST, the University of California at Irvine, or the Linguistic Data Consortium (LDC) at the University of Pennsylvania, would help NARA learn how to create data sets that are engaging and have long-term value to various research communities.
Chapter 3 in this report points out the benefits of providing an application programming interface (API) that gives direct access to some of the ERA’s digital archive so that third parties can develop new tools. This approach is particularly applicable to research initiatives. Google, for example, allows a simple API-level access to its search engine to anyone who registers. The advantage to Google is that others can spend their own research dollars to develop applications that might eventually be interesting or useful to Google. API-level access can also provide access to users, enabling researchers to conduct experiments with live populations in real time—for example, by varying system characteristics and observing user behavior. Such access would be extremely attractive to a variety of computer science and social science research communities, and it could be provided by NARA at relatively low cost relative to the benefits obtained.
RESEARCH CHALLENGES FACING THE NATIONAL ARCHIVES AND RECORDS ADMINISTRATION
The research problems facing NARA—and appropriate research strategies—fall into two categories:
Problems shared with other organizations. Many of NARA’s technical problems are the same as those faced by designers and operators of any large digital library or repository. For
dealing with these problems, NARA’s greatest leverage will come from drawing on research sponsored by other organizations, learning about best practices, inducing others to work on problems of interest to NARA by offering corpora of content that researchers can use to test new ideas, and participating in joint research programs with other federal agencies and organizations that face similar challenges. Wherever technologies may be shared with other applications, partnerships to jointly address shared problems will help stretch limited research resources.
Problems specific to government archives and similar institutions. A few problems are specific to the preservation of records or otherwise unique to NARA and may need separate research thrusts. Working on these more specialized problems may require specific engagement with existing research communities. Partnership with agencies that have greater experience in managing IT research programs is also likely to be the most effective mechanism for addressing this class of research problems.
What are some of the specific research problems that NARA faces? The organization must develop its own list, but it need not start from scratch in doing so.4 The subsections below provide some illustrative examples of research problems in each class.
Research Problems Shared with Other Organizations
Following is a list of examples of research problems that NARA shares with other organizations, together with appropriate strategies.
Automatic classification. Traditional archiving processes rely heavily on controlled-vocabulary metadata, which is usually assigned manually. Large-scale use of automatic categorization to assign metadata provides a way to handle the expected flood of digital records in a cost-effective manner. There has been significant progress in this technology area during the past decade, and there is now a greater understanding of what makes problems “difficult” and “easy.” Automatic categorization, which is used routinely in a variety of commercial settings, is often as accurate as human categorization is. Automatic categorization must become a core competency for NARA. Human effort should be applied only when automatic solutions are inadequate. The use of this technology can be broadened to related problems such as filtering records for retention and distinguishing between federal and presidential records.
Engagement with leaders in the research community would provide NARA an opportunity to better understand the state of the art. NARA’s contributions in this case might well not involve direct support for research; NARA could make significant contributions to the research community by making corpora available, stating its own requirements clearly, and learning from research sponsored by other agencies.
Scholars’ tools for searching large collections of records. Some shortcomings of record classification can be remedied by increasingly powerful search methods. While a casual searcher
See, for example, Library of Congress and National Science Foundation, 2003, It’s About Time: Research Challenges in Digital Archiving and Long-Term Preservation, Margaret Hedstrom (ed.), Library of Congress, Washington, D.C., p. ix, available online at <http://www.digitalpreservation.gov/repor/NSF_LC_Final_Report.pdf>; accessed May 1, 2005.
may be content with a result that is “good enough,” scholars want better tools and are likely to be continually pressing the limits of technology. For example, Google already offers a “Google Scholar” search designed to find online scholarly material, and libraries and technical literature publishers are beginning to offer customized searching. Many of these emerging techniques will probably be useful to scholars searching the ERA, but the archives may pose some unique problems. This is a case in which NARA could stimulate a research community with a database of archival records and a set of demanding users with specific needs.
Preserving computer programs. Improved capabilities for preserving software may be of general interest to the community interested in digital preservation, including NARA, but they may or may not be required by NARA to preserve records. Here a modest level of funding in a jointly funded program of research looking broadly at digital preservation problems might be appropriate.
How to preserve Web sites. Government Web sites publish information in a wide variety of formats, use static and dynamic information, use information from a single source or many sources, and present general or user-specific views of information. Government Web sites also collect information from the public—for example, public comments on proposed regulations. What government Web site material should be archived? The variety of problems posed by government Web sites could be viewed as daunting, but it can also be an advantage in that the sites afford a means to gather much valuable information. Of course not all Web sites produce unique “records”; some are merely access portals to an underlying set of records that may be archived in other ways. Web sites should be a particular priority for archiving because they are so dynamic and also because they provide information that may not be captured by other archiving processes. An essential part of the national historical record may have simply been lost because government Web sites were not systematically archived in the past decade. (Serious efforts aimed at archiving Web content, however, are being made by such organizations as the Internet Archive.) NARA will need to co-develop policies and technical approaches for handling Web sites.
The interplay between information technology and human and organizational behavior related to records. NARA, like many other organizations, has an interest in understanding how people and organizations use new technologies and an interest in the implications of this use for new forms of communication, new types of records, and new divisions of responsibilities for administrative and record-keeping activities. Relevant areas of research include business-process redesign, organizational restructuring as a result of new information technologies, adoption and adaptation of new technologies, work flow analysis and design, compliance with policies and requirements, and design of incentive mechanisms.
Problem Areas in Which NARA and Similar Institutions Have Special Interests
How to tease out (or define) records in interactive systems. Interactions with and within| government increasingly involve ephemeral views not stored as records but merely computed from online databases through software that itself changes over time. Web sites (discussed above) are one important example.
Transactions systems. How should government transactions systems be preserved? Are there uniform ways (e.g., logs or snapshots) to archive them? Or should transactions systems be designed to emit records in order to preserve records as part of the standard work flow?
Providing digital assurances over a long period of time. The computer science and cryptogra-
phy to support authenticity, integrity, and chain-of-custody requirements associated with long-term preservation are likely to be leading edge for the foreseeable future. A research program would explore new techniques to be able to assure integrity and authenticity of records held for many decades. Is an end-to-end scheme that would span hundreds of years possible?
With necessarily limited resources, a limited capacity to manage research programs, and a large set of research challenges related to the general topic of archiving and digital preservation, NARA has to decide how and where it should invest. Questions for NARA to consider include these: What agencies share an interest in NARA’s problem or a similar problem? What research community is working on that problem or a similar problem? How can NARA most effectively join with and influence that community? Does NARA need to sponsor its own research, or can it learn from or participate in others’ research programs? What results are anticipated, and how will the results be brought back to NARA?
As the ERA starts to operate, NARA will surely uncover new problems. Some of the problems will lead to evolution of the system and its operation in straightforward ways—that is, known methods or engineering techniques will solve them. But operation may also reveal deeper problems worthy of new research activities.