Read "Summary of a Workshop on Information Technology Research for Federal Statistics" at NAP.edu

Page 17 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

2 Research Opportunities

Research opportunities explored in the workshop's panel presentations and small-group discussions are described in this chapter, which illustrates the nature and range of IT research issues—including human-computer interaction, database systems, data mining, metadata, information integration, and information security —that arise in the context of the work being conducted by the federal statistical agencies. The chapter also touches on two other challenges pertinent to the work of the federal statistical agencies—survey instruments and the need to limit disclosure of confidential information. This discussion represents neither a comprehensive examination of information technology (IT) challenges nor a prioritization of research opportunities, and it does not attempt to focus on the more immediate challenges associated with implementation.

HUMAN-COMPUTER INTERACTION

One of the real challenges associated with federal statistical data is that the people who make use of it have a variety of goals. There are, first of all, hundreds or thousands of specialists within the statistical system who manipulate the data to produce the reports and indices that government agencies and business and industry depend on. Then there are the thousands, and potentially millions, of persons in the population at large who access the data. Some users access statistical resources daily, others only occasionally, and many others only indirectly, through third parties, but all depend in some fashion on these resources to support important

Page 18 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

decisions. Federal statistics resources support an increasingly diverse range of users (e.g., high school students, journalists, local community groups, business market analysts, and policy makers) and tasks. The pervasiveness of IT, exemplified by the general familiarity with the Web interface, is continually broadening the user base.

BOX 2.1

Some Policy Issues Associated with Electronic Dissemination

In her presentation at the workshop, Patrice McDermott, from OMB Watch, observed that if information suddenly began to be disseminated by electronic means alone, some people would no longer be able to access it. Even basic telephone service, a precursor for low-cost Internet access, is not universal in the United States. It is not clear that schools and libraries can fill the gap: schools are not open, for the most part, to people who do not have children attending them, and finding resources to invest in Internet access remains a challenge for both schools and public libraries. McDermott added that research by OMB Watch indicates that people see a substantial difference between being directed to a book that contains Census data and being helped to access and navigate through online information. Another issue is the burden imposed by the shifting of costs: if information is available only in electronic form, users and intermediaries such as libraries end up bearing much of the cost of providing access to it, including, for example, the costs of telecommunications, Internet service, and printing.

Workshop participants observed, however, that many are likely to remain without ready access to information online, raising a set of social and policy questions (Box 2.1). However, over time, a growing fraction of potential users can be expected to gain network access, making it increasingly beneficial to place information resources online, together with capabilities that support their interpretation and enhance the statistical literacy of users. In the meantime, online access is being complemented by published sources and by the journalists, community groups, and other intermediaries who summarize and interpret the data.

The responsibility of a data product designer or provider does not end with the initial creation of that product. There are some important human-computer interaction (HCI) design challenges in supporting a wide range of users. A key HCI design principle is “know thy user ”; various approaches to learning about and understanding user abilities and needs are discussed below. Besides underscoring the need to focus on users, workshop participants pointed to some specific issues: universal access, support for users with limited statistical literacy, improved visualization techniques, and new modes of interacting with data. These are discussed in turn below.

Page 19 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

User Focus

Iterative, user-centered design and testing are considered crucial to developing usable and useful information products. A better understanding of typical users and the most common tasks they perform, which could range from retrieving standard tables to building sophisticated queries, would facilitate the design of Web sites to meet those users' needs. One important approach discussed at the workshop is to involve the user from the start, through various routine participatory activities, in the design of sites. The capture of people's routine interactions with online systems to learn what users are doing, what they are trying to do, what questions they are asking, and what problems they are having allows improving the product design. If, for example, a substantial number of users are seen to ask the same question, the system should be modified to ensure that the answer to this question is easily available—an approach analogous to the “frequently asked questions” concept. Customer or market surveys can also be used in conjunction with ongoing log and site analyses to better understand the requirements of key user groups. There are many techniques that do not associate data with individuals and so are sensitive to privacy considerations.¹ For example, collecting frequent queries requires aggregation only at the level of the site, not of the individual. Where individual-level data are useful, they could be made anonymous.

Universal Access

The desire to provide access to statistical information for a broad range of citizens raises concerns about what measures must be taken to ensure universal access.² Access to computers, once the province of a small number of expert programmers, now extends to a wider set of computer-literate users and an even larger segment of the population sufficiently skilled to use the Web to access information. The expanding audience for federal statistical data represents both an opportunity and a challenge for information providers.

¹

Data on user behavior must be collected and analyzed in ways that are sensitive to privacy concerns and that avoid, in particular, tracking the actions of individuals over time (though this inhibits within-subject analyses). There are also the matters related to providing appropriate notice and obtaining consent for such monitoring.

²	This term, similar to the more traditional label “universal service,” also encompasses economic and social issues related to the affordability of access services and technology, as well as the provision of access through community-based facilities, but these are not the focus of this discussion.

Page 20 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

Universality considerations apply as well to the interfaces people use to access information. The Web browser provides a common interface across a wide range of applications and extends access to a much larger segment of the population (anyone with a browser). However, the inertia associated with such large installed software bases tends to slow the implementation of new interface technologies. During the workshop, Gary Marchionini argued that adoption of the Web browser interface has locked in a limited range of interactions and in some sense has set interface design back several years. A key challenge in ensuring universal access is finding upgrade trajectories for interfaces that maximize access across the broadest possible audience. ³

Providing access to all citizens also requires attention to the diverse physical needs of users. Making every Web site accessible to everyone requires more than delivering just a plain-text version of a document, because such a version lacks the richness of interaction offered by today's interfaces. Some work is already being done; vendors of operating systems, middleware, and applications provide software hooks that support alternative modes of access. The World Wide Web Consortium is establishing standards and defining such hooks to increase the accessibility of Web sites.

Another dimension of universal access is supporting users whose systems vary in terms of hardware performance, network connection speed, and software. The installed base of networked computers ranges from Intel 80286 processors using 14.4-kbps modems to high-performance computers with optical fiber links that are able to support real-time animation. That variability in the installed base presents a challenge in designing new interfaces that are also compatible with older systems and software.

Literacy, Visualization, and Perception

Given the relatively low level of numerical and statistical literacy in the population at large, it becomes especially important to provide users with interfaces that give them useful, meaningful information. Providing data with a bad interface that does not allow users to interpret data sensibly may be worse than not providing the data at all, because the bad interface frustrates nonexpert users and wastes their time. The goal is to provide not merely a data set but also tools that allow making sense of the data. Today, most statistical data is provided in tabular form—the form

³	See Computer Science and Telecommunications Board, National Research Council. 1997. More Than Screen Deep: Toward Every-Citizen Interfaces to the Nation 's Information Infrastructure. National Academy Press, Washington, D.C.

Page 21 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

of presentation with which the statistical community has the longest experience. Unfortunately, although it is well understood by both statisticians and expert users, this form of presentation has significant limitations. Tables can be difficult for unsophisticated users to interpret, and they do not provide an engaging interface through which to explore statistical survey data. Also, the types of analyses that can be conducted using summary tables are much more limited than those that can be conducted when access to more detailed data is provided. Workshop participants pointed to the challenge of developing more accessible forms of presentation as central to expanding the audience for federal statistical data.

Statistics represent complex information that might be thought of as multimedia. Even data tables, when sufficiently large, do not lend themselves to display as simple text. Many of the known approaches to multimedia—such as content-based indexing and retrieval—may be applicable to statistical problems as well. Visualization techniques, such as user-controlled graphical displays and animations, enable the user to explore, discover, and explain trends, outliers, gaps, and jumps, allowing a better understanding of important economic or social phenomena and principles. Well-designed two-dimensional displays are effective for many tasks, but researchers are also exploring three-dimensional and immersive displays. Advanced techniques such as parallel coordinates and novel coding schemes, which complement work being done on three-dimensional and immersive environments, are also worthy of study.

Both representation (what needs to be shown to describe a given set of data) and control (how the user interacts with a system to determine what is displayed) pose challenges. Statisticians have been working on the problem of representation for a very long time. Indeed a statistic itself is a very concise condensation of a very large collection of information. More needs to be done in representing large data sets so that users who are not sophisticated in statistical matters can obtain, in a fairly compact way, the sense of the information in large collections of data. Related to this is the need to provide users with appropriate indications of the effects of sampling error.

Basic human perceptual and cognitive abilities affect the interpretation of statistical products. Amos Tversky and others have identified pervasive cognitive illusions, whereby people try to see patterns in random data.⁴ In the workshop presentation by Diane Schiano, evidence

⁴

See A. Tversky and D.M. Kahneman. 1974. “Judgement Under Uncertainty: Heuristics and Biases,” Science 125:1124-1131. One such heuristic/bias is the perception of patterns in random scatter plots. See W.S. Cleveland and R. McGill. 1985. “Graphical Perception and Graphical Methods for Analyzing Scientific Data,” Science 229 (August 30):828-833.

Page 22 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

was offered of pervasive perceptual illusions that occur in even the simplest data displays. People make systematic errors in estimating the angle of a single line in a simple two-dimensional graph and in estimating the length of lines and histograms. These are basic perceptual responses that are not subject to cognitive overrides to correct the errors. As displays become more complex, the risk of perceptual errors grows accordingly. Because of this, three-dimensional graphics are often applied when they should not be, such as when the data are only two-dimensional. More generally, because complex presentations and views can suggest incorrect conclusions, simple, consistent displays are generally better.

The interpretation of complex data sets is aided by good exploratory tools that can provide both an overview of the data and facilities for navigating through them and zooming in (or “drilling down”) on details. To illustrate the navigation challenge, Cathryn Dippo of the Bureau of Labor Statistics noted that the Current Population Survey's (CPS 's) typical monthly file alone contains roughly 1,000 variables, and the March file contains an additional 3,000. Taking into account various supplements to the basic survey, the CPS has 20,000 to 25,000 variables, a number that rapidly becomes confusing for a user trying to interpret or even access the data. That figure is for just one survey; the surveys conducted by the Census Bureau contain some 100,000 variables in all.

Underscoring the importance of providing users with greater support for interaction with data, Schiano pointed to her research that found that direct manipulation through dynamic controls can help people correct some perceptual illusions associated with data presentation. Once users are allowed to interact with an information object and to choose different views, perception is vastly improved. Controls in common use today are limited largely to scrolling and paging through fairly static screens of information. However, richer modes of control are being explored, such as interfaces that let the user drag items around, zoom in on details, and aggregate and reorder data. The intent is to allow users to manipulate data displays directly in a much more interactive fashion.

Some of the most effective data presentation techniques emerging from human-computer interaction research involve tightly coupled interactions. For example, when the user moves a slider (a control that allows setting the value of a single variable visually), that action should have an immediate and direct effect on the display —users are not satisfied by an unresponsive system. Building systems that satisfy these requirements in the Web environment, where network communications latency delays data delivery and makes it hard to tightly couple a user action and the resulting display, is an interesting challenge. What, for example, are the optimal strategies for allocating data and processing between the client

Page 23 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

and the server in a networked environment in order to support this kind of interactivity?

Two key elements of interactivity are the physical interface and the overall style of interaction. The trend in physical interfaces has been toward a greater diversity of devices. For example, a mouse or other two-dimensional pointing device supplements keyboard input in desktop computing, while a range of three-dimensional interaction devices are used in more specialized applications. Indeed, various sensors are being developed that offer enhanced direct manipulation of data. One can anticipate that new ways of interacting will become commonplace in the future. How can these diverse and richer input and output devices be used to disseminate statistical information better? The benefits of building more flexible, interactive systems must be balanced against the risk that the increased complexity can lead unsophisticated users to draw the wrong conclusions (e.g., when they do not understand how the information has been transformed by their interactions with it).

Also at work today is a trend away from static displays toward what Gary Marchionini termed “hyperinteraction,” which leads users to expect quick action and instant access to large quantities of information by pointing and clicking across the Web or by pressing the button on a TV remote control. An ever-greater fraction of the population has such expectations, affecting how one thinks about disseminating statistical information.

DATABASE SYSTEMS

Database systems cover a range of applications, from the large-scale relational database systems widely used commercially, to systems that provide sophisticated statistical tools and spreadsheet applications that provide simple data-manipulation functionality along with some analysis capability. Much of the work today in the database community is motivated by a commercial interest in combining transactions, analysis, and mining of multiple databases in a distributed environment. For example, data warehouse environments—terabyte or multiterabyte systems that integrate data from various locations—replicate transactions databases to support problem solving and decision making. Workshop participants observed that the problems of other user communities, such as the federal statistics community, can be addressed in this fashion as well.

Problems cited by the federal statistics community include legacy migration, information integration across heterogeneous databases, and mining data from multiple sources. These challenges, perhaps more mundane than the splashier Web development activities that many IT users are focused on, are nonetheless important. William Cody noted in the workshop that the database community has not focused much on these

Page 24 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

hard problems but is now increasingly addressing them in conjunction with its application partners. Commercial systems are beginning to address these needs.

Today's database systems do not build in all of the functionality to perform many types of analysis. There are several approaches to enhancing functionality, each with its advantages and disadvantages. Database systems can be expanded in an attempt to be all things to all people, or they can be constructed so that they can be extended using their own internal programming language. Another approach is to give users the ability to extract data sets for analysis using other tools and application languages. Researchers are exploring what functions are best incorporated in databases, looking at such factors as the performance trade-offs between the overhead of including a function inside a database and the delay incurred if a function must be performed outside the database system or in a separate database system.

Building increased functionality into database systems offers the potential for increasing overall processing efficiency, Cody observed. There are delays inherent in transferring data from one database to another; if database systems have enhanced functionality, processing can be done on a real-time or near-real-time basis, allowing much faster access to the information. Built-in functionality also permits databases to perform integrated tasks on data inside the database system. Also, relational databases lend themselves to parallelization, whereas tools external to databases have not been built to take as much advantage of it. Operations that can be included in the database engine are thus amenable to parallelization, allowing parallel processing computing capabilities to be exploited.

Cody described the likely evolution over the coming years of an interactive, analytic data engine, which has as its core a database system enriched with new functions. Users would be able to interact with the data more directly through visualization tools, allowing interactive data exploration. This concept is simple, but selecting and building the required set of basic statistical operations into database systems and creating the integration tools needed to use a workstation to explore databases interactively are significant challenges that will take time. Statistics-related operations that could be built into database systems include the following:

Data-mining operations. By bringing data-mining primitives into the database, mining operations can occur automatically as data are collected in operational systems and transferred into warehousing systems rather than waiting until later, after special data sets have been constructed for data mining.
Enhanced statistical analysis. Today, general-purpose relational database systems (as opposed to database systems specifically designed

Page 25 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

for statistical analysis) for the most part support only fairly simple statistical operations. A considerable amount of effort is being devoted to figuring out which additional statistical operators should and could be included in evolving database systems. For example, could one perform a regression or compute statistical measures such as covariances and correlations directly in the database?

Time series operators. The ability to conduct a time-series analysis within a database system would, for example, allow one to derive a forecast based on the information coming in real time to a database.
Sampling. Sampling design is a sophisticated practice. Research is addressing ways to introduce sampling into database systems so that the user can make queries based on samples and obtain confidence limits around these results. While today's database systems use sampling during the query optimization process to estimate the result sizes of intermediate tables, sampling operators are not available to the end-user application. SQL, which is the standard language used to interact with database systems, provides a limited set of operations for aggregating data, although this has been augmented with the recent addition of new functionality for online analytical processing.

Additional support for statistical operations and sampling would allow, for example, estimating the average value of a variable in a data set containing millions of records by requesting that the database itself take a sample and calculate its average. The direct result, without any additional software to process the data, would be the estimated mean together with some confidence limit that would depend on the variance and the sample size.

Before the advent of object-relational database systems, which add object-oriented capabilities to relational databases, adding such extensions would generally have required extensive effort by the database vendor. Today, object-relational systems make it easier for third parties, as well as sophisticated users, to add both new data types and new operations into a database system. Since it is probably not reasonable to push all of the functionality of a statistical analysis product such as SAS into a general-purpose database system, a key challenge is to identify particular aggregation and sampling techniques and statistical operations that would provide the most leverage in terms of increasing both performance and functionality.

DATA MINING

Data mining enables the use of historical data to support evidence-based decision making—often without the benefit of explicitly stated

Page 26 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

statistical hypotheses—to create algorithms that can make associations that were not obvious to the database user. Ideas for data mining have been explored in a wide variety of contexts. In one example, researchers at Carnegie Mellon University studied a medical database containing several hundred medical features of some 10,000 pregnant women over time. They applied data-mining techniques to this collection of historical data to derive rules that better predict the risk of emergency caesarian sections for future patients. One pattern identified in the data predicts that when three conditions are met—no previous vaginal delivery, an abnormal second-trimester ultrasound reading, and the infant malpresenting —the patient's risk of an emergency caesarian section rises from a base rate of about 7 percent to approximately 60 percent.⁵

Data mining finds use in a number of commercial applications. A database containing information on software purchasers (such as age, income, what kind of hardware they own, and what kinds of software they have purchased so far) might be used to forecast who would be likely to purchase a particular software application in the future. Banks or credit card companies analyze historical data to identify customers that are likely to close their accounts and move to another service provider; predictive rules allow them to take preemptive action to retain accounts. In manufacturing, data collected over time from manufacturing processes (e.g., records containing various readings as items move down a production line) can be used by decision makers interested in process improvements in a production facility.

Both statisticians and computer scientists make use of some of the same data-mining tools and algorithms; researchers in the two fields have similar goals but somewhat different approaches to the problem. Statisticians, much as they would before beginning any statistical analysis, seek through interactions with the data owner to gain an understanding of how and why the data were collected, in part to make use of this information in the data mining and in part to better understand the limitations on what can be determined by data mining. The computer scientist, on the other hand, is more apt to focus on discovering ways to efficiently manipulate large databases in order to rapidly derive interesting or indicative trends and associations. Establishing the statistical validity of these methods and discoveries may be viewed as something that can be done at a later stage. Sometimes information on the conditions and circumstances under which the data were collected may be vague or even nonexistent, making it difficult to provide strong statistical justification for choosing

⁵	This example is described in more detail in Tom M. Mitchell. 1999. “Machine Learning and Data Mining,” Communications of the ACM 47(11).

Page 27 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

particular data-mining tools or to establish the statistical validity of patterns identified from the mining; the statistician is arguably better equipped to understand the limitations of employing data mining in such circumstances. Statisticians seek to separate structure from noise in the data and to justify the separation based on principles of statistical inference. Similarly, statisticians approach issues like subsampling methodology as a statistical problem.

Research on data mining has been stimulated by the growth in both the quantity of data that is being collected and in the computing power available for analyzing it. At present, a useful set of first-generation algorithms has been developed for doing exploratory data analysis, including logistic regression, clustering, decision-tree methods, and artificial-neural-net methods. These algorithms have already been used to create a number of applications; at least 50 companies today market commercial versions of such analysis tools.

One key research issue is the scalability of data-mining algorithms. Mining today frequently relies on approaches such as selecting subsets of the data (e.g., by random sampling) and summarizing them, or deriving smaller data sets by methods other than selecting subsets (e.g., to perform a regression relating two variables, one might divide the data into 1,000 subgroups and perform the regression on each group, yielding a derived subset consisting of 1,000 sets of regression coefficients). For example, to mine a 4-terabyte database, one might do the following: sample it down to 200 gigabytes, aggregate it to 80 gigabytes, and then filter the result down to 10 gigabytes.

A relatively new area for data mining is multimedia data, including maps, images, and video. These are much more complex than the numerical data that have traditionally been mined, but they are also potentially rich new sources of information. While existing algorithms can sometimes be scaled up to handle these new types of data, mining them frequently requires completely new methods. Methods to mine multimedia data together with more traditional data sources could allow one to learn something that had not been known before. To use the earlier example, which involved determining risk factors in pregnancy, one would analyze not only the traditional features such as age (a numerical field) and childbearing status (a Boolean field) but also more complex multimedia features such as videosonograms and unstructured text notes entered by physicians. Another multimedia data-mining opportunity suggested at the workshop was to explore X-ray images (see Box 2.2) and numerical and text clinical data collected by the NHANES survey.

Active experimentation is an interesting research area related to data mining. Most analysis methods today analyze precollected samples of data. With the Internet and connectivity allowing researchers to easily

Page 28 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

BOX 2.2

National Health and Nutrition Examination Survey X-ray Image Archive

Lewis Berman of the National Center for Health Statistics presented some possible uses of the NHANES X-ray image archive. He described NHANES as the only nationally representative sampling of X rays and indicated that some effort had been made to make this set of data more widely available. For example, more than 17,000 X-ray cervical and lumbar spine images from NHANES II have been digitized.¹ In collaboration with the National Library of Medicine, these data are being made accessible online under controlled circumstances via Web tools, along with collateral data such as reported back pain at the time of the X ray. Other data sets that could also be useful to researchers include hand and knee films from NHANES III, a collection of hip X rays, and a 30-year compilation of electro-cardiograms. NHANES data could also provide a resource that would allow the information technology and medical communities to explore issues ranging from multimedia data mining to the impact of image compression on the accuracy of automated diagnosis.

¹

The images from NHANES II were scanned at 175 microns on a Lumisys Scanner. The cervical and lumbar spine images have a resolution of 1,463 × 1,755 × 12 bits (5 MB per image) and 2,048 × 2,487 × 12 bits (10 MB per image), respectively. Although the images are stored as 2 bytes/pixel, they capture only 12 bits of gray scale.

tap multiple databases, there is an opportunity to explore algorithms that would, after a first-pass analysis of an initial data set, search data sources on the Internet to collect additional data that might inform, test, or improve conjectures that are formed from the initial data set. In his presentation at the workshop, Tom Mitchell explored some of these implications of the Internet for data collection and analysis. An obvious opportunity is to make interview forms available on the Web and collect information from user-administered surveys. A more technically challenging opportunity is to make use of Web information that is already available. How might one use that very large, heterogeneous collection of data to augment the more carefully collected but smaller data sets that come from statistical surveys? For example, many companies in the United States have Web sites that provide information on current and new products, the company's location, and other information such as recruiting announcements. Mitchell cited work by his research group at Carnegie Mellon on extracting data from corporate Web sites to collect such information as where they are headquartered, where they have facilities, and

Page 29 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

what their economic sector is. Similarly, most universities have Web sites that describe their academic departments, degree programs, research activities, and faculty. Mitchell described a system that extracts information from the home pages of university faculty. It attempts to locate and identify faculty member Web sites by browsing university Web sites, and it extracts particular information on faculty members, such as their home department, the courses they teach, and the students they advise.⁶

METADATA

The term “metadata” is generally used to indicate the descriptions and definitions that underlie data elements. Metadata provides data about data. For example, what, precisely, is meant by “household” or “income” or “employed”? In addition to metadata describing individual data elements, there is a host of other information associated with a survey, also considered metadata, that may be required to understand and interpret a data set. These include memos documenting the survey, the algorithms⁷ used to derive results from survey responses (e.g., how it is determined whether someone is employed), information on how surveys are constructed, information on data quality, and documentation of how the interviews are actually conducted (not just the questions asked but also the content of training materials and definitions used by interviewers in gathering the data). Workshop participants observed that better metadata and metadata tools and systems could have a significant impact on the usability of federal statistics, and they cited several key areas, discussed below.

Metadata, ranging from definitions of data fields to all other documentation associated with the design and conduct of a statistical survey, can be extensive. Martin Appel of the Census Bureau observed that attempts to manually add metadata have not been able to keep up with

⁶

M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. 1998. “Learning to Extract Symbolic Knowledge from the World Wide Web,” Proceedings of the 1998 National Conference on Artificial Intelligence (July). Available online at <http:// www.cs.cmu.edu/~tom/publications.html>.

⁷

Simply including computer code as metadata may not be the most satisfactory method; even high-level language programs may not be useful as metadata. Another approach would be to use specification languages, which make careful statements about what computer code should do. These are more compact and more readable than typical computer code, although some familiarity with the specification language and comfort with its more formal nature are required. As with computer code itself, a description in a specification language cannot readily be interpreted by a nonexpert user, but it can be interpreted by a tool that can present salient details to nonexpert users. These languages are applicable not only to representing a particular computer program but also to representing larger systems, such as an entire statistical collection and processing system.

Page 30 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

the volume of data that are generated. In particular, statistical data made available for analysis are frequently derived from calculations performed on other data, making the task of tying a particular data element to the appropriate metadata more complex. Tools for automatically generating and maintaining metadata as data sets are created, augmented, manipulated, and transformed (also known as self-documenting) could help meet this demand.

Even if fully satisfactory standards and tools are developed for use in future surveys, there remain legacy issues because the results of statistical surveys conducted in past decades are still of interest. For instance, the NHANES databases contain 30 years of data, during which time span similar but not identical questions were asked and evaluated, complicating the study of long-term health trends. Much work remains to provide a metadata system for these survey data that will permit their integration.

Another, related challenge is how to build tools that support the search and retrieval of metadata. A new user seeking to make sense of a Census data set may well need to know the difference between a “household” and a “family” or a “block group” and a “block” in order to make sense of that set. More generally, metadata are critical to help users make sense of data—for instance, what a particular piece of data means, how it was collected, and how much trust can be placed in it. The development of automatic display techniques that allow metadata associated with a particular data set to be quickly and easily accessed was identified as one area of need. For example, when a user examines a particular data cell, the associated metadata might be automatically displayed. At a minimum, drill-down facilities, such as the inclusion of a Web link in an online statistical report pointing to the relevant metadata, could be provided. Such tools should describe not only the raw data but also what sort of transformations were performed on them. Finally, as the next section discusses, metadata can be particularly important when one wishes to conduct an analysis across data from multiple sources.

INFORMATION INTEGRATION

Given the number of different statistical surveys and agencies conducting surveys, “one-stop shopping” for federal statistical data would make statistical data more accessible. Doing so depends on capabilities that allow analyzing data from multiple sources. The goal would be to facilitate both locating the relevant information across multiple surveys and linking it to generate new results. Several possible approaches were discussed at the workshop.

Metadata standards, including both standardized formats for describing the data as well as sets of commonly agreed-on meanings, are one key

Page 31 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

to fully exploiting data sets from multiple sources. Without them, for instance, it is very difficult to ascertain which fields in one data set correspond to which fields in the other set and to what extent the fields are comparable. While the framework provided by the recently developed XML standard, including the associated data-type definitions (DTDs), offers some degree of promise, work is needed to ensure that effective DTDs for federal statistical data sets are defined. XML DTDs, because they specify only certain structural characteristics of data, are only part of the solution; approaches for defining the semantics of statistical data sets also need to be developed. Standards do not, moreover, provide a solution for legacy data sets.

Another approach to information integration is to leverage the existing metadata, such as the text labels that describe the rows and columns in a statistical table or descriptions of how the data have been collected and processed, that accompany the data sets. Finding ways of using these metadata to represent and relate the contents of tables and databases so that analyses can be performed is an interesting area for further research.

The database community is exploring how to use database systems to integrate information originating from different systems throughout an organization (data warehousing). Database system developers are building tools that provide an interactive, analytical front end that integrates access to information in databases along with tools for visualizing the data. Research is being done on such things as data transformations and data cleaning and on how to model different data sources in an integrated way.

SURVEY INSTRUMENTS

The way in which data are collected is critical: without high-quality data up front, later work will have little value. Improved tools for administering surveys, whether they use paper and pencil, are computer-assisted, or are interviewee (end-user) administered, would also help. Discussions at the workshop suggested that a new generation of tools for developing surveys would offer statistical agencies greater flexibility in developing sound, comprehensive surveys. The current generation of tools is hard to use and requires that significant amounts of customized code be designed, written, and debugged. The complexity of the surveys sponsored by the federal government exceeds that of most other surveys, so it is unlikely that software to support this complex process will ever become mainstream. Workshop participants suggested that the federal government should for this reason consider consolidating its efforts to develop (or have others develop) such software. Some particular needs are associated with survey tools:

Page 32 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

Improved survey software tools. It would be useful to easily modify surveys that have already been developed or deployed; such modification can be difficult when extensive custom coding is required to create a survey instrument. High-level language tools (so-called fourth-generation languages), like those developed by the database industry, which demonstrate that families of sophisticated applications can be developed without requiring programmers to write extensive amounts of customized computer code, may also ease the task of developing surveys.
Flexibility in navigation. Better software tools would, for example, permit users to easily back up to earlier answers and to correct errors. Heather Contrino, discussing the American Travel Survey CATI system, observed that if a respondent provides information about several trips during the trip section of the survey and then recalls another trip during the household section, it would be useful if the interviewer could immediately go back to a point in the survey where the new information should be captured and then proceed with the survey. The new CATI system used for the 1995 American Travel Survey provides some flexibility, but more would improve survey work. The issue, from an IT research perspective, is developing system designs that ensure internal consistency of the survey data acquired from subjects while also promoting more flexible interactions, such as adapting to respondents' spontaneous reports.
Improved ease of use. Being able to visualize the flow of the questionnaire would be especially helpful. In complex interviews, an interviewer can lose his or her place and become disoriented, especially when following rarely used paths. This difficulty could be ameliorated by showing, for example, the current location in the survey in relation to the overall flow of the interview. Built-in training capabilities would also enhance the utility of future tools. Ideally, they should be able to coach the interviewer on how to administer the survey.
Monitoring the survey process. Today, survey managers monitor the survey process manually. Tools for automatically monitoring the survey could be designed and implemented so that, as survey results are uploaded by the survey takers, status tables could be automatically produced and heuristic and statistical techniques used to detect abnormal conditions. Automated data collection would improve the timeliness of data collection and enhance monitoring efforts. While the data analyst is generally interested only in the final output from a survey instrument, the survey designer also wants information on the paths taken through the survey, including, for example, any information that was entered and then later modified. This is similar to the analyses of “click trace” that track user paths through Web sites.
On-the-fly response checking. It would be useful to build in checks to identify inappropriate data values or contradictory answers immediately,

Page 33 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

as an interview is being conducted, rather than having to wait for post-interview edits and possibly incurring the cost and delay of a follow-up interview to correct the data. Past attempts to build in such checks are reported to have made the interview instruments run excessively slowly, so the checks were removed.

Improved performance. Another dimension to the challenges of conducting surveys is the hardware platform. Laptops are the current platform of choice for taking a survey. However, the current generation of machines is not physically robust in the field, is too difficult to use, and is too heavy for many applications (e.g., when an interviewer stands in a doorway, as happens when a household is being screened for possible inclusion in a survey). Predictable advances in computer hardware will address size and shape, weight, and battery life problems while advances in processing speed will enable on-the-fly checking, as noted above. Continued commercial innovation in portable computer devices, building on the present generation of personal digital assistants, which provide sophisticated programmability, appears likely to provide systems suitable for many of these applications. It is, of course, a separate matter whether procurement processes and budgets can assimilate use of such products quickly.

New modes of interaction with survey instruments. Another set of issues relates to the limitations of keyboard entry. While a keyboard is suitable for a telephone interview or an interview conducted inside someone's house, it has some serious limitations in other circumstances, such as when an interviewer is conducting an initial screening interview at someone's doorstep or in a driveway. Advances in speech-to-text technology might offer advantages for certain types of interviews, as might handwriting recognition capability, which is being made available in a number of computing devices today. Limited-vocabulary (e.g., “yes”, “no,” and numerical digits), speaker-independent speech recognition systems have been used for some time in survey work.⁸ The technology envisioned here would provide speaker-independent capability with a less restricted vocabulary. With this technology it would be possible to capture answers in a much less intrusive fashion, which could lead to improvements in overall survey accuracy. Speech-to-text would also help reduce human intermediation if it could allow interviewees to interact directly with the survey instrument. There are significant research questions regarding the implications of different techniques for administering

⁸	The Bureau of Labor Statistics started using this technology for the Current Employment Survey in 1992. See Richard L. Clayton and Debbie L.S. Winter. 1992. “Speech Data Entry: Results of a Test of Voice Recognition for Survey Data Collection,” Journal of Official Statistics 8:377-388.

Page 34 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

survey questionnaires, with some results in the literature suggesting that choice of administration technique can affect survey results significantly.⁹ More research on this question, as well as on the impact of human intermediation on data collection, would be valuable.

LIMITING DISCLOSURE

Maintaining the confidentiality of respondents in data collected under pledges of confidentiality is an intrinsic part of the mission of the federal statistical agencies. It is this promise of protection against disclosure of confidential information—protecting individual privacy or business trade secrets—that convinces many people and businesses to comply willingly and openly with requests for information about themselves, their activities, and their organizations. Hence, there are strong rules in place governing how agencies may (and may not) share data,¹⁰ and data that divulge information about individual respondents are not released to the public. Disclosure limitation is a research area that spans both statistics and IT; researchers in both fields have worked on the issue in the past, and approaches and techniques from both fields have yielded insights. While nontechnical approaches play a role, IT tools are frequently employed to help ease the tension between society's demands for data and the agencies' ability to collect information and maintain its confidentiality.

Researchers rely on analysis of data sets from federal statistical surveys, which are viewed as providing the highest-quality data on a number of topics, to explore many economic and social phenomena. While some of their analysis can be conducted using public data sets, some of it depends on information that could be used to infer information about individual respondents, including microdata, which are the data sets containing records on individual respondents. Statistical agencies must strike a balance between the benefits obtained by releasing information for legitimate research and the potential for unintended disclosures that could result from releasing information. The problem is more complicated than simply whether or not to release microdata. Whenever an agency releases statistical information, it is inherently disclosing some information about

⁹

See, e.g., Sara Kiesler and Lee Sproull. 1986. “Response Effects in the Electronic Survey,” Public Opinion Quarterly 50:243-253 and Wendy L. Richman, Sara Kiesler, Suzanne Weisband, and Fritz Drasgow. 1999. “A Meta-analytic Study of Social Desirability Distortion in Computer-Administered Questionnaires, Traditional Questionnaires, and Interviews,” Journal of Applied Psychology 84(5, October):754-775.

¹⁰

These rules were clarified and stated consistently in Office of Management and Budget, Office of Information and Regulatory Affairs. 1997. “Order Providing for the Confidentiality of Statistical Information, ” Federal Register 62(124, June 27):33043. Available online at <http://www.access.gpo.gov/index.html>.

Page 35 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

the source of the data from which the statistics are computed and potentially making it easier to infer information about individual respondents.

Contrary to what is sometimes assumed, protecting data confidentiality is not as simple as merely suppressing names and other obvious identifiers. In some cases, one can re-identify such data using record linkage techniques. Record linkage, simply put, is the process of using identifying information in a given record to identify other records containing information on the same individual or entity.¹¹ For example, a set of attributes such as geographical region, sex, age, race, and so forth may be sufficient to identify individuals uniquely. Moreover, because multiple sources of data may be drawn on to infer identity, understanding how much can be inferred from a particular set of data is difficult. A simple example provided by Latanya Sweeney in her presentation at the workshop illustrates how linking can be used to infer identity (Box 2.3).

Both technical and nontechnical approaches have a role in improving researcher access to statistical data. Agencies are exploring a variety of nontechnical solutions to complement their technical solutions. For example, the National Center for Education Statistics allows researchers access to restricted-use data under strict licensing terms, and the National Center for Health Statistics (NCHS) recently opened a research data center that makes data files from many of its surveys available, both on-site and via remote access, under controlled conditions. The Census Bureau has established satellite centers for secured access to research data in partnership with the National Bureau of Economic Research, Carnegie Mellon University, and the University of California (at Berkeley and at Los Angeles), and it intends to open additional centers.¹² Access to data requires specific contractual arrangements aimed at safeguarding confidentiality, and de-identified public-use microdata user files can be accessed through third parties. For example, data from the National Crime Victimization Survey are made available through the Interuniversity Consortium for Political and Social Research (ICPSR) at the University of Michigan. Members of the research community are, of course, interested in finding less restrictive ways of giving researchers access to confidential data that do not compromise the confidentiality of that data.

¹¹

For an overview and series of technical papers on record linkage, see Committee on Applied and Theoretical Statistics, National Research Council and Federal Committee on Statistical Methodology, Office of Management and Budget. 1999. Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition. National Academy Press, Washington, D.C.

¹²	See U.S. Census Bureau, Office of the Chief Economist, 1999. Research Data Centers. U.S. Census Bureau, Washington, D.C., last revised September 28. Available online at <http://www.census.gov/cecon/www/rdc.html>.

Page 36 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

BOX 2.3

Using External Data to Re-identify Personal Data

Removing names and other unique identification information is not sufficient to prevent re-identifying the individuals associated with a particular data record. Latanya Sweeney illustrated this point in her presentation at the workshop using an example of how external data sources can be used to determine the identity of the individuals associated with medical records. Hospitals and insurers collect information on individual patients. Because such data are generally believed to be anonymous once names and other unique identifiers have been removed, copies of these data sets are provided to researchers and sold commercially. Sweeney described how she re-identified these seemingly anonymous records using information contained in voter registration records, which are readily purchased for many communities.

Voter registration lists, which provide information on name, address, and so forth, are likely to have three fields in common with de-identified medical records—zip code, birth date, and sex. How unique a link can be established using this information? In one community where Sweeney attempted to re-identify personal data, there are 54,805 voters. The range of possible birth dates (year, month, day) is relatively small—about 36,500 dates over 100 years—and so potentially can be useful in identifying individuals. In the community she studies, there is a concentration of people in their 20s and 30s, and birth date alone uniquely identifies about 12 percent of the community's population. That is, given a person 's birth date and knowledge that the person lived in that community, one could uniquely identify him or her. Birth date and gender were unique for 29 percent of the voters, birth date and zip code, for 69 percent, and birth date and full postal code, for 97 percent.

Academic work on IT approaches to disclosure limitation has so far been confined largely to techniques for limiting disclosure resulting from release of a given data set. However, as the example provided by Sweeney illustrates, disclosure limitation must also address the extent to which released information can be combined with other, previously released statistical information, including administrative data and commercial and other publicly available data sets, to make inferences. Researchers have recognized the importance of understanding the impact on confidentiality of these external data sources, but progress has been limited because the problem is so complex. The issue is becoming more important for at least two reasons. First, the quantity of personal information being collected automatically is increasing rapidly (Box 2.4) as the Web grows and database systems become more sophisticated. Second, the statistical agencies, to meet the research needs of their users, are being asked to release “anonymized” microdata to support additional data analyses. As a result, a balancing act must be performed between the benefits obtained from

Page 37 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

data release and the potential for unwanted disclosure that comes from linking with other databases. What is the disclosure effect, at the margin, of the release of a particular set of data from a statistical agency?

BOX 2.4

Growth in the Collection of Personal Data

At the workshop, Latanya Sweeney described a metric she had developed to provide a sense of how the amount of personal data is growing. Her measure—disk storage per person, calculated as the amount of storage in the form of hard disks sold per year divided by the adult world population—is based on the assumption that access to inexpensive computers with very large storage capacities is enabling the collection of an increasing amount of personal data. Based on this metric, the several thousand characters of information that could be printed on an 8 1/2 by 11 inch piece of paper would have documented some 2 months of a person's life in 1983. The estimate seems reasonable: at that time such information probably would have been limited to that contained in school or employment records, the telephone calls contained on telephone bills, utility bills, and the like. By 1996, that same piece of paper would document 1 hour of a person's life. The growth can be seen in the increased amount of information contained on a Massachusetts birth certificate; it once had 15 fields of information but today has more than 100. Similar growth is occurring in educational data records, grocery store purchase logs, and many other databases, observed Sweeney. Projections for the metric in 2000, with 20-gigabyte drives widely available, are that the information contained on a single page would document less than 4 minutes of a person's life —information that includes image data, Web and Internet usage data, biometric data (gathered for health care, authentication, and even Web-based clothing purchases), and so on.

The issue of disclosure control has also been addressed in the context of work on multilevel security in database systems, in which the security authorization level of a user affects the results of database queries.¹³ A simple disclosure control mechanism such as classifying individual records is not sufficient because of the possible existence of an inference channel whereby information classified at a level higher than that for which a user is cleared can be inferred by that user based on information at lower levels (including external information) that is possessed by that

¹³

See National Research Council and Social Science Research Council. 1993. Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics. National Academy Press, Washington, D.C., pp. 150-151; and D.E. Denning et al. 1988. “A Multilevel Relational Data Model, ” Proceedings of the 1987 IEEE Symposium on Research Security and Privacy. IEEE Computer Society, Los Alamitos, Calif.

Page 38 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

user. Such channels are, in general, hard to detect because they may involve a complex chain of inferences and because of the ability of users to exploit external data.¹⁴

Various statistical disclosure-limiting techniques have been and are being developed to protect different types of data. The degree to which these techniques need to be unique to specific data types has not been resolved. The bulk of the research by statistics researchers on statistical disclosure limitation has focused on tabular data, and a number of disclosure-limiting techniques have been developed to protect the confidentiality of individual respondents (including people and businesses), including the following:

Cell suppression—the blanking of table entries that would provide information that could be narrowed down to too small a set of individuals;
Swapping—exchanging pieces of information among similar individuals in a data set; and
Top coding—aggregating all individuals above a certain threshold into a single top category. This allows, for example, hiding information about an individual whose income was significantly greater than the incomes of the other individuals in a given set that would otherwise appear in a lone row of a table.

However, researchers who want access to the data are not yet satisfied with currently available tabular data-disclosure solutions. In particular, some of these approaches rely on distorting the data in ways that can make it less acceptable for certain uses. For example, swapping can alter records in a way that throws off certain kinds of research (e.g., it can limit researchers' ability to explore correlations between various attributes).

While disclosure issues for tabular data sets have received the most attention from researchers, many other types of data are also released, both publicly and to more limited groups such as researchers, giving rise to a host of questions about how to limit disclosure. Some attention has been given to microdata sets and the creation of public-use microdata

¹⁴

See T.F. Lunt, T.D. Garvey, X. Qian, and M.E. Stickel. 1994. “Type Overlap Relations and the Inference Problem,” Proceedings of the 8th IFIP WG 11.3 Working Conference on Database Security, August; T.F. Lunt, T.D. Garvey, X. Qian, and M.E. Stickel. 1994. “Issues in Data-Level Monitoring of Conjunctive Inference Channels, ” Proceedings of the 8th IFIP WG 11.3 Working Conference on Database Security, August; and T.F. Lunt, T.D. Garvey, X. Qian, and M.E. Stickel. 1994. “Detection and Elimination of Inference Channels in Multilevel Relational Database Systems,” Proceedings of the IEEE Symposium on Research in Security and Privacy, May 1993. For an analysis of the conceptual models underlying multilevel security, see Computer Science and Telecommunications Board, National Research Council. 1999. Trust in Cyberspace. National Academy Press, Washington, D.C.

Page 39 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

files. The proliferation of off-the-shelf software for data linking and data combining appears to have raised concerns about releasing microdata. None of the possible solutions to this problem coming from the research community (e.g., random sampling, masking, or synthetic data generation) seems mature enough today to be adopted as a data release technique.

Digital geospatial data, including image data, are becoming more widely available and are of increasing interest to the research community. Opportunities for and interest in linking data sets by spatial coordinates can be expected to grow correspondingly. In many surveys, especially natural resources or environmental surveys, the subject matter is inherently spatial. And spatial data are instrumental in research in many areas, including public health and economic development. The confidentiality of released data based on sample surveys is generally protected by minimizing the chance that a respondent can be uniquely identified using demographic variables and other characteristics. The situations where sampling or observational units (e.g., person, household, business, or land plot) are linked with a spatial coordinate (e.g., latitude and longitude) or another spatial attribute (e.g., Census block or hydrologic unit) have been less well explored. Precise spatial coordinates for sampling or observational units in surveys are today generally considered identifying information and are thus excluded from the information that can be released with a public data set. Identification can also be achieved through a combination of less precise spatial attributes (e.g., county, Census block, hydrologic unit, land use), and care must be taken to ensure that including variables of this sort in a public data set will not allow individual respondents to be uniquely identified.

Techniques to limit information disclosure associated with spatial data have received relatively little attention, and research is needed on approaches that strike an appropriate balance between two opposing forces: (1) the need to protect the confidentiality of sample and observational units when spatial coordinates or related attributes are integral to the survey and (2) the benefits of using spatial information to link with a broader suite of information resources. Such approaches might draw from techniques currently used to protect the confidentiality of alphanumeric human population survey data. For example, random noise might be added to make the spatial location fuzzier, or classes of spatial attributes might be combined to create a data set with lower resolution. It is possible that the costs and benefits of methods for protecting the confidentiality of spatial data will vary from those where only alphanumeric data are involved. In addition, alternative paradigms making use of new information technologies may be more appropriate for problems specific to spatial data. One might, for instance, employ a behind-the-scenes mechanism for accurately combining spatial information where the link-

Page 40 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

age, such as the merging of spatial data sets, occurs in a confidential “space” to produce a product such as a map or a data set with summaries that do not disclose locations. In some cases, this might include a mechanism that implements disclosure safeguards.

A third, more general, issue is how to address disclosure limitation when multimedia data such as medical images are considered. Approaches developed for numerical tabular or microdata do not readily apply to images, instrument readings, text, or combinations of them. For example, how does one ensure that information gleaned from medical images cannot be used to re-identify records? Given the considerable interest of both computer scientists and statisticians in applying data-mining techniques to extract patterns from multimedia data, collaboration with computer scientists on disclosure-limiting techniques for these data is likely to be fruitful.

Few efforts have been made to evaluate the success of data release strategies in practice. Suppose for example, that a certain database is proposed for release. Could one develop an analytical technique to help data managers evaluate the potential for unwanted disclosure caused by the proposed release? The analysis would evaluate the database itself, along with meta-information about other known, released databases, so as to identify characteristics of additional external information that could cause an unwanted disclosure. It could be used to evaluate not only the particular database proposed for release but also the impact of that release on potential future releases of other databases. Several possible approaches were identified by workshop participants. First, one can further develop systematic approaches for testing the degree to which a particular release would identify individuals. Given that it is quite difficult to know the full scope of information available to a would-be “attacker,” it might also be useful to develop models of the information available to and the behavior of someone trying to overcome attempts to limit disclosure and to use these models to test the effectiveness of a particular disclosure limitation approach.

Another approach, albeit a less systematic one, is to explore red teaming to learn how a given data set could be exploited (including by combining it with other, previously disclosed or publicly available data sets). Red teaming in this context is like red teaming to test information system security (a team of talented individuals is invited to probe for weaknesses in a system¹⁵ ), and the technique could benefit from collaboration with IT researchers and practitioners.

¹⁵

A recent CSTB report examining defense command-and-control systems underscored the importance of frequent red teaming to assess the security of critical systems. See Computer Science and Telecommunications Board, National Research Council. 1999. Realizing the Potential of C4I: Fundamental Challenges. National Academy Press, Washington, D.C.

Page 41 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

TRUSTWORTHINESS OF INFORMATION SYSTEMS

The challenge of building trustworthy (secure, dependable, and reliable) systems has grown along with the increasing complexity of information systems and their connectedness, ubiquity, and pervasiveness. This is a burgeoning challenge to the federal statistical community as agencies move to greater use of networked systems for data collection, processing, and dissemination. Thus, even as solutions are developed, the goal being pursued often appears to recede.¹⁶

There have been substantial advances in some areas of security and particular problems have been solved. For example, if one wishes to protect information while it is in transit on a network, the technology to do this is generally considered to be available.¹⁷ Hence experts tend to agree that a credit card transaction over the Internet can be conducted with confidence that credit card numbers cannot be exposed or tampered with while they are in transit. On the other hand, there remain many difficult areas: for example, unlike securing information in transit, the problem of securing the information on the end systems has, in recent years, not received the attention that it demands. Protecting against disclosure of confidential information and ensuring the integrity of the collection, analysis, and dissemination process are critical issues for federal statistical agencies.

For the research community that depends on federal statistics, a key security issue is how to facilitate access to microdata sets without compromising their confidentiality. As noted above, the principal approach being used today is for researchers to relocate themselves temporarily to agency offices or one of a small number of physically secured data centers, such as those set up by the Census Bureau and the NCHS. Unfortunately, the associated inconveniences, such as the need for frequent travel, are cited by researchers as a significant impediment to working with microdata. Another possible approach being explored is the use of various security techniques to permit off-site access to data. NCHS is one agency that has established remote data access services for researchers. This raises several issues. For example, what is the trade-off between

¹⁶	The recent flap over the proposed Federal Intrusion Detection Network (FIDnet) indicates that implementing security measures is more complicated in a federal government context.

¹⁷

For a variety of reasons, including legal and political issues associated with restrictions that have been placed on the export of strong cryptography from the United States, these technologies are not as widely deployed as some argue they should be. See, e.g., Computer Science and Telecommunications Board, National Research Council. 1996. Cryptography's Role in Securing the Information Society. National Academy Press, Washington, D.C. These restrictions have recently been relaxed.

Page 42 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

permitting off-site users to replicate databases to their own computers in a secure fashion for local analysis and permitting users to have secured remote access to external analysis software running on computers located at a secured center. Both approaches require attention to authentication of users and both require safeguards, technological or procedural, to prevent disclosure as a result of the microdata analysis.¹⁸

Another significant challenge in the federal statistics area is maintaining the integrity of the process by which statistical data are collected, processed, and disseminated. Federal statistics carry a great deal of authority because of the reputation that the agencies have developed —a reputation that demands careful attention to information security. Discussing the challenges of maintaining the back-end systems that support the electronic dissemination of statistics products, Michael Levi of the Bureau of Labor Statistics cited several demands placed on statistics agencies: systems that possess automated failure detection and recovery capabilities; better configuration management including installation, testing, and reporting tools; and improved tools for intrusion prevention, detection, and analysis.

As described above, the federal statistical community is moving away from manual, paper-and-pencil modes of data collection to more automated modes. This trend started with the use of computer-assisted techniques (e.g., CAPI and CATI) to support interviewers and over time can be expected to move toward more automated modes of data gathering, including embedded sensors for automated collection of data (e.g., imagine if one day the American Travel Survey were to use Global Positioning System satellite receivers and data recorders instead of surveys). Increasing automation increases the need to maintain the traceability of data to its source as the data are transferred from place to place (e.g., uploaded from a remote site to a central processing center) and are processed into different forms during analysis (e.g., to ensure that the processed data in a table in fact reflect the original source data). In other words, there is a greater challenge in maintaining process integrity—a chain of evidence from source to dissemination.

There are related challenges associated with avoiding premature data release. In some instances, data have been inadvertently released before the intended point in time. For example, the Bureau of Labor Statistics prematurely released part of its October 1998 employment report.

¹⁸

A similar set of technical requirements arise in supporting the geographically dispersed workers who conduct field interviews and report the data that have been collected. See, for example, Computer Science and Telecommunications Board, National Research Council. 1992. Review of the Tax Systems Modernization of the Internal Revenue Service. National Academy Press, Washington, D.C.

Page 43 Cite

Suggested Citation:"2 Research Opportunities." National Research Council. 2000. Summary of a Workshop on Information Technology Research for Federal Statistics. Washington, DC: The National Academies Press. doi: 10.17226/9874.

×

According to press reports citing a statement made by BLS Commissioner Katharine G. Abraham, this happened when information was moved to an internal computer by a BLS employee who did not know it would thereupon be transferred immediately to the agency's World Wide Web site and thus be made available to the public.¹⁹ The processes for managing data apparently depended on manual procedures. What kind of automated process-support tools could be developed to make it much more difficult to release information prematurely?

In the security research literature, problems and solutions are abstracted into a set of technologies or building blocks. The test of these building blocks is how well researchers and technologists can apply them to understand and address the real needs of customers. While there are a number of unsolved research questions in information security, solutions can in many cases be obtained through the application of known security techniques. Of course the right solution depends on the context; security design is conducted on the basis of knowledge of vulnerabilities and threats and the level of risk that can be tolerated, and this information is specific to each individual application or system. Solving real problems also helps advance more fundamental understanding of security; the constraints of a particular problem environment can force rethinking of the structure of the world of building blocks.

¹⁹	John M. Berry. 1998. “BLS Glitch Blamed on Staff Error; Premature Release of Job Data on Web Site Boosted Stocks,” Washington Post, November 7, p. H03.