2

Research Opportunities

Research opportunities explored in the workshop's panel presentations and small-group discussions are described in this chapter, which illustrates the nature and range of IT research issues—including human-computer interaction, database systems, data mining, metadata, information integration, and information security —that arise in the context of the work being conducted by the federal statistical agencies. The chapter also touches on two other challenges pertinent to the work of the federal statistical agencies—survey instruments and the need to limit disclosure of confidential information. This discussion represents neither a comprehensive examination of information technology (IT) challenges nor a prioritization of research opportunities, and it does not attempt to focus on the more immediate challenges associated with implementation.

HUMAN-COMPUTER INTERACTION

One of the real challenges associated with federal statistical data is that the people who make use of it have a variety of goals. There are, first of all, hundreds or thousands of specialists within the statistical system who manipulate the data to produce the reports and indices that government agencies and business and industry depend on. Then there are the thousands, and potentially millions, of persons in the population at large who access the data. Some users access statistical resources daily, others only occasionally, and many others only indirectly, through third parties, but all depend in some fashion on these resources to support important



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics 2 Research Opportunities Research opportunities explored in the workshop's panel presentations and small-group discussions are described in this chapter, which illustrates the nature and range of IT research issues—including human-computer interaction, database systems, data mining, metadata, information integration, and information security —that arise in the context of the work being conducted by the federal statistical agencies. The chapter also touches on two other challenges pertinent to the work of the federal statistical agencies—survey instruments and the need to limit disclosure of confidential information. This discussion represents neither a comprehensive examination of information technology (IT) challenges nor a prioritization of research opportunities, and it does not attempt to focus on the more immediate challenges associated with implementation. HUMAN-COMPUTER INTERACTION One of the real challenges associated with federal statistical data is that the people who make use of it have a variety of goals. There are, first of all, hundreds or thousands of specialists within the statistical system who manipulate the data to produce the reports and indices that government agencies and business and industry depend on. Then there are the thousands, and potentially millions, of persons in the population at large who access the data. Some users access statistical resources daily, others only occasionally, and many others only indirectly, through third parties, but all depend in some fashion on these resources to support important

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics decisions. Federal statistics resources support an increasingly diverse range of users (e.g., high school students, journalists, local community groups, business market analysts, and policy makers) and tasks. The pervasiveness of IT, exemplified by the general familiarity with the Web interface, is continually broadening the user base. BOX 2.1 Some Policy Issues Associated with Electronic Dissemination In her presentation at the workshop, Patrice McDermott, from OMB Watch, observed that if information suddenly began to be disseminated by electronic means alone, some people would no longer be able to access it. Even basic telephone service, a precursor for low-cost Internet access, is not universal in the United States. It is not clear that schools and libraries can fill the gap: schools are not open, for the most part, to people who do not have children attending them, and finding resources to invest in Internet access remains a challenge for both schools and public libraries. McDermott added that research by OMB Watch indicates that people see a substantial difference between being directed to a book that contains Census data and being helped to access and navigate through online information. Another issue is the burden imposed by the shifting of costs: if information is available only in electronic form, users and intermediaries such as libraries end up bearing much of the cost of providing access to it, including, for example, the costs of telecommunications, Internet service, and printing. Workshop participants observed, however, that many are likely to remain without ready access to information online, raising a set of social and policy questions (Box 2.1). However, over time, a growing fraction of potential users can be expected to gain network access, making it increasingly beneficial to place information resources online, together with capabilities that support their interpretation and enhance the statistical literacy of users. In the meantime, online access is being complemented by published sources and by the journalists, community groups, and other intermediaries who summarize and interpret the data. The responsibility of a data product designer or provider does not end with the initial creation of that product. There are some important human-computer interaction (HCI) design challenges in supporting a wide range of users. A key HCI design principle is “know thy user ”; various approaches to learning about and understanding user abilities and needs are discussed below. Besides underscoring the need to focus on users, workshop participants pointed to some specific issues: universal access, support for users with limited statistical literacy, improved visualization techniques, and new modes of interacting with data. These are discussed in turn below.

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics User Focus Iterative, user-centered design and testing are considered crucial to developing usable and useful information products. A better understanding of typical users and the most common tasks they perform, which could range from retrieving standard tables to building sophisticated queries, would facilitate the design of Web sites to meet those users' needs. One important approach discussed at the workshop is to involve the user from the start, through various routine participatory activities, in the design of sites. The capture of people's routine interactions with online systems to learn what users are doing, what they are trying to do, what questions they are asking, and what problems they are having allows improving the product design. If, for example, a substantial number of users are seen to ask the same question, the system should be modified to ensure that the answer to this question is easily available—an approach analogous to the “frequently asked questions” concept. Customer or market surveys can also be used in conjunction with ongoing log and site analyses to better understand the requirements of key user groups. There are many techniques that do not associate data with individuals and so are sensitive to privacy considerations.1 For example, collecting frequent queries requires aggregation only at the level of the site, not of the individual. Where individual-level data are useful, they could be made anonymous. Universal Access The desire to provide access to statistical information for a broad range of citizens raises concerns about what measures must be taken to ensure universal access.2 Access to computers, once the province of a small number of expert programmers, now extends to a wider set of computer-literate users and an even larger segment of the population sufficiently skilled to use the Web to access information. The expanding audience for federal statistical data represents both an opportunity and a challenge for information providers. 1   Data on user behavior must be collected and analyzed in ways that are sensitive to privacy concerns and that avoid, in particular, tracking the actions of individuals over time (though this inhibits within-subject analyses). There are also the matters related to providing appropriate notice and obtaining consent for such monitoring. 2   This term, similar to the more traditional label “universal service,” also encompasses economic and social issues related to the affordability of access services and technology, as well as the provision of access through community-based facilities, but these are not the focus of this discussion.

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics Universality considerations apply as well to the interfaces people use to access information. The Web browser provides a common interface across a wide range of applications and extends access to a much larger segment of the population (anyone with a browser). However, the inertia associated with such large installed software bases tends to slow the implementation of new interface technologies. During the workshop, Gary Marchionini argued that adoption of the Web browser interface has locked in a limited range of interactions and in some sense has set interface design back several years. A key challenge in ensuring universal access is finding upgrade trajectories for interfaces that maximize access across the broadest possible audience. 3 Providing access to all citizens also requires attention to the diverse physical needs of users. Making every Web site accessible to everyone requires more than delivering just a plain-text version of a document, because such a version lacks the richness of interaction offered by today's interfaces. Some work is already being done; vendors of operating systems, middleware, and applications provide software hooks that support alternative modes of access. The World Wide Web Consortium is establishing standards and defining such hooks to increase the accessibility of Web sites. Another dimension of universal access is supporting users whose systems vary in terms of hardware performance, network connection speed, and software. The installed base of networked computers ranges from Intel 80286 processors using 14.4-kbps modems to high-performance computers with optical fiber links that are able to support real-time animation. That variability in the installed base presents a challenge in designing new interfaces that are also compatible with older systems and software. Literacy, Visualization, and Perception Given the relatively low level of numerical and statistical literacy in the population at large, it becomes especially important to provide users with interfaces that give them useful, meaningful information. Providing data with a bad interface that does not allow users to interpret data sensibly may be worse than not providing the data at all, because the bad interface frustrates nonexpert users and wastes their time. The goal is to provide not merely a data set but also tools that allow making sense of the data. Today, most statistical data is provided in tabular form—the form 3   See Computer Science and Telecommunications Board, National Research Council. 1997. More Than Screen Deep: Toward Every-Citizen Interfaces to the Nation 's Information Infrastructure. National Academy Press, Washington, D.C.

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics of presentation with which the statistical community has the longest experience. Unfortunately, although it is well understood by both statisticians and expert users, this form of presentation has significant limitations. Tables can be difficult for unsophisticated users to interpret, and they do not provide an engaging interface through which to explore statistical survey data. Also, the types of analyses that can be conducted using summary tables are much more limited than those that can be conducted when access to more detailed data is provided. Workshop participants pointed to the challenge of developing more accessible forms of presentation as central to expanding the audience for federal statistical data. Statistics represent complex information that might be thought of as multimedia. Even data tables, when sufficiently large, do not lend themselves to display as simple text. Many of the known approaches to multimedia—such as content-based indexing and retrieval—may be applicable to statistical problems as well. Visualization techniques, such as user-controlled graphical displays and animations, enable the user to explore, discover, and explain trends, outliers, gaps, and jumps, allowing a better understanding of important economic or social phenomena and principles. Well-designed two-dimensional displays are effective for many tasks, but researchers are also exploring three-dimensional and immersive displays. Advanced techniques such as parallel coordinates and novel coding schemes, which complement work being done on three-dimensional and immersive environments, are also worthy of study. Both representation (what needs to be shown to describe a given set of data) and control (how the user interacts with a system to determine what is displayed) pose challenges. Statisticians have been working on the problem of representation for a very long time. Indeed a statistic itself is a very concise condensation of a very large collection of information. More needs to be done in representing large data sets so that users who are not sophisticated in statistical matters can obtain, in a fairly compact way, the sense of the information in large collections of data. Related to this is the need to provide users with appropriate indications of the effects of sampling error. Basic human perceptual and cognitive abilities affect the interpretation of statistical products. Amos Tversky and others have identified pervasive cognitive illusions, whereby people try to see patterns in random data.4 In the workshop presentation by Diane Schiano, evidence 4   See A. Tversky and D.M. Kahneman. 1974. “Judgement Under Uncertainty: Heuristics and Biases,” Science 125:1124-1131. One such heuristic/bias is the perception of patterns in random scatter plots. See W.S. Cleveland and R. McGill. 1985. “Graphical Perception and Graphical Methods for Analyzing Scientific Data,” Science 229 (August 30):828-833.

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics was offered of pervasive perceptual illusions that occur in even the simplest data displays. People make systematic errors in estimating the angle of a single line in a simple two-dimensional graph and in estimating the length of lines and histograms. These are basic perceptual responses that are not subject to cognitive overrides to correct the errors. As displays become more complex, the risk of perceptual errors grows accordingly. Because of this, three-dimensional graphics are often applied when they should not be, such as when the data are only two-dimensional. More generally, because complex presentations and views can suggest incorrect conclusions, simple, consistent displays are generally better. The interpretation of complex data sets is aided by good exploratory tools that can provide both an overview of the data and facilities for navigating through them and zooming in (or “drilling down”) on details. To illustrate the navigation challenge, Cathryn Dippo of the Bureau of Labor Statistics noted that the Current Population Survey's (CPS 's) typical monthly file alone contains roughly 1,000 variables, and the March file contains an additional 3,000. Taking into account various supplements to the basic survey, the CPS has 20,000 to 25,000 variables, a number that rapidly becomes confusing for a user trying to interpret or even access the data. That figure is for just one survey; the surveys conducted by the Census Bureau contain some 100,000 variables in all. Underscoring the importance of providing users with greater support for interaction with data, Schiano pointed to her research that found that direct manipulation through dynamic controls can help people correct some perceptual illusions associated with data presentation. Once users are allowed to interact with an information object and to choose different views, perception is vastly improved. Controls in common use today are limited largely to scrolling and paging through fairly static screens of information. However, richer modes of control are being explored, such as interfaces that let the user drag items around, zoom in on details, and aggregate and reorder data. The intent is to allow users to manipulate data displays directly in a much more interactive fashion. Some of the most effective data presentation techniques emerging from human-computer interaction research involve tightly coupled interactions. For example, when the user moves a slider (a control that allows setting the value of a single variable visually), that action should have an immediate and direct effect on the display —users are not satisfied by an unresponsive system. Building systems that satisfy these requirements in the Web environment, where network communications latency delays data delivery and makes it hard to tightly couple a user action and the resulting display, is an interesting challenge. What, for example, are the optimal strategies for allocating data and processing between the client

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics and the server in a networked environment in order to support this kind of interactivity? Two key elements of interactivity are the physical interface and the overall style of interaction. The trend in physical interfaces has been toward a greater diversity of devices. For example, a mouse or other two-dimensional pointing device supplements keyboard input in desktop computing, while a range of three-dimensional interaction devices are used in more specialized applications. Indeed, various sensors are being developed that offer enhanced direct manipulation of data. One can anticipate that new ways of interacting will become commonplace in the future. How can these diverse and richer input and output devices be used to disseminate statistical information better? The benefits of building more flexible, interactive systems must be balanced against the risk that the increased complexity can lead unsophisticated users to draw the wrong conclusions (e.g., when they do not understand how the information has been transformed by their interactions with it). Also at work today is a trend away from static displays toward what Gary Marchionini termed “hyperinteraction,” which leads users to expect quick action and instant access to large quantities of information by pointing and clicking across the Web or by pressing the button on a TV remote control. An ever-greater fraction of the population has such expectations, affecting how one thinks about disseminating statistical information. DATABASE SYSTEMS Database systems cover a range of applications, from the large-scale relational database systems widely used commercially, to systems that provide sophisticated statistical tools and spreadsheet applications that provide simple data-manipulation functionality along with some analysis capability. Much of the work today in the database community is motivated by a commercial interest in combining transactions, analysis, and mining of multiple databases in a distributed environment. For example, data warehouse environments—terabyte or multiterabyte systems that integrate data from various locations—replicate transactions databases to support problem solving and decision making. Workshop participants observed that the problems of other user communities, such as the federal statistics community, can be addressed in this fashion as well. Problems cited by the federal statistics community include legacy migration, information integration across heterogeneous databases, and mining data from multiple sources. These challenges, perhaps more mundane than the splashier Web development activities that many IT users are focused on, are nonetheless important. William Cody noted in the workshop that the database community has not focused much on these

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics hard problems but is now increasingly addressing them in conjunction with its application partners. Commercial systems are beginning to address these needs. Today's database systems do not build in all of the functionality to perform many types of analysis. There are several approaches to enhancing functionality, each with its advantages and disadvantages. Database systems can be expanded in an attempt to be all things to all people, or they can be constructed so that they can be extended using their own internal programming language. Another approach is to give users the ability to extract data sets for analysis using other tools and application languages. Researchers are exploring what functions are best incorporated in databases, looking at such factors as the performance trade-offs between the overhead of including a function inside a database and the delay incurred if a function must be performed outside the database system or in a separate database system. Building increased functionality into database systems offers the potential for increasing overall processing efficiency, Cody observed. There are delays inherent in transferring data from one database to another; if database systems have enhanced functionality, processing can be done on a real-time or near-real-time basis, allowing much faster access to the information. Built-in functionality also permits databases to perform integrated tasks on data inside the database system. Also, relational databases lend themselves to parallelization, whereas tools external to databases have not been built to take as much advantage of it. Operations that can be included in the database engine are thus amenable to parallelization, allowing parallel processing computing capabilities to be exploited. Cody described the likely evolution over the coming years of an interactive, analytic data engine, which has as its core a database system enriched with new functions. Users would be able to interact with the data more directly through visualization tools, allowing interactive data exploration. This concept is simple, but selecting and building the required set of basic statistical operations into database systems and creating the integration tools needed to use a workstation to explore databases interactively are significant challenges that will take time. Statistics-related operations that could be built into database systems include the following: Data-mining operations. By bringing data-mining primitives into the database, mining operations can occur automatically as data are collected in operational systems and transferred into warehousing systems rather than waiting until later, after special data sets have been constructed for data mining. Enhanced statistical analysis. Today, general-purpose relational database systems (as opposed to database systems specifically designed

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics for statistical analysis) for the most part support only fairly simple statistical operations. A considerable amount of effort is being devoted to figuring out which additional statistical operators should and could be included in evolving database systems. For example, could one perform a regression or compute statistical measures such as covariances and correlations directly in the database? Time series operators. The ability to conduct a time-series analysis within a database system would, for example, allow one to derive a forecast based on the information coming in real time to a database. Sampling. Sampling design is a sophisticated practice. Research is addressing ways to introduce sampling into database systems so that the user can make queries based on samples and obtain confidence limits around these results. While today's database systems use sampling during the query optimization process to estimate the result sizes of intermediate tables, sampling operators are not available to the end-user application. SQL, which is the standard language used to interact with database systems, provides a limited set of operations for aggregating data, although this has been augmented with the recent addition of new functionality for online analytical processing. Additional support for statistical operations and sampling would allow, for example, estimating the average value of a variable in a data set containing millions of records by requesting that the database itself take a sample and calculate its average. The direct result, without any additional software to process the data, would be the estimated mean together with some confidence limit that would depend on the variance and the sample size. Before the advent of object-relational database systems, which add object-oriented capabilities to relational databases, adding such extensions would generally have required extensive effort by the database vendor. Today, object-relational systems make it easier for third parties, as well as sophisticated users, to add both new data types and new operations into a database system. Since it is probably not reasonable to push all of the functionality of a statistical analysis product such as SAS into a general-purpose database system, a key challenge is to identify particular aggregation and sampling techniques and statistical operations that would provide the most leverage in terms of increasing both performance and functionality. DATA MINING Data mining enables the use of historical data to support evidence-based decision making—often without the benefit of explicitly stated

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics statistical hypotheses—to create algorithms that can make associations that were not obvious to the database user. Ideas for data mining have been explored in a wide variety of contexts. In one example, researchers at Carnegie Mellon University studied a medical database containing several hundred medical features of some 10,000 pregnant women over time. They applied data-mining techniques to this collection of historical data to derive rules that better predict the risk of emergency caesarian sections for future patients. One pattern identified in the data predicts that when three conditions are met—no previous vaginal delivery, an abnormal second-trimester ultrasound reading, and the infant malpresenting —the patient's risk of an emergency caesarian section rises from a base rate of about 7 percent to approximately 60 percent.5 Data mining finds use in a number of commercial applications. A database containing information on software purchasers (such as age, income, what kind of hardware they own, and what kinds of software they have purchased so far) might be used to forecast who would be likely to purchase a particular software application in the future. Banks or credit card companies analyze historical data to identify customers that are likely to close their accounts and move to another service provider; predictive rules allow them to take preemptive action to retain accounts. In manufacturing, data collected over time from manufacturing processes (e.g., records containing various readings as items move down a production line) can be used by decision makers interested in process improvements in a production facility. Both statisticians and computer scientists make use of some of the same data-mining tools and algorithms; researchers in the two fields have similar goals but somewhat different approaches to the problem. Statisticians, much as they would before beginning any statistical analysis, seek through interactions with the data owner to gain an understanding of how and why the data were collected, in part to make use of this information in the data mining and in part to better understand the limitations on what can be determined by data mining. The computer scientist, on the other hand, is more apt to focus on discovering ways to efficiently manipulate large databases in order to rapidly derive interesting or indicative trends and associations. Establishing the statistical validity of these methods and discoveries may be viewed as something that can be done at a later stage. Sometimes information on the conditions and circumstances under which the data were collected may be vague or even nonexistent, making it difficult to provide strong statistical justification for choosing 5   This example is described in more detail in Tom M. Mitchell. 1999. “Machine Learning and Data Mining,” Communications of the ACM 47(11).

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics particular data-mining tools or to establish the statistical validity of patterns identified from the mining; the statistician is arguably better equipped to understand the limitations of employing data mining in such circumstances. Statisticians seek to separate structure from noise in the data and to justify the separation based on principles of statistical inference. Similarly, statisticians approach issues like subsampling methodology as a statistical problem. Research on data mining has been stimulated by the growth in both the quantity of data that is being collected and in the computing power available for analyzing it. At present, a useful set of first-generation algorithms has been developed for doing exploratory data analysis, including logistic regression, clustering, decision-tree methods, and artificial-neural-net methods. These algorithms have already been used to create a number of applications; at least 50 companies today market commercial versions of such analysis tools. One key research issue is the scalability of data-mining algorithms. Mining today frequently relies on approaches such as selecting subsets of the data (e.g., by random sampling) and summarizing them, or deriving smaller data sets by methods other than selecting subsets (e.g., to perform a regression relating two variables, one might divide the data into 1,000 subgroups and perform the regression on each group, yielding a derived subset consisting of 1,000 sets of regression coefficients). For example, to mine a 4-terabyte database, one might do the following: sample it down to 200 gigabytes, aggregate it to 80 gigabytes, and then filter the result down to 10 gigabytes. A relatively new area for data mining is multimedia data, including maps, images, and video. These are much more complex than the numerical data that have traditionally been mined, but they are also potentially rich new sources of information. While existing algorithms can sometimes be scaled up to handle these new types of data, mining them frequently requires completely new methods. Methods to mine multimedia data together with more traditional data sources could allow one to learn something that had not been known before. To use the earlier example, which involved determining risk factors in pregnancy, one would analyze not only the traditional features such as age (a numerical field) and childbearing status (a Boolean field) but also more complex multimedia features such as videosonograms and unstructured text notes entered by physicians. Another multimedia data-mining opportunity suggested at the workshop was to explore X-ray images (see Box 2.2) and numerical and text clinical data collected by the NHANES survey. Active experimentation is an interesting research area related to data mining. Most analysis methods today analyze precollected samples of data. With the Internet and connectivity allowing researchers to easily

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics as an interview is being conducted, rather than having to wait for post-interview edits and possibly incurring the cost and delay of a follow-up interview to correct the data. Past attempts to build in such checks are reported to have made the interview instruments run excessively slowly, so the checks were removed. Improved performance. Another dimension to the challenges of conducting surveys is the hardware platform. Laptops are the current platform of choice for taking a survey. However, the current generation of machines is not physically robust in the field, is too difficult to use, and is too heavy for many applications (e.g., when an interviewer stands in a doorway, as happens when a household is being screened for possible inclusion in a survey). Predictable advances in computer hardware will address size and shape, weight, and battery life problems while advances in processing speed will enable on-the-fly checking, as noted above. Continued commercial innovation in portable computer devices, building on the present generation of personal digital assistants, which provide sophisticated programmability, appears likely to provide systems suitable for many of these applications. It is, of course, a separate matter whether procurement processes and budgets can assimilate use of such products quickly. New modes of interaction with survey instruments. Another set of issues relates to the limitations of keyboard entry. While a keyboard is suitable for a telephone interview or an interview conducted inside someone's house, it has some serious limitations in other circumstances, such as when an interviewer is conducting an initial screening interview at someone's doorstep or in a driveway. Advances in speech-to-text technology might offer advantages for certain types of interviews, as might handwriting recognition capability, which is being made available in a number of computing devices today. Limited-vocabulary (e.g., “yes”, “no,” and numerical digits), speaker-independent speech recognition systems have been used for some time in survey work.8 The technology envisioned here would provide speaker-independent capability with a less restricted vocabulary. With this technology it would be possible to capture answers in a much less intrusive fashion, which could lead to improvements in overall survey accuracy. Speech-to-text would also help reduce human intermediation if it could allow interviewees to interact directly with the survey instrument. There are significant research questions regarding the implications of different techniques for administering 8   The Bureau of Labor Statistics started using this technology for the Current Employment Survey in 1992. See Richard L. Clayton and Debbie L.S. Winter. 1992. “Speech Data Entry: Results of a Test of Voice Recognition for Survey Data Collection,” Journal of Official Statistics 8:377-388.

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics survey questionnaires, with some results in the literature suggesting that choice of administration technique can affect survey results significantly.9 More research on this question, as well as on the impact of human intermediation on data collection, would be valuable. LIMITING DISCLOSURE Maintaining the confidentiality of respondents in data collected under pledges of confidentiality is an intrinsic part of the mission of the federal statistical agencies. It is this promise of protection against disclosure of confidential information—protecting individual privacy or business trade secrets—that convinces many people and businesses to comply willingly and openly with requests for information about themselves, their activities, and their organizations. Hence, there are strong rules in place governing how agencies may (and may not) share data,10 and data that divulge information about individual respondents are not released to the public. Disclosure limitation is a research area that spans both statistics and IT; researchers in both fields have worked on the issue in the past, and approaches and techniques from both fields have yielded insights. While nontechnical approaches play a role, IT tools are frequently employed to help ease the tension between society's demands for data and the agencies' ability to collect information and maintain its confidentiality. Researchers rely on analysis of data sets from federal statistical surveys, which are viewed as providing the highest-quality data on a number of topics, to explore many economic and social phenomena. While some of their analysis can be conducted using public data sets, some of it depends on information that could be used to infer information about individual respondents, including microdata, which are the data sets containing records on individual respondents. Statistical agencies must strike a balance between the benefits obtained by releasing information for legitimate research and the potential for unintended disclosures that could result from releasing information. The problem is more complicated than simply whether or not to release microdata. Whenever an agency releases statistical information, it is inherently disclosing some information about 9   See, e.g., Sara Kiesler and Lee Sproull. 1986. “Response Effects in the Electronic Survey,” Public Opinion Quarterly 50:243-253 and Wendy L. Richman, Sara Kiesler, Suzanne Weisband, and Fritz Drasgow. 1999. “A Meta-analytic Study of Social Desirability Distortion in Computer-Administered Questionnaires, Traditional Questionnaires, and Interviews,” Journal of Applied Psychology 84(5, October):754-775. 10   These rules were clarified and stated consistently in Office of Management and Budget, Office of Information and Regulatory Affairs. 1997. “Order Providing for the Confidentiality of Statistical Information, ” Federal Register 62(124, June 27):33043. Available online at <http://www.access.gpo.gov/index.html>.

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics the source of the data from which the statistics are computed and potentially making it easier to infer information about individual respondents. Contrary to what is sometimes assumed, protecting data confidentiality is not as simple as merely suppressing names and other obvious identifiers. In some cases, one can re-identify such data using record linkage techniques. Record linkage, simply put, is the process of using identifying information in a given record to identify other records containing information on the same individual or entity.11 For example, a set of attributes such as geographical region, sex, age, race, and so forth may be sufficient to identify individuals uniquely. Moreover, because multiple sources of data may be drawn on to infer identity, understanding how much can be inferred from a particular set of data is difficult. A simple example provided by Latanya Sweeney in her presentation at the workshop illustrates how linking can be used to infer identity (Box 2.3). Both technical and nontechnical approaches have a role in improving researcher access to statistical data. Agencies are exploring a variety of nontechnical solutions to complement their technical solutions. For example, the National Center for Education Statistics allows researchers access to restricted-use data under strict licensing terms, and the National Center for Health Statistics (NCHS) recently opened a research data center that makes data files from many of its surveys available, both on-site and via remote access, under controlled conditions. The Census Bureau has established satellite centers for secured access to research data in partnership with the National Bureau of Economic Research, Carnegie Mellon University, and the University of California (at Berkeley and at Los Angeles), and it intends to open additional centers.12 Access to data requires specific contractual arrangements aimed at safeguarding confidentiality, and de-identified public-use microdata user files can be accessed through third parties. For example, data from the National Crime Victimization Survey are made available through the Interuniversity Consortium for Political and Social Research (ICPSR) at the University of Michigan. Members of the research community are, of course, interested in finding less restrictive ways of giving researchers access to confidential data that do not compromise the confidentiality of that data. 11   For an overview and series of technical papers on record linkage, see Committee on Applied and Theoretical Statistics, National Research Council and Federal Committee on Statistical Methodology, Office of Management and Budget. 1999. Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition. National Academy Press, Washington, D.C. 12   See U.S. Census Bureau, Office of the Chief Economist, 1999. Research Data Centers. U.S. Census Bureau, Washington, D.C., last revised September 28. Available online at <http://www.census.gov/cecon/www/rdc.html>.

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics BOX 2.3 Using External Data to Re-identify Personal Data Removing names and other unique identification information is not sufficient to prevent re-identifying the individuals associated with a particular data record. Latanya Sweeney illustrated this point in her presentation at the workshop using an example of how external data sources can be used to determine the identity of the individuals associated with medical records. Hospitals and insurers collect information on individual patients. Because such data are generally believed to be anonymous once names and other unique identifiers have been removed, copies of these data sets are provided to researchers and sold commercially. Sweeney described how she re-identified these seemingly anonymous records using information contained in voter registration records, which are readily purchased for many communities. Voter registration lists, which provide information on name, address, and so forth, are likely to have three fields in common with de-identified medical records—zip code, birth date, and sex. How unique a link can be established using this information? In one community where Sweeney attempted to re-identify personal data, there are 54,805 voters. The range of possible birth dates (year, month, day) is relatively small—about 36,500 dates over 100 years—and so potentially can be useful in identifying individuals. In the community she studies, there is a concentration of people in their 20s and 30s, and birth date alone uniquely identifies about 12 percent of the community's population. That is, given a person 's birth date and knowledge that the person lived in that community, one could uniquely identify him or her. Birth date and gender were unique for 29 percent of the voters, birth date and zip code, for 69 percent, and birth date and full postal code, for 97 percent. Academic work on IT approaches to disclosure limitation has so far been confined largely to techniques for limiting disclosure resulting from release of a given data set. However, as the example provided by Sweeney illustrates, disclosure limitation must also address the extent to which released information can be combined with other, previously released statistical information, including administrative data and commercial and other publicly available data sets, to make inferences. Researchers have recognized the importance of understanding the impact on confidentiality of these external data sources, but progress has been limited because the problem is so complex. The issue is becoming more important for at least two reasons. First, the quantity of personal information being collected automatically is increasing rapidly (Box 2.4) as the Web grows and database systems become more sophisticated. Second, the statistical agencies, to meet the research needs of their users, are being asked to release “anonymized” microdata to support additional data analyses. As a result, a balancing act must be performed between the benefits obtained from

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics data release and the potential for unwanted disclosure that comes from linking with other databases. What is the disclosure effect, at the margin, of the release of a particular set of data from a statistical agency? BOX 2.4 Growth in the Collection of Personal Data At the workshop, Latanya Sweeney described a metric she had developed to provide a sense of how the amount of personal data is growing. Her measure—disk storage per person, calculated as the amount of storage in the form of hard disks sold per year divided by the adult world population—is based on the assumption that access to inexpensive computers with very large storage capacities is enabling the collection of an increasing amount of personal data. Based on this metric, the several thousand characters of information that could be printed on an 8 1/2 by 11 inch piece of paper would have documented some 2 months of a person's life in 1983. The estimate seems reasonable: at that time such information probably would have been limited to that contained in school or employment records, the telephone calls contained on telephone bills, utility bills, and the like. By 1996, that same piece of paper would document 1 hour of a person's life. The growth can be seen in the increased amount of information contained on a Massachusetts birth certificate; it once had 15 fields of information but today has more than 100. Similar growth is occurring in educational data records, grocery store purchase logs, and many other databases, observed Sweeney. Projections for the metric in 2000, with 20-gigabyte drives widely available, are that the information contained on a single page would document less than 4 minutes of a person's life —information that includes image data, Web and Internet usage data, biometric data (gathered for health care, authentication, and even Web-based clothing purchases), and so on. The issue of disclosure control has also been addressed in the context of work on multilevel security in database systems, in which the security authorization level of a user affects the results of database queries.13 A simple disclosure control mechanism such as classifying individual records is not sufficient because of the possible existence of an inference channel whereby information classified at a level higher than that for which a user is cleared can be inferred by that user based on information at lower levels (including external information) that is possessed by that 13   See National Research Council and Social Science Research Council. 1993. Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics. National Academy Press, Washington, D.C., pp. 150-151; and D.E. Denning et al. 1988. “A Multilevel Relational Data Model, ” Proceedings of the 1987 IEEE Symposium on Research Security and Privacy. IEEE Computer Society, Los Alamitos, Calif.

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics user. Such channels are, in general, hard to detect because they may involve a complex chain of inferences and because of the ability of users to exploit external data.14 Various statistical disclosure-limiting techniques have been and are being developed to protect different types of data. The degree to which these techniques need to be unique to specific data types has not been resolved. The bulk of the research by statistics researchers on statistical disclosure limitation has focused on tabular data, and a number of disclosure-limiting techniques have been developed to protect the confidentiality of individual respondents (including people and businesses), including the following: Cell suppression—the blanking of table entries that would provide information that could be narrowed down to too small a set of individuals; Swapping—exchanging pieces of information among similar individuals in a data set; and Top coding—aggregating all individuals above a certain threshold into a single top category. This allows, for example, hiding information about an individual whose income was significantly greater than the incomes of the other individuals in a given set that would otherwise appear in a lone row of a table. However, researchers who want access to the data are not yet satisfied with currently available tabular data-disclosure solutions. In particular, some of these approaches rely on distorting the data in ways that can make it less acceptable for certain uses. For example, swapping can alter records in a way that throws off certain kinds of research (e.g., it can limit researchers' ability to explore correlations between various attributes). While disclosure issues for tabular data sets have received the most attention from researchers, many other types of data are also released, both publicly and to more limited groups such as researchers, giving rise to a host of questions about how to limit disclosure. Some attention has been given to microdata sets and the creation of public-use microdata 14   See T.F. Lunt, T.D. Garvey, X. Qian, and M.E. Stickel. 1994. “Type Overlap Relations and the Inference Problem,” Proceedings of the 8th IFIP WG 11.3 Working Conference on Database Security, August; T.F. Lunt, T.D. Garvey, X. Qian, and M.E. Stickel. 1994. “Issues in Data-Level Monitoring of Conjunctive Inference Channels, ” Proceedings of the 8th IFIP WG 11.3 Working Conference on Database Security, August; and T.F. Lunt, T.D. Garvey, X. Qian, and M.E. Stickel. 1994. “Detection and Elimination of Inference Channels in Multilevel Relational Database Systems,” Proceedings of the IEEE Symposium on Research in Security and Privacy, May 1993. For an analysis of the conceptual models underlying multilevel security, see Computer Science and Telecommunications Board, National Research Council. 1999. Trust in Cyberspace. National Academy Press, Washington, D.C.

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics files. The proliferation of off-the-shelf software for data linking and data combining appears to have raised concerns about releasing microdata. None of the possible solutions to this problem coming from the research community (e.g., random sampling, masking, or synthetic data generation) seems mature enough today to be adopted as a data release technique. Digital geospatial data, including image data, are becoming more widely available and are of increasing interest to the research community. Opportunities for and interest in linking data sets by spatial coordinates can be expected to grow correspondingly. In many surveys, especially natural resources or environmental surveys, the subject matter is inherently spatial. And spatial data are instrumental in research in many areas, including public health and economic development. The confidentiality of released data based on sample surveys is generally protected by minimizing the chance that a respondent can be uniquely identified using demographic variables and other characteristics. The situations where sampling or observational units (e.g., person, household, business, or land plot) are linked with a spatial coordinate (e.g., latitude and longitude) or another spatial attribute (e.g., Census block or hydrologic unit) have been less well explored. Precise spatial coordinates for sampling or observational units in surveys are today generally considered identifying information and are thus excluded from the information that can be released with a public data set. Identification can also be achieved through a combination of less precise spatial attributes (e.g., county, Census block, hydrologic unit, land use), and care must be taken to ensure that including variables of this sort in a public data set will not allow individual respondents to be uniquely identified. Techniques to limit information disclosure associated with spatial data have received relatively little attention, and research is needed on approaches that strike an appropriate balance between two opposing forces: (1) the need to protect the confidentiality of sample and observational units when spatial coordinates or related attributes are integral to the survey and (2) the benefits of using spatial information to link with a broader suite of information resources. Such approaches might draw from techniques currently used to protect the confidentiality of alphanumeric human population survey data. For example, random noise might be added to make the spatial location fuzzier, or classes of spatial attributes might be combined to create a data set with lower resolution. It is possible that the costs and benefits of methods for protecting the confidentiality of spatial data will vary from those where only alphanumeric data are involved. In addition, alternative paradigms making use of new information technologies may be more appropriate for problems specific to spatial data. One might, for instance, employ a behind-the-scenes mechanism for accurately combining spatial information where the link-

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics age, such as the merging of spatial data sets, occurs in a confidential “space” to produce a product such as a map or a data set with summaries that do not disclose locations. In some cases, this might include a mechanism that implements disclosure safeguards. A third, more general, issue is how to address disclosure limitation when multimedia data such as medical images are considered. Approaches developed for numerical tabular or microdata do not readily apply to images, instrument readings, text, or combinations of them. For example, how does one ensure that information gleaned from medical images cannot be used to re-identify records? Given the considerable interest of both computer scientists and statisticians in applying data-mining techniques to extract patterns from multimedia data, collaboration with computer scientists on disclosure-limiting techniques for these data is likely to be fruitful. Few efforts have been made to evaluate the success of data release strategies in practice. Suppose for example, that a certain database is proposed for release. Could one develop an analytical technique to help data managers evaluate the potential for unwanted disclosure caused by the proposed release? The analysis would evaluate the database itself, along with meta-information about other known, released databases, so as to identify characteristics of additional external information that could cause an unwanted disclosure. It could be used to evaluate not only the particular database proposed for release but also the impact of that release on potential future releases of other databases. Several possible approaches were identified by workshop participants. First, one can further develop systematic approaches for testing the degree to which a particular release would identify individuals. Given that it is quite difficult to know the full scope of information available to a would-be “attacker,” it might also be useful to develop models of the information available to and the behavior of someone trying to overcome attempts to limit disclosure and to use these models to test the effectiveness of a particular disclosure limitation approach. Another approach, albeit a less systematic one, is to explore red teaming to learn how a given data set could be exploited (including by combining it with other, previously disclosed or publicly available data sets). Red teaming in this context is like red teaming to test information system security (a team of talented individuals is invited to probe for weaknesses in a system15 ), and the technique could benefit from collaboration with IT researchers and practitioners. 15   A recent CSTB report examining defense command-and-control systems underscored the importance of frequent red teaming to assess the security of critical systems. See Computer Science and Telecommunications Board, National Research Council. 1999. Realizing the Potential of C4I: Fundamental Challenges. National Academy Press, Washington, D.C.

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics TRUSTWORTHINESS OF INFORMATION SYSTEMS The challenge of building trustworthy (secure, dependable, and reliable) systems has grown along with the increasing complexity of information systems and their connectedness, ubiquity, and pervasiveness. This is a burgeoning challenge to the federal statistical community as agencies move to greater use of networked systems for data collection, processing, and dissemination. Thus, even as solutions are developed, the goal being pursued often appears to recede.16 There have been substantial advances in some areas of security and particular problems have been solved. For example, if one wishes to protect information while it is in transit on a network, the technology to do this is generally considered to be available.17 Hence experts tend to agree that a credit card transaction over the Internet can be conducted with confidence that credit card numbers cannot be exposed or tampered with while they are in transit. On the other hand, there remain many difficult areas: for example, unlike securing information in transit, the problem of securing the information on the end systems has, in recent years, not received the attention that it demands. Protecting against disclosure of confidential information and ensuring the integrity of the collection, analysis, and dissemination process are critical issues for federal statistical agencies. For the research community that depends on federal statistics, a key security issue is how to facilitate access to microdata sets without compromising their confidentiality. As noted above, the principal approach being used today is for researchers to relocate themselves temporarily to agency offices or one of a small number of physically secured data centers, such as those set up by the Census Bureau and the NCHS. Unfortunately, the associated inconveniences, such as the need for frequent travel, are cited by researchers as a significant impediment to working with microdata. Another possible approach being explored is the use of various security techniques to permit off-site access to data. NCHS is one agency that has established remote data access services for researchers. This raises several issues. For example, what is the trade-off between 16   The recent flap over the proposed Federal Intrusion Detection Network (FIDnet) indicates that implementing security measures is more complicated in a federal government context. 17   For a variety of reasons, including legal and political issues associated with restrictions that have been placed on the export of strong cryptography from the United States, these technologies are not as widely deployed as some argue they should be. See, e.g., Computer Science and Telecommunications Board, National Research Council. 1996. Cryptography's Role in Securing the Information Society. National Academy Press, Washington, D.C. These restrictions have recently been relaxed.

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics permitting off-site users to replicate databases to their own computers in a secure fashion for local analysis and permitting users to have secured remote access to external analysis software running on computers located at a secured center. Both approaches require attention to authentication of users and both require safeguards, technological or procedural, to prevent disclosure as a result of the microdata analysis.18 Another significant challenge in the federal statistics area is maintaining the integrity of the process by which statistical data are collected, processed, and disseminated. Federal statistics carry a great deal of authority because of the reputation that the agencies have developed —a reputation that demands careful attention to information security. Discussing the challenges of maintaining the back-end systems that support the electronic dissemination of statistics products, Michael Levi of the Bureau of Labor Statistics cited several demands placed on statistics agencies: systems that possess automated failure detection and recovery capabilities; better configuration management including installation, testing, and reporting tools; and improved tools for intrusion prevention, detection, and analysis. As described above, the federal statistical community is moving away from manual, paper-and-pencil modes of data collection to more automated modes. This trend started with the use of computer-assisted techniques (e.g., CAPI and CATI) to support interviewers and over time can be expected to move toward more automated modes of data gathering, including embedded sensors for automated collection of data (e.g., imagine if one day the American Travel Survey were to use Global Positioning System satellite receivers and data recorders instead of surveys). Increasing automation increases the need to maintain the traceability of data to its source as the data are transferred from place to place (e.g., uploaded from a remote site to a central processing center) and are processed into different forms during analysis (e.g., to ensure that the processed data in a table in fact reflect the original source data). In other words, there is a greater challenge in maintaining process integrity—a chain of evidence from source to dissemination. There are related challenges associated with avoiding premature data release. In some instances, data have been inadvertently released before the intended point in time. For example, the Bureau of Labor Statistics prematurely released part of its October 1998 employment report. 18   A similar set of technical requirements arise in supporting the geographically dispersed workers who conduct field interviews and report the data that have been collected. See, for example, Computer Science and Telecommunications Board, National Research Council. 1992. Review of the Tax Systems Modernization of the Internal Revenue Service. National Academy Press, Washington, D.C.

OCR for page 17
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics According to press reports citing a statement made by BLS Commissioner Katharine G. Abraham, this happened when information was moved to an internal computer by a BLS employee who did not know it would thereupon be transferred immediately to the agency's World Wide Web site and thus be made available to the public.19 The processes for managing data apparently depended on manual procedures. What kind of automated process-support tools could be developed to make it much more difficult to release information prematurely? In the security research literature, problems and solutions are abstracted into a set of technologies or building blocks. The test of these building blocks is how well researchers and technologists can apply them to understand and address the real needs of customers. While there are a number of unsolved research questions in information security, solutions can in many cases be obtained through the application of known security techniques. Of course the right solution depends on the context; security design is conducted on the basis of knowledge of vulnerabilities and threats and the level of risk that can be tolerated, and this information is specific to each individual application or system. Solving real problems also helps advance more fundamental understanding of security; the constraints of a particular problem environment can force rethinking of the structure of the world of building blocks. 19   John M. Berry. 1998. “BLS Glitch Blamed on Staff Error; Premature Release of Job Data on Web Site Boosted Stocks,” Washington Post, November 7, p. H03.