3
Benefits of Access

The United States, like all modern societies today, depends on complex data to develop legislation, design policies, and evaluate programs. Although aggregate data are widely available from many federal agencies, especially the U.S. Census Bureau, data in that form do not permit in-depth, multivariate analysis of the trends, antecedents, or possible consequences of social and other phenomena of interest. Such analyses require access to microdata, which permit the use of statistical models to study specific questions. As noted in Chapter 2, most of these analyses are done by outside researchers rather than the agencies. Analysts also need access to microdata in order to evaluate data quality, although some of this work, too, is also done by some agencies themselves. Thus, access to microdata by outside researchers is critical to both substantive and methodological work.

This chapter discusses the role of data access in the scientific process, some of the specific ways access to research data have contributed to policy making, and the role of access in addressing the question of data quality. We begin with a brief consideration of individuals as the source of much data and the structure of the federal government as it relates to research and data collection.

DATA COLLECTION AND RESEARCH

Much of the data needed in a modern society comes from individuals. In many cases, people are willing to provide information because it results in direct financial or other benefit to them. In buying a home, for



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities 3 Benefits of Access The United States, like all modern societies today, depends on complex data to develop legislation, design policies, and evaluate programs. Although aggregate data are widely available from many federal agencies, especially the U.S. Census Bureau, data in that form do not permit in-depth, multivariate analysis of the trends, antecedents, or possible consequences of social and other phenomena of interest. Such analyses require access to microdata, which permit the use of statistical models to study specific questions. As noted in Chapter 2, most of these analyses are done by outside researchers rather than the agencies. Analysts also need access to microdata in order to evaluate data quality, although some of this work, too, is also done by some agencies themselves. Thus, access to microdata by outside researchers is critical to both substantive and methodological work. This chapter discusses the role of data access in the scientific process, some of the specific ways access to research data have contributed to policy making, and the role of access in addressing the question of data quality. We begin with a brief consideration of individuals as the source of much data and the structure of the federal government as it relates to research and data collection. DATA COLLECTION AND RESEARCH Much of the data needed in a modern society comes from individuals. In many cases, people are willing to provide information because it results in direct financial or other benefit to them. In buying a home, for

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities example, the detailed financial information from prospective buyers allows banks to determine what kind of mortgage they will offer, which, in turn, allows prospective buyers to know what they can afford. In this case, as in many others, high-quality information reduces transaction and other costs, which in turn results in lower prices for consumers. Instant access to personal financial data and streamlined credit-reporting systems have not only enhanced the ability of lenders to assess risk quickly, but also increased competition among lending institutions with the result that mortgage rates have been reduced significantly—by some estimates, as much as 2 percentage points—saving American consumers billions of dollars a year (see McCullagh, 2004). In other sectors, mechanisms such as frequent shopper cards and on-line credit applications have reduced the prices of groceries and Internet purchases. In the market context, people seem generally willing to accept the underlying rationale for surrendering a degree of privacy and confidentiality—and running the risk that their data will be misused—because the financial and other benefits are personal, immediate, and clear. The benefits of data requested by governments are often less recognizable: they accrue to society as a whole, not just the data providers; they may take years to be realized in the form of new laws or programs; and they may be used in indirect and complex ways that are not obvious. The benefits of supplying information to a grocery store to save 50 cents on a can of tuna fish are transparent; the benefits of supplying data to a statistical agency that may contribute to improved research on retirement decisions that could, in turn, improve the functioning of pension or social security systems are not. The lack of transparency in the value of personal data for societal purposes has two consequences: people may be reluctant to support (through taxes) government data collection, and there is evidence that people are increasingly reluctant to respond to government requests for information (see Chapter 4). The United States has not only a decentralized federal statistical and data collection system (see National Research Council, 2005), but also a decentralized, pluralistic structure for basic and applied social science research and policy analysis (as well as other kinds of research). Most of that research, supported by federal grants and contracts, is carried out at universities, nonprofit research institutions, and for-profit research companies, although some federal agencies conduct considerable in-house research. State and local governments, private foundations, and corporate and individual donations are also sources of social science research support at universities and private research organizations, including advocacy and public interest groups with a variety of policy preferences and perspectives. Federal statistical and other data collection agencies also carry out

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities some research and analysis, but the largest share of their budgets is allocated to data collection and processing. In-house research is generally limited to descriptive studies, such as analyses of trends and group comparisons, along with significant methodological research to improve data quality and the effectiveness of data collection. Statistical agencies do few studies that have specific policy-related conclusions, although their work often relates to policy questions. One reason underlying this approach is that the agencies must avoid undercutting their credibility as a source of high-quality, objective information. As a matter of principle, substantive analyses by statistical agency staff should be relevant to public understanding and policy issues but “not take positions on policy options or be designed with any particular policy agenda in mind” (National Research Council, 2005:41). Because the scope of research by statistical agencies is often narrowly focused, data access by other researchers is necessary to ensure that alternative methodologies and uses are fully explored to advance social science knowledge and the design and evaluation of public policies. Research access provides opportunities for disparate academic and policy communities to communicate and learn from one another; it also provides valuable information to statistical agencies about their data. DATA ACCESS AND THE SCIENTIFIC PROCESS Empirical science includes not only data collection and use, but also data access and sharing. Data access is especially central to the production of policy-relevant social science research in which microdata, which comprise detailed information about individual units (people, households, firms), and particularly longitudinal microdata, which comprise repeated observations on the same units, play an essential role. A large portion of such data is collected in surveys conducted or funded by government agencies. Many of the raw data produced from these surveys contain individual and group identifiers along with sensitive information; the data are typically collected under a promise of confidentiality and are to be used only for research or statistical purposes. Almost 20 years ago, a study by the Committee on National Statistics described the benefits of data sharing—some of which apply to data access more generally—and its essential role in science (National Research Council, 1985:9-16; see also Sieber, 1991). Data sharing promotes new research and allows for exploration of new questions without necessitating new data collection. Economies of scale are also created. The same datasets can be used for multiple purposes without substantial new investments: data gathered by researchers to answer one set of questions may be useful to others to answer another. Finding new ways to use existing data may also lessen the need for new collection efforts, which,

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities in turn, reduces the burden on respondents. Sharing data can also lead to file linkages and the creation of new, more powerful datasets for examining public policy issues. When government-funded research is used for decision making, data sharing allows for analysis of problems by investigators with diverse perspectives. Policy disputes related to interpretation are common, and, with wide dissemination of data to researchers, debate can be better informed. In contrast, much of the policy-related research that is commissioned by private interests is never published, so it cannot be corroborated or extended to new work. When data are shared along with study results, the research community and data collection agencies can improve and hone their own data collection methods and analytic capabilities. Faulty techniques that might not otherwise have been acknowledged as such can be identified, and techniques that are effective can be promoted (we return to this important point below). Researcher access also makes it more likely that additional information about the statistical procedures that underlie the data, which might otherwise not be completely documented by agencies, is archived. Perhaps most important, data sharing fosters an open research community and reinforces transparent scientific inquiry. Data sharing allows for verification, refutation, or refinement of original results. In this way, data access safeguards the scientific enterprise by ensuring that other scientists can replicate important findings (see Abowd and Lane, 2003). Although replication is not a common scientific activity in the social sciences, philosophers and historians of science (e.g., Kuhn, 1962) agree that it is an important one. “It is by means of wide and complete disclosure, and the skeptical efforts to replicate novel research findings, that scientific communities collectively build bodies of ‘reliable knowledge’” David (2001:2). Moreover, there is a considerable amount of informal replication: when an investigator extends previous results, he or she may begin by trying to replicate the first finding. Replication acts as an important disciplinary device for both academic researchers and government statisticians. In addition, as is evident from news stories of research fraud, scientists have sometimes misrepresented the results of research by altering their data or reporting only some observations. Wide access to research data helps ensure that such misrepresentations, when they occur, will be identified by other researchers. The U.S. National Science Foundation has required that data used in projects supported by its grants be placed in a publicly accessible archive. There are similar requirements at the U.S. National Institute of Justice and the Robert Wood Johnson Foundation. The National Institutes of Health have pursued policies to encourage data sharing for further analysis or replication studies in such areas as DNA sequencing, mapping informa-

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities tion, and crystallographic coordinates (Soete and ter Weel, 2003). Some academic journals promote reproducibility of empirical findings for the articles they publish. However, many journals that have a policy of making the underlying data available (e.g., the American Economic Review) waive the requirement if any portions of the data are “restricted use,” which undermines the value of the policy. Some important data sets produced by statistical agencies pose particularly difficult challenges of confidentiality protection: they are therefore accessible only in a restricted access mode—a secure research data center, a monitored remote access arrangement, or through a licensing agreement (see Chapter 2). If scientific replication is to be encouraged, application procedures to use confidential data need to be as streamlined as possible so that researchers, including graduate students and junior researchers, are not discouraged from applying by the length of time and amount of resources required for review (see Chapter 5). MICRODATA FOR POLICY-RELEVANT RESEARCH The nation faces a range of complex policy issues, including the provision and funding of health care, education standards, retirement income security, and savings and consumption behavior. Equally complex policy issues are posed by such economic changes as increasing globalization of trade and shifts in the relative importance of various industry sectors. In turn, these changes have widespread ramifications for employment opportunities and income security. Addressing policy issues in these areas requires increasingly sophisticated behavioral modeling, which, in turn, requires detailed microdata, particularly longitudinal microdata. The more detailed the data, the more utility they have for research. For example, the inclusion of geographic details—such as state, county, or city of residence—in microdata sets from large national probability sample surveys would permit modeling disparities in health, economic, and other outcomes that vary significantly across geographic areas. Similarly, the inclusion of contextual variables for cities and neighborhoods in microdata sets from national surveys would permit analyses of many policy-relevant issues. Some RDCs do currently offer access to geographic detail for selected surveys: for example, the National Center for Health Statistics includes a version of the microdata from the National Health Interview Survey with state and county identifiers at its RDC, and researchers can make special arrangements at the University of Michigan to use microdata from the Panel Study of Income Dynamics with contextual neighborhood-level variables derived from census and other data. Detailed microdata permit in-depth analyses of socioeconomic trends and their antecedents and consequences. Such analyses require multivariate behavioral modeling, which cannot be effectively undertaken

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities with aggregate data. For example, it has become increasingly clear over recent decades that analysis of aggregate statistics does not give policy makers an accurate view of the functioning of the economy (Abowd and Lane, 2003). Indeed, the creative turbulence that is a hallmark of the U.S. economy and a major contributor to its success is not apparent from macrolevel indicators. Analysis of microdata has revealed how flux in labor markets factors into job creation (Haltiwanger, David, and Schuh, 1996) and how widespread reallocation of factors of production (e.g., workers) from one firm to another firm in and across narrowly defined industries is a major contributor to U.S. productivity growth—more important than investment in equipment and structures (Foster, Haltiwanger, and Krizan, 2001). Detailed microdata are also needed for modeling economic decisions and other kinds of social behavior. Indeed, research is expanding into areas that were relatively untapped even a few years ago. For example, research using attitudinal information in combination with socioeconomic data about individuals and families to model savings behavior—although anticipated 50 years ago by Klein and Goldberger (1955), who used data on consumer sentiment to forecast consumption—has become a vibrant field in recent years with the expanded collection of microdata and their increasing availability to researchers. A number of applied microdata examples relate to one prominent public policy issue—population aging (see Woodbury et al., 1999). The issue is at the forefront of public attention because of the changing demographic structure of the U.S. population and the budget pressures that face Medicare and Social Security. To develop policies for an aging population in an informed manner, researchers must assess such trends as increasing life expectancy, changing retirement and savings patterns, changes in pension plans, and declines in employer-provided health insurance coverage. Rapidly changing medical technologies and increasing costs of care add further complexity to the analysis. Microdata for individuals and families can be used to simulate outcomes under different possible policies and to estimate costs and benefits associated with various policy options (see National Research Council, 1991, 1997). Data from the Health and Retirement Study (HRS), for example, have been instrumental in answering such questions as how Social Security benefits interact with pensions and savings in household efforts to finance retirement, how social security age eligibility requirements affect retirement rates and timing, and how changes in out-of-pocket medical expenses affect the use of federal programs.1 1   For a full bibliography of research using the HRS data, see the website for the survey at the University of Michigan: hrsonline.isr.umich.edu/papers/sho_papers.php?hfyle=bib_all [May 2005].

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities Another example of microdata needs can be found in research on pollution abatement. In this case, and many others like it, use of aggregate data leads to biased estimates of relationships among variables because different firms in an industry respond to regulations in different ways. Moreover, when aggregated, industry responses are weighted to represent the universe of firms at a given time. As that universe changes as a result of the entry and exit of firms, the assigned weights will no longer be correct and neither will be the analyses based on them (see Abowd and Lane, 2004). Microdata also allow for a much broader range of analyses than do aggregate data: for example, examining relationships among variables for individual classes of firms in an industry or industries (McGuckin, 1995). The expansion of research on the human dimensions of environmental change is another. Researchers increasingly include individual-level contextual variables in their models—the schools respondents attend, the neighborhoods they live in, the firms they work for, and the people with whom they interact. Linking data on people and their environments—including biological and spatial data—is at the very core of this kind of research (Rindfuss, 2002). More generally, the increasing complexity of social and economic activity requires data that can be used to separate out demographic interactions and economic and ecological effects. A key characteristic of microdata is that they allow the marginal effects of key variables to be isolated, adjusting for other factors. Research into the relationship between household income and Medicare expenditures is an example: interestingly, the results of recent studies have been mixed. Work by Battacharya and Lakdawalla (2003), which uses household-level data from the Medicare Current Beneficiary Survey linked to Medicare claims records, does not find a positive association between income and Medicare use. In contrast, McClellan and Skinner (2004), using insurance claims and data from the census and the Panel Study of Income Dynamics (PSID), find that households in high-income neighborhoods pay more in Medicare taxes but receive more in benefits. It will take more studies, possibly with other microdata sets, to answer the important policy question about redistribution from poor to well-to-do households through the Medicare system. The microdata-based literature on Medicare shows other interesting relationships. For instance, a small proportion of the elderly population accounts for a very large proportion of Medicare expenditures, and those who account for a high proportion of expenditures in one year are likely to be above-average users in subsequent and preceding years (Garber, MaCurdy, and McClellan, 1998). The importance of the Medicare program, and research about it, is difficult to overstate: total disbursements in 2002 were $265.7 billion, and its costs are growing faster than those of

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities Social Security. Understanding Medicare program use, and its correlation with income and health, is critical to understanding its current effects and to making projections and policy recommendations for the future. LONGITUDINAL MICRODATA For many research applications, it is desirable to analyze not just microdata, but also longitudinal microdata—repeated observations on the same units over time. For example, decisions by individuals and firms that affect retirement behavior and benefits occur over long periods of time, requiring microdata sets that follow people through their working lives (National Research Council, 1997:70-71). Similarly, understanding the cumulative effects of racial or gender discrimination on employment, health, and other outcomes requires longitudinal data on generations of individuals and families (National Research Council, 2004a:Ch. 11). Longitudinal data contribute to high-quality research in at least two distinct ways. First, such data allow more accurate estimation than is typically possible with a single cross-sectional survey of such information as transitions between states (for example, a household’s income falling below the poverty line), durations in a particular state, and changes in variables of interest. Because shorter recall periods tend to result in more accurate reporting of retrospective information, collecting information each year about the past year’s activities will produce more accurate data than asking for, say, a 10-year history (Bound, Brown, and Mathiowetz, 2001). Second, longitudinal data allow a researcher to control for the role of unobserved characteristics in explaining variation in outcomes among individuals, so long as the unobserved characteristics are relatively stable for individuals over time (Brown, 2003). Longitudinal data generally derive from three sources: surveys, administrative records, and policy experiments. Experiments in such areas as welfare reform usually combine data from surveys and administrative records for baseline information around the time of the intervention with follow-up surveys and administrative records at a later time to assess the effects of the intervention (Brown, 2003; National Research Council, 2001b). Longitudinal data collections are major investments, and their improvement over time is an important goal to which widespread access can contribute. For example, timely researcher access has led to identification of ways to improve economic measures in the HRS, such as the linking of asset values with income from assets (Hurd, Juster, and Smith, 2003). This progress can be contrasted with the lack of progress in the measurement of pension entitlements: those data are complex but, more important, less available. It has taken many years for researchers to understand their

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities weaknesses, in part because few of them have had access to the data and to become familiar with them (see National Research Council, 1997:97-101). Nearly anything that is done to facilitate the use of longitudinal data for social science research also has the potential to contribute to informed policy discussion. The list of topics to which longitudinal microdata have been applied is a long one: welfare reform, job training, unemployment insurance, preschool programs, retirement, employer-provided health insurance, policies affecting the disabled, K-12 educational reform, occupational safety, and tax policy. However, because of the time it takes to gather and then analyze the data, they are not always available when policy makers want them. For policy makers, the best-case scenario may appear to be one in which there is a brief policy intervention, the effects of interest are short run, and the data needed for evaluation are contained in routinely collected administrative records. The experiments that preceded the 1996 welfare reform legislation—which replaced Aid to Families with Dependent Children with the Temporary Assistance to Needy Families (TANF) program—are a good example of this happy coincidence. The main effect that the experiments were designed to assess was the extent to which, and how quickly, the prompt provision of job search assistance would move recipients off the welfare rolls; administrative record-keeping for the experimental programs went a long way toward providing the data needed for this assessment. Yet longer term effects of welfare reform, such as the extent to which former welfare recipients hold jobs for 2 or more years, are also of interest, and these kinds of assessments require longitudinal microdata (see National Research Council, 2001b). Moreover, the early results of experiments can be misleading, so that an adequate period of measurement is needed before effects can be confidently measured. For example, early analysis of the effects of the Seattle and Denver income maintenance experiments indicated that income support leads to marriage dissolution (Groenveld, Tuma, and Hannan, 1980). Subsequent analysis, however, suggested that long-term stability in marriage was enhanced by income support: what had been captured by the data at the beginning of the experiment was a one-time enabling of divorce for women who had no other source of financial support (Cain and Wissoker, 1990). Longitudinal data can also contribute to policy analyses when a program continues for a long enough time so that evaluations of its early incarnations can be informative in guiding subsequent decisions. Head Start is an example. The earliest cohorts of Head Start participants have reached adulthood, and later cohorts have progressed far enough in school so that the medium-term effects on achievement can be assessed.

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities Nevertheless, because Head Start was not designed as an experiment using randomly assigned control groups, the program evaluations have been less clear-cut than they might have been if a field experiment had been conducted.2 A third situation in which longitudinal data have timely policy relevance is when an ongoing longitudinal data program contains the information needed to address a current policy question. For example, the PSID was originally created in 1968 to study poverty-related problems. It fulfilled that goal well in its early years, and in its more than three decades of existence it has also been the basis for very informative work on duration of spells of welfare receipt (O’Neill et al., 1984; Hoynes and MaCurdy, 1994; Boisjoly, Harris, and Duncan, 1998; Duncan, 2000). The data are now beginning to be used to assess the effects of more recent welfare reforms. In 1979 the National Longitudinal Survey of Youth (NLSY) was begun, and it and the PSID extended data collection to the children of original PSID respondents. Both data sets have been able to shed light on the consequences for children of poverty, welfare receipt, and maternal employment. Similarly, the HRS, having continued long enough for its initial cohort to reach retirement age, has become the data set of choice for many policy discussions related to retirement. LINKING SURVEY AND ADMINISTRATIVE DATA Thirty-five years ago, interagency agreements permitted the linkage of some microdata from the Internal Revenue Service (IRS), the Social Security Administration (SSA), and the Census Bureau’s Current Population Survey (CPS). A publicly available 1973 CPS-SSA-IRS exact-match file was the basis for a major dynamic microsimulation model of social welfare policies and retirement income and was also used to analyze the quality of income reporting in the March CPS. A 1978 CPS-SSA exact-match file was the basis for another microsimulation model of retirement income, although that file was not made publicly available. Because linked data present challenges for minimizing the likelihood of re-identifying individuals, concerns about increasing nonresponse rates to government surveys and, subsequently, legislation (e.g., the 1976 Tax Reform Act, P.L. 2   In general, it is easier to mount evaluations with random assignment for experimental programs than for ongoing ones. For Head Start, if program administrators were to define priority classes of applicants and select randomly when they cannot serve everyone in the highest priority class, this difficulty could be overcome. See also the report of an evaluation of the High/Scope Perry Preschool Project, which did use random assignment and found substantial positive effects after an interval of some 35 years (Schweinhart, 2004).

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities 94-455) led agencies to curtail the development of linked microdata for public use. The linkages that were performed (for example, of March CPS files with limited tax return information) were for internal agency use only (see National Research Council, 1991:66-68, 134-135). Linking survey and administrative microdata can create datasets that facilitate a broad spectrum of research relevant to complex policy questions. For example, by linking Wisconsin income tax records, Social Security earnings and benefits records, and probate records, Menchik and David (1983) determined that prospective Social Security benefits did not have a perceptible effect on lifetime wealth accumulation. Linkages such as those of HRS with Social Security records introduce detail to the data that are particularly constructive for modeling savings incentives, retirement decisions, and other dynamic economic behavior. In this case, the potential research benefits of linking were sufficiently large that the Social Security Administration approved the link with minimal controversy (the linked data are available through special access arrangements at the University of Michigan). Another example that illustrates the benefits of data linkage is research to investigate the effects of community context on child development and socialization patterns and the effects of the availability of child care on parents’ work decisions (Gordon, 1999). By having access to an NLSY file with detailed geographic codes for survey respondents, Gordon was able to add contextual data—such as the availability of child care—for the neighborhoods in which respondents lived. Gordon’s application highlights the tradeoff between data precision and disclosure risks. Access to census tract-level geocoding permitted more sensitive construction of community and child care variables central to the study; however, it also increased the identifiability of individual NLSY records.3 The new Medicare drug benefit provides another example in which data linkage might contribute significantly to policy-relevant research and modeling. Policy makers would like to know the effects of the legislation: who gains and by how much and how the benefit changes drug consumption, retirement decisions, and other behaviors. To observe and model these responses, individual survey data and linked Medicare data are needed on the same people both before and after the policy change. Data linkage also has the potential to reduce data collection costs—both direct costs and the cost of respondent burden. Linking existing information from different surveys or a combination of survey and admin- 3   It is important to keep in mind the distinction between identifiability of individual records and the risk of re-identification. The first is a statistical matter; the second assumes, in addition, intruder motivation (see Chapter 4).

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities istration data, as in the continuing Medical Care Expenditure Panel Survey (MEPS), can streamline the data collection process by reducing the need to duplicate surveys. If survey designers know that links to administrative data can be made, they can limit the length of questionnaires as well. Yet statistical agencies have not made extensive use of linkages of administrative records and survey data in household surveys. They have made much more extensive use of administrative records, such as tax records, for business data collection. Another benefit from linkages of survey data and administrative records is that they can improve data accuracy and scope by giving researchers access to information that individuals may not be able to recall or estimate accurately. For example, the lifetime earnings data in the Social Security files that are linked to the HRS are virtually impossible for respondents to recall or even find in their own records. Similarly, respondents may not have immediate access to information in medical records or have the technical knowledge to answer some questions. Yet administrative data contain their own errors—such as omissions and duplications—and they may use different concepts and cover different populations than surveys. Thus, although there are many advantages of linked data, care is needed in making linkages and in using the linked data. In addition to their role in complementing survey data, administrative records can be used to estimate the measurement errors in survey reports, an idea that dates back at least to Ferber (1966) and the work of the Inter-University Consortium for Savings Research. For example, for the Survey of Income and Program Participation (SIPP), a census of federal and state administrative records was taken in 1983-1984 for four states to ascertain the validity of reporting for eight income maintenance programs (Marquis and Moore, 1990). Analysis of survey reports and administrative data for the Food Stamp Program determined that systematic differences in reporting biased the relationships derived from the SIPP sample and that error-prone respondents were more likely to drop out of the SIPP panels. Another way in which linkage increases data utility is by making it possible to get more research mileage from isolated datasets that would otherwise have limited application. In addition to the above-noted value in reducing data collection redundancies and improving data accuracy in a cost-effective manner, linking information sources may provide increased flexibility to meet future research needs with existing data sets. For example, if SIPP records were linked to administrative tax returns or wage and salary data from state unemployment insurance records, the income data would be more accurate, the cost of the survey would decrease, and the utility of the data for research and policy analysis would increase.

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities The benefits of linked data extend beyond social science research. Linking records is essential for high-quality studies of effectiveness in many areas of medical research by permitting the analysis of data already available. For example, linking together emergency medical service (EMS) data, hospital records, and death registries allows researchers to follow patients through the pre-hospital, hospital, and post-discharge stages (National Highway Traffic Safety Administration, 2001). Data linkage also facilitates research on infrequent events, such as rare diseases that affect only a small percentage of the population. In such cases, working from general sample data does not provide adequate sample sizes for target groups, and population-based data, which are very expensive to collect, are often required. Linkage can, in some instances, provide a much less costly substitute. ACCESS AND DATA QUALITY Researchers’ access to and use of the complex data collected by federal statistical agencies are essential to maintain and improve data quality (Abowd and Lane, 2004). The findings derived from such analyses undergo reexamination and reinvigoration when disseminated to the research community. Researchers’ use of government data creates an effective feedback loop by revealing data quality and processing problems, as well as new data needs, which can spur statistical agencies to improve their operations and make their data more relevant. The use of data by outside researchers can also verify or improve sampling frames for surveys and censuses. Data access and use by a variety of researchers, for diverse purposes, provides a range of feedback to data collection agencies. Agencies may also be able to generate new data products when they combine existing records with other information from research. The relationship between data use and data quality is the essential foundation for the common interest of the statistical system and the wider research community in broad and responsible access to data. That relationship is well recognized by such agencies as the Census Bureau. The agency’s Center for Economic Studies web page states: “Exposing to the light of research the conceptual and processing assumptions that are embedded in the Census Bureau’s micro databases constitutes a core element in the Census Bureau’s commitment to quality” (see mission statement at ces.census.gov). This recognition is not new. McGuckin (1992) argued more than a decade ago that coordinated research efforts between in-house and outside researchers offer the best model for ensuring that agencies maximize the benefits from data users. In fact, McGuckin (1992:19) argued that it is a primary responsibility of statistical agencies to facilitate researcher access to confidential microdata files. Such access, by

OCR for page 36
Expanding Access to Research Data: Reconciling Risks and Opportunities improving the microdata for research and policy analysis, also improves the quality and usefulness of the aggregate statistics on trends and distributions that are the bread and butter of statistical agency output. Thus, the benefits from research access to complex microdata accrue not only to policy makers, but also, and importantly, to the statistical agencies themselves. There is growing appreciation for the point of view that the largest single improvement that the U.S. statistical system could make is to enhance the capabilities for analysis of statistical data by researchers inside and outside of government, which, in turn, would enable statistical agencies to better understand and improve their data (see Abowd and Lane, 2003). There are many examples of the synergy that can be created from large numbers of researchers using the same datasets, which allows for corroboration of results and an accumulation of the benefits of knowledge. In a workshop featuring results from Wave 1 of the HRS, for example, much was learned about what had gone right and wrong (Journal of Human Resources, 1995). On a practical level, multiple users assure an increased return on the investment in expensive data collection projects. Although statistical agencies expend substantial resources to ensure that they produce the best possible product, there is no substitute for actual research use of microdata to identify data anomalies. Indeed, there is general recognition of the direct correlation between the quality of a statistical agency’s data and its openness to external research. A variety of studies offer evidence that the U.S. statistical system now collects more relevant and higher quality statistics as a result of disclosing both the survey instruments and the data to outside researchers (e.g., Levitan and Gallo, 1990; McGuckin and Nguyen, 1990; Taeuber, 1981; Triplett, 1991; see also Soete and ter Weel, 2003). By increasing access to their data, statistical and other data collection agencies will almost certainly improve both the quality of the data and their usefulness for research and policy analysis. Such improvement will, in turn, increase the value of the investments in collecting, processing, and maintaining the data.