Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 131
s Data Collection and Processing In this chapter we consider the operational aspects of SIPP how the data are collected and processed. Survey operations as distinct from design, evaluation, and analysis-represent by far the largest component of total survey costs. Moreover, the care and efficiency with which a survey is operated directly and substantially affect the quality and timeliness of the data. Hence, no assessment of SIPP would be complete without a review of SIPP's present operations and the Census Bureau's plans for future changes. CURRENT OPERATIONS Table 5-1 shows a rough distribution of SIPP costs by function as a refer- ence for our discussion of SIPP interviewing and data processing opera- tions. Field costs that are associated with the interviewing staff (travel and communications, payments to interviewers, and training) amount to 41 per- cent of the total. Data processing costs which about two-thirds are 1Our evaluation of SIPP operations and plans for future improvements is greatly indebted to the work of panel members Martin David and Randall Olsen and to Carol Sheets, head of the data processing staff for the National Longitudinal Surveys of Labor Market Expenence (NLS) at Ohio State University. These people visited the Census Bureau headquarters twice and developed a background paper with assessments and recommendations; Dr. Olsen also visited the Census Bureau's Chicago regional office. They and the entire panel are very appreciative of the wholehearted cooperation of Census Bureau staff during the site visits and in response to requests of the panel for information about SIPP operations. 131
OCR for page 132
132 THE SURVEY OF INCOME AND PROGRAM PARTICIPATION TABLE 5-1 (in percent) SIPP Budget, by Major Function, Fiscal 1992 unchon Percent of Budget Sample design and selection 6.0 Questionnaire development and materials 3.0 Field 40.8 Travel and communications Payments to interviewers (420 people) Training Data processing Regional office data entry (keying) Regional office clerical operations Other regional office activities Data processing (headquarters) Research and evaluation Data analysis, reports, and pnnnug Data dissemination Administration Total costs 7.2 29.4 4.2 27.2 10.0 8.0 3.0 2.0 4.2 12.0 3.0 8.0 100.0 ($31.0 million) NOTE: Distnbunon of costs is before insiitunon of maximum telephone interviewing in February 1992. SOURCE: Estimates from Census Bureau staff. associated with the regional offices and one-third with headquarters amount to another 27 percent of the total annual expenditure of $31 million for SIPP. Interviewing From the beginning, SIPP was expected to involve labor-intensive inter- viewing procedures in order to obtain high-quality responses to detailed questions on complex aspects of households' socioeconomic status and well- being. For the 1984-1990 panels, face-to-face interviewing has been the preferred mode and the one used in most cases. Telephone interviewing has been permitted to follow up for information not obtained in face-to-face interviews, to interview people who would not or could not participate otherwise, and to interview sample people who moved to locations more than 100 miles from a SIPP primary sampling unit area. For the 1984 and 1985 panels, about 5-6 percent of interviews were conducted by telephone, with the proportion increasing from the second through the final wave of each panel (Jabine, King, and Petroni, 1990:20~. SIPP interviewers collect information from respondents using paper
OCR for page 133
DATA COLLECTION AND PROCESSING 133 and-pencil techniques. At each visit, the interviewer updates a large control card (containing basic demographic characteristics of household members, housing structure characteristics, telephone numbers, and some other items) and completes a bulky questionnaire for each adult aged 15 and older, using numerous flash cards to aid respondents. The questionnaire differs across waves because of the inclusion of different topical modules; the wave 1 questionnaire also differs from all other waves because of the use of depen- dent interviewing for many items after wave 1 (i.e., reminding respondents of their answers in the prior wave and updating the information rather than asking each question afresh). The questionnaires are highly structured, with complex skip patterns and a good deal of redundancy as a way of jogging respondents' memories and providing a basis to check for inconsistencies or impute missing infor- mation. Interviewers must transcribe many items, either during the course of the interview or prior to the interview (to capture needed information from the previous wave). For example, each income source mentioned in the recipiency section must be coded onto the income source summary at the back of the questionnaire. For each code ore that summary, the inter- viewer asks questions about income amounts received during the 4-month reference period. Despite the magnitude and complexity of the task, interviewing in SIPP has proceeded quite smoothly. At the outset of the program, there were [ears that the interviewers (and respondents) could not cope well with such a long and involved survey. Indeed, the turnover rate for interviewers was initially high 32 percent in fiscal 1986-but in fiscal 1988 the rate was down to 18.5 percent, in comparison with 20-25 percent for other major surveys conducted by the Census Bureau (Jabine, King, and Petroni, 1990:24~. Interviewers have also become more experienced: in fiscal 1986 only about 33 percent had 3 or more years of survey experience; in fiscal 1988 almost 60 percent of the interviewers had that much experience. Continuous train- ing is provided to SIPP interviewers, and their work is monitored in several ways (e.g., by personal observation and reinterview). Often their reactions are sought about the success of one or another experiment and about pro- posed changes in procedures (e.g., greater use of telephone interviewing). Although the SIPP interviewers are highly professional in their work, it is also evident that the answers they elicit from respondents are often flawed (see Chapter 3~. It appears likely that the structure of the questionnaire contributes to such data quality problems in SIPP as underreporting of asset income and confusion among program names. Also, paper-and-pencil tech- niques with such a long, involved questionnaire inevitably lead to ineff~- ciencies and introduce opportunities for interviewer as well as respondent errors (e.g., transcribing errors and mistakes in following the skip patterns). Recently, the Census Bureau decided to switch to a mode of maximum
OCR for page 134
34 THE SURVEY OF INCOME AND PROGRAM PARTICIPATION telephone interviewing as a cost-cutting measure. Beginning in February 1992, waves 1, 2, and 6 of each SIPP panel are to be conducted as before by face-to-face interviewing to the extent feasible; however, the remaining waves are to be conducted by telephone, again to the maximum extent feasible. The telephoning and personal visits will be camed out by the same inter- viewers using the same questionnaire, with the interviewers making phone calls from their homes. The Census Bureau conservatively expects to save about $500,000 per year from the switch (roughly 4 percent of total costs associated with inter- viewing see Table 5-1), due to reductions in travel costs and the time of interviewers.2 The plan is to use the savings to improve SIPP's data prod- ucts and dissemination program. The Bureau hopes that there will be little loss of data quality.3 Experiments conducted with maximum telephone interviewing in 1985-1986 found relatively few differences in nonresponse rates and analytical measures between the experimental and control groups (Gbur and Petroni, 1989; Gbur, Cantwell, and Petroni, 1990~. However, there was some evidence, particularly for blacks, that maximum telephone interviewing produced lower estimates of the poverty rate and other mea- sures related to low income and receipt of means-tested program benefits. Also, the experiments covered only two successive waves, so no info~n~a- tion is available on mode differences over a longer period Regional Office Operations The Census Bureau's 12 regional offices play an important role in process- ing SIPP data Clerks check the completed questionnaires mailed in by the SIPP interviewers for errors and omissions and assign geographic codes for sample people who moved. Data entry clerks key the information from the questionnaires, using software that checks for the presence of identifiers and selected control card data items. Batches of keyed questionnaires are venfied, and data files for accepted batches are transmitted electronically to Census Bureau headquarters in Suitland, Maryland. Quarterly reports on verification results indicate that error levels in the keying operations are very low (Jabine, King, and Petroni, 1990:8 1~. 2Interviewers are paid by the hour. In order not to reduce the pay of interviewers already on the staff, the Census Bureau planned to hire fewer new interviewers than would otherwise be needed for the 1992 panel, which has a larger initial sample size than the 1991 panel. 3The program to improve the data products is in the formative stages, and so there is a lack of available detail. This makes it impossible to determine whether the $500,000 will be too much, too little, or about right to support these future changes. Likewise, it is impossible to determine whether these future improvements justify the possible risk to data collection of . . . . . maximum telephone Interviewing.
OCR for page 135
DATA COLLECTION AND PROCESSING 135 Errors in the data that are diagnosed in Suitland are returned to the regional offices for correction, if possible. Regional editors are given little latitude to use judgment or knowledge of the case to edit problematic cases. Calls to interviewers to resolve problems are rare, and follow-up calls to respondents even more so. Home Office Operations The Census Bureau's home office in Suitland, Maryland, handles all subse- quent editing and preparation of SIPP data files, with the exception of coding of industry and occupation, which is accomplished at the Bureau's processing facility in Jeffersonville, Indiana. Data for each wave of each panel are processed separately. Steps in data preparation include (see labine, King, and Petroni, 1990:80-81~: · checking each file transmitted from the regional offices to ensure that all expected cases, both interviews and noninterviews, are received; · transmitting keyed verbal descriptions of occupation and industry to the Jeffersonville facility for coding; · imputing data for noninterviewed people in interviewed households (Type Z nonresponse); · performing extensive consistency edits within and between sections of the questionnaire, between the control card and the questionnaire, and among responses for people in the same family and household; · performing extensive sets of edits and imputations on each section of the questionnaire, including topical modules, to ensure that responses ap- pear when they should and to impute missing values;4 · developing recodes based on combinations of data items to add to the data records; · checking the accuracy of geographic codes; · imputing an estimated household size for households that moved and could not be located, to use in the calculation of weights for movers; · calculating cross-sectional weights for each month in the wave; and · reformatting records and altering some data items to protect confi- dentiality as input to microdata files that are suitable for public release. Later, after all waves of a panel are processed, the data for selected items are further edited for consistency over time, longitudinal weights are developed, and a public-use longitudinal file constructed. Changes due to longitudinal edits are not carried over to the cross-sectional wave files. 4When edit programs diagnose a problem, that problem is resolved mechanically. While operationally efficient, in some cases this approach may degrade data quality.
OCR for page 136
36 THE SURVEY OF INCOME AND PROGRAM PARTICIPATION Although SIPP home office processing operations have settled down and are currently running relatively smoothly, it is no overstatement to say that data processing at Suitland has been the Achilles heel of the SIPP program. When SIPP began in a great rush, there was no time to evaluate the processing system that had been used for the Income Survey Develop- ment Program (ISDP) or think through the computing requirements for a continuing longitudinal survey of the size and scope of SIPP. The Census Bureau modified a system developed for the Current Population Survey (CPS) to use for SIPP, which treated each wave of each panel as a separate cross-section and was highly inflexible. This decision was dictated by outmoded hardware and software at Suitland (problems that generally affect data processing at the Census Bureau) and the fact that the programming staff were trained primarily in low-level assembly and procedural languages. SIPP had to contend with the limited disk space available on the Suitland office's UNISYS equipment (being phased out), necessitating slow arid out- moded tape-to-tape operations for many processing steps, and with the limi- tations of FORTRAN for editing and cleaning programs. For database man- agement, the SIPP staff used the internally developed system, RIM, that lacked features of modern database management systems. Initially, the SIPP processing staff were able to keep up with the flow of data. The first report from the 1984 panel (providing measures of in- come and program participation for the second quarter of 1983) was re- leased in September 1984, as was a public-use microdata file for wave 1 only 8 months after the last month of data collection. However, as the data continued to pour in from the field, month after month, the processing system buckled under the strain. And the initial success in prompt release of microdata files was undermined by user reports of errors, which necess~- tated the recall of most of the 1984 panel cross-section files.S Final files for the core information from waves 1-9 of the 1984 panel were still re- leased on a reasonably timely basis about 13 months on average after the last month of data collection. However, topical module files were delayed, with an average release date of about 22 months after data collection; and the 1984 longitudinal panel file was not released until April 1988, or 20 months after the last month of data collection. The introduction of a new panel each year added greatly to the strain on the data processing staff, particularly given the need to rewrite large sec- tions of computer code to keep up with changes to the questionnaire and to other aspects of the survey changes that were inevitable for a new, com- plex data collection program. As a result, delivery schedules deteriorated SRecalls were necessitated not only because of errors, but also because of design flaws. For example, wave 1 public-use files omitted the employer number, an identifier essential to estab- lishing continuity of jobs from wave to wave.
OCR for page 137
DATA COLLECTION AND PROCESSING 137 greatly. The Census Bureau did not publish any reports from the 1985 or later panels until 1990 (see Chapter 6~. Microdata files from the 1985 panel took an average of 31 months from the last month of data collection until release, and files from the 1986 panel took an average of 26 months from the last month of collection until release. Not until midway through the 1987 panel did the Census Bureau begin to achieve delivery times in the range of a year after data collection (Committee ore National Statistics, 1989:Table 2-4~. To enable the data processing to catch up, the Census Bureau decided in late 1987 to freeze the core questionnaire, permitting only changes that appeared absolutely essential to meet the survey's goals of providing im- proved data on income and program participaiion.6 The agency also strove to minimize changes in the fixed topical modules. This strategy was suc- cessful in that the Census Bureau began to meet its delivery targets of release of public-use files within a year of collection. However, giving up flexibility in the questionnaire was a high price to pay for a new, still evolving survey that is intended to be responsive to emerging policy con- cerns particularly as some of the initial design decisions had already lim- ited the detail in the SIPP questionnaire to try to make it easier to process the data. As examples, respondents were asked about earnings for a maxi- mum of two employers and about program income for a maximum of six sources. Also, respondents were asked about earnings on a monthly basis rather than in terms of individual paychecks; hence, respondents who were paid biweekly or on some other basis had to engage in considerable mental arithmetic to answer the questions. During the past few years, the Census Bureau has shown commendable attention to user needs and concerns with regard to data products. Not only have delivery schedules been speeded up, but the data processing staff working with an advisory group from the Association of Public Data Us- ers have recently redesigned the core data files in a person-month format to be much more accessible for many analyses (see Chapter 6~. However, many other needed improvements for example, longitudinal editing of the wave files and an automated system to generate complete and accurate documentation (e.g., documentation of edits and imputations) have yet to be made. The Census Bureau is aware of the problems that have afflicted SIPP operations, and the agency is planning major improvements through the adoption of new technology. Specifically, the Bureau plans to convert SIPP interviewing from paper-and-pencil techniques to computer-assisted personal 6For example, some wording changes were made in the 1988 panel to try to reduce the magnitude of the seam problem (e.g., asking respondents specifically to indicate the month in which program payments began before providing monthly amounts).
OCR for page 138
38 THE SURVEY OF INCOME AND PROGRAM PARTICIPATION interviewing (CAPI) by 1995. Also, the Census Bureau already has well under way a program to replace its UNISYS equipment with networked VAX computers, and the SIPP staff intend to switch to a commercial data- base management system for processing. We review the Bureau's plans for CAPI and database management technology for SIPP below.7 We also consider investment needs for continuing education of the SIPP data pro- cessing staff and issues involved in the transition to the new technology, together with the new survey design for SIPP. COMPUTER-ASSISTED INTERVIEWING There is currently considerable interest in the use of various methods of computer-assisted survey information collection (CASIC) (see Subcommit- tee on Computer Assisted Survey Information Collection, 1990~. Relevant techniques include: i · centralized computer-assisted telephone interviewing (CATI), in which nterviewers clustered at one or more central locations telephone respon- dents, read them questions displayed by a computer, and enter the answers into the computer (CATI can also operate in a decentralized mode, in which interviewers call respondents from their homes); · decentralized computer-assisted personal interviewing (CAPI), in which interviewers go to respondents' homes or offices with a portable computer and read the questions from and record the answers into the computer; and · various forms of computer-assisted self-interviewing (CASI), includ- ing prepared data entry (PDE), in which respondents themselves use a per- sonal computer or terminal to fill out interactively the survey questionnaire; touchtone data entry (TDE), in which respondents answer computer-gener- ated questions by pressing buttons on a telephone; and voice recognition entry (VRE), in which respondents answer questions by speaking directly into a telephone. These methods promise many advantages, including: · improved data quality because the computer program automatically controls skip patterns and includes editing features to prevent or detect inconsistencies and other errors on the spot; also, keying errors are likely to 7We note that the use of innovative data collection and processing technology, while prom- ising many benefits for SIPP (and other surveys), is not a panacea. For data quality to be high, respondents must understand the questions and be motivated to answer them fully and accu- rately. In Chapter 7 we discuss a relatively new methodological research program at the Census Bureau that is applying cognitive techniques to the issue of how well respondents understand and answer the current SIPP questionnaire. On the basis of the results of that work, experiments are in progress to assess alternative, less structured interviewing techniques that promise to improve data quality.
OCR for page 139
DATA COLLECTION AND PROCESSING 139 be reduced because there is no need for clerks to key the paper question- naire (although the interviewers themselves may make keying mistakes); · more timely data capture and development of analysis files because some data entry steps are eliminated and because of extensive upfront edits; and · increased flexibility in data gathering because multiple versions of the questionnaire (e.g., in different languages) can be readily offered and changes to the questionnaire more readily programmed and documented. We note that CASIC methods are undergoing development and that survey organizations are still learning how to use them effectively. The process of converting to a CASIC survey operation can be painful, and it is not always the case that the potential advantages from CASIC techniques will be realized in a particular application. Nonetheless, the potential gains clearly warrant investment in development and implementation. At the present time, CATI, which is the oldest CASIC technique in use, is widely employed by govemment, academic, and private survey organiza- tions in the United States and abroad. It is estimated that there are more than 1,000 CATI installations throughout the world (Subcommittee on Com- puter Assisted Survey Information Collection, 1990:11~. The Census Bu- reau maintains a CATI installation and has considerable experience with the technique. CAPI is a newer technique that is just beginning to be used in the United States.8 Evaluation of large-scale pilot studies for the NLS new youth cohort (NLSY) in 1989 (300 cases) and 1990 (2,400 cases one quar- ter of the national effort) were very favorable (Olsen et al., 1990; Olsen, 1992~. CAPI training for the NLSY took the same time as paper-and-pencil training, and there were no serious field problems. Data transmission over telephone lines was smoothly implemented and error-free. Compared with paper-and-pencil cases, the CAPI data were determined to have fewer errors and to be of unifo~ly higher quality in the dimensions examined (skip errors, undocumented codes, internal inconsistencies, etc.), even though the paper-and-pencil cases were subsequently edited and the CAPI cases were accepted without cleaning. The CAPI pilot study for the Medicare Current Beneficiary Survey (CBS) in early 1991 was also successful, and the initial rounds of inter- viewing in fall 1991 and winter 1992 for the full CBS sample of about 15,000 Medicare beneficiaries have proceeded smoothly (Sperry, 1991; Sperry, Bittner, and Branden, n.d.~. On the basis of the pilot study, the survey contractor for the CBS (Westat, Inc.) determined that additional training 8The Netherlands developed a CAPI system called BLAISE (after Blaise Pascal) for collect- ing household survey data as early as the mid-1980s (see Bethlehem and Keller, 1991).
OCR for page 140
140 THE SURVEY OF INCOME AND PROGRAM PARTICIPATION was required, with a particular focus on ways to solve problems during the interviews. Also, there were initial problem which are currently being resolved- with transmitting the data over modems attached to the inter- viewers' personal computers. In all other respects, including interviewer and respondent acceptance of the technique, preliminary indications of data quality, timeliness, and the ability to feed back data from an earlier inter- view to the next interview, the CAPI procedures appear to be working well for the CBS. Not all experiences with CAPI have been as favorable. The Census Bureau's initial effort to collect the AIDS supplement for the Health Inter- view Survey (HIS) was not a success, due to hardware and software prob- lems (National Center for Health Statistics and Bureau of the Census, 1988~. However, the Census Bureau is proceeding with further tests of CATI and CAPI for the HIS, using newer portable computers with materially increased performance. Problems were also encountered in the use of CAPI by a private contractor for the 1987-1988 Nationwide Food Consumption Survey, although these problems appeared to stem largely from management failures rather than the use of CAPI per se (U.S. General Accounting Office, 1991~. The Census Bureau is committed to expanding the use of CASIC inter- viewing techniques for both its household and establishment surveys, and there is a high-level task force working on a Bureau-wide CASIC imple- mentation strategy (Bureau of the Census, l991f). The Bureau is currently working to convert the CPS to both CATI and CAPI by January 1994. Both techniques are needed because the CPS uses maximum personal interview- ing for the first month in which an address is in the sample and maximum telephone interviewing for the remaining interviews. The Census Bureau is also planning to convert SIPP data collection to CAPI methods by February 1995. SIPP is a nearly ideal application for CAPI because it is a large, complex survey with a continuing field effort. As part of its CAPI planning for SIPP, the Census Bureau will undoubtedly evaluate the experience with maximum telephone interviewing in the panels under way as of early 1992 and determine the most cost-effective mix of telephone and personal inter- views. The decentralized SIPP interviewing staff could administer a com- puter-assisted interview in both modes. In its review, the Bureau should also consider the possible contribution of a centralized CATI operation, which affords opportunities for increased quality control of the interview- ers' work. CATI might, for example, be used to interview SIPP cases that move from one primary sampling unit (PSU) to another. Potential Improvements for SIPP CAPI technology offers enormous potential to improve the timeliness and quality of SIPP data and other aspects of the SIPP program, but it is also
OCR for page 141
DATA COLLECTION AND PROCESSING 141 relatively new. We therefore describe in some detail the sorts of capabili- ties that the Census Bureau should expect and plan for in a fully imple- mented CAPI system and their implications for the smooth running of the SIPP processing system. In the next section we consider the cost implica- tions. And in the subsequent section we provide a list of important func- tions that we believe a SIPP CAPI system should have and review the capabilities of the Census Bureau's existing CAPI software. Successful implementation of CAPI for SIPP should produce signifi- cant improvements in timeliness of data processing and analysis. If there is no imputation, weighting, or special coding to be done (i.e., industry and occupation), it should be possible to produce frequencies and provide Cen- sus Bureau analysts with a fully documented data file that is suitable for analysis with a widely used software package (such as SAS) within a week or two at most after the last case is transmitted from an interviewer.9 Given the need for various kinds of post-field processing of the data, it is essential that such processing operations be fully integrated with the design of the questionnaire. Such integration is needed to maximize smooth, timely op- erations and minimize bottlenecks (see further discussion, below). CAPI should improve data quality by greatly reducing interviewer error and supporting more complex questionnaire design than is feasible for pa- per instruments. For example, some analysts believe that better quality data can be obtained by collecting information on income, employment, and program participation in the form of event histories in which the respondent supplies start and end dates, instead of by using fixed monthly reference periods (see discussion in Chapter 7 on cognitive research). CAPI would make it easier to collect event history data, which have often been hard to manage in paper formats. CAPI would further improve data quality by readily enforcing the natural temporal ordering of events (e.g., jobs must be started before they are left). Obtaining the full power of CAPI to improve data quality at reduced cost and time requires that the entire process of data editing and cleaning be redesigned, taking into account the ability to perform real-time checks with CAPI. The Census Bureau will need to review and restructure its edit specifications for SIPP, deciding which potential inconsistencies it wants to resolve during interviews; which inconsistencies it wants to eliminate by structuring the questions and allowable responses so that inconsistent re- plies are not logically possible; and which inconsistencies it will not at- tempt to resolve, prevent, or eliminate. 9The availability of such a file would permit Census Bureau analysts to have an early look at the raw data and assess data quality in terms of item nonresponse rates, extreme values, and the like.
OCR for page 147
DATA COLLECTION AND PROCESSING 147 · BLAISE, which is the system developed by the Netherlands Central Bureau of Statistics; · CASES, which is maintained at the University of California at Ber- keley;~2 and · the Ohio State CAPI system, which is maintained by the Center for Human Resource Research at Ohio State University and used for the NLSY. We urge the Census Bureau to give high prionty to investigating existing outside CAPI systems to find one that meets the needs of SIPP more effec- tively than QUISC. Recommendation 5-1: We strongly support the Census Bureau's goal to convert SIPP to computer-assisted personal interview- ing (CAPI). Since the Bureau's current CAPI software system (QUISC) does not appear to meet the data collection require- ments for SIPP, the Census Bureau should give high priority to investigating other available CAPI systems and determine the most appropriate system for SIPP. DATABASE MANAGEMENT The Census Bureau is currently In the process of updating its computing equipment, including replacing UNISYS mainframe, batch-onented proces- sors with networked VAX computers that facilitate interactive processing and the use of modern database management technology. The new equip- ment will assist data processing operations throughout the Bureau. The SIPP staff plan to take advantage of the shift to the VAX network by con- vening their data files and processing to powerful database men agemer~t system (DBMS) software that is commercially available, such as Oracle or Relational Data Base software (Bureau of the Census, n.d.~. Another caDdi- date is Scientific Information Retneval (SIR) software, which is used for the NLSY. These commercial systems offer venous capabilities and fea- tures of the relational database model, which was originally developed as a logically rigorous and complete statement of database structure and ma- nipulation (see Codd, 1985; Date, 1987~. Other kinds of database manage- ment systems embody network or hierarchical database models.~3 (For 12We note that the Census Bureau's QUISC system evolved, like CASES, from a system that was originally developed at Berkeley. We understand that CASES, perhaps augmented with some features from QUISC, has recently become the Census Bureau's leading candidate for future CATI/CAPI development for SIPP and other surveys. 13The term "relational," which distinguishes the relational model from traditional network or hierarchical database models, refers primarily to the organizational structure of the data. A relational database creates a series of rectangular tables or "flat" files, each of which is "nor- malized,' according to the relational model in order to contain information in a very simple
OCR for page 148
48 THE SURVEY OF INCOME AND PROGRAM PARTICIPATION further discussion and assessment of database management systems, see Gray, 1984; Silberschatz, Stonebraker, and Ullman, 1990.) Database management systems offer important capabilities that can fa- cilitate processing and analysis of SIPP's complex data sets that embrace multiple panels, waves, households, families, people, and sources of in- come. They Perot large databases to be accessed in an interactive mode by multiple users, which can support editing and imputation procedures that use information from other waves of data and make it possible for analysts to readily review problem cases as needed. Database management systems also provide interfaces -to statistical packages that are widely used for analysis and estimation. In addition, DBMS technology, especially RDBMS soft- ware, facilitates the integration of data and documentation. Relational database management systems offer other features that are likely to be especially helpful for a survey like SIPP. They have query languages for obtaining information from the database, using logical opera- tions, which can be of direct utility for editing complex data. The powerful structured query language (SQL) has recently been adopted as an industry standard that will be supported to some degree by all RDBMS vendors.~4 RDBMS technology also embodies consistency features that greatly reduce the opportunity for errors to occur in data processing.~5 Finally, an impor- tant feature of RDBMS systems is that they provide flexibility in handling changes to a questionnaire without disrupting the entire database structure. In particular, RDBMS technology offers dynamic independence, that is, the ability to add new data to the system without restructuring the existing data, provided that the initial database design anticipates this need. structure (e.g., in the SIPP context, there might be separate files for people, families, jobs, and income types). Relationships between entities (e.g., people having jobs) are also represented in these tables, as is the internal documentation of the database (the set of tables) itself. This simple but powerful structure is key to many of the advantages of relational database manage- ment technology, including its query and processing capabilities. However, for performance and other practical reasons, no current relational database management system (RDBMS) soft- ware conforms completely to the relational model in all of its features. Nevertheless, the term RDBMS is used for a DBMS that attempts to implement most of the key relational features. 14Query languages operate in a different manner from scientific programming languages and statistical packages. Analysts would not want to use query languages in place of statistical packages for estimation purposes; however, interfaces can be designed to exploit the power of the RDBMS for efficient data retrieval together with the computational capability of a stahsti- cal package like SAS or SPSS. An example of a linkage between a statistical package and a DBMS is the PROC SQL module of SAS. 1SThe relational database model specifies structural integrity constraints that enforce struc- tural consistency on the data. In addition, the logical rules that govern data entry can draw on any part of the existing data to enforce consistency in data values; consistency may be applied to individuals, households, or other entities and combinations of entities. A properly designed RDBMS will ensure that adjustments to the database do not leave garbage in the system.
OCR for page 149
DATA COLLECTION AND PROCESSING 149 We strongly support the development of an improved database manage- ment system for SIPP that integrates the documentation with the data and facilitates timely, accurate, and flexible data processing (we indicate below some specific functions that a DBMS needs to provide for SIPP). To achieve this goal will require adequate disk space and processing resources. We urge the Census Bureau to allocate sufficient disk space and processing resources to SIPP so that the data processing and analysis staffs can store and access SIPP data on-line in a DBMS, using magnetic tapes only for backup and other special purposes. Needed Database Management Capabilities for SIPP There are several capabilities that we believe it is important for an im- proved database management system to provide for SIPP. First, the system should be designed with sufficient flexibility so that changes in the SIPP questionnaire that are expected to improve data quality or relevance can be accommodated. It should not be the case in the future, as has happened in the past, that difficulties in processing lead the Census Bureau to "freeze" the interview content for some time. As we noted above, the relational database model has features that make it possible to change a questionnaire without having to redesign the rest of the database structure. Second, the system should facilitate the ability to supply fully edited data from the previous interview in sufficient time to use in the next inter- view with the respondent. There needs to be careful coordination of this feedback capability, which is critical for achieving data quality improve- ments in SIPP at the source, with the design and operation of the CAPI system (see below. Third, the system should have the capability to supply values for miss- ing data in a timely manner. DBMS software generally offers advantages in this regard. Because the technology permits large data sets to be on-line, the use of a DBMS should enable the Census Bureau to move away from treating each SIPP wave as ~ separate cross-section for imputation pur- poses. Instead, it should be possible for the Bureau to develop more exten- sive yet timely longitudinal imputations by using data from surrounding waves a goal that the SIPP data processing staff have indicated is a high priority i7 16Presumably, most of the editing will be performed within the CAPI system, but some additional editing may be required within the DBMS. Careful coordination of the CAPI and database management systems is also needed to achieve flexibility with regard to the question- naire content. 17The ability to handle large data sets on-line should also make it possible to readily produce multiple-wave analysis files with appropriate weights, as well as to integrate the processing of waves from separate panels that represent the same time period of data collec
OCR for page 150
150 THE SURVEY OF INCOME AND PROGRAM PARTICIPATION Fourth, the database management system should support the ready modi- fication of imputation procedures when required. The current cross-sec- iional imputation system for SIPP is very inflexible and is known to be less than optimal in some respects (e.g., in the imputation of income and asset values for program recipients see Chapter 3~. To the extent that some imputations must continue to be made on a cross-sectional rather than lon- gitudinal basis, the database management system should provide a capabil- ity to implement modifications to the imputation scheme and evaluate their effects on data quality in a timely manner. Longitudinal imputations will likely be of better quality because information from other waves is used for the actual respondent instead of information from the same wave for an- other (albeit similar) person. However, longitudinal imputations involve complex logic, and the automated imputation schemes that are implemented at the beginning of a panel will not likely be optimal for the full panel. Hence, there will be a continuing need for a ready capability to modify the longitudinal imputation procedures as knowledge is gained of their implica- tions for data quality. In this regard, we are concerned about the SIPP staff's plans to use a DBMS and at the same time retain their FORTRAN-based editing and im- putation programs. It seems unwise to have a hybrid system that does not make full use of the capabilities of the chosen database management sys- tem. The Census Bureau argues that it does not want to become overly dependent on commercial software vendors, but adopting a particular com- mercial system is no longer necessarily a major risk. Most vendors of DBMSs are committed to operating on different computers under venous operating systems (technically, they offer"platform independence". Fur- thermore, in the case of relational systems, databases can readily be moved from one RDBMS to another if a change in the RDBMS becomes necessary. Fifth, the database management system developed for SIPP should pro- vide a ready interface to such statistical packages as SAS and SPSS. Such interfaces will facilitate internal analysis of the data by Census Bureau staff both for evaluation purposes (e.g., analyzing the effect of imputations on lion: for ex~n~ple, wave 7 of the 1991 panel and wave 4 of the 1992 panel will be fielded at the same time. However, we note that there are some capabilities of a DBMS that would be desirable for survey processing but are not yet commercially available. For example, current systems do not support economical ways of dealing with "versions" of data that will arise as information for each SIPP panel is captured in successive interviews and longitudinal weights and imputations are altered to make use of accumulating information. lain addition, it could be useful for the database management system to provide a capability for multiple imputation, in which a range of imputed responses is generated for each missing value in order to permit users to assess the variability in an estimate that can be attributed to the imputation process (see Little and Rubin, 1987).
OCR for page 151
DATA COLLECTION AND PROCESSING 151 the quality of estimates from the data) and for substantive studies on in- come, program participation, and related topics. Finally, the database management system that is used to construct the SIPP database should also support construction of a complete corresponding database of the documentation. At present, there is no documentation data- base for SIPP that can be related to the data, which contributes to problems in releasing fully documented analysis files on a timely basis arid hinders users in obtaining a complete understanding of the file structures and data content. This lack also substantially reduces the ability to institute more modern methods of releasing data for example, supplying data or1 compact disks with extraction software or providing a facility to create extracts over such communications networks as Internet. For most cost-effective use, these access methods require integration of the documentation, ideally in- cluding frequency counts for each variable, with the data.l9 We cannot overstress the importance of seeking a database management system with comprehensive documentation capabilities and then using those capabilities to the fullest in preparing data files from SIPP. As noted above, DBMS technology, especially RDBMS software, often facilitates integration of data and documentation. An RDBMS can be implemented to maintain a vocabulary of names for each measurement, each traDsforma- tion or other processing procedure, and each relationship encompassed in the database. The RDBMS will ensure that variable names are uniquely and permanently assigned, no matter how many users are making independent uses of the data. This capability should make it possible to track processing activities and changes and to generate updated descriptions of the database as each SIPP panel proceeds and as new panels are created. This tracking information can be used to produce documentation for all data processing steps, such as weighting, imputation, construction of recoded variables, and development of analysis files. The result should be greater productivity in processing multiple panels and public-use files and greater clarity arid com- pleteness in the documentation of all data processing steps. Integration of CAPI and Database Management We have noted the importance of integrating the CAPI and database man- agement systems for SIPP to facilitate smooth, timely data processing and to minimize errors. The CAPI system will likely perform all data entry functions; however, if any paper forms remain, then the database manage- mer~t system should be used to enter data from them. 19See Chapter 6 for a review of the current computer data products and documentation for SIPP and a discussion of gaps and needed improvements: for example, there is currently no documentation at all of the data editing and imputation procedures.
OCR for page 152
152 THE SURVEY OF INCOME AND PROGRAM PARTICIPATION Ideally, the CAPI system chosen for SIPP will generate the following inputs to the database management system: lion; · a data dictionary that defines all questionnaire items; · the logical rules that clearly determine the universe for each ques · the logical rules applied at the time of an interview to enforce con- sistencies; · the list of responses partitioned into sets for each separate universe defined by the rules of the interview for example (in the context of SIPP), one set for the address, one set for each individual, one set for each job, and one set for each spell of property or program income receipt, all recorded on the basis of the relevant accounting or reference period; . and the list of exceptions, comments, and annotations to each question; · the set of information about the environment of the interview time, place, mode, interviewer, duration, etc. Whether an RDBMS or some other database management system is used, we stress again how important it is that the DBMS for SIPP have the capacity to internally track and maintain the information needed to docu- ment fully the data content and the data collection and processing activities, including imputation, weighting, construction of new variables, and refor- matting. This information is necessary to produce fully documented inter- nal and public-use files and to properly feed back information for use by the CAPI system in subsequent interviews. Recommendation 5-2: We strongly support the Census Bureau's plans to adopt a new database management system for SIPP. The Census Bureau should use the capabilities of a DBMS to the fullest in seeking to make improvements in processing, ana- lyzing, and documenting the data from SIPP. The processing performed by the database management system should be fully integrated with the SIPP CAPI system. INVESTING IN THE DATA PROCESSING STAFF The Census Bureau has a distinguished history of making seminal contribu- tions to data collection and processing technology. However, in recent years, the Bureau has lagged behind best practice and has lacked the hard- ware and software with which to implement state-of-the-art methods of data collection and processing. We are pleased that technological improvements are under way. We urge the Census Bureau to recognize the need to devote resources to modernization of its computing hardware and software on a continuing basis.
OCR for page 153
DATA COLLECTION AND PROCESSING 153 We further urge the Census Bureau to recognize the need for a continu- ing program of investment in the education and training of the data process- ing staff. In order to make best use of new hardware and software, the staff must be fully trained in new data processing methods. Steps must also be taken to ensure that the data processing staff regularly visit and learn from other organizations. To date, the SIPP processing staff have been so focused on production problems that they have not been able to devote time and resources to modernization. We believe that SIPP needs the equivalent of at least one full-time staff member devoted to systems modernization: this does not mean one person, but some of the time of the best staff that is devoted to continuing education, software development, networking, and systems de- velopment. Improvements in data processing systems may later reduce staff requirements in data processing, but investments must first be made. The SIPP processing staff should have the resources and be encouraged to visit other data processing facilities They could find it useful to meet with the staff at the National Opinion Research Center (NORC) and Ohio State University who deal with the NLS, and the staff at the University of Michigan who deal with the PSID both longitudinal surveys that are simi- lar to SIPP. The SIPP staff could also learn from the experience of the Statistics of Income Division at the Internal Revenue Service and its transi- tion to the Oracle RDBMS. They could also examine how statistical agen- cies in other countries (e.g., Statistics Canada) are using database manage- ment systems in the processing of survey data. Just as the Census Bureau invests in the staff that work on survey design, methodology, and evaluation (e.g., by sponsoring methodological research conferences), it is critical that the Bureau invest in the equally important data production people. The ability of the processing staff to make full use of improved technology greatly depends on the support they receive for continued development and use of their data processing skills. Recommendation 5-3: In view of the major advances that con tinue to occur in computing hardware and software, the Census Bureau should devote significant resources to continued educa tion and training of its data processing staff. In particular, the SIPP processing staff should take advantage of the experience of data processing facilities outside the Census Bureau that deal with longitudinal surveys. TRANSITION TO A CAPI/DATABASE MANAGEMENT SYSTEM FOR SIPP Moving a survey to CAPI and an improved database management system affects nearly every step of the survey, from questionnaire design through
OCR for page 154
154 THE SURVEY OF INCOME AND PROGRAM PARTICIPATION data dissemination. We are concerned that there may not be enough time to fully test and work out the inevitable operational problems prior to the scheduled implementation in February 1995 of a full-blown CAPIldatabase management system for SIPP. The new technologies must not only be fully developed and tested in their own right, but also mesh with other changes to SIPP, such as questionnaire content and format changes. The length and complexity of the SIPP questionnaire, which entail the need for complex editing and data processing procedures, and the frequency of interviews will pose substantial challenges to the smooth implementation of CAPI/database management technology. We believe it is critical to the future success of SIPP that all aspects of the redesign work well from the start. It would be tragic to have a replay of the situation that occurred at the inauguration of SIPP, when the need to move immediately into the field precluded making needed changes to the data processing system, with consequent adverse effects for the delivery of data products. At present, the Census Bureau has more than 2 years before the scheduled redesign in 1995 to develop a CAPI/database management system for SIPP. However, this amount of time may not be sufficient to ensure a problem-free changeover, particularly given all of the other changes that are likely to be introduced at the same time and the fact that a final decision has not yet been made on which CAPI or database management system to use. A review of the current Census Bureau schedule (Fischer, 1992) shows the following key milestones that relate to CAPI and database management technology: · finalize the content of the wave 1 questionnaire by December 1992; · develop the CAPI wave 1 questionnaire (including specifications and programming) in January-October 1993 and have a dress rehearsal in Febru- ary 1994; · develop the CAPI wave 2 questionnaire in March 1993-March 1994 and have a dress rehearsal in June 1994; · develop the wave 1 systems design in June 1992-April 1993 and the specifications for the wave 1 processing system in January-November 1993; and · develop the wave 2 systems design in November 1992-October 1993 and the specifications for the wave 2 processing system in November 1993- October 1994. This schedule is very tight. It requires making an early decision on the content of the questionnaire, which precludes making much use of results from the current program of cognitive questionnaire research or from the forward record-check studies that we strongly urge for questionnaire testing and experimentation (see Chapter 7~. The schedule also allows very little
OCR for page 155
DATA COLLECTION AND PROCESSING 155 time for testing the integration of the CAPI/database management system for wave 2 and subsequent interviews, which is essential for such functions as feeding back data from the previous to the current interview. Although we agree that it is important to have the SIPP redesign imple- mented on a timely basis, we do not believe there is any magic to the current scheduled start-up date. If additional time for development and operational testing would permit a smoother transition, then we believe that time would be well spent. We suggest that the Census Bureau consider the following schedule (see Table 5-2~: field a somewhat smaller panel of, say, 15,000 households in 1995 that uses CAPI and database management tech- nology for data collection and processing. The onmarY ouroose of the 1995 TABLE 5-2 Suggested Schedule for Implementing the SIPP Redesign and Use of CAPIIDBMS Technology, Including a Large Dress Rehearsal Panel in 1995 Panel and Wave Calendar Year 1992 1993 1994 1995 1996 1998 Paper and Pencil 1994 7 4 1 5 2 6 3 CAPI/DBMS 1995 1996 1997 1998 1999 8 5 6 4 1 2 3 (4) (5) 2 (6) (7) (8) (9) 6 7 1 8 9 10 4 11 12 2000 1998 panel continues and 2000 panel begins Original 20,000 20,000 20,000 15,000 26,700 26,700 sample size (households) NOTE: ( ) indicates optional.
OCR for page 156
156 THE SURVEY OF INCOME AND PROGRAM PARTICIPATION panel would be to conduct a full-bore dress rehearsal of the new system, identifying operational problems that could be corrected in time for imple- mentation of all aspects of the redesign with the 1996 panel. (The redesign of the sample per se could be introduced in 1995 as scheduled. Indeed, it would be advantageous to do so, as then all panels that are based on the new technology would have the same sample design.) In the best outcome, the 1995 panel would provide a smooth transition for 1996 and also provide high-quality data for users; at worst, the 1995 panel would fulfill the former very important function. At a minimum, the 1995 panel would include three interviews; if it is going well, it could be continued for another year or two and used for additional testing that could feed into the next new panel, which would start up in 1998.2° (Continuing the 1995 panel, as shown in Table 5-2, has the advantage of avoiding big changes in the total interviewing workload.) This schedule has several benefits, primarily that it should greatly re- duce the risks from unanticipated operational problems with the CAPI/data- base management system. Also, it should permit greater flexibility over the next couple of years for the Census Bureau to experiment with question- naire content and format. Even with an additional year, we are not sanguine that a CAPI system could be developed to handle the type of free-form questionnaire that is being evaluated in the cognitive research program (see Chapter 7~. However, we do believe that it should be possible to improve the current structured questionnaire by making use of the results of that research. Yet another advantage of this schedule is that a new panel will start up at the time of the year 2000 decennial census, thereby providing a better opportunity to compare the census and SIPP than if SIPP panels were introduced in odd years. There remains the question of what to do with the 1993 and 1994 panels, which will have begun win paper-and-pencil methods namely, whether to switch them to the CAPI/database management system when the 1995 test panel is introduced or run the paper-and-pencil and CAPI systems side by side. As we noted above, the latter strategy means increased costs. However, a sudden switch could pose other kinds of problems: for ex- ample, the regional offices would have to cope with an abrupt reduction in workload. Also, the data processing staff would have to make special ef- forts to quickly move the data from the most recent waves of existing panels so that the next wave could be CAPI and also prepare CAPI versions 20To the extent possible, the regular SIPP data products should be provided to users from Me 1995 panel, although, if problems arise, it may be necessary to release data products as "research" files or reports to be used with caution. (The Census Bureau has issued "research" products in the past, when it had reason to doubt their quality or viewed them as preliminary.)
OCR for page 157
DATA COLLECTION AND PROCESSING 157 of the SIPP questionnaires for existing as well as new panels. These efforts would greatly increase the burden on the data processing staff. In our view, the problem of the 1993 and 1994 panels is another argu- ment for running the 1995 panel as a dress rehearsal for the CAPI/database management system with a somewhat smaller sample size that should free up resources to handle the new as well as the old systems. Under this plan, it seems most feasible to continue the use of paper-and-pencil methods for the 1993 panel, which will have only two remaining waves in 1995. We also suggest that the Census Bureau consider truncating the 1994 panel at six instead of eight waves (see Table 5-2~. If this step is taken, it would seem most feasible to continue the 1994 panel under the old system the 1994 panel would have only three remaining waves in 1995 and would end before the start of the 1996 panel. Beginning in 1996, the Census Bureau would have only the CAPI/database management system to run.2i Recommendation 5-4: The Census Bureau should make every effort to ensure smooth implementation of CAPI and an im- proved database management system for SIPP under the new design of 4-year panels introduced every 2 years. One option that the Census Bureau should consider is to field a large-scale dress rehearsal panel in 1995 as a way to work out any opera- tional problems. Under this scheme, full implementation of the SIPP survey redesign would occur in 1996. 21Under the suggested schedule, it will be important to carefully consider the order of topical modules and determine which ones (e.g., program eligibility) are essential to include if the 1994 panel is truncated.
Representative terms from entire chapter: