Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 131
s
Data Collection and Processing
In this chapter we consider the operational aspects of SIPP how the data
are collected and processed. Survey operations as distinct from design,
evaluation, and analysis-represent by far the largest component of total
survey costs. Moreover, the care and efficiency with which a survey is
operated directly and substantially affect the quality and timeliness of the
data. Hence, no assessment of SIPP would be complete without a review of
SIPP's present operations and the Census Bureau's plans for future changes.
CURRENT OPERATIONS
Table 5-1 shows a rough distribution of SIPP costs by function as a refer-
ence for our discussion of SIPP interviewing and data processing opera-
tions. Field costs that are associated with the interviewing staff (travel and
communications, payments to interviewers, and training) amount to 41 per-
cent of the total. Data processing costs which about two-thirds are
1Our evaluation of SIPP operations and plans for future improvements is greatly indebted to
the work of panel members Martin David and Randall Olsen and to Carol Sheets, head of the
data processing staff for the National Longitudinal Surveys of Labor Market Expenence (NLS)
at Ohio State University. These people visited the Census Bureau headquarters twice and
developed a background paper with assessments and recommendations; Dr. Olsen also visited
the Census Bureau's Chicago regional office. They and the entire panel are very appreciative
of the wholehearted cooperation of Census Bureau staff during the site visits and in response to
requests of the panel for information about SIPP operations.
131
OCR for page 132
132
THE SURVEY OF INCOME AND PROGRAM PARTICIPATION
TABLE 5-1
(in percent)
SIPP Budget, by Major Function, Fiscal 1992
unchon
Percent of Budget
Sample design and selection 6.0
Questionnaire development and materials 3.0
Field 40.8
Travel and communications
Payments to interviewers (420 people)
Training
Data processing
Regional office data entry (keying)
Regional office clerical operations
Other regional office activities
Data processing (headquarters)
Research and evaluation
Data analysis, reports, and pnnnug
Data dissemination
Administration
Total costs
7.2
29.4
4.2
27.2
10.0
8.0
3.0
2.0
4.2
12.0
3.0
8.0
100.0 ($31.0 million)
NOTE: Distnbunon of costs is before insiitunon of maximum telephone
interviewing in February 1992.
SOURCE: Estimates from Census Bureau staff.
associated with the regional offices and one-third with headquarters amount
to another 27 percent of the total annual expenditure of $31 million for
SIPP.
Interviewing
From the beginning, SIPP was expected to involve labor-intensive inter-
viewing procedures in order to obtain high-quality responses to detailed
questions on complex aspects of households' socioeconomic status and well-
being. For the 1984-1990 panels, face-to-face interviewing has been the
preferred mode and the one used in most cases. Telephone interviewing has
been permitted to follow up for information not obtained in face-to-face
interviews, to interview people who would not or could not participate
otherwise, and to interview sample people who moved to locations more
than 100 miles from a SIPP primary sampling unit area. For the 1984 and
1985 panels, about 5-6 percent of interviews were conducted by telephone,
with the proportion increasing from the second through the final wave of
each panel (Jabine, King, and Petroni, 1990:20~.
SIPP interviewers collect information from respondents using paper
OCR for page 133
DATA COLLECTION AND PROCESSING
133
and-pencil techniques. At each visit, the interviewer updates a large control
card (containing basic demographic characteristics of household members,
housing structure characteristics, telephone numbers, and some other items)
and completes a bulky questionnaire for each adult aged 15 and older, using
numerous flash cards to aid respondents. The questionnaire differs across
waves because of the inclusion of different topical modules; the wave 1
questionnaire also differs from all other waves because of the use of depen-
dent interviewing for many items after wave 1 (i.e., reminding respondents
of their answers in the prior wave and updating the information rather than
asking each question afresh).
The questionnaires are highly structured, with complex skip patterns
and a good deal of redundancy as a way of jogging respondents' memories
and providing a basis to check for inconsistencies or impute missing infor-
mation. Interviewers must transcribe many items, either during the course
of the interview or prior to the interview (to capture needed information
from the previous wave). For example, each income source mentioned in
the recipiency section must be coded onto the income source summary at
the back of the questionnaire. For each code ore that summary, the inter-
viewer asks questions about income amounts received during the 4-month
reference period.
Despite the magnitude and complexity of the task, interviewing in SIPP
has proceeded quite smoothly. At the outset of the program, there were
[ears that the interviewers (and respondents) could not cope well with such
a long and involved survey. Indeed, the turnover rate for interviewers was
initially high 32 percent in fiscal 1986-but in fiscal 1988 the rate was
down to 18.5 percent, in comparison with 20-25 percent for other major
surveys conducted by the Census Bureau (Jabine, King, and Petroni, 1990:24~.
Interviewers have also become more experienced: in fiscal 1986 only about
33 percent had 3 or more years of survey experience; in fiscal 1988 almost
60 percent of the interviewers had that much experience. Continuous train-
ing is provided to SIPP interviewers, and their work is monitored in several
ways (e.g., by personal observation and reinterview). Often their reactions
are sought about the success of one or another experiment and about pro-
posed changes in procedures (e.g., greater use of telephone interviewing).
Although the SIPP interviewers are highly professional in their work, it
is also evident that the answers they elicit from respondents are often flawed
(see Chapter 3~. It appears likely that the structure of the questionnaire
contributes to such data quality problems in SIPP as underreporting of asset
income and confusion among program names. Also, paper-and-pencil tech-
niques with such a long, involved questionnaire inevitably lead to ineff~-
ciencies and introduce opportunities for interviewer as well as respondent
errors (e.g., transcribing errors and mistakes in following the skip patterns).
Recently, the Census Bureau decided to switch to a mode of maximum
OCR for page 134
34
THE SURVEY OF INCOME AND PROGRAM PARTICIPATION
telephone interviewing as a cost-cutting measure. Beginning in February
1992, waves 1, 2, and 6 of each SIPP panel are to be conducted as before by
face-to-face interviewing to the extent feasible; however, the remaining waves
are to be conducted by telephone, again to the maximum extent feasible.
The telephoning and personal visits will be camed out by the same inter-
viewers using the same questionnaire, with the interviewers making phone
calls from their homes.
The Census Bureau conservatively expects to save about $500,000 per
year from the switch (roughly 4 percent of total costs associated with inter-
viewing see Table 5-1), due to reductions in travel costs and the time of
interviewers.2 The plan is to use the savings to improve SIPP's data prod-
ucts and dissemination program. The Bureau hopes that there will be little
loss of data quality.3 Experiments conducted with maximum telephone
interviewing in 1985-1986 found relatively few differences in nonresponse
rates and analytical measures between the experimental and control groups
(Gbur and Petroni, 1989; Gbur, Cantwell, and Petroni, 1990~. However,
there was some evidence, particularly for blacks, that maximum telephone
interviewing produced lower estimates of the poverty rate and other mea-
sures related to low income and receipt of means-tested program benefits.
Also, the experiments covered only two successive waves, so no info~n~a-
tion is available on mode differences over a longer period
Regional Office Operations
The Census Bureau's 12 regional offices play an important role in process-
ing SIPP data Clerks check the completed questionnaires mailed in by the
SIPP interviewers for errors and omissions and assign geographic codes for
sample people who moved. Data entry clerks key the information from the
questionnaires, using software that checks for the presence of identifiers
and selected control card data items. Batches of keyed questionnaires are
venfied, and data files for accepted batches are transmitted electronically to
Census Bureau headquarters in Suitland, Maryland. Quarterly reports on
verification results indicate that error levels in the keying operations are
very low (Jabine, King, and Petroni, 1990:8 1~.
2Interviewers are paid by the hour. In order not to reduce the pay of interviewers already on
the staff, the Census Bureau planned to hire fewer new interviewers than would otherwise be
needed for the 1992 panel, which has a larger initial sample size than the 1991 panel.
3The program to improve the data products is in the formative stages, and so there is a lack
of available detail. This makes it impossible to determine whether the $500,000 will be too
much, too little, or about right to support these future changes. Likewise, it is impossible to
determine whether these future improvements justify the possible risk to data collection of
. . . . .
maximum telephone Interviewing.
OCR for page 135
DATA COLLECTION AND PROCESSING
135
Errors in the data that are diagnosed in Suitland are returned to the
regional offices for correction, if possible. Regional editors are given little
latitude to use judgment or knowledge of the case to edit problematic cases.
Calls to interviewers to resolve problems are rare, and follow-up calls to
respondents even more so.
Home Office Operations
The Census Bureau's home office in Suitland, Maryland, handles all subse-
quent editing and preparation of SIPP data files, with the exception of
coding of industry and occupation, which is accomplished at the Bureau's
processing facility in Jeffersonville, Indiana. Data for each wave of each
panel are processed separately. Steps in data preparation include (see labine,
King, and Petroni, 1990:80-81~:
· checking each file transmitted from the regional offices to ensure
that all expected cases, both interviews and noninterviews, are received;
· transmitting keyed verbal descriptions of occupation and industry to
the Jeffersonville facility for coding;
· imputing data for noninterviewed people in interviewed households
(Type Z nonresponse);
· performing extensive consistency edits within and between sections
of the questionnaire, between the control card and the questionnaire, and
among responses for people in the same family and household;
· performing extensive sets of edits and imputations on each section of
the questionnaire, including topical modules, to ensure that responses ap-
pear when they should and to impute missing values;4
· developing recodes based on combinations of data items to add to
the data records;
· checking the accuracy of geographic codes;
· imputing an estimated household size for households that moved and
could not be located, to use in the calculation of weights for movers;
· calculating cross-sectional weights for each month in the wave; and
· reformatting records and altering some data items to protect confi-
dentiality as input to microdata files that are suitable for public release.
Later, after all waves of a panel are processed, the data for selected
items are further edited for consistency over time, longitudinal weights are
developed, and a public-use longitudinal file constructed. Changes due to
longitudinal edits are not carried over to the cross-sectional wave files.
4When edit programs diagnose a problem, that problem is resolved mechanically. While
operationally efficient, in some cases this approach may degrade data quality.
OCR for page 136
36
THE SURVEY OF INCOME AND PROGRAM PARTICIPATION
Although SIPP home office processing operations have settled down
and are currently running relatively smoothly, it is no overstatement to say
that data processing at Suitland has been the Achilles heel of the SIPP
program. When SIPP began in a great rush, there was no time to evaluate
the processing system that had been used for the Income Survey Develop-
ment Program (ISDP) or think through the computing requirements for a
continuing longitudinal survey of the size and scope of SIPP. The Census
Bureau modified a system developed for the Current Population Survey
(CPS) to use for SIPP, which treated each wave of each panel as a separate
cross-section and was highly inflexible. This decision was dictated by
outmoded hardware and software at Suitland (problems that generally affect
data processing at the Census Bureau) and the fact that the programming
staff were trained primarily in low-level assembly and procedural languages.
SIPP had to contend with the limited disk space available on the Suitland
office's UNISYS equipment (being phased out), necessitating slow arid out-
moded tape-to-tape operations for many processing steps, and with the limi-
tations of FORTRAN for editing and cleaning programs. For database man-
agement, the SIPP staff used the internally developed system, RIM, that
lacked features of modern database management systems.
Initially, the SIPP processing staff were able to keep up with the flow
of data. The first report from the 1984 panel (providing measures of in-
come and program participation for the second quarter of 1983) was re-
leased in September 1984, as was a public-use microdata file for wave 1
only 8 months after the last month of data collection. However, as the data
continued to pour in from the field, month after month, the processing
system buckled under the strain. And the initial success in prompt release
of microdata files was undermined by user reports of errors, which necess~-
tated the recall of most of the 1984 panel cross-section files.S Final files
for the core information from waves 1-9 of the 1984 panel were still re-
leased on a reasonably timely basis about 13 months on average after the
last month of data collection. However, topical module files were delayed,
with an average release date of about 22 months after data collection; and
the 1984 longitudinal panel file was not released until April 1988, or 20
months after the last month of data collection.
The introduction of a new panel each year added greatly to the strain on
the data processing staff, particularly given the need to rewrite large sec-
tions of computer code to keep up with changes to the questionnaire and to
other aspects of the survey changes that were inevitable for a new, com-
plex data collection program. As a result, delivery schedules deteriorated
SRecalls were necessitated not only because of errors, but also because of design flaws. For
example, wave 1 public-use files omitted the employer number, an identifier essential to estab-
lishing continuity of jobs from wave to wave.
OCR for page 137
DATA COLLECTION AND PROCESSING
137
greatly. The Census Bureau did not publish any reports from the 1985 or
later panels until 1990 (see Chapter 6~. Microdata files from the 1985 panel
took an average of 31 months from the last month of data collection until
release, and files from the 1986 panel took an average of 26 months from
the last month of collection until release. Not until midway through the
1987 panel did the Census Bureau begin to achieve delivery times in the
range of a year after data collection (Committee ore National Statistics,
1989:Table 2-4~.
To enable the data processing to catch up, the Census Bureau decided in
late 1987 to freeze the core questionnaire, permitting only changes that
appeared absolutely essential to meet the survey's goals of providing im-
proved data on income and program participaiion.6 The agency also strove
to minimize changes in the fixed topical modules. This strategy was suc-
cessful in that the Census Bureau began to meet its delivery targets of
release of public-use files within a year of collection. However, giving up
flexibility in the questionnaire was a high price to pay for a new, still
evolving survey that is intended to be responsive to emerging policy con-
cerns particularly as some of the initial design decisions had already lim-
ited the detail in the SIPP questionnaire to try to make it easier to process
the data. As examples, respondents were asked about earnings for a maxi-
mum of two employers and about program income for a maximum of six
sources. Also, respondents were asked about earnings on a monthly basis
rather than in terms of individual paychecks; hence, respondents who were
paid biweekly or on some other basis had to engage in considerable mental
arithmetic to answer the questions.
During the past few years, the Census Bureau has shown commendable
attention to user needs and concerns with regard to data products. Not only
have delivery schedules been speeded up, but the data processing staff
working with an advisory group from the Association of Public Data Us-
ers have recently redesigned the core data files in a person-month format
to be much more accessible for many analyses (see Chapter 6~. However,
many other needed improvements for example, longitudinal editing of the
wave files and an automated system to generate complete and accurate
documentation (e.g., documentation of edits and imputations) have yet to
be made.
The Census Bureau is aware of the problems that have afflicted SIPP
operations, and the agency is planning major improvements through the
adoption of new technology. Specifically, the Bureau plans to convert SIPP
interviewing from paper-and-pencil techniques to computer-assisted personal
6For example, some wording changes were made in the 1988 panel to try to reduce the
magnitude of the seam problem (e.g., asking respondents specifically to indicate the month in
which program payments began before providing monthly amounts).
OCR for page 138
38
THE SURVEY OF INCOME AND PROGRAM PARTICIPATION
interviewing (CAPI) by 1995. Also, the Census Bureau already has well
under way a program to replace its UNISYS equipment with networked
VAX computers, and the SIPP staff intend to switch to a commercial data-
base management system for processing. We review the Bureau's plans for
CAPI and database management technology for SIPP below.7 We also
consider investment needs for continuing education of the SIPP data pro-
cessing staff and issues involved in the transition to the new technology,
together with the new survey design for SIPP.
COMPUTER-ASSISTED INTERVIEWING
There is currently considerable interest in the use of various methods of
computer-assisted survey information collection (CASIC) (see Subcommit-
tee on Computer Assisted Survey Information Collection, 1990~. Relevant
techniques include:
i
· centralized computer-assisted telephone interviewing (CATI), in which
nterviewers clustered at one or more central locations telephone respon-
dents, read them questions displayed by a computer, and enter the answers
into the computer (CATI can also operate in a decentralized mode, in which
interviewers call respondents from their homes);
· decentralized computer-assisted personal interviewing (CAPI), in which
interviewers go to respondents' homes or offices with a portable computer
and read the questions from and record the answers into the computer; and
· various forms of computer-assisted self-interviewing (CASI), includ-
ing prepared data entry (PDE), in which respondents themselves use a per-
sonal computer or terminal to fill out interactively the survey questionnaire;
touchtone data entry (TDE), in which respondents answer computer-gener-
ated questions by pressing buttons on a telephone; and voice recognition
entry (VRE), in which respondents answer questions by speaking directly
into a telephone.
These methods promise many advantages, including:
· improved data quality because the computer program automatically
controls skip patterns and includes editing features to prevent or detect
inconsistencies and other errors on the spot; also, keying errors are likely to
7We note that the use of innovative data collection and processing technology, while prom-
ising many benefits for SIPP (and other surveys), is not a panacea. For data quality to be high,
respondents must understand the questions and be motivated to answer them fully and accu-
rately. In Chapter 7 we discuss a relatively new methodological research program at the
Census Bureau that is applying cognitive techniques to the issue of how well respondents
understand and answer the current SIPP questionnaire. On the basis of the results of that work,
experiments are in progress to assess alternative, less structured interviewing techniques that
promise to improve data quality.
OCR for page 139
DATA COLLECTION AND PROCESSING
139
be reduced because there is no need for clerks to key the paper question-
naire (although the interviewers themselves may make keying mistakes);
· more timely data capture and development of analysis files because
some data entry steps are eliminated and because of extensive upfront edits;
and
· increased flexibility in data gathering because multiple versions of
the questionnaire (e.g., in different languages) can be readily offered and
changes to the questionnaire more readily programmed and documented.
We note that CASIC methods are undergoing development and that
survey organizations are still learning how to use them effectively. The
process of converting to a CASIC survey operation can be painful, and it is
not always the case that the potential advantages from CASIC techniques
will be realized in a particular application. Nonetheless, the potential gains
clearly warrant investment in development and implementation.
At the present time, CATI, which is the oldest CASIC technique in use,
is widely employed by govemment, academic, and private survey organiza-
tions in the United States and abroad. It is estimated that there are more
than 1,000 CATI installations throughout the world (Subcommittee on Com-
puter Assisted Survey Information Collection, 1990:11~. The Census Bu-
reau maintains a CATI installation and has considerable experience with the
technique.
CAPI is a newer technique that is just beginning to be used in the
United States.8 Evaluation of large-scale pilot studies for the NLS new
youth cohort (NLSY) in 1989 (300 cases) and 1990 (2,400 cases one quar-
ter of the national effort) were very favorable (Olsen et al., 1990; Olsen,
1992~. CAPI training for the NLSY took the same time as paper-and-pencil
training, and there were no serious field problems. Data transmission over
telephone lines was smoothly implemented and error-free. Compared with
paper-and-pencil cases, the CAPI data were determined to have fewer errors
and to be of unifo~ly higher quality in the dimensions examined (skip
errors, undocumented codes, internal inconsistencies, etc.), even though the
paper-and-pencil cases were subsequently edited and the CAPI cases were
accepted without cleaning.
The CAPI pilot study for the Medicare Current Beneficiary Survey
(CBS) in early 1991 was also successful, and the initial rounds of inter-
viewing in fall 1991 and winter 1992 for the full CBS sample of about
15,000 Medicare beneficiaries have proceeded smoothly (Sperry, 1991; Sperry,
Bittner, and Branden, n.d.~. On the basis of the pilot study, the survey
contractor for the CBS (Westat, Inc.) determined that additional training
8The Netherlands developed a CAPI system called BLAISE (after Blaise Pascal) for collect-
ing household survey data as early as the mid-1980s (see Bethlehem and Keller, 1991).
OCR for page 140
140
THE SURVEY OF INCOME AND PROGRAM PARTICIPATION
was required, with a particular focus on ways to solve problems during the
interviews. Also, there were initial problem which are currently being
resolved- with transmitting the data over modems attached to the inter-
viewers' personal computers. In all other respects, including interviewer
and respondent acceptance of the technique, preliminary indications of data
quality, timeliness, and the ability to feed back data from an earlier inter-
view to the next interview, the CAPI procedures appear to be working well
for the CBS.
Not all experiences with CAPI have been as favorable. The Census
Bureau's initial effort to collect the AIDS supplement for the Health Inter-
view Survey (HIS) was not a success, due to hardware and software prob-
lems (National Center for Health Statistics and Bureau of the Census, 1988~.
However, the Census Bureau is proceeding with further tests of CATI and
CAPI for the HIS, using newer portable computers with materially increased
performance. Problems were also encountered in the use of CAPI by a private
contractor for the 1987-1988 Nationwide Food Consumption Survey, although
these problems appeared to stem largely from management failures rather
than the use of CAPI per se (U.S. General Accounting Office, 1991~.
The Census Bureau is committed to expanding the use of CASIC inter-
viewing techniques for both its household and establishment surveys, and
there is a high-level task force working on a Bureau-wide CASIC imple-
mentation strategy (Bureau of the Census, l991f). The Bureau is currently
working to convert the CPS to both CATI and CAPI by January 1994. Both
techniques are needed because the CPS uses maximum personal interview-
ing for the first month in which an address is in the sample and maximum
telephone interviewing for the remaining interviews. The Census Bureau is
also planning to convert SIPP data collection to CAPI methods by February
1995. SIPP is a nearly ideal application for CAPI because it is a large,
complex survey with a continuing field effort. As part of its CAPI planning
for SIPP, the Census Bureau will undoubtedly evaluate the experience with
maximum telephone interviewing in the panels under way as of early 1992
and determine the most cost-effective mix of telephone and personal inter-
views. The decentralized SIPP interviewing staff could administer a com-
puter-assisted interview in both modes. In its review, the Bureau should
also consider the possible contribution of a centralized CATI operation,
which affords opportunities for increased quality control of the interview-
ers' work. CATI might, for example, be used to interview SIPP cases that
move from one primary sampling unit (PSU) to another.
Potential Improvements for SIPP
CAPI technology offers enormous potential to improve the timeliness and
quality of SIPP data and other aspects of the SIPP program, but it is also
OCR for page 141
DATA COLLECTION AND PROCESSING
141
relatively new. We therefore describe in some detail the sorts of capabili-
ties that the Census Bureau should expect and plan for in a fully imple-
mented CAPI system and their implications for the smooth running of the
SIPP processing system. In the next section we consider the cost implica-
tions. And in the subsequent section we provide a list of important func-
tions that we believe a SIPP CAPI system should have and review the
capabilities of the Census Bureau's existing CAPI software.
Successful implementation of CAPI for SIPP should produce signifi-
cant improvements in timeliness of data processing and analysis. If there is
no imputation, weighting, or special coding to be done (i.e., industry and
occupation), it should be possible to produce frequencies and provide Cen-
sus Bureau analysts with a fully documented data file that is suitable for
analysis with a widely used software package (such as SAS) within a week
or two at most after the last case is transmitted from an interviewer.9 Given
the need for various kinds of post-field processing of the data, it is essential
that such processing operations be fully integrated with the design of the
questionnaire. Such integration is needed to maximize smooth, timely op-
erations and minimize bottlenecks (see further discussion, below).
CAPI should improve data quality by greatly reducing interviewer error
and supporting more complex questionnaire design than is feasible for pa-
per instruments. For example, some analysts believe that better quality data
can be obtained by collecting information on income, employment, and
program participation in the form of event histories in which the respondent
supplies start and end dates, instead of by using fixed monthly reference
periods (see discussion in Chapter 7 on cognitive research). CAPI would
make it easier to collect event history data, which have often been hard to
manage in paper formats. CAPI would further improve data quality by
readily enforcing the natural temporal ordering of events (e.g., jobs must be
started before they are left).
Obtaining the full power of CAPI to improve data quality at reduced
cost and time requires that the entire process of data editing and cleaning be
redesigned, taking into account the ability to perform real-time checks with
CAPI. The Census Bureau will need to review and restructure its edit
specifications for SIPP, deciding which potential inconsistencies it wants to
resolve during interviews; which inconsistencies it wants to eliminate by
structuring the questions and allowable responses so that inconsistent re-
plies are not logically possible; and which inconsistencies it will not at-
tempt to resolve, prevent, or eliminate.
9The availability of such a file would permit Census Bureau analysts to have an early look
at the raw data and assess data quality in terms of item nonresponse rates, extreme values, and
the like.
OCR for page 147
DATA COLLECTION AND PROCESSING
147
· BLAISE, which is the system developed by the Netherlands Central
Bureau of Statistics;
· CASES, which is maintained at the University of California at Ber-
keley;~2 and
· the Ohio State CAPI system, which is maintained by the Center for
Human Resource Research at Ohio State University and used for the NLSY.
We urge the Census Bureau to give high prionty to investigating existing
outside CAPI systems to find one that meets the needs of SIPP more effec-
tively than QUISC.
Recommendation 5-1: We strongly support the Census Bureau's
goal to convert SIPP to computer-assisted personal interview-
ing (CAPI). Since the Bureau's current CAPI software system
(QUISC) does not appear to meet the data collection require-
ments for SIPP, the Census Bureau should give high priority to
investigating other available CAPI systems and determine the
most appropriate system for SIPP.
DATABASE MANAGEMENT
The Census Bureau is currently In the process of updating its computing
equipment, including replacing UNISYS mainframe, batch-onented proces-
sors with networked VAX computers that facilitate interactive processing
and the use of modern database management technology. The new equip-
ment will assist data processing operations throughout the Bureau. The
SIPP staff plan to take advantage of the shift to the VAX network by con-
vening their data files and processing to powerful database men agemer~t
system (DBMS) software that is commercially available, such as Oracle or
Relational Data Base software (Bureau of the Census, n.d.~. Another caDdi-
date is Scientific Information Retneval (SIR) software, which is used for
the NLSY. These commercial systems offer venous capabilities and fea-
tures of the relational database model, which was originally developed as a
logically rigorous and complete statement of database structure and ma-
nipulation (see Codd, 1985; Date, 1987~. Other kinds of database manage-
ment systems embody network or hierarchical database models.~3 (For
12We note that the Census Bureau's QUISC system evolved, like CASES, from a system
that was originally developed at Berkeley. We understand that CASES, perhaps augmented
with some features from QUISC, has recently become the Census Bureau's leading candidate
for future CATI/CAPI development for SIPP and other surveys.
13The term "relational," which distinguishes the relational model from traditional network
or hierarchical database models, refers primarily to the organizational structure of the data. A
relational database creates a series of rectangular tables or "flat" files, each of which is "nor-
malized,' according to the relational model in order to contain information in a very simple
OCR for page 148
48
THE SURVEY OF INCOME AND PROGRAM PARTICIPATION
further discussion and assessment of database management systems, see
Gray, 1984; Silberschatz, Stonebraker, and Ullman, 1990.)
Database management systems offer important capabilities that can fa-
cilitate processing and analysis of SIPP's complex data sets that embrace
multiple panels, waves, households, families, people, and sources of in-
come. They Perot large databases to be accessed in an interactive mode by
multiple users, which can support editing and imputation procedures that
use information from other waves of data and make it possible for analysts
to readily review problem cases as needed. Database management systems
also provide interfaces -to statistical packages that are widely used for analysis
and estimation. In addition, DBMS technology, especially RDBMS soft-
ware, facilitates the integration of data and documentation.
Relational database management systems offer other features that are
likely to be especially helpful for a survey like SIPP. They have query
languages for obtaining information from the database, using logical opera-
tions, which can be of direct utility for editing complex data. The powerful
structured query language (SQL) has recently been adopted as an industry
standard that will be supported to some degree by all RDBMS vendors.~4
RDBMS technology also embodies consistency features that greatly reduce
the opportunity for errors to occur in data processing.~5 Finally, an impor-
tant feature of RDBMS systems is that they provide flexibility in handling
changes to a questionnaire without disrupting the entire database structure.
In particular, RDBMS technology offers dynamic independence, that is, the
ability to add new data to the system without restructuring the existing data,
provided that the initial database design anticipates this need.
structure (e.g., in the SIPP context, there might be separate files for people, families, jobs, and
income types). Relationships between entities (e.g., people having jobs) are also represented
in these tables, as is the internal documentation of the database (the set of tables) itself. This
simple but powerful structure is key to many of the advantages of relational database manage-
ment technology, including its query and processing capabilities. However, for performance
and other practical reasons, no current relational database management system (RDBMS) soft-
ware conforms completely to the relational model in all of its features. Nevertheless, the term
RDBMS is used for a DBMS that attempts to implement most of the key relational features.
14Query languages operate in a different manner from scientific programming languages
and statistical packages. Analysts would not want to use query languages in place of statistical
packages for estimation purposes; however, interfaces can be designed to exploit the power of
the RDBMS for efficient data retrieval together with the computational capability of a stahsti-
cal package like SAS or SPSS. An example of a linkage between a statistical package and a
DBMS is the PROC SQL module of SAS.
1SThe relational database model specifies structural integrity constraints that enforce struc-
tural consistency on the data. In addition, the logical rules that govern data entry can draw on
any part of the existing data to enforce consistency in data values; consistency may be applied
to individuals, households, or other entities and combinations of entities. A properly designed
RDBMS will ensure that adjustments to the database do not leave garbage in the system.
OCR for page 149
DATA COLLECTION AND PROCESSING
149
We strongly support the development of an improved database manage-
ment system for SIPP that integrates the documentation with the data and
facilitates timely, accurate, and flexible data processing (we indicate below
some specific functions that a DBMS needs to provide for SIPP). To achieve
this goal will require adequate disk space and processing resources. We
urge the Census Bureau to allocate sufficient disk space and processing
resources to SIPP so that the data processing and analysis staffs can store
and access SIPP data on-line in a DBMS, using magnetic tapes only for
backup and other special purposes.
Needed Database Management Capabilities for SIPP
There are several capabilities that we believe it is important for an im-
proved database management system to provide for SIPP. First, the system
should be designed with sufficient flexibility so that changes in the SIPP
questionnaire that are expected to improve data quality or relevance can be
accommodated. It should not be the case in the future, as has happened in
the past, that difficulties in processing lead the Census Bureau to "freeze"
the interview content for some time. As we noted above, the relational
database model has features that make it possible to change a questionnaire
without having to redesign the rest of the database structure.
Second, the system should facilitate the ability to supply fully edited
data from the previous interview in sufficient time to use in the next inter-
view with the respondent. There needs to be careful coordination of this
feedback capability, which is critical for achieving data quality improve-
ments in SIPP at the source, with the design and operation of the CAPI
system (see below.
Third, the system should have the capability to supply values for miss-
ing data in a timely manner. DBMS software generally offers advantages in
this regard. Because the technology permits large data sets to be on-line,
the use of a DBMS should enable the Census Bureau to move away from
treating each SIPP wave as ~ separate cross-section for imputation pur-
poses. Instead, it should be possible for the Bureau to develop more exten-
sive yet timely longitudinal imputations by using data from surrounding
waves a goal that the SIPP data processing staff have indicated is a high
priority i7
16Presumably, most of the editing will be performed within the CAPI system, but some
additional editing may be required within the DBMS. Careful coordination of the CAPI and
database management systems is also needed to achieve flexibility with regard to the question-
naire content.
17The ability to handle large data sets on-line should also make it possible to readily
produce multiple-wave analysis files with appropriate weights, as well as to integrate the
processing of waves from separate panels that represent the same time period of data collec
OCR for page 150
150
THE SURVEY OF INCOME AND PROGRAM PARTICIPATION
Fourth, the database management system should support the ready modi-
fication of imputation procedures when required. The current cross-sec-
iional imputation system for SIPP is very inflexible and is known to be less
than optimal in some respects (e.g., in the imputation of income and asset
values for program recipients see Chapter 3~. To the extent that some
imputations must continue to be made on a cross-sectional rather than lon-
gitudinal basis, the database management system should provide a capabil-
ity to implement modifications to the imputation scheme and evaluate their
effects on data quality in a timely manner. Longitudinal imputations will
likely be of better quality because information from other waves is used for
the actual respondent instead of information from the same wave for an-
other (albeit similar) person. However, longitudinal imputations involve
complex logic, and the automated imputation schemes that are implemented
at the beginning of a panel will not likely be optimal for the full panel.
Hence, there will be a continuing need for a ready capability to modify the
longitudinal imputation procedures as knowledge is gained of their implica-
tions for data quality.
In this regard, we are concerned about the SIPP staff's plans to use a
DBMS and at the same time retain their FORTRAN-based editing and im-
putation programs. It seems unwise to have a hybrid system that does not
make full use of the capabilities of the chosen database management sys-
tem. The Census Bureau argues that it does not want to become overly
dependent on commercial software vendors, but adopting a particular com-
mercial system is no longer necessarily a major risk. Most vendors of
DBMSs are committed to operating on different computers under venous
operating systems (technically, they offer"platform independence". Fur-
thermore, in the case of relational systems, databases can readily be moved
from one RDBMS to another if a change in the RDBMS becomes necessary.
Fifth, the database management system developed for SIPP should pro-
vide a ready interface to such statistical packages as SAS and SPSS. Such
interfaces will facilitate internal analysis of the data by Census Bureau staff
both for evaluation purposes (e.g., analyzing the effect of imputations on
lion: for ex~n~ple, wave 7 of the 1991 panel and wave 4 of the 1992 panel will be fielded at
the same time. However, we note that there are some capabilities of a DBMS that would be
desirable for survey processing but are not yet commercially available. For example, current
systems do not support economical ways of dealing with "versions" of data that will arise as
information for each SIPP panel is captured in successive interviews and longitudinal weights
and imputations are altered to make use of accumulating information.
lain addition, it could be useful for the database management system to provide a capability
for multiple imputation, in which a range of imputed responses is generated for each missing
value in order to permit users to assess the variability in an estimate that can be attributed to
the imputation process (see Little and Rubin, 1987).
OCR for page 151
DATA COLLECTION AND PROCESSING
151
the quality of estimates from the data) and for substantive studies on in-
come, program participation, and related topics.
Finally, the database management system that is used to construct the
SIPP database should also support construction of a complete corresponding
database of the documentation. At present, there is no documentation data-
base for SIPP that can be related to the data, which contributes to problems
in releasing fully documented analysis files on a timely basis arid hinders
users in obtaining a complete understanding of the file structures and data
content. This lack also substantially reduces the ability to institute more
modern methods of releasing data for example, supplying data or1 compact
disks with extraction software or providing a facility to create extracts over
such communications networks as Internet. For most cost-effective use,
these access methods require integration of the documentation, ideally in-
cluding frequency counts for each variable, with the data.l9 We cannot
overstress the importance of seeking a database management system with
comprehensive documentation capabilities and then using those capabilities
to the fullest in preparing data files from SIPP.
As noted above, DBMS technology, especially RDBMS software, often
facilitates integration of data and documentation. An RDBMS can be implemented
to maintain a vocabulary of names for each measurement, each traDsforma-
tion or other processing procedure, and each relationship encompassed in
the database. The RDBMS will ensure that variable names are uniquely and
permanently assigned, no matter how many users are making independent
uses of the data. This capability should make it possible to track processing
activities and changes and to generate updated descriptions of the database
as each SIPP panel proceeds and as new panels are created. This tracking
information can be used to produce documentation for all data processing
steps, such as weighting, imputation, construction of recoded variables, and
development of analysis files. The result should be greater productivity in
processing multiple panels and public-use files and greater clarity arid com-
pleteness in the documentation of all data processing steps.
Integration of CAPI and Database Management
We have noted the importance of integrating the CAPI and database man-
agement systems for SIPP to facilitate smooth, timely data processing and
to minimize errors. The CAPI system will likely perform all data entry
functions; however, if any paper forms remain, then the database manage-
mer~t system should be used to enter data from them.
19See Chapter 6 for a review of the current computer data products and documentation for
SIPP and a discussion of gaps and needed improvements: for example, there is currently no
documentation at all of the data editing and imputation procedures.
OCR for page 152
152
THE SURVEY OF INCOME AND PROGRAM PARTICIPATION
Ideally, the CAPI system chosen for SIPP will generate the following
inputs to the database management system:
lion;
· a data dictionary that defines all questionnaire items;
· the logical rules that clearly determine the universe for each ques
· the logical rules applied at the time of an interview to enforce con-
sistencies;
· the list of responses partitioned into sets for each separate universe
defined by the rules of the interview for example (in the context of SIPP),
one set for the address, one set for each individual, one set for each job, and
one set for each spell of property or program income receipt, all recorded
on the basis of the relevant accounting or reference period;
.
and
the list of exceptions, comments, and annotations to each question;
· the set of information about the environment of the interview time,
place, mode, interviewer, duration, etc.
Whether an RDBMS or some other database management system is
used, we stress again how important it is that the DBMS for SIPP have the
capacity to internally track and maintain the information needed to docu-
ment fully the data content and the data collection and processing activities,
including imputation, weighting, construction of new variables, and refor-
matting. This information is necessary to produce fully documented inter-
nal and public-use files and to properly feed back information for use by the
CAPI system in subsequent interviews.
Recommendation 5-2: We strongly support the Census Bureau's
plans to adopt a new database management system for SIPP.
The Census Bureau should use the capabilities of a DBMS to
the fullest in seeking to make improvements in processing, ana-
lyzing, and documenting the data from SIPP. The processing
performed by the database management system should be fully
integrated with the SIPP CAPI system.
INVESTING IN THE DATA PROCESSING STAFF
The Census Bureau has a distinguished history of making seminal contribu-
tions to data collection and processing technology. However, in recent
years, the Bureau has lagged behind best practice and has lacked the hard-
ware and software with which to implement state-of-the-art methods of data
collection and processing. We are pleased that technological improvements
are under way. We urge the Census Bureau to recognize the need to devote
resources to modernization of its computing hardware and software on a
continuing basis.
OCR for page 153
DATA COLLECTION AND PROCESSING
153
We further urge the Census Bureau to recognize the need for a continu-
ing program of investment in the education and training of the data process-
ing staff. In order to make best use of new hardware and software, the staff
must be fully trained in new data processing methods. Steps must also be
taken to ensure that the data processing staff regularly visit and learn from
other organizations.
To date, the SIPP processing staff have been so focused on production
problems that they have not been able to devote time and resources to
modernization. We believe that SIPP needs the equivalent of at least one
full-time staff member devoted to systems modernization: this does not
mean one person, but some of the time of the best staff that is devoted to
continuing education, software development, networking, and systems de-
velopment. Improvements in data processing systems may later reduce staff
requirements in data processing, but investments must first be made.
The SIPP processing staff should have the resources and be encouraged
to visit other data processing facilities They could find it useful to meet
with the staff at the National Opinion Research Center (NORC) and Ohio
State University who deal with the NLS, and the staff at the University of
Michigan who deal with the PSID both longitudinal surveys that are simi-
lar to SIPP. The SIPP staff could also learn from the experience of the
Statistics of Income Division at the Internal Revenue Service and its transi-
tion to the Oracle RDBMS. They could also examine how statistical agen-
cies in other countries (e.g., Statistics Canada) are using database manage-
ment systems in the processing of survey data.
Just as the Census Bureau invests in the staff that work on survey
design, methodology, and evaluation (e.g., by sponsoring methodological
research conferences), it is critical that the Bureau invest in the equally
important data production people. The ability of the processing staff to
make full use of improved technology greatly depends on the support they
receive for continued development and use of their data processing skills.
Recommendation 5-3: In view of the major advances that con
tinue to occur in computing hardware and software, the Census
Bureau should devote significant resources to continued educa
tion and training of its data processing staff. In particular, the
SIPP processing staff should take advantage of the experience
of data processing facilities outside the Census Bureau that deal
with longitudinal surveys.
TRANSITION TO A CAPI/DATABASE
MANAGEMENT SYSTEM FOR SIPP
Moving a survey to CAPI and an improved database management system
affects nearly every step of the survey, from questionnaire design through
OCR for page 154
154
THE SURVEY OF INCOME AND PROGRAM PARTICIPATION
data dissemination. We are concerned that there may not be enough time to
fully test and work out the inevitable operational problems prior to the
scheduled implementation in February 1995 of a full-blown CAPIldatabase
management system for SIPP. The new technologies must not only be fully
developed and tested in their own right, but also mesh with other changes to
SIPP, such as questionnaire content and format changes. The length and
complexity of the SIPP questionnaire, which entail the need for complex
editing and data processing procedures, and the frequency of interviews will
pose substantial challenges to the smooth implementation of CAPI/database
management technology.
We believe it is critical to the future success of SIPP that all aspects of
the redesign work well from the start. It would be tragic to have a replay of
the situation that occurred at the inauguration of SIPP, when the need to
move immediately into the field precluded making needed changes to the
data processing system, with consequent adverse effects for the delivery of
data products. At present, the Census Bureau has more than 2 years before
the scheduled redesign in 1995 to develop a CAPI/database management
system for SIPP. However, this amount of time may not be sufficient to
ensure a problem-free changeover, particularly given all of the other changes
that are likely to be introduced at the same time and the fact that a final
decision has not yet been made on which CAPI or database management
system to use.
A review of the current Census Bureau schedule (Fischer, 1992) shows
the following key milestones that relate to CAPI and database management
technology:
· finalize the content of the wave 1 questionnaire by December 1992;
· develop the CAPI wave 1 questionnaire (including specifications and
programming) in January-October 1993 and have a dress rehearsal in Febru-
ary 1994;
· develop the CAPI wave 2 questionnaire in March 1993-March 1994
and have a dress rehearsal in June 1994;
· develop the wave 1 systems design in June 1992-April 1993 and the
specifications for the wave 1 processing system in January-November 1993;
and
· develop the wave 2 systems design in November 1992-October 1993
and the specifications for the wave 2 processing system in November 1993-
October 1994.
This schedule is very tight. It requires making an early decision on the
content of the questionnaire, which precludes making much use of results
from the current program of cognitive questionnaire research or from the
forward record-check studies that we strongly urge for questionnaire testing
and experimentation (see Chapter 7~. The schedule also allows very little
OCR for page 155
DATA COLLECTION AND PROCESSING
155
time for testing the integration of the CAPI/database management system
for wave 2 and subsequent interviews, which is essential for such functions
as feeding back data from the previous to the current interview.
Although we agree that it is important to have the SIPP redesign imple-
mented on a timely basis, we do not believe there is any magic to the
current scheduled start-up date. If additional time for development and
operational testing would permit a smoother transition, then we believe that
time would be well spent. We suggest that the Census Bureau consider the
following schedule (see Table 5-2~: field a somewhat smaller panel of, say,
15,000 households in 1995 that uses CAPI and database management tech-
nology for data collection and processing. The onmarY ouroose of the 1995
TABLE 5-2 Suggested Schedule for Implementing the SIPP Redesign and
Use of CAPIIDBMS Technology, Including a Large Dress Rehearsal Panel
in 1995
Panel and Wave
Calendar Year 1992 1993
1994 1995 1996 1998
Paper and Pencil
1994 7 4 1
5 2
6 3 CAPI/DBMS
1995
1996
1997
1998
1999
8 5
6
4 1
2
3
(4)
(5) 2
(6)
(7)
(8)
(9)
6
7 1
8
9
10 4
11
12
2000 1998 panel continues and 2000 panel begins
Original 20,000 20,000 20,000 15,000 26,700 26,700
sample size
(households)
NOTE: ( ) indicates optional.
OCR for page 156
156
THE SURVEY OF INCOME AND PROGRAM PARTICIPATION
panel would be to conduct a full-bore dress rehearsal of the new system,
identifying operational problems that could be corrected in time for imple-
mentation of all aspects of the redesign with the 1996 panel. (The redesign
of the sample per se could be introduced in 1995 as scheduled. Indeed, it
would be advantageous to do so, as then all panels that are based on the
new technology would have the same sample design.)
In the best outcome, the 1995 panel would provide a smooth transition
for 1996 and also provide high-quality data for users; at worst, the 1995
panel would fulfill the former very important function. At a minimum, the
1995 panel would include three interviews; if it is going well, it could be
continued for another year or two and used for additional testing that could
feed into the next new panel, which would start up in 1998.2° (Continuing
the 1995 panel, as shown in Table 5-2, has the advantage of avoiding big
changes in the total interviewing workload.)
This schedule has several benefits, primarily that it should greatly re-
duce the risks from unanticipated operational problems with the CAPI/data-
base management system. Also, it should permit greater flexibility over the
next couple of years for the Census Bureau to experiment with question-
naire content and format. Even with an additional year, we are not sanguine
that a CAPI system could be developed to handle the type of free-form
questionnaire that is being evaluated in the cognitive research program (see
Chapter 7~. However, we do believe that it should be possible to improve
the current structured questionnaire by making use of the results of that
research. Yet another advantage of this schedule is that a new panel will
start up at the time of the year 2000 decennial census, thereby providing a
better opportunity to compare the census and SIPP than if SIPP panels were
introduced in odd years.
There remains the question of what to do with the 1993 and 1994
panels, which will have begun win paper-and-pencil methods namely, whether
to switch them to the CAPI/database management system when the 1995
test panel is introduced or run the paper-and-pencil and CAPI systems side
by side. As we noted above, the latter strategy means increased costs.
However, a sudden switch could pose other kinds of problems: for ex-
ample, the regional offices would have to cope with an abrupt reduction in
workload. Also, the data processing staff would have to make special ef-
forts to quickly move the data from the most recent waves of existing
panels so that the next wave could be CAPI and also prepare CAPI versions
20To the extent possible, the regular SIPP data products should be provided to users from
Me 1995 panel, although, if problems arise, it may be necessary to release data products as
"research" files or reports to be used with caution. (The Census Bureau has issued "research"
products in the past, when it had reason to doubt their quality or viewed them as preliminary.)
OCR for page 157
DATA COLLECTION AND PROCESSING
157
of the SIPP questionnaires for existing as well as new panels. These efforts
would greatly increase the burden on the data processing staff.
In our view, the problem of the 1993 and 1994 panels is another argu-
ment for running the 1995 panel as a dress rehearsal for the CAPI/database
management system with a somewhat smaller sample size that should free
up resources to handle the new as well as the old systems. Under this plan,
it seems most feasible to continue the use of paper-and-pencil methods for
the 1993 panel, which will have only two remaining waves in 1995. We
also suggest that the Census Bureau consider truncating the 1994 panel at
six instead of eight waves (see Table 5-2~. If this step is taken, it would
seem most feasible to continue the 1994 panel under the old system the
1994 panel would have only three remaining waves in 1995 and would end
before the start of the 1996 panel. Beginning in 1996, the Census Bureau
would have only the CAPI/database management system to run.2i
Recommendation 5-4: The Census Bureau should make every
effort to ensure smooth implementation of CAPI and an im-
proved database management system for SIPP under the new
design of 4-year panels introduced every 2 years. One option
that the Census Bureau should consider is to field a large-scale
dress rehearsal panel in 1995 as a way to work out any opera-
tional problems. Under this scheme, full implementation of the
SIPP survey redesign would occur in 1996.
21Under the suggested schedule, it will be important to carefully consider the order of
topical modules and determine which ones (e.g., program eligibility) are essential to include if
the 1994 panel is truncated.
Representative terms from entire chapter:
data processing