Alternative Sampling Frames For Personnel Surveys
Users of data on scientists and engineers commonly complain that it is not possible to identify subgroups in this population in sufficient detail, that there is not enough information about the working environment, career paths, and other characteristics of scientists and engineers, and that trends over time, such as changes in the numbers of science and engineering immigrants, are not well monitored. Statistical problems may, in part, explain insufficient data in these areas. SRS should investigate whether alternative sampling frames for its personnel surveys would make it possible to improve these data in the future.
Scientists and engineers are what are known in the survey methods field as a ''rare population," which makes them difficult to study in a cost-effective manner. At first glance, this classification appears to be hyperbole. For 1993, the NSF SESTAT system estimated that there were 12 million college graduates under age 76 with either employment or training (or both) in a science and engineering field—not a small group in total numbers. The Current Population Survey (CPS), which is the largest continuing U.S. household survey, has about 48,000 households and 126,000 people in the sample each month, of which an estimated 6,000 people would be scientists and engineers. This sample size is adequate for reliable estimates of the total population of scientists and engineers. (That is, the sample size would be adequate if the CPS permitted identifying people trained but not currently working as scientists and engineers in addition to those employed in science and engineering.)1
However, users of science and engineering personnel data are almost never interested in the total: their interests center on particular groups (e.g., women and minorities in specific fields) and comparing across groups. For a group amounting to 50,000 people, which is not atypical for specific fields, the sample size in one month of the CPS is only about 25 cases, which is not adequate for analysis purposes. Also, the total population of working scientists and engineers, while not that small numerically, is a small percentage of the total household population (5% in 1993).
To obtain adequate sample size for a rare population in a cost-effective manner, it is necessary to find some way to screen the general population. Simply expanding the sample size of a general household survey (e.g., the CPS) is prohibitively expensive because the "yield" (the number of sample cases of interest) is so low—about 5 cases for each 100 people added to the sample in this instance.
The NSF Approach
Historically, NSF has used the decennial census long-form sample as a screening device for obtaining adequate samples of scientists and engineers. NSF sponsored surveys drawn from census respondents following the 1960, 1970, 1980, and 1990 censuses. Individuals identified as
scientists and engineers in the 1972, 1982, and 1993 postcensal surveys were resurveyed over the decade at 2-year intervals. Also, surveys were conducted regularly of new entrants to the field, specifically, new bachelor's and master's degree recipients in science and engineering disciplines identified by institutions of higher education. In addition, NSF funded continuing surveys of Ph.D.-level scientists and engineers. This approach has several problems, however, of which two are of particular significance.
(1) First, the census long-form questionnaire is not a particularly efficient screener for scientists and engineers. It is possible to use the data on current occupation to oversample people who report they worked as a scientist or engineer and to use the data on level of education attained to limit the sample to college graduates. However, the census does not ask about degree field, and NSF has never been able to get such a question included in the census. Hence, it is not possible with the census data to oversample people with science and engineering degrees. Given user interest not only in people who work as scientists and engineers, but also in those who are trained in science and engineering fields, the deficiencies of the census questionnaire present significant challenges for a cost-effective sample design.
One approach is to draw a large sample not only of people reporting science and engineering occupations, but also of other college graduates. This approach quickly becomes expensive because the stratum of other college graduates includes many cases who are not of interest. Another approach is to sample other college graduates at a low rate relative to working scientists and engineers, which permits a smaller overall sample size. However, if the differences in the sampling rates are too great, there will be substantial increases in the standard errors of estimates that are based—as will usually be the case—on both strata (working scientists and engineers and other college graduates). Differences in sampling rates of 100 to 1 were used in the NSF 1982 postcensal survey, and the results were disastrous for the quality of the estimates. i
The CNSTAT Panel to Study the NSF Scientific and Technical Personnel Data System recommended that the differences in sampling rates be no more than 4 to 1; as a compromise, the design of the 1993 postcensal survey (the National Survey of College Graduates) has sampling rates that vary by no more than 8 to 1. To reduce the differences between sampling rates while maintaining the precision of estimates of scientists and engineers, the overall size of the 1993 NSCG was increased to 215,000 initial sample cases from 138,000 cases in the 1982 postcensal survey. Because of its large sample size, the postcensal survey has to be conducted by mail to be affordable, which, in turn, limits the kind and amount of detail that can be ascertained.
(2) Second, the approach of using the decennial census as a screener to identify the stock of scientists and engineers to follow up over the decade together with new graduates in science and engineering fields means that the NSF data system cannot readily identify some population groups of interest. There is no cost-effective way to add these groups to the sample.
One such group is people who, during the decade, move into science and engineering jobs from non-science and engineering backgrounds. The computer science field is one example in which a significant number of people who work in that field were not trained in a science or engineering discipline. This kind of movement across the boundaries between science and engineering and other disciplines cannot be identified in the NSF system until the next postcensal survey.
Another population group of interest that the NSF approach misses is scientists and engineers who enter the United States during the decade who were not trained in the United States. While there are other sources of data about immigrants, they are problematic in both level
of detail and quality. Again, new immigrants cannot be captured in the NSF data system until the next postcensal survey.
A Better Approach for the Future?
Two new surveys offer the possibility that NSF could develop a cost-effective science and engineering personnel data system that responds more fully to user needs. One survey, which began in 1994, is the National Immunization Survey (NIS), sponsored by the Centers for Disease Control in the Department of Health and Human Services. The second survey, which is scheduled to become fully operational in 2003, is the American Community Survey (ACS) that is being developed by the Census Bureau.
The NIS is a random digit dialing (RDD) survey in which about 3 million telephone numbers per year are called to identify families with young children, who are asked to respond to questions about immunization. It is possible that NSF could pay to have questions added to the NIS screener to identify people working or trained as scientists and engineers, who would be asked to respond to an NSF-designed questionnaire. RDD surveys miss that portion of the population without telephones, but this lack should not be a problem for the science and engineering population. To minimize respondent burden for families with both a scientist or engineer and young children, the sample design could eliminate overlap by assigning a portion of such families the immunization questionnaire and a portion the NSF questionnaire.
To maintain the desired sample sizes for both NSF and CDC, NSF may need to provide funding not only for added questions, but also for added screening interviews. Even so, the costs of piggybacking on the NIS would likely compare favorably with the costs of the current system. Moreover, such an approach should afford the opportunity to include such population groups as immigrants and, perhaps, to have a more richly detailed questionnaire than in the current system.
The ACS is designed as a mailout-mailback survey, with telephone and personal followup, in which 250,000 households are to be surveyed each month, for a total of 3 million households per year (about 7–8 million people), with no overlap in the samples across months. The survey is currently being tested in pilot sites and is planned to become fully operational beginning in 2003. If implementation of the ACS proceeds as planned, it will likely replace the census long-form questionnaire in the year 2010. The ACS questionnaire will include the basic long-form data, including educational level and occupation. With NSF support, it could be possible to add a question on degree field, which would permit using the responses to identify both working and trained scientists and engineers for a follow-up mail survey. If respondents provide telephone numbers, it could be possible to conduct the NSF follow-up survey by phone.
With either the NIS or the ACS, NSF would need to decide how frequently over the course of a decade to use one or the other survey as a screener. One approach would be to begin the decade by using the NIS or ACS to obtain a large sample of scientists and engineers who would then be followed up at regular intervals. At these same intervals, the NIS or ACS could be used to identify new entrants (e.g., immigrants, new degree recipients) to add to the longitudinal sample. Alternatively, the NIS or ACS could be the source of regular cross-sectional surveys of scientists and engineers. Also, one or the other survey could be used on occasion to identify subsamples of scientists and engineers to receive questionnaires on special topics.
Box D1 provides approximate sample sizes from the SESTAT system, the March CPS, and the ACS for various size groups of scientists and engineers to illustrate the gains from use of the ACS. (The gains would presumably be similar for the NIS.) The estimates for the ACS are
derived by assuming that every case of interest receives the NSF questionnaire; in fact, it would be possible to subsample to reduce costs. The sample sizes shown for the CPS and ACS are approximate and do not take account of sample design features that could reduce the effective sample size.
No information on costs is available at present, but it would be worthwhile for NSF to investigate the costs and benefits of using the ACS or the NIS versus the current approach in the future.
Box D-1. Illustrative Approximate Sample Sizes for Estimates of Scientists and Engineers: NSF SESTAT, March CPS, ACS