Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 203
Appendix SOME METHODOLOGICAL ISSUES IN ANALYZING DATA ON IMMIGRATION INTRODUCTION The body of this report has been concerned largely with the process of collecting and disseminating data on immigration and the foreign-born. Analytical issues have been touched on, but no detailed examination of analytical procedures has been attempted. This appendix incorporates three papers concerning analysis, the first two by Kenneth Hill, of the panel staff, and the third by Kenneth Wachter, a member of the panel. Although the papers have benefited from the comments of a number of reviewers, they nonetheless represent the views of the authors rather than of the panel as a collective entity. They are included here because they concern issues central to the panel's charge and were prepared as a part of its overall work plan. We hope they will serve to stimulate both discussion and new research areas. The first paper outlines three methodological procedures that could be applied to data that either are available or could be made available at little expense and with little change in current administrative practices. The methods outlined are acted at measuring stocks or flows that are poorly documented by existing statistics: emigration of immigrants admitted for permanent residence (and, coincidentally, an estimate of average coverage of the Alien Address Report system), the size and growth of the population of illegally resident aliens, and net flows of U.S. citizens. These methods are intended to be illustrations of ways in which particular types of data might be used for analytical purposes and to indicate the potential analytical value of compiling or processing data that are already collected. The estimates obtained by these methods are also intended to be illustrative rather than substantive--for substantive applications, the necessary data must be available and the extensive assumptions underlying the methods must be evaluated in the light of the results obtained. The methods proposed do have some promise for producing useful new estimates, to complement rather than to replace existing ones, and it is hoped that, even if these methods in the form presented do not prove viable or prove to be excessively sensitive to critical and unsupportable assumptions, their presentation will stimulate discussion and the development of new approaches to the use of available or easily generated data. 203
OCR for page 204
204 The second piece is concerned with estimating the size of the illegally resident population of the United States. Estimating the size of this population, and still more its characteristics, poses serious and special measurement problems, since the population itself is, for obvious reasons, anxious to avoid any unnecessary contact with officialdom. As a result, the methods applied, though often ingenious, also often rely on extensive assumptions that are hard to justify. Hill reviews the major empirical studies that have been made of the size of the illegal population and examines their results in the context of their methodology and assumptions. Several of the methods have been reviewed elsewhere, and little new about these methods is presented here; however, some of the methods have not been subjected to detailed examination before, and it seemed useful to cover all the major methods and their results in one place and to pull together and evaluate all the available empirical estimates of the size of the illegal population. The paper is intended as an evaluation of the various estimates and thus concentrates on the negative rather than on the positive aspects of the methodologies used. The reader should bear in mind, however, that the measurement problems involved are particularly severe and that any methods used will inevitably involve assumptions and approximations that are hard to justify. It is an area in which new approaches or the use of different data are to be welcomed, and in which a wide margin of uncertainty in the estimates derived should not be interpreted as a criticism of the methodology or of the attempt. The third piece discusses the issues of imputation and treatment of missing data with particular reference to procedures of the Immigration and Naturalization Service (INS) and the presentation of data in the INS statistical yearbooks. Wachter argues strongly that the procedures currently used by INS should be reviewed in teems of their statistical validity and should be carefully documented in the Statistical Yearbook so that users can be aware of how the necessary imputations have been made and be alerted to how such imputations might affect the data.
OCR for page 205
205 Indirect Approaches to Assessing Stocks and Flows of Migrants Kenneth Hill INTRODUCT ION Statistics for U.S. migrant groups are of very variable quality. The best data made available by the INS cover first arrivals of permanent immigrants, or those changing status to permanent immigrant, and those naturalizing to U.S. citizenship. These data seem to be fairly reliable, in general if not with regard to all the available detail, even though they have suffered from severe processing and publication delays in the last few years. Figures on first arrivals of refugees, published very promptly by the Office of Refugee Resettlement, also seem to be reliable. Some elements of inflow are thus adequately covered by existing statistics. Inflows of temporary visitors, returning citizens, and returning resident aliens are less satisfactory. Although total arrivals by air are reasonably well recorded by the INS, processing of arrival declarat ions for aliens has been sporadic in recent years, and permanent residents are no longer required to complete such declarat ions. The situation for the inflow through land border ports of entry is worse; in many cases no direct head count is made, the total flow being estimated as the product of numbers of cars and an average occupancy figure derived from semiannual surveys, also used to estimate c it izen/nonc it izen rat ios ~ see Chapter 4 for a more comple te descript ion of these procedures). Since the gross inflow across land borders represents the great majority of total inflow, INS estimates of total inflow cannot be regarded as satisfactory and in any case exclude any inf 1 ow of und oc ument ed a 1 fen s. No systematic attempt is even made to record outflow; although temporary visitors are required to complete a declaration on departure, compliance is high only at airports. Departures of all passengers by air are recorded by the INS from airline reports, but coverage of charter flights appears to be incomplete ~ see Chapter 5~. No attempt is made to record departures of citizens or permanent residents at land border points or even to est imate the number of vehicles crossing. There is thus no basis for estimating gross out flow from the United States and no basis for monitoring changes in population stock. Until 1981, the INS attempted to monitor the stock of resident aliens through the Alien Address Reporting system; however, reporting was widely felt to be incomplete and the system was scrapped, although resident aliens are st ill required to register changes of address with the INS. (This requirement is seldom observed, however, and the forms are not processed. ~ Information on the population stock and inflows is available from the decennial census, which collects country of birth, citizenship, period of arrival for the foreign-born, and residence one year and five years before the census. The accuracy of some of the information, which
OCR for page 206
206 is self-reported, is open to question, and the coverage by the census of undocumented aliens is unknown. There are thus major deficiencies in U.S. international migration statistics, the two most important being the size and structure of the undocumented population and emigration of both U.S. citizens and of noncitizens. Numerous ingenious approaches have been developed to obtain estimates of the stocks and flows involved. Siegel et al. (1980) provide a useful review of methods used to estimate illegal immigration, and methods of estimating emigration have been reviewed by Passel and Peck (1979) and Warren and Kraly (1985~. This paper describes three potentially useful new indirect approaches to the estimation of stocks and flows of U.S. migrants. Unfortunately, the approaches are based on data that are no longer collected, from the Alien Address Reporting system, on data that are collected but not processed, from records of deportable aliens located in the United States or on data that are difficult to compile, from foreign census counts of U.S. citizens or the U.S.-born living abroad. The immediate practical applicability of the methods described is thus severely limited, but the methods are described in order to indicate some directions that analysis could take if fairly simple procedures for data collection, processing, or compilation were instituted. They are not proposed as final solutions to the measurement problems with which they are concerned. Like all indirect methods, they too involve assumptions and approximations that will affect the results. Rather, these approaches illustrate how certain types of data could be used to obtain estimates of stocks and flows of people in, into, and out of the United States. We hope this illustration of the application of somewhat different approaches to the problem will generate further thinking, which may stimulate additional future research in this area. The first method uses information on Reportable aliens located by duration of illegal residence and other simple characteristics to estimate the size and structure of the nonlegal population of the United States. The second method combines information from the Alien Address Reporting system with information on numbers of new i~u~igrants and naturalizations to estimate both the coverage of the address reporting program and the emigration of resident aliens. The third method uses census data from other countries on the U.S. citizen or U.S.-born population resident in those countries to scale information from an administrative data source--Internal Revenue Service tax filer records--on the U.S. population living abroad. THE SIZE AND STRUCTURE OF THE NONLEGAL POPULATION OF THE UNITED STATES Numerous methods have been described for estimating the number of nonlegal residents of the United States or major components of this population (see the second paper in this appendix) for a review of the more important studies). The approaches proposed here use information on locations of Reportable aliens by duration of illegal residence to estimate the size and duration structure of the underlying population, first assuming the population to be demographically stable and second using duration-specific growth rates. The INS collects information on Reportable aliens located on form I-213 (see Appendix A) but has not
OCR for page 207
207 processed the data systematically, although Davidson ( 1981) has described the results of processing a sample of the forms completed in calendar 1978. The use of I-213 data to est imate either numbers or characterist ics of the nonlegal population is not straightforward for a number of reasons. First, the locat ions occur very large ly at short durat ions of illegal stay; in fiscal 1982, for example, 75 percent of the 963,000 locat ions were of aliens with a durat ion of illegal stay of 30 days or less, and 50 percent occurred at entry; a high proportion of these locat ions may be of the same person located several t ime s in the same year. Second, the located aliens cannot be regarded as a random sample of the underlying population, since probabilities of location are likely to vary by characteristics such as sex, nationality, and occupation. Third, the quality of the data on the I-213 is widely regarded as low, and although no thorough evaluation has been made, Davidson ( 1981) shows that employment and residence characteristics suffer from high levels of nonresponse. These shortcomings no doubt partly explain the INS I s failure to process I-213 forms on a routine basis. Before describing and illustrating the methods in detail, it is use fu 1 to provide a genera 1 expl anat ion of why the me thod s might be expected to work at all. To make any analytical use of locations, it has to be assumed that the number of locat ions is re lated to the number of deportable aliens who can be located. If the number of locations is determined by INS targets, or the INS locates as many deportable aliens as it can given existing resources and manpower, there will be no systemat ic relet ionship between locat ions and populat ion at risk, and locations will provide no basis for estimating the size of the population. No empirical basis exists for assuming a relationship between locations and population, but it does seem plausible that if the Reportable alien populat ion were doubled, the INS would locate at least some more deportable aliens without any increase in ef fort, although locat ions might inc rease by a fac tor of le s s than two. Even accept ing this assumption of a positive elasticity of locations to population, it might appear at first sight that a series of numbers of locations by duration could only indicate relative, not absolute, rates of location by duration. Sets of location rates that are the same in duration pattern but different in level will produce the same numbers of locations at each duration when applied to populations that share a given distribution by duration but are appropriately scaled. It is not obvious, therefore, that recorded numbers of locations can tell us anything about the size of the underlying population. However, there is a link between the two because the number of locations affects the size of the population, in much the same way as deaths affect the size and age distribution of a closed population. If locations were the only source of attrition, the parallel with deaths would be exact, and methods for estimating population size from deaths by age for a stable population (that is, a population changing at a single, constant rate at all ages and thus maintaining a constant age structure though not a constant size) , such as that proposed by Preston et al. (1980), or for a general population (Preston and Coale, 1982), could be applied to locations by duration. In practice, voluntary return migration, change of status' and deaths also contribute to the attrition of the population of illegal aliens, so estimates based on locations alone will underestimate the true size of
OCR for page 208
208 the population unless allowance is made for other unobserved types of loss. The first approach assumes that the illegal alien population is stable in the demographic sense of having a constant, unchanging rate of change at each duration of illegal residence. In such a stable population, the number reaching duration d in a year, N(d), can be expressed in teems of the number of entries in the year E, the stable rate of change r, and the probability of surviving from entry to d, p~d): N(d)=Ee~r~p~d) (1) The average population at all durations, P. can be found by integrating equation 1: w .. w P=/ N(d)dd-EIe~rdp(d)dd (2) 0 0 where w is the highest duration attained. In any population, the rate of change r is equal to the entry rate, E/P, less the loss rate, LIP, where L is total losses; substituting rP + L for E in equation (2) and rearranging gives P/L=ie~rdp(d)dd/[l-ri e~rdp(d)dd] (3) 0 0 Also in a stable population, survival to duration d can be expressed in teems of losses by duration, ltd), and r: pod)=; 1(djer~dd// 1(djer~dd (4) d O If we now assume that losses from INS locations, D(d), form a constant proportion of all losses at all durations d, p~d) can be expressed in terms of D(d) and r, since the constant proportion will cancel out in equation 4. We can now apply equations 3 and 4 to Davidson's data on locations by duration of illegal stay for 1978, assuming different growth rates, and limiting the analysis to locations at durations of one month or more. Equation 4 has been evaluated assuming that locations are distributed evenly over each duration group, applying a value of d for the midpoint of the interval, except for the open 7+ years interval, for which a value of 9.5 years was assumed. The integrals in equation 3 were then evaluated trapezoidally for each duration category. Calculations are shown in Table B-1. The ratio P/L, average population to average losses, increases from 1.655 for a zero growth rate to 1.882 for a growth rate of 5 percent to 2.168 for a growth rate of 10 percent; an annual growth rate of 10 percent implies a population doubling time of seven years. The estimated P/L is surprisingly insensitive to the assumed growth rate. The estimates of P/L do not provide a basis for estimating P directly, since we do not know the value of L, average annual losses. However, we can obtain estimates of P for a range of assumptions about the value of L/D, total losses to location losses. Total locations at one month duration or more were 231,274 in 1978. If locations were 25 percent of total losses, the value of L would be 0.93 million, and the alien population present illegally in the United States for a month or
OCR for page 209
209 more would then be 1.53 million for a growth rate of zero, 1.74 million for a growth rate of 5 percent, and 2. 01 mil lion for a growth rate of 10 percent. If locations were 50 percent of total losses, each estimate would be halved. Locat ions data do not suggest a rapid growth rate of the populat ion. In 1979, 245,118 deportable aliens illegally resident for a month or more were located, so if location rates remained constant the underlying population grew at 5.8 percent annually. A growth rate around 5 percent thus seems more likely than one of 10 percent. We have little guidance for a plausible figure for L/D, though Garcia y Griego (1980:Figure 3.3), using data from the Mexican CENIET border survey on migrat ion histories of Mexicans returned by the INS, found that about 60 percent of returns to Mexico over the period 1970-1977 resulted from INS locat ions, and about 40 percent were voluntary. These results suggest that an L/D ratio of 2. 0, allowing for deaths and legalizat ions in addition to voluntary returns, is more plausible than a ratio of 4. 0, at least for Mexican illegal residents. Using these assumptions, the data suggest an illegal population resident one month or more that averaged around 0.9 million in 1978. This procedure can also provide a number of other interest ing results. For a growth rate of 5 percent, the ratio P/L is estimated at 1. 882; this rat lo is the inverse of the loss rate, which is therefore est imated as 0. 531. The entry rate, E/P, is equal to the loss rate plus the growth rate, and is therefore est imated as 0. 581. If the rat lo L/D is taken to be 2. 0, P is equal to 0.871 million, implying a value of E of 0.506 million. This value is the number of illegals achieving a month' s residence in 1978; since locations under a month in 1978 totalled 0.817 million, and the value of E is est imated at 0.506 million reaching a month without being located, total entries are estimated (assuming all losses at durat ions less than a month result from locat ions) at 1.323 million, of which the Border Patrol located 62 percent at entry or during the first month of illegal residence. We can also use the pi d) func t ions to calculate durat ion-spec if ic annual location rates, dividing the life table losses pods - p~d+l) by person-years lived by the life table popular ion, approximated by n~p~d) + Fidel) ~ / 2, where n is the length of the duration interval in years, and then dividing by 2.0 again to allow for the assumption that only half the losses resulted from locations. The resulting location rates nld are shown in the last column of Table B-1. One comfort ing feature of the rates is that those for the open interval, wld, which are set at 0.200 by assigning a uniform distribution over 5 years, are more or less consistent with the rates for shorter durations A discomforting feature is that the rates are lowest for the duration interval 1-2 years, whereas we might expect them to decline steadily with duration. A possible explanation would be that location losses represent a lower proportion of all losses at long durations than at short durations This explanation is tested in Table B-2, in which locations numbers are inflated by variable durat ion-spec if ic fac tors, averaging 2. 0 overal 1, and then manipulated using a growth rate of 5 percent. Three models are presented, ~ a) with the location proportion of all losses rising with duration, (b) with it falling, and (c) wi th it starting high for duration 1-6 months, falling sharply to a minimum for durat ion 7-12 months, then
OCR for page 210
210 rising steadily as duration increases. Model (b) does indeed produce locat ion rates that are essent tally constant at durat ions over one year. More surprisingly, the results using these three models suggest that the procedure is not very sensitive even to substant ial variat ions in the location to total loss ratios by duration, the estimated total population varying from 0.79 million for model (a) to 1.24 million for model (b) . The assumption of stability can be dropped if inflation is available on duration-specific growth rates. If duration-specific location rates were constant from year to year, population growth rates could be calculated directly from the numbers of locations in successive years, since the locations growth rates would be identical to the underly ing populat ion growth rates. Even if we wished not to as sume constant rates, we could assume a constant duration pattern for the rates and an overall growth rate to which the durat ion-spec if ic rates would be scaled. To apply this procedure, we need information on locations by durat ion for at least two consecut ive years. Unfortunately, such useful data are not available, but we present the methodology required and illustrate the effects of departure from stability for two different case s. Preston and Coale ( 1983) have shown that for a non-stable population, a -or r(x)dx N(a) = B e ° p(a) (5) where N( a) is the population age a, B the number of births, r(x) the growth rate at age x, and p(a) the probability of surviving to age a, all at some particular t ime t. By integration, the total population P is g iven by: a w W -or r(x)dx P=/ N(a)da=Bi e ° 0 0 p(a)da (6) In any population, the birth rate B/P is equal to the loss rate L/P plus the growth rate R. so equat ion 6 can be rewritten ~ replac ing age by durat ion) as d d W -or r(x )d x w -or r(x ) d x P/L= ~ e ° p(d)dd / [1 - R r e ° p(d)dd] (7) 0 0 we can estimate p(d) and r(x), we can then use this equation to estimate P/L. The variable growth rate version of equat ion 4 is d d w r r(x)dx w ~ r(x)dx P(d)=| l(d)e° dd/i l(d)e° dd (8) d O
OCR for page 211
211 Thus, given values of l(d) (or nld) and r(d) (or nrd) we can obtain p( d), the survival function needed in equation 7. Note that the values of lid) again do not need to be the correct level, as long as they have the true duration pattern, since a constant level factor will cancel out from the top and bottom of equation 8. Thus we can use locations nDd in place of losses Hid in equation ~ if we assume that locations make up a constant proportion of total losses for all durations We have no data to which to apply this more flexible approach, since Davidson' s data on locations by duration are for 1978 only and provide no guidance concerning duration-specific changes in locations. However, we can test the sensitivity of the stable assumption estimates derived above to a non-stable underlying population by assuming different patterns of duration-specific growth rates. Using the basic model with an overall growth rate of 5 percent and a constant location to loss ratio of 0.5, we illustrate in Table B-3 the estimates obtained assuming first that duration-specific growth rates fall with duration and second that they rise. The P/L ratios obtained bracket the ratio for a stable population, lower for falling rates and higher for rising rates, but differ from it by only 4 or 5 percent. Thus it appears that the stable procedure is actually quite insensitive to departures from stability, at least for the range of growth rates tested, as it was to substantial differences in the stable growth rate used. This insensitivity arises from the heavy concentration of locations at short durations for which the growth rate has only a modest effect. In conclusion, these methods make some strong assumptions, but the results are not very sensitive to many of them. Deviations from stability appear to be relatively unimportant, and the stability assumption can be relaxed if data are available for more than one year. Similarly, the results are not highly sensitive to the stable growth rate assumed in the stable method or to the overall growth rate in the non-stable method. The results are more sensitive to locations to losses ratios that change sharply with duration, although ratios that change by more than a factor of two affect the overall population to loss ratio by less than 50 percent. The assumption to which the final estimate is directly proport tonal is the overall location to loss ratio; a value of this ratio of 0.25 will produce an estimate of the illegal population exactly twice as large as will a value of 0.50. However, overall the methodology turns out to be surprisingly robust to deviations from the assumptions. It is likely to work best for groups with similar location and other loss probabilities, so it could usefully be applied to data on locations classified by sex and nationality groups, though not by age since age would introduce entries to and departures from the population considered as a result of birthdays. Data for consecutive years would also prove useful for relaxing the assumption of stability and for examining the consistency of the results. Given the limited data available, the results using location to loss ratios that fall with duration appear most plausible; with an overall location to loss ratio of 0.59 they suggest an average illegal alien population resident a month or more of 1.2 million for 1978, a figure by no means inconsistent with other empirical estimates available. This figure of course excludes the contribution of illegal immigrants at durations of 10 days or less, but their contribution in terms of person-years lived must be fairly small, even if their number is large;
OCR for page 212
for 1978, it would increase the estimate of 1.2 million by less than O.1 million. This estimate is of course only arrived at in order to illustrate how these methods work. More extensive data, permitting repeated applications, the relaxation of certain assumptions, and separate analyses for more homogenous subgroups, are necessary to establish the ultimate value of the methods for estimation purposes. ESTIMATING EMIGRATION OF RESIDENT ALIENS Until 1981, most aliens resident in the United States were required to report their address to the INS in January every year. Reporting was made by completing and mailing to the INS a special card (form I-53) available at post offices and elsewhere. The information collected is described in Chapter 4, and the form reproduced in Appendix A. Figures from the reporting system were published in the INS Statistical Yearbook by nationality and state of residence. Reporting under the system was widely regarded as being incomplete, one of the reasons why the Alien Address Reporting (AAR) system was dropped after 1981, and year-to-year fluctuations in the numbers of reporting foreigners can only be explained plausibly in terms of varying coverage. However, the information available provides some basis for estimating the emigration of permanent resident aliens. If all recording is complete, the number of permanent residents reporting in year t+l, PR(t+l), should be equal to the number who reported in year t, PR(t), plus immigrants (both arriving and changing status), lit, less naturalizations, 1Nt, emigration, 1Et, and deaths in the United States of permanent immigrants, 1Dt. Thus PR(t+l)=PR(t)+ 1It-INt-(lEt+lDt) ( ) If reporting in years t and t+1 was kits and k~t+l) complete, and PRR(t) and PRR(t+1) are the numbers reporting, then PRR(t+l)/k(t+l)=PRR(t)/k(t)+ iIt - 1Nt -(lEt + IDt) or PRR(t+l) k(~+~) k(~+~) ( E + D)+k(t+l) 1 t PRR(t) k(t) PRR(t) PRR(t) Since PRR(t) = k(t)[PR(t)], we can write PRR(t+l) k(t+l) k(t+l) (1 Et + 1 Dt) + k t+1 (1 It ~ 1 Nt) _ ( ) PRR(t) k(t) k(t) Pit(t) PRR(t) =- [l-R(t)] +k(t+l) lIt-lNt' (10) k(t) PRR(t)
OCR for page 213
213 where it(t) is a loss ratio of deaths and emigrants divided by the initial population; if deaths and emigration are regarded as minimal for immigrants during their year of entry, it(t) can be regarded approximately as a loss rate equal to the sum of the death and emigration rates (note that the denominator of it(t) is the true, not the reported, population at time t). If over a number of years k(t) and it(t) are approximately constant, equation 10 becomes PRR(t+l) `1 R'+k ~I~-~N~ (11) PRR(t) PRR(t) where R is the loss rate, k is the average coverage completeness of the AAR system, and lit, 1Nt, PRR(t) and PRR(t+l) can be obtained from INS statistics. R and k can thus be estimated by plotting the ratios in equation 11, and fitting a straight line of intercept (1-R) and slope k. The estimated value R is not an emigration rate but rather a combined emigration and death rate. The emigration element could be obtained by subtracting a death rate calculated on the basis of the age distribution of the population being considered; this death rate would probably not exceed 10 per 1~000 for the immigrant populations from most countries of orlgln. The derivation above suggests some practical implications for applying the method. Since it(t) and k(t) are assumed to be constant, the method should be applied to groups as homogenous as possible, such as country of origin by sex groups. It is also clear that the method will not work well if (a) the fluctuations in kits or it(t) are large, or (b) lit ~ 1Nt is small relative to PRR(t), or (c) (lIt ~ 1Nt)/PRR(t) varies little over time. Simulations suggest that the line should be fitted to the points using a group mean procedure, ordering the observations by the values of flit ~ 1Nt)/PRR(t); that the resulting estimate of R is reasonably robust to random fluctuations in it(t) and kite; but that the resulting estimate of k is much more sensitive to such fluctuations. It is also necessary to discuss in more detail the effects of the assumption that it(t) and k(t) can be summarized by average values R and k applying to the whole period. Simulations suggest that random variations around the average values will have little effect on R but will have a more pronounced effect on the estimate of k, tending to reduce its value. Underlying trends in it(t) and kit) might be expected to have more substantial effects, however. Limited_simulations suggest that trends in it(t) result in overestimates in R and k if it(t) is increasing, and underestimates of R and k if it(t) is declining; the effect on R is small, the estimate not deviating much from the average value, but the effect on k is substantial, and the estimate might be in error by as much as plus or minus 5 percent for a trend in it(t) over a 15-year period of about 1 percent per annum. A trend over time in k(t) has relatively little effect on the estimate of k, which works out close to the weighted average of k(t) regardless of the direction of the trend, but the estimate of R is biased upward by declining coverage and downward by increasing coverage. In general it can be concluded that the estimates of R and k are reasonably robust to trends in it(t) and kite, so long as
OCR for page 244
244 possibly for no other reason than that the efficiency of the Border Patrol has increased, causing more entries to fail early and thus to be repeated. The size and growth of the illegal alien population may not be problems of the magnitude sometimes suggested, although any substantial number of illegal residents may cause social and economic problems, particularly at the local level; these wider issues are not considered in this discussion, which is limited to the size of the population only. REFERENCES Bean, F.D., King, A.G., and Passel, J.S. 1983 The number of illegal migrants of Mexican origin in the United States: Sex ratio-based estimates for 1980. Demography 20(1):99-110. CENIET 1981 Infonme Final: Los Trabajadores Mexicanos en los Estados Unidos (Encuesta Nacional de Emigracion a la Prontera Norte del Pals y a los Estados Unidos--ENEFNEU--~. Secretaria del Trabajo y Prevision Social. Centro Nacional de Infonmacion y Estadisticas del Trabajo. Mexico City. Cue, R.A. 1976 Men from an Underdeveloped Society: The Socioeconomic and Spatial Origins and Initial Destination of Documented Mexican Immigrants. Unpublished Doctoral Dissertation. University of Texas at Austin, Austin, Texas. Davidson, C.A. 1981 Characteristics of Deportable Aliens Located in the Interior of the United States. Paper presented at the annual meetings of the Population Association of America, Washington, D.C. Garcia y Griego, M. 1980 E1 Volumen de la Migracion de Mexicanos no Documentados a los Estados Unidos (Nuevas Hipotesis). Secretaria del Trabajo y Prevision Social. Centro Nacional de Informacion y Estadisticas del Trabajo. Mexico City. Goldberg, H. 1974 Estimates of Emigration from Mexico and Illegal Entry into the United States, 1960-1970, by the Residual Method. Unpublished graduate research paper. Center for Population Research. Georgetown University, Washington, D.C. Heer, D.M. 1979 What is the annual net flow of undocumented Mexican immigrants to the United States? Demography 16~3~:417-423. Interagency Task Force on Immigration Policy 1979 Staff Report. Departments of Justice, Labor and State. Washington, D.C. IUSSP 1981 Indirect Procedures for Estimating Emigration. IUSSP Papers, No. 18. IUSSP, Liege, Belgium.
OCR for page 245
245 Lancaster, C., and Scheuren, F.J. 1978 Counting the Uncountable Illegals: Some Initial Statistical Speculations Employing Capture-Recapture Techniques. 1977 Proceedings of the Social Statistics Section. Part 1, pp. 530-535. American Statistical Association. Preston, S.H., and Coale, A.J. 1982 Age structure, growth, attrition, and accession: A new synthesis. Population Index 48~2~:217-259. Reichert, J.S., and Massey, D.S. 1979 Patterns of migration from a Mexican sending community: a comparison of legal and illegal migrants. International Migration Review 13:599-623. Robinson, J.G. 1980 Estimating the approximate size of the illegal alien population in the United States by the comparative trend analysis of age-specific death rates. Demography 17~2~:159-176. Siegel, J.S., Passel, J.S., and Robinson, J.G. 1980 Preliminary Review of Existing Studies of the Number of Illegal Residents in the United States. Mimeo. Bureau of the Census, U.S. Department of Commerce, Washington, D.C. Warren, R. 1982 Estimation of the Size of the Illegal Population in the United States. Paper presented at the 1982 annual meetings of the Population Association of America, San Diego. Warren, R., and Passel, J.S. 1983 Estimates of Illegal Aliens from Mexico Counted in the 1980 United States Census. Paper presented at the annual meeting of the Population Association of America, Pittsburgh.
OCR for page 246
246 u o o em Cal o ¢ Cal a' o N can o In U ~r1 U ~ P Ct ~ v a' 1 rQ so ~ a) At: ~ ED u: E A` U) a' to 3 =0 ~ I: Cal .,' E o ,1 PA U) ~ . - o E _, ._' Ct so & 3 O C O C: O . - Ct O FIG O O So o A: cat ~0· · at Go ~Ct Cal ~ a~ ~c~. . oo · ~ ~ O ~. .. ~1 - . Ooo ~ ~aca. ._ ~ct =: · ··ct cD · · · cD · 1 'c . ~3 ~o o · ~o ~_ C) C,} ~ ~ ~_ ~ ~.- ~ ·~ 4) O ~3 · ~ ~x x U]1 _ ·~1 00 ~_ tn ~a,~ (D ct . oo _ ~1~ ~ ~ ='' - oo ~_ ~ c ~: a~ cC O ~ ~ ~ ~ o ~ ~_ ~o ~ ~ ~ ~ ~ E 3 ~ 3 · ~:t tc cn ~ ~ ~o o =: bO a) J: ~ 0 u ~. ~ ~ ~.- ~ ~ ~ ~_ ,. ~ a) oc ~ 0 C cn ~ ~ ~ ~ . ,,, c~ p.,.- u ~ ~ £ o ~4 ~ ~ ~ 0 ~· - - ~ ~ ~ ~ ~ ~ ~3 3 O ~ . - o~ . - ~ ~ ~ ~ ~D ~ ~ ~ . - . - . - ~ O ~ · ~ ~ _ O O ~ _ ~ c~ ~ cc ct ct · . ce o 3 ~0 ~ ~0 - - 3 U, ~ ~ ~ ~ 1 ". ~ ~ ~ ~ ~ ~ 3 ~ ~. o ~ o ~ ~ ~ 0 ~ ~ ~ ~ ~ ~ 0 ~ 0 ~ ~ ~ 0 0 ~_ ~ u ct s~ ct C~ )~ ~ c~ oc ~ ~ t~ tt 0 ~ ~ ~· ~ ~3 c: ~ ~o a,~ ~ ~ ~ cC ~ ,~ ~ ~ c~ _ c. ~ 0c ~,- ~ ~ O ~ · O O . - c~ . - ~ ~ ~ JJ ~ a) ~ ~ 0 c)~ - c~ ~ ca . - ~ . ~ ~ c' ~ cn ~ 0 X o ~ ~ ~ ~ ~ ~ ~ ~ ~ o X o X 3 X ~ ~ - o X ~ ~ ~ E ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ._ ~ ~ ~ ~ ~= Z :~ cn ~ ~ 3 tt ~ 3 cn <: o ~ u: :~: cn :~ ~ ~ U~ ~ ~ ~ u~ £ ~ ~ ~ ~ Z Z o o ~C~ °~- O 0o cr cc c~ ~ ~C ~- . O ~· ~ ca tu · ~· 0 0 U~ ~O O cr ~ =\ a~ c~ _ _ _ _ _ 1 1 1 1 1 1 O ~0 0 0 U ~r ~0 0 0 0 0 ~D ~ ~00 ~00 ~c ~ax ~ ~cr~ _ _ _ ~ _ ~_ ~_ _ _ _ _ C~ ^ ^ O O ~ ~ C ~ ~ O O ~ C =) ~ :, 1-~ _ 00 00 ~a c a ~· - ·~1 ~ ~q U~ U ~Ct ~ ~_ ~S" 0 _ ~- _~ _ _ a ~ oo u ~o~ u S ~'_ _ ~ _ ~ ~. - . - ~0 3 C ~ ~ :^ O ~ O _ td tV · - 0,) ~ ~: s ~ ~ 0 0 _ a ~n ~ u ~ ~ cn =: ~n U] _ ~ ~ c~ ~ E ~ ~ ~- -- . - - ~ - ~ ~ ~ ~ ~ C.) U) · - ·- ~ C)- C) _ ~P, ~ ~_ _ C ~ ~ ~ S~ Z S" C ~_ t_ _ 0 c~ O O a) ~ c' ~C ~.- .~ - C) ~=: = ~C) C) C) 3 ~X :=
OCR for page 247
to ~ - a~ 1 to AL I) ·. c) o Ad 3 o so o o em e . - UP 1 Cal UP 1 UP I N0 ha So 1~0 UP O ~ + _ Ct ~ _ Cal z :Z .- So C Ct .,, + 1_ _ _ Z o . - Ct So to . - Ct C' ~ O ~Ct ~ .,, _ _ X _ _ U) Cal o v - Cal ~ + a 00 < C~ O C~ O 247 N ~ c~ ~ u~ ~ O ~ ~ ~ ~D ~ O 0 00 ~ ~ ~ ~ ~ ~ ~ U~ ~ _1 1 1 1 ~ 1 1 1 1 o 1 1- _ ~ ~ ~J ~ ~ O =: ~ C~ O U~ ~ ~ ~ O ~ ~ ~ ~ ~ 00 0 O _. O O O O O O C~ O O C~J O _1 ' O OOOCOOOOOOOOOO ~ · e e e e e e e e e e e e e . - 1 1 1 1 1 1 1 _ O C~ ~ ~ ~ C~ U, ~ O ~ ~ ~ U~ ~' O _ ~ _d ~ ~ O ~ ~ _d C~ 00 0 C~ _ '-/ C~ ~ O ~ ~ ~ O ~ ~ ~ ~ ~ ~ ~ 0 0 ~ ~ 0 ~ ~ 0 ~ _ ~ 0 00 . n, e e e e e e e e e e e e ee J~ 1 1 1 1 1 1 1 1 1 1 1 1 1 ~ X ~ ~ ~ ~ U~ ~ ~ ~ O 00 ~ ~ U~ O ~ ~ 1-~ ~ ~ ~ ~ 00 C CS~ ~ ~ ~ O ~ ~ ~ ~ ~ C~ _ _~ t-) U~ ~) OC) ~ _4 C~ ~) ~ U~ OC · e e e e e e e e e e e U~ ~ 00 ~ I~ 00 C~ ~ C~ 1_ ~ ~ 00 ~ 1_ ~ 00 ~ ~ I~ O 00 _1 ~ ~ O O ~ C~ C~ ~ ~ C~ ~ O C~ C~ ~ U) ~ ~ O e e e e e e e e e e e e e e 1 1 1 1 1 ~/ ~ ~ _ ~/ ~ ~ ~ c~ 1 1 1 1 1 1 1 1 1 O ~ ~ C~ ~D ~ ~ ~ O ~ C~ ~ U~ ~ O ~ ~0 ~ C~ U~ ~ ~) C-) -1 C~ _/ _~ _/ _d c~ ~ ~ ~- oo _ ~ ~ oo r~ u~ ~ 1- 00 ~ U~ ~ ~ ~ _ ~ ~ ~ 00 C~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ O 0000000000000 e e e e e e e e e e e e e e O ~ CC ~ ~ ~ ~ ~ C~ ~ C C~ O ~ ~ C-) 1~ ~) ~ ~) ~ I~ U) ~ 1N cs~ a~ ~) ~ c~ ~ ~D C~ I_ O _I a~ ~ ~ ~ a~ oo ~ oo ~ ~- ~ u~ ~ c~ e e e e e e e e e e e e e e ~ ~ ~ O ~ U~ ~ C~ O O C~ ~ U~ C~ C~ 1 - 0~\ C+) 1-00 C~ ~ C-1 0~ 0 U~ 0~ ~ ~ ~ ~ ~ ~ 0N 00 ~ U~ _ ~ ~ ~ ~ ~ ~ ~ ~ u~ o o ~ ~ ~ ~ cs ~ ~ ~ ~ ~ ~ ~ _ ~ ~ ~ O ~ ~ 1- ~ ~ G I- O - ~ ~-~ C~ O CS, ~ ~ U~ ~ ~, 1- ~ . - CN ¢ E~ e~ C C ~ O ~·,' e e 3 . ~ ~ u e' P~ ct c' · - a a _' a o a~ _ O c~ 00 1_ ~ O .- c~ Z ~ ~ ~ ~ ~ ~ ~ ~ ~ 0 0N ~ ~ c~ ~ ~ ~ ~ ~ u~ ~ ~ ~ ~ ~ ~ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 + u~ O u~ O u, O u~ O ~ O u~ O u~ O ~ O u~ _I _1 ~I C ~C-) t-) ~ ~) 1- 1- 00 00 u o e cs c~ c: a~ :^ u~ ~ e u~ u~ u' ~ O
OCR for page 248
248 TABLE B-8 Estimated Mexican Emigrants by Age Group and Sex, 1960-1970 and 1970-1980, Using Variable Growth Rate Procedure ~ in thousands) Males Females Age Group 1960-1970a1970-1980a1960-1970a1970-1980a 10-14 112196-67-143 15-19 209190-48-141 20-24 89584842 25-29 6239209179 30-34 -67-19627-163 35-39 35-4225-28 40-44 -7-122745 45-49 -3056222 50-54 115-52137-43 55-59 -41-88-38-57 60-64 28242418 65-69 -56-4-434 70-74 -16-174-15 75-79 -10-74-19-81 Total 42378308-381 aThe life tables used were from the United Nat ions model life tables (UN 1982) selected with an expectation of life at age 10 of 56.53 years (males 60-70), 58.21 years (males 70-80), 60.28 years ~ females 60-70) and 62.09 years ~ females 70-80) .
OCR for page 249
a In In is ·e En Ma girl ho a) o o ·rl ca a in c) of In ¢ a) so o Ace ~ 1 Go ~1 1 ~ rig ¢ a EON En on 1 to a o 1 a) ~ lo .~4 'n ~ 1 In En on ~ o lo: ~ In l o o ~ o-' a ~ ~ u a ¢ ~n . - 1 - 1 ~b C~ C~ o 249 . ~ ~ o ~ ~ ~ U~ ~ ~ ~ ~ ~ oo ~ a) ~ 0 ~ ~ ~ c~ ~ a' 0 ~ ~ ~ ~ I_ . ~ ~ ~ . ~ . ~ . ~ . ~ . ~ _. ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ + o~ + C~ ~ ~ + C~ + ~ ~ ~ + -~-O 1 U~ _ ~ _ %;t + a~ _ 0 0 ~ _ ~ ~ C~ _ C~ o tn .~l _~ 1 ~ ~ 1-~ O ~ 1-~ O ~ ~) _` ~ ~C~ C~ 0 ~ · U~ · U~ . 0 . CS' · ~ . U~ - - ^m ^ - ^ - "o "o - - ~ ~- ~ c~ ~ _e ~ ~ c~ ~ ~ ~ ~ 1 ~ c~ O c ~cs, 1 oo 1 ~ 1 ~- + ~ + C~ _ r- 1 ~_ _ _ _ _ _ cn C~ . · - U ~1 1 ~ ~ ~ ~ ~) ~ _' ~ O ~ C-) ~ L~ ~O c ~ ~ ~ cr ~ U~ ~ c~ ~ 0 0 0 c~ . ~ . oo . ~ . oo . U~ · ~ · C. 00 0 ~ ~ ~ ~ ~ ~ c~ c~ + ~n ~) ~ 1 ~ + C~ 1 ~ + ~) _ C~) 1 ~) 1 ~ _ _ _ _ _ _ Ct O 1 C~) ~ ~-~ ~ ^ ~ ^ c') ~ oO ~ O u ~ C~ a~ ~ ~ u~ oo ~ ~D ~ ~ ~ 00 ~O · ~ · O . ~ . C~ · C<; · C~ ~ ~ ~ O ~ C~ ~ U~ ~ ~ ~ ~ ~ ~ O ~ 1 IJ~ C~ t~ I + C-) + 0 1 Ln 1 ~O _ c~ + o~ 1 0-O-O- ~ _ _I ,- ~ _ _ CS~ 1 oo ~ r_ ~ 1-~ C~l '. O C~ ~ ~ ~ ~ ~D O u~ · U~ . ~ · O ~ ~ · ~ · C~ ~ ~ ~ C~ ~ ~ ~ C~ ~ ~ ~ ~l ~ 1~ ~D ~) 1 ~ 1 C`l ~) ~t + ~ + O ~ ~ ~ _ c~ _ ~o 1 ~ _ ~ _ oo + 00 _ _ _ 0~ ~ oO ~ C~ ~ ~-~ ~ ~ oo c~ ~ ~ oo u~ 1- · U) · ~ · ~ · O · 00 · a~ ^+ ^m ^ - "o ^ - ^< ^ O + C~ 1 ~ ~ O + ~ 1 ~ ~ ~ C~ _~ _ O _ ~- 1 oo-~- _ C~ + ON + c~ ~ ~ _ ~ ~ c~ _ c~ _ ~) 1 ~ ~ 1~ ~ ~) ~ ~ ~ ~J ~ ~ ~ O ^ _ ~ =~ u~ O o,- ~ c~ ~ 0 . ~ . ~ · c~ · ~ · O · ~^ 0 ~ ~ ^ c~ ~ ~ ^ c~ O ~ ~ + oo ~ ~t + ~ 1 oO ~;t O + O + ~ _ ~ 1 ~ _ oo _ ~ + ~ _ _ ~ _ ~ _ ,_ c~ oocr ~0 oo oo oo oo c~ ~ox ~o~ a' cn JJ ~ O ~u .,' O ~n O C ~=1 ~O b0~ ~ ~q O C. U ~Ct t0 C~oO cr, C) S"O P~ ~n =: · ~ O E~b0 O. - Z~m Ct S~ o ~0 C~ C~
OCR for page 250
250 TABLE B-10 Distribut ion of Person-Years Lived by Located Deportable Aliens and Fiscal Year of Residence, 1977-1984 Durat ion Pi sca 1 Ye ar Category of Loc at ion Fiscal Year of Residence 1977 1978 1979 1980 1981 1982 1983 1984 4 days 1977404 - - - ~ ~ ~ ~ 1978- 421 - - - - - - 1979- - 405 _ ~ 1980- - - 360 - - - - 1981- - - - 361 - - - 1982- - - - - 352 - - 1983~ 1984- - - - - - 1 589 4-30 days 1977 3309 1978 57 3215 - - - - - - 1979 - 56 3148 - - - - - 1980 - - 38 2141 - - - - 1981 - - - 39 2202 - - - 1982 - - - - 41 2275 - - 1983 - - - - - 49 2767 - 1984 - - - - - - 52 2920 1-6 months 1977 25099 - - - - - - - 1978 3518 23547 - - - - - - 1979 - 4232 28319 - - - - - 1980 - - 3280 21948 - - - - 1981 - - - 3447 23069 - - - 1982 - - - - 3490 23358 - - 1983 - - - - - 3408 22805 - 1984 - - - - - - 3224 21575 7-12 months 1977 17166 - 1978 8613 14355 1979 1980 1981 1982 1983 1984 - 10242 17070 8963 14938 - - 11099 18498 - - - 11214 18689 9985 16642 8858 14763 1 year 1977 63541 - - - - - - - 1978 99493 49747 ~ ~ ~ ~ ~ ~ 1979 81527 81527 40764 - - - - - 1980 63550 63550 63550 31775 - - - - 1981 14609 73047 73047 73047 36524 - - - 1982 - 18992 94960 94960 94960 47480 - - 1983 - - 18971 94853 94853 94853 47427 - 1984 - - - 14911 74556 74556 74556 37278 Total ( thousand s ) 380.9 342.9 352.5 363.5
OCR for page 251
251 The Imputation and Treatment of Missing Data Kenneth Wachter Every statistical reporting system needs well-defined, routine procedures for the treatment of missing data. That data are missing does not in itself reflect badly on administrators who collect or report it; all good data collection involves missing values--they are an ordinary fact of life. What does reflect badly on those who report statistics is to fail to recognize the importance of obtaining as complete a response as possible, to deny that there were missing values, or to pretend that the reported values are free from gaps in coverage and free from nonresponse. Nothing undermines confidence in an administrative tabulation so badly as the absence of a column showing the number of offices that failed to report or whose reports could not be included in the tabulation, the number of forms submitted incomplete, and so forth. Reporting the extent of missing values builds confidence in a report, by showing that the agency seeks a realistic view of the completeness that its compilations achieve. Of course, it is better when data are not missing, although no large reporting system comes close to 100 percent reporting. There is no statistical magic that can make up for information that is not there. There does now exist a large body of statistical know-how for minimizing the bad effects that missing data could otherwise have on reported totals and on inferences about patterns and trends. Some of these statistical methods are already routine and are in continual use by government agencies, including the Census Bureau. Others are at the stage of research development and testing. A good recent overall account can be found in the entry on "Incomplete Data" in the Encyclopedia of Statistical Sciences (Little and Rubin, 1982~. The first rule of treating missing data is that procedures must be as uniform, standardized, and well-documented as possible. It is more important to know what the numbers before one mean and how they were arrived at than to have them compiled by superior but unfathomable methods. It is better to have a run-of-the-mill but uniform reporting system than to have a system in which (unbeknownst to readers of its reports) certain district offices have pursued every last elusive case while others have misplaced whole bundles of forms. Even if star offices can be identified, they cannot be regarded as a random subset of all offices. A good approach to missing data is to target a random sample of offices for intensive follow-up of missing cases and then to present correction multipliers based on the sample follow-u~. Sampling can be an event, cost-effective way to allocate resources for improvement in data quality. Using the INS Statistical Yearbook as an example, these general considerations lead immediately to the observation that all tables in the yearbook should include a row or column enumerating cases with status unknown. If unknown or uncertain cases have been distributed among other
OCR for page 252
252 rows or columns, the formula for distributing them should be stated in the table notes. For example, Table lOA in the 1979 Statistical Yearbook on page 28 illustrates this point in having a row for "no occupation reported." In contrast, however, the row for "unknown marital status" in Table lOA contains zeroes for all years and both sexes. The zeroes suggest that ambiguous cases have been assigned by some rules that are not explained, since it is not credible that of some 2 million people not a single person failed to check a box; not a single coder accidentally punched a nonsense digit, and not a single error eluded the agents who accepted the forms. Furthermore, age must have been missing from some records, since age nonresponse is commonplace. A row showing the number of cases that lacked ages next to the median age row would be reassuring. For another example, consider Table 17C on page 54. A row showing numbers of temporary visitors whose region of last permanent residence was uncertain, or a formula showing how such cases have been allocated among regions would bolster confidence and enhance the value of this table. A further observation to be noted is that the footnotes in each table in the Statistical Yearbook should include a statement of the basic data . sources or sources from which the table is derived. For instance, Table 17C of the 1979 Statistical Yearbook is probably derived from the I-94 Nonimmigrant Arrival/Departure Forms. But that cannot be determined from the yearbook itself. Such footnotes would aid both outside readers and those in the INS who trace errors and regenerate the tables in later years. Which tables derive from the same basic data and which from different sets of data? The yearbook should also contain a statement of the total numbers of I-94 forms accounted for in central office tabulations and the number of those that were incomplete, missing matching departure records, and so forth. In that way the information that would help in the interpretation of all the tables based on those particular forms would be assembled in one place. It is very important in treating missing data to know whether the data are missing at random. For example, suppose that a border office typically fails to report at all when it is so busy with exceptional numbers of apprehensions that there is no time for statistical work. Or suppose that at border crossing points, fewer booths are staffed at peak weekends or peak hours, because of staff holidays or staff shortages. In such situations the data are not missing at random. There is a relationship between the values that would have been reported if they were not missing and the fact that those values are missing--a relationship that seriously undermines the statistics. Such situations can sometimes be prevented by astute staffing decisions, but they are bound to occur. The preface to the INS Statistical Yearbook should discuss the most salient such situations, on the basis of direct consultation with the officers in the field who know the realities firsthand. Formalized statistical techniques that compensate for or diminish the bad effects of missing data come under the general heading of imputation. Missing information can be imputed or made up on the basis of information that is not missing. The first rule of imputation is information must never be imputed to records in a way that does not allow the imputed cases to be separated from the actually reported cases at every later stage of the analysis. For instance, values should never be
OCR for page 253
253 imputed into the basic records, like I-94 fonts, unless they are coded in such a way that the imputed values will be clearly distinguished from the real values whenever totals are assembled. When data are computerized, it is generally easy to add a code to each value indicating whether it is imputed. Then totals can be run off with and without the imputed values. In this way, the effects of imputation can be observed. Of the various imputation strategies now in use, three are mentioned here: "hot deck imputation," generalized regression single or multiple imputation, and incomplete data likelihood maximization with algorithms like those of the "E-M" type. Good accounts of these methods can be found in the entry on incomplete data in the Encyclopedia of Statistical Sciences. For an account of hot deck imputation; see Ford (1983). This is the type of method in widest use among government agencies, especially in the Census Bureau. For regression-based imputation, a new variant that allows unbiased estimation of standard errors in cross-tabulations has been pioneered by Rubin and called "multiple imputation" (Rubin, 1980~. It is not restricted to sample surveys alone. A large experiment using this method to insert 1980 occupational codes into the 1970 census public-use sample is now under way at the Census Bureau. For likelihood maximization methods, more formal statistical expertise in model building is required, although the results can repay the extra effort if the data are otherwise of high quality. A good entree to this extensive statistical literature is the cautionary article by Little and Rubin (1983). The statistical virtues of these methods are not the only considerations in a decision about which to use. The simplicity of the methods and the feasibility of implementing them in practice under the difficult conditions that the INS often faces must be taken into account. When missing values on individual forms like the I-94 are at issue the simplest and most easily implemented formal imputation method is the hot deck method. When missing blocks of data in aggregate tables are at issue, for example if a computer tape or the transmissions from one district office should be garbled or misplaced, a likelihood maximization method would be efficient and appropriate. The idea of hot deck imputation is to substitute for values missing on one record the values that occur on another record whose values have high probability of agreeing with those that are not missing. As an example, consider I-94 forms that are missing entry 11, occupation. A person coding the I-94 forms, finding a missing occupation, would go back, either manually or by computer, to the last-processed I-94 form which showed, say, the same country of citizenship (entry 3) and decade of birth (entry 2~. The value from that form would then be coded as occupation for the form with occupation missing, along with a code showing that this occupation value was an imputed value rather than a true value. The final tabulations of admissions by occupation would then show separate values for occupations without imputations and occupations including imputations. It is essential that any hot deck imputation use a set of formal rules that state that if such and such entries are missing, then the donor form from whom the missing values are supplied shall be the first form that is similar in a number of prespecified variables. The selection of donor form should not be left to the judgment of the coder. It is also essential that the rules be simple, particularly if they are
OCR for page 254
254 being implemented by hand, but also, in the interests of efficiency, if they are being implemented by computer. Thus the requirements for a match to a donor should not be overly rigid, yet appropriate for the variable to be imputed. The use of likelihood maximization methods, especially those that employ E-M algorithms, demands the specification of a statistical model and therefore demands a trained statistician to formulate the model. In an example such as that of a missing tape, the need might be to estimate cells in a cross-tabulation, in which the total number of records with missing age and sex values might be known, but in which their distribution among the cells of the table might be uncertain. In such a case, a standard model for contingency tables could be used. It remains essential, however, to present the table without as well as with the entries adjusted for the missing data. The impact of the statistical adjustments needs to be visible, so that readers and administrators can assess their plausibility. Likelihood maximization methods would be recommended if fairly large quantities of data, numbering, say, into the thousands of records or 5 percent of the total sample, proved missing. For cases in which few values were missing, either in absolute amounts or relative to the size of the sample, it is generally not cost-effective to adjust. Any statistical system has to deal with missing data. With a record-generating system as large and complex as that of the INS, what is simplest and most easily implemented is undoubtedly best. It is therefore right to advocate a pragmatic approach, rather than a fancy theoretical solution, and to encourage, above all, come-on sense, uniformity, and candor. REFERENCES Ford, B.L. 1983 An overview of hot-deck procedures. Pp. 185-207 in W.G. Madow, I. Olkin, and D.B. Rubin, eds., Incomplete Data in Sample Surveys: Theories and Bibliographies. Vol. 2. New York: - Academic Press. Little, R.J., and Rubin, D. 1982 Incomplete data. In S. Kotz and N.L. Johnson, eds., Encyclopedia of Statistical Science. New York: Wiley. 1983 On jointly estimating parameters and missing data by maximizing the complete-data likelihood. American Statistician 37:218-220. Rubin, D.B. 1980 Pp. 1-9 in Bureau of the Census, Handling Nonresponse in Sample Surveys by Multiple Imputation. Washington, D.C.: U.S. Department of Commerce.
Representative terms from entire chapter: