| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 203
Appendix
SOME METHODOLOGICAL ISSUES IN ANALYZING DATA ON IMMIGRATION
INTRODUCTION
The body of this report has been concerned largely with the process of
collecting and disseminating data on immigration and the foreign-born.
Analytical issues have been touched on, but no detailed examination of
analytical procedures has been attempted. This appendix incorporates
three papers concerning analysis, the first two by Kenneth Hill, of the
panel staff, and the third by Kenneth Wachter, a member of the panel.
Although the papers have benefited from the comments of a number of
reviewers, they nonetheless represent the views of the authors rather
than of the panel as a collective entity. They are included here because
they concern issues central to the panel's charge and were prepared as a
part of its overall work plan. We hope they will serve to stimulate both
discussion and new research areas.
The first paper outlines three methodological procedures that could
be applied to data that either are available or could be made available
at little expense and with little change in current administrative
practices. The methods outlined are acted at measuring stocks or flows
that are poorly documented by existing statistics: emigration of
immigrants admitted for permanent residence (and, coincidentally, an
estimate of average coverage of the Alien Address Report system), the
size and growth of the population of illegally resident aliens, and net
flows of U.S. citizens. These methods are intended to be illustrations
of ways in which particular types of data might be used for analytical
purposes and to indicate the potential analytical value of compiling or
processing data that are already collected. The estimates obtained by
these methods are also intended to be illustrative rather than
substantive--for substantive applications, the necessary data must be
available and the extensive assumptions underlying the methods must be
evaluated in the light of the results obtained. The methods proposed do
have some promise for producing useful new estimates, to complement
rather than to replace existing ones, and it is hoped that, even if these
methods in the form presented do not prove viable or prove to be
excessively sensitive to critical and unsupportable assumptions, their
presentation will stimulate discussion and the development of new
approaches to the use of available or easily generated data.
203
OCR for page 204
204
The second piece is concerned with estimating the size of the
illegally resident population of the United States. Estimating the size
of this population, and still more its characteristics, poses serious and
special measurement problems, since the population itself is, for obvious
reasons, anxious to avoid any unnecessary contact with officialdom. As a
result, the methods applied, though often ingenious, also often rely on
extensive assumptions that are hard to justify. Hill reviews the major
empirical studies that have been made of the size of the illegal
population and examines their results in the context of their methodology
and assumptions. Several of the methods have been reviewed elsewhere,
and little new about these methods is presented here; however, some of
the methods have not been subjected to detailed examination before, and
it seemed useful to cover all the major methods and their results in one
place and to pull together and evaluate all the available empirical
estimates of the size of the illegal population. The paper is intended
as an evaluation of the various estimates and thus concentrates on the
negative rather than on the positive aspects of the methodologies used.
The reader should bear in mind, however, that the measurement problems
involved are particularly severe and that any methods used will
inevitably involve assumptions and approximations that are hard to
justify. It is an area in which new approaches or the use of different
data are to be welcomed, and in which a wide margin of uncertainty in the
estimates derived should not be interpreted as a criticism of the
methodology or of the attempt.
The third piece discusses the issues of imputation and treatment of
missing data with particular reference to procedures of the Immigration
and Naturalization Service (INS) and the presentation of data in the INS
statistical yearbooks. Wachter argues strongly that the procedures
currently used by INS should be reviewed in teems of their statistical
validity and should be carefully documented in the Statistical Yearbook
so that users can be aware of how the necessary imputations have been
made and be alerted to how such imputations might affect the data.
OCR for page 205
205
Indirect Approaches to Assessing Stocks and Flows of Migrants
Kenneth Hill
INTRODUCT ION
Statistics for U.S. migrant groups are of very variable quality. The
best data made available by the INS cover first arrivals of permanent
immigrants, or those changing status to permanent immigrant, and those
naturalizing to U.S. citizenship. These data seem to be fairly reliable,
in general if not with regard to all the available detail, even though
they have suffered from severe processing and publication delays in the
last few years. Figures on first arrivals of refugees, published very
promptly by the Office of Refugee Resettlement, also seem to be
reliable. Some elements of inflow are thus adequately covered by
existing statistics. Inflows of temporary visitors, returning citizens,
and returning resident aliens are less satisfactory. Although total
arrivals by air are reasonably well recorded by the INS, processing of
arrival declarat ions for aliens has been sporadic in recent years, and
permanent residents are no longer required to complete such
declarat ions. The situation for the inflow through land border ports of
entry is worse; in many cases no direct head count is made, the total
flow being estimated as the product of numbers of cars and an average
occupancy figure derived from semiannual surveys, also used to estimate
c it izen/nonc it izen rat ios ~ see Chapter 4 for a more comple te descript ion
of these procedures). Since the gross inflow across land borders
represents the great majority of total inflow, INS estimates of total
inflow cannot be regarded as satisfactory and in any case exclude any
inf 1 ow of und oc ument ed a 1 fen s.
No systematic attempt is even made to record outflow; although
temporary visitors are required to complete a declaration on departure,
compliance is high only at airports. Departures of all passengers by air
are recorded by the INS from airline reports, but coverage of charter
flights appears to be incomplete ~ see Chapter 5~. No attempt is made to
record departures of citizens or permanent residents at land border
points or even to est imate the number of vehicles crossing. There is
thus no basis for estimating gross out flow from the United States and no
basis for monitoring changes in population stock. Until 1981, the INS
attempted to monitor the stock of resident aliens through the Alien
Address Reporting system; however, reporting was widely felt to be
incomplete and the system was scrapped, although resident aliens are
st ill required to register changes of address with the INS. (This
requirement is seldom observed, however, and the forms are not
processed. ~ Information on the population stock and inflows is available
from the decennial census, which collects country of birth, citizenship,
period of arrival for the foreign-born, and residence one year and five
years before the census. The accuracy of some of the information, which
OCR for page 206
206
is self-reported, is open to question, and the coverage by the census of
undocumented aliens is unknown.
There are thus major deficiencies in U.S. international migration
statistics, the two most important being the size and structure of the
undocumented population and emigration of both U.S. citizens and of
noncitizens. Numerous ingenious approaches have been developed to obtain
estimates of the stocks and flows involved. Siegel et al. (1980) provide
a useful review of methods used to estimate illegal immigration, and
methods of estimating emigration have been reviewed by Passel and Peck
(1979) and Warren and Kraly (1985~.
This paper describes three potentially useful new indirect approaches
to the estimation of stocks and flows of U.S. migrants. Unfortunately,
the approaches are based on data that are no longer collected, from the
Alien Address Reporting system, on data that are collected but not
processed, from records of deportable aliens located in the United States
or on data that are difficult to compile, from foreign census counts of
U.S. citizens or the U.S.-born living abroad. The immediate practical
applicability of the methods described is thus severely limited, but the
methods are described in order to indicate some directions that analysis
could take if fairly simple procedures for data collection, processing,
or compilation were instituted. They are not proposed as final solutions
to the measurement problems with which they are concerned. Like all
indirect methods, they too involve assumptions and approximations that
will affect the results. Rather, these approaches illustrate how certain
types of data could be used to obtain estimates of stocks and flows of
people in, into, and out of the United States. We hope this illustration
of the application of somewhat different approaches to the problem will
generate further thinking, which may stimulate additional future research
in this area.
The first method uses information on Reportable aliens located by
duration of illegal residence and other simple characteristics to
estimate the size and structure of the nonlegal population of the United
States. The second method combines information from the Alien Address
Reporting system with information on numbers of new i~u~igrants and
naturalizations to estimate both the coverage of the address reporting
program and the emigration of resident aliens. The third method uses
census data from other countries on the U.S. citizen or U.S.-born
population resident in those countries to scale information from an
administrative data source--Internal Revenue Service tax filer
records--on the U.S. population living abroad.
THE SIZE AND STRUCTURE OF THE NONLEGAL POPULATION OF THE UNITED STATES
Numerous methods have been described for estimating the number of
nonlegal residents of the United States or major components of this
population (see the second paper in this appendix) for a review of the
more important studies). The approaches proposed here use information on
locations of Reportable aliens by duration of illegal residence to
estimate the size and duration structure of the underlying population,
first assuming the population to be demographically stable and second
using duration-specific growth rates. The INS collects information on
Reportable aliens located on form I-213 (see Appendix A) but has not
OCR for page 207
207
processed the data systematically, although Davidson ( 1981) has described
the results of processing a sample of the forms completed in calendar
1978.
The use of I-213 data to est imate either numbers or characterist ics
of the nonlegal population is not straightforward for a number of
reasons. First, the locat ions occur very large ly at short durat ions of
illegal stay; in fiscal 1982, for example, 75 percent of the 963,000
locat ions were of aliens with a durat ion of illegal stay of 30 days or
less, and 50 percent occurred at entry; a high proportion of these
locat ions may be of the same person located several t ime s in the same
year. Second, the located aliens cannot be regarded as a random sample
of the underlying population, since probabilities of location are likely
to vary by characteristics such as sex, nationality, and occupation.
Third, the quality of the data on the I-213 is widely regarded as low,
and although no thorough evaluation has been made, Davidson ( 1981) shows
that employment and residence characteristics suffer from high levels of
nonresponse. These shortcomings no doubt partly explain the INS I s
failure to process I-213 forms on a routine basis.
Before describing and illustrating the methods in detail, it is
use fu 1 to provide a genera 1 expl anat ion of why the me thod s might be
expected to work at all. To make any analytical use of locations, it has
to be assumed that the number of locat ions is re lated to the number of
deportable aliens who can be located. If the number of locations is
determined by INS targets, or the INS locates as many deportable aliens
as it can given existing resources and manpower, there will be no
systemat ic relet ionship between locat ions and populat ion at risk, and
locations will provide no basis for estimating the size of the
population. No empirical basis exists for assuming a relationship
between locations and population, but it does seem plausible that if the
Reportable alien populat ion were doubled, the INS would locate at least
some more deportable aliens without any increase in ef fort, although
locat ions might inc rease by a fac tor of le s s than two. Even accept ing
this assumption of a positive elasticity of locations to population, it
might appear at first sight that a series of numbers of locations by
duration could only indicate relative, not absolute, rates of location by
duration. Sets of location rates that are the same in duration pattern
but different in level will produce the same numbers of locations at each
duration when applied to populations that share a given distribution by
duration but are appropriately scaled. It is not obvious, therefore,
that recorded numbers of locations can tell us anything about the size of
the underlying population. However, there is a link between the two
because the number of locations affects the size of the population, in
much the same way as deaths affect the size and age distribution of a
closed population. If locations were the only source of attrition, the
parallel with deaths would be exact, and methods for estimating
population size from deaths by age for a stable population (that is, a
population changing at a single, constant rate at all ages and thus
maintaining a constant age structure though not a constant size) , such as
that proposed by Preston et al. (1980), or for a general population
(Preston and Coale, 1982), could be applied to locations by duration. In
practice, voluntary return migration, change of status' and deaths also
contribute to the attrition of the population of illegal aliens, so
estimates based on locations alone will underestimate the true size of
OCR for page 208
208
the population unless allowance is made for other unobserved types of
loss.
The first approach assumes that the illegal alien population is
stable in the demographic sense of having a constant, unchanging rate of
change at each duration of illegal residence. In such a stable
population, the number reaching duration d in a year, N(d), can be
expressed in teems of the number of entries in the year E, the stable
rate of change r, and the probability of surviving from entry to d, p~d):
N(d)=Ee~r~p~d)
(1)
The average population at all durations, P. can be found by integrating
equation 1:
w
.. w
P=/ N(d)dd-EIe~rdp(d)dd (2)
0 0
where w is the highest duration attained. In any population, the rate of
change r is equal to the entry rate, E/P, less the loss rate, LIP, where
L is total losses; substituting rP + L for E in equation (2) and
rearranging gives
P/L=ie~rdp(d)dd/[l-ri e~rdp(d)dd] (3)
0 0
Also in a stable population, survival to duration d can be expressed in
teems of losses by duration, ltd), and r:
pod)=; 1(djer~dd// 1(djer~dd (4)
d O
If we now assume that losses from INS locations, D(d), form a constant
proportion of all losses at all durations d, p~d) can be expressed in
terms of D(d) and r, since the constant proportion will cancel out in
equation 4.
We can now apply equations 3 and 4 to Davidson's data on locations by
duration of illegal stay for 1978, assuming different growth rates, and
limiting the analysis to locations at durations of one month or more.
Equation 4 has been evaluated assuming that locations are distributed
evenly over each duration group, applying a value of d for the midpoint
of the interval, except for the open 7+ years interval, for which a value
of 9.5 years was assumed. The integrals in equation 3 were then
evaluated trapezoidally for each duration category. Calculations are
shown in Table B-1. The ratio P/L, average population to average losses,
increases from 1.655 for a zero growth rate to 1.882 for a growth rate of
5 percent to 2.168 for a growth rate of 10 percent; an annual growth rate
of 10 percent implies a population doubling time of seven years. The
estimated P/L is surprisingly insensitive to the assumed growth rate.
The estimates of P/L do not provide a basis for estimating P
directly, since we do not know the value of L, average annual losses.
However, we can obtain estimates of P for a range of assumptions about
the value of L/D, total losses to location losses. Total locations at
one month duration or more were 231,274 in 1978. If locations were 25
percent of total losses, the value of L would be 0.93 million, and the
alien population present illegally in the United States for a month or
OCR for page 209
209
more would then be 1.53 million for a growth rate of zero, 1.74 million
for a growth rate of 5 percent, and 2. 01 mil lion for a growth rate of 10
percent. If locations were 50 percent of total losses, each estimate
would be halved.
Locat ions data do not suggest a rapid growth rate of the populat ion.
In 1979, 245,118 deportable aliens illegally resident for a month or more
were located, so if location rates remained constant the underlying
population grew at 5.8 percent annually. A growth rate around 5 percent
thus seems more likely than one of 10 percent. We have little guidance
for a plausible figure for L/D, though Garcia y Griego (1980:Figure 3.3),
using data from the Mexican CENIET border survey on migrat ion histories
of Mexicans returned by the INS, found that about 60 percent of returns
to Mexico over the period 1970-1977 resulted from INS locat ions, and
about 40 percent were voluntary. These results suggest that an L/D ratio
of 2. 0, allowing for deaths and legalizat ions in addition to voluntary
returns, is more plausible than a ratio of 4. 0, at least for Mexican
illegal residents. Using these assumptions, the data suggest an illegal
population resident one month or more that averaged around 0.9 million in
1978.
This procedure can also provide a number of other interest ing
results. For a growth rate of 5 percent, the ratio P/L is estimated at
1. 882; this rat lo is the inverse of the loss rate, which is therefore
est imated as 0. 531. The entry rate, E/P, is equal to the loss rate plus
the growth rate, and is therefore est imated as 0. 581. If the rat lo L/D
is taken to be 2. 0, P is equal to 0.871 million, implying a value of E of
0.506 million. This value is the number of illegals achieving a month' s
residence in 1978; since locations under a month in 1978 totalled 0.817
million, and the value of E is est imated at 0.506 million reaching a
month without being located, total entries are estimated (assuming all
losses at durat ions less than a month result from locat ions) at 1.323
million, of which the Border Patrol located 62 percent at entry or during
the first month of illegal residence.
We can also use the pi d) func t ions to calculate durat ion-spec if ic
annual location rates, dividing the life table losses pods - p~d+l) by
person-years lived by the life table popular ion, approximated by n~p~d) +
Fidel) ~ / 2, where n is the length of the duration interval in years, and
then dividing by 2.0 again to allow for the assumption that only half the
losses resulted from locations. The resulting location rates nld are
shown in the last column of Table B-1. One comfort ing feature of the
rates is that those for the open interval, wld, which are set at
0.200 by assigning a uniform distribution over 5 years, are more or less
consistent with the rates for shorter durations A discomforting feature
is that the rates are lowest for the duration interval 1-2 years, whereas
we might expect them to decline steadily with duration. A possible
explanation would be that location losses represent a lower proportion of
all losses at long durations than at short durations This explanation
is tested in Table B-2, in which locations numbers are inflated by
variable durat ion-spec if ic fac tors, averaging 2. 0 overal 1, and then
manipulated using a growth rate of 5 percent. Three models are
presented, ~ a) with the location proportion of all losses rising with
duration, (b) with it falling, and (c) wi th it starting high for duration
1-6 months, falling sharply to a minimum for durat ion 7-12 months, then
OCR for page 210
210
rising steadily as duration increases. Model (b) does indeed produce
locat ion rates that are essent tally constant at durat ions over one year.
More surprisingly, the results using these three models suggest that
the procedure is not very sensitive even to substant ial variat ions in the
location to total loss ratios by duration, the estimated total population
varying from 0.79 million for model (a) to 1.24 million for model (b) .
The assumption of stability can be dropped if inflation is
available on duration-specific growth rates. If duration-specific
location rates were constant from year to year, population growth rates
could be calculated directly from the numbers of locations in successive
years, since the locations growth rates would be identical to the
underly ing populat ion growth rates. Even if we wished not to as sume
constant rates, we could assume a constant duration pattern for the rates
and an overall growth rate to which the durat ion-spec if ic rates would be
scaled. To apply this procedure, we need information on locations by
durat ion for at least two consecut ive years. Unfortunately, such useful
data are not available, but we present the methodology required and
illustrate the effects of departure from stability for two different
case s.
Preston and Coale ( 1983) have shown that for a non-stable population,
a
-or r(x)dx
N(a) = B e ° p(a)
(5)
where N( a) is the population age a, B the number of births, r(x) the
growth rate at age x, and p(a) the probability of surviving to age a, all
at some particular t ime t. By integration, the total population P is
g iven by:
a
w W -or r(x)dx
P=/ N(a)da=Bi e °
0 0
p(a)da (6)
In any population, the birth rate B/P is equal to the loss rate L/P plus
the growth rate R. so equat ion 6 can be rewritten ~ replac ing age by
durat ion) as
d d
W -or r(x )d x w -or r(x ) d x
P/L= ~ e ° p(d)dd / [1 - R r e ° p(d)dd] (7)
0 0
we can estimate p(d) and r(x), we can then use this equation to estimate
P/L. The variable growth rate version of equat ion 4 is
d d
w r r(x)dx w ~ r(x)dx
P(d)=| l(d)e° dd/i l(d)e° dd (8)
d O
OCR for page 211
211
Thus, given values of l(d) (or nld) and r(d) (or nrd) we can
obtain p( d), the survival function needed in equation 7. Note that the
values of lid) again do not need to be the correct level, as long as they
have the true duration pattern, since a constant level factor will cancel
out from the top and bottom of equation 8. Thus we can use locations
nDd in place of losses Hid in equation ~ if we assume that
locations make up a constant proportion of total losses for all durations
We have no data to which to apply this more flexible approach, since
Davidson' s data on locations by duration are for 1978 only and provide no
guidance concerning duration-specific changes in locations. However, we
can test the sensitivity of the stable assumption estimates derived above
to a non-stable underlying population by assuming different patterns of
duration-specific growth rates. Using the basic model with an overall
growth rate of 5 percent and a constant location to loss ratio of 0.5, we
illustrate in Table B-3 the estimates obtained assuming first that
duration-specific growth rates fall with duration and second that they
rise. The P/L ratios obtained bracket the ratio for a stable population,
lower for falling rates and higher for rising rates, but differ from it
by only 4 or 5 percent. Thus it appears that the stable procedure is
actually quite insensitive to departures from stability, at least for the
range of growth rates tested, as it was to substantial differences in the
stable growth rate used. This insensitivity arises from the heavy
concentration of locations at short durations for which the growth rate
has only a modest effect.
In conclusion, these methods make some strong assumptions, but the
results are not very sensitive to many of them. Deviations from
stability appear to be relatively unimportant, and the stability
assumption can be relaxed if data are available for more than one year.
Similarly, the results are not highly sensitive to the stable growth rate
assumed in the stable method or to the overall growth rate in the
non-stable method. The results are more sensitive to locations to losses
ratios that change sharply with duration, although ratios that change by
more than a factor of two affect the overall population to loss ratio by
less than 50 percent. The assumption to which the final estimate is
directly proport tonal is the overall location to loss ratio; a value of
this ratio of 0.25 will produce an estimate of the illegal population
exactly twice as large as will a value of 0.50. However, overall the
methodology turns out to be surprisingly robust to deviations from the
assumptions. It is likely to work best for groups with similar location
and other loss probabilities, so it could usefully be applied to data on
locations classified by sex and nationality groups, though not by age
since age would introduce entries to and departures from the population
considered as a result of birthdays. Data for consecutive years would
also prove useful for relaxing the assumption of stability and for
examining the consistency of the results.
Given the limited data available, the results using location to loss
ratios that fall with duration appear most plausible; with an overall
location to loss ratio of 0.59 they suggest an average illegal alien
population resident a month or more of 1.2 million for 1978, a figure by
no means inconsistent with other empirical estimates available. This
figure of course excludes the contribution of illegal immigrants at
durations of 10 days or less, but their contribution in terms of
person-years lived must be fairly small, even if their number is large;
OCR for page 212
for 1978, it would increase the estimate of 1.2 million by less than O.1
million. This estimate is of course only arrived at in order to
illustrate how these methods work. More extensive data, permitting
repeated applications, the relaxation of certain assumptions, and
separate analyses for more homogenous subgroups, are necessary to
establish the ultimate value of the methods for estimation purposes.
ESTIMATING EMIGRATION OF RESIDENT ALIENS
Until 1981, most aliens resident in the United States were required to
report their address to the INS in January every year. Reporting was
made by completing and mailing to the INS a special card (form I-53)
available at post offices and elsewhere. The information collected is
described in Chapter 4, and the form reproduced in Appendix A. Figures
from the reporting system were published in the INS Statistical Yearbook
by nationality and state of residence.
Reporting under the system was widely regarded as being incomplete,
one of the reasons why the Alien Address Reporting (AAR) system was
dropped after 1981, and year-to-year fluctuations in the numbers of
reporting foreigners can only be explained plausibly in terms of varying
coverage. However, the information available provides some basis for
estimating the emigration of permanent resident aliens. If all recording
is complete, the number of permanent residents reporting in year t+l,
PR(t+l), should be equal to the number who reported in year t, PR(t),
plus immigrants (both arriving and changing status), lit, less
naturalizations, 1Nt, emigration, 1Et, and deaths in the United
States of permanent immigrants, 1Dt. Thus
PR(t+l)=PR(t)+ 1It-INt-(lEt+lDt) ( )
If reporting in years t and t+1 was kits and k~t+l) complete, and PRR(t)
and PRR(t+1) are the numbers reporting, then
PRR(t+l)/k(t+l)=PRR(t)/k(t)+ iIt - 1Nt -(lEt + IDt)
or
PRR(t+l) k(~+~) k(~+~) ( E + D)+k(t+l) 1 t
PRR(t) k(t) PRR(t) PRR(t)
Since PRR(t) = k(t)[PR(t)], we can write
PRR(t+l) k(t+l) k(t+l) (1 Et + 1 Dt) + k t+1 (1 It ~ 1 Nt)
_ ( )
PRR(t) k(t) k(t) Pit(t) PRR(t)
=- [l-R(t)] +k(t+l) lIt-lNt' (10)
k(t) PRR(t)
OCR for page 213
213
where it(t) is a loss ratio of deaths and emigrants divided by the initial
population; if deaths and emigration are regarded as minimal for
immigrants during their year of entry, it(t) can be regarded approximately
as a loss rate equal to the sum of the death and emigration rates (note
that the denominator of it(t) is the true, not the reported, population at
time t). If over a number of years k(t) and it(t) are approximately
constant, equation 10 becomes
PRR(t+l) `1 R'+k ~I~-~N~ (11)
PRR(t) PRR(t)
where R is the loss rate, k is the average coverage completeness of the
AAR system, and lit, 1Nt, PRR(t) and PRR(t+l) can be obtained
from INS statistics. R and k can thus be estimated by plotting the
ratios in equation 11, and fitting a straight line of intercept (1-R) and
slope k.
The estimated value R is not an emigration rate but rather a combined
emigration and death rate. The emigration element could be obtained by
subtracting a death rate calculated on the basis of the age distribution
of the population being considered; this death rate would probably not
exceed 10 per 1~000 for the immigrant populations from most countries of
orlgln.
The derivation above suggests some practical implications for
applying the method. Since it(t) and k(t) are assumed to be constant, the
method should be applied to groups as homogenous as possible, such as
country of origin by sex groups. It is also clear that the method will
not work well if (a) the fluctuations in kits or it(t) are large, or (b)
lit ~ 1Nt is small relative to PRR(t), or (c) (lIt ~ 1Nt)/PRR(t)
varies little over time. Simulations suggest that the line should be
fitted to the points using a group mean procedure, ordering the
observations by the values of flit ~ 1Nt)/PRR(t); that the
resulting estimate of R is reasonably robust to random fluctuations in
it(t) and kite; but that the resulting estimate of k is much more
sensitive to such fluctuations.
It is also necessary to discuss in more detail the effects of the
assumption that it(t) and k(t) can be summarized by average values R and k
applying to the whole period. Simulations suggest that random variations
around the average values will have little effect on R but will have a
more pronounced effect on the estimate of k, tending to reduce its
value. Underlying trends in it(t) and kit) might be expected to have more
substantial effects, however. Limited_simulations suggest that trends in
it(t) result in overestimates in R and k if it(t) is increasing, and
underestimates of R and k if it(t) is declining; the effect on R is small,
the estimate not deviating much from the average value, but the effect on
k is substantial, and the estimate might be in error by as much as plus
or minus 5 percent for a trend in it(t) over a 15-year period of about
1 percent per annum. A trend over time in k(t) has relatively little
effect on the estimate of k, which works out close to the weighted
average of k(t) regardless of the direction of the trend, but the
estimate of R is biased upward by declining coverage and downward by
increasing coverage. In general it can be concluded that the estimates
of R and k are reasonably robust to trends in it(t) and kite, so long as
OCR for page 244
244
possibly for no other reason than that the efficiency of the Border
Patrol has increased, causing more entries to fail early and thus to be
repeated. The size and growth of the illegal alien population may not be
problems of the magnitude sometimes suggested, although any substantial
number of illegal residents may cause social and economic problems,
particularly at the local level; these wider issues are not considered in
this discussion, which is limited to the size of the population only.
REFERENCES
Bean, F.D., King, A.G., and Passel, J.S.
1983 The number of illegal migrants of Mexican origin in the United
States: Sex ratio-based estimates for 1980. Demography
20(1):99-110.
CENIET
1981 Infonme Final: Los Trabajadores Mexicanos en los Estados
Unidos (Encuesta Nacional de Emigracion a la Prontera Norte del
Pals y a los Estados Unidos--ENEFNEU--~. Secretaria del
Trabajo y Prevision Social. Centro Nacional de Infonmacion y
Estadisticas del Trabajo. Mexico City.
Cue, R.A.
1976 Men from an Underdeveloped Society: The Socioeconomic and
Spatial Origins and Initial Destination of Documented Mexican
Immigrants. Unpublished Doctoral Dissertation. University of
Texas at Austin, Austin, Texas.
Davidson, C.A.
1981 Characteristics of Deportable Aliens Located in the Interior of
the United States. Paper presented at the annual meetings of
the Population Association of America, Washington, D.C.
Garcia y Griego, M.
1980 E1 Volumen de la Migracion de Mexicanos no Documentados a los
Estados Unidos (Nuevas Hipotesis). Secretaria del Trabajo y
Prevision Social. Centro Nacional de Informacion y
Estadisticas del Trabajo. Mexico City.
Goldberg, H.
1974 Estimates of Emigration from Mexico and Illegal Entry into the
United States, 1960-1970, by the Residual Method. Unpublished
graduate research paper. Center for Population Research.
Georgetown University, Washington, D.C.
Heer, D.M.
1979 What is the annual net flow of undocumented Mexican immigrants
to the United States? Demography 16~3~:417-423.
Interagency Task Force on Immigration Policy
1979 Staff Report. Departments of Justice, Labor and State.
Washington, D.C.
IUSSP
1981 Indirect Procedures for Estimating Emigration. IUSSP Papers,
No. 18. IUSSP, Liege, Belgium.
OCR for page 245
245
Lancaster, C., and Scheuren, F.J.
1978 Counting the Uncountable Illegals: Some Initial Statistical
Speculations Employing Capture-Recapture Techniques. 1977
Proceedings of the Social Statistics Section. Part 1, pp.
530-535. American Statistical Association.
Preston, S.H., and Coale, A.J.
1982 Age structure, growth, attrition, and accession: A new
synthesis. Population Index 48~2~:217-259.
Reichert, J.S., and Massey, D.S.
1979 Patterns of migration from a Mexican sending community: a
comparison of legal and illegal migrants. International
Migration Review 13:599-623.
Robinson, J.G.
1980 Estimating the approximate size of the illegal alien population
in the United States by the comparative trend analysis of
age-specific death rates. Demography 17~2~:159-176.
Siegel, J.S., Passel, J.S., and Robinson, J.G.
1980 Preliminary Review of Existing Studies of the Number of Illegal
Residents in the United States. Mimeo. Bureau of the Census,
U.S. Department of Commerce, Washington, D.C.
Warren, R.
1982 Estimation of the Size of the Illegal Population in the United
States. Paper presented at the 1982 annual meetings of the
Population Association of America, San Diego.
Warren, R., and Passel, J.S.
1983 Estimates of Illegal Aliens from Mexico Counted in the 1980
United States Census. Paper presented at the annual meeting of
the Population Association of America, Pittsburgh.
OCR for page 246
246
u
o
o
em
Cal
o
¢
Cal
a'
o
N
can
o
In
U
~r1
U ~
P
Ct ~
v
a'
1 rQ
so
~ a)
At: ~
ED u:
E
A`
U)
a'
to
3
=0 ~
I:
Cal .,'
E o
,1 PA
U)
~ . -
o E
_, ._'
Ct
so
& 3
O C
O
C:
O
. -
Ct
O
FIG
O
O
So
o
A:
cat ~0· · at
Go ~Ct Cal ~
a~ ~c~. . oo
· ~ ~
O ~. .. ~1 - .
Ooo ~ ~aca. ._ ~ct
=: · ··ct cD
· ·
·
cD
·
1
'c
.
~3 ~o o
· ~o ~_ C) C,}
~ ~ ~_ ~ ~.- ~
·~ 4) O ~3 · ~ ~x x
U]1 _ ·~1 00 ~_ tn ~a,~ (D
ct . oo _ ~1~ ~ ~ ='' -
oo ~_ ~ c ~: a~ cC O ~ ~ ~ ~
o ~ ~_ ~o ~ ~ ~ ~ ~ E
3 ~ 3 · ~:t tc cn ~ ~ ~o o
=: bO a) J: ~ 0 u ~. ~ ~ ~.- ~ ~ ~ ~_
,. ~ a) oc ~ 0 C cn ~ ~ ~ ~ .
,,, c~ p.,.- u ~ ~ £
o ~4 ~ ~ ~ 0 ~· - - ~ ~ ~ ~ ~ ~ ~3 3
O ~ . - o~ . - ~ ~ ~ ~ ~D ~ ~ ~ . - . - . - ~ O ~ · ~ ~ _ O O
~ _ ~ c~ ~ cc ct ct · . ce o 3 ~0 ~ ~0 - -
3 U, ~ ~ ~ ~ 1 ". ~ ~ ~ ~ ~ ~ 3 ~ ~.
o ~ o ~ ~ ~ 0 ~ ~ ~ ~ ~ ~ 0 ~ 0 ~ ~ ~ 0 0 ~_ ~ u
ct s~ ct C~ )~ ~ c~ oc ~ ~ t~ tt 0 ~ ~ ~· ~ ~3
c: ~ ~o a,~ ~ ~ ~ cC ~ ,~ ~ ~ c~ _ c. ~ 0c ~,- ~ ~ O ~ · O O
. - c~ . - ~ ~ ~ JJ ~ a) ~ ~ 0 c)~ - c~ ~ ca . - ~ . ~ ~ c' ~ cn ~ 0
X o ~ ~ ~ ~ ~ ~ ~ ~ ~ o X o X 3 X ~ ~ - o X
~ ~ ~ E ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ._ ~ ~ ~ ~ ~=
Z :~ cn ~ ~ 3 tt ~ 3 cn <: o ~ u: :~: cn :~ ~ ~ U~ ~ ~ ~ u~ £ ~ ~ ~ ~ Z Z
o o ~C~
°~- O 0o cr cc c~
~ ~C ~-
. O ~· ~
ca tu
· ~·
0 0 U~ ~O O
cr ~ =\ a~ c~
_ _ _ _ _
1 1 1 1 1 1
O ~0 0 0 U ~r ~0 0 0 0 0
~D ~ ~00 ~00
~c ~ax ~ ~cr~
_ _ _ ~ _ ~_ ~_ _ _ _ _
C~
^ ^
O O ~ ~ C ~ ~
O O ~ C =) ~ :,
1-~ _ 00 00 ~a
c a ~· - ·~1 ~ ~q U~ U
~Ct ~ ~_ ~S" 0 _ ~-
_~ _ _ a ~ oo u ~o~ u
S ~'_ _ ~ _ ~ ~. - . -
~0 3 C ~ ~ :^ O ~ O _ td tV · - 0,) ~ ~:
s ~ ~ 0 0 _ a ~n ~ u ~ ~
cn =: ~n U] _ ~ ~ c~ ~ E ~ ~ ~- --
. - - ~ - ~ ~ ~ ~ ~
C.) U) · - ·- ~ C)- C) _ ~P, ~ ~_ _
C ~ ~ ~ S~ Z S" C ~_ t_ _
0 c~ O O a) ~ c' ~C ~.- .~ -
C) ~=: = ~C) C) C) 3 ~X :=
OCR for page 247
to
~ -
a~
1
to
AL
I)
·.
c)
o
Ad
3
o
so
o
o
em
e
. -
UP
1
Cal
UP
1
UP
I N0
ha
So
1~0
UP
O
~ +
_ Ct
~ _
Cal z :Z
.-
So C
Ct
.,, +
1_ _
_ Z
o
. -
Ct
So
to
. -
Ct
C' ~ O
~Ct ~
.,, _ _
X _ _
U)
Cal
o
v
-
Cal
~ +
a
00
< C~
O
C~
O
247
N ~ c~ ~ u~ ~ O ~ ~ ~ ~D ~ O
0 00 ~ ~ ~ ~ ~ ~ ~ U~ ~ _1
1 1 1 ~ 1
1 1 1
o
1
1- _ ~ ~ ~J ~ ~ O =: ~ C~
O U~ ~ ~ ~ O ~ ~ ~ ~ ~ 00 0
O _. O O O O O O C~ O O C~J O _1 ' O
OOOCOOOOOOOOOO ~
· e e e e e e e e e e e e e . -
1 1 1 1 1 1 1 _
O
C~ ~ ~ ~ C~ U, ~ O ~ ~ ~ U~ ~' O
_ ~ _d ~ ~ O ~ ~ _d C~ 00 0 C~ _ '-/
C~ ~ O ~ ~ ~ O ~ ~ ~ ~ ~ ~ ~
0 0 ~ ~ 0 ~ ~ 0 ~ _ ~ 0 00 . n,
e e e e e e e e e e e e ee J~
1 1 1 1 1 1 1 1 1 1 1 1 1 ~
X
~ ~ ~ ~ U~ ~ ~ ~ O 00 ~ ~
U~ O ~ ~ 1-~ ~ ~ ~ ~ 00 C
CS~ ~ ~ ~ O ~ ~ ~ ~ ~ C~ _
_~ t-) U~ ~) OC) ~ _4 C~ ~) ~ U~ OC
· e e e e e e e e e e e
U~ ~ 00 ~ I~ 00 C~ ~ C~
1_ ~ ~ 00 ~ 1_ ~ 00 ~ ~ I~ O 00
_1 ~ ~ O O ~ C~
C~ ~ ~ C~ ~ O C~ C~ ~ U) ~ ~ O
e e e e e e e e e e e e e e
1 1 1 1 1 ~/ ~ ~ _ ~/ ~ ~ ~ c~
1 1 1 1 1 1 1 1 1
O ~ ~ C~ ~D
~ ~ ~ O ~ C~ ~ U~ ~ O ~ ~0 ~ C~
U~ ~ ~) C-) -1 C~ _/ _~ _/ _d
c~ ~ ~ ~- oo _ ~ ~ oo r~ u~ ~ 1-
00 ~ U~ ~ ~ ~ _ ~ ~ ~ 00
C~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
O 0000000000000
e e e e e e e e e e e e e e
O ~ CC ~ ~ ~ ~ ~ C~ ~ C
C~ O ~ ~ C-) 1~ ~) ~ ~) ~ I~ U) ~ 1N
cs~ a~ ~) ~ c~ ~ ~D C~ I_ O _I
a~ ~ ~ ~ a~ oo ~ oo ~ ~- ~ u~ ~ c~
e e e e e e e e e e e e e e
~ ~ ~ O ~ U~ ~ C~ O O C~ ~ U~ C~
C~ 1 - 0~\ C+) 1-00 C~ ~ C-1 0~ 0 U~
0~ ~ ~ ~ ~ ~ ~ 0N 00 ~ U~
_ ~ ~ ~ ~ ~ ~
~ ~ u~ o o ~ ~ ~ ~ cs ~ ~ ~ ~ ~ ~ ~
_ ~ ~ ~ O ~ ~ 1- ~ ~ G I- O
- ~ ~-~ C~ O CS, ~ ~ U~ ~ ~,
1- ~ . - CN
¢
E~
e~
C
C ~
O ~·,'
e e 3
. ~ ~
u
e'
P~
ct
c'
· -
a
a
_'
a
o
a~
_
O
c~ 00 1_ ~
O
.-
c~
Z
~ ~ ~ ~ ~ ~ ~ ~ ~ 0
0N ~ ~ c~ ~ ~ ~ ~ ~ u~ ~ ~ ~ ~ ~ ~
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 +
u~ O u~ O u, O u~ O ~ O u~ O u~ O ~ O u~
_I _1 ~I C ~C-) t-) ~ ~) 1- 1- 00 00
u
o
e
cs
c~
c: a~
:^
u~
~ e
u~ u~
u' ~
O
OCR for page 248
248
TABLE B-8 Estimated Mexican Emigrants by Age Group and Sex, 1960-1970
and 1970-1980, Using Variable Growth Rate Procedure ~ in thousands)
Males Females
Age Group 1960-1970a1970-1980a1960-1970a1970-1980a
10-14 112196-67-143
15-19 209190-48-141
20-24 89584842
25-29 6239209179
30-34 -67-19627-163
35-39 35-4225-28
40-44 -7-122745
45-49 -3056222
50-54 115-52137-43
55-59 -41-88-38-57
60-64 28242418
65-69 -56-4-434
70-74 -16-174-15
75-79 -10-74-19-81
Total 42378308-381
aThe life tables used were from the United Nat ions model life tables
(UN 1982) selected with an expectation of life at age 10 of 56.53 years
(males 60-70), 58.21 years (males 70-80), 60.28 years ~ females 60-70)
and 62.09 years ~ females 70-80) .
OCR for page 249
a
In
In
is
·e
En
Ma
girl
ho
a)
o
o
·rl
ca
a
in
c)
of
In
¢
a)
so
o
Ace ~
1 Go
~1 1
~ rig
¢
a
EON
En
on
1
to
a
o
1
a)
~ lo
.~4 'n
~ 1
In En
on
~ o
lo:
~ In
l
o
o ~
o-'
a ~
~ u
a ¢
~n
. -
1 - 1
~b
C~
C~
o
249
. ~ ~ o ~ ~ ~ U~ ~ ~ ~ ~ ~ oo ~
a) ~ 0 ~ ~ ~ c~ ~ a' 0 ~ ~ ~ ~
I_ . ~ ~ ~ . ~ . ~ . ~ . ~ .
~ _. ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
+ o~ + C~ ~ ~ + C~ + ~ ~ ~ +
-~-O 1 U~ _ ~ _ %;t + a~ _
0 0 ~ _ ~ ~ C~ _ C~
o
tn
.~l
_~ 1 ~ ~ 1-~ O ~ 1-~ O ~ ~) _` ~ ~C~
C~
0 ~ · U~ · U~ . 0 . CS' · ~ . U~
- - ^m ^ - ^ - "o "o - - ~
~- ~ c~ ~ _e ~ ~ c~ ~ ~ ~ ~ 1 ~ c~ O
c ~cs, 1 oo 1 ~ 1 ~- + ~ + C~ _ r- 1
~_ _ _ _ _ _ cn
C~
.
· -
U
~1 1 ~ ~ ~ ~ ~) ~ _' ~ O ~ C-) ~ L~ ~O
c ~ ~ ~ cr ~ U~ ~ c~ ~ 0 0 0 c~
. ~ . oo . ~ . oo . U~ · ~ · C.
00
0 ~ ~ ~ ~ ~ ~ c~ c~ + ~n
~) ~ 1 ~ + C~ 1 ~ + ~) _ C~) 1 ~) 1 ~
_ _ _ _ _ _ Ct
O
1 C~) ~ ~-~ ~ ^ ~ ^ c') ~ oO ~ O
u ~ C~ a~ ~ ~ u~ oo ~ ~D ~ ~ ~ 00
~O · ~ · O . ~ . C~ · C<; · C~
~ ~ ~ O ~ C~ ~ U~ ~ ~ ~ ~ ~ ~
O ~ 1 IJ~ C~ t~ I + C-) + 0 1 Ln 1
~O _ c~ + o~ 1 0-O-O- ~ _
_I ,- ~ _ _
CS~ 1 oo ~ r_ ~ 1-~ C~l
'. O C~ ~ ~ ~ ~ ~D O u~
· U~ . ~ · O ~ ~ · ~ · C~
~ ~ ~ C~ ~ ~ ~ C~ ~ ~ ~ ~l ~ 1~
~D ~) 1 ~ 1 C`l ~) ~t + ~ + O ~ ~ ~
_ c~ _ ~o 1 ~ _ ~ _ oo + 00 _
_ _
0~ ~ oO ~ C~ ~ ~-~ ~ ~ oo
c~ ~ ~ oo u~
1- · U) · ~ · ~ · O · 00 · a~
^+ ^m ^ - "o ^ - ^< ^
O + C~ 1 ~ ~ O + ~ 1 ~ ~ ~ C~
_~ _ O _ ~- 1 oo-~- _ C~ + ON +
c~ ~ ~ _ ~ ~ c~ _ c~ _
~) 1 ~ ~ 1~ ~ ~) ~ ~ ~ ~J ~ ~ ~ O ^
_ ~ =~ u~ O o,- ~ c~ ~ 0
. ~ . ~ · c~ · ~ · O ·
~^ 0 ~ ~ ^ c~ ~ ~ ^ c~
O ~ ~ + oo ~ ~t + ~ 1 oO ~;t O +
O + ~ _ ~ 1 ~ _ oo _ ~ + ~ _
_ ~ _ ~ _ ,_
c~
oocr ~0
oo oo oo oo
c~ ~ox ~o~
a'
cn
JJ ~
O
~u
.,' O
~n O
C ~=1
~O
b0~ ~
~q
O
C. U
~Ct
t0
C~oO
cr,
C)
S"O
P~
~n =:
·
~ O
E~b0
O. -
Z~m
Ct
S~
o
~0
C~
C~
OCR for page 250
250
TABLE B-10 Distribut ion of Person-Years Lived by Located Deportable
Aliens and Fiscal Year of Residence, 1977-1984
Durat ion Pi sca 1 Ye ar
Category of Loc at ion
Fiscal Year of Residence
1977 1978 1979 1980 1981 1982 1983 1984
4 days 1977404 - - - ~ ~ ~ ~
1978- 421 - - - - - -
1979- - 405 _ ~
1980- - - 360 - - - -
1981- - - - 361 - - -
1982- - - - - 352 - -
1983~
1984- - - - - - 1 589
4-30 days 1977 3309
1978 57 3215 - - - - - -
1979 - 56 3148 - - - - -
1980 - - 38 2141 - - - -
1981 - - - 39 2202 - - -
1982 - - - - 41 2275 - -
1983 - - - - - 49 2767 -
1984 - - - - - - 52 2920
1-6 months 1977 25099 - - - - - - -
1978 3518 23547 - - - - - -
1979 - 4232 28319 - - - - -
1980 - - 3280 21948 - - - -
1981 - - - 3447 23069 - - -
1982 - - - - 3490 23358 - -
1983 - - - - - 3408 22805 -
1984 - - - - - - 3224 21575
7-12
months
1977 17166 -
1978 8613 14355
1979
1980
1981
1982
1983
1984
- 10242 17070
8963 14938 -
- 11099 18498 -
- - 11214 18689
9985 16642
8858 14763
1 year 1977 63541 - - - - - - -
1978 99493 49747 ~ ~ ~ ~ ~ ~
1979 81527 81527 40764 - - - - -
1980 63550 63550 63550 31775 - - - -
1981 14609 73047 73047 73047 36524 - - -
1982 - 18992 94960 94960 94960 47480 - -
1983 - - 18971 94853 94853 94853 47427 -
1984 - - - 14911 74556 74556 74556 37278
Total ( thousand s )
380.9 342.9 352.5 363.5
OCR for page 251
251
The Imputation and Treatment of Missing Data
Kenneth Wachter
Every statistical reporting system needs well-defined, routine procedures
for the treatment of missing data. That data are missing does not in
itself reflect badly on administrators who collect or report it; all good
data collection involves missing values--they are an ordinary fact of
life. What does reflect badly on those who report statistics is to fail
to recognize the importance of obtaining as complete a response as
possible, to deny that there were missing values, or to pretend that the
reported values are free from gaps in coverage and free from
nonresponse. Nothing undermines confidence in an administrative
tabulation so badly as the absence of a column showing the number of
offices that failed to report or whose reports could not be included in
the tabulation, the number of forms submitted incomplete, and so forth.
Reporting the extent of missing values builds confidence in a report, by
showing that the agency seeks a realistic view of the completeness that
its compilations achieve.
Of course, it is better when data are not missing, although no large
reporting system comes close to 100 percent reporting. There is no
statistical magic that can make up for information that is not there.
There does now exist a large body of statistical know-how for minimizing
the bad effects that missing data could otherwise have on reported totals
and on inferences about patterns and trends. Some of these statistical
methods are already routine and are in continual use by government
agencies, including the Census Bureau. Others are at the stage of
research development and testing. A good recent overall account can be
found in the entry on "Incomplete Data" in the Encyclopedia of
Statistical Sciences (Little and Rubin, 1982~.
The first rule of treating missing data is that procedures must be as
uniform, standardized, and well-documented as possible. It is more
important to know what the numbers before one mean and how they were
arrived at than to have them compiled by superior but unfathomable
methods. It is better to have a run-of-the-mill but uniform reporting
system than to have a system in which (unbeknownst to readers of its
reports) certain district offices have pursued every last elusive case
while others have misplaced whole bundles of forms. Even if star offices
can be identified, they cannot be regarded as a random subset of all
offices. A good approach to missing data is to target a random sample of
offices for intensive follow-up of missing cases and then to present
correction multipliers based on the sample follow-u~. Sampling can be an
event, cost-effective way to allocate resources for improvement in
data quality.
Using the INS Statistical Yearbook as an example, these general
considerations lead immediately to the observation that all tables in the
yearbook should include a row or column enumerating cases with status
unknown. If unknown or uncertain cases have been distributed among other
OCR for page 252
252
rows or columns, the formula for distributing them should be stated in
the table notes. For example, Table lOA in the 1979 Statistical Yearbook
on page 28 illustrates this point in having a row for "no occupation
reported." In contrast, however, the row for "unknown marital status" in
Table lOA contains zeroes for all years and both sexes. The zeroes
suggest that ambiguous cases have been assigned by some rules that are
not explained, since it is not credible that of some 2 million people not
a single person failed to check a box; not a single coder accidentally
punched a nonsense digit, and not a single error eluded the agents who
accepted the forms. Furthermore, age must have been missing from some
records, since age nonresponse is commonplace. A row showing the number
of cases that lacked ages next to the median age row would be
reassuring. For another example, consider Table 17C on page 54. A row
showing numbers of temporary visitors whose region of last permanent
residence was uncertain, or a formula showing how such cases have been
allocated among regions would bolster confidence and enhance the value of
this table.
A further observation to be noted is that the footnotes in each table
in the Statistical Yearbook should include a statement of the basic data
.
sources or sources from which the table is derived. For instance, Table
17C of the 1979 Statistical Yearbook is probably derived from the I-94
Nonimmigrant Arrival/Departure Forms. But that cannot be determined from
the yearbook itself. Such footnotes would aid both outside readers and
those in the INS who trace errors and regenerate the tables in later
years. Which tables derive from the same basic data and which from
different sets of data? The yearbook should also contain a statement of
the total numbers of I-94 forms accounted for in central office
tabulations and the number of those that were incomplete, missing
matching departure records, and so forth. In that way the information
that would help in the interpretation of all the tables based on those
particular forms would be assembled in one place.
It is very important in treating missing data to know whether the
data are missing at random. For example, suppose that a border office
typically fails to report at all when it is so busy with exceptional
numbers of apprehensions that there is no time for statistical work. Or
suppose that at border crossing points, fewer booths are staffed at peak
weekends or peak hours, because of staff holidays or staff shortages. In
such situations the data are not missing at random. There is a
relationship between the values that would have been reported if they
were not missing and the fact that those values are missing--a
relationship that seriously undermines the statistics. Such situations
can sometimes be prevented by astute staffing decisions, but they are
bound to occur. The preface to the INS Statistical Yearbook should
discuss the most salient such situations, on the basis of direct
consultation with the officers in the field who know the realities
firsthand.
Formalized statistical techniques that compensate for or diminish the
bad effects of missing data come under the general heading of
imputation. Missing information can be imputed or made up on the basis
of information that is not missing. The first rule of imputation is
information must never be imputed to records in a way that does not allow
the imputed cases to be separated from the actually reported cases at
every later stage of the analysis. For instance, values should never be
OCR for page 253
253
imputed into the basic records, like I-94 fonts, unless they are coded in
such a way that the imputed values will be clearly distinguished from the
real values whenever totals are assembled. When data are computerized,
it is generally easy to add a code to each value indicating whether it is
imputed. Then totals can be run off with and without the imputed
values. In this way, the effects of imputation can be observed.
Of the various imputation strategies now in use, three are mentioned
here: "hot deck imputation," generalized regression single or multiple
imputation, and incomplete data likelihood maximization with algorithms
like those of the "E-M" type. Good accounts of these methods can be
found in the entry on incomplete data in the Encyclopedia of Statistical
Sciences. For an account of hot deck imputation; see Ford (1983). This
is the type of method in widest use among government agencies, especially
in the Census Bureau. For regression-based imputation, a new variant
that allows unbiased estimation of standard errors in cross-tabulations
has been pioneered by Rubin and called "multiple imputation" (Rubin,
1980~. It is not restricted to sample surveys alone. A large experiment
using this method to insert 1980 occupational codes into the 1970 census
public-use sample is now under way at the Census Bureau. For likelihood
maximization methods, more formal statistical expertise in model building
is required, although the results can repay the extra effort if the data
are otherwise of high quality. A good entree to this extensive
statistical literature is the cautionary article by Little and Rubin
(1983).
The statistical virtues of these methods are not the only
considerations in a decision about which to use. The simplicity of the
methods and the feasibility of implementing them in practice under the
difficult conditions that the INS often faces must be taken into
account. When missing values on individual forms like the I-94 are at
issue the simplest and most easily implemented formal imputation method
is the hot deck method. When missing blocks of data in aggregate tables
are at issue, for example if a computer tape or the transmissions from
one district office should be garbled or misplaced, a likelihood
maximization method would be efficient and appropriate.
The idea of hot deck imputation is to substitute for values missing
on one record the values that occur on another record whose values have
high probability of agreeing with those that are not missing. As an
example, consider I-94 forms that are missing entry 11, occupation. A
person coding the I-94 forms, finding a missing occupation, would go
back, either manually or by computer, to the last-processed I-94 form
which showed, say, the same country of citizenship (entry 3) and decade
of birth (entry 2~. The value from that form would then be coded as
occupation for the form with occupation missing, along with a code
showing that this occupation value was an imputed value rather than a
true value. The final tabulations of admissions by occupation would then
show separate values for occupations without imputations and occupations
including imputations.
It is essential that any hot deck imputation use a set of formal
rules that state that if such and such entries are missing, then the
donor form from whom the missing values are supplied shall be the first
form that is similar in a number of prespecified variables. The
selection of donor form should not be left to the judgment of the coder.
It is also essential that the rules be simple, particularly if they are
OCR for page 254
254
being implemented by hand, but also, in the interests of efficiency, if
they are being implemented by computer. Thus the requirements for a
match to a donor should not be overly rigid, yet appropriate for the
variable to be imputed.
The use of likelihood maximization methods, especially those that
employ E-M algorithms, demands the specification of a statistical model
and therefore demands a trained statistician to formulate the model. In
an example such as that of a missing tape, the need might be to estimate
cells in a cross-tabulation, in which the total number of records with
missing age and sex values might be known, but in which their
distribution among the cells of the table might be uncertain. In such a
case, a standard model for contingency tables could be used. It remains
essential, however, to present the table without as well as with the
entries adjusted for the missing data. The impact of the statistical
adjustments needs to be visible, so that readers and administrators can
assess their plausibility. Likelihood maximization methods would be
recommended if fairly large quantities of data, numbering, say, into the
thousands of records or 5 percent of the total sample, proved missing.
For cases in which few values were missing, either in absolute amounts or
relative to the size of the sample, it is generally not cost-effective to
adjust.
Any statistical system has to deal with missing data. With a
record-generating system as large and complex as that of the INS, what is
simplest and most easily implemented is undoubtedly best. It is
therefore right to advocate a pragmatic approach, rather than a fancy
theoretical solution, and to encourage, above all, come-on sense,
uniformity, and candor.
REFERENCES
Ford, B.L.
1983 An overview of hot-deck procedures. Pp. 185-207 in W.G. Madow,
I. Olkin, and D.B. Rubin, eds., Incomplete Data in Sample
Surveys: Theories and Bibliographies. Vol. 2. New York:
-
Academic Press.
Little, R.J., and Rubin, D.
1982 Incomplete data. In S. Kotz and N.L. Johnson, eds.,
Encyclopedia of Statistical Science. New York: Wiley.
1983 On jointly estimating parameters and missing data by maximizing
the complete-data likelihood. American Statistician 37:218-220.
Rubin, D.B.
1980 Pp. 1-9 in Bureau of the Census, Handling Nonresponse in Sample
Surveys by Multiple Imputation. Washington, D.C.: U.S.
Department of Commerce.
Representative terms from entire chapter:
illegal aliens