Model-Based Approaches to Estimating Migration Flows

**INTRODUCTION**

The migration process for undocumented people is complex and dynamic, as described in Chapter 2. Undocumented migrants and their agents adapt quickly to changes in resources and strategies on the U.S. side of the border. Furthermore, the migration of undocumented people responds not only to enforcement efforts by the United States but also to labor market factors in the United States and Mexico, local laws and regulations on both sides of the border, and “competing” traffic across the border—including the highly profitable drug traffic going north and arms traffic heading south.

To estimate the number of illegal crossings at the U.S.–Mexico border, enforcement agencies require information that is not only precise but also timely (e.g., available on a quarterly basis and soon after the end of a quarter). The fact that the migration process is highly dynamic makes this difficult. For example, neither the emergence of drug violence along the border nor the severe economic recession in the United States were anticipated as recently as 5 years ago. The need for accurate estimates at the border at a geographically detailed level introduces additional challenges. Any effective information system will have to be agile, adapting to altered flows and externalities as the process evolves.

No single data source is able to provide direct estimates of the number of illegal attempts to cross the U.S.–Mexico border (see the discussion in Chapters 4 and 5). While several U.S. and Mexican surveys (described in detail in Chapter 3) address specific aspects of the migration process, they tend to do so only in limited ways (as discussed in Chapter 4). A new

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 93

6
Model-Based Approaches to
Estimating Migration Flows
INTRODUCTION
The migration process for undocumented people is complex and dy-
namic, as described in Chapter 2. Undocumented migrants and their agents
adapt quickly to changes in resources and strategies on the U.S. side of the
border. Furthermore, the migration of undocumented people responds not
only to enforcement efforts by the United States but also to labor market
factors in the United States and Mexico, local laws and regulations on both
sides of the border, and “competing” traffic across the border—including
the highly profitable drug traffic going north and arms traffic heading south.
To estimate the number of illegal crossings at the U.S.–Mexico border,
enforcement agencies require information that is not only precise but also
timely (e.g., available on a quarterly basis and soon after the end of a quar-
ter). The fact that the migration process is highly dynamic makes this dif-
ficult. For example, neither the emergence of drug violence along the border
nor the severe economic recession in the United States were anticipated as
recently as 5 years ago. The need for accurate estimates at the border at a
geographically detailed level introduces additional challenges. Any effec-
tive information system will have to be agile, adapting to altered flows and
externalities as the process evolves.
No single data source is able to provide direct estimates of the number
of illegal attempts to cross the U.S.–Mexico border (see the discussion in
Chapters 4 and 5). While several U.S. and Mexican surveys (described in
detail in Chapter 3) address specific aspects of the migration process, they
tend to do so only in limited ways (as discussed in Chapter 4). A new
93

OCR for page 93

94 ESTIMATING ILLEGAL ENTRIES AT THE U.S.–MEXICO BORDER
survey, or substantial modification of a current one, would be very costly.
Moreover, its design would have to be sufficiently flexible to reflect the
dynamic nature of the migration process.
Similarly, administrative data collected by the U.S. Department of
Homeland Security (DHS) along the U.S.–Mexico border (which were col-
lected for purposes other than the estimation of migration flows) are likely
to provide only a partial picture of the activities of undocumented migrants
and cannot be used in isolation to draw inferences about migration flows
(see Chapter 5). The difficulty in estimating flow from current data sources
persists even if statistical modeling techniques, such as capture-recapture
methodology and other sampling strategies, are used to estimate these hard-
to-count populations.
Based on the panel’s conversations with U.S. Border Patrol (USBP)
agents during site visits to Arizona and California, it is clear that USBP
already attempts to combine information from different sources to forecast
border-crossing activity, albeit in informal ways. In addition to whatever
the surveys may indicate, agents make use of their own administrative data,
their previous experience, and other sources of information that include,
for example, occupancy rates of hotels on the Mexican side of the border,
sign-cutting (i.e., observing and tracking footprints and other physical signs
of migrant passage), and remote sensing data.
Building upon what USBP already does in practice, this chapter dis-
cusses more formal ways for combining varied sources of information
to estimate unauthorized migration flows with geographic and annual/
quarterly specificity. These methods include conventional approaches, such
as probability models, regression models, and spatiotemporal processes,
and more recent methods such as agent-based modeling.
To fit a model, one wants to have a training sample for which both the
explanatory variables (such as economic pressure, enforcement effort, point
of origin) are known, and also the true values of the response variable (such
as the flow of illegal immigrants at a specific portion of the border). Such a
training sample is difficult to obtain in this situation and will never be fully
achieved. Nevertheless, a model for illegal flow will include many compo-
nents for which data exist. For example, each border station records the
number of people in different demographic segments who are interdicted
that month, and surveys are available that indicate how many people in a
particular town chose to seek work in the United States. A mathematical
model for illegal immigration that is founded on good social science theory
can be fit to the available data, and it offers reasonable hope of correctly
tracking the unmeasured data. This hope can be approximately validated,
or disconfirmed, if the model’s broad predictions for, say, the total number
of illegal Mexican immigrants are not consistent with estimates obtained

OCR for page 93

MODEL-BASED APPROACHES 95
from other sources (e.g., the cost of day labor or the number of illegal aliens
found during random traffic stops).
TELEPHONE CARDS: A THOUGHT EXPERIMENT ON
QUANTIFYING A DIFFICULT-TO-MEASURE POPULATION
Two glaring gaps in the information required to estimate the effective-
ness of the resources that have been deployed at the U.S.–Mexico border
during the past decade are the proportion of undocumented crossers who
succeed in their first or later attempts and the proportion of apprehended
migrants who are deterred from further crossing attempts. In the course
of its deliberations, the panel discussed a number of different ideas con-
cerning creative sampling methods for estimating different components of
undocumented immigration. The panel describes here one such idea for
quantifying one type of deterrence effect (i.e., the fraction of apprehended
migrants who choose not to attempt to cross the border again). This simple
thought experiment involves providing telephone cards to undocumented
immigrants who are apprehended in the United States and are then returned
to the Mexican side of the border.
Typically, individuals who intend to cross the border without documen-
tation arrive in the border area and make arrangements for illegal crossing
with assistance from a smuggler. Most of those who are apprehended dur-
ing their first attempt and are returned to the Mexican side of the border
will try to cross again within the next few days. If the second attempt is
also unsuccessful, they tend to keep trying, usually over a period of several
days, until finally they either succeed or give up. USBP could, in principle,
provide a phone card from a Mexican telephone company to a randomly
selected subset of apprehended migrants who are about to be returned to
Mexico. The phone cards, which would come preloaded with a certain us-
age value, could be used to call from either side of the border, but only after
the caller is identified as the person who actually received the card. The
toll-free number to activate the card would differ depending on whether
the individual was in the United States or Mexico at the time of activation.
The fraction of individuals activating the card in Mexico would provide an
estimate of the fraction of apprehended individuals who are deterred from
crossing again, and the fraction activating the card in the United States
would provide an estimate of the fraction of individuals that cross success-
fully on the next attempt.
Several practical problems would need to be resolved in order to imple-
ment the phone card experiment. First, phone cards would have to be suf-
ficiently attractive so that migrants actually use them, but not so valuable
that they render the program too expensive (or induce criminal elements to
prey on returning migrants, or create a black market in phone cards). This

OCR for page 93

96 ESTIMATING ILLEGAL ENTRIES AT THE U.S.–MEXICO BORDER
issue might be addressed by experimenting with different phone card values
and by varying the fraction of migrants who receive phone cards across
time and border locations. Second, it would be necessary to ensure that
the user of the phone card is the intended card recipient and not someone
else to whom the card was given or sold. This issue might be addressed
by having the migrant answer simple questions at the time of phone card
activation based on demographic information collected at the time of ap-
prehension. Third, it is necessary to ascertain whether the location from
which the migrant activates the card is the migrant’s final destination (be
it in Mexico or the United States). This issue might be addressed by hav-
ing the card’s earliest possible activation date be 1 to 2 weeks after the
apprehension, so that cards activated in Mexico would predominantly be
activated by discouraged crossers, while cards activated in the United States
would predominantly be activated by successful crossers. Finally, one would
expect that successful undocumented migrants would be more reluctant to
activate their cards in the United States than would unsuccessful undocu-
mented migrants still in Mexico, so the undercount is likely to be different
for successful undocumented crossers than for unsuccessful crossers who
are deterred from further attempts. The risks of non-use and differential use
cannot be ignored. Steps would need to be taken to address the concerns
of undocumented migrants, such as allowing callers not to self-identify and
providing assurances that the identity and locations of the callers are not
required and would not be traced (beyond the country of origin).
This thought experiment highlights the kinds of data that would be
needed if apprehensions data were to be used to estimate stocks or flows
of unauthorized immigrants. However, even though the phone card experi-
ment might be useful in estimating the number of crossers who are success-
ful after having been apprehended once or more than once, it still would not
provide any information about individuals who cross successfully in their
first attempt. One approach to counting such elusive populations is based
on network sampling or link-tracing sampling (see Box 6-1). However,
these methods require careful implementation and additional assumptions
to be usable, and they are not yet sufficiently developed to be clearly helpful
to DHS in filling the critical data gaps.
SURVEY AND ADMINISTRATIVE DATA—INFORMED MODELING
Statistical models can provide plausible descriptions of immigration
behavior, and some aspects of their fit can be validated against available
survey and administrative data. A model can be applied to historical data
to see whether its predictions agree with results from previous surveys.
Even though the available surveys do not directly address all questions
of interest to DHS (see Chapters 3 and 4), if a statistical model agrees

OCR for page 93

MODEL-BASED APPROACHES 97
with the findings of the surveys on those aspects of flow that the surveys
do capture, then one can reasonably expect that the model has predictive
power for estimating other relevant aspects of flows. Similarly, if a model
produces results that are not supported by previous data, then one of three
conclusions is plausible: the model does not fit the data well, the migration
process has changed significantly over time, or both these conditions apply.
The model must be flexible, and one should expect that it will be necessary
to extend it when new factors come into play, leading to a new round of
model retrofitting and validation.
Beyond timeliness and the possibility of greater accuracy, modeling
has additional advantages. A good model allows policy makers to explore
“what if” scenarios by changing model inputs. In particular, DHS can
explore the impact of different allocations of enforcement resource among
border stations or the impact of new enforcement policies. More impor-
tantly, the process of building a good model can create a stronger under-
standing of the social process underlying immigration behavior. Finally, a
good model should produce accurate estimates of prediction uncertainty.
Predictions from a model that are not paired with estimates of their predic-
tion error have limited value.
This chapter reviews several approaches to survey-informed modeling
and, to the extent possible, offers some comparative guidance in the con-
text of estimating unauthorized migration flows. There are at least three
standard strategies for doing survey-informed estimation: build a prob-
ability model, fit a regression model, or employ a spatiotemporal model.
Newer approaches such as agent-based modeling are based on simulation
but still rely on survey and administrative data for parameter settings and
model components. The estimates that result from the use of models should
usually be fairly accurate if the data are representative and reliable, the
model is valid, and the immigration process does not change. In this case,
disaggregated data on migrants and nonmigrants that allow for the explicit
modeling of the migration decision can be used to verify and validate find-
ings from aggregate data, such as apprehensions.
Probability Models
Massey and Singer (1995) developed a simple probability model for the
number of unsuccessful attempts at illegal immigration before a successful
crossing. Their basic model was a geometric distribution for the number
of attempts before the first success, where it was assumed that attempts
were independent trials with constant probability of success. The estimated
probability of success was obtained from interview data on the number of
people who were successful on the first try, the second try, and so forth. Us-
ing survey data on the number of crossing attempts by migrants, collected

OCR for page 93

98 ESTIMATING ILLEGAL ENTRIES AT THE U.S.–MEXICO BORDER
BOX 6-1
Network Sampling
Link-trace sampling and its variant, respondent-driven sampling (Salganik
and Heckathorn, 2004; Thompson and Seber, 1996), have been used for sampling
elusive, hard-to-reach populations, such as unregulated workers, the homeless,
drug users, and sex workers. These populations are typically characterized not
only by the absence of a serviceable sampling frame but also by the presence
of social relations among members of the population. These social connections
provide a means of reaching individuals in the population via the people they
know. Such approaches are often informally referred to as “snowball sampling.”
Link-tracing sampling strategies such as snowball sampling and respondent-
driven sampling are often used to leverage those social relations beyond the
small population subgroup available to researchers. The initially selected group is
referred to as the “seed” sample. Information about the social links of individuals in
the seed sample to other members of the population is used to identify and con-
tact members outside the original population subgroup available to researchers. In
most applications of link-tracing sampling, the seeds are a convenience sample,
so the probability of inclusion of any member is unknown. Therefore, a serious
drawback of this type of sampling is that probability-based inferential methods are
problematic. However, it is possible for the seeds in link-tracing designs to be se-
lected randomly, even in applications to hard-to-reach populations—for example,
by using a spatial sampling frame (Felix-Medina and Thompson, 2004).a
Respondent-driven sampling is a variation of link-tracing sampling in which
the respondents themselves choose and contact people they know and invite them
to participate in the survey (Heckathorn, 1997, 2007; Salganik and Heckathorn,
2004; Volz and Heckathorn, 2008). While it is possible to reduce the dependence
of the final sample on the seeds, recent simulation studies (Handcock and Gile,
2010) suggest that substantial biases can remain. A common feature of networked
populations is that they exhibit homophily by attributes; that is, the social ties are
more likely to occur between people who have similar attributes. In the case of
undocumented migrants, homophily might occur when the initial group of seeds
from family members in four Mexican states, they modified the initial geo-
graphic distribution into a Poisson regression model in which the response
was the number of attempts before the first successful trip and the mean
of the Poisson included covariates related to economic factors, gender, and
other variables that might affect the success rate. They also (coarsely) cor-
roborated their model’s predictions against data from Mexican surveys and
against the data on the number of people legalized under the Immigration
Reform and Control Act of 1986 (IRCA). Their estimates suggest that, as
of 1995, the U.S.–Mexico border was becoming increasingly porous and the
probability of apprehension on any given attempt was about one-third and

OCR for page 93

MODEL-BASED APPROACHES 99
includes men from the same geographic area in Mexico. While link-tracing designs
are often effective at acquiring a sample, the degree to which data so collected
can be considered a probability sample is unclear. To allow valid inference to the
population, the designs need to be implemented carefully and the mechanism of
selection of successive waves of the sample must be well understood. Recent re-
search (Gile, 2011; Gile and Handcock, 2011) discusses model-based estimation
methods introduced by convenience samples of seeds. However, methodological
development lags the data collection efforts.
There have been some applications of link-tracing sampling to unauthorized
border crossings. Two such efforts have been discussed in Chapters 3 and 4: the
Mexican Migration Project and the Mexican Migration Field Research Program.
Neither of the two surveys results in samples that can be used for quantifying
migration flows. Respondent-driven sampling does not appear to have been
systematically used in the context of Mexican migration. Morral and colleagues
(2011) discuss the potential use of respondent-driven sampling to estimate the
stock of undocumented migrants, the probability of eluding capture at the border,
and other quantities associated with the migration process.
One approach to network sampling is to start with seeds who are recent im-
migrants from Mexico and use them to recruit other recent immigrants. Key ques-
tions asked of them would include how long they have been in the United States
since their most recent crossing and how many recent immigrants they know.
The recruitment “coupons” for this respondent-driven sampling would not need
to be physical (e.g., an identifying number would be sufficient). Methods have
been developed to estimate population size from respondent-driven sampling
data (Handcock et al., 2011; Salganik et al., 2011). Given the size of the flows,
it is likely that network scale-up methods, which are a form of post-stratification,
would be the most effective means to estimate the population size. However,
significant methodological development and empirical testing are required before
these methods can be recommended.
aSee Thompson (2002) for an informative general discussion of link-tracing designs.
falling. They concluded that about 98 percent of individuals who attempted
to cross the border illegally were ultimately successful.
Massey and Espinosa (1997) extended this methodology in various
ways—for example, by proposing a model for estimating the probability
of a first trip and of recurrent trips. The extended model includes not only
macroeconomic variables but also individual and household characteristics,
migration experience of other members of the household, and macroeco-
nomic variables from the community/country of origin. Additional vari-
ables, such as those related to the political and legal contexts in Mexico
and the United States (the Bracero Program, the IRCA period, and so on)

OCR for page 93

100 ESTIMATING ILLEGAL ENTRIES AT THE U.S.–MEXICO BORDER
might also have been considered. Nonetheless, the situation at the border
has changed markedly since 1997, and the panel has no confidence that
these older models, which antedate the drug corridors, modern enforce-
ment technology, and other innovations, can provide good guidance for the
current era. Since the older models are unlikely to have the correct form,
it would probably be necessary to rebuild them rather than just refit them
with new data.1
While the policy environment can be updated in a rebuilt model, an-
other shortcoming of much survey-based regression type modeling is the
endogeneity of many of the migration determinants. In the presence of
endogenous covariates and dual causality, the ability to simulate counter-
factuals is compromised. More recent studies along the lines of Massey
and Espinosa (1997) address this problem by using instrumental variables
estimation (Angelucci, 2012; Gathmann, 2008; Orrenius, 1999).
Wein and colleagues (2009) and Liu and Wein (2008) extended the
probability modeling to include compartmental modeling. They describe a
system of four submodels, each of which is tuned with historical data and
which interact to produce a probabilistic model for immigration flows. The
four submodels are as follows:
• a multinomial logit model, as in Ben-Akiva and Lerman (1985),
which gives a probabilistic description of the choices made by un-
documented border crossers, such as the location of the attempt;
• an enforcement model, which describes the probabilities of interdic-
tion as a function of enforcement effort and resources;
• a repatriation model, which describes how an apprehended alien is
returned to Mexico; and
• an economic model that accounts for how supply and demand affect
the wages of unskilled immigrant workers.
These submodels can become arbitrarily more sophisticated, incorporating
elements of game theory, queuing theory, and portfolio analysis. The ana-
lyst must solve systems of non-linear equations or differential equations.
Chang and colleagues (2012) developed a practical computational tool for
implementing this model.
The main difficulty with this approach, even in its most mathematically
advanced form, is that it is difficult to tune the model from historical data
since the various submodels rely on information that was collected using
different designs, at different data scales, in different time frames, and with
1
Since the changing relationship between the U.S. and Mexican economies will also have a
non-linear effect on the incentive to migrate, the older models are likely to have moved outside
the range in which approximate linearity would allow simple retrofitting to work.

OCR for page 93

MODEL-BASED APPROACHES 101
different degrees of precision. This makes it almost impossible to calculate
the errors associated with model predictions. This difficulty is not unique
to this multipart model, and Chang and colleagues (2012) consider it to
be among the most practical strategies for assessing cross-border flow. In
principle, however, this type of complex probability modeling could be use-
ful to address several aspects of illegal migration flows, with the exception
of the probability of a successful first attempt.
Regression Models
Multiple linear regression modeling is a standard tool in demography
and econometrics. Regression has been applied to various problems in im-
migration studies, perhaps most pertinently by Lewer and Van den Berg
(2007), but it is more commonly used to estimate the economic impact of
undocumented laborers. When circumstances admit locally linear approxi-
mation to complex phenomena, regression models can be quite effective,
even in nonlinear applications. They are easy to implement, and results are
relatively robust to departures from model assumptions. Multiple regression
is transparent: the coefficients are often directly interpretable, and standard
statistical inference enables tests of those coefficients, sensitivity analysis,
and the calculation of confidence intervals. In the context of illegal immi-
gration, the response variable might be the total number of successful illegal
crossing attempts in a month, or it might be a multivariate response, such
as a vector of illegal crossings at each of a number of different locations.
To estimate immigration flows, an economics perspective would start
by modeling them in terms of the difference in earnings realizable in the
United States and the prospective migrant’s current home; one such ap-
proach has already been briefly discussed in Chapter 5. Migration costs are
then subtracted from the potential gains, where costs include actual travel
expenses, foregone earnings, and the disutility of being away from home
(often ameliorated by migrant networks, which can also be readily modeled
with the right data). The model might allow for benefits and costs to vary
by age, gender, and education level. For example, young men with long
work horizons, greater facility for learning English, and facing less risk in
an illicit border crossing would have greater migration incentives than older
people, women, and high-education workers (the latter have relatively high
earnings in Mexico).
There is a large literature on models for immigration forecasting. Howe
and Jackson (2005) recently surveyed this area, describing and compar-
ing methodologies that have been adopted in the United States, Canada,
and various European countries. But they emphasized that “[t]he poverty
of explanatory models in the current practice of immigration projection
contrasts sharply with the abundance of theories proposed and discussed

OCR for page 93

102 ESTIMATING ILLEGAL ENTRIES AT THE U.S.–MEXICO BORDER
by experts in a variety of social science and policy disciplines” (Howe and
Jackson, 2005:19). This apparent gap suggests that there is potential for
research on modeling approaches that formalize and combine the various
theories on illegal immigration (e.g., those based on social networks, rela-
tive strength of dual economies, social capital effects, and policy analysis)
into formal regression models whose accuracy can be assessed using his-
torical data.
Nonetheless, much has already been done to apply regression methods,
broadly defined, to immigration flows. McKenzie and Rapoport (2010),
for example, used ordinary least squares regression on data from the 1997
National Survey of Population Dynamics (ENADID) to study how the edu-
cation level of illegal immigrants depends upon demographic and economic
variables. Their work responds in part to work by Orrenius and Zavodny
(2005), who used a Cox proportional hazards regression model on data
from the Mexican Migration Project to quantify the effect of economic con-
ditions, border enforcement, and migrant networks on the education level
of unauthorized border crossers. Massey and Espinosa (1997) undertook
a broader examination of 41 covariates that, from one or more theoretical
perspectives, might be linked to illegal immigration; their work was based
on 25 samples drawn from border states in western Mexico. Even before
that, Taylor (1987) built explicit economic models that used regression to
estimate income gains that provided incentive for illegal immigration. There
is a great deal more literature on the topic, some of which is discussed in
Chapter 2. But the question that remains to be answered is whether these
regression tools can produce estimates of illegal migration flows (or their
components) with sufficient accuracy and timeliness to meet DHS’s needs.
On the positive side, there are some grounds for optimism. DHS wants
to estimate flows in the recent past, which tends to be easier than the
forecasting problem that has driven much of the literature, especially that
summarized in Howe and Jackson (2005). Also, the DHS has access to
administrative data, which can help inform the social, economic, and po-
litical theories that have driven previous modeling efforts. On the negative
side, much of the previous literature was developed using data from before
2005, and it is clear that the illegal immigration process has changed in
important ways.
Nonetheless, using historical survey data and administrative records, a
multiple regression model of some kind (including multivariate regression,
principal components regression, Cox proportional hazards regression, and
so on) could be fit to describe how a specific component of the total flow
might depend on the explanatory variables. For example, if the component
of interest were counts of males aged 17-30, then that variable would be
used as the response, and a model would be fit that included such informa-
tion as the expected difference in income opportunity between the United

OCR for page 93

MODEL-BASED APPROACHES 103
States and Mexico, the level of interdiction effort, and perhaps such covari-
ates as the cost of being smuggled, the size of the Hispanic population in the
United States, and indicator variables for seasonality, which affects migrant
farm labor and the home construction industry. The flow Yi(t) can be esti-
mated from a national survey such as ENADID or the Mexican National
Survey of Occupation and Employment (ENOE), albeit with the difficulties
and limitations that have already been discussed in Chapter 4. The differ-
ence in expected income by age, education, and other individual attributes
can be obtained from economic records, and the amount, distribution, and
type of border enforcement are available from the administrative records.
This kind of modeling can, in principle, be implemented for each demo-
graphic component. An estimate of the total flow is the sum of the estimates
computed for each segment.
The simplest nonlinear model for immigration, the gravity model, as-
sumes that the magnitude of the population flow between two locations is
proportional to the product of the sizes of the populations in each of those
locations and inversely proportional to some monotonically increasing
function of the distance between the two locations (e.g., the square of the
distance, as in the Newtonian model for gravitation in astronomy).2 In the
context of this study, the measure of distance would be supplemented by
variables that reflect the amount of border security, the costs and dangers
of traveling, and so forth. Also, the product of the sizes of the populations
might be replaced by the “attractiveness” of the United States, probably
measured in terms of economic advantage.
Regression models are popular statistical tools because they are rela-
tively easy to fit and lend themselves to straightforward interpretations.
Relative to other approaches, they also tend to make the least consequential
assumptions. However, the regression modeling approach will not do a
good job of tracking changing mechanisms of migration, and it is unlikely
to provide fine geographic detail given the kinds of survey data currently
available. Over time, as the migration process evolves and the measures
of the inputs become outdated, the validity of the model will drift, and its
predictive accuracy will surely decline.
Spatiotemporal Processes
A third standard approach is to model the correlation in flow data
across time and space. A simple time series is a starting point, but it requires
a great deal of faith in the model and its stability. For estimating immigra-
2
SeeSen and Pruthi (1983) for a discussion of the use of regression in fitting a gravity model
to migration flows and Sen and Smith (1995) for a discussion of gravity models in general.

OCR for page 93

104 ESTIMATING ILLEGAL ENTRIES AT THE U.S.–MEXICO BORDER
tion flows, the natural time series formulation would disaggregate the total
flow into more homogeneous flows, for example:
• men between the ages of 17 and 40, who are entering the United
States to work as migrant farm laborers;
• pregnant women who are entering the United States to ensure that
their child has U.S. citizenship;
• people who are joining family members already established in the
United States (either legally or illegally); and
• other cases, such as drug smuggling, employees in the building trade,
and so on.
A time series model would be fit to each such flow, probably with
additional covariates to capture the dynamics of the process, such as the
impact of the recent drop in the construction sector or changes in the level
of enforcement activity at the border.
A time series model can easily capture the seasonality of farm labor
demand and construction starts, but it will probably not do a good job of
capturing changing dynamics of flows. Some aspects, such as changing lev-
els of law enforcement, can be handled through transfer functions, since the
times at which such interventions occur is known. But other aspects, such
as the changing impact of criminal cartels on the immigration pipeline, will
not. The time series literature is rich, and there are certainly strategies for
handling some of these kinds of interventions, but those methods quickly
become complex. Prado and West (2010) offer a recent survey of the area,
with special emphasis on dynamic time series models, which seem likely to
be the type of time series model most relevant to estimating immigration
flows when there may be feedback effects (e.g., if increased flow triggers
increased border security).
A common time series model is the autoregressive process. The simplest
such model treats, say, the total flow Y(t) at time t as a regression on the
past, so that mean function μ(t) satisfies
ln μ(t) = θ0 + Σj θ ln μ(t – j) + ε(t)
where ε(t) is “white noise,” t is the period of interest, and j is the number
of prior time periods. If the time period is a month, then j might take on
values from 1 to 12 when the model postulates that the flow in month t
depends on the flows during the previous year. In the case of immigration
flow, one expects that the coefficients θ1 and θ12 are both positive, since it
seems likely that recent secular trends are well forecasted by flow rate in the
previous month, whereas seasonal effects are captured by flow rate during
the same month in the preceding year. The coefficients for other months

OCR for page 93

MODEL-BASED APPROACHES 105
may be well approximated by 0. In other words, the flow in, for example,
May 2012 can be expected to be positively correlated with the flow in April
2012 (because of local conditions), and also with the flow in May of 2011
(because migration tends to be seasonal). This is a common formulation
for time series models for employment and travel data. The flow in the
previous month captures recent events and trends, such as changes in the
U.S. economy; the flow in the same month of the previous year captures the
annual cycles associated with home construction and migrant farm labor.
However, it would surely be better to disaggregate the total flow by demo-
graphic characteristics, and then model those component flows separately.
A more sophisticated instantiation of this strategy is to build a spatio-
temporal model for the flows. Such models extend time series analyses to in-
clude spatial correlation structure. Simple versions might allow association
among flows in Mexican states for decisions about whether to immigrate,
while more complicated versions might be able to capture discouragement
at particular border-crossing locations, redirection of flow due to fences or
smuggling cartels, and so on. In particular, this last aspect of redirection
offers the possibility of modeling the “squeezed balloon” aspect of cross-
border traffic, in which increased interdiction at one region simply relocates
the flow to a less monitored region.
Specifically, an example of a standard spatiotemporal model is a Condi-
tional Autoregressive (CAR) model. In this case it might model the flow at
a particular time t and location s as a Poisson random variable with mean
μst. The mean is then modeled as ln μst = β0 + β1x1 (s, t) + … + βpxp (s, t)
+ ε(s, t), where the log function is motivated as the natural link and the
linear regression relates covariates to the response through regression, as
previously discussed. Those covariates would typically include time series
terms, such as the flow in the previous month or in the same month of the
preceding year. The spatial structure can be incorporated in two ways: as
covariates (e.g., the population size in regions of Mexico), or through cor-
relation among the error terms ε(s, t). The correlation structure is the most
likely avenue for handling the “squeezed balloon” effect. (See Banerjee and
colleagues (2004) for an extensive treatment of modern spatial modeling.)
As is typically the case, more sophisticated modeling approaches real-
ize their promise of improved accuracy only when data are available at
increasingly higher levels of resolution (both in space and time). Some
detailed spatial information about migration experiences at the municipal
level (intensity of out-migration) can be obtained from Mexican Census
data. If it is possible to cross the spatial information on migration with
other economic, social, and enforcement information (such as the intensity
of operations of organized crime along the U.S.–Mexico border), then it
could be possible to produce estimates of the number of attempts to cross
by sector and by time period.

OCR for page 93

106 ESTIMATING ILLEGAL ENTRIES AT THE U.S.–MEXICO BORDER
To the panel’s knowledge, appropriately tailored and tested spatiotem-
poral models have not been previously used for immigration flow processes.
From that perspective, this modeling strategy may be better viewed as pro-
spective research, rather than a reasonable plan for near-term implementa-
tion. A first step would require exploring the availability of the data needed
to fit spatiotemporal models.
Simulation Models
The preceding discussion suggested that standard statistical methods,
even with reasonably enhanced survey data, may not be adequate to com-
pletely satisfy the needs of DHS. More recently developed strategies in-
clude simulation models, of which agent-based modeling is an example.
Although the basic idea behind agent-based modeling dates back to the
early 1900s, the approach is computationally intensive and therefore did
not become widely used until the 1990s. Today, agent-based modeling is
used in many disciplines, including economics (e.g., Holland and Miller,
1991), military preparedness, battlefield management, and epidemiological
planning (Caplat et al., 2008), to name a few. The approach has also been
implemented to address the question of migration of human populations
(Edwards, 2008).
Agent-based modeling represents a new strategy for modeling immi-
gration flows using data from surveys and official records. Such models
endow a set of artificial agents with “rules” and then observe the emergent
behavior as the agents interact with each other and their environment. The
notion of “agent” can be quite general. In weather forecasting, for example,
agents can be cubic kilometers of atmosphere, which exchange temperature,
pressure, and humidity according to the laws of physics. In traffic-flow
modeling, the agents are automobiles, with probabilistic rules that prefer
certain spacings, speeds, origins, and destinations, which are chosen by the
programmer to mimic the known activity of the community under study.
The best-known example of agent-based modeling is the “Sugarscape”
created by Epstein and Axtell (1996). In that program, agents are allocated
at random on a flat plane where a nutrient, “sugar,” grows at a fixed rate.
In order to survive, the agents must consume the sugar, which they do at a
faster rate than the sugar grows back. Thus, when the sugar is depleted, the
agent must move to a new location where the sugar has not been consumed.
The first layer of rules prescribes how agents move and generates migra-
tion patterns similar to those in hunter-gatherer societies. A second layer
of rules creates two genders, which reproduce when sufficient resources are
available; this leads to behaviors that reflect the differential equations seen
in population dynamics. Higher-order layers enable barter economics and
division of labor. The main point of Sugarscape is that seemingly complex

OCR for page 93

MODEL-BASED APPROACHES 107
social behaviors can be generated by a handful of simple and transparent
rules (Epstein and Axtell, 1996).
A significant difficulty with agent-based models is that their statistical
properties are understudied. In general, it is not clear how one should make
principled uncertainty statements about such models, nor how one can as-
sess goodness-of-fit. On the other hand, these models enjoy a high level of
face validity: if the rule sets are reasonable, then the model may seem more
plausible than a model that encodes human behavior in complex mathemat-
ics. Also, agent-based models are easy to assess; if the emergent behavior is
unreasonable, then the model is inadequate.
In the context of modeling immigration flows with an agent-based
model, administrative data and surveys offer important opportunities for
model tuning and falsification. For example, consider rules of the follow-
ing kind:
• An agent decides whether to attempt illegal immigration according
to a coin toss, where the probability of heads is a function of the
agent’s age, income, marital status, the distance from the U.S. border,
and other relevant covariates.
• If the coin toss leads the agent to attempt to immigrate, then the
agent tries a certain number of times, until discouragement, where
the number of attempts is a probabilistic function of the agent’s
covariates.
• If the agent succeeds, then the agent will attempt to engage in
various kinds of activity in the United States, such as migrant labor,
home construction, joining a family member, and so on.
Obviously, these rules are simplistic and offered only as illustration.
The important point is that one can tune these rules, in principle, according
to data in the administrative records and surveys. If, in a given year, the age
mix of those interdicted at the border does not match the mix generated
by the agent-based model, then this indicates that the model is incorrectly
specified. More directly, the data enable the modeler to fit the functions that
determine how the covariates affect the coin toss, or how easily an agent
with certain characteristics will be discouraged.
One can address the problem of making inference from agent-based
models in at least two ways. One way is to do sensitivity analyses and see
how the outputs vary across reasonable ranges of inputs. This is particu-
larly useful given that certain important information (e.g., the probability
of successfully crossing the border in the first, second, or later attempts)
is not available. A second way is to build an emulator, which creates a
mathematically simpler model that approximates the agent-based model.
Using methods introduced by O’Hagan (2001) and developed by Gramacy

OCR for page 93

108 ESTIMATING ILLEGAL ENTRIES AT THE U.S.–MEXICO BORDER
and Lee (2008) and Higdon and colleagues (2008), one can use Bayesian
inferences to set credible regions on model outputs.
The advantages of agent-based models for inferring immigration flow
are that the method is relatively easy to program, relatively easy to validate,
and allows decision makers to flexibly explore “what if” scenarios. The
disadvantages are that the methods for formal statistical inference are still
under development and that building and fitting such a model requires ex-
pertise that DHS has yet to acquire. As discussed in Chapter 5, DHS would
be able to cheaply and effectively “outsource” this analysis to the scholarly
community if it were to make the administrative data from its enforcement
database more widely available.
In the context of immigration modeling, the Secure Border Initiative
(MITRE Corporation, 2008) attempted to produce a simulation model for
cross-border traffic that is essentially an agent-based model. That model
has been criticized for making ad hoc assumptions, and to the best of our
knowledge it has not been retrospectively validated against historical data
(Chang et al., 2012). Nonetheless, if DHS decides to pursue an agent-based
model as a strategy for producing flow estimates, the Secure Border Initia-
tive model is a natural starting point.
CONCLUSION
Existing surveys and administrative data sources do not suffice to es-
timate some important aspects of the migration process; two fundamental
data gaps include the proportion of undocumented migrants who cross the
border undetected and the proportion of migrants who were successfully
deterred after one or more apprehensions. The use of modeling approaches
informed by survey data and administrative data is therefore necessary
for estimating the flows of unauthorized migrants across the U.S.–Mexico
border. Any modeling approach, and the assumptions underlying it, will
need to keep track of mechanisms of change and be continually validated
against historical trends and data. Since all modeling approaches will have
their limitations, there is also much that could be learned by comparing
estimates from multiple methods.
Without access to DHS administrative data, the panel was unable to
assess the strengths and weaknesses of each modeling approach in the
context of estimating the components of illegal migration flows along the
U.S.–Mexico border. If the panel had had access to these data, it might
have been able to make some basic comparisons between the different
approaches and gain some insight into the accuracy of the information ob-
tained from surveys. As a specific example, consider the analysis carried out
using EMIF-N data in Chapter 5. The panel found that several probability
models appeared to fit the re-apprehension estimates from EMIF-N quite

OCR for page 93

MODEL-BASED APPROACHES 109
well. If the apprehensions data from DHS had also been available, the panel
might have been able, at least to some extent, to validate (or fail to validate)
EMIF-N. The panel would also have been able to evaluate the impact of
violating standard model assumptions (e.g., the assumption of a constant
population size) on the performance of capture-recapture approaches. More
generally, using administrative data collected over several time periods,
the panel might have been able to fit models using earlier information and
evaluate them by comparing their predictions to observed data from later
periods. This out-of-sample validation approach would have allowed the
panel to compare the predictive ability of different models and explore the
importance of the various assumptions underpinning those models.
Although the panel was aware that DHS has been considering specific
modeling approaches (e.g., capture-recapture methods using apprehensions
data), it could not get access to the relevant technical reports commissioned
by DHS. Because the broader scientific community has not hitherto been
engaged with DHS in developing, applying, and continually refining specific
modeling approaches, the evidentiary base to which the panel could refer
was also limited. For all of these reasons, much of the discussion in this
chapter was general in nature. As was discussed in Chapter 5, DHS would
benefit from making the administrative data in its enforcement databases
publicly available to the research community, even if it were necessary to
protect potentially sensitive information through data masking, aggrega-
tion, and other such procedures.
• onclusion 6.1: Modeling approaches, and the assumptions under-
C
lying them, must keep track of changing mechanisms of migration
and be continually validated against historical trends and data.
Since all modeling approaches have their limitations, there is also
much that could be learned by comparing estimates from multiple
methods.

OCR for page 93