The migration process for undocumented people is complex and dynamic, as described in Chapter 2. Undocumented migrants and their agents adapt quickly to changes in resources and strategies on the U.S. side of the border. Furthermore, the migration of undocumented people responds not only to enforcement efforts by the United States but also to labor market factors in the United States and Mexico, local laws and regulations on both sides of the border, and “competing” traffic across the border—including the highly profitable drug traffic going north and arms traffic heading south.
To estimate the number of illegal crossings at the U.S.–Mexico border, enforcement agencies require information that is not only precise but also timely (e.g., available on a quarterly basis and soon after the end of a quarter). The fact that the migration process is highly dynamic makes this difficult. For example, neither the emergence of drug violence along the border nor the severe economic recession in the United States were anticipated as recently as 5 years ago. The need for accurate estimates at the border at a geographically detailed level introduces additional challenges. Any effective information system will have to be agile, adapting to altered flows and externalities as the process evolves.
No single data source is able to provide direct estimates of the number of illegal attempts to cross the U.S.–Mexico border (see the discussion in Chapters 4 and 5). While several U.S. and Mexican surveys (described in detail in Chapter 3) address specific aspects of the migration process, they tend to do so only in limited ways (as discussed in Chapter 4). A new
Similarly, administrative data collected by the U.S. Department of Homeland Security (DHS) along the U.S.–Mexico border (which were collected for purposes other than the estimation of migration flows) are likely to provide only a partial picture of the activities of undocumented migrants and cannot be used in isolation to draw inferences about migration flows (see Chapter 5). The difficulty in estimating flow from current data sources persists even if statistical modeling techniques, such as capture-recapture methodology and other sampling strategies, are used to estimate these hard-to-count populations.
Based on the panel’s conversations with U.S. Border Patrol (USBP) agents during site visits to Arizona and California, it is clear that USBP already attempts to combine information from different sources to forecast border-crossing activity, albeit in informal ways. In addition to whatever the surveys may indicate, agents make use of their own administrative data, their previous experience, and other sources of information that include, for example, occupancy rates of hotels on the Mexican side of the border, sign-cutting (i.e., observing and tracking footprints and other physical signs of migrant passage), and remote sensing data.
Building upon what USBP already does in practice, this chapter discusses more formal ways for combining varied sources of information to estimate unauthorized migration flows with geographic and annual/quarterly specificity. These methods include conventional approaches, such as probability models, regression models, and spatiotemporal processes, and more recent methods such as agent-based modeling.
To fit a model, one wants to have a training sample for which both the explanatory variables (such as economic pressure, enforcement effort, point of origin) are known, and also the true values of the response variable (such as the flow of illegal immigrants at a specific portion of the border). Such a training sample is difficult to obtain in this situation and will never be fully achieved. Nevertheless, a model for illegal flow will include many components for which data exist. For example, each border station records the number of people in different demographic segments who are interdicted that month, and surveys are available that indicate how many people in a particular town chose to seek work in the United States. A mathematical model for illegal immigration that is founded on good social science theory can be fit to the available data, and it offers reasonable hope of correctly tracking the unmeasured data. This hope can be approximately validated, or disconfirmed, if the model’s broad predictions for, say, the total number of illegal Mexican immigrants are not consistent with estimates obtained
TELEPHONE CARDS: A THOUGHT EXPERIMENT ON QUANTIFYING A DIFFICULT-TO-MEASURE POPULATION
Two glaring gaps in the information required to estimate the effectiveness of the resources that have been deployed at the U.S.–Mexico border during the past decade are the proportion of undocumented crossers who succeed in their first or later attempts and the proportion of apprehended migrants who are deterred from further crossing attempts. In the course of its deliberations, the panel discussed a number of different ideas concerning creative sampling methods for estimating different components of undocumented immigration. The panel describes here one such idea for quantifying one type of deterrence effect (i.e., the fraction of apprehended migrants who choose not to attempt to cross the border again). This simple thought experiment involves providing telephone cards to undocumented immigrants who are apprehended in the United States and are then returned to the Mexican side of the border.
Typically, individuals who intend to cross the border without documentation arrive in the border area and make arrangements for illegal crossing with assistance from a smuggler. Most of those who are apprehended during their first attempt and are returned to the Mexican side of the border will try to cross again within the next few days. If the second attempt is also unsuccessful, they tend to keep trying, usually over a period of several days, until finally they either succeed or give up. USBP could, in principle, provide a phone card from a Mexican telephone company to a randomly selected subset of apprehended migrants who are about to be returned to Mexico. The phone cards, which would come preloaded with a certain usage value, could be used to call from either side of the border, but only after the caller is identified as the person who actually received the card. The toll-free number to activate the card would differ depending on whether the individual was in the United States or Mexico at the time of activation. The fraction of individuals activating the card in Mexico would provide an estimate of the fraction of apprehended individuals who are deterred from crossing again, and the fraction activating the card in the United States would provide an estimate of the fraction of individuals that cross successfully on the next attempt.
Several practical problems would need to be resolved in order to implement the phone card experiment. First, phone cards would have to be sufficiently attractive so that migrants actually use them, but not so valuable that they render the program too expensive (or induce criminal elements to prey on returning migrants, or create a black market in phone cards). This
issue might be addressed by experimenting with different phone card values and by varying the fraction of migrants who receive phone cards across time and border locations. Second, it would be necessary to ensure that the user of the phone card is the intended card recipient and not someone else to whom the card was given or sold. This issue might be addressed by having the migrant answer simple questions at the time of phone card activation based on demographic information collected at the time of apprehension. Third, it is necessary to ascertain whether the location from which the migrant activates the card is the migrant’s final destination (be it in Mexico or the United States). This issue might be addressed by having the card’s earliest possible activation date be 1 to 2 weeks after the apprehension, so that cards activated in Mexico would predominantly be activated by discouraged crossers, while cards activated in the United States would predominantly be activated by successful crossers. Finally, one would expect that successful undocumented migrants would be more reluctant to activate their cards in the United States than would unsuccessful undocumented migrants still in Mexico, so the undercount is likely to be different for successful undocumented crossers than for unsuccessful crossers who are deterred from further attempts. The risks of non-use and differential use cannot be ignored. Steps would need to be taken to address the concerns of undocumented migrants, such as allowing callers not to self-identify and providing assurances that the identity and locations of the callers are not required and would not be traced (beyond the country of origin).
This thought experiment highlights the kinds of data that would be needed if apprehensions data were to be used to estimate stocks or flows of unauthorized immigrants. However, even though the phone card experiment might be useful in estimating the number of crossers who are successful after having been apprehended once or more than once, it still would not provide any information about individuals who cross successfully in their first attempt. One approach to counting such elusive populations is based on network sampling or link-tracing sampling (see Box 6-1). However, these methods require careful implementation and additional assumptions to be usable, and they are not yet sufficiently developed to be clearly helpful to DHS in filling the critical data gaps.
SURVEY AND ADMINISTRATIVE DATA—INFORMED MODELING
Statistical models can provide plausible descriptions of immigration behavior, and some aspects of their fit can be validated against available survey and administrative data. A model can be applied to historical data to see whether its predictions agree with results from previous surveys. Even though the available surveys do not directly address all questions of interest to DHS (see Chapters 3 and 4), if a statistical model agrees
with the findings of the surveys on those aspects of flow that the surveys do capture, then one can reasonably expect that the model has predictive power for estimating other relevant aspects of flows. Similarly, if a model produces results that are not supported by previous data, then one of three conclusions is plausible: the model does not fit the data well, the migration process has changed significantly over time, or both these conditions apply. The model must be flexible, and one should expect that it will be necessary to extend it when new factors come into play, leading to a new round of model retrofitting and validation.
Beyond timeliness and the possibility of greater accuracy, modeling has additional advantages. A good model allows policy makers to explore “what if” scenarios by changing model inputs. In particular, DHS can explore the impact of different allocations of enforcement resource among border stations or the impact of new enforcement policies. More importantly, the process of building a good model can create a stronger understanding of the social process underlying immigration behavior. Finally, a good model should produce accurate estimates of prediction uncertainty. Predictions from a model that are not paired with estimates of their prediction error have limited value.
This chapter reviews several approaches to survey-informed modeling and, to the extent possible, offers some comparative guidance in the context of estimating unauthorized migration flows. There are at least three standard strategies for doing survey-informed estimation: build a probability model, fit a regression model, or employ a spatiotemporal model. Newer approaches such as agent-based modeling are based on simulation but still rely on survey and administrative data for parameter settings and model components. The estimates that result from the use of models should usually be fairly accurate if the data are representative and reliable, the model is valid, and the immigration process does not change. In this case, disaggregated data on migrants and nonmigrants that allow for the explicit modeling of the migration decision can be used to verify and validate findings from aggregate data, such as apprehensions.
Massey and Singer (1995) developed a simple probability model for the number of unsuccessful attempts at illegal immigration before a successful crossing. Their basic model was a geometric distribution for the number of attempts before the first success, where it was assumed that attempts were independent trials with constant probability of success. The estimated probability of success was obtained from interview data on the number of people who were successful on the first try, the second try, and so forth. Using survey data on the number of crossing attempts by migrants, collected
Link-trace sampling and its variant, respondent-driven sampling (Salganik and Heckathorn, 2004; Thompson and Seber, 1996), have been used for sampling elusive, hard-to-reach populations, such as unregulated workers, the homeless, drug users, and sex workers. These populations are typically characterized not only by the absence of a serviceable sampling frame but also by the presence of social relations among members of the population. These social connections provide a means of reaching individuals in the population via the people they know. Such approaches are often informally referred to as “snowball sampling.” Link-tracing sampling strategies such as snowball sampling and respondent-driven sampling are often used to leverage those social relations beyond the small population subgroup available to researchers. The initially selected group is referred to as the “seed” sample. Information about the social links of individuals in the seed sample to other members of the population is used to identify and contact members outside the original population subgroup available to researchers. In most applications of link-tracing sampling, the seeds are a convenience sample, so the probability of inclusion of any member is unknown. Therefore, a serious drawback of this type of sampling is that probability-based inferential methods are problematic. However, it is possible for the seeds in link-tracing designs to be selected randomly, even in applications to hard-to-reach populations—for example, by using a spatial sampling frame (Felix-Medina and Thompson, 2004).a
Respondent-driven sampling is a variation of link-tracing sampling in which the respondents themselves choose and contact people they know and invite them to participate in the survey (Heckathorn, 1997, 2007; Salganik and Heckathorn, 2004; Volz and Heckathorn, 2008). While it is possible to reduce the dependence of the final sample on the seeds, recent simulation studies (Handcock and Gile, 2010) suggest that substantial biases can remain. A common feature of networked populations is that they exhibit homophily by attributes; that is, the social ties are more likely to occur between people who have similar attributes. In the case of undocumented migrants, homophily might occur when the initial group of seeds
from family members in four Mexican states, they modified the initial geographic distribution into a Poisson regression model in which the response was the number of attempts before the first successful trip and the mean of the Poisson included covariates related to economic factors, gender, and other variables that might affect the success rate. They also (coarsely) corroborated their model’s predictions against data from Mexican surveys and against the data on the number of people legalized under the Immigration Reform and Control Act of 1986 (IRCA). Their estimates suggest that, as of 1995, the U.S.–Mexico border was becoming increasingly porous and the probability of apprehension on any given attempt was about one-third and
includes men from the same geographic area in Mexico. While link-tracing designs are often effective at acquiring a sample, the degree to which data so collected can be considered a probability sample is unclear. To allow valid inference to the population, the designs need to be implemented carefully and the mechanism of selection of successive waves of the sample must be well understood. Recent research (Gile, 2011; Gile and Handcock, 2011) discusses model-based estimation methods introduced by convenience samples of seeds. However, methodological development lags the data collection efforts.
There have been some applications of link-tracing sampling to unauthorized border crossings. Two such efforts have been discussed in Chapters 3 and 4: the Mexican Migration Project and the Mexican Migration Field Research Program. Neither of the two surveys results in samples that can be used for quantifying migration flows. Respondent-driven sampling does not appear to have been systematically used in the context of Mexican migration. Morral and colleagues (2011) discuss the potential use of respondent-driven sampling to estimate the stock of undocumented migrants, the probability of eluding capture at the border, and other quantities associated with the migration process.
One approach to network sampling is to start with seeds who are recent immigrants from Mexico and use them to recruit other recent immigrants. Key questions asked of them would include how long they have been in the United States since their most recent crossing and how many recent immigrants they know. The recruitment “coupons” for this respondent-driven sampling would not need to be physical (e.g., an identifying number would be sufficient). Methods have been developed to estimate population size from respondent-driven sampling data (Handcock et al., 2011; Salganik et al., 2011). Given the size of the flows, it is likely that network scale-up methods, which are a form of post-stratification, would be the most effective means to estimate the population size. However, significant methodological development and empirical testing are required before these methods can be recommended.
aSee Thompson (2002) for an informative general discussion of link-tracing designs.
falling. They concluded that about 98 percent of individuals who attempted to cross the border illegally were ultimately successful.
Massey and Espinosa (1997) extended this methodology in various ways—for example, by proposing a model for estimating the probability of a first trip and of recurrent trips. The extended model includes not only macroeconomic variables but also individual and household characteristics, migration experience of other members of the household, and macroeconomic variables from the community/country of origin. Additional variables, such as those related to the political and legal contexts in Mexico and the United States (the Bracero Program, the IRCA period, and so on)
might also have been considered. Nonetheless, the situation at the border has changed markedly since 1997, and the panel has no confidence that these older models, which antedate the drug corridors, modern enforcement technology, and other innovations, can provide good guidance for the current era. Since the older models are unlikely to have the correct form, it would probably be necessary to rebuild them rather than just refit them with new data.1
While the policy environment can be updated in a rebuilt model, another shortcoming of much survey-based regression type modeling is the endogeneity of many of the migration determinants. In the presence of endogenous covariates and dual causality, the ability to simulate counterfactuals is compromised. More recent studies along the lines of Massey and Espinosa (1997) address this problem by using instrumental variables estimation (Angelucci, 2012; Gathmann, 2008; Orrenius, 1999).
Wein and colleagues (2009) and Liu and Wein (2008) extended the probability modeling to include compartmental modeling. They describe a system of four submodels, each of which is tuned with historical data and which interact to produce a probabilistic model for immigration flows. The four submodels are as follows:
• a multinomial logit model, as in Ben-Akiva and Lerman (1985), which gives a probabilistic description of the choices made by undocumented border crossers, such as the location of the attempt;
• an enforcement model, which describes the probabilities of interdiction as a function of enforcement effort and resources;
• a repatriation model, which describes how an apprehended alien is returned to Mexico; and
• an economic model that accounts for how supply and demand affect the wages of unskilled immigrant workers.
These submodels can become arbitrarily more sophisticated, incorporating elements of game theory, queuing theory, and portfolio analysis. The analyst must solve systems of non-linear equations or differential equations. Chang and colleagues (2012) developed a practical computational tool for implementing this model.
The main difficulty with this approach, even in its most mathematically advanced form, is that it is difficult to tune the model from historical data since the various submodels rely on information that was collected using different designs, at different data scales, in different time frames, and with
1 Since the changing relationship between the U.S. and Mexican economies will also have a non-linear effect on the incentive to migrate, the older models are likely to have moved outside the range in which approximate linearity would allow simple retrofitting to work.
different degrees of precision. This makes it almost impossible to calculate the errors associated with model predictions. This difficulty is not unique to this multipart model, and Chang and colleagues (2012) consider it to be among the most practical strategies for assessing cross-border flow. In principle, however, this type of complex probability modeling could be useful to address several aspects of illegal migration flows, with the exception of the probability of a successful first attempt.
Multiple linear regression modeling is a standard tool in demography and econometrics. Regression has been applied to various problems in immigration studies, perhaps most pertinently by Lewer and Van den Berg (2007), but it is more commonly used to estimate the economic impact of undocumented laborers. When circumstances admit locally linear approximation to complex phenomena, regression models can be quite effective, even in nonlinear applications. They are easy to implement, and results are relatively robust to departures from model assumptions. Multiple regression is transparent: the coefficients are often directly interpretable, and standard statistical inference enables tests of those coefficients, sensitivity analysis, and the calculation of confidence intervals. In the context of illegal immigration, the response variable might be the total number of successful illegal crossing attempts in a month, or it might be a multivariate response, such as a vector of illegal crossings at each of a number of different locations.
To estimate immigration flows, an economics perspective would start by modeling them in terms of the difference in earnings realizable in the United States and the prospective migrant’s current home; one such approach has already been briefly discussed in Chapter 5. Migration costs are then subtracted from the potential gains, where costs include actual travel expenses, foregone earnings, and the disutility of being away from home (often ameliorated by migrant networks, which can also be readily modeled with the right data). The model might allow for benefits and costs to vary by age, gender, and education level. For example, young men with long work horizons, greater facility for learning English, and facing less risk in an illicit border crossing would have greater migration incentives than older people, women, and high-education workers (the latter have relatively high earnings in Mexico).
There is a large literature on models for immigration forecasting. Howe and Jackson (2005) recently surveyed this area, describing and comparing methodologies that have been adopted in the United States, Canada, and various European countries. But they emphasized that “[t]he poverty of explanatory models in the current practice of immigration projection contrasts sharply with the abundance of theories proposed and discussed
by experts in a variety of social science and policy disciplines” (Howe and Jackson, 2005:19). This apparent gap suggests that there is potential for research on modeling approaches that formalize and combine the various theories on illegal immigration (e.g., those based on social networks, relative strength of dual economies, social capital effects, and policy analysis) into formal regression models whose accuracy can be assessed using historical data.
Nonetheless, much has already been done to apply regression methods, broadly defined, to immigration flows. McKenzie and Rapoport (2010), for example, used ordinary least squares regression on data from the 1997 National Survey of Population Dynamics (ENADID) to study how the education level of illegal immigrants depends upon demographic and economic variables. Their work responds in part to work by Orrenius and Zavodny (2005), who used a Cox proportional hazards regression model on data from the Mexican Migration Project to quantify the effect of economic conditions, border enforcement, and migrant networks on the education level of unauthorized border crossers. Massey and Espinosa (1997) undertook a broader examination of 41 covariates that, from one or more theoretical perspectives, might be linked to illegal immigration; their work was based on 25 samples drawn from border states in western Mexico. Even before that, Taylor (1987) built explicit economic models that used regression to estimate income gains that provided incentive for illegal immigration. There is a great deal more literature on the topic, some of which is discussed in Chapter 2. But the question that remains to be answered is whether these regression tools can produce estimates of illegal migration flows (or their components) with sufficient accuracy and timeliness to meet DHS’s needs.
On the positive side, there are some grounds for optimism. DHS wants to estimate flows in the recent past, which tends to be easier than the forecasting problem that has driven much of the literature, especially that summarized in Howe and Jackson (2005). Also, the DHS has access to administrative data, which can help inform the social, economic, and political theories that have driven previous modeling efforts. On the negative side, much of the previous literature was developed using data from before 2005, and it is clear that the illegal immigration process has changed in important ways.
Nonetheless, using historical survey data and administrative records, a multiple regression model of some kind (including multivariate regression, principal components regression, Cox proportional hazards regression, and so on) could be fit to describe how a specific component of the total flow might depend on the explanatory variables. For example, if the component of interest were counts of males aged 17-30, then that variable would be used as the response, and a model would be fit that included such information as the expected difference in income opportunity between the United
States and Mexico, the level of interdiction effort, and perhaps such covariates as the cost of being smuggled, the size of the Hispanic population in the United States, and indicator variables for seasonality, which affects migrant farm labor and the home construction industry. The flow Yi(t) can be estimated from a national survey such as ENADID or the Mexican National Survey of Occupation and Employment (ENOE), albeit with the difficulties and limitations that have already been discussed in Chapter 4. The difference in expected income by age, education, and other individual attributes can be obtained from economic records, and the amount, distribution, and type of border enforcement are available from the administrative records. This kind of modeling can, in principle, be implemented for each demographic component. An estimate of the total flow is the sum of the estimates computed for each segment.
The simplest nonlinear model for immigration, the gravity model, assumes that the magnitude of the population flow between two locations is proportional to the product of the sizes of the populations in each of those locations and inversely proportional to some monotonically increasing function of the distance between the two locations (e.g., the square of the distance, as in the Newtonian model for gravitation in astronomy).2 In the context of this study, the measure of distance would be supplemented by variables that reflect the amount of border security, the costs and dangers of traveling, and so forth. Also, the product of the sizes of the populations might be replaced by the “attractiveness” of the United States, probably measured in terms of economic advantage.
Regression models are popular statistical tools because they are relatively easy to fit and lend themselves to straightforward interpretations. Relative to other approaches, they also tend to make the least consequential assumptions. However, the regression modeling approach will not do a good job of tracking changing mechanisms of migration, and it is unlikely to provide fine geographic detail given the kinds of survey data currently available. Over time, as the migration process evolves and the measures of the inputs become outdated, the validity of the model will drift, and its predictive accuracy will surely decline.
A third standard approach is to model the correlation in flow data across time and space. A simple time series is a starting point, but it requires a great deal of faith in the model and its stability. For estimating immigration
2 See Sen and Pruthi (1983) for a discussion of the use of regression in fitting a gravity model to migration flows and Sen and Smith (1995) for a discussion of gravity models in general.
• men between the ages of 17 and 40, who are entering the United States to work as migrant farm laborers;
• pregnant women who are entering the United States to ensure that their child has U.S. citizenship;
• people who are joining family members already established in the United States (either legally or illegally); and
• other cases, such as drug smuggling, employees in the building trade, and so on.
A time series model would be fit to each such flow, probably with additional covariates to capture the dynamics of the process, such as the impact of the recent drop in the construction sector or changes in the level of enforcement activity at the border.
A time series model can easily capture the seasonality of farm labor demand and construction starts, but it will probably not do a good job of capturing changing dynamics of flows. Some aspects, such as changing levels of law enforcement, can be handled through transfer functions, since the times at which such interventions occur is known. But other aspects, such as the changing impact of criminal cartels on the immigration pipeline, will not. The time series literature is rich, and there are certainly strategies for handling some of these kinds of interventions, but those methods quickly become complex. Prado and West (2010) offer a recent survey of the area, with special emphasis on dynamic time series models, which seem likely to be the type of time series model most relevant to estimating immigration flows when there may be feedback effects (e.g., if increased flow triggers increased border security).
A common time series model is the autoregressive process. The simplest such model treats, say, the total flow Y(t) at time t as a regression on the past, so that mean function μ(t) satisfies
where ε(t) is “white noise,” t is the period of interest, and j is the number of prior time periods. If the time period is a month, then j might take on values from 1 to 12 when the model postulates that the flow in month t depends on the flows during the previous year. In the case of immigration flow, one expects that the coefficients θ1 and θ12 are both positive, since it seems likely that recent secular trends are well forecasted by flow rate in the previous month, whereas seasonal effects are captured by flow rate during the same month in the preceding year. The coefficients for other months
may be well approximated by 0. In other words, the flow in, for example, May 2012 can be expected to be positively correlated with the flow in April 2012 (because of local conditions), and also with the flow in May of 2011 (because migration tends to be seasonal). This is a common formulation for time series models for employment and travel data. The flow in the previous month captures recent events and trends, such as changes in the U.S. economy; the flow in the same month of the previous year captures the annual cycles associated with home construction and migrant farm labor. However, it would surely be better to disaggregate the total flow by demographic characteristics, and then model those component flows separately.
A more sophisticated instantiation of this strategy is to build a spatiotemporal model for the flows. Such models extend time series analyses to include spatial correlation structure. Simple versions might allow association among flows in Mexican states for decisions about whether to immigrate, while more complicated versions might be able to capture discouragement at particular border-crossing locations, redirection of flow due to fences or smuggling cartels, and so on. In particular, this last aspect of redirection offers the possibility of modeling the “squeezed balloon” aspect of cross-border traffic, in which increased interdiction at one region simply relocates the flow to a less monitored region.
Specifically, an example of a standard spatiotemporal model is a Conditional Autoregressive (CAR) model. In this case it might model the flow at a particular time t and location s as a Poisson random variable with mean µst. The mean is then modeled as ln µst = β0 + β1×1 (s, t) + … + βpXp (s, t) + ε(s, t), where the log function is motivated as the natural link and the linear regression relates covariates to the response through regression, as previously discussed. Those covariates would typically include time series terms, such as the flow in the previous month or in the same month of the preceding year. The spatial structure can be incorporated in two ways: as covariates (e.g., the population size in regions of Mexico), or through correlation among the error terms ε(s, t). The correlation structure is the most likely avenue for handling the “squeezed balloon” effect. (See Banerjee and colleagues (2004) for an extensive treatment of modern spatial modeling.)
As is typically the case, more sophisticated modeling approaches realize their promise of improved accuracy only when data are available at increasingly higher levels of resolution (both in space and time). Some detailed spatial information about migration experiences at the municipal level (intensity of out-migration) can be obtained from Mexican Census data. If it is possible to cross the spatial information on migration with other economic, social, and enforcement information (such as the intensity of operations of organized crime along the U.S.–Mexico border), then it could be possible to produce estimates of the number of attempts to cross by sector and by time period.
To the panel’s knowledge, appropriately tailored and tested spatiotemporal models have not been previously used for immigration flow processes. From that perspective, this modeling strategy may be better viewed as prospective research, rather than a reasonable plan for near-term implementation. A first step would require exploring the availability of the data needed to fit spatiotemporal models.
The preceding discussion suggested that standard statistical methods, even with reasonably enhanced survey data, may not be adequate to completely satisfy the needs of DHS. More recently developed strategies include simulation models, of which agent-based modeling is an example. Although the basic idea behind agent-based modeling dates back to the early 1900s, the approach is computationally intensive and therefore did not become widely used until the 1990s. Today, agent-based modeling is used in many disciplines, including economics (e.g., Holland and Miller, 1991), military preparedness, battlefield management, and epidemiological planning (Caplat et al., 2008), to name a few. The approach has also been implemented to address the question of migration of human populations (Edwards, 2008).
Agent-based modeling represents a new strategy for modeling immigration flows using data from surveys and official records. Such models endow a set of artificial agents with “rules” and then observe the emergent behavior as the agents interact with each other and their environment. The notion of “agent” can be quite general. In weather forecasting, for example, agents can be cubic kilometers of atmosphere, which exchange temperature, pressure, and humidity according to the laws of physics. In traffic-flow modeling, the agents are automobiles, with probabilistic rules that prefer certain spacings, speeds, origins, and destinations, which are chosen by the programmer to mimic the known activity of the community under study.
The best-known example of agent-based modeling is the “Sugarscape” created by Epstein and Axtell (1996). In that program, agents are allocated at random on a flat plane where a nutrient, “sugar,” grows at a fixed rate. In order to survive, the agents must consume the sugar, which they do at a faster rate than the sugar grows back. Thus, when the sugar is depleted, the agent must move to a new location where the sugar has not been consumed. The first layer of rules prescribes how agents move and generates migration patterns similar to those in hunter-gatherer societies. A second layer of rules creates two genders, which reproduce when sufficient resources are available; this leads to behaviors that reflect the differential equations seen in population dynamics. Higher-order layers enable barter economics and division of labor. The main point of Sugarscape is that seemingly complex
A significant difficulty with agent-based models is that their statistical properties are understudied. In general, it is not clear how one should make principled uncertainty statements about such models, nor how one can assess goodness-of-fit. On the other hand, these models enjoy a high level of face validity: if the rule sets are reasonable, then the model may seem more plausible than a model that encodes human behavior in complex mathematics. Also, agent-based models are easy to assess; if the emergent behavior is unreasonable, then the model is inadequate.
In the context of modeling immigration flows with an agent-based model, administrative data and surveys offer important opportunities for model tuning and falsification. For example, consider rules of the following kind:
• An agent decides whether to attempt illegal immigration according to a coin toss, where the probability of heads is a function of the agent’s age, income, marital status, the distance from the U.S. border, and other relevant covariates.
• If the coin toss leads the agent to attempt to immigrate, then the agent tries a certain number of times, until discouragement, where the number of attempts is a probabilistic function of the agent’s covariates.
• If the agent succeeds, then the agent will attempt to engage in various kinds of activity in the United States, such as migrant labor, home construction, joining a family member, and so on.
Obviously, these rules are simplistic and offered only as illustration. The important point is that one can tune these rules, in principle, according to data in the administrative records and surveys. If, in a given year, the age mix of those interdicted at the border does not match the mix generated by the agent-based model, then this indicates that the model is incorrectly specified. More directly, the data enable the modeler to fit the functions that determine how the covariates affect the coin toss, or how easily an agent with certain characteristics will be discouraged.
One can address the problem of making inference from agent-based models in at least two ways. One way is to do sensitivity analyses and see how the outputs vary across reasonable ranges of inputs. This is particularly useful given that certain important information (e.g., the probability of successfully crossing the border in the first, second, or later attempts) is not available. A second way is to build an emulator, which creates a mathematically simpler model that approximates the agent-based model. Using methods introduced by O’Hagan (2001) and developed by Gramacy
The advantages of agent-based models for inferring immigration flow are that the method is relatively easy to program, relatively easy to validate, and allows decision makers to flexibly explore “what if” scenarios. The disadvantages are that the methods for formal statistical inference are still under development and that building and fitting such a model requires expertise that DHS has yet to acquire. As discussed in Chapter 5, DHS would be able to cheaply and effectively “outsource” this analysis to the scholarly community if it were to make the administrative data from its enforcement database more widely available.
In the context of immigration modeling, the Secure Border Initiative (MITRE Corporation, 2008) attempted to produce a simulation model for cross-border traffic that is essentially an agent-based model. That model has been criticized for making ad hoc assumptions, and to the best of our knowledge it has not been retrospectively validated against historical data (Chang et al., 2012). Nonetheless, if DHS decides to pursue an agent-based model as a strategy for producing flow estimates, the Secure Border Initiative model is a natural starting point.
Existing surveys and administrative data sources do not suffice to estimate some important aspects of the migration process; two fundamental data gaps include the proportion of undocumented migrants who cross the border undetected and the proportion of migrants who were successfully deterred after one or more apprehensions. The use of modeling approaches informed by survey data and administrative data is therefore necessary for estimating the flows of unauthorized migrants across the U.S.–Mexico border. Any modeling approach, and the assumptions underlying it, will need to keep track of mechanisms of change and be continually validated against historical trends and data. Since all modeling approaches will have their limitations, there is also much that could be learned by comparing estimates from multiple methods.
Without access to DHS administrative data, the panel was unable to assess the strengths and weaknesses of each modeling approach in the context of estimating the components of illegal migration flows along the U.S.–Mexico border. If the panel had had access to these data, it might have been able to make some basic comparisons between the different approaches and gain some insight into the accuracy of the information obtained from surveys. As a specific example, consider the analysis carried out using EMIF-N data in Chapter 5. The panel found that several probability models appeared to fit the re-apprehension estimates from EMIF-N quite
well. If the apprehensions data from DHS had also been available, the panel might have been able, at least to some extent, to validate (or fail to validate) EMIF-N. The panel would also have been able to evaluate the impact of violating standard model assumptions (e.g., the assumption of a constant population size) on the performance of capture-recapture approaches. More generally, using administrative data collected over several time periods, the panel might have been able to fit models using earlier information and evaluate them by comparing their predictions to observed data from later periods. This out-of-sample validation approach would have allowed the panel to compare the predictive ability of different models and explore the importance of the various assumptions underpinning those models.
Although the panel was aware that DHS has been considering specific modeling approaches (e.g., capture-recapture methods using apprehensions data), it could not get access to the relevant technical reports commissioned by DHS. Because the broader scientific community has not hitherto been engaged with DHS in developing, applying, and continually refining specific modeling approaches, the evidentiary base to which the panel could refer was also limited. For all of these reasons, much of the discussion in this chapter was general in nature. As was discussed in Chapter 5, DHS would benefit from making the administrative data in its enforcement databases publicly available to the research community, even if it were necessary to protect potentially sensitive information through data masking, aggregation, and other such procedures.
• Conclusion 6.1: Modeling approaches, and the assumptions underlying them, must keep track of changing mechanisms of migration and be continually validated against historical trends and data. Since all modeling approaches have their limitations, there is also much that could be learned by comparing estimates from multiple methods.