Appendix C
Small-Area Modeling in Space and Time with Multiple Data Sources
Small-area estimation (SAE) methods include a wide range of modeling techniques generally devised to create improved estimates for domains where direct sample-based estimates are not reliable because of small sample sizes. Sample surveys are usually designed to produce reliable estimates for large domains (such as at the state or national level). However, there has been widespread growing interest in developing estimates for progressively finer domains, which has led to rapid development of a multitude of SAE methods in recent years. In SAE, the term “small area” refers to any domain that does not have enough samples for reliable direct sample-based estimates. In various surveys, examples include small areas defined by the intersection of detailed industry, geography, demography, etc. County levels in NASS surveys are typical examples of “small areas.”
The underlying concept of SAE methods is to “borrow strength” by integrating information from multiple data sources, including survey data, or across time and space to improve estimates for small areas. Application of SAE methods at NASS has become increasingly important because of the demand for crop estimates at detailed geographic levels. Using traditional survey approaches to produce reliable estimates at these levels would require a much larger sample; this may be prohibitively expensive and impractical in terms of data collection and processing time. Consequently, reliance solely on sample-based estimates may result in many estimates being withheld from publication because of their poor quality. This type of situation is seen in the current NASS processing. Small-area modeling has been found to provide useful estimates for areas that are otherwise unpublishable; for areas that are considered publishable, it can also lead to
improved efficiency over direct survey estimates. Two key success stories of SAE are the Small Area Income and Poverty Estimates program and the Small Area Health Insurance Estimates program of the U.S. Census Bureau, described elsewhere in this report.
This appendix gives an overview of modeling approaches and suggests those that NASS might pursue in the near and longer terms. It goes on to describe uncertainty measures, characteristics of area and unit models, extension to time, and related problems.
MODELS
The universe of methods that may be considered for SAE to combine information from multiple sources to improve county-level crop estimates is large and varied. Some of the most promising/interesting include geospatial methods, machine learning, and Bayesian methods. These are not mutually exclusive. For example, Bayesian methods are used in geospatial analysis and time domain analysis, as well as with incorporation of multiple types of measurements. When used in a geospatial context, it can be helpful to include explicit spatial effects in order to broaden or borrow spatial support and thus to reduce uncertainty, especially when modeling estimates for small areas. A large number of model types have been proposed and demonstrated in applications, and can be found in texts on spatial statistics, spatial econometrics, and spatial modeling, notably those described in Cressie and Wikle (2011) and Haining (2003). Many of these models are implemented and readily accessible in software packages such as R.
In the classic SAE literature, there is a distinction between unit-level models (e.g., a model for a farm or for a Common Land Unit [CLU]) and area-level models (e.g., a model for a county). Both can be cast as linear mixed models with spatial random effects written as a spatial basis expansion (as there are now areal basis functions). If spatial models are used, then, roughly speaking, unit-level models correspond to the covariance (point-level) modeling approach, while area-level models correspond to the precision (polygon-level) approach. The majority of models available for point-level spatial and spatiotemporal modeling were developed in a noncomplex design setting, and often for Gaussian data.
The panel believes that the Bayesian approach holds great promise as recent developments have allowed combining design-based estimates with space–time smoothing models. For example, Mercer and colleagues (2015) effectively use a spatial Fay-Herriot (1979) model in the context of modeling childhood mortality based on complex survey data. The basic idea is to assume a hierarchical model in which the first stage is taken as the asymptotic distribution of the direct (design-based) estimator. Porter and colleagues (2015) use a similar model with an intrinsic conditional
autoregressive (ICAR) spatial model, and emphasize covariate modeling. You and Zhou (2011) discuss both ICAR and Leroux specifications for the spatial model. Researchers who have worked on multivariate extensions to area-level models include Ghosh and Datta and colleagues and Bell and colleagues (e.g., Franco and Bell, 2016). Multivariate space–time models have been developed by Bradley and colleagues (2015a).
Near-Term Approaches
The panel suggests starting with area-level models. It is straightforward to add covariates to such models. The covariates may be added via a simple linear model or via a more flexible form, such as those used in the machine learning literature; it would be best to begin with simple, interpretable models. In an area-level model, satellite data can be included by taking within-area averages (for example). Note that ecological bias can be avoided (under certain assumptions) by including the within-area variance of the variable in the model, as well as the mean of the variable (Wakefield, 2008).
NASS already has experience with area-level modeling (Cruze et al., 2016; Erciulescu et al., 2016). So far, the use of spatial random effects has not been extensive. The panel suggests that NASS begin by exploring county-level models using the area-level spatial Fay-Herriot model to describe survey measurements. Each alternative data source could be given its own data model, linked to the larger model in a hierarchical Bayes framework. The model could use Besag or Leroux spatial formulations. See Hodges and Reich (2010) for a discussion of possible spatial confounding. Computation with the integrated nested Laplace approximation (INLA) (Rue et al., 2009) approach is fast (as compared with Markov chain Monte Carlo [MCMC], which NASS has been using) and accurate. There is a reliable R implementation of the INLA method, though it is not a standard package.
Longer-Term Approaches
Modeling at the unit level may be more difficult than at the area level but potentially could lead to improved estimates, especially if auxiliary information is available at the unit level. To avoid estimation bias, the sample design should be properly accounted for in the model (for example, stratification could be accounted for by including fixed effects in the model, and cluster sampling by including random effects).
Point-level Fay-Herriot models also are possible. If a spatial Fay-Herriot unit model were used, then the data model could correspond to the asymptotic distribution of the direct estimate, with the spatial model appearing in the process model. Point-level models are somewhat troubling here as farms are not points, but can be very large. It may be that once CLU information
is available, the area-level modeling could be extended to farms, again with a caveat that farms vary significantly in size.
EXPRESSING UNCERTAINTY
The Bayesian approach to modeling naturally leads to intuitive measures of uncertainty. The fundamental output of a Bayesian analysis is a multivariate posterior distribution over all unknown quantities in the model. This distribution is typically of high dimension, and so summarization is required. In particular, summaries of the univariate posterior distribution of quantities of interest may be reported. For example, the posterior median (or posterior mean, if the posterior on the quantity of interest is symmetric) may be quoted along with such quantiles as the 2.5 percent and 97.5 percent, to give a 95 percent interval.
Maps of posterior summaries may be produced. For example, a map of posterior medians may be accompanied by a map of the width of an interval of some percentage coverage (for example 95%). Hatching may also indicate uncertainty. For example, one may map the posterior median at the county level, but hatch with increased hatching as the associated uncertainty (interval estimate, for example) increases.
CHARACTERISTICS OF POINT-LEVEL AND AREA-LEVEL MODELS
There are two main approaches to modeling spatial data:
Point-Level (or Unit-Level) Modeling
- This modeling conceptually treats space as continuous.
- Spatial modeling concentrates on specification of the covariance matrix (e.g., Stein, 1999).
- Intuitive isotropic correlation models based on distance lead to dense matrices, i.e., matrices with few zeroes.
- Unfortunately, if n (the number of units) is large, the fitting of the model is computationally expensive because one must carry out operations (determinants and inverses) on n x n matrices (Rue and Held, 2005).
- This approach is also often referred to as Gaussian random field (GRF) or geostatistical modeling.
- This approach is suited to unit-level (point) modeling.
- Much of the literature (particularly in the environmental sciences) splits the modeling into a data model and a process model.
- Numerous approaches have been suggested for modeling the continuous surface with efficient computation (to give a variance–covariance matrix that is guiding the development).
-
- Fixed-rank kriging (Cressie and Johanneson, 2008) and R implementation via the FRK package may be used.
- Lattice kriging (Nychka et al., 2015) and R implementation via the LatticeKrig package may be used.
- Stochastic partial differential equations (SPDE) (Lindgren et al., 2011) and R implementation via the INLA package may be used.
- Predictive processes (Banerjee et al., 2008) and R implementation in the spBayes package may be used.
- Bradley and colleagues (2016a) provide a review of approaches, with an emphasis on big data.
Polygon-Level (Area-Level) Modeling
- Modeling focuses on the precision (inverse covariance) matrix.
- This is also referred to as a Markovian or conditional modeling approach. Besag (1974) is an early influential reference. More specifically, the approach is also referred to as Gaussian Markov random field (GMRF) modeling.
- The key idea is to model local structure, which leads to sparse matrices (with zeros in the precision matrix corresponding to conditional independencies). There is less intuition on the implied covariances.
- Computation is very efficient with either MCMC or INLA (Rue et al., 2009).
- The approach discretizes space, usually based on administrative regions. This can lead to somewhat ad hoc neighborhood definitions.
- These are also called area (or aggregate) models. There has been much experience in different fields of county-level modeling, particularly with health and census data.
- Common approaches to area-level modeling include:
- intrinsic conditional autoregressive (ICAR) models (Besag et al., 1991);
- Leroux et al. (1999); and
- variants as used in INLA (Riebler et al., 2016).
- Neighbors need to be defined, and the most common method is based on sharing a common boundary.
EXTENSION TO TIME
Extension to time is conceptually straightforward, but joint space–time correlation models require care. Including time may not be important for county crop estimates, even though there is substantial correlation from year to year for some variables. Alternative data sources from the cur-
rent year may provide stronger predictors than previous-year crops data. There is a literature in spatial and spatiotemporal modeling of specifying first-order models rather than second-order models, as this is often easier. Conditional space–time models may also be simpler. See Cressie and Wikle (2011) for further information. See also work on multivariate spatiotemporal mixed-effects models by Bradley and colleagues (2015a).
RELATED PROBLEMS
The change of support problem (COSP) has seen much interest (Cressie, 1993; Gotway and Young, 2002; Cressie and Wikle, 2011; Gelfand, 2010; Bradley et al., 2016b). This problem occurs when one would like to make an inference at a particular spatial resolution, but the data are available at another resolution. Much of this work focuses on normal data and kriging-type approaches, in which block kriging is used. For example, Fuentes and Raftery (2005) combine point and aggregate pollution data, with the latter consisting of outputs from numerical models produced over a gridded surface using MCMC, and evaluate the block kriging integrals on a grid. Berrocal and colleagues (2010) considered the same class of problem, but added a time dimension and used a regression model with coefficients that varied spatially to relate the observed data to the modeled output. Bradley and colleagues (2015b, 2016b) describe non-Gaussian spatial change of support (COS) and space-time COS.
Diggle and colleagues (2013) take a different approach for discrete data and model various applications using log-Gaussian Cox point processes, including the reconstruction of a continuous spatial surface from aggregate data. Their approach is based on MCMC and follows Li and colleagues (2012) in simulating random locations of cases within areas, which is a computationally expensive step. Software to implement this approach is described in Taylor et al. (2015).