National Academies Press: OpenBook

Improving Crop Estimates by Integrating Multiple Data Sources (2017)

Chapter: Appendix C: Small-Area Modeling in Space and Time with Multiple Data Sources

« Previous: Appendix B: Routine External Evaluation Protocol
Suggested Citation:"Appendix C: Small-Area Modeling in Space and Time with Multiple Data Sources." National Academies of Sciences, Engineering, and Medicine. 2017. Improving Crop Estimates by Integrating Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/24892.
×

Appendix C

Small-Area Modeling in Space and Time with Multiple Data Sources

Small-area estimation (SAE) methods include a wide range of modeling techniques generally devised to create improved estimates for domains where direct sample-based estimates are not reliable because of small sample sizes. Sample surveys are usually designed to produce reliable estimates for large domains (such as at the state or national level). However, there has been widespread growing interest in developing estimates for progressively finer domains, which has led to rapid development of a multitude of SAE methods in recent years. In SAE, the term “small area” refers to any domain that does not have enough samples for reliable direct sample-based estimates. In various surveys, examples include small areas defined by the intersection of detailed industry, geography, demography, etc. County levels in NASS surveys are typical examples of “small areas.”

The underlying concept of SAE methods is to “borrow strength” by integrating information from multiple data sources, including survey data, or across time and space to improve estimates for small areas. Application of SAE methods at NASS has become increasingly important because of the demand for crop estimates at detailed geographic levels. Using traditional survey approaches to produce reliable estimates at these levels would require a much larger sample; this may be prohibitively expensive and impractical in terms of data collection and processing time. Consequently, reliance solely on sample-based estimates may result in many estimates being withheld from publication because of their poor quality. This type of situation is seen in the current NASS processing. Small-area modeling has been found to provide useful estimates for areas that are otherwise unpublishable; for areas that are considered publishable, it can also lead to

Suggested Citation:"Appendix C: Small-Area Modeling in Space and Time with Multiple Data Sources." National Academies of Sciences, Engineering, and Medicine. 2017. Improving Crop Estimates by Integrating Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/24892.
×

improved efficiency over direct survey estimates. Two key success stories of SAE are the Small Area Income and Poverty Estimates program and the Small Area Health Insurance Estimates program of the U.S. Census Bureau, described elsewhere in this report.

This appendix gives an overview of modeling approaches and suggests those that NASS might pursue in the near and longer terms. It goes on to describe uncertainty measures, characteristics of area and unit models, extension to time, and related problems.

MODELS

The universe of methods that may be considered for SAE to combine information from multiple sources to improve county-level crop estimates is large and varied. Some of the most promising/interesting include geospatial methods, machine learning, and Bayesian methods. These are not mutually exclusive. For example, Bayesian methods are used in geospatial analysis and time domain analysis, as well as with incorporation of multiple types of measurements. When used in a geospatial context, it can be helpful to include explicit spatial effects in order to broaden or borrow spatial support and thus to reduce uncertainty, especially when modeling estimates for small areas. A large number of model types have been proposed and demonstrated in applications, and can be found in texts on spatial statistics, spatial econometrics, and spatial modeling, notably those described in Cressie and Wikle (2011) and Haining (2003). Many of these models are implemented and readily accessible in software packages such as R.

In the classic SAE literature, there is a distinction between unit-level models (e.g., a model for a farm or for a Common Land Unit [CLU]) and area-level models (e.g., a model for a county). Both can be cast as linear mixed models with spatial random effects written as a spatial basis expansion (as there are now areal basis functions). If spatial models are used, then, roughly speaking, unit-level models correspond to the covariance (point-level) modeling approach, while area-level models correspond to the precision (polygon-level) approach. The majority of models available for point-level spatial and spatiotemporal modeling were developed in a noncomplex design setting, and often for Gaussian data.

The panel believes that the Bayesian approach holds great promise as recent developments have allowed combining design-based estimates with space–time smoothing models. For example, Mercer and colleagues (2015) effectively use a spatial Fay-Herriot (1979) model in the context of modeling childhood mortality based on complex survey data. The basic idea is to assume a hierarchical model in which the first stage is taken as the asymptotic distribution of the direct (design-based) estimator. Porter and colleagues (2015) use a similar model with an intrinsic conditional

Suggested Citation:"Appendix C: Small-Area Modeling in Space and Time with Multiple Data Sources." National Academies of Sciences, Engineering, and Medicine. 2017. Improving Crop Estimates by Integrating Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/24892.
×

autoregressive (ICAR) spatial model, and emphasize covariate modeling. You and Zhou (2011) discuss both ICAR and Leroux specifications for the spatial model. Researchers who have worked on multivariate extensions to area-level models include Ghosh and Datta and colleagues and Bell and colleagues (e.g., Franco and Bell, 2016). Multivariate space–time models have been developed by Bradley and colleagues (2015a).

Near-Term Approaches

The panel suggests starting with area-level models. It is straightforward to add covariates to such models. The covariates may be added via a simple linear model or via a more flexible form, such as those used in the machine learning literature; it would be best to begin with simple, interpretable models. In an area-level model, satellite data can be included by taking within-area averages (for example). Note that ecological bias can be avoided (under certain assumptions) by including the within-area variance of the variable in the model, as well as the mean of the variable (Wakefield, 2008).

NASS already has experience with area-level modeling (Cruze et al., 2016; Erciulescu et al., 2016). So far, the use of spatial random effects has not been extensive. The panel suggests that NASS begin by exploring county-level models using the area-level spatial Fay-Herriot model to describe survey measurements. Each alternative data source could be given its own data model, linked to the larger model in a hierarchical Bayes framework. The model could use Besag or Leroux spatial formulations. See Hodges and Reich (2010) for a discussion of possible spatial confounding. Computation with the integrated nested Laplace approximation (INLA) (Rue et al., 2009) approach is fast (as compared with Markov chain Monte Carlo [MCMC], which NASS has been using) and accurate. There is a reliable R implementation of the INLA method, though it is not a standard package.

Longer-Term Approaches

Modeling at the unit level may be more difficult than at the area level but potentially could lead to improved estimates, especially if auxiliary information is available at the unit level. To avoid estimation bias, the sample design should be properly accounted for in the model (for example, stratification could be accounted for by including fixed effects in the model, and cluster sampling by including random effects).

Point-level Fay-Herriot models also are possible. If a spatial Fay-Herriot unit model were used, then the data model could correspond to the asymptotic distribution of the direct estimate, with the spatial model appearing in the process model. Point-level models are somewhat troubling here as farms are not points, but can be very large. It may be that once CLU information

Suggested Citation:"Appendix C: Small-Area Modeling in Space and Time with Multiple Data Sources." National Academies of Sciences, Engineering, and Medicine. 2017. Improving Crop Estimates by Integrating Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/24892.
×

is available, the area-level modeling could be extended to farms, again with a caveat that farms vary significantly in size.

EXPRESSING UNCERTAINTY

The Bayesian approach to modeling naturally leads to intuitive measures of uncertainty. The fundamental output of a Bayesian analysis is a multivariate posterior distribution over all unknown quantities in the model. This distribution is typically of high dimension, and so summarization is required. In particular, summaries of the univariate posterior distribution of quantities of interest may be reported. For example, the posterior median (or posterior mean, if the posterior on the quantity of interest is symmetric) may be quoted along with such quantiles as the 2.5 percent and 97.5 percent, to give a 95 percent interval.

Maps of posterior summaries may be produced. For example, a map of posterior medians may be accompanied by a map of the width of an interval of some percentage coverage (for example 95%). Hatching may also indicate uncertainty. For example, one may map the posterior median at the county level, but hatch with increased hatching as the associated uncertainty (interval estimate, for example) increases.

CHARACTERISTICS OF POINT-LEVEL AND AREA-LEVEL MODELS

There are two main approaches to modeling spatial data:

Point-Level (or Unit-Level) Modeling

  • This modeling conceptually treats space as continuous.
  • Spatial modeling concentrates on specification of the covariance matrix (e.g., Stein, 1999).
  • Intuitive isotropic correlation models based on distance lead to dense matrices, i.e., matrices with few zeroes.
  • Unfortunately, if n (the number of units) is large, the fitting of the model is computationally expensive because one must carry out operations (determinants and inverses) on n x n matrices (Rue and Held, 2005).
  • This approach is also often referred to as Gaussian random field (GRF) or geostatistical modeling.
  • This approach is suited to unit-level (point) modeling.
  • Much of the literature (particularly in the environmental sciences) splits the modeling into a data model and a process model.
  • Numerous approaches have been suggested for modeling the continuous surface with efficient computation (to give a variance–covariance matrix that is guiding the development).
Suggested Citation:"Appendix C: Small-Area Modeling in Space and Time with Multiple Data Sources." National Academies of Sciences, Engineering, and Medicine. 2017. Improving Crop Estimates by Integrating Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/24892.
×

Polygon-Level (Area-Level) Modeling

  • Modeling focuses on the precision (inverse covariance) matrix.
  • This is also referred to as a Markovian or conditional modeling approach. Besag (1974) is an early influential reference. More specifically, the approach is also referred to as Gaussian Markov random field (GMRF) modeling.
  • The key idea is to model local structure, which leads to sparse matrices (with zeros in the precision matrix corresponding to conditional independencies). There is less intuition on the implied covariances.
  • Computation is very efficient with either MCMC or INLA (Rue et al., 2009).
  • The approach discretizes space, usually based on administrative regions. This can lead to somewhat ad hoc neighborhood definitions.
  • These are also called area (or aggregate) models. There has been much experience in different fields of county-level modeling, particularly with health and census data.
  • Common approaches to area-level modeling include:
  • Neighbors need to be defined, and the most common method is based on sharing a common boundary.

EXTENSION TO TIME

Extension to time is conceptually straightforward, but joint space–time correlation models require care. Including time may not be important for county crop estimates, even though there is substantial correlation from year to year for some variables. Alternative data sources from the cur-

Suggested Citation:"Appendix C: Small-Area Modeling in Space and Time with Multiple Data Sources." National Academies of Sciences, Engineering, and Medicine. 2017. Improving Crop Estimates by Integrating Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/24892.
×

rent year may provide stronger predictors than previous-year crops data. There is a literature in spatial and spatiotemporal modeling of specifying first-order models rather than second-order models, as this is often easier. Conditional space–time models may also be simpler. See Cressie and Wikle (2011) for further information. See also work on multivariate spatiotemporal mixed-effects models by Bradley and colleagues (2015a).

RELATED PROBLEMS

The change of support problem (COSP) has seen much interest (Cressie, 1993; Gotway and Young, 2002; Cressie and Wikle, 2011; Gelfand, 2010; Bradley et al., 2016b). This problem occurs when one would like to make an inference at a particular spatial resolution, but the data are available at another resolution. Much of this work focuses on normal data and kriging-type approaches, in which block kriging is used. For example, Fuentes and Raftery (2005) combine point and aggregate pollution data, with the latter consisting of outputs from numerical models produced over a gridded surface using MCMC, and evaluate the block kriging integrals on a grid. Berrocal and colleagues (2010) considered the same class of problem, but added a time dimension and used a regression model with coefficients that varied spatially to relate the observed data to the modeled output. Bradley and colleagues (2015b, 2016b) describe non-Gaussian spatial change of support (COS) and space-time COS.

Diggle and colleagues (2013) take a different approach for discrete data and model various applications using log-Gaussian Cox point processes, including the reconstruction of a continuous spatial surface from aggregate data. Their approach is based on MCMC and follows Li and colleagues (2012) in simulating random locations of cases within areas, which is a computationally expensive step. Software to implement this approach is described in Taylor et al. (2015).

Suggested Citation:"Appendix C: Small-Area Modeling in Space and Time with Multiple Data Sources." National Academies of Sciences, Engineering, and Medicine. 2017. Improving Crop Estimates by Integrating Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/24892.
×
Page 125
Suggested Citation:"Appendix C: Small-Area Modeling in Space and Time with Multiple Data Sources." National Academies of Sciences, Engineering, and Medicine. 2017. Improving Crop Estimates by Integrating Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/24892.
×
Page 126
Suggested Citation:"Appendix C: Small-Area Modeling in Space and Time with Multiple Data Sources." National Academies of Sciences, Engineering, and Medicine. 2017. Improving Crop Estimates by Integrating Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/24892.
×
Page 127
Suggested Citation:"Appendix C: Small-Area Modeling in Space and Time with Multiple Data Sources." National Academies of Sciences, Engineering, and Medicine. 2017. Improving Crop Estimates by Integrating Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/24892.
×
Page 128
Suggested Citation:"Appendix C: Small-Area Modeling in Space and Time with Multiple Data Sources." National Academies of Sciences, Engineering, and Medicine. 2017. Improving Crop Estimates by Integrating Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/24892.
×
Page 129
Suggested Citation:"Appendix C: Small-Area Modeling in Space and Time with Multiple Data Sources." National Academies of Sciences, Engineering, and Medicine. 2017. Improving Crop Estimates by Integrating Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/24892.
×
Page 130
Next: Appendix D: Biographical Sketches of Panel Members and Staff »
  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!