With the notable exception of the estimation of nonprofit R&D funding amounts, the National Center for Science and Engineering Statistics (NCSES) primarily uses standard survey estimates—that is, direct survey-weighted totals and ratios—in its tabulations for *National Patterns*. Design-based survey regression estimators, although they can be more accurate in many survey applications, are often not used for lack of exploratory studies to develop good survey predictors (whose totals are known or measured accurately by other instruments). In relation to consideration of more sophisticated estimation techniques, the charge to the steering committee included the issue of the responsiveness of the information content to user’s needs.

The workshop covered a set of techniques that could provide, for instance, estimates for state-level R&D funds for specific types of industries. This is a type of tabulation that NCSES believes its users would be very interested in obtaining. In addition, the subject of the distinction between measurement error and definitional vagueness, which is related to data quality, was raised during the discussion period; a summary of that discussion is included as a short section at the end of this chapter.

Julie Gershunskaya of the Bureau of Labor Statistics presented a survey of current methods for small-area estimation that have been found useful in various federal statistical applications. Such techniques have the potential to produce R&D statistics on more detailed domains for inclusion in future

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 63

5
Small-Area Estimation
With the notable exception of the estimation of nonprofit R&D fund-
ing amounts, the National Center for Science and Engineering Statistics
(NCSES) primarily uses standard survey estimates—that is, direct survey-
weighted totals and ratios—in its tabulations for National Patterns. Design-
based survey regression estimators, although they can be more accurate
in many survey applications, are often not used for lack of exploratory
studies to develop good survey predictors (whose totals are known or
measured accurately by other instruments). In relation to consideration of
more sophisticated estimation techniques, the charge to the steering com-
mittee included the issue of the responsiveness of the information content
to user’s needs.
The workshop covered a set of techniques that could provide, for
instance, estimates for state-level R&D funds for specific types of industries.
This is a type of tabulation that NCSES believes its users would be very
interested in obtaining. In addition, the subject of the distinction between
measurement error and definitional vagueness, which is related to data
quality, was raised during the discussion period; a summary of that discus-
sion is included as a short section at the end of this chapter.
OVERVIEW
Julie Gershunskaya of the Bureau of Labor Statistics presented a survey
of current methods for small-area estimation that have been found useful in
various federal statistical applications. Such techniques have the potential
to produce R&D statistics on more detailed domains for inclusion in future
63

OCR for page 63

64 NATIONAL PATTERNS OF R&D RESOURCES
National Patterns reports. Currently, the National Patterns reports tabulate
statistics on R&D funding primarily at the national level, but there are also
state-level tabulations for some major categories of funders and performers
for the current year. In addition, tabulations of industrial R&D funding are
also available for about 80 separate North American Industrial Classifica-
tion System (NAICS) codes, down to three digits of detail.1 These efforts to
provide information for subnational domains are commended. Kei oizumi K
noted earlier in the workshop that many users would benefit from the
publication of statistics on R&D funding for more detailed domains: pos-
sibilities include providing R&D funding for substate geographic levels or
for domains defined by states crossed by type of industry. There may also
be interest in providing future tabulations for particular categories of col-
leges and universities.
A small area is defined as a domain of interest for which the sample
size is insufficient to make direct sample-based estimates of adequate
precision. These “small areas” in the context of R&D can be geographic
entities, industrial types, socio emographic groups, or intersections of
d
geography, industry, or demography. Small-area estimation methods are
techniques that can be used, when the sample size is inadequate, to pro-
duce reliable estimates by using various additional sources of information
from other domains or time periods. However, such methods do rely on
various assumptions about how that information links to the information
from the domain of interest.
Gershunskaya said at the outset that the best strategy to avoid reli-
ance on small-area estimation is to provide for sufficiently reliable direct
estimates for the domains of interest at the sample design stage. How-
ever, it is typical for surveys carried out by federal statistical agencies
to have insufficient sample sizes to support estimates for small domains
requested by the user communities. Hence, the need for small-area estimates
is widespread.
Gershunskaya differentiated between direct and indirect estimates.
Direct estimates use the values on the variable of interest from only the
sample units for the domain and time period of interest. They are usually
unbiased or nearly so, but due to limited sample size, they can be unreliable.
Indirect estimates “borrow strength” outside the domain or time period
(or both) of interest and so are based on assumptions, either implicitly or
explicitly. As a result of their use of external information, indirect estimates
can have smaller variances than direct estimates, but they can be biased
if the assumptions on which they are based are not valid. The objective
therefore is to try to find an estimator with substantially reduced variance
but with only slightly increased bias.
1
For 2007 data, see http://www.nsf.gov/statistics/nsf11301/pdf/tab58.pdf [January 2013].

OCR for page 63

SMALL-AREA ESTIMATION 65
DIRECT ESTIMATORS
Gershunskaya first reviewed the basic Horvitz-Thompson estimator
and then discussed several modifications to it.
Horvitz-Thompson
Introducing some notation, let the quantity of interest be denoted
by Yd for domain d of the population. In considering the application of
these methods to R&D statistics, domains could be defined by states or
by industries with a certain set of NAICS codes. Each sampled unit j has
an associated sample weight, denoted wj, which is equal to the inverse of
a unit’s probability of being selected.2 (The rough interpretation is that the
weight corresponds to the number of population units represented by each
sampled unit.)
The Horvitz-Thompson estimator of Yd is
∑w y j j ,
j ∈sd
where yj is the measurement of interest for sample unit j, and sd denotes the
set of sampled units in domain d. The Horvitz-Thompson estimator may be
unreliable, especially for small domains. To address this, there are various
alternative direct estimators that may out-perform the Horvitz-Thompson
estimator, especially when auxiliary data are available.
Ratio Estimators
To discuss these estimators, additional notation is needed. Let yj denote
the measurement of interest for sample unit j, let xj denote auxiliary data
for sample unit j (assumed univariate to start), and let Xd be the known
population total for domain d, from administrative or census data. (Note
that the dependence of both yj and xj on domain d is not explicitly indi-
cated in the notation to keep things more readable.) In the case of R&D
statistics, yj could be the R&D expenditure for company j, xj could be the
total payroll for company j, and Xd could be the true population total
payroll in a particular state. Then the ratio estimator, using sample data,
is given by
2
In practice, survey weights are almost never design weights in the sense of being inverse
selection probabilities; nonresponse adjustment or imputation (or both) change their properties
(see, e.g., Särndal and Lundström, 2005).

OCR for page 63

66 NATIONAL PATTERNS OF R&D RESOURCES
ˆ
Y HT
ˆ ˆ ˆ
Yd(R) = Xd Bd , where Bd = dHT ,
Xˆ
d
ˆ ˆ
where YdHT and XdHT are the Horvitz-Thompson estimators of the respec-
tive population totals. If there is a substantial correlation between R&D
expenditure and payroll size, the ratio estimator may provide a marked
improvement over the Horvitz-Thompson estimator.
A particular case of the ratio estimator is given for the situation where xj
equals 1 if j is in the dth domain and equals zero otherwise, which is referred
to as the post-stratified estimator. In this case, letting Nd be the number of
population units in the domain (assumed to be known), and letting
ˆ(
NdHT ) = ∑w j
j ∈sd
be the sample-based estimate of Nd, the post-stratified estimator can be
written as
ˆ
Y (HT )
ˆ
Yd(PS) = Nd d(HT ) .
ˆ
Nd
The post-stratified estimator has improved performance in comparison with
the Horvitz-Thompson estimator. However, when the domain sample size
is small, the post-stratified estimator can still perform poorly.
Generalized Regression Estimator
The ratio estimator can be expressed as a special case of the General-
ized Regression (GREG) estimator:
( )
T
ˆ ˆ ˆ(
Yd(GREG) = Yd(HT ) + Xd − XdHT ) ˆ
Bd ,
where:
Xd is a vector of known population totals for domain d,
ˆ(
XdHT ) is a vector of Horvitz-Thompson estimates of Xd,
ˆ
Yd(HT ) is a Horvitz-Thompson estimate of Yd, and
ˆ
Bd is a vector of coefficients (derived from the sample using a particular
formula).

OCR for page 63

SMALL-AREA ESTIMATION 67
This estimator also belongs to a variety called calibration estimators, as
the second term here “corrects” (or “calibrates”) the Horvitz-Thompson
estimator for Y using known population totals for X.
Note that the estimator for Bd is based on sample data. When the
sample size is small, this estimate may be unstable. To address this, one can
ˆ
pool the data over domains to produce a single B . The resulting modified
direct estimator, known as the survey regression estimator, is expressed as
follows:
( )
T
ˆ ˆ ˆ(
Yd(SR) = Yd(HT ) + Xd − XdHT ) ˆ
B.
Gershunskaya illustrated how this estimator might be applied to NCSES
R&D data. Let Xim be the known population payroll in industry-type i and
ˆ (HT )
state m, Xim be the Horvitz-Thompson estimate of payroll in industry-
type i and state m, Xiˆ (HT ) be the Horvitz-Thompson estimate of payroll,
national total for industry-type i, and Yi ˆ (HT ) be the Horvitz-Thompson
estimate of R&D funds, national total for industry-type i. Then, one can
compute
ˆ
Y (HT )
ˆ
Bi = i (HT )
ˆ
Xi
using national data for industry i in the survey regression estimator to
estimate
( )
T
ˆ (SR ˆ (HT ˆ (HT
Yim ) = Yim ) + Xim − Xim ) ˆ
Bi .
ˆ
Gershunskaya pointed out that, although B in the survey regression
estimator is based on a larger sample, the effective sample size still equals
the domain sample size. To see why that is so, one can rewrite the survey
regression estimator as
ˆ ˆ
j ∈sd
(
Yd(SR) = Xd B + ∑ w j y j − x j B ,
ˆ )
which shows that the survey regression estimator is a sum of the fitted
v
alues from a regression model based on predictors from the domain of
interest, and it has a bias correction from weighting the residuals again
from a regression using data only from that domain. Therefore, the effi-

OCR for page 63

68 NATIONAL PATTERNS OF R&D RESOURCES
ciency of the survey regression estimator depends on the variability of the
residuals and on the domain sample size.
INDIRECT ESTIMATORS
Gershunskaya then moved to a description of indirect estimators. As
she noted earlier, direct sample-based estimators are unbiased (or nearly so)
but they may have unacceptably large variances. To overcome this problem,
certain assumptions about similarity or relationships between areas or time
periods (or both) are made, and these assumptions allow one to use more
sample data, thus “borrowing strength.” Her first example of an indirect
estimator was the synthetic estimator, which is a sample-based estimator
for which the parameters estimated from larger (or combined) domains or
from other time periods are applied to a small area. She then discussed the
structure-preserving estimator, known as SPREE, and composite estimators.
Synthetic Estimator
To describe synthetic estimation, Gerhsusnkaya began with the usual
direct estimate of the sample domain mean from a simple random sample,
namely:
1
yd =
nd
∑y j .
j ∈sd
Unfortunately, this estimator can be unreliable if the sample size in the
domain is small, so one would want to use the data from the other domains
to improve its reliability. One obvious candidate, assuming that means are
constant over domains, would be the global average over all domains. The
resulting estimator,
1
yd = ∑ yj ,
n j∈s
is an example of the synthetic estimator. It is much more stable, but it is
very likely to be substantially biased because the assumption of a common
mean across domains will rarely hold. If there are auxiliary variables, a
more realistic assumption than the assumption of a common mean would
be to assume, for example, a common regression slope across domains.
Consider again the survey regression estimator:

OCR for page 63

SMALL-AREA ESTIMATION 69
Yd(SR) = Xd B + ∑ w j y j − x j B .
ˆ ˆ
j ∈sd
ˆ ( )
This estimator can be depicted as survey regression equals model plus bias
correction. The “model” part of the survey regression estimator turns out
to be a synthetic estimator,
ˆ ˆ
Yd(Syn) = Xd B .
To better understand synthetic estimation, consider an R&D example.
A synthetic estimator of R&D expenditure in industry type i and state m is
ˆ (Syn ˆ
Yim ) = Xim Bi ,
ˆ
Y (HT )
ˆ ˆ
where Bi = i (HT ) and it is assumed that the common ratio Bi of R&D to
ˆ
Xi
total payroll holds across all states in industry type i.
NCSES has already used a similar approach to produce a Survey of
Industrial R&D state estimator, which is described in Slanta and Mulrow
(2004). For her example, Gershunskaya said, R&D for state m is estimated as
ˆ ˆ
Ym = Ym ,s + Ym ,c ,
where
Ym ,s = ∑ ym , j
j ∈s
ˆ
is the observed sample total for R&D in state m, and Ym ,c is a prediction
of the nonsampled part of the population for R&D in state m, which is
computed as
I
Ym ,c = ∑ RimYi ,c
ˆ ˆ
i =1
where Rim is the ratio of payroll in state m to national payroll total for
industry i, and
(
Yi ,c = ∑ w j − 1 yi , j
ˆ
j ∈s
)

OCR for page 63

70 NATIONAL PATTERNS OF R&D RESOURCES
is a prediction for the nonsampled part of R&D in industry type i. This
approach relies on the assumption that in each industry type i, R&D is
distributed among states proportionately to each state’s total payroll.
Gershunskaya compared the state estimator from Slanta and Mulrow
(2004) with the synthetic estimator based on a common industry slope. For
simplicity, Gershunskaya considered the estimator for the whole popula-
tion, rather than for only the nonsampled part. The Slanta-Mulrow (SM)
estimator can then be expressed as
I
X ˆ
YmSM) = ∑ im Yi(HT ) ,
ˆ(
i = 1 Xi
and the common industry slope estimator can be expressed as
I ˆ
Y (HT )
YmCIS) = ∑ Xim i (HT ) .
ˆ(
i =1
ˆ
Xi
Both estimators are synthetic estimators and are based on similar assump-
tions. Notice that in the denominators, the Slanta-Mulrow estimator uses
the population total Xi, and the common industry slope estimator uses the
ˆ
Horvitz-Thompson estimator Xi(HT ) of the population total Xi. It might be
worth evaluating these two competing estimators using BRDIS data. If
indeed R&D is correlated with payroll, the common industry slope estima-
tor may prove to be preferable.
SPREE
Another synthetic estimator is SPREE, the structure preserving esti-
mator. It is based on a two-dimensional table of estimates, with elements
Cim, with one dimension indexed by i and running from 1 to I (e.g., type
of industry) and the other dimension indexed by m and running from 1
to M (e.g., state). The Cim here represents the total of R&D funds for all
industries of a certain type in a given state. SPREE assumes that initial
estimates of individual cell totals, Cim, are available from a previous census
or from administrative data, though as such they are possibly substantially
biased. This approach also assumes that the sample from a current survey
is large enough so that one can obtain direct sample-based estimates for the
ˆ ˆ
marginal totals, denoted Yi and Ym . The goal is to estimate the amount of
R&D funding for each of the individual cells by updating them to be con-
sistent with the marginal totals. Iterative proportional fitting (also known
as raking) is a procedure that adjusts the cell totals Cim so that the modified

OCR for page 63

SMALL-AREA ESTIMATION 71
table conforms to the new marginal estimates. The revised cell totals are
the new small-area estimates.
The implicit assumption is that the relative structure of the table is
constant since the last census, that is,
C ij Yij
Ckj Ykj
=
Cil Yil
Ckl Ykl
for any combination of indices i, j, k, and l.
In summary, Gershunskaya said, direct estimators are unbiased and
should be used when the sample size is sufficient to produce reliable esti-
mates. However, with small samples they have large variances. Synthetic
estimators, in contrast, have smaller variances but they are usually based
on strong assumptions, and therefore may be badly biased if their assump-
tions do not hold.
Composite Estimators
Gershunskaya then turned to another type of indirect estimator, com-
posite estimators. They are convex combinations of direct and synthetic
estimators, which provide a compromise between bias and variance. They
can be expressed as follows:
Yd(C ) = ν dYd(Direct ) + (1 − ν d ) Yd(Model ) .
ˆ ˆ ˆ
The central question in using them is how one should choose the weights
nd. One possible approach is to define weights on the basis of sample cover-
ˆ
age in the given area, e.g., selecting nd proportional to Nd / Nd . However,
this method fails to account for variation of the variable of interest in the
area. A second possibility is to use weights that minimize the mean squared
error of the resulting estimator. This second method depends on potentially
unreliable estimates of the mean squared error of composite parts.
Methods Based on Explicit Models
In contrast to these approaches that are based on implicit models,
the final general category of estimators described by Gershunskaya covers
methods based on explicit models. Explicitly stated modeling assump-
tions allow for the application of standard statistical methods for model
selection, model evaluation, the estimation of model parameters, and the

OCR for page 63

72 NATIONAL PATTERNS OF R&D RESOURCES
production of measures of uncertainty (e.g., confidence intervals, mean
squared error) under the assumed model. Methods based on explicit models
constitute the core of modern small-area methods.
The most popular small-area methods are based on either the linear
mixed model (for continuous variables) or the generalized linear mixed
model (for binary or count data). Two types of models are commonly
used, area-level and unit-level models. An area-level model (with assump-
tions pertaining to the aggregate area level) is applied when only area-level
a
uxiliary data are available (rather than auxiliary data for individual units).
In this case, direct sample-based estimates play the role of individual data
points in the model, with their sampling variances assumed to be known.
Generally, area-level models are easier to apply than the unit-level models.
One benefit from the application of these models is that they usually take
into account the sample design.3
Unit-level models (with the assumptions based on relationships between
individual respondents) require different and more detailed information
(which is why they are seldom used by statistical agencies) and generally
rely on assumptions of independence of units, assumptions that are often
violated in clustered survey designs. But if the assumptions for unit-level
models are tenable and the unit-level data are available, one would want to
use them in place of area-level aggregated models for reasons of efficiency.
However, some complications can arise when trying to account for the
sample design.
Fay-Herriot Small-Area Model
Fay and Herriot (1979) introduced an area-level model in the context
of the estimation of per capita income for small places. The authors used
the following set of auxiliary variables: county level per capita income, the
value of owner-occupied housing, and the average adjusted gross income
per exemption. Fay-Herriot models are often represented using two-level
model assumptions, the sampling model and the linking model. The sam-
pling model states that the direct sample estimator estimates the true popu-
lation parameter without bias and with a certain (sampling) error. The
linking model makes certain assumptions (e.g., linear regression relation-
ship) about the true underlying values.
In the Fay-Herriot model, the sample-based estimate is
ˆ
Yd(Direct ) = θ d + ε d ,
3
However, some areas may not have any sample. If areas are selected into the sample with
unequal probabilities related to the true area means, bias may occur as a result.

OCR for page 63

SMALL-AREA ESTIMATION 73
that is, the sum of an expected value plus an error term with zero mean
and with its variance equal to the sampling variance of the direct sample
estimate. The linking model for the mean can be written as
θ d = Xd β + ν d .
T
This equation indicates that the mean of the sample estimate is expressed as
( )
a linear combination of the auxiliary variables Xd β plus a model-error,
T
nd, having mean zero and a constant variance, with the model error inde-
pendent of the sampling error. The entire model can then be expressed as
ˆ
Yd(direct ) = Xd β + ν d + ε d ,
T
which is a linear mixed model since it has both fixed and random effects.
Under this model, the best unbiased linear estimator (in a certain well-
defined sense) for qd has a composite form, as follows:
θ d = γ dYd(Direct ) + (1 − γ d ) Xd β ,
ˆ ˆ T ˆ
where
A
γd =
A + Vd(Direct ) ,
( Direct )
A is the variance of the random term in the linking model, and Vd is
the sampling variance, which is assumed known. The above composite form
shows that the direct estimates are shrunk toward the synthetic part, where
the smaller A is (i.e., the better the linking model explains the underlying
relationship), the more weight goes to the synthetic (i.e., model-based) part.
Similarly, areas with estimates with larger sampling variances also have
more weight allotted to the synthetic part.
R&D Example
Gershunskaya then provided an example to show how one might
produce small-area estimates of R&D funds for small domains defined by
ˆ (Direct
states and industry types. Let Yim ) be a direct sample-based estimator
for R&D in industry i and state m from BRDIS. The direct sample estima-
tor provides unbiased measurement of the unobserved truth qim, with some
random error:

OCR for page 63

74 NATIONAL PATTERNS OF R&D RESOURCES
ˆ (Direct
Yim ) = θ im + ε im (sampling model).
The assumption is that ignoring an error term, the state-level R&D funds
in industry type i are proportional to the state’s total payroll, which can
be expressed as
θ im = Xim Bi + ν im (linking model).
The resulting small-area estimator can be written as
ˆ
Ai
ˆ ( ˆ (Direct ˆ )
θ im = Xim Bi + γ im Yim ) − Xim Bi , where γ im = ˆ
ˆ ( Direct .
Ai + Vim )
Estimation of Bi and Ai are straightforward from the data. (A cautionary
note: this application differs from the formal Fay-Herriot model since the
variances of eim must be estimated and can be inaccurate if they are based
on small sample sizes.)
Unit-Level Small-Area Modeling
An example of a unit-level model is a small-area model of areas planted
with corn and soybeans for 12 Iowa counties (Battese, Harter, and Fuller,
1988). The survey data consisted of Ydj , the number of hectares of corn
(or soybeans) per segment j in county d. The auxiliary variables, collected
by satellite, were x1,dj , the number of pixels planted with corn per seg-
ment j in county d, and x2,dj , the number of pixels planted with soybean
per segment j in county d. The model considered in the paper is called the
nested-error regression:
Ydj = β0 + β1x1,dj + β 2 x2,dj + ν d + ε dj ,
where the error terms are independent. The resulting small-area estimator is
ˆ (
θ d = γ d yd + (1 − γ d ) β0 + β1x1d + β 2 x2 d ,
ˆ ˆ ˆ
)
where
σν2
γd =
σ + σ e / nd .
2
ν
2

OCR for page 63

SMALL-AREA ESTIMATION 75
Note that the larger the sample size of an area, the more relative weight
is placed on the sample part of the weighted average. The regression
c
oefficients and the error variances are easily estimated from the data. Both
the Fay-Herriot and the Battese, Fuller, and Harter models are examples of
linear mixed models, Gershunkaya noted.
Smoothing Over Time
None of the models presented so far examined the potential from use
of the dependent variable collected from previous time periods. This is
extremely relevant for R&D statistics, Gershunskaya said, since many of
the surveys and censuses used as inputs for National Patterns have a rela-
tively long, stable, historical series, often going back to the 1950s. As an
example of a small-area spatial-temporal model, Gershunskaya described
the results in Rao and Yu (1994). For areas d = 1, …, D and time periods
t = 1, …, T, assume that
ˆ(
YdtDirect ) = Xdt β + ν d + udt + ε dt ,
T
where the udt are random error terms that follow a first-order autoregres-
sive process. In this case, a good small-area estimator (for the current
period) is a weighted sum of the synthetic estimator for the current period
and model residuals from the previous time periods, namely,
(Yˆ )
T −1
θ dT = γ dT YdT ) + (1 − γ dT ) XdT β +
ˆ ˆ (Direct T ˆ
∑γ dt
( Direct )
dt
T ˆ
− Xdt β .
t =1
Modifications for Discrete Dependent Variables
Gershunskaya then briefly discussed models for discrete dependent
variables. The most common case is when yd is a binary variable. Assume
that the quantity of interest is the small-area proportion
Pd = Nd 1 ∑ ydj .
−
jε d
Then one can formulate an area-level Fay-Herriot-type model using direct
sample-based estimates of proportions. However, the area-level approach
has shortcomings, one of which is that some areas may have no sample
units reporting R&D and thus will be dropped from the model. A unit-level
generalized linear mixed model may be more efficient in this case. Assume

OCR for page 63

76 NATIONAL PATTERNS OF R&D RESOURCES
that ydj is 1 with probability pdj and 0 with probability 1 – pdj; then the
standard model in this situation has the following form:
pdj
= xdj β + ν d .
T
log
1 − pdj
Implementing Small-Area Modeling for National Patterns
Gershunskaya then indicated how NCSES could develop explicit small-
area models using the National Patterns datasets. She first considered a
unit-level model scenario. Assume that wages, employment, and possibly
other covariates are obtained from administrative data for all businesses in
the target population. Using sample data, one could establish a relationship
between R&D funding and auxiliary variables by fitting the parameters of
some explicit model. One could then apply the results of this model fitting
to the prediction of R&D in the nonsampled part of the above models.
However, because there is no explicit question on state-by-industry R&D
in the BRDIS questionnaire, a proxy for it would have to be derived.
(Although possible, it would currently be a laborious effort.) In such model-
ing, it would be important to account for the sample design in the variance
estimation, which is a serious complication.
The second scenario proposed by Gershunskaya was for an area-level
model. Here a current design-based ratio or regression estimator (or other
area-level predictor(s), e.g., “true” population values available from an
administrative file) could be used in the synthetic part of the composite
estimator. It would also be useful to consider alternative direct estimators
of R&D that could be used in the area-level model, Gershunskaya said,
and she outlined a few possibilities of improved direct estimators (based on
the theory developed by Sverchkov and Pfeffermann, 2004). Let dm,j be 1 if
company j reports R&D in state m, and 0 otherwise. If one does not have
auxiliary information, an alternative direct sample estimator is
∑ (w j )
− 1 ym , j
Ym = ∑ ym, j + ( Nm − nm )
ˆ j ∈s
j ∈s ∑δ (w
j ∈s
m, j j −1).
As an analogue of the ratio estimator, using company’s payroll xj or some
other auxiliary variable (possibly payroll per employee), a modified form of
the previous alternative direct sample estimator can be defined as

OCR for page 63

SMALL-AREA ESTIMATION 77
∑ (w j )
− 1 ym , j
Ym = ∑ ym , j + Xm ,c
ˆ j ∈s
,
j ∈s ∑δ (w m, j j )
− 1 xj
j ∈s
where Xm,c is the total payroll in the nonsampled portion of state m. Finally,
as an analogue of the modified direct estimator,
I
Ym = ∑ ym , j + ∑ Xim ,c Bi + ( Nm − nm ) Mm,c ,
ˆ ˆ ˆ
j ∈s i =1
where
∑y i, j 1 if company j reports R & D in industry i
ˆ
Bi =
j ∈s
, where δ i , j =
∑ δ i, j xj
0 otherwise
j ∈s
I
ˆ
∑δ (w m, j j )
− 1 ym , j − ∑ xim , j Bi
ˆ
Mm ,c =
j ∈s i =1
.
∑δ (w m, j j −1 )
j ∈s
Final Considerations and Discussion
Gershunskaya concluded her review of small-area estimation methods
with a set of important considerations:
• It is important to plan for estimation for domains of interest at
the design stage to ensure that one has direct estimates of some reliability
to start off.
• Finding a set of good auxiliary variables is crucial for success in
small-area modeling.
• Small-area estimation methods are based on assumptions, and
therefore evaluation of the resulting estimates is vital.
• Using a statistical model supports a systematic approach for a
given problem: (a) the need for explicitly stated assumptions, (b) the need
for model selection and checking, and (c) the production of measures of
uncertainty.
• It is important to account for the sample design (unequal prob-
abilities of selection and clustering effects) in the model formulation and
fitting.

OCR for page 63

78 NATIONAL PATTERNS OF R&D RESOURCES
Joel Horowitz pointed out that these models have a long history, and
there are methodologies that have been developed that avoid the assumption
of linearity or proportionality and that can accommodate estimation errors
in the predictors. Gershunskaya agreed that there were non arametric
p
models that had such properties. She said that her presentation was already
detailed and therefore some complicated issues could not be included. Eric
Slud added that the sample survey context made some of this particular
application more difficult than in the literature that Horowitz was refer-
ring to. Slud added that the key issue in applying these techniques is finding
predictive covariates. Often, one is restricted to synthetic variables, often
measured in more aggregated domains.
Another topic that emerged during the floor discussion was whether
there were likely to be productive predictors available in this context. Slud
said that clearly there were opportunities to use synthetic variables by
using information at higher levels of aggregation. However, the availability
of useful predictors at the respondent level was less clear and would be
known only by subject-matter researchers conducting some exploratory
data analysis.
John Jankowski said that one of NCSES’s highest areas of concern in
terms of data presentation is the state and subnational distribution of R&D
activity. He added that the business sector is the one for which this is most
relevant. He said that a small firm in Massaschusetts sampled in BRDIS
could have a weight of 250 so the resulting direct small-area estimates
would likely be unreliable, but he noted that the Slanta and Mulrow (2004)
paper was successful in reducing the magnitude of that problem. However,
it could not address the small-area distribution by industrial category
because in the Survey of Industrial R&D for the distribution of R&D by
industry sector the funds were just assigned to the major industry category.
Now, however, BRDIS has the entire distribution of R&D by industry
sector and so there is a great deal more potential for the use of small-area
estimation. Jankowski added that BRDIS also provides the geographic
distribution of such funds. Even though geography and industrial sector
are not simultaneously available, he said that it might now be possible to
produce such estimates through some hard work, though it is not certain.
Jankowski added as things stand now, if there is a large R&D-performing
company that is 51 percent in one category and 49 percent in another,
100 percent would be assigned to the first category, and users would notice
that. New technology gives us a chance to better distribute those funds to
industrial categories.
Christopher Hill was concerned that if the National Science Foundation
provided small-area estimates, there would be situations in which experts
in R&D funding would know that the estimates are incorrect because they
have local information. Slud responded that this is the case for every set of

OCR for page 63

SMALL-AREA ESTIMATION 79
small-area estimates. When such situations occur, they should be seen as
opportunities to improve either the quality of the input data, the form of
the model, or the variables included in the model. Many participants noted
that model validation is an important part of the application of these tech-
niques. Jankowski added that it would be unclear the form in which such
estimates would be made publicly available.
Slud pointed out that prior to application of this methodology, it is
important to explore how sensitive the results are to the model assump-
tions, which depends on the relative size of the sampling errors to the
others that one might be able to quantify. Karen Kafadar pointed out that
one advantage with this methodology is that since you get standard errors
for your estimates, you can compare your estimates with ground truth and
know whether they are or are not consistent.
MEASUREMENT ERROR OR DEFINITIONAL VAGUENESS
As noted above, an issue that arose during the workshop concerned
the need to better understand sources of error underlying NCSES survey
and census responses. Survey data are subject to sampling error and non
sampling error, with nonsampling error often decomposed into nonresponse
error and measurement error. As is almost always the case for survey
estimates, NCSES does not have a complete understanding of the magni-
tude of nonsampling error. Several participants suggested that it would be
beneficial for NCSES to investigate this topic, possibly through greater use
of einterviews or comparison of survey and census responses with admin-
r
istrative records (see section on STAR METRICS in Chapter 4).
In particular, several participants pointed out that it would be impor-
tant to distinguish between true measurement error, that is, when the total
R&D funding level is misreported, and differences in interpretation, for
example, in distinguishing between what is applied research and what is
development. It was suggested this issue could be addressed through the use
of focus groups and other forms of cognitive research. Or a subsampling
study could be carried out in which answers subject to possible definitional
vagueness could be followed up and resolved. However, several participants
acknowledged that such a study would be expensive and labor intensive.