Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 303
Random Effects Models for Network Data
Peter D. Hoff*
Department of Statistics
University of Washington
Seattle, Washington
hoff@stat.washington. edu
www. stat . washington. edu/hoff
Abstract
One impediment to the statistical analysis of network data has been the difficulty
in modeling the dependence among the observations. In the very simple case of binary
(0~1) network data, some researchers have parameterized network dependence in terms
of exponential family representations. Accurate parameter estimation for such models
is quite difficult, and the most commonly used models often display a significant lack of
fit. Additionally, such models are generally limited to binary data. In contrast, random
effects models have been a widely successful tool in capturing statistical dependence
for a variety of data types, and allow for prediction, imputation, and hypothesis test-
ing within a general regression context. We propose novel random effects structures
to capture network dependence, which corn also provide graphical representations of
network structure and variability.
~ Network Dependence
ordered pair of nodes i. i = 1 .... -
Network data typically consist of a set of n nodes and a relational tie Yi,j, measured on each
~ ~ ~~ _7 __7~)_ This framework has many applications, including the
study of war, trade, the behavior of epidemics, the interconnectedness of the World Wide
Web, and telephone calling patterns.
It is often of interest to relate each network response Yi,j to a possibly par-specific vector
valued predictor variable xi,j. A flexible framework for doing so is the generalized linear
mode! (see, for example McCullagh and Welder 1983), in which the expected value of the
response is modeled as a function of a linear predictor 3',, where ~ is an unknown vector
of regression coefficients to be estimated from the data. The ordinary regression mode!
¢tYi,jJ p!Xi,j is perhaps the most commonly used model of this the.
1;~ ~1~1 ~ ~ _ {n ~ ~ ~ ~ 1 _ _4 · 1 . 1 ~ . . ~
A generalized
'~ ~~ ~~ -my flu- ~ J Clara Is ~og~s~c regression, wn~cn relates the expectation of the
response to the regression variable via the relation g(E[yi,j]) = 3'xi,j, where g(p) = log UPS.
*This research was supported by Once of Naval Research grant N0001~02-1-1011.
DYNAMIC SOCIAL NETWORK MODELING kD ISIS
303
OCR for page 303
As an example of the use of such statistical models, consider the analysis of strong
friendship ties among 13 boys and 14 girts in a s~xth-grade classroom, as collected by Hansell
(1984). Each student was asked if they liked each other student "a lot", "some", or "not
much". A strong friendship tie is considered present if a student likes another student "a lot "
Also recorded is the sex of each student. The data, presented in Figure 1, suggest a general
preference for same-sex friendship ties. Of potential interest is statistical estimation of this
preference, as well as a confidence interval for its value. One approach for such statistical
analysis would be to formulate the logistic regression mode} g(Etyi~j~xi~i7 35) = 30 + >~xi,j,
where xi,j is one if children i and j are of the same sex, and zero otherwise, and it = (30 ,3~)
~ a)
11000001 I 030 a 1000001000300
1 0 0 0 0 0 0 0 0 0 ~ O 0 1 0 0 0 0 0 0 0 0 0 ~ O O
1 ~ 0000010031 10000001001300
111 l l l l I 11 ~ 1 000 I G 000000 ~ C O
1100 000 I l O 10 11011000111100
01001 0001031 10 11000010100
000011 010031 00000000000000
1000110 11101 00000000000000
00010010 0031 9000Q000000300
000000000 030 00000000000300
0000000000 30 00010000000300
00000100010 0 00000100000300
O O O O O O O O O O O tO O O 0 1 0 0 0 1 0 0 0 ~ O O
ODOO 1 000000OO
O O O ~ O O ~ ~ O O O ~ I) 1 ~ 0 1 ~ U 1 1 ~—~ 1
l l O l l l O l O l l ~ O 11 1 I l O l l l l l O l
0000000000030 000 1000000300
OOOOOOOOOOODO OOOO O01100OOO
00 ~ 1 ~ 0 ~ 00001 d l l . l l 01101100
1010111000110 0 I 0110 0001 ~ 00
() () () () () () () ~ () () O ~ O 1 1 O O O O O () ~ 1 ~ O
0000010000030 01001100 11101
OOOOOODOOOODO 0 1 · 1 000 1 1 1 1 00
0000000000030 1 I 00000000 100
OO0OOOOOOOO DO 11; 11001011 00
0000000000030 000000000003 0
00 ~000 OOOC ~ Do 0000000000000
: - 9i1 1
Figure 1: (a) Sociomatrix for friendship data: Rows and columns 1-13 are boys, 14-27 are
girls; (b) Graphical representation of friendship data: Dark blue lines are reciprocated ties.
are parameters to be estimated.
Estimation of regression coefficients 3 typically proceeds under the assumption that the
observations are conditionally independent given ~ and the xi,j's. However, this assumption
is often violated by many network datasets. For example, the data on friendship ties display
several types of dependence:
Within-node dependence: The number of ties sent by each student varies considerably,
ranging from 0 to 19 with a mean of 5.8 and a standard deviation of 4.7 (the standard
deviation of the number of ties received was 3.2~. This node level variability suggests
that responses from the same individual are positively dependent, in that the proba-
bility that Yi,j = ~ (i sends a tie to j), is high if we know York ~ for lots of other nodes
k, and Tower if Yi,k is mostly zero. More formally, we may wish to have a mode! in
which Pr~yi,j = EYE,, - , Yi,3-i, Hi j+i, - . , Yi,n) is an increasing function of Yi,k, k ~ j.
Reciprocity: For directed relations, it is often expected that pi,' and yji are positively
dependent. The classroom data exhibit a sizable degree of reciprocity' in that the
number of pairs in which y:,j Yj i—~ is 24, which is more than we would expect due
to just random chance: In only 11 of 500 (2~2~o) random permutations of the network
clata, holding constant the number of ties sent by each student, did the number of
304
DYNAMIC SOCLAL N~TWO~MODEL~G ED ^=YSIS
OCR for page 303
such reciprocal dyacis exceed 24. The average number of reciprocal dyads in the 500
permutations was 17.2. This suggests an appropriate mode} would be one which allowed
for positive clependence between Yi,j and yj:.
Transitivity and Balance: In many situations we expect that two nodes with a positive
relation will relate similarly to other nodes. For relations which are positive or negative,
this has led to the concept of "balance" in which a positive value for pi j implies Yi,k an]
Yj,k are likely to be of the same sign for other nodes k. A related concept is transitivity,
in which a large value of yi,j together with a large value of Yj,k implies a large value of
Yi,k (see Wasserman and Faust, 1994, Chapter 6~.
The classroom data exhibit a large degree of transitivity, in that the number of non-
vacuously transitive ordered triples (see Wasserman and Faust, page 244), is 400. In
500 random permutations of the network data, holding constant the number of ties
sent by each student, the largest observed number of transitive triples was 347. This
indicates the data exhibit significantly more transitivity than would be expected due to
just random chance and node-level variability, and an appropriate mode} should allow
for some form of transitive dependence.
In this article, we discuss statistical regression models which can describe such types of
network depenclence. This is done by incorporating random effects structures in a general-
ized linear model setting. We discuss parameter estimation for these models in a Bayesian
framework, and provide example statistical analyses of the classroom data described above
and of a dataset on alliances and conflict among New Guinean tribes.
2 Network Random Effects Models
Generalized linear models, or glm's, are ubiquitous tools which extend linear regression
models to non-normal data and transformable additive covariate effects (McCullagh and
Nelder, 1983). A standard glm assumes the expectation of the response variable Yi,j can be
written as a function of a linear predictor '7 )/xi j. Assuming observations are conditionally
independent given the x: j7s and 37 the model is:
Pr(yl,27 1 Yn,n—1|X7 3) ~P(Yi,jlXi,j7 3)
it
g(E[yi,jlxi,j, 3]) ~7i,j ~ /X ,j
~7
Examples of glm's include ordinary linear regression, logistic regression, Poisson regression,
and quasilikelihood methods.
As discussed above, one feature that distinguishes network data is the likely dependence
among the yi,j's. This lack of independence makes standard glm models inappropriate. In
other settings which involve dependent data, a common approach to parameter estimation
has been the generalized linear mixed-effects model (McCulloch and Searle, 2002) in which
it is assumed the network observations can be modeled as conditionally independent given
DYNAMIC SOCIAL NETWORK HODF:LING AND ANALYSIS
305
OCR for page 303
appropriate random effects terms which can be incorporated into the gIm framework. The
mode! above becomes
Pr~y:,2,...,y7`n_~X,>7'Y) IIptyii~Xi,i' 3~:i,j) (~'
it
g(Etyi,j~X,J,wi,j]) hi,j >'Xi,j +:i,j,
where ~Yi,j is an unobserved random effect. The distribution of and dependence among the
~yi,j's determines the dependence among the yi,j's. For many kinds of network data, we may
wish to find a form for the Pi j's that induces the kinds of clependence described above, such
as w~thin-node dependence, reciprocity, transitivity, and balance.
A simple approach to modeling the node variability that gives rise to within-node de-
pendence is the use of random intercepts, that is, to let ~Yi,j a: + bj + ci,j, where a: and bj
represent independently distributed sender- and receiver-specific effects. Such a distribution
on the ai's and bj's induces a positive dependence among responses involving a common
node. Typically, the distribution of these effects are taken to be normal distributions with
means equal to zero, and variances to be estimated from the data.
Modeling other forms of network dependence is not as straightforward. In the case of
binary logistic regression, Hoff, Raftery, and Handcock (2002) propose using a latent-variable
approach as a means of modeling balance, transitivity, and reciprocity in network data. As
applied to the aim above, such an approach presumes the error ci,j can be written as a
function f of independent k-dimensional latent variables zi, z; ~ flak SO that ci,j f(Zi, zj),
i, j I, . . ., n. The function f is chosen to be simple and to mimic the forms of network
dependence described above. Incorporating both the random intercepts and the zi's into the
model, and assuming independent normal distributions, (~) becomes
71i,j ~ Xi,j + ai + bj + f(Zi, Zj)
al, . . ., an ~ i.i.~. Normal(O,cJ2)
be, ..,bn ~ i.i.~. Normal,
Z~' ~ An ~ i~i.~. Normai(O,ik X CT2),
where 3, CJ2, CJb, and ~2 are parameters to be estimated, and Ok is the k x k identity matrix.
Additionally, if the researcher is interested in local network structure, it may be desirable to
estimate ai, hi, zi for each node.
It remains to choose a suitable function I. One approach is to presume reciprocity,
transitivity, and balance arise due to the existence of unobserved node characteristics, and
that nodes relate preferentially to other nodes with similar values of those characteristics
This motivates letting f be a measure of "similarity" between the random effects zi and Zj,
which gives rise to a "latent position" interpretation as discussed in Hoff et se. (2002~. For
example, consider the following forms for I:
· (distance model) f (Eli, Zj) —All—Zj I;
· (inner product model) f (zi, Zj) = ZtZj
306
DYNAMIC SOCIAL N~TWO=MODEL~G ID ANALYSIS
OCR for page 303
OCR for page 303
OCR for page 303
OCR for page 303
OCR for page 303
In the case of directed responses, each of the above functions induces a degree of reciprocity
as ci,j = f(zi,zj) - f(zj,zi) = Hi, due to the symmetry of f. The common error term
induces a positive dependence between yi j and Yj i
The above functions also give rise to higher-order dependence. For example, the distance
model gives an error structure that is inherently transitive since |zi—Zj | < Izi—ok | + Ink—Zj I
by the triangle inequality. The observation of strong ties from i to k and k to j suggests
that |zi - zk| and |zk—zjl are small, and therefore |zi—zjl cannot be too large and we
might expect strong ties from i to j. The inner product model satisfies a similar but more
complicated relation: in the special case that the vectors zi are of unit length, Z'Zj >
ZtZk + ZkZj ( 1 + 2 :(1—ZkZi ) ( 1—ZkZi ) )
An undirected signed graph is said to be balanced if the product of the relations in all
cycles is nonnegative, i.e. Yii,i2 X Yi2,i3 X · · X Yik_l,ik > 0 for all sequences of indices for which
the corresponding data ale available (Wasserman and Faust 1994, Chapter 6) As f(zi, zj)
exists in the mode! for each pair i, j, balance in terms of this random effect is equivalent to the
balance of the complete graph formed by the sociomatrix with i, jth entry equal to f (Hi, Zj)
For a complete signed graph, all cycles are balanced if and only if each triad is balanced,
ie f(Zi,Z') X f(Zj,Zk) X f(Zk,Zi) > 0 for all triples i,j,k. Interestingly, this is satisfied
by the inner product model in one dimension (zi ~ R), as (zzzj) x (ZJZk) X (ZkZj) > 0. For
Pi ~ ~k' k > 1, these terms are not necessarily balanced' although they are ``probabilistically"
balanced in the following sense: if the directions of the zi's are uniformly distributed, then
the expected number of balanced triads exceeds the number of imbalanced triads, with the
difference decreasing with increasing k. An additional feature of the inner product model is
that if the directions of the z's are uniformly distributed, then in general E(z'zj) = 0. In
particular, if each zi is a vector of k independent normal random variables with mean 0 and
variance cry, then ZiZj will have mean 0 and variance ken, furthering the interpretation of
ziz; as an error term.
On the other hand,—~zi—zig is always negative, and so we lack this interpretation for
the distance model. However, the distance model may be easier to interpret as a spatial
representation of network structure: The zi's can be interpreted as positions in a latent
"social space," with nodes having strong ties to one another being estimated as close together,
and subsets of nodes with strong w~thin-group ties being estimated as clusters in this social
space. Additionally, plotting estimates and confidence regions for the zi's gives a graphical,
model-based representation of the network data
3 Parameter Estimation
Given network data Y = {Yi,j} and possible regressor variables X = {xi,j}, the goal is to
make statistical inference on the unknown model parameters, which we generically denote
as §. The parameter ~ may include the regression coefficients 3, the variances of the random
effects, and possibly the random effects themselves. We take a Bayesian approach to param-
eter estimation, in that we posit a (potentially diffuse) prior probability distribution pa),
and base our inference on the posterior, or conditional distribution of the parameters given
the information in the data, which is given by Bayes' rule, pithy) = phyla) x p(~)/pt
however we can make approximate random samples from this distribution using Markov
chain Monte Carlo (MCMC) simulation (Gelfand and Smith l99O, Besag, Green, Higdon,
and Mengersen 1995). MCMC-based inference constructs a dependent sequence of 9-values
as follows: Given the ith-value 0` in the sequence,
· sample a parameter value d* from a proposal distribution A);
· compute the acceptance probability
( Pi*; )
· set ELI—§* with probability r, otherwise set 0~ = 8~.
The particular details, such as the choice of the proposal distribution J. will depend on the
model and the data. See Hoff et al. (2002) for MCMC algorithms designed specifically for
such latent variable models.
The result of the algorithm is a sequence of ~ values having a distribution that is approx-
imately equal to the target distribution P(§IY). Statistical inference can be based on these
samples. For example, a point estimate of ~ is often taken to be the posterior mean, which
is approximated by the average of the sampled B-values. Posterior confidence intervals can
be based on the sample quartiles.
4 Example Data Analyses
We now apply the methods described above to the statistical analysis of two example
datasets. In the first example, we use the inner product model as a means of making
inference on the preference for same sex friendship ties in Hansell's classroom data. In the
second example, we use the distance model to make inference on the network of alliances
among sixteen New Guinean tribes studied by Read (1954). Both datasets involve binary
network data, although the methods are easily adapted to other types of network data via
an appropriate generalized linear model.
4.] Classroom Friendships
Hansell's (1984) data exhibit a tendency of children to form same sex friendship ties, in that
72~o of the ties are sam~sex. We consider a statistical analysis of this preference, in which
we estimate the log odds of a same-sex tie, as well as make a confidence interval for its value.
This is done via the logistic regression model with random effects described above
g(Etyt,'~g, xi,j, :z,j]) To + 3~xi,j + ~i,j,
where xi,j is the indicator that i and j are of the same sex, 3 = {30, id} are parameters to
be estimated, and hi j is a random effect. In this parametrization, ,30 is the log odds of a
friendship between children of opposite sexes, and 30 + >~ is the log odds for children of the
same sex.
308
DYNAMIC SOCIAL N~TWO~MODEL~G ED ISIS
As described in the introduction, Hansell's (1984) classroom data exhibit several forms
of network dependence, including nod~level variability, reciprocity, and transitivity. This
suggests we model the data with node-specific rates of sending and receiving ties, as well as
a term which captures reciprocity and transitivity. We choose the following inner-product
model with random sender and receiver effects:
log oddstyi j—I) >~, + '3~xi,j + ai + bj + z'zj
al, . . ., an ~ i.i.~. Normal(O, ~2)
bi ~ . . . ~ be ~ i.i.d. Normal(O, orb ~
Z~' - In ~ i.i.d. Normal(O, ~2)
The parameters in this model are the regression coefficients pe and id, as well as the variance
terms ~2, cab ~ ~2, which determine the dependencies between ties.
A Bayesian analysis was performed using the methods outlined in Section 3. The prior
distributions for,l]O and ,B~ were taken to be independent, diffuse normal distributions, both
having mean zero and variance 100. The variance terms ~2' (Jb ~ ~2 were given diffuse inverse-
Gamma(2,1) distributions, having an expectation of one but an infinite variance. An MCMC
algorithm was used to obtain the 500,000 approximate samples from the posterior distribu-
tion p(,Bo, lit, ~2, ~b' (J2~y) Marginal posterior distributions of IBM, ~27 Cub' ~2 are presented in
Figure 2. The results suggest a significant preference for same sex friendship ties, in that the
posterior distribution for ,5~ is centered around a median of 1.49, and a 95~o quartile-based
confidence interval for 3~ is (0.84, 2.~), which does not contain zero. The posterior dis-
tributions of ~2 and ~2 have deviated from their prior distributions and have moved to the
right, giving evidence for sender-specific variability as weli as the need for the latent variables
Z~' . ~ An The posterior for orb concentrates mass on Tow values, and is not much different
from the prior distribution, indicating little evidence for strong receiver-specific variability.
In comparison, a naive approach to inference would be to treat each possible tie as a
Bernoulli random variable, independent of all other ties. Using standard logistic regression,
our estimate of ,3~ is I.3 with a standard error of 0.2, giving an approximate 95~o confidence
interval of (0.9l,1.70), which is of substantially smaller width than the interval obtained
with the random effects model. Of course, we might expect the confidence interval based on
this naive analysis to be too small, as it incorrectly assumes all ties between indiv~duais are
independent and thus overestimates the precision of the parameter estimate.
4.2 Tribal alliances
Read (1954) describes a number of network relations between sixteen New Guinean tribes.
Here we consider the network of alliances between tribes, letting yi j = ~ if tribes i and j
have an alliance, and pi j—O otherwise. We analyze these data using the simple distance
mode! with no covariates or separate sender- and receiver-specific random effects:
log odds Prays j—~ ~ 30, zoo, A) = 30—Pi—Zj ~ ~
where >0 represents the baseline odds of a tie between two nodes that have the same latent
position (i.e. ,50 the maximum Tog odds of a tie), and the zi's are latent positions in E:2
DYNAMIC SOCIAL NETWORK ~IOIDE~WG AND ANALYSIS
309
en
~ -
o
~ -
0 -
o
TO -
o
~ -
~ - /
o I I 1 1 , ,
0.0 0.5 1.0 1.5 2.0 2.5
p(betalY)
it__
r ~
0 1 2 3 4 5
6 7
p(sigma_z^21Y)
-
1 1
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Figure 2: Marginal posterior distributions for the classroom data: Dashed lines represent
the prior distributions for the variance parameters, solid lines the posterior. Vertical lines
give the posterior median.
Without separate sender- and receiver-specific effects, we may expect that tribes with many
alliances will be estimated as being more centrally located' and those with few ties as being
on the periphery.
Bayesian estimates and confidence intervals for 30 and the zi's are obtained using the
methods outlined in Section 3. In particular, samples of latent positions from the posterior
distribution p(z~, . . ., zippy) are plotted in the first panel of Figure 1 (colors are chosen so
that nearby node locations will have similar colors). Additionally, a black line drawn between
nodes indicates the presence of an alliance.
Ad-hoc approaches, or simple point estimates of latent locations, might uncover some of
the structure of the network. Our method goes beyond this by providing posterior confidence
regions for node locations which in turn give us a model-based measure of uncertainty about
the network structure. Additionally, forms of predictive inference can be obtained from such
a model. For example, suppose that the presence or absence of an alliance between pair (i, j)
is unobserved or missing. The model can be fit with all available information (excluding the
unknown Yi,j), and from the available information the posterior distributions of zi and z;
can be obtained. From these' predictive inference about the value of Yi,j can be made.
310
DYNAMIC SOCIAL NETWO~MOD~LING ED ANALYSIS
Figure 3: Tribal alliance network and marginal posterior distributions of locations.
Also collected by Read (1954) were data on conflicts between the tribes. It is interesting
to note that, based on a clustering of nodes (l,2,15,16), (3,4,6,7,S,1l,12), and (5,9,10,13,14),
there were no w~thin-cluster conflicts, even though not every tribe within a cluster had an
alliance with every other cluster Member. Additionally, node 7, towards the center of
the alliance structure, had no conflicts with any of the other 15 tribes. We note that both
responses (conflict and alliance) couicl be modeled concurrently by a similar method, in
which a multinomial logistic random effects mode! is employed in place of the binary logistic
random effects model above.
5 Discussion
This article proposes a form of generalized linear mixed-effects mode} for the statistical
analysis of network data for which parameter estimation is practical to implement. The
approach has some advantages over existing social network models and inferential procedures:
the approach allows for prediction and hypothesis testing; lends itself to a model-based
method of network visualization; is highly extendable and interpretable in terms of well
known statistical pro ceclures such as regression and generalized linear models; and has a
feasible means of exact parameter estimation.
The models discussed here can capture some types of network dependence, although it is
possible (or even likely) that in many datasets there are types of dependencies that cannot
DYNAMIC SOCKS NETWO=MODEL~G ED CYSTS
3 1 1
be well-represented with these models. It then becomes important to develop methods for
assessing mode! lack of fit, and determining the effect of lack of fit on the estimation of
regression coefficients. Furthermore, it may be useful to combine the types of random effects
discussed here with other types of random effects, or latent variables. For example, Nowicki
and Snijders (2001) discuss a latent class model, a useful model for identifying clusters of
Rhodes that relate to others in similar ways. Their latent class model, combined with types
of random effects models presented here and possibly other random effects structures, could
provide a rich class of models for dependent network data.
References
Besag, J., Green, P., Higdon, D., and Mengersen, K. (1995), "Bayesian computation and stochastic
systems," Statist. Sci., 10, 3-66, With comments and a reply by the authors.
Gelfand, A. E. and Smith, A. F. M. (1990), "Samplin~-hm~1 ~nDron.~.h~R t.n r.~.lr,~l~t.in~ m~r~n~1
densities," ]. Amer. Statist. Assoc., 85, 398-409.
O = = ~ ,= ^~