Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Random Effects Models for Network Data Peter D. Hoff* Department of Statistics University of Washington Seattle, Washington hoff@stat.washington. edu www. stat . washington. edu/hoff Abstract One impediment to the statistical analysis of network data has been the difficulty in modeling the dependence among the observations. In the very simple case of binary (0~1) network data, some researchers have parameterized network dependence in terms of exponential family representations. Accurate parameter estimation for such models is quite difficult, and the most commonly used models often display a significant lack of fit. Additionally, such models are generally limited to binary data. In contrast, random effects models have been a widely successful tool in capturing statistical dependence for a variety of data types, and allow for prediction, imputation, and hypothesis test- ing within a general regression context. We propose novel random effects structures to capture network dependence, which corn also provide graphical representations of network structure and variability. ~ Network Dependence ordered pair of nodes i. i = 1 .... - Network data typically consist of a set of n nodes and a relational tie Yi,j, measured on each ~ ~ ~~ _7 __7~)_ This framework has many applications, including the study of war, trade, the behavior of epidemics, the interconnectedness of the World Wide Web, and telephone calling patterns. It is often of interest to relate each network response Yi,j to a possibly par-specific vector valued predictor variable xi,j. A flexible framework for doing so is the generalized linear mode! (see, for example McCullagh and Welder 1983), in which the expected value of the response is modeled as a function of a linear predictor 3',, where ~ is an unknown vector of regression coefficients to be estimated from the data. The ordinary regression mode! ¢tYi,jJ p!Xi,j is perhaps the most commonly used model of this the. 1;~ ~1~1 ~ ~ _ {n ~ ~ ~ ~ 1 _ _4 · 1 . 1 ~ . . ~ A generalized '~ ~~ ~~ -my flu- ~ J Clara Is ~og~s~c regression, wn~cn relates the expectation of the response to the regression variable via the relation g(E[yi,j]) = 3'xi,j, where g(p) = log UPS. *This research was supported by Once of Naval Research grant N0001~02-1-1011. DYNAMIC SOCIAL NETWORK MODELING kD ISIS 303

As an example of the use of such statistical models, consider the analysis of strong friendship ties among 13 boys and 14 girts in a s~xth-grade classroom, as collected by Hansell (1984). Each student was asked if they liked each other student "a lot", "some", or "not much". A strong friendship tie is considered present if a student likes another student "a lot " Also recorded is the sex of each student. The data, presented in Figure 1, suggest a general preference for same-sex friendship ties. Of potential interest is statistical estimation of this preference, as well as a confidence interval for its value. One approach for such statistical analysis would be to formulate the logistic regression mode} g(Etyi~j~xi~i7 35) = 30 + >~xi,j, where xi,j is one if children i and j are of the same sex, and zero otherwise, and it = (30 ,3~) ~ a) 11000001 I 030 a 1000001000300 1 0 0 0 0 0 0 0 0 0 ~ O 0 1 0 0 0 0 0 0 0 0 0 ~ O O 1 ~ 0000010031 10000001001300 111 l l l l I 11 ~ 1 000 I G 000000 ~ C O 1100 000 I l O 10 11011000111100 01001 0001031 10 11000010100 000011 010031 00000000000000 1000110 11101 00000000000000 00010010 0031 9000Q000000300 000000000 030 00000000000300 0000000000 30 00010000000300 00000100010 0 00000100000300 O O O O O O O O O O O tO O O 0 1 0 0 0 1 0 0 0 ~ O O ODOO 1 000000OO O O O ~ O O ~ ~ O O O ~ I) 1 ~ 0 1 ~ U 1 1 ~~ 1 l l O l l l O l O l l ~ O 11 1 I l O l l l l l O l 0000000000030 000 1000000300 OOOOOOOOOOODO OOOO O01100OOO 00 ~ 1 ~ 0 ~ 00001 d l l . l l 01101100 1010111000110 0 I 0110 0001 ~ 00 () () () () () () () ~ () () O ~ O 1 1 O O O O O () ~ 1 ~ O 0000010000030 01001100 11101 OOOOOODOOOODO 0 1 · 1 000 1 1 1 1 00 0000000000030 1 I 00000000 100 OO0OOOOOOOO DO 11; 11001011 00 0000000000030 000000000003 0 00 ~000 OOOC ~ Do 0000000000000 : - 9i1 1 Figure 1: (a) Sociomatrix for friendship data: Rows and columns 1-13 are boys, 14-27 are girls; (b) Graphical representation of friendship data: Dark blue lines are reciprocated ties. are parameters to be estimated. Estimation of regression coefficients 3 typically proceeds under the assumption that the observations are conditionally independent given ~ and the xi,j's. However, this assumption is often violated by many network datasets. For example, the data on friendship ties display several types of dependence: Within-node dependence: The number of ties sent by each student varies considerably, ranging from 0 to 19 with a mean of 5.8 and a standard deviation of 4.7 (the standard deviation of the number of ties received was 3.2~. This node level variability suggests that responses from the same individual are positively dependent, in that the proba- bility that Yi,j = ~ (i sends a tie to j), is high if we know York ~ for lots of other nodes k, and Tower if Yi,k is mostly zero. More formally, we may wish to have a mode! in which Pr~yi,j = EYE,, - , Yi,3-i, Hi j+i, - . , Yi,n) is an increasing function of Yi,k, k ~ j. Reciprocity: For directed relations, it is often expected that pi,' and yji are positively dependent. The classroom data exhibit a sizable degree of reciprocity' in that the number of pairs in which y:,j Yj i~ is 24, which is more than we would expect due to just random chance: In only 11 of 500 (2~2~o) random permutations of the network clata, holding constant the number of ties sent by each student, did the number of 304 DYNAMIC SOCLAL N~TWO~MODEL~G ED ^=YSIS

such reciprocal dyacis exceed 24. The average number of reciprocal dyads in the 500 permutations was 17.2. This suggests an appropriate mode} would be one which allowed for positive clependence between Yi,j and yj:. Transitivity and Balance: In many situations we expect that two nodes with a positive relation will relate similarly to other nodes. For relations which are positive or negative, this has led to the concept of "balance" in which a positive value for pi j implies Yi,k an] Yj,k are likely to be of the same sign for other nodes k. A related concept is transitivity, in which a large value of yi,j together with a large value of Yj,k implies a large value of Yi,k (see Wasserman and Faust, 1994, Chapter 6~. The classroom data exhibit a large degree of transitivity, in that the number of non- vacuously transitive ordered triples (see Wasserman and Faust, page 244), is 400. In 500 random permutations of the network data, holding constant the number of ties sent by each student, the largest observed number of transitive triples was 347. This indicates the data exhibit significantly more transitivity than would be expected due to just random chance and node-level variability, and an appropriate mode} should allow for some form of transitive dependence. In this article, we discuss statistical regression models which can describe such types of network depenclence. This is done by incorporating random effects structures in a general- ized linear model setting. We discuss parameter estimation for these models in a Bayesian framework, and provide example statistical analyses of the classroom data described above and of a dataset on alliances and conflict among New Guinean tribes. 2 Network Random Effects Models Generalized linear models, or glm's, are ubiquitous tools which extend linear regression models to non-normal data and transformable additive covariate effects (McCullagh and Nelder, 1983). A standard glm assumes the expectation of the response variable Yi,j can be written as a function of a linear predictor '7 )/xi j. Assuming observations are conditionally independent given the x: j7s and 37 the model is: Pr(yl,27 1 Yn,n1|X7 3) ~P(Yi,jlXi,j7 3) it g(E[yi,jlxi,j, 3]) ~7i,j ~ /X ,j ~7 Examples of glm's include ordinary linear regression, logistic regression, Poisson regression, and quasilikelihood methods. As discussed above, one feature that distinguishes network data is the likely dependence among the yi,j's. This lack of independence makes standard glm models inappropriate. In other settings which involve dependent data, a common approach to parameter estimation has been the generalized linear mixed-effects model (McCulloch and Searle, 2002) in which it is assumed the network observations can be modeled as conditionally independent given DYNAMIC SOCIAL NETWORK HODF:LING AND ANALYSIS 305

appropriate random effects terms which can be incorporated into the gIm framework. The mode! above becomes Pr~y:,2,...,y7`n_~X,>7'Y) IIptyii~Xi,i' 3~:i,j) (~' it g(Etyi,j~X,J,wi,j]) hi,j >'Xi,j +:i,j, where ~Yi,j is an unobserved random effect. The distribution of and dependence among the ~yi,j's determines the dependence among the yi,j's. For many kinds of network data, we may wish to find a form for the Pi j's that induces the kinds of clependence described above, such as w~thin-node dependence, reciprocity, transitivity, and balance. A simple approach to modeling the node variability that gives rise to within-node de- pendence is the use of random intercepts, that is, to let ~Yi,j a: + bj + ci,j, where a: and bj represent independently distributed sender- and receiver-specific effects. Such a distribution on the ai's and bj's induces a positive dependence among responses involving a common node. Typically, the distribution of these effects are taken to be normal distributions with means equal to zero, and variances to be estimated from the data. Modeling other forms of network dependence is not as straightforward. In the case of binary logistic regression, Hoff, Raftery, and Handcock (2002) propose using a latent-variable approach as a means of modeling balance, transitivity, and reciprocity in network data. As applied to the aim above, such an approach presumes the error ci,j can be written as a function f of independent k-dimensional latent variables zi, z; ~ flak SO that ci,j f(Zi, zj), i, j I, . . ., n. The function f is chosen to be simple and to mimic the forms of network dependence described above. Incorporating both the random intercepts and the zi's into the model, and assuming independent normal distributions, (~) becomes 71i,j ~ Xi,j + ai + bj + f(Zi, Zj) al, . . ., an ~ i.i.~. Normal(O,cJ2) be, ..,bn ~ i.i.~. Normal, Z~' ~ An ~ i~i.~. Normai(O,ik X CT2), where 3, CJ2, CJb, and ~2 are parameters to be estimated, and Ok is the k x k identity matrix. Additionally, if the researcher is interested in local network structure, it may be desirable to estimate ai, hi, zi for each node. It remains to choose a suitable function I. One approach is to presume reciprocity, transitivity, and balance arise due to the existence of unobserved node characteristics, and that nodes relate preferentially to other nodes with similar values of those characteristics This motivates letting f be a measure of "similarity" between the random effects zi and Zj, which gives rise to a "latent position" interpretation as discussed in Hoff et se. (2002~. For example, consider the following forms for I: · (distance model) f (Eli, Zj) AllZj I; · (inner product model) f (zi, Zj) = ZtZj 306 DYNAMIC SOCIAL N~TWO=MODEL~G ID ANALYSIS

In the case of directed responses, each of the above functions induces a degree of reciprocity as ci,j = f(zi,zj) - f(zj,zi) = Hi, due to the symmetry of f. The common error term induces a positive dependence between yi j and Yj i The above functions also give rise to higher-order dependence. For example, the distance model gives an error structure that is inherently transitive since |ziZj | < Iziok | + InkZj I by the triangle inequality. The observation of strong ties from i to k and k to j suggests that |zi - zk| and |zkzjl are small, and therefore |zizjl cannot be too large and we might expect strong ties from i to j. The inner product model satisfies a similar but more complicated relation: in the special case that the vectors zi are of unit length, Z'Zj > ZtZk + ZkZj ( 1 + 2 :(1ZkZi ) ( 1ZkZi ) ) An undirected signed graph is said to be balanced if the product of the relations in all cycles is nonnegative, i.e. Yii,i2 X Yi2,i3 X · · X Yik_l,ik > 0 for all sequences of indices for which the corresponding data ale available (Wasserman and Faust 1994, Chapter 6) As f(zi, zj) exists in the mode! for each pair i, j, balance in terms of this random effect is equivalent to the balance of the complete graph formed by the sociomatrix with i, jth entry equal to f (Hi, Zj) For a complete signed graph, all cycles are balanced if and only if each triad is balanced, ie f(Zi,Z') X f(Zj,Zk) X f(Zk,Zi) > 0 for all triples i,j,k. Interestingly, this is satisfied by the inner product model in one dimension (zi ~ R), as (zzzj) x (ZJZk) X (ZkZj) > 0. For Pi ~ ~k' k > 1, these terms are not necessarily balanced' although they are ``probabilistically" balanced in the following sense: if the directions of the zi's are uniformly distributed, then the expected number of balanced triads exceeds the number of imbalanced triads, with the difference decreasing with increasing k. An additional feature of the inner product model is that if the directions of the z's are uniformly distributed, then in general E(z'zj) = 0. In particular, if each zi is a vector of k independent normal random variables with mean 0 and variance cry, then ZiZj will have mean 0 and variance ken, furthering the interpretation of ziz; as an error term. On the other hand,~zizig is always negative, and so we lack this interpretation for the distance model. However, the distance model may be easier to interpret as a spatial representation of network structure: The zi's can be interpreted as positions in a latent "social space," with nodes having strong ties to one another being estimated as close together, and subsets of nodes with strong w~thin-group ties being estimated as clusters in this social space. Additionally, plotting estimates and confidence regions for the zi's gives a graphical, model-based representation of the network data 3 Parameter Estimation Given network data Y = {Yi,j} and possible regressor variables X = {xi,j}, the goal is to make statistical inference on the unknown model parameters, which we generically denote as §. The parameter ~ may include the regression coefficients 3, the variances of the random effects, and possibly the random effects themselves. We take a Bayesian approach to param- eter estimation, in that we posit a (potentially diffuse) prior probability distribution pa), and base our inference on the posterior, or conditional distribution of the parameters given the information in the data, which is given by Bayes' rule, pithy) = phyla) x p(~)/pt<Y). A closed form expression for the desired conditional distribution is generally unavailable, DYNAMIC SOCIAL NETWORK MOD~:LING,AND ANALYSIS 307

however we can make approximate random samples from this distribution using Markov chain Monte Carlo (MCMC) simulation (Gelfand and Smith l99O, Besag, Green, Higdon, and Mengersen 1995). MCMC-based inference constructs a dependent sequence of 9-values as follows: Given the ith-value 0` in the sequence, · sample a parameter value d* from a proposal distribution A); · compute the acceptance probability ( Pi*; ) · set ELI§* with probability r, otherwise set 0~ = 8~. The particular details, such as the choice of the proposal distribution J. will depend on the model and the data. See Hoff et al. (2002) for MCMC algorithms designed specifically for such latent variable models. The result of the algorithm is a sequence of ~ values having a distribution that is approx- imately equal to the target distribution P(§IY). Statistical inference can be based on these samples. For example, a point estimate of ~ is often taken to be the posterior mean, which is approximated by the average of the sampled B-values. Posterior confidence intervals can be based on the sample quartiles. 4 Example Data Analyses We now apply the methods described above to the statistical analysis of two example datasets. In the first example, we use the inner product model as a means of making inference on the preference for same sex friendship ties in Hansell's classroom data. In the second example, we use the distance model to make inference on the network of alliances among sixteen New Guinean tribes studied by Read (1954). Both datasets involve binary network data, although the methods are easily adapted to other types of network data via an appropriate generalized linear model. 4.] Classroom Friendships Hansell's (1984) data exhibit a tendency of children to form same sex friendship ties, in that 72~o of the ties are sam~sex. We consider a statistical analysis of this preference, in which we estimate the log odds of a same-sex tie, as well as make a confidence interval for its value. This is done via the logistic regression model with random effects described above g(Etyt,'~g, xi,j, :z,j]) To + 3~xi,j + ~i,j, where xi,j is the indicator that i and j are of the same sex, 3 = {30, id} are parameters to be estimated, and hi j is a random effect. In this parametrization, ,30 is the log odds of a friendship between children of opposite sexes, and 30 + >~ is the log odds for children of the same sex. 308 DYNAMIC SOCIAL N~TWO~MODEL~G ED ISIS

As described in the introduction, Hansell's (1984) classroom data exhibit several forms of network dependence, including nod~level variability, reciprocity, and transitivity. This suggests we model the data with node-specific rates of sending and receiving ties, as well as a term which captures reciprocity and transitivity. We choose the following inner-product model with random sender and receiver effects: log oddstyi jI) >~, + '3~xi,j + ai + bj + z'zj al, . . ., an ~ i.i.~. Normal(O, ~2) bi ~ . . . ~ be ~ i.i.d. Normal(O, orb ~ Z~' - In ~ i.i.d. Normal(O, ~2) The parameters in this model are the regression coefficients pe and id, as well as the variance terms ~2, cab ~ ~2, which determine the dependencies between ties. A Bayesian analysis was performed using the methods outlined in Section 3. The prior distributions for,l]O and ,B~ were taken to be independent, diffuse normal distributions, both having mean zero and variance 100. The variance terms ~2' (Jb ~ ~2 were given diffuse inverse- Gamma(2,1) distributions, having an expectation of one but an infinite variance. An MCMC algorithm was used to obtain the 500,000 approximate samples from the posterior distribu- tion p(,Bo, lit, ~2, ~b' (J2~y) Marginal posterior distributions of IBM, ~27 Cub' ~2 are presented in Figure 2. The results suggest a significant preference for same sex friendship ties, in that the posterior distribution for ,5~ is centered around a median of 1.49, and a 95~o quartile-based confidence interval for 3~ is (0.84, 2.~), which does not contain zero. The posterior dis- tributions of ~2 and ~2 have deviated from their prior distributions and have moved to the right, giving evidence for sender-specific variability as weli as the need for the latent variables Z~' . ~ An The posterior for orb concentrates mass on Tow values, and is not much different from the prior distribution, indicating little evidence for strong receiver-specific variability. In comparison, a naive approach to inference would be to treat each possible tie as a Bernoulli random variable, independent of all other ties. Using standard logistic regression, our estimate of ,3~ is I.3 with a standard error of 0.2, giving an approximate 95~o confidence interval of (0.9l,1.70), which is of substantially smaller width than the interval obtained with the random effects model. Of course, we might expect the confidence interval based on this naive analysis to be too small, as it incorrectly assumes all ties between indiv~duais are independent and thus overestimates the precision of the parameter estimate. 4.2 Tribal alliances Read (1954) describes a number of network relations between sixteen New Guinean tribes. Here we consider the network of alliances between tribes, letting yi j = ~ if tribes i and j have an alliance, and pi jO otherwise. We analyze these data using the simple distance mode! with no covariates or separate sender- and receiver-specific random effects: log odds Prays j~ ~ 30, zoo, A) = 30PiZj ~ ~ where >0 represents the baseline odds of a tie between two nodes that have the same latent position (i.e. ,50 the maximum Tog odds of a tie), and the zi's are latent positions in E:2 DYNAMIC SOCIAL NETWORK ~IOIDE~WG AND ANALYSIS 309

en ~ - o ~ - 0 - o TO - o ~ - ~ - / o I I 1 1 , , 0.0 0.5 1.0 1.5 2.0 2.5 p(betalY) it__ r ~ 0 1 2 3 4 5 6 7 p(sigma_z^21Y) - 1 1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Figure 2: Marginal posterior distributions for the classroom data: Dashed lines represent the prior distributions for the variance parameters, solid lines the posterior. Vertical lines give the posterior median. Without separate sender- and receiver-specific effects, we may expect that tribes with many alliances will be estimated as being more centrally located' and those with few ties as being on the periphery. Bayesian estimates and confidence intervals for 30 and the zi's are obtained using the methods outlined in Section 3. In particular, samples of latent positions from the posterior distribution p(z~, . . ., zippy) are plotted in the first panel of Figure 1 (colors are chosen so that nearby node locations will have similar colors). Additionally, a black line drawn between nodes indicates the presence of an alliance. Ad-hoc approaches, or simple point estimates of latent locations, might uncover some of the structure of the network. Our method goes beyond this by providing posterior confidence regions for node locations which in turn give us a model-based measure of uncertainty about the network structure. Additionally, forms of predictive inference can be obtained from such a model. For example, suppose that the presence or absence of an alliance between pair (i, j) is unobserved or missing. The model can be fit with all available information (excluding the unknown Yi,j), and from the available information the posterior distributions of zi and z; can be obtained. From these' predictive inference about the value of Yi,j can be made. 310 DYNAMIC SOCIAL NETWO~MOD~LING ED ANALYSIS

Figure 3: Tribal alliance network and marginal posterior distributions of locations. Also collected by Read (1954) were data on conflicts between the tribes. It is interesting to note that, based on a clustering of nodes (l,2,15,16), (3,4,6,7,S,1l,12), and (5,9,10,13,14), there were no w~thin-cluster conflicts, even though not every tribe within a cluster had an alliance with every other cluster Member. Additionally, node 7, towards the center of the alliance structure, had no conflicts with any of the other 15 tribes. We note that both responses (conflict and alliance) couicl be modeled concurrently by a similar method, in which a multinomial logistic random effects mode! is employed in place of the binary logistic random effects model above. 5 Discussion This article proposes a form of generalized linear mixed-effects mode} for the statistical analysis of network data for which parameter estimation is practical to implement. The approach has some advantages over existing social network models and inferential procedures: the approach allows for prediction and hypothesis testing; lends itself to a model-based method of network visualization; is highly extendable and interpretable in terms of well known statistical pro ceclures such as regression and generalized linear models; and has a feasible means of exact parameter estimation. The models discussed here can capture some types of network dependence, although it is possible (or even likely) that in many datasets there are types of dependencies that cannot DYNAMIC SOCKS NETWO=MODEL~G ED CYSTS 3 1 1

be well-represented with these models. It then becomes important to develop methods for assessing mode! lack of fit, and determining the effect of lack of fit on the estimation of regression coefficients. Furthermore, it may be useful to combine the types of random effects discussed here with other types of random effects, or latent variables. For example, Nowicki and Snijders (2001) discuss a latent class model, a useful model for identifying clusters of Rhodes that relate to others in similar ways. Their latent class model, combined with types of random effects models presented here and possibly other random effects structures, could provide a rich class of models for dependent network data. References Besag, J., Green, P., Higdon, D., and Mengersen, K. (1995), "Bayesian computation and stochastic systems," Statist. Sci., 10, 3-66, With comments and a reply by the authors. Gelfand, A. E. and Smith, A. F. M. (1990), "Samplin~-hm~1 ~nDron.~.h~R t.n r.~.lr,~l~t.in~ m~r~n~1 densities," ]. Amer. Statist. Assoc., 85, 398-409. O = = ~ ,= ^~ <meow ~ Hansell, S. (1984), "Cooperative groups, weak ties, and the integration of peer friendships," Social Psychology Quarterly, 47, 31~328. Hoff, P. D., Raftery, A. E., and Handcock, M. S. (2002), "Latent Space Approaches to Social Network Analysis," Journal of the American Statistical Association, 97, to appear. McCullagh, P. and Nelder, J. A. (1983), Generalized linear models, Chapman & Hall, London. McCulloch C. E. and Searle, S. R. (2001), Generalized, linear, and mixed models, Wiley Series in Probability and Statistics: Texts, References, and Pocketbooks Section, Wiley-Interscience Cohn Wiley & Sons], New York. Nowicki, K. and Snijders, T. A. B. (2001), "Estimation and Prediction for Stochastic Block Struc- tures,,' Journal of the American Statistical Association, 96, 1077-1087. Read, K. (1954), ``Cultures of the central highlands, New Guinea,'7 Southwestern Journal of An- thropology7 10' 1-43. Wasserman, S. and Faust, K. (1994), Social Network Analysis: Methods and Applications, Cam- bridge: Cambridge University Press. 312 DYNAMIC SOCIAL NETWORK MODELING AND ^^YSIS