Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Predictability of Large-scale Spatially Embedded Networks Carter T. Butts; 10/23/02 Abstract Although it is well-known that there is a relationship between soci~physical dim tance and edge probability in interpersonal networks, the predictive power of such distances for total network structure has not been established. Here, it is shown that upper bounds on the marginal edge probabilities for far-flung dyads can be used to place a lower bound on the predictive power of distance, and one such bound Is de- rived. Application of this bound to the special case of uniformly placed vertices on the plane suggests that only modest constraints are required for distance effects to dominate at large physical scales. Keywords: social networks, spatial models, Shannon entropy, predictive power, dis- tance ~ Introcluction Numerous studies show a strong relationship between physical distance and social structure (e.g., Merton (1948~; Festinger et al. (1950~; Caplow and Forman (1950~; Blake et al. (1956~; Whyte (1957~; Sommer (1969~; Snow et al. (1981~; Latane et al. (1995~; arguably, few other findings in the social sciences can claim such a degree of strength and generality. While this is an interesting and important result in and of itself, it begs a critical question: assuming *This material is based upon work supported by the National Science Foundation under Grant No. 0100999, as well as an NSF Graduate Research Fellowship. Additional support was provided by tlie Center for the Computational Analysis of Social and Organizational Systems and the Institute for Complex Engineered Systems at Carnegie Mellon University. tI would like to thank Kathleen Carley, David Krackhardt, Shelby Stewman, David Rode, and Katie Faust for their comments and input. "Institute for Mathematical Behavioral Sciences and Department of Sociology, University of California- Irvine, Irvine, CA 92697, buttsc@uci.edu DYNAMIC SOCIAL N-ETWORKMODELTNG~4ND A/JAl~YSIS 313

that the distance/edge probability relationship is as it appears to be, to what extent can this account for the variability of social structure writ large? In his 1984 comment, "Chance and Necessity in Sociological Theory," Bruce Mayhew makes the characteristically boIct claim that well over 905 of the variation in social structure is determined by physical spacei. If Mayhew's assertion is correct, then we would expect for network models based on vertex position to allow us to develop extremely credible predictions of large-scale network structure. Since individual positions can be inferred from population data, such a result should (in principle, at least) allow us to reduce the problem of macrostructural prediction to one of spatial demography. If the assertion is false, by contrast, then other approaches will be required to effectively mode! the structure of social macrostructure. While the "Mayhew question" is unlikely to be settled by a single paper, it is shown here that the requirements for predicting network structure from vertex layout are fairly modest. Fairly minimal constraints on the probability of edges between distant alters are sufficient to establish a lower bound on the predictive power of distance, where predictive power is defined in terms of reduction in the Shannon entropy of the total structure. Application of this bound to the special case of vertices in a planar region suggests that the requirements for strong distance effects (e.g., > 90'7o uncertainty reduction) are likely to be attainable in practice for moderate to large physical scales, and thus that it is reasonable to expect that large-scare spatially embedded social networks will be readily predictable from vertex position data. 2 Notation en c} Basic Assumptions For the results which follow, we will focus exclusively on the case of a loopless undirected graph G = (V, E) with known vertex set V and uncertain (i.e., random) edge set E. (For convenience, we denote the cardinality of V by N V.) It is assumed further that G is spatially embedded, in the sense that there exists some space $ and set V {v~,...,vN) such that V C $ and vi is the position of vertex vi. We assume that there exists some distance function, cl, on $, but do not require it to be a metric (i.e., it need not satisfy the triangle inequality). Substantively, the the most obvious interpretation of $ is as a social physical space (often called a "Blau" space (Blau, 1977~; such a space may include both physical and demographic dimensions, including gender, age, race, and primary language. For the purposes of our present application, we will focus on a physical space interpretation of $, but it should be emphasized that this is not required for the general result to apply. iThis, he quips, being an artificially low estimate due to measurement error. 314 DYNAMIC SOCIAL N~TWORKMODELI?JG AND ANALYSIS

3 Predictability of Spatially Embedded Networks What, precisely, is meant by the "predictability" of spatially embedded networks? In the context of this paper, predictability is understood to be the extent to which our initial uncertainty regarding network structure is reduced by the provision of new information. Specifically, we are interested in the extent to which knowledge of vertex positions within $ reduces our uncertainty regarding the edge set of G. A natural measure of uncertainty - and the one which we shall employ here - is the Shannon entropy, which can be interpreted as the expected length of an optimally encoded signal expressing the value of a random variable. Denoting the entropy function by I, we form the R2-like predictability measure P(G~V)~ _ i(G~V) (I) which expresses the extent to which knowledge of the vertex position set, V, can account for the tote] information content of G. With P as our notion of predictability, we can state the following general result: Theorem 1 (Predictability). Let G = (V,E) be a spatially embedded random graph with vertex position set V and distance function 1, and let G be distributed such that p({vi, vj) ~ E (G)) ~ ~ bt vi, vj : ~ (vi, vj) > rc, for some ~ < 0.5. Then P(G~V) > p(<d(vi,vj) > rc)~]IBM), and lime ~oP(G~V) > p~d(vi,vj) > rc), where P(G~V) = ~ _ ~v'), ~ is the Shannon entropy, and IBM) =clog26(16)iog2(161. Proof. For convenience in notation, let clij - cl(vi,vj) and let eij _ {vi,vj) ~ E(G). We begin by assuming that, for actors within radius rc, distance tells us nothing regard- ing edge probability; that is to say, p(ez'- Idi~ < rc) = 0~5. Then it trivially follows from the definition of the Shannon entropy that I (eij Idij < rc) - 1. For the complementary case' we begin by noting that I (eij |dij > rC) p (eij |dij > rC) log2 p (en |dij > rC ) - (1 p(eij Idij > rc)) log2(1p(eij |dij > rc)). The fact that p(eij Idij > rc) < ~ < 0.5 then implies that I (eij idle > rc) <~ log2 ~(1c) log2 (16) = IBM) We now consider the entropy of the entire graph. Using the well-known result that I(X, Y) < I(X) + I(Y) for (possibly dependent) random variables X, Y. we can bound the DYNAMIC SOCIAL NETWORK MODELING AND ANALYSIS 315

entropy of the graph by the sum of the (independent) edgewise entropies. Therefore we have I (G IV) ~ At, I (eel IV) {i,j} At, 1 + it, IBM) {i~j}:dii <rc {2,i}:dij >rc = (I (` )I)p(di; < rc) A (I ~ )I)P(dIj > rC)IB(6) = [p(dij < rc) +p(dij > rc) IB(~)1J (IV (\G)1) = tp (dij < rC) + (1p (dij < rC))]B(~)] (I ~ ) 1), Since the uninformative entropy of G is given by I(G) = (IV(2G)I)7 it follows that P(GIV) 1 _ I (G TV > 1 - [p (do < rc) + (1p (dij < rC))IB (I)] ( 2 ) 1p (do ~ rc) - (1p (A < rc))Is (a) p (dij > rc)p (dij > rc) IB (I) p (all; ~ rc) (1IB (I)) which demonstrates the first portion of Theorem 1. To complete the proof, we allow ~ ~ O and take the limit: imP(G~V) > limp (clij > rc) (1IB (I)) = limp (dij > rc) (1 + clog2 ~ + (1c) log2 (1c)) P (all; ~ rc) Lo This is a powerful and general result: it tells us that whenever we can place a reasonable upper bound on the marginal edge probability between distant vertices, we can use the quartiles of the distance distribution to place a lower bound on the predictive power of V. Furthermore, when this bound on marginal edge probability becomes small, the predictive power of the position set becomes bounded by the probability that the distance between two randomly selected vertices will exceed the critical threshold. Thus, where the threshold 316 DYNAMIC SOCIAL NETWORK MODELING AND ANALYSIS

distance is small relative to the overall distribution, we can guarantee that the total structure will be easily predicted from vertex position alone One important and somewhat counter-intuitive aspect of Theorem ~ is that it does not depend on N: the predictability of the total structure can be bounded by a function which depends only on the geometry of the population layout. Similarly we did not have to assume dyadic independence to obtain this result (only bounds on the edgewise marginals). These two facts greatly facilitate the application of Theorem ~ in the field. where population distributions and some crude estimates of the distanceiedge probability relationship may be all that is available. They also serve to reinforce the argument that the predictive power of distance is robust to varying assumptions about the precise determinants of network structure. 4 Uniform Population Distribution on the Plane Consider the special case in which a population of arbitrary size is placed uniformly within a square region of size ~ x {. Such a model may be thought of as a first approximation to a sparse population distribution in physical space, particularly over large areas. Here, we show the minimum threshold distances necessary to obtain a given level of predictive power for a structure on the plane, as a function of the linear scale fed of the embedding region. As will be shown, the only modest critical thresholds are required to guarantee high levels of predictability under uniform vertex placement. 4.] Distribution of Inter-point Distances In order to apply Theorem I, we must first know the distribution of inter-point distances for square planar regions. Under the assumption that ~ is the euclidean distance, we derive this distribution in the following lemma: Lemma 1. Let vi,vj designate two randomly selected points on a two-dimensional plane, with each coordinate being IID u(o,e). Then the density function of d (vi,vj) is given by ~ 2d [e2Ed + e2] f(d) = 1 2d (~sin 1 (I) _ e (I - i) + e4 ] ~ < d < It, O _ d < otherwise DYNAMIC SOCIAL N~:TWORK MODELING AND ANALYSIS 317

with associates! distribution function ~0 1 2 And _ 83 + d4] F<d) 4. ~ 2e2 3~3 4e4~ ~ 3 + id t12 (~ddq ~ ) ~1 ~ < 0 O _ d < e + sin ( d2 )] + ~ _ d4 e < d < Me d > Me Proof. For the tw~dimensional case, we may write the euclidean distance in terms of coor- dinate differences: l d (v2'vj);((Vi~l(Vj)l) + ((Vi)2 - (~i)2) By assumption, these coordinates are uniformly distributed on [O. I]. It can easily be shown that the difference between two such uniform deviates is distributed Triangular with lower boundi, upper bound e, and mode 0. Thus, we may simplify the distribution of ~ as follows: ~¢Vi~vi) ~ `/(Uto,e) - uto,e')2 + (uto.e) - u~o,ey)2 I ~ - ~/(T(-e~i,o))2+ (T(-t,e,o))2 Note that Tti,i,O) is symmetric about the origin; thus, it is a standard result that FT2(X) = 2FT (~1, where ET2 and FT are the cumulative distribution functions of variates T2 and T. respectively (Evans et al., 2000). For the distribution function of a Triangular deviate with lower bound a, upper bound b, and mode c we have FT (X) and hence 318 (O x<0 (ba)(ca) O < X < C (bx)2 C < X ~ b ' ~ 1 x > b FT2 (X) = 2FT (a 1 (~-Q)2 1 (ba) (ca) 12 (b ~) , (ba) (bc) ~1 4<0 O < ~ < C c < ~/7 < b 1/~> b DYNAMIC SOCIAL NETWORK HODEfI7JG AND ANALYSIS

which, by symmetry, can be collapsed to (for O < x < b2). 1 _ 2 (b>/~) 2bb b_ Aft I_ b2 To obtain the associated density function, fr2 (X), we simply differentiate: fT2 (X) C, FT2 (X) ~ (1_ (b fix ~ b2 J _ ~ o < X < b2 O otherwise Having derived the density of a single T2 variate, we now must consider the sum of two such (IID) variates. Since the variates in question are independent, their joint density is simply the product of their individual densities, and we may obtain the density of the sum via convolution. In this case, however, we note that the domain of T2 + T2 depends on the individual variates, and hence some care is needed in choosing the limits of integration We divide the density at xb2 (the point beyond which both variates must be greater than 0), taking the tower region first: rz fT2+T2 (z 1° < x ~ b ) ~ fT2 (y) fT2 (xy) dy o (bit b2)(b~ b2) b2 sin ( Y ~)1 _2~I~+2v~I~ y1= b2 [sin-1 (1)sin-1 ( - 1)] ~ ~ + b4 ~ 4~ + x b2 be b4 DYNAMIC SOCIAL NETWO=AJODELING AND ANALYSIS 319

Now, we consider the upper region: b2 fT2+T2 (X |b < x ~ 2b ) / fT2 (y) fT2 (x y) dy + Jx - b2 i_b2 (bit b2) (be - b2) dy 1 1 (2yz ) 162 24|b2 2~lb2 ~ lb2 b2 ~ X J |=b2 b3 1=b2 b3 Ixbe b4 1Xb2 1 [sin~1 (2b2 _ 2 ) _ sin~1 (2 - 2b2)] _ 2 (bA/ ) b3(~ b)+2b2_ 2 2 1 2b2 - _ 4 b - b2 + _ b2 ( 2 ) b3( ) b b3 1=h2 Finally, putting this together, we obtain the complete density for a sum of two squared triangular variates: fT2+T2 (X) ~ Or 4~ x . - _ ~ _ 1 b2 b3 1 b4 sin ~ (IBM ) ~ (b ·~) + 2b2_x LO O _ x < b2 b2 < x < 2b2 otherwise From the above, we may now derive the density of d. The last step in this process involves applying a positive square root transformation to the sum of squared triangular variates. This is a monotonically increasing one-tmone transformation, and we can thus derive the new density by a simple change of variables: fJ~(X) = fT2+T2 (X ) ~J~ (where ~~ is the Jacobian determinant of the transformation) fT2+T2 (X ) 2x {2z t~~ + ~b4 ~ O < ~ < b = 2 - 1 2b2_ 2 _ 4 2b2_ 2 2x it sin (I) ~ (b ~ + b4 ~ b < x < Mob LO otherwise To arrive at the distribution function of d, we need merely integrate fame over the desired 320 DYNAMIC SOCIAL N~:TWORK MODELING AND ANAl~YSIS

: - - - Scale P(GIV) (xt 90~0 95% 99~O O.IxO.l Next ~ lOxIO km lOOxlOO km 0.019 0.013 0.006 0 192 0.133 0.058 1.924 1.331 0.578 _ 19.236 13.311 5.923 . . All rC values given in kilometers. Table 1: Maximum Critical Radius as a Function of Scale and Uncertainty Reduction, Uni- form Vertex Placement range. F,7~(x) = [O X < 0 to 2X ~b2 4b3 + ~b4 ~ my O < x < b i0 2x ~b2 b3 + b4 ~ my + ib 2x tb2 sin ~ =2 ~ b3 (b~ + b ala ~ dY b < x < >/7b 1 TO 2 [2b2 3b3 + 4b4] 2~2 [i 2 (butb4) + sin 1 ( =2 )] +8(22_b2)2 =4 1 b3 b4 x > Mob Substitution of ~ for x and e for b completes the proof 4.2 Predictability x < 0 O ~ x < b b < x < Mob x > pub [1 Using Lemma ~ together with Theorem 1, we may determine the maximum rC value needed to guarantee that a given fraction of the uncertainty in G can be accounted for by the euclidean distances between vertex positions; these threshold values are shown in Table I. As the table indicates, a critical radius at or below approximately ~o20 of the linear scale of the region is adequate for a ~o90 uncertainty reduction, with a reduction of %99 possible for radii of 0.06t or less. DYNAMIC SOCKS NETWO~MODE~WG AND TRYSTS 321

How do these threshold values compare to empirical assessments of the distance/edge probability relationship? Fits of dyadic edge models to existing data sets suggest that a tow- probability threshold is attainable, but also clearly indicate that thresholds will depend upon relational content (Butts, 2002~. Recalling that P(G~V) > p(d (vi, vj) > rc) (lIB any), we can express the adequacy of the threshold approximation in terms of JIB (~) In this regard, ~ ~ 0.001 is sufficient to bring the predictability bound within approximately Ado of the limit; such a bound is not hard to achieve in practice. Based on the Butts models, thresholds of as little as 0.05km may be reasonable approximations for face-t~face contact, with larger thresholds of approximately 0.25km and 1Skm for social friendship and telephone contacts, respectively. Although it may be possible to obtain more predictive power using these or other models than Table ~ would suggest, the lower bounds alone indicate that physical layout has the potential to account for the overwhelming majority of network structure at even modest spatial scales. 5 Conclusion To summarize, then, it would seem that even a very modest null mode] based on physical distance (the threshold model) must account for the vast majority of network structure in large-scale networks, under quite minimal assumptions. Since fitted models have the capacity to be much more informative than the null model, they are expected to provide even more information about network macrostructure at even smaller scales. Thus, not only might one reasonably speculate that distance could account for most of the uncertainty in large-scale interpersonal networks, it almost has to do so. This would seem to vindicate the intuition of theorists such as Mayhew, who perceived that physical space was a critical structuring force, but who did not demonstrate the extent of that result. References Blake, R. R., Rhead, C, Wedge, B., and Monton, ]. (1956~. Housing architecture and social interaction. Sociometry, 19:133-139. Blau, P M. (1977~. Inequality and Heterogeneity. Free Press, New York. Butts, C. T. (2002~. Spatial Models of Large-scale Interpersonal Networks. Doctoral Disser- tation, Carnegie MeHon University. Caplow, T. and Forman, R. (1950~. Neighborhood interaction in a homogeneous community. American Sociological Review, 15:357-366. Evans, M., Hastings, N., and Peacock. B. (2000~. Statistical Distributions. John Wiley and Sons, New York, third edition. , 322 DYNAMIC SOCIAL NETWORK MODELING AND ANALYSIS

Festinger, L., Schachter, S., and Back, K. (1950~. Stanford University Press, Stanford, California. Social Pressures in Informal Groups. Latane, B., Liu J. H., Nowak, A., Bonevento, M., and Zheng, L. (1995~. Distance matters: Physical space and social impact. Personality and Social Psychology But7tetin, 21~:795- 805. Mayhew, B. H. (1984~. Chance and necessity in sociological theory. Journal of Mathematical Sociology, 9:305-339. Merton,R. (1948~. The socialpsychologyof housing. InDennis,W.,editor, Current Trends in Social Psychology, pages 163-217. University of Pittsburgh Press, Pittsburgh, PA. Snow, D. A., Leahy, P. ]., and Schwab, W. A. (1981~. Social interaction in a heterogeneous apartment: An investigation of the effects of environment upon behavior. Sociological Focus, 14~4~:309-319. Sommer' R. (1969~. Personal Space. Prentice-Hall, Englewood Cliffs, N]. Whyte, NAi. H. (1957~. The Organization Man. Doubleday, Garden City, NY. DYNAMIC SOCIAL N~TWO=MODE~NG ED PHYSIC 323