| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 79
Colloquium
Evolution of document networks
Filippo Menczer
School of Informatics, Indiana University, Bloomington, IN 47408
How does a network of documents grow without centralized
control? This question is becoming crucial as we try to explain the
emergent scale-free topology of the World Wide Web and use link
analysis to identify important information resources. Existing mod-
els of growing information networks have focused on the structure
of links but neglected the content of nodes. Here I show that the
current models fail to reproduce a critical characteristic of infor-
mation networks, namely the distribution of textual similarity
among linked documents. I propose a more realistic model that
generates links by using both popularity and content. This model
yields remarkably accurate predictions of both degree and simi-
larity distributions in networks of web pages and scientific
literature.
There are important social and economic implications in
explaining how the topology evolves in information networks
such as the World Wide Web. Although text analysis has been
used for a long time to analyze documents, extract their meaning,
retrieve information, and map knowledge domains (1-3), link
analysis is increasingly used by search engines and digital librar-
ies to estimate the importance or reputation of documents (4-6)
and to map documents into topical clusters (7-10~.
A number of models have been proposed to explain the growth
of complex networks exhibiting characteristics such as the small-
world property (high clustering and low diameter) and the
scale-free property (power-law distribution of degree). A few
representative models are reviewed in Background. Which of
these competing theories is more plausible? In Validating Prior
Models, I show that the answer is none: Although they all
correctly predict degree distributions, they fail at predicting
another observable feature of the information network, namely
the distribution of textual similarity among linked documents. In
Degree-Similarity Mixture Model, I propose a growth model to
explicitly capture the trade-off between an author's desire to link
related and popular documents. The model is validated against
two data sets: a network of web pages sampled from the Open
Directory Project (DMOZ; http://dmoz.org) and a collection of
scientific articles published in PNAS. Web pages are connected
by hyperlinks and articles by citations. Numerical simulations are
used to generate predictions of degree and similarity distribu-
tions for both data sets.
Background
Since the discovery of scale-free and small-world phenomena in
web pages (11-15) and bibliographic collections (16-18), phys-
icists and computer scientists have developed growth theories
that model the behavior of authors linking new pages to explain
the emergence of these critical network properties (9, 19~. Most
growth models are based on some form of preferential attach-
ment, whereby one node at a time is added to the network with
new edges to existing nodes selected according to some proba-
bility distribution.
In the best known preferential attachment model, a node i
receives a new edge with probability proportional to its current
degree, Phi) x kg) (13~. This so-called BA model generates
networks with power-law degree distributions, in which the
www.pnas.org/cgi/doi/10. 1 073/pnas.03075541 00
oldest nodes are those with highest degree. The copying model
and its extensions implement equivalent rich-get-richer pro-
cesses based on local walks, without requiring explicit knowledge
of degree (20-22~.
To give newer nodes a chance to compete for links, an
extension of the preferential attachment model is based on
linking to a node dependent on its degree with some probability
or to a uniformly chosen node with the remaining probability
(23, 24~. Such a mixture model generates networks that can fit
the power-law degree distribution of the entire web as well as the
different distributions observed in subsets of the web such as
university and business homepages (25~.
Some theories have explored similar mixture models in which
links are created according to a trade-off between graph degree
and metric distance measures, showing that certain trade-off
regimes lead to power-law degree distributions (26, 27~.
To study the decision process by which authors link docu-
ments, let us consider the relationship between the probability
that two documents are linked and their content (text) similarity.
Content similarity can be measured by the cosine metric tradi-
tionally used in information retrieval (1~:
-d2
Id ~ ~ ~ - ~ ~d2
[1]
where d is a vector space representation of the text in document
d. Link probability can be approximated by a link similarity (or
neighborhood) metric defined as the Jaccard coefficient:
~,fdi,d2) = U U U '
[2]
where U,~ is the set representing d's neighborhood, which consists
of inlinks and outlinks for web pages and citations and references
for articles (in which case ~ is akin to cocitation and biblio-
graphic coupling). To visualize the relationship between text and
link similarity, one can map the joint distribution of ok and (a
across pairs of documents. Fig. 1 shows two such maps, obtained
from collections of web pages and PNAS articles. These maps
demonstrate that for web pages as well as scientific papers,
authors connect documents in a way that is significantly (al-
though not strongly) correlated with the similarity of those
documents to their own. The Pearson correlation coefficient
between ok and ~ is 0.10 for web pages (3.8 x 109 pairs) and 0.12
for PNAS articles (7.5 x 106 pairs).
To quantify the dependence of the web's link topology on
content, I considered the conditional probability that the link
neighborhood between two web pages is above some threshold
A, given that the two pages have some content similarity a, as a
function of K:
This paper results from the Arthur M. Sackier Colloquium of the National Acaclemy of
Sciences, "Mapping Knowiecige Domains," held May 9-11, 2003, at the Arnold and Mabel
Beckman Center of the National Acaclemies of Sciences and Engineering in Irvine, CA.
E-mail: fil@?incliana.edu.
@) 2004 by The National Acaclemy of Sciences of the USA
PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 1 5261-5265
OCR for page 80
n-11
0.75
al 0.5
nab
0.5
0 25
, ._
·~m I~
~__~ -
·~ ~~,
;~
-
0 0.25
Arc
.
1
Fig. 1. Joint distribution maps for content and link similarity across pairs of
web pages (Upper) and scientific articles (Lower). Colors code the logo of the
number of pairs with the corresponding similarity coordinates. The web data
are based on a stratified sample of 109,648 pages from the DMOZ, with their
inlinks and outlines, crawled in 2002. The article data are based on the titles,
abstracts, and references of 15,785 articles published in PNAS between 1997
and 2002.
1~, q) Iced q) = K A at, q) > A|
IOD, q) (7c~, q) = K| ' ~ ~
wherep, q are two web pages. An interesting phase transition was
observed between two distinct regions around a critical distance
K* independent of A (28~. For K > K*, the probability that two
pages are neighbors does not seem to depend on their content
similarity; for K < K*, the probability decreases according to a
power-law Pr(A~K) ~ K), with the decay exponent ~ growing
linearly with A. This observation suggested a content-based
growth model in which an author tends to link a new page to the
most popular among related (similar) pages and with decreasing
probability to less similar ones. As in the BA model, at each step
t one new page t is added, and m new links are created from t to
m existing pages, each selected from {i, i < t) with probability:
Pry, t) = ~ mt if crC(i, t) > tic*
ccr>(i, t) otherwise
, [4]
where m, K*, and By are constants derived from the data and c is
a normalization factor. This degree-similarity phase model ac-
curately predicted the degree distribution of the web pages in the
5262 1 www.pnas.org/cgi/doi/10. 1 073/pnas.0307554100
Fl
l
all pairs DMOZ +
linked pairs DMOZ x
all pairs PNAS 3<
linked pairs PNAS
. ~
W+~"~`
arc
.~
a~ 0 ~ xXxx x
C13 C
OK—KT+++'~
~ ~~+~"`
., ~
3K ~ ·' ~ ~
Fig. 2. Content similarity distributions for web pages (DMOZ) and scientific
articles (PNAS). The distributions across linked documents are clearly different
from the background distributions. The latter are exponential for both data
sets (exponential fit curves appear as lines in the log-linear plot).
sample from which the data were derived and was the first model
to do so taking content into account (28~.
Validating Prior Models
Given that all of the models described in Background can predict
the degree distribution of web pages and scientific articles, which
is most plausible and/or powerful in explaining the emerging
topology of information networks? To answer this question, we
need an independent observation from the data, in addition to
degree. I turned to the distribution of content similarity between
linked (neighbor) pages. Fig. 2 demonstrates that this distribu-
tion is qualitatively different from the background similarity
distribution for both web pages and scientific articles, clearly
indicating that content must play a role in the evolution of
information networks. I then looked for a model capable of
predicting both the degree distributions and the similarity dis-
tributions among linked documents. Such a model would be
more plausible than models predicting degree alone.
0.14 Al
0.12
O.1
c
0.08
0.06
0.04
0.02 x
O .
i
linked pairs DMOZ
content-independent models
degree similarity phase model
. ,
0 0.05 0.1
i
. . .
.-
. .
1
1, .....
I--.- -- --: ;-. ..
, _ .. :
.... . ....
,,,,: , :
: ~ -- . ,--
, , :
:
.. x x, X x x x x x x . , x x x x x
x x x x x x x x x x x x x x x ~
, ,
, :
r~-~—- . - ~ _ _
0.15 0.2 0.25 0.3 0.35
Tic
Fig. 3. Content similarity distributions across linked pages generated by
simulating various growth models for web pages, compared with DMOZ data.
In all simulations the parameters are set to match or fit the DMOZ data: n =
109,648 nodes, m = 15 links, and, in the degree-similarity phase simulation, K*
= 0.25 and ~ = 1.7.
Menczer
OCR for page 81
-
~,L~
- .
NcN
Fig. 4. (a) In all preferential attachment models, a new page t is likely to be ~
linked to a page with high degrees such as i,. (b) In the degree-uniform c,
mixture model, there is also some probability to link to any random page such ~ o.oo'
as i2. (c) In the degree-similarity mixture model, t is more likely to be linked to
a page that has high degree but is close in content space, i.e., similar to t, such
as i3.
1
Out
0.01
0.0001
The models in Background were validated by numerical sim-
ulations, with content similarity drawn from the exponential
distributions obtained by fitting the background distributions in
Fig. 2: Prtcrc) ~ 10- where ,u = 7 for web pages and ,u = 8 for
PNAS articles. The degree distribution of the data are well
matched by those generated by the mixture model (25) and the
degree-similarity phase model (28~. On the other hand, as shown
in Fig. 3, for web pages the growth models that do not consider
content (13, 20, 25) predictably generate similarity distributions
across linked pages that mirror the background exponential
distribution. The degree-similarity phase model does somewhat
better, but it generates a distribution that goes to zero too rapidly
for small and large arc. Results are analogous for PNAS articles.
Thus, none of the growth models outlined above generates
distributions of textual similarity across linked documents in
qualitative agreement with the data.
Degree-Similarity Mixture Model
The class of mixture models has a free parameter that can be
tuned to fit the data. At each step, one new document is added,
and m new links or references are created from it to existing
documents. At time t the probability that the ith document is
selected and linked from the tth document is
kg)
Phi) = cat t + (1—or)Pr~i), [5]
where i ~ t and cat ~ t0,1] is a preferential attachmeniparameter
(Fig. 4a). In the degree-uniform mixture model (25) Pr (i) = 1/t,
the uniform distribution (Fig. 4b). Let us now introduce an
alternative degree-similarity mixture model in which
on ~ , ~
0.01
-
0.001
,..>
.....
0.0001
0.1
. . . . . . . . . . . . . . . . . . . . . . . . . .
DMOZ data ~
degree-uniform mixture -------
degree-similarity mixture · -- -
~--Q^-~- a~
me
I1 a.
OIL
le-05
. . . . . . .
1 1 0 1 Go 1 000
degree (k)
Fig. 6. Degree distributions of web pages predicted by simulating the two
mixture models. In the degree-uniform mixture simulation, cr = 0.3; in the
degree-similarity simulation, ~ = 0.2 and ~ = 1.7. All parameters were set by
fitting the DMOZ data.
( ) (~cki' t) )
, [6]
where my is a constant (Fig. 4c). Like the degree-similarity phase
model, this model is inspired by the idea that authors tend to link
new documents to popular and related ones and by the obser-
vation that link probability between two web pages decays with
decreasing similarity as a power-law Pr(A~K) ~ K) with ~ = 1.7
for A = 0.1 (Fig. 54. However, the free parameter cat in the
degree-similarity mixture allows us to explicitly model the trade-
off between linking to related (similar) versus popular (high-
degree) documents.
To validate the degree-similarity mixture model, the networks
of web pages and PNAS articles were built by simulation and
compared with those obtained by simulating the degree-uniform
mixture model. Figs. 6 and 7 show the predictions generated for
web pages. Although both models accurately predict the degree
distribution, only the degree-similarity mixture model reason-
ably approximates the similarity distribution.
The PNAS article data were analyzed analogously to the
DMOZ data, yielding a conditional citation probability with a
tail that scales as a power-law Pr(A = 0.1~K) ~ K) just as for web
0.16
0.14
0.12
O.1
<'
0.08
0.06
0.04
0.02
.
DMOZ data x
degree uniform mixture -------
degree similarity mixture ---- ---
V —_ J _ _ _ _ _ _ _ _ _ _ _ _ _ _ ~
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
tic
Fig. 5. Tails of conditional link probability Pr(AIK) as a function of ~ for pairs
of web pages (DMOZ) and PNAS articles, with power-lawfit exponents ~ = 1.7 Fig. 7. Distribution of content similarity among linked web pages predicted
and ~ = 3.1 respectively. by simulating the two mixture models.
Menczer
PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 1 5263
OCR for page 82
1 0000
1 000
100
10
. . . . . . . . . . . . . . . . .
PNAS data o
degree uniform mixture -------
degree-similarity mixture ~
'a G
%.
;;;;,
· %1
.~
" to
. ~
., bin`
\ ~ "a
I: ,~ lo "%
lo' ~ ~ lo ON
Fig. 8. Distributions of degree (citation count) of PNAS articles, as predicted
bythetwo mixture models. In all ofthesimulations, n = 15,785 nodes end m =
1 reference. In the degree-uniform mixture simulation, c' = 0.5; in the degree-
similaritysimulation, c'= 0.1 and fly= 3.1.All parametersweresetbymatching
or fitting the PNAS data (on Iy references to other papers in the PNAS col lection
were considered).
pages, but with a larger exponent By = 3.1 (Fig. 5). Figs. 8 and 9
show the predictions generated by simulating the growth of the
PNAS article network according to the two mixture models.
Both models accurately predict the distribution of citation
counts, although the degree-similarity model fits the data better.
Again, the degree-similarity mixture model generates a similarity
distribution in remarkable agreement with the data.
Conclusion
In this paper I have shown that existing growth models for
document networks generate the wrong predictions for the
distribution of content similarity across linked documents. Mod-
els that do not take content into account yield distributions that
are heavily skewed toward low similarities because those are
exponentially more frequent in the data. On the contrary, if
authors tend to link and cite related documents, one would
expect a similarity distribution with a fatter tail and a peak for
CTC > 0, as displayed by both web pages and scientific articles. This
behavior is captured by the degree-similarity mixture model.
The similarity measures and document representations used in
the presented analysis and model are quite crude. Cosine
similarity here is based on simple term frequencies. The Jaccard
coefficient for link similarity not only is a rough approximation
of link probability for web pages but is further limited by
incomplete knowledge owing to the necessary reliance on search
1. Salton, G. & McGill, M. (1983)An Introduction to Modern Information Retrieval
(McGraw-Hill, New York).
_. Belew, R. (2000) Finding Out About: A Cognitive Perspective on Search Engines
and the WOW (Cambridge Univ. Press, Cambridge, U.K.).
3. Borner, K., Chen, C. & Boyack, K. (2003) Annul Rev. Infi Sci. Technol. 37,
179-255.
4. Brin, S. & Page, L. (1998) Comput. Networks ISDN Syst. 30, 107-117.
5. Kleinberg, J. (1999) J. Assoc. Comput. Mach. 46, 604-632.
6. Mendelzon, A. & Rafiei, D. (2000) IEEE Data Eng. Bull. 23, 9-16.
7. Kleinberg, J. & Lawrence, S. (2001) Science 294, 1849-1850.
8. Girvan, M. & Newman, M. (2002) Proc. Natl. Acad. Sci. USA 99, 8271-8276.
9. Henzinger, M. & Lawrence, S. (2004) Proc. Natl. Acad. Sci. USA 101,
5186-5191.
10. Hopcroft, J., Khan, O., Kulis, B. & Selman, B. (2004) Proc. Natl. Acad. Sci. USA
101, 5249-5253.
11. Albert, R., Jeong, H. & Barabasi, A.-L. (1999) Nature 401, 130-131.
12. Huberman, B. & Adamic, L. (1999) Nature 401, 131.
13. Barabasi, A.-L. & Albert, R. (1999) Science 286, 509-512.
5264 1 www.pnas.org/cgi/doi/10.1 073/pnas.0307554100
0.14
0.12
0.1
0.
car
a,
0.06
0.04
0.02
V _
o
..
)2 , ~ - ' ~ o o A ~ O D O I' I to ~ o
-a-- : oOO '-
U , , --- _ _ , , , T
0.1 0.2 0.3 0.4 0.5 0.6
l
PNAS data n
degree uniform mixture -------
degree-similarity mixture --- -
Ooze ~
·: MONO I: ~ ~ C 3
Fig. 9. Distribution of content similarity among titles and abstracts of articles
that cite one another, as predicted by the two mixture models.
engines for inlink information. One natural direction for future
work is to repeat the analysis and validate the degree-similarity
mixture model by using more sophisticated document represen-
tations and similarity measures (2, 3, 29, 30~. Another is to
extend the validation of the model to see whether it can predict
additional properties of the networks, such as clustering coeffi-
cient and degree correlation (18, 22~. Finally, further insight
must be gained by studying the relationship between the mech-
anism studied here (linking similar documents) and other pro-
cesses likely to play a role in the evolution of link/citation
networks, such as copying (22) and coauthorship (17, 18~.
The results presented here strongly suggest that page content
cannot be neglected when we try to understand the evolution of
document networks. The tension between referring to popular
versus related documents provides us with a plausible and
unified model of how authors link nodes in such different
networks as the web and the scientific literature. This model
generates remarkably accurate predictions of how such a process
can lead to the emergent link and content structure of document
spaces.
I thank Jon Kleinberg, Rob Axtell, David Aldous, Laszlo Barabasi,
Reka Albert, Mark Newman, Lada Adamic, Alessandro Vespignani,
and Katy Borner for reviewing an earlier draft of this manuscript;
Rich Shiffrin and two anonymous reviewers for providing helpful
suggestions; the Open Directory Project for the DMOZ data; and the
National Academy of Sciences for the PNAS data. This work was
supported by National Science Foundation Career Award IIS-0133124/
0348940.
14. Broder, A., Kumar, S., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R.,
Tomkins, A. & Wiener, J. (2000) Comput. Networks ISDN Syst. 33,
309-320.
15. Adamic, L. & Huberman, B. (2000) Science 287, 2115.
16. de Solla Price, D. (1965) Science 149, 510-515.
17. Newman, M. E. J. (2004) Proc. Natl. Acad. Sci. USA 101, 5200-5205.
18. Borner, K., Maru, J. T. & Goldstone, R. L. (2004) Proc. Natl. Acad. Sci. USA
101, 5266-5273.
19. Dorogovtsev, S. & Mendes, J. (2003) Evolution of Networks: From Biological
Nets to the Internet and WOW (Oxford Univ. Press, Oxford).
20. Kleinberg, J., Kumar, S., Raghavan, P., Rajagopalan, S. & Tomkins, A. (1999)
Lect. Notes Comput. Sci. 1627,1-18.
21. Kumar, S., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A. &
Upfal, E. (2000) in Proceedings of the 41st Annual IEEE Symposium on
Foundations of Computer Science (IEEE Comput. Soc., Silver Spring, MD),
pp. 57-65.
22. Vazquez, A. (2003) Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip.
Top. 67, 056104.
Menczer
OCR for page 83
23. Dorogovtsev, S., Mendes, J. & Samukhin, A. (2000) Phys. Rev. Lett. 85,
4633-4636.
24. Cooper, C. & Frieze, A. (2001) in Proceedings of the 9th Annual European
Symposium on Algorithms, Lecture Notes in Computer Science, ed. Meyer auf
der Heide, F. (Springer, Berlin), Vol. 2161, pp. 500-511.
25. Pennock, D., Flake, G., Lawrence, S., Glover, E. & Giles, C. (2002) Proc. Natl.
Acad. Sci. USA 99, 5207-5211.
26. Fabrikant, A., Koutsoupias, E. & Papadimitriou, C. (2002) in Proceedings of the
29th International Colloquium onAutomata, Languages, and Programming, Lecture
Menczer
Notes in Computer Science, eds. Widmayer, P., Triguero, F., Morales, R.,
Hennessy, M., Eidenbenz, S. & Conejo, R. (Springer, Berlin), Vol. 2380, pp.
110-122.
27. Aldous, D. (2003) arXiv.:cond-mat/0304701
28. Menczer, F. (2002) Proc. Natl. Acad. Sci. USA 99, 14014-14019.
29. Ganesan, P., Garcia-Molina, H. & Widom, J. (2003) Assoc. Comput. Mach.
Trans. Inf: Syst. 21, 64-93.
30. Landauer, T. K, Laham, D. & Derr, M. (2004) Proc. Natl. Acad. Sci. USA 101,
5214-5219.
PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 1 5265
Representative terms from entire chapter:
mixture model