| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 38
Colloquium
Mixed-membership models of scientific publications
Elena Erosheva*t, Stephen Fienbergt§, and John Lafferty
*Department of Statistics, School of Social Work, and Center for Statistics and the Social Sciences, University of Washington, Seattle, WA 98195, and
"Department of Statistics, Computer Science Department, and §Center for Automated Learning and Discovery, Carnegie Mellon University,
Pittsburgh, PA 15213
PNAS is one of world's most cited multidisciplinary scientific
journals. The PNAS official classification structure of subjects is
reflected in topic labels submitted by the authors of articles, largely
related to traditionally established disciplines. These include broad
field classifications into physical sciences, biological sciences, social
sciences, and further subtopic classifications within the fields.
Focusing on biological sciences, we explore an internal soft-
classification structure of articles based only on semantic decom-
positions of abstracts and bibliographies and compare it with the
formal discipline classifications. Our model assumes that there is a
fixed number of internal categories, each characterized by multi-
nomial distributions over words (in abstracts) and references (in
bibliographies). Soft classification for each article is based on
proportions of the article's content coming from each category. We
discuss the appropriateness of the model for the PNAS database as
well as other features of the data relevant to soft classification.
The Proceedings is there to help bring new ideas
promptly into play. New ideas may not always be right,
but their prominent presence can lead to correction. We
must be careful not to censor even those ideas which
seem to be off beat.
Saunders MacLane (1)
Are there internal categories of articles in PNAS that we can
obtain empirically with statistical data-mining tools based
only on semantic decompositions of words and references used?
Can we identify MacLane's "off-beat" but potentially path-
breaking PNAS articles by using these internal categories? Do
these empirically defined categories correspond in some natural
way to the classification by field used to organize the articles for
publication, or does PNAS publish substantial numbers of
interdisciplinary articles that transcend these disciplinary bound-
aries? These are examples of questions that our contribution to
the mapping of knowledge domains represented by PNAS
explores.
Mathematical and statistical techniques have been developed
for analyzing complex data in ways that could reveal underlying
data patterns through some form of classification. Computa-
tional advances have made some of these techniques extremely
popular in recent years. For example, 2 of the 10 most cited
articles from 1997-2001 PNAS publications are on appl~cahons
of clustering for gene-expression patterns (2, 3~. The traditional
assumption in most methods that aim to discover knowledge in
underlying data patterns has been that each subject (object or
individual) from the population of interest inherently belongs to
only one of the underlying subpopulations (clusters, classes,
aspects, or pure type categories). This implies that a subject
shares all its attributes, usually with some degree of uncertainty,
with the subpopulation to which it belongs. Given that a rela-
tively small number of subpopulations is often necessary for a
meaningful interpretation of the underlying patterns, many data
collections do not conform with the traditional assumption.
Subjects in such populations may combine attributes from
several subpopulations simultaneously. In other words, they may
5220-5227 1 PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1
have a mixed collection of attributes originating from more than
one subpopulation.
Several different disciplines have developed approaches that
have a common statistical structure that we refer to as mixed
membership. In genetics, mixed-membership models can ac-
count for the fact that individual genotypes may come from
different subpopulations according to (unknown) proportions of
an individual's ancestry. Rosenberg et al. (4) use such a model
to analyze genetic samples from 52 human populations around
the globe, identifying major genetic clusters without using the
geographic information about the origins of individuals. In the
social sciences, such models are natural, because members of a
society can exhibit mixed membership with respect to the
underlying social or health groups for a particular problem being
studied. Hence, individual responses to a series of questions may
have mixed origins. Woodbury et al. (5) use this idea to develop
medical classification. In text analysis and information retrieval,
mixed-membership models have been used to account for dif-
ferent topical aspects of individual documents.
In the next section, we describe a class of mixed-membership
models that unifies existing special cases (64. We then explain
how this class of models can be adapted to analyze both the
semantic content of a document and its citations of other
publications. We fit this document-oriented mixed-membership
model to a subcollection of the PNAS database supplied to the
participants in the Arthur M. Sackler Colloquium Mapping
Knowledge Domains. We focus in our analysis on a high-level
description of the fields in biological sciences in terms of a small
number of extreme or basis categories. Griffiths and Steyvers (7)
use a related version of the model for abstracts only and attempt
a finer level of description.
Mixed-Membership Models
The general mixed-membership model that we work with relies
on four levels of assumptions: population, subject, latent vari-
able, and sampling scheme. Population level assumptions de-
scribe the general structure of the population that is common to
all subjects. Subject-level assumptions specify the distribution of
observable responses given individual membership scores. Mem-
bership scores are usually unknown and hence can be viewed also
as latent variables. The next assumption is whether the mem-
bership scores are treated as fixed or random in the model.
Finally, the last level of assumptions specifies the number of
distinct observed characteristics (attributes) and the number
of replications for each characteristic. We describe each set of
assumptions formally in turn.
This paper results from the Arthur M. Sackler Colloquium of the National Acaclemy of
Sciences, "Mapping Knowiec~ge Domains," held May 9-11, 2003, at the Arnold anc' Mabel
Beckman Center of the National Acaclemies of Sciences and Engineering in Irvine, CA.
tTo whom correspondence shouIcl be ac~cdressecl. E-maii: eiena~stat.washington.eclu.
2004 by The National Academy of Sciences of the USA
www.pnas.org/cgi/doi/10. ~ 073/pnas.0307760101
OCR for page 39
OCR for page 41
OCR for page 42
OCR for page 43
OCR for page 44
OCR for page 45
Representative terms from entire chapter:
sci usa
Population Level. Assume there are K original or basis subpopu-
lations in the populations of interest. For each subpopulation k,
denote by f(Xj~6kj) the probability distribution for response
variable I, where (9kj iS a vector of parameters. Assume that,
within a subpopulation, responses to observed variables are
independent.
Subject Level. For each subject, membership vector A = (A1, ....
AK) provides the degrees of a subject's membership in each of the
subpopulations. The probability distribution of observed re-
sponses Xj for each subject is defined fully by the conditional
probability Pr~xj~A) = IkA
can be interpreted also as a latent classification process in which
an aspect of origin is drawn first for each word and for each
reference in a document, according to a multinomial distribution
parameterized by the document-specific membership scores A,
and words and references then are generated from correspond-
ing distributions of the aspects of origin (64. Rather than a
mixture of K latent classes, the model can be thought of as a
"simplicial mixture" (13) because the word and reference
probabilities range over a simplex with corners Elk and 62k,
respectively.
The likelihood function is thus
P ~ Bid ~ = J.Dir(A~ or) [| pA(w~n(W ~) [l qA(r~n~r dada
Jw r
- 1 [ rid Jo rl A' i rI PA(W)n(W d) I| q ~ryn(r did A
'=1 w r
where integrals are over the (K - 1) simplex.
It is important to note that the assumption of exchangeability
among words and references (conditional independence given
the membership scores) does not imply joint independence
among the observed characteristics. Instead, the assumption of
exchangeability means that dependencies among words and
references can be explained fully by the membership scores of
the documents. For an extended discussion on exchangeability in
this context, see ref. 16.
Alternative Model for References
For the analysis of PNAS publications in the next section, we
assume multinomial sampling of words and references. Although
multinomial sampling is computationally convenient, it is not a
realistic model of the way in which authors select references for
the bibliography of an article. We briefly describe an example of
more realistic generative assumptions for references.
Suppose an article focuses on a sufficiently narrow scientific
area. In this case, the authors may have essentially perfect
knowledge of the literature, and thus they would pay separate
attention to each article in their pool of references as they
consider whether to include it in the bibliography. Under these
circumstances, given that the pool of references contains R
articles, we assume that a document is represented as d =
({X(irl)), X2, X3, . . ., XR+1), where x(~r1) is a word in the abstract,
R is the number of references, end x2, . . . ,XR+~ are all references
in the pool. Reference counts do not change: they are given by
nor, d) = 1 if the bibliography of d contains a reference to r and
by nor, d) = 0 if otherwise.
Then our model for generating documents would be to sample
A and x(~r1), according to Eqs. 4 and 5, and sample x;, j = 2, ....
R + 1, according to
Xj~ Bernoulli~qA(xj)],
K
where qA(xj) = ~ Akdjk. [9]
k=1
The likelihood function based on this alternative model would
not only take into account which documents contain which
references, but it also would incorporate the information about
which references documents do not contain.
Both the basic model for references and any alternatives still
would need to reflect the time ordering on publications and
include in the pool of possible references only those that have
been published already, perhaps even with a short time lag.
5222 1 www.pnas.org/cgi/doi/10.1073/pnas.0307760101
However, even such changes are unlikely to produce a "correct"
model for citation practices.
Estimating the Model
The primary complication in using a mixed-membership model
such as is shown in Eqs. ~6, in which the membership proba-
bilities are random rather than fixed, is that the integral in Eq.
7 cannot be computed explicitly and therefore must be approx-
imated. Two approximation schemes have been investigated
recently for this problem and the associated problem of fitting
the model. In the variational approach (12), the mixture terms
PA(W) = ~k=iAk6Ik(W) are bounded from below in a product
form that leads to a tractable integral; the lower bound is then
maximized. A related approach, called expectation-propagation
(13), also approximates each mixture term in a product form but
chooses the parameters of the factors by matching first and
second moments. Either of these approximations to the integral
(Eq. 7) can be used in an approximate expectation-
maximization (EM) algorithm to estimate the parameters of the
models. It is shown in ref. 13 that expectation-propagation in
[8] general leads to better approximations than the simple varia-
tional method for mixed-membership models, although we ob-
tained comparable results with both approaches on the PNAS
collection. The results reported below use the variational
approximation.
The PNAS Database
The National Academy of Sciences provided the database for the
participants of the colloquium. We focused on a subset of all
biological sciences articles in volumes 94-98 (Julian years 1997-
2001) of PNAS, thereby ignoring articles published in the social
and physical sciences unless they have official dual classifications
with one classification in the biological sciences. The reason for
this narrowing of focus is 2-fold. First, the major share of PNAS
publications in recent years represents research developments in
the biological sciences. Thus, of 13,008 articles published in
volumes 94-98, 12,036 (92.53%) are in the biological sciences.
The share of social and physical sciences articles in volumes
94-98 is a much more modest 7.47%. Second, we assume that a
collection of articles is characterized by mixed membership in a
number of internal categories, and social and physical sciences
articles are unlikely to share the same internal categories with
articles from the biological sciences. We also automatically
ignore other types of PNAS publications such as corrections,
commentaries, letters, and reviews, because these are not tra-
ditional research reports. Among the biological sciences articles
in our database, 11 articles were not processed because they did
not have an abstract, and 1 article was not processed because it
did not contain any references.
PNAS is one of world's most cited multidisciplinary scientific
journals. Historically, when submitting a research paper to
PNAS, authors have to select a major category from physical,
biological, or social sciences and a minor category from the list
of topics. PNAS permits dual classifications between major
categories and, in exceptional cases, within a major category. The
lists of topics change over time to reflect changes in the National
Academy of Sciences sections. PNAS, in its information for
authors (revised in June 2002), states that it classifies publica-
tions in biological sciences according to 19 topics; the numbers
of published articles and numbers of dual-classified articles in
each topic are shown in Table 1.
The topic labels provide a classification structure for pub-
lished materials, and most of the articles are members of only a
single topic. For our mixed-membership model, we assume that
there is a fixed number of extreme internal categories or aspects,
each of which is characterized by multinomial distributions over
words (in abstracts) and references (in bibliographies). Aspects
are determined from contextual decompositions in such a way
Erosheva et a/.
Table 1. Biological sciences publications in PNAS volumes 94-98
by subtopic
Topic
n
1 Biochemistry 2,578 (33) ~ | ~ O ~ ~ a~ ~u
2 Medicalsciences 1,547(13) a) a' Q ·- u a) v a) ~ o ~
3 Neurobioogy Q V ~ ' v Z
4 Cell biology ~ 343 (9) | ~ ~ '~' ~ 0 ~ ~ E
5 Genetics 980 (14) ~ o ~ ~ ~ oo ~ ~ ~ ~ o oo r~ r~
~ ~ ~ ~ ~ r~ ~ ~ ~ ~
6 Immunology 865 (9) c~ g 8 8 8 8 8 8 8 8 8 8 8 8 8 8
7 Biophysics 636 (40) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 Evolution 510 (12)
9 Microbiology 498 (11) ~ ° ~o ~ ~5 0 aJ
10 Plant biology 488(4) ~ O ~ Q 07 ~, ~ · - -~_ ~
11 Deve opmenta bio ogy 366 (2) E ~ ~ 0 ~ ~ E E c . 0
12 Physiology 340 (1) ~ ~ ~ r ~ ~ ~ E ~ ~~ ~~,
13 Pharmacology 188(2) o ~ ~ o ~ ~ ~ ~ r~ o oo oo oo r
(D ~ rn ~ ~ ~ c~ ~ c~ ~ ~ ~ ~ ~ ~
14 Ecology 133 (5) cL 8 8 8 o ° ° o ° ° o o o o o o
15 Applied biologicalsciences 94(6) o o o o o o o o o o o o o o o
16 Psychology 88 (1) _
17 Agricultural sciences 43 (2)
18 Population biology 43 (5) 4. 0 .o
19 Anthropology 10 (0) a~ .~Q ~ a~ .~= O ~ ~ ~ 0
Total 11,981 (179) ~ <,, ,o, `~, Z ~ ~, ~v cr ~ ^,
The numbersofarticles with dual classifications are given in parentheses. ~ ~ Q ~ Q ~: ~ `~ ~ ~ ~ u
c~
that a multinomial distribution of words and references in each
document is a convex combination of the corresponding distri-
butions from the aspects. The convex combination for each
article is based on proportions of the article's content coming ,,,
from each category. These proportions, or membership scores, <. ~ Q ~ ~ ° ~ O p° ~ a~ ° ~ Q >
the "stop list" before fitting the model. If the distribution of stop .~
words is not uniform across the internal categories, this alter- Q~ ~ ~ ~ ~o oo ~o c~ ~ ~ ~ ~o ~ ~ ~ ~o
native approach may potentially produce different results. 0 Q 0 0 0 o° 0 o° 0 o° 0 o° 0 0 0 o° 0
The following interpretations are based on examination of 50 Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
high-probability words for each aspect. Note that enumeration of
the aspects is arbitrary. The first aspect includes words such as ~:
Ca2+, kinase, phosphorylation, receptor, and G (protein) chan- ~ v ~ Q ~ ° ~= O ~ ~ a' a; ~ a
nel, which pertain to cell signaling and intracellular signal Q Q ~ ~ Q ~ :: ~s ~ a
transduction. It is likely that, in this aspect, signal transduction ~ <: v v ~ v ~ ~ ~ ~ v ~ ~ ~ ~ '
Erosheva eta/. PNAS | April 6, 2004 I vol. 101 | suppl. 1 | 5223
Table 3. High-probability references by aspect
Author
Aspect 1
Journal, Year
C
Aspect 2
Author Journal Year C
HAMILLOP PFLUG ARCH EURJPHY, 1981 72 SAITOUN MOLBIOLEVOL, 1987 96
LAEMMLIUK Nature, 1970 322 THOMPSON JD NUCLEIC ACIDS RES, 1994 147
HILLE B IONIC CHANNELS EXCIT, 1992 58 ALTSCHUL SF NUCLEIC ACIDS RES, 1997 160
BLISS TVP NATURE, 1993 54 SAMBROOK J MOL CLONING LAB MAN'U, 1989 764
SUDHOF TC NATURE, 1995 33 ALTSCHUL SF ~ MOL BIOL, 1990 253
GRYNKIEWICZ G ~ Bl:OL CHEM, 1985 31 FELSENSTEIN J EVOLU'TION, 1985 51
SAMBROOKJ MOLCLONING LAB MANU, 1989 764 KISEIINOH JMOLEVOL, 1989 31
SHER.R.INGTON R NATURE, 1995 33 STRIMMERK MOLBIOL EVOL, 1996 31
ROTHMANJE NATURE, 1994 27 KIMURAM JMOLEVOL, 1980 34
SIMONSK NATURE, 1997 35 EISEN MB PNATLACAD SCI USA, 1998 60
SOLLNERT NATURE, 1993 25 SWOFFORDDL PAUPPHYLOGENETIC AN, 1993 25
ROTHMAN JE SCIENCE, 1996 24 KIMURA M NEUTRAL THEORY MOL E, 1983 28
THINAICARAN G NEURON, 1996 23 KUMAR S MEGA MOL EVOLUTIONAR, 1993 26
TOWBIN H P NATL ACAD SCI USA, 1979 86 HASEGAWA M J MOL EVOL, 1985 24
BERMAN DM CELL, 1996 21 NEI M MOL EVOLUTIONARY GEN, 1987 28
KRAULIS PJ J APPL CRYSTALLOGR, 1991
JONES TA ACTA CRYSTALLOGR A, 1991
OTWINOWSKI Z METHOD ENZYMOL, 1997
BRUNGER AT ACTA CRYSTALLOGR D 5, 1998
LASKOWSKI RA J APPL CRYSTALLOGR, 1993
NICHOLLS A PROTEINS, 1991
NAVAZA J ACTA CRYSTA.LLOGR A, 1994
SAMBROOK J MOL CLONING LAB MANIJ, 1989
LAEMMLI UK NATURE, 1970
MERRITT EA ACTA CRYSTALLOGR D, 1994
BRUNGER AT NATURE. 1992
BRADFORD MM ANAL BIOCHEM, 1976
MERRITT EA METHOD ENZYMOL, 1997
WUTHRICH K NMR PROTEINS NUCL AC, 1986
KABSCH W BIOPOLYMERS, 1983
202
174
140
118
96
85
81
764
322
66
48
209
41
40
39
Aspect 3 Aspect 4
Author Journal, Year C Author Journal, Year C
SAMBROOK J MOL CLON'ING LAB MAN'U, 1989 764 HOGAN B MANIPULATING MOUSE E, i994 68
LAEMMLI UK NATURE, 1970 322 CHOMCZYNSKI P ANALBIOCHE'~, 1987 206
ALTSCHUL SF J MOL BIOL, 1990 253 TALAIRACH J COPLANAR STEREOTAXIC, 1988 60
BRADFORD MM ANAL BIOCHEM, 1976 209 PAXINOS G RAT BRAIN STEREOTAXI, 1986 38
SANGER F P NATL ACAD SCI USA, 1977 140 SAMBROOK J MOL CLONING LAB MANU' 1989 764
MILLER JH EXPTMOLGENETICS, 1972 102 NAGYA PNATLACADSCIUSA, 1993 39
ALTSCHUL SF NUCLEIC ACIDS RES, 1997 160 MANSOUR SL NATURE' 1988 37
THOMPSON JD NUCLEIC ACIDS RES, 1994 147 BRAND AH DEVELOPMENT, 1993 46
CHOMCZYNSKI P ANAL BIOCHEM, 1987 206 HOGAN B MANIPULATING MOUSE E, 1986 32
HARLOW E ANTIBODIES LAB MANUA, 1988 129 TYBULEWICZ VLJ CELL, 1991 46
BLATTNER FR SCIENCE, 1997 56 KWONG KK P NATL ACAD SCI USA, 1992 24
SCHENA M SCIENCE, 1995 40 DUNLAP JC CELL, 1999 19
KYTE ~ J MOL BIOL, 1982 51 Ll E CELL, 1992 35
MU'RASHIGE T PHYSL PLANTARUM, 1962 33 ALTSCHUL SF J MOL BIOL, 1990 253
TOWBIN H P NATL ACAD SCI USA, 1979 86 EISEN MB P NATL ACAD SCI USA, 1998 60
Aspect 5 Aspect 6
Author Journal, Year C Author Journal, Year C
SAMBROOK J MOL CLONING LAB MANU, 1989
SIKORS}C:I RS
DIGNAM JD
LEVINE AJ
ELDEIRY WS
HARLOW E
HARPER ~W
FRIEDBERG EC
ALTSCHUL SF
OGRYZKO W
WEINBERG RA
KAMEI Y
HOLLSTEIN M
FIELDS S
YANG XJ
GENETICS, 1989
NUCLEIC ACIDS RES, 1983
CELL, 1997
CELL, 1993
ANTIBODIES LAB MANUA, 1988
CELL, 1993
DNA REPAIR MUTAGENES, 1995
J MOL BIOL 1990
CELL, 1996
CELL, 1995
CELL 1996
SCIENCE, 1991
NATURE, 1989
NATURE, 1996
764
102
68
57
54
129
50
58
253
41
40
39
41
67
37
Aspect 7 Aspect 8
Author Jou~nal7 Year C Author Journal,Year C
DENG HK NATURE, 1996 46 CHOMCZYNSKI P ANAL BIOCHEM, 1987 206
DRAGIC T NATURE, 1996 45 BRADFORD MM ANAL BIOCHEM, 1976 209
DORANZ BJ CELL, 1996 45 LAEMMLI UK NATURE, 1970 322
FENGY SCIENCE, 1996 43 LOWRY OH JBIOLCHEM, 1951 73
ALKHATIB G SCIENCE, 1996 43 ZHANG Y NATURE, 1994 31
COCCHI F SCIENCE, 1995 41 KUIPER GGJM P NATL ACAD SCI USA, 1996 27
CHOE H CELL, 1996 41 SAMBROOKJ MOL CLON LAB MANU, 1989 764
THOMPSON CB SCIENCE, 1995 38 MONCADA S PHARMACOLREV, 1991 25
ZOU H CELL, 1997 40 PELLEYMOUNTERMA SCIENCE, 1995 23
'DARNELL JE SCIENCE. 1994 40 CAMPFIELD LA SCIENCE, 1995 23
MUZIOM CELL, 1996 35 KUIPERGGJM ENDOCRINOLOGY, 1997 22
Ll P CELL, 1997 36 HALAAS JL SCIENCE, 1995 21
XIAZG SCIENCE, 1995 38 BLIGH EG CAN J BIOCH PHYSL, 1959 45
BOLDIN MP CELL, 1996 34 BROWN MS CELL, 1997 28
PEAR WS PNATL ACAD SCI USA 1993 57 ZHANG SH SCIENCE 1992 18
For each aspect, the top references are shown in order of decreasing probability, according to the model. The
count of each reference in the PNAS collection is shown in the right column (C).
is considered as applied to neuron signaling as indicated by the
words synaptic, neurons, voltage. It is interesting that Ca2+ in the
first aspect is the highest-probability contextual word over all
the aspects. Frequent words for the second aspect indicate that
its context is related to molecular evolution that deals with
natural selection on the population and intraspecies level and
mechanisms of acquiring genetic traits. Words in aspect 3 pertain
mostly to the plant molecular biology area. High-probability
words in aspect 4 relate to studies of neuronal responses in mice
and humans, which identify this aspect as related to develop-
mental biology and neurobiology. Aspect 5 contains words that
can be associated with biochemistry and molecular biology.
5224 1 www.pnas.org/cgi/doi/10.1073/pnas.0307760101
Words in aspect 6 point to genetics and molecular biology.
Frequent words for aspect 7 contain such terms as immune, IL
(or interleukin), antigen, (IFN) gamma, and MHC class II, which
point to a relatively new area in immunology, namely, tumor
immunology. The presence of such words as HIV and virus
in aspect 7 indicates a more ~eneral immunology content.
~ ~,,
tor aspect 8, words such as increase or reduced, treatment,
effect, fold, and P (assuming it stands for P value) correspond to
general reporting of experimental results, likely in the area of
endocrinology.
As for words, multinomial distributions are estimated for the
references that are present in our collection. For estimation, we
Erosheva et a/.
Aspect 1
Aspect 5
O Aspect 1 0 Aspect2
Kiln °1
lo
0 -
~D
lo
0 -
c~
o-
o
lo
lo
l
0.0 0.4 0.8
Aspect 5
l
0~0 0.4 0.8
Evolution
Aspect 2
Aspect 3
Aspect 4
Aspect 6
lo
0
Aspect 7
Aspect 8
Genetics
lo
8-
0~0 0.4 0.8
Aspect 3
lo
0
Aspect 4
O ~=
0.O 0.4 0.8
Aspect 6
0.0 0~4 0.8
0.0 0.4 0.8
0.0 0.4 0.8
O Aspect7 ,, Aspect 8
g
0.0 0.4 0.8
0.0 0.4 0.8
Fig. 1. Distributions by aspect of the posterior means of membership scores for articles published in evolution and genetics.
only need unique indicators for each referenced article. After the
model is fitted, attributes of high-probability references for each
aspect provide additional information about its contextual in-
terpretation. Table 3 provides attributes of 15 high-probability
references for each aspect that were available in the database
together with PNAS citation counts (number of times cited by
PNAS articles in the database). Notice that, because the model
draws from the contextual decomposition, having a high citation
count is not necessary for having high aspect probability. In
Erosheva et a/.
Table 3, high-probability references for aspect 1 are dominated
by publications in Nature; references in aspect 7 are mostly
Nature, Cell, and Science publications from the mid-199Os.
Examining titles of the references (see Table 5, which is
published as supporting information on the PNAS web site,
www.pnas.org), we see that manuals, textbooks, and references
to methodology articles seem to be prominent for many aspects.
Thus, among the first 15 high-probability references, all 15 from
aspect 3 and more than half from aspect 4 are of this method-
PNAS 1 April 6, 2004 1 volt. 101 1 supply. ~ 1 5225
Table 4. Mean decompositions of aspect membership scores (Lower), together with a graphical representation of this
table (Upper)
Biochemistry
Medical Sciences
Neurobiology
Cell Biology
Genetics
Immunology
Biophysics
Evolution
Microbiology
Plant Biology
Developmental Biology
Physiology
Pharmacology
Topic
1
, _
r _
2
3 4
5 6 7 8
Biochemistry 0.0469 0.0347 0.1810 0.0178 0.3838 0.2057 0.0477 0.0823
Medical sciences 0.0244 0.0502 0.0938 0.1274 0.0181 0.1075 0.3286 0.2500
Neurobiology 0.2875 0.0398 0.0722 0.3768 0.0196 0.0296 0.0441 0.1304
Cell biology 0.1691 0.0165 0.1420 0.0684 0.1097 0.2423 0.1637 0.0884
Genetics 0.0141 0.3056 0.1422 0.1532 0.0487 0.2621 0.0395 0.0347
Immunology 0.0127 0.0593 0.1003 0.0413 0.0422 0.0915 0.6244 0.0283
Biophysics 0.0507 0.0295 0.2398 0.0162 0.5496 0.0542 0.0176 0.0423
Evolution 0.0042 0.7679 0.0465 0.0913 0.0289 0.0378 0.0101 0.0133
Microbiology 0.0158 0.1725 0.3431 0.0335 0.0647 0.1174 0.1870 0.0661
Plant biology 0.1333 0.0983 0.4400 0.0360 0.0462 0.0954 0.0166 0.1344
Developmental biology 0.0475 0.0288 0.1071 0.3729 0.0274 0.2558 0.0974 0.0631
Physiology 0.3179 0.0275 0.0712 0.1123 0.0258 0.0116 0.0595 0.3743
Pharmacology 0.2883 0.0161 0.0772 0.1965 0.0299 0.0349 0.0537 0.3033
For clarity, the six lowest-frequency topics, which make up 3.4% of the biological sciences articles, are not shown.
ological type. In contrast, most high-probability references for
aspect 7 are those that report new findings. Titles of the
references indicate neurobiology content for aspect 1, molecular
evolution for aspect 2, and plant molecular biology for aspect 3,
which is in agreement with our conclusions based on high-
probability words. For other aspects, titles of high-probability
references help us refine the aspects. Thus, aspect 4 mostly
pertains to the study of brain development, in particular, via
genetic manipulation of mouse embryo. Aspect 5, identified as
biochemistry and molecular biology by the words, can be de-
scribed as protein structural biology by the references. Aspect 6
may be labeled in a more detailed way as "DNA repair,
mutagenesis, and cell cycle." The references for aspects 7 and 8
shift their focuses more toward HIV infection and studies of
molecular mechanisms of obesity.
Among frequent references for the eight aspects, there are
seven PNAS articles that share a special feature: they were all
5226 1 www.pnas.org/cgi/doi/10.1073/pnas.0307760101
either coauthored or contributed by a distinguished member of
the National Academy of Sciences. In fact, one article was
coauthored by a Nobel prize winner, and two were contributed
by other Nobelists. Although these articles do not have the
highest counts in the database, they are notable for various
reasons; e.g., one is on clustering and gene expression (2), and
it is also one of the two highly cited PNAS articles on clustering
that we mentioned in the Introduction. These seven articles may
not necessarily be off-beat, but they may be among those that
fulfill MacLane's petition regarding the special nature of PNAS.
From our analysis of high-probability words, it is difficult to
determine whether the majority of aspects correspond to a single
topic from the official classifications in PNAS biological science
publications. To investigate whether there is a correspondence
between the estimated aspects and the given topics, we examine
aspect loadings (means of posterior membership scores) for each
article. Given estimated parameters of the model, the distribu-
Erosheva et a/.
tion of each article's loadings can be obtained by means of Bayes'
theorem. The variational and expectation-propagation proce-
dures provide Dirichlet approximations to the posterior distri- is,
butionp(A~d, 8) for each document d. We use the mean of this
Dirichlet as an estimate of the weight of the document on each
aspect. Histograms of these loadings are provided in Fig. 1 for
articles in evolution and genetics. Relatively high histogram bars
near zero correspond to the majority of articles having small
posterior membership scores for the given aspect. Among the
articles published in genetics, some can be considered as full
members in aspects 2, 3, 4, and 6, but many have mixed
membership in these and other aspects. Articles published in
evolution, on the other hand, show a somewhat different behav-
ior: the majority of these articles comes fully from aspect 2.
The sparsity of the loadings can be gauged also by the
parameters of the Dirichlet distribution, which are estimated as
cat = 0.0195, ox = 0.0203, tt3 = 0.0569, tt4 = 0.0346, ors = 0.0317,
tt6 = 0.0363, ct7 = 0.0411, and as = 0.0255. The estimated
Dirichlet, which is the generative distribution of membership
scores, is "bathtub-shaped" on the simplex; as a result, articles
tend to have relatively high membership scores in only a few
aspects.
To summarize the aspect distributions for each topic, we
provide mean loadings and the graphical representation of these
values in Table 4 Upper. Larger values correspond to darker
colors, and the values below some threshold are not shown
(white) for clarity. As an example, the mean loading of 0.2883 for
pharmacology in the first aspect is the average of the posterior
means of the membership scores for this aspect over all phar-
macology publications in the database. Note that this percentage
is based on the assumption of mixed membership and can be
interpreted as indicating that 29% of the words in pharmacology
articles originate from aspect 1, according to our model.
Examining the rows of Table 4, we see that most subtopics in
biological sciences have major components from more than one
aspect (extreme or basis category). Examining the columns, we
can gain additional insights in interpretation of the extreme
categories. Aspect 8, for example, is the aspect of origin for a
combined 37% of physiology, 30% of pharmacology, and 25% of
medical sciences articles, according to the mixed-membership
model. The most prominent subtopic is evolution; it has the
greatest influence in defining an extremal category, aspect 2.
This is consistent with a special place that evolution holds among
the biological sciences by standing apart both conceptually and
methodologically.
1. MacLane, S. (1997) Proc. Natl. Acad. Sci. USA 94, 5983-5985.
2. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998) Proc. Natl.
Acad. Sci. USA 95,14863-14868.
3. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E.,
Lander E. S. & Golub, T. R. (1999) Proc. Natl. Acad. Sci. USA 96, 2907-2912.
4. Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K.,
Zhivotovsky, L. A. & Feldman, M. W. (2002) Science 298, 2381-2385.
5. Woodbury, M. A., Clive, J. & Garson, A. (1978) Comput. Biomed. Res. 11, 277-298.
6. Erosheva, E. A. (2002) Ph.D. thesis (Carnegie Mellon University, Pittsburgh).
7. Griffiths, T. L. & Steyvers, M. (2004) Proc. Natl. Acad. Sci. USA 101,
5228-5235.
8. Manton, K. G., Woodbury, M. A. & Tolley, H. D. (1994) StatisticalApplications
Using Fuzzy Sets (Wiley Interscience, New York), p. 312.
9. Potthoff, R. G., Manton, K. G., Woodbury, M. A. & Tolley, H. D. (2000) J.
Classification 17, 315-353.
10. Pritchard, J. K., Stephens, M. & Donnelly, P. (2000) Genetics 155, 945-959.
Erosheva et al.
Finally, we compare the loadings (posterior means of the
membership scores) of dual-classified articles to those that are
sin~lv classified. We consider two articles as similar if their
loadings are equal for the first significant digit for all aspects.
One might interpret singly classified articles that are similar to
dual-classified as articles that should have had dual classification
but did not. We find that, for 11 % of the singly classified articles,
there is at least one similar dual-classified article. For example,
three biophysics dual-classified articles with loadings 0.9 for the
second and 0.1 for the third aspect turned out to be similar to 86
singly classified articles from biophysics, biochemistry, cell bi-
ology, developmental biology, evolution, genetics, immunology,
medical sciences, and microbiology.
Concluding Remarks
We have presented results from fitting a m~xed-membership
model to PNAS biological sciences publications, from 1997 to
2001, providing an implicit semantic decomposition of words and
references in the articles. The model allows us to identify
extreme internal categories of publications and to provide soft
classifications of articles into these categories. Our results show
that the traditional discipline classifications correspond to a
mixed distribution over the internal categories. Our analyses and
modeling were intended to capture a high-level description of a
subset of PNAS articles.
In an often-quoted statement, Box remarked: "all models are
wrong" (17~. In our case, the assumption of a bag of words and
references in the m~xed-membership model clearly oversimpli-
fies reality; the model does not account for the general structure
of the language, nor does it capture the compositional structure
of bibliographies. Many interesting extensions of the basic model
we have explored are possible, from hierarchical models of topics
to more detailed models of citations and dynamic models of the
evolution of scientific fields over time. Nevertheless, as Box
notes, even wrong models may be useful. Our results indicate
that mixed-membership models can be useful for analyzing the
implicit structure of scientific publications.
We thank Dr. Anna Lokshin (University of Pittsburgh, Pittsburgh) for
help with interpreting model results from a biologist's perspective. E.E.
was supported by National Institutes of Health Grants 1 RO1
AG023141-01 and RO1 CA94212-01; S.F. was supported by National
Institutes of Health Grant 1 RO1 AG023141-01. J.L. was supported by
National Science Foundation Grant CCR-0122581 and Advanced Re-
search and Development Activity Contract MDA904-00-C-2106.
11. Hofmann, T. (2001) Machine Learn. 42, 177-196.
12. Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3,
993-1002.
13. Minka, T. P. & Lafferty, J. (2002) Uncertainty in Artificial Intelligence:
Proceedings of the Eighteenth Conference (UAI-2002) (Morgan Kaufmann, San
Francisco), pp. 352-359.
14. Cohn, D. & Hofmann, T. (2001) Neural Information Processing Systems 13 (MIT
Press, Cambridge, MA).
15. Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D. M. & Jordan, M. I.
(2003) J. Machine Learn. Res. 3, 1107-1135.
16. Blei, D. M., Jordan, M. I. & Ng, A. Y. (2003) in Bayesian Statistics 7: Proceedings
of the Seventh Valencia International Meeting, eds. Bernardo, J. M., Bayarri,
M. J., Dawid, A. P., Berger, J. O., Heckerman, D., Smith, A. F. M. & West, M.
(Oxford Univ. Press, Oxford), pp. 25-44.
17. Box, G. E. P. (1979) in Robustness in Statistics, eds. Launer, R. L. & Wilkinson,
G. G. (Academic, New York), p. 202.
PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 1 5227