| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 10
Colloquium
Mapping knowlecige clomains: Characterizing PNAS
Kevin W. Boyack*
Computation, Computers, Information and Mathematics Center, Sandia National Laboratories, P.O. Box 5800, Albuquerque, NM 87185
A review of data mining and analysis techniques that can be used
for the mapping of knowledge domains is given. Literature map-
ping techniques can be based on authors, documents, journals,
words, and/or indicators. Most mapping questions are related to
research assessment or to the structure and dynamics of disciplines
or networks. Several mapping techniques are demonstrated on a
data set comprising 20 years of papers published in PNAS. Data
from a variety of sources are merged to provide unique indicators
of the domain bounded by PNAS. By using funding source infor-
mation and citation counts, it is shown that, on an aggregate basis,
papers funded jointly by the U.S. Public Health Service (which
includes the National Institutes of Health) and non-U.S. govern-
ment sources outperform papers funded by other sources, includ-
ing by the U.S. Public Health Service alone. Grant data from the
National Institute on Aging show that, on average, papers from
large grants are cited more than those from small grants, with
performance increasing with grant amount. A map of the highest
performing papers over the 20-year period was generated by using
citation analysis. Changes and trends in the subjects of highest
impact within the PNAS domain are described. Interactions be-
tween topics over the most recent 5-year period are also detailed.
C cientists have always had the desire to do research of high
~impact. Part of this desire has been for so-called selfish
reasons such as to obtain tenure, increase one's salary, or to
enhance one's reputation. However, altruistic purposes also play
a large role. We desire to make a difference, to advance
knowledge for the benefit of our employers, our nations, or all
mankind.
This raises questions that all scientists face and that collec-
tively give rise to innovation and the advancement of science and
technology: "What should I work on?" "Are my ideas any good,
are they novel, or have they already been taken?" "What can I
learn from others?" "How can I improve on their work?" "Who
should I work with?" and "Who will fund this?"
Such questions accrue on an institutional level as well. Orga-
nizations that answer well are rewarded. Universities develop
reputations that drive research agendas and secure large
amounts of funding over many years. Successful companies drive
markets and consumer preference, maintaining their profitabil-
ity. Success often reflects an ability to stay on the leading edges
of science and technology curves.
In today's world, we have unparalleled access to information,
which should enable us to answer questions of a strategic nature
more readily than in the past. However, with this increased
information has come dilution. Fortunately, tools are now
becoming available that allow us to sift, condense, and associate
this information in ways that help us answer our questions.
This paper will start with a review of data mining and analysis
techniques for the mapping of literatures, including their best
uses and the types of questions that can be answered. Subsequent
sections will use some of these techniques to provide an indi-
cator-based characterization of the domain comprised by PNAS.
Specifically, multiple data sources are combined to give a unique
look at input-output (funding-impact) and import-export (dif-
fusion between disciplines) from the perspective of this multi-
5192-5199 1 PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1
disciplinary, but biomedically dominated journal. A map of the
highest impact research in PNAS is also introduced.
Techniques for Mapping Knowledge Domains
Mapping of scientific literature as a field has been in existence
for many decades. We are indebted to Eugene Garfield, Derek
de Solla Price, and others who, through their desire to under-
stand the structure and flow of scientific advancement (1-5),
started the work that has made the indexing and dissemination
of bibliographic information a commodity. Electronic sources
such as the Science Citation Index Expanded (SCIE), INSPEC,
and Medline contain entries for millions of scientific articles
providing us with information to help answer our questions.
Historically, answers have not come without great effort.
Given the lack of computing resources, early studies naturally
tended to focus on small subsets and were, with some exceptions,
academic in nature. With the recent availability of electronic
data, exponentially increasing computing power, advanced al-
gorithms, and visualization techniques, we are now at a point
where much less effort is required to get answers. Indeed, we can
almost routinely do large scale studies aimed at answering
significant questions of a strategic nature (6~.
Notable among recent advances is the development of the field
of information visualization. The past decade has seen rapid
growth in this field, and the application of many new techniques
to the visualization of literature, patents, genomes (cf. ref. 7),
and other information types (8, 9~. However, it must be remem-
bered that whereas visualization can be critical to understanding,
it is simply a window into the rigorous, often multidimensional,
analyses that have formed the basis of informatics for many
years. Thus, mapping, as a term, does not merely refer to the
visualization piece, but to the underlying data mining and
analysis techniques as well.
Mapping knowledge domains, then, takes as its input such
seemingly diverse subjects as network analysis (e.g., web, social
networks, scale-free networks, and metabolic pathways), linguis-
tics, concept or topic extraction, citation analysis, and science
and technology indicators, in addition to visualization tech-
niques. Similarly, knowledge domain can be more broadly defined
than the narrow "technical field" that is commonly associated
with the term. Genomes, communities, and networks are all
domains with multiple attributes from which one can derive
different types of knowledge. Although this paper focuses on
mapping of literatures, many of the same analysis and visual-
ization techniques have been and can be applied to other
domains.
The main purpose of mapping knowledge domains is to give
us knowledge, or answers to our questions. Mapping is useful for
This paper results from the Arthur M. Sackier Colloquium of the National Acacdemy of
Sciences, "Mapping Knowiec~ge Domains," heic] May 9-11, 2003, at the Arnoic! and Mabel
Beckman Center of the National Acacdemies of Sciences anc] Engineering in Irvine, CA.
Abbreviations: SClE, Science Citation Inclex Expanc~ecl; iSI, institute for Scientific Informa-
tion; AENR, articles, letters, notes, anc' reviews; PI, principal investigator.
*E-maii: kboyack@?sanclia.gov.
C) 2004 by The National Acaclemy of Sciences of the USA
www.pnas.org/cgi/cloi/10. ~ 073/pnas.0307509100
OCR for page 11
Table 1. Summary of commonly utilized literature mapping techniques and their uses
Questions related to
Unit of
analysis Fields and paradigms
Communities and networks
Research performance or
competitive advantage
Commonly used algorithms
Authors
Documents Field structure, dynamics,
paradigm development
Journals
Words
Indicators and
metrics
Science structure, dynamics,
classification, diffusion
between fields
Social structure, intellectual
structure, some dynamics
Cognitive structure, dynamics
Use network characteristics as
indicators
Use field mapping with indicators
Comparisons of fields, institutions,
countries, etc., input-output
Social network packages,
multidimensional scaling,
factor analysis,
Pathfinder networks
Cocitation, co-term, vector
space, latent semantic
analysis, principle
components analysis,
various clustering
methods
Cocitation, intercitation
Vector space, latent
semantic analysis, latent
dirichlet allocation
Counts, correlations
the subject matter expert and nonexpert alike. For the nonex-
pert, mapping provides an entry point into a domain, a means of
gaining knowledge on both the macro and micro levels. For the
expert, mapping provides validation of perceptions and a means
to quickly investigate trends and new information. Yet, even the
expert can be surprised by developments on the periphery of his
perception. Mapping and interactive exploration provide context
for such surprises.
Commonly utilized techniques for mapping literatures are
shown in Table 1 with their primary uses. Most questions of
interest fall into three categories: fields and paradigms, commu-
nities or networks, and assessment of performance or opportu-
nity. Coauthorship analysis is very similar to social network
analysis. Yet, whereas social network analysis is concerned with
global properties of large author databases (10), coauthorship
studies aim to answer specific questions about collaboration
groups (11~. Author cocitation analysis is particularly suited to
investigation of intellectual structure and history, and is often
used with factor analysis and multidimensional scaling (12~.
Pathfinder network scaling is particularly effective at preparing
these data for layout in a visualization program (9~.
Documents are the most often used unit of analysis because
they can be used to map a particular scientific or technical field
and its development. Cocitation and co-word are the two most
common types of document analysis, and often lead to different
groupings of documents. At the finest levels, cocitation tech-
niques cluster documents by scientific paradigm, or by the same
research question and hypotheses (9), whereas co-word docu-
ment clusters are more topical in nature. Alternatives to the
co-word method for generating document similarities include
Salton's vector space model (13) and latent semantic analysis (14,
15~. Journals are used less often, and are used for larger scale
studies, such as to view the relationships between different fields
(16~. They are also suitable for the study of diffusion between
disciplines (often called import-export) by using intercitation
rates (17~.
Mapping of words or indexing terms as networks reveals the
cognitive structure of a field (18~. There is some debate as to
whether co-word analyses should be used for studies of science
dynamics (19~. The most reliable approaches aim to combine
co-word techniques with citation analyses (20~. More advanced
Boyack
techniques using sophisticated algorithms to group and relate
topics show great promise for dynamic studies (21, 224.
Similar visualization methods are applied to the mapping
types mentioned above for the simple reason that authors,
documents, journals, and words (or groupings of these) all work
equally well as the mapping unit. Common visualizations include
traditional scatterplots and link-node diagrams, such as those
drawn by the PAJEK program (23~. Newer, more powerful
visualizations include self-organized maps (24), landscapes (25,
26), timelines and crossmaps (27), and 3D displays (9~. The best
of these have the capability of allowing the user to navigate the
information space and get detail on demand, which facilitates
analysis that helps the user to answer questions.
The power of visualization is enhanced when mapping types
are combined. Combining types adds more dimensions to the
information, which are more easily explored by using visualiza-
tion than with traditional analysis methods. For example, Chen
(9, 28) combines indicators (citation counts by year) with
document cocitation analysis in a 3D display to show the growth
of scientific paradigms.
Indicators have been used for as long as people have wanted
to compare things. Science and technology indicators were
largely developed from the 1950s through the 1970s (29) by the
Organization for Economic Cooperation and Development and
the National Science Foundation, and have resulted in publica-
tions such as National Science Foundation's biannual Science
and Engineering Indicators (30~. Although activity measures
(31), and specifically economic activity measures, have been the
dominant component of such reports, scientific output measures
such as counts of papers, patents, and citations have also played
a large role. Measures of converging partial indicators have been
used with the aim of identifying areas of science and technology
likely to yield the greatest benefits (32, 33~. Output measures
have been correlated to economic activity at a macro level to
show the relative strengths of countries, states, and/or technical
fields (30~. Several studies have reported correlation between
aggregated scientific outputs and funding (34-39), but none have
reported any such correlations at the individual grant level.
Characterization of PNAS
Data Sources. Data from four sources (see Fig. 1) were merged to
provide the basis for a characterization of PNAS. Most studies
PNAS 1 April 6, 2004 1 vol. 101 1 suppl. ~ 1 5193
OCR for page 12
NIA GRANTS ISI/SCIE
MEDLINE PNASTOC 250
VOL ~ ~ VOL ~ ~ SOL 200
PAGE ~ ~ PAGE + ~ PAGE CQ
YEAR ~ ~ YEAR .°
PI ill ~ AUTHOR ~ 150
INST 4-- ~ INSf <-,
0 1
BRAIDING REFERENCES MeSht TOPICS ~ 100
-Amounts C=~3N -~ndf~g type ~
-D£~ratfor~s cuff Descriptors Z 50
Fig. 1. Data sources, field joins (arrows), and unique properties from each
source (italics).
merging databases do so to provide deeper coverage of a field
(40, 41~. However, this study merges multiple data sources to get
more detailed information on a single journal and its impact. The
base set to which other sources were merged was data from the
SCIE. These data consist of 47,073 records covering the 20 years
of PNAS from 1982 to 2001, including full reference lists and
citation counts to each paper as of December 31, 2002. Citation
counts were determined by matching of Institute for Scientific
Information (ISI) reference lists (journal name variations were
accounted for) with bibliographic data.l For this analysis, only
the 45,326 articles, letters, notes, and reviews (commonly re-
ferred to as ALNR) were considered. The balance of the records,
from editorials, corrections, book reviews, etc., contribute little
or no original research, and are commonly discounted in such
analyses.
PNAS records were also extracted from Medline, and were
joined to the SCIE records primarily for use of the MeSH (medical
subject heading) terms. MeSH terms are desirable for several
reasons: (i) SCIE keywords are sparse, uncontrolled, and available
only back to 1991; (ii) MeSH is a rich, controlled vocabulary added
by human indexers; and (iii) MeSH contains specific funding-
related terms. Joining MeSH terms to the ISI citation counts
enables input-output studies with respect to funding type.
PNAS has a topic structure that is clearly visible in both the
print and web versions of the journal Tables of Contents. First-
level topics are broad: Biological Sciences, Physical Sciences, and
Social Sciences. Within each of these first-level topics are
secondary topics, such as Biochemistry, Biophysics, and Cell
Biology within the Biological Sciences topic. First- and second-
level topics for each paper were extracted from the Tables of
Contents and added to the SCIE data. Joining of topics to the
other data enables import-export studies as well as the corre-
lation between impact and topic.
Finally, grant data from the National Institute on Aging (one
of the institutes of the National Institutes of Health) containing
principal investigator (PI) names, institutions, and funding
amounts by year were joined to the other data. These data were
obtained from the National Institute on Aging as part of a
previous study (39~. An effort was made to match grants to
PNAS papers that were likely to have resulted from specific
grants. For a paper to be linked to a specific grant the following
conditions were required (also see Fig. 14:
PNAS author = Grant PI (last name + first initial)
and PNAS author institution = Grant PI institution
and PNAS publication year—Grant initial year
and (PNAS publication year ' Grant initial year + 5
or PNAS publication year ' Grant final year + 2)
These data are extracted from Science Citation Index Expanded [Institute for Scientific
Information, Inc. (ISI), Philadelphia, PA; Copyright ISI]. All rights reserved. No portion of
these data may be reproduced or transmitted in any form or by any means without the
prior written permission of ISI.
5194 1 wWW.pnaS.Or9/C9i/doi/l 0.1 073/pnas.03075091 00
~ O O O ° ~ ~
1982 1987 1992 1997 2002
Publication Year
Fig. 2. Mean number of citations ( - ) to PNAS ALNR are compared with
several different percentiles: 90th ( O ), 75th (a), 50th or median (a), and 25th
(O). Citation counts are as of December 31, 2002.
A total of 1,862 PNAS papers were found to be probable
matches to specific grants. Although we cannot say with certainty
that these papers are from National Institute on Aging-funded
studies, they were authored by National Institute on Aging-
funded PIs and were written at a time consistent with their
National Institute on Aging funding. Joining of grant data to the
balance of the data enables correlation of impact to funding
amount, something that has to date been very difficult to
quantify.
In this study, impact is equated with a ranking measure derived
from citation counts. Papers were ranked by citation count for
each publication year. Absolute rankings were then converted to
percentile rankings. Percentile rankings are used for two rea-
sons. First, it provides normalization across time such that papers
from different years can be directly compared. This result is
particularly important for recent papers, because they have
typically not had enough time after publication to accumulate
large numbers of citations. Second, given the skewed nature of
citation count distributions, it keeps a few highly cited papers
from dominating citation statistics. For example, mean citation
counts for the PNAS papers range between the 64th and 70th
percentile from 1982 to 1999. Related data are shown in Fig. 2.
Whereas there are certainly factors other than citation mea-
sures in what constitutes a full definition of impact, and while the
validity of using citation measures has been debated (cf. refs. 42
and 43), they are widely used (44), and will be the basis for
impact in this study.
Impact and Funding. Medline MeSH terms contain three main
funding source designators: Support, U.S. Gov't, P.H.S., Support,
U.S. Gov't, Non-P.H.S., and Support, Non-U.S. Gov't. The first
two designators refer to publications funded by the U.S. Public
Health System (P.H.S.) and all other U.S. government agencies
(OG), respectively. In a practical sense, P.H.S. refers to the
National Institutes of Health. Support, Non-U.S. Gov't (nG)
could refer to either U.S. nongovernmental sources (e.g., indus-
try, nonprofit) or to foreign sources, but has not been segmented
further. Papers with no funding source designators are tagged as
Unknown. Very few papers in this category exclude a funding
acknowledgment inadvertently (45~. Thus, Unknown can be
considered as a distinct category.
Given that each paper is tagged with anywhere from none to
all three of the funding source designators, eight unique funding
categories can be constructed. Two of the smaller categories,
PHS+OG and PHS+OG+nG, have been combined to make a
category of sufficient size for statistical purposes. Thus, seven
soyack
OCR for page 13
14000
12000
~ 10000
¢ 800G
o
6000
4000
2000
o
60
100
90
50 ,~ 80
45 A, o~ 60
40 :S 0 40
.~ 30
° 20
10
. .
55
35
Fig. 3. Numbers of papers (ALNR) and impact (mean citation percentile) for
seven funding categories. Categories are shown in order of decreasing mean
percentile. Bars indicate the number of papers (Left); circles and standard
error bars indicate impact (Right). PHS, U.S. Public Health System; OG, other
U.S. government; nG, non-U.S. government (includes foreign).
funding categories are shown in Fig. 3 along with their numbers
of papers (ALNR) and mean percentiles. The highest ranked
category, with a mean percentile >55, is papers jointly funded by
the U.S. Public Health System and non-U.S. government
sources. By contrast, papers funded solely by the U.S. Public
Health System have a mean percentile of 49.2. Yet, this is still
higher than the mean percentile of 44.4 associated with papers
of Unknown funding source, indicating that PHS funding has a
positive impact with respect to a lack of U.S. Public Health
System funding. The differences between impacts of these three
categories are statistically significant at the P < 0.001 level by
using a Scheffe test (ref. 46 and Table 24.
Other studies have shown that the mean impact of a group of
papers increases with the number of authors, presumably due to
multidisciplinarity (36~. In general, the number of authors
increases with the increasing percentile in Fig. 3. However, there
are local differences that cannot be explained by number
of authors. For example, for categories 1 and 2 (4.82 and 5.04
authors, respectively), and categories 4 and 5 (3.99 and 4.11
authors, respectively), the mean number of authors is anti-
correlated with mean percentile.
Fig. 3 shows only mean percentiles for the entire 20-year
period of study. Mean percentiles by year are relatively stable for
the larger funding categories. Smaller categories showed much
more scatter by year.
Does Grant Size Matter? As previously mentioned, the correlation
between impact and the amount of funding has historically been
difficult to quantify. This correlation is largely due to the
difficulty of accurately linking funding information with the
publications resulting from those funds. Agencies and institu-
Table 2. Scheffe test results for comparisons between percentile
means of different funding categories (from Fig. 3)
Category 2 3 4 5 6
1 P< .001 PI .001 P< .001 PI .001 P< .001 P< .001
2 NS NS P< .001 P< .001 P< .001
3 NS P< .001 P< .001 P< .001
4 P ~ .001 P < .001 P < .001
5 NS P< .001
6 P=.085
NS, no significant difference between means.
Boyack
V
1,000 10,000
Annual grant amount (constant 1996 $ thousands)
Fig. 4. Correlation between impact (citation percentile) and grant amount.
Individual grant-paper pairs (small circles) and mean percentileswith standard
errors (large circles) are shown for the five grant size regions that are num-
bered l-V.
tions, although they track many things, are uniformly poor at
keeping track of input-output linkages.
A total of 1,862 PNAS papers were identified as likely having
resulted from National Institute on Aging funding. We assume
this to be a small fraction of the total number of National
Institute on Aging-funded papers, although the exact fraction is
not known. Yet, the number deduced here is consistent with the
relative sizes of the National Institute on Aging and the National
Institutes of Healthy Many of these papers can be matched to
multiple grants, and conversely, many of the grants seem to have
given rise to multiple papers. For these data, we have identified
3,059 grant-paper pairs. This finding corresponds well to what we
know to be true in research; in many cases, institutions receive
multiple grants in complementary areas, and certainly the work
from a single grant can spawn more than one publication.
Multiple linkages between papers and grants indicate a concen-
tration of activity at an institution. The more money received by
a particular PI from a focused organization such as the National
Institute on Aging, and the more that PI publishes, the more
likely it is that the funds and publications are truly linked.
Fig. 4 shows the correlation between citation percentile and
average annual grant amount for the 3,059 grant-paper pairs.
Dollar amounts were normalized by GDP deflators to remove
inflation biases (30~. Annual grant amounts were averaged over
the publication year of paper and the three previous years. Five
different grant amount ranges were identified: <$31,600,
$31,600 to $100,000, $100,000 to $316,000, $316,000 to
$1,000,000, and >$1,000,000. Mean citation percentiles and
grant amounts were calculated for the grant-paper pairs in each
of the five grant ranges. The mean citation percentiles remain
constant at 56-57 through the first three ranges (up to $316k),
then increase to 62 and 65.6 for ranges IV and V.
The number of authors was also considered here as a poten-
tially confounding variable. Cumulative probability density func-
tions of numbers of authors per paper are nearly identical for
funding ranges III-V. Thus, number of authors has little impact
on the mean percentiles in these funding ranges.
Several observations can be drawn from these data. First,
papers from large grants tend to outperform (in terms of mean
citation percentiles) those from smaller grants, with the average
$Nationai Institute on Aging funding is ~6% of the National Institutes of Health total
annually. The 1,862 National Institute on Aging papers are 7.4% of the total National
Institutes of Health papers in the PNAS data set.
PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 1 5195
OCR for page 14
Fig. 5. Three time periods in the PNAS high-impact map show the progression from the basic gene and protein work and techniques that dominated the 1 980s
to more diverse applications in the 1 990s. Maps were generated by using VXINSIGHT. Dots indicate individual papers. Wireframe mountains show the density of
papers in clusters. Cluster positions are shown in RightLowerfor comparison with the map panes. Clusters are numbered from oldestto youngest. Shapes indicate
the first third (circles), second third (squares), and last third (triangles) in the timewise progression. Dark shapes indicate the core clusters.
performance increasing with increasing grant amount above
$300,000. Second, even for small grants, papers funded by the
National Institute on Aging tend to outperform the average
PNAS paper; mean percentiles for each grant amount group are
well over 50. Third, a high level of funding does not guarantee
publication of a high impact paper. Fig. 4 shows many highly
funded papers with a low citation percentile. However, the
fraction of papers in the lowest quartile for ranges II-V de-
creases with range (0.199, 0.195, 0.130, and 0.095, respectively),
which is consistent with the general increase in mean percentile.
Fourth, the variance in individual paper impact appears to be
very orthogonal to impact. However, this is to be expected in a
single journal study of a high impact journal. If lower impact
journals were included in the study, the percentile ranking for
most PNAS papers would be shifted much higher.
These observations are specific to National Institute on Aging
funding and PNAS papers, and cannot be directly applied to
other funding sources or journals. Neither can we claim any
direct cause and effect between funding and impact in the results
shown here. However, this work shows a similar qualitative
correlation between government funding and impact to what has
been observed before. Early work by Narin and coworkers (34,
35) showed a positive correlation between National Institutes of
Health funding amounts and biomedical publication counts, but
did not address impact or quality. Lewison and Dawson (36) used
the U.K. Research Outputs Database to show that the mean
impact for groups of papers in gastroenterology increased with
5196 1 www. pnas.org/cg i/doi/ 10.1 073/pnas.0307509 100
increases in the number of authors and the number of funding
sources. They also found that papers acknowledging funding
sources had significantly higher impact than those without such
acknowledgments (37~. Butler (38) found that whereas acknowl-
edgment data on the whole accurately reflected the total re-
search output of a funding body, there was no ability to track
research back to the grant level.
This work goes further than any previous studies by correlating
impact with funding level. However, it is also clear that the data
are not yet sufficient to produce any definitive conclusions.
Government agencies will need to create a clean and maintain-
able database linking grants, supported publications, patents,
and policy changes to enable such analyses (39, 44~. Accurate
data would enable causal mechanisms to be addressed, given the
temporal nature of the grant-research-publication relationship,
and would also allow the overall impact (over all publications) of
individual grants to be calculated. Such data have the potential
to change the way research is funded.
Map of High-lmpact Research. To round out this characterization
of research published in PNAS, a map was generated to provide
information about the subjects of highest impact and related
trends. Mapping of all 45,326 ALNR based on their 1.52 million
references exceeded the resources available on a common desk-
top PC. However, a map based on the top quartile of papers from
each year, those with a citation percentile of 75 or greater (see
Fig. 2), could be easily generated using those same resources.
Boyack
OCR for page 15
Table 3. Diagnostic terms and dominant topics for the 50 largest (of 70) clusters from the PNAS high-impact map
Mean No. of
Cluster Year papers MeSH term 1 MeSH term 2
Dominant PNAS topic
1997-2001, %
4
5
6
7
8
9
10
12
13
14
16
17
18
19
22
23
24
25
26
28
29
31
32
36
38
40
41
42
43
46
47
48
49
50
52
53
57
59
60
61
62
63
64
65
66
67
68
69
70
1987.40
1987.79
1987.82
1988.17
1988.46
1988.80
1988.93
1988.96
1989.25
1989.29
1989.37
1 990.00
1990.74
1990.93
1991.03
1991.26
1991.41
1991.45
1991.99
1992.20
1992.44
1993.35
1993.77
1993.87
1994.58
1994.78
1995.05
1995.10
1995.10
1995.12
1995.32
1995.42
1995.54
1995.64
1995.65
1996.09
1996.10
1996.54
1996.86
1997.00
1997.21
1997.35
1997.45
1997.63
1997.92
1997.93
1997.94
1998.01
1998.31
1999.55
242
483
524
281
339
194
94
348
492
254
162
313
93
127
171
208
144
130
193
172
272
203
117
157
304
200
137
229
90
263
155
234
92
150
302
156
200
176
173
82
227
215
183
286
139
205
120
123
222
162
*Oncogenes
*Genes, structural
Cloning, molecular
Oxidation-reduction
Electrophoresis, polyacrylamide gel
Mutation
Buthionine sulfoximine
Nucleic acid hybridization
Cloning, molecular
Transforming growth factors
DNA restriction enzymes
Chromatography, affinity
. .
Sarcoma viruses, Alan
Neutra I ization tests
ADP-ribosylation factors
*DNA replication
P-glycoprotein
Autoradiography
Chromosome mapping
Receptors, fibroblast growth factor
Gene expression
Electric conductivity
*Nucleic acid conformation
HIV-I reverse transcriptase
Alzheimer's disease/*metabolism
Phosphotyrosine
Phylogeny
Comparative study
Magnetic resonance imaging
Nitric oxide/ *metabolism
Brain-derived neurotrophic factor
*Cell cycle
Photosynthetic Reaction Center, bacterial
*Protein folding
Molecular sequence data
Cytotoxicity, immunologic
Lymphocyte transformation
RNA, messenger/genetics/metabolism
DNA primers
clF-2 kinase
*DNA repair
Protein p53/*metabolism
Sirolimus
*Apoptosis
Ubiquitins/*metabolism
Models, molecular
Neoplasm transplantation
Adenomatous polyposis cold protein
Tumor cells, cultured
Gene expression profiling
DNA restriction enzymes
DNA restriction enzymes
Nucleic acid hybridization
Lipoproteins, LDL/*metabolism
Alzheimer's disease/* pathology
Collagen/metabolism
Bacteriorhodopsins/genetics/*metabol ism
Escherichia cold genetics
Sequence homology, nucleic acid
H-2 Antigens/*genetics
Tumor necrosis factor
Biochemistry (30.8)
Genetics (37.5)
Medical Sciences (36.4)
Biochemistry (33.3)
Medical Sciences (33.3)
Cell Biology (33.3)
Biochemistry (26. 7)
Microbiology (31.3)
Biochemistry (20. 7)
Medical Sciences (50.0)
Biochemistry (25.~)
Biochemistry (34.8)
Genetics (72.7)
Medical Sciences (36.4)
Biochemistry (34.4)
Cell Biology (21.4)
Biochemistry (32.4)
Genetics (16.7)
Biochemistry (33.3)
Biochemistry (34.5)
Neurobiology (62.5)
Biochemistry (25.5)
Biochemistry (39.5)
Neurobiology (35.7)
Medica/ Sciences (22.2)
Evolution (23.5)
Medica/ Sciences (18.0)
Neurobiology (44.9)
Medical Sciences (38.7)
Neurobiology (45.8)
Cell Biology (31.9)
Neurobiology (32.6)
Biophysics (69.4)
Biochemistry (22.8)
Immunology (53.4)
Immunology (33.0)
Biochemistry (19.3)
Biochemistry (22.8)
Immunology (33.3)
Medical Sciences (28.8)
Medical Sciences (24.6)
Cell Biology (27.3)
Cell Biology (24.5)
Cell Biology (29.6)
Biochemistry (49.0)
Medica/ Sciences (25.0)
Genes, APC Medica/ Sciences (22.4)
*Telomere Biochemistry (31 .7)
Oligonucleotide array sequence analysis Genetics (27.1)
HIV-1 /*immunology
Hemochromatosis/genetics/* metabolism
Drug resistance/*genetics
Receptors, opiod/*metabolism
Receptors, calcitriol
Gene library
Synapses/*physiology
*Reverse transcriptase Inhibitors
Amyloid,B protein/*metabolism
Protei n-tyrosi ne ki ease/* mete bol ism
Bone marrow cells
Sequence homology, amino acid
Photic stimulation
co-N-Methylarginine
Nerve tissue proteins/*pharmacology
*Genes, p53
*Bacterial proteins
*Protein conformation
*Genetic vectors
Killer cells, natural/ *immunology
Defensins
Tetracycline/*pharmacology
NF-~ B/*antagonists & inhibitors
Leptin
*Genetics, population
1-phosphatidylinositol 3-Kinase/metabolism
Protooncogene proteins c-bc1-2
Multienzyme complexes/*metabolism
Crystallography, x-Ray
Serine endopeptidases/*metabolism
Italics indicate topics with <30% dominance of a cluster.
This approach has the added benefit of focusing only on those
topics of highest impact over the years. The resulting map
contained 11,565 ALNR. Steps used in creating the map were as
follows: (i) Paper-to-paper similarities were calculated using
bibliographic coupling (47) and direct citations by application of
the formula of Small (48), which includes normalization. Coci-
tation and longitudinal coupling were not considered. 1,744,258
pairs of papers (or 2.61% of the possible pairs) were linked
through bibliographic coupling (i.e., having at least one common
reference). In addition, the 11,565 ALNR had 411,780 refer-
Boyack
ences, of which 24,346 were to other papers within the set. Such
direct citations were given a weight of 5. Groups of papers that
cite similar sets of references are thus positioned together using
this method. (ii) Paper positions were calculated from the
similarities using VXORD, a force-directed placement ordination
routine (49~. Ordination does not assign a cluster number to each
paper, but rather calculates positions for each paper on an x,y
plane. (iii) Papers were assigned to clusters by using the k-means
routine in MATLAB based on their x,y locations from step 2. The
number of clusters was arbitrarily set at 70, and whereas 70 is not
PNAS 1 Aprii 6, 2004 1 vol. 101 1 suppl. ~ 1 5197
OCR for page 16
Table 4. Summary of properties for PNAS topics, 1997-2001
Topic
No. of ALNR Mean percentile Times cited Independence
Medical Sciences(BS) 1,555 60.0 1,614 0.53
Cell Biology (BS) 1,239 57.5 1,206 0.43
Pharmacology (BS) 189 54.3 126 0~33
Plant Biology(BS) 489 53.3 486 0.69
Genetics (BS) 988 51.9 986 0.47
Microbiology (BS) 499 51.7 514 0.50
Neurobiology (BS) 1,358 51.5 1,098 0.72
Physiology (BS) 341 51.2 209 0.41
Immunology (BS) 865 51.0 730 0.67
Biochemistry (BS) 2,586 49.0 2,521 0.64
Developmental Biology(BS) 372 46.6 266 0.46
Applied Biological Sciences (BS) 95 46.5 67 0.15
Biophysics (BS) 640 46.3 798 0.59
Agricultural Sciences (BS) 44 45.3 39 0.64
Computer Sciences (PS) 10 42.5 5 0.00
Evolution (BS) 527 42.1 470 0.61
Chemistry(PS) 253 41.8 208 0~33
Population Biology (BS) 43 39.4 37 0.19
Psychology (SS) 124 33.9 80 0.56
Ecology (BS) 137 33.7 49 0.80
Applied Physical Sciences (PS) 42 33.3 11 0.36
Engineering (PS) 25 31.2 11 0.27
Geophysics (PS) 26 27.5 4 0.50
Anthropology (SS) 83 25.7 74 0.57
Social Sciences (SS) 11 25.1 4 0.75
Geology (PS) 49 24.6 9 0.44
Statistics (PS) 20 22.5 15 0.20
Physics (PS) 46 22.3 21 0.43
Applied Mathematics (PS) 54 16.4 22 0.50
Astronomy (PS) 14 11.2 3 1.00
Mathematics (PS) 42 7.0 5 1.00
Economic Sciences (SS) 15 4.3 3 0.67
BS, Biological Sciences; PS, Physical Sciences; SS, Social Sciences.
necessarily an optimum number, it is sufficient to show a
distribution of topics and trends. Relative cluster positions are
shown in Fig. 5. (iv) VXINSIGHT (50) was used to interactively
navigate and query the PNAS high-impact map. Fig. 5 shows
landscapes for three different time periods. When used inter-
actively, tools like VXINSIGHT can show the growth and decay of
research fronts in a visual way. (v) Diagnostic MeSH terms, i.e.,
those that differentiate one cluster from another, but that are not
necessarily the most common terms, were generated for each
cluster, and are given in Table 3. Dominant PNAS topics (from
the 1997-2001 Tables of Contents) were also found for each
cluster (see Table 3~.
The high-impact maps of Fig. 5 show two distinct features: a
core group of 20 close-knit clusters in the center, and the
remaining clusters that are dispersed and focus on individual
topics. The central position of the core clusters indicates their
centrality to the focus of PNAS over the 20-year period. This
core work had much to do with molecular cloning, hybridization,
sequencing, and other key techniques during the first 10 years,
shifting into more applied work on growth factors, cancers, and
gene expression in the middle years (see Fig. 5 and Table 3 to
match diagnostic terms to clusters and times). The most recent
work in this core area deals with molecular sequencing, RNA,
and cell metabolism.
The dispersed clusters do not have a common focus, but most
have strong links (through bibliographic coupling) to the core. In
general, the shift has been to more applied topics, often using the
revolutionary techniques associated with molecular cloning,
hybridization, and sequencing, but maintaining a focus on the
5198 1 www.pnas.org/cgi/doi/10.1073/pnas.0307509100
application. As a result, clusters of activity have focused on such
topics as brain-related research, specific gene and protein activ-
ity, protein folding, molecular models, and apoptosis, which was
identified as a hot topic from the same data by Griffiths and
Steyvers (21~.
Another interesting shift is shown by the dominant topics in
Table 3. One might assume that papers would tend to cluster
within PNAS topics, and that authors would cite heavily to
papers of the same topic. Over time, this occurrence has proved
to be less and less the case. The number of clusters with less than
30% of their papers belonging to a dominant topic has increased
over time. This finding indicates either that coupling between
PNAS topics is on the increase or that the perceived boundaries
between these topics are becoming more fuzzy.
It is also interesting to consider the characteristics of PNAS
topics. Topic assignments are made by authors rather than
editors, yet both may wish to see characteristics by topic in that
it may influence publishing choices. Second-level topics along
with their counts and mean percentile rankings are shown in
Table 4. The top 14 topics by percentile are all Biological
Sciences topics. Medical Sciences and Cell Biology, although
being two of the largest categories, rank highest. The largest
category, Biochemistry, has a mean percentile of 49. Physical
Sciences and Social Sciences categories all have mean percentiles
under 50, which is not surprising for a journal centered in
biochemistry.
Mapping of literatures in the ways shown here: i.e., generation
of visual maps, clustering, and analysis of the evolution of topics
over time, is amenable to discipline level or structural studies as
well as to the single journal study given here.
Boyack
OCR for page 17
Import-Export Within PNAS. Diffusion of information between
scientific disciplines is a relatively new topic of study. The largest
of these studies to date looked at 644,000 articles from the 1999
CDROM version of the SCIE (17~. Fifteen broad categories of
science were defined (e.g., Basic Life Sciences, Biology, Physics,
etc.), and the percentage of references from each category to the
others was calculated. Physics was found to be the most inde-
pendent, whereas Biology was nearly as dependent on Basic Life
Sciences as upon itself.
Import and export between fields can also be investigated
within a single multidisciplinary journal such as PNAS. Here, we
look at diffusion between PNAS topics as defined in Table 4. The
normalized (number of citations to topic divided by the number
of citations to all topics) diagonal of the citation matrix (data not
shown) is defined as an index of independence (17, 51), and is
given in Table 4. A higher independence value indicates a larger
fraction of references given to papers within topic. Indepen-
dence is thought to correlate with the basic or applied nature of
a field, with high independence indicating a basic science (17~.
A reordering of Table 4 by independence reveals that, in general,
the topics order themselves from basic to applied. Plant Biology,
Neurobiology, Biophysics, and Biochemistry are all more basic
fields than Genetics, Developmental Biology, Cell Biology, or
Physiology. For comparison, Rinia et al. (17) found that for the
entire Science Citation Index for 1999, Basic Life Sciences had an
independence value of 0.63, whereas the more applied Biology
had a value of 0.36. However, they also found that Clinical
Life Sciences had an independence of 0.67. The PNAS Medical
Sciences topic has a value of 0.53, indicating that PNAS Medical
Sciences papers may be more enabling (ability to export) than
medical sciences papers overall. The full citation matrix shows
that Medical Sciences receives >10% of the citations from 11 of
the other PNAS topics, including the nonbiological Computer
1. Garfield, E. (1955) Science 122,108-111.
2. Garfield, E. (1970) Nature 227, 669-671.
3. Price, D. J. D. (1963) Little Science, Big Science (Columbia Univ. Press, New
York).
4. Price, D. J. D. (1965) Science 149, 510-515.
5. Carpenter, M. P. & Narin, F. (1973) J. Am. Soc. Ink Sci. 24, 425-436.
6. Borner, K., Chen, C. & Boyack, K. W. (2003) Annul Rev. In; Sci. Technol. 37,
179-255.
7. Kim, S. K., Lund, J., Kiraly, M., Duke, K., Jiang, M., Stuart, J. M., Eizinger,
A., Wylie, B. N. & Davidson, G. S. (2001) Science 293, 2087-2092.
8. Card, S., Mackinlay, J. & Shneiderman, B. (1999) Readings in Information
Visualization: Using Vision to Think (Morgan Kaufmann, San Francisco).
9. Chen, C. (2003) Mapping Scientific Frontiers: The Quest for Knowledge Visual-
ization (Springer, London).
10. Newman, M. E. J. (2001) Proc. Natl. Acad. Sci. USA 98, 404-409.
11. Glanzel, W. (2001) Scientometrics 51, 69-115.
12. White, H. D. & McCain, K. W. (1998) J. Am. Soc. Ini Sci. 49, 327-356.
13. Salton, G., Yang, C. & Wong, A. (1975) Comm. ACM 18, 613-620.
14. Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. & Harshman,
R. A. (1990) J. Am. Soc. Inf. Sci. 41, 391-407.
15. Landauer, T. K., Laham, D. & Derr, M. (2004) Proc. Natl. Acad. Sci. USA 101,
5214-5219.
16. Bassecoulard, E. & Zitt, M. (1999) Scientometrics 44, 323-345.
17. Rinia, E. J., van Leeuwen, T. N., Bruins, E. E. W., van Vuren, H. G. & van
Raan, A. F. J. (2002) Scientometrics 54, 347-362.
18. Callon, M. & Law, J. (1983) Social Science Information 22, 191-235.
19. Leydesdorff, L. (1997) J. Am. Soc. Ini Sci. 48, 418-427.
20. Noyons, E. C. M., Moed, H. F. & Luwel, M. (1999) J. Am. Soc. Inf: Sci. 50,
1 15-131.
21. Griffiths, T. L. & Steyvers, M. (2004) Proc. Natl. Acad. Sci. USA 101,
5228-5235.
22. Erosheva, E., Fienberg, S. & Lafferty, J. (2004) Proc. Natl. Acad. Sci. USA 101,
5220-5227.
23. Batagelj, V. & Mrvar, A. (1998) Connections 21, 47-57.
24. Lin, X. (1997) J. Am. Soc. Inf: Sci. 48, 40-54.
25. Boyack, K. W., Wylie, B. N. & Davidson, G. S. (2002) J. Am. Soc. Inf: Sc'.
Technol. 53, 764-774.
Boyack
Sciences and Applied Mathematics. The most enabling topic,
receiving large fractions of citations from multiple topics, is
Biochemistry, which is consistent with the common perception
that it forms the core of PNAS publications. Chemistry is
anomalous in that it cites heavily to Biochemistry and Biophysics,
with an independence of 0.33. The corresponding value from
Rinia et al. (~17) is 0.63. Thus, the PNAS Chemistry topic must
be an evolved brand of chemistry that has more to do with
application of biology than chemistry at large.
Diffusion between PNAS and other journals could also be
examined by using a similar analysis on the citations to and from
PNAS.
Conclusions
Impact and funding indicators and citation-based maps have
been used to provide a characterization of publication in PNAS
from 1982 to 2001. The types of maps and analysis shown here
can be applied at many levels: single journal, single discipline,
groups of disciplines, etc., given appropriate data. Accurate
funding data, and especially, accurate records of the relationship
between individual grants and papers is needed. Given these
data, similar analyses could be performed for large fields of
science, or perhaps, even the whole of science. The ultimate goal
is to provide an interactive means of exploring and evaluating
scientific and technical information (publications, grants, etc.) to
help us obtain answers to questions of strategic importance and
aid the innovation process.
I thank the Laboratory Directed Research and Development Program,
Sandia National Laboratories, and Katy Borner, Richard Shiffrin, and
several anonymous reviewers for insightful comments and suggestions.
This work was supported by the U.S. Department of Energy under
Contract DE-AC04-94AL85000.
26. Wise, J. A. (1999) J. Am. Soc. Inf. Sci. 50, 1224-1233.
27. Morris, S. A., Yen, G., Wu, Z. & Asnake, B. (2003) J. Am. Soc. In; Sci. Technol.
54, 413-422.
28. Chen, C. & Kuljis, J. (2003) J. Am. Soc. Inf. Sci. Technol. 54, 453-446.
29. Godin, B. (2003) Res. Policy 32, 679-691.
30. National Science Board. (2002) Science and Engineering Indicators 2002
(National Science Foundation, Arlington, VA).
31. King, J. (1987) J. Ini Sci. 13, 261-276.
32. Martin, B. R. & Irvine, J. (1983) Res. Policy 12, 61-90.
33. Irvine, J. & Martin, B. R. (1984) Foresight in Science: Picking the Winners
(Frances Pinter Publications, London).
34. Frame, J. D. & Narin, F. (1976) Fed. Proc. 35, 2529-2532.
35. McAllister, P. R. & Narin, F. (1983) J. Am. Soc. Inf. Sci. 34, 123-131.
36. Lewison, G. & Dawson, G. (1998) Scientometrics 41, 17-27.
37. Lewison, G. (1998) Gut 43, 288-293.
38. Butler, L. (2001) Res. Eval. 10, 59-65.
39. Boyack, K. W. & Borner, K. (2003) J. Am. Soc. Inf. Sci. Technol. 54, 447-461.
40. Ingwersen, P. & Christensen, F. H. (1997) J. Am. Soc. Inf Sci. 48, 205-217.
41. Hood, W. W. & Wilson, C. S. (2001) J. Am. Soc. Inf. Sci. Technol. 52, 1242-1254.
42. Seglen, P. (1997) Allergy (Copenhagen) 52, 1050-1056.
43. Seglen, P. (1997) Br. Med. J. 314, 498-502.
44. Narin, F. & Hamilton, K. S. (1996) Scientometrics 36, 293-310.
45. Lewison, G., Dawson, G. & Anderson, J. (1995) in 5th International
Conference of the International Society for Scientometrics and Informetrics,
eds. Koenig, M. E. D. & Bookstein, A. (Learned Information, Medford, NJ),
pp. 255-263.
46. Scheffe, H. (1953) Biometrika 40, 87-104.
47. Kessler, M. M. (1963) Am. Doc. 14, 10-25.
48. Small, H. (1997) Scientometrics 38, 275-293.
49. Davidson, G. S., Wylie, B. N. & Boyack, K. W. (2001) in 7th IEEE Symposium
Inform Visualization (Info Vis 2001), eds. Andrews, K., Roth, S. & Wong, P. C.,
(IEEE Computer Society, Los Alamitos, CA), pp. 23-30.
50. Davidson, G. S., Hendrickson, B., Johnson, D. K., Meyers, C. E. & Wylie, B. N.
(1998) J. Intell. Inform. Syst. 11, 259-285.
~1. Urata, H. (1990) Scientometrics 18, 309-319.
PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 1 5199
Representative terms from entire chapter:
cell biology