| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 54
Colloquium
Mapping subsets of scholarly information
Paul Ginspargtt, Paul Houle, Thorsten Joachims, and Jae-Hoon Sul
Cornell University, Ithaca, NY 14853
We illustrate the use of machine learning techniques to analyze,
structure, maintain, and evolve a large online corpus of academic
literature. An emerging field of research can be identified as part
of an existing corpus, permitting the implementation of a more
coherent community structure for its practitioners.
The arXiv (see http://arXiv.org; for general background, see
ref. 1) is an automated repository of >250,000 full-text
research articles (as of mid-October 2003) in physics and related
disciplines, going back more than a decade and growing at a rate
of 40,000 new submissions per year. It serves >10 million
requests per month (2), including tens of thousands of search
queries per day and >20 million full-text downloads during 2002.
It is a significant example of a Web-based service that has
changed the practice of research in a major scientific discipline.
It provides nearly comprehensive coverage of large areas of
physics and serves as an on-line seminar system for those areas.
It also provides a significant resource for model building and
algorithmic experiments in mapping scholarly domains. To-
gether with the SLAC SPIRES-HEP database, it provides a
public resource of full-text articles and associated citation trees
of many millions of links, with a focused disciplinary coverage
and rich usage data. tThe Stanford Linear Accelerator Center
SPIRES-HEP database has comprehensively catalogued the
High Energy Particle Physics (HEP) literature online since 1974,
and indexes >500,000 high-energy physics-related articles in-
cluding their full citation tree (see ref. 3~.]
In what follows, we use arXiv data to illustrate how machine
learning methods can be used to analyze, structure, maintain, and
evolve a large online corpus of academic literature. The specific
application will be to train a support vector machine text classifier
to extract an emerging research area from a larger-scale resource.
The automated detection of such subunits can play an important
role in disentangling other subnetworks and associated subcom-
munities from the global network. Our notion of "mapping" here
is in the mathematical sense of associating attributes to objects,
rather than in the sense of visualization tools. Although the former
underlies the latter, we expect the two will increasingly go hand-
in-hand (for a recent review, see ref. 4~.
Text Classification
The goal of text classification is the automatic assignment of
documents to a fixed number of semantic categories. In the
"multilabel" setting, each document can be in zero or one or
more categories. Efficient automated techniques are essential to
avoid tedious and expensive manual category assignment for
large document sets. A "knowledge-engineering" approach,
involving hand-crafting accurate text-classification rules, is sur-
prisingly difficult and time-consuming (5~. We therefore take a
machine learning approach to generating text-classification rules
automatically from examples.
The machine learning approach can be phrased as a super-
vised learning problem. The learning task is represented by the
training sample Sn
(hi, yet, (x2, A), ~ (Xn' An)
5236-5240 1 PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1
of size n documents, where xi represents the document content.
In the multilabel setting, each category label is treated as a
separate binary classification problem. For each such binary
task, yi ~ ~—1,+11 indicates whether a document belongs to a
particular class. The task of the learning algorithm, L, is to find
a decision rule, h ~ x ~ ~ -1,+1 T. based on Sn that classifies new
documents, x, as accurately as possible.
Documents need to be transformed into a representation
suitable for the learning algorithm and the classification task.
Information retrieval research suggests that words work well as
representation units and that for many tasks their ordering can
be ignored without losing too much information. This type of
representation is commonly called the "bag-of-words" model, an
attribute-value representation of text. Each text document is
represented by a vector in the lexicon space, i.e., by a "term
frequency" (TF) feature vector TF(wi, x), with component values
equal to the number of times each distinct word, wi, in the corpus
occurs in the document x. Fig. 1 shows an example feature vector
for a particular document.
This basic representation is ordinarily refined in a few ways.
TF x Inverse Document Frequency (IDF) Weighting. Scaling the
components of the feature vector with their IDF(wi) (6) often leads
to improved performance. In general, IDF(wi) is some decreasing
function of the word frequency DF(wi), equal to the number of
documents in the corpus which contain the word, wit For example,
IDF(Wi) = log(DF`w )),
[2]
where n is the total number of documents. Intuitively, the IDF
assumes that rarer terms have more significance for classification
purposes and hence gives them greater weight. To compensate
for the effect of different document lengths, each document
feature vector xi is normalized to unit length: A = 1.
Stemming. Instead of treating each occurrence form of a word as
a different feature, stemming is used to project the different
forms of a word onto a single feature, the word stem, by removing
inflection information (7~. For example "computes," "comput-
ing," and "computer" are all mapped to the same stem "com-
put." The terms "word" and "word stem" are used synonymously
in the following text.
Stopword Removal. For many classification tasks, common words
like "the," "and," or "he" do not help discriminate between
This paper results from the Arthur M. Sackier Colloquium of the National Acaclemy of
Sciences, "Mapping KnowiecJge Domains," hetcl May 9-11, 2003, at the ArnoicJ and Mabel
Beckman Center of the National Acaclemies of Sciences and Engineering in Irvine, CA.
Abbreviations: SVM, support vector machine; TF, term frequency; IDF, inverse cJocument
frequency.
Presenter of invited talk at Arthur M. Sackler Colloquium of the National AcacJemy of
Sciences, "Mapping Knowledge Domains," heicJ May 9-1 1, 2003, at the Arnold and Mabel
Beckman Center of the National Acaclemies of Sciences ancJ Engineering in Irvine, CA.
tTo whom correspondence shouicJ be acJciressed. E-maii: ginsparg~corneii.ecJu.
~ 2004 by The National Academy of Sciences of the USA
www.pnas.org/cgi/doi/10. ~ 073/pnas.0308253 ~ 00
OCR for page 55
From: ~oa@sciences.sdsu~edu
Newsgroups: Romp,
Subject: Need specs on Apple QT
I need to get the specs, or at least a
very verbose interpretation of the specs,
for Quiche me. Technl=~n~
magazines~nces to books would
be nice, too. \
I also need the specs in ~ fromat usa~
on a Unix or Condos system. I can't
do much win the Qu~kb~ey
have on ...
document classes. Stopword removal describes the process of
eliminating such words from the document by matching against
a predefined list of stopwords. We use a standard stoplist of
~300 words.
Support Vector Machines (SVMs)
SVMs were developed by Vapnik and coworkers (8) based on the
structural risk minimization principle from statistical learning
theory. They have proven to be a highly effective method for
learning text-classification rules, achieving state-of-the-art per-
formance on a broad range of tasks (9, 10~. Two main advantages
of using SVMs for text classification lie in their ability to handle
the high-dimensional feature spaces arising from the bag-of-
words representation. From a statistical perspective, they are
robust to overfitting and are well suited for the statistical
properties of text. From a computational perspective, they can
be trained efficiently despite the large number of features. A
detailed overview of the SVM approach to text classification,
with more details on the notation used below, is given in ref. 11.
In their basic form, SVMs learn linear decision rules,
h ~) = sgn~w x + b i, [3]
Fig. 1. Representing text as a feature vector.
The margin of the resulting hyperplane is ~ = 1/~w~.
The constraints (Eq. 5) require that all training examples are
classified correctly up to some slack A. If a training example lies
on the "wrong" side of the hyperplane, we have the correspond-
ing hi ~ 1, and thus An= (i is an upper bound on the number of
training errors. The factor C in Eq. 4 is a parameter that allows
trading off training error vs. model complexity. The optimal
value of this parameter depends on the particular classification
task and must be chosen by means of cross-validation or by some
other model selection strategy. For text classification, however,
the default value of C = 1/maxi~i~2 = 1 has proven to be
effective across a large range of tasks (11~.
OP1 has an equivalent dual formulation:
Optimization Problem 2 [SVM (Dual)].
1
maximize: W(or) = 2ai - 2 ~ MY, yjaio~jixi xj)
i=l i=1j=T
n
subject to: ~Yi°`i = 0
described by a weight vector w and a threshold b, from an input
sample of n training examples, Sn = ((xi'y~), · ~ (xn'yn)~' Hi ~
AN, yi ~ ~ - 1, + 11. For a linearly separable Sn, the SVM finds the
hyperplane with maximum Euclidean distance to the closest
training examples. This distance is called the margin 8, as
depicted in Fig. 2. Geometrically, the hyperplane is defined by its
normal vector, w, and its distance from the origin, -b. For
nonseparable training sets, the amount of training error is
measured by using slack variables, A.
Computing the position of the hyperplane is equivalent to
solving the following convex quadratic optimization problem (8~:
Optimization Problem 1 [SVM (Primal)].
minimize: V(w, b, f) = 2W-w + C7fi [4]
subject to: V'=~:yc~w-xi + b]—1 - Hi [5]
Ala=] (i > 0- [6]
Ginsparg et a/.
Vi ~ [l n]:0 c (xi C C. [9]
From the solution of the dual, the classification rule solution can
be constructed as
n
w = ~cY~xi and b =Yusv - W.xusv' [10]
where (xusv'Yusv) is some training example with 0 < case < C. For
the experiments in this article, SVMIight (11) is used for solving
the dual-optimization problem. (SVMlight is available at http://
svmlight.joachims.org/.) More detailed introductions to SVMs
can be found in refs. 12 and 13.
An alternative to the approach here would be to use Latent
Semantic Analysis (14) to generate the feature vectors. Latent
Semantic Analysis can potentially give better recall by capturing
PNAS 1 April 6, 2004 1 vol. 101 1 suppl 1 1 5237
OCR for page 56
+
+ + +
+ +
~ __- 0.6
—C
Fig. 2. A binary classification problem (+ vs. -) in two dimensions. The
hyperplane h* separates positive and negative training examples with maxi- 0.2
mum margin 3. The examples closest to the hyperplane are called "support
vectors" (marked with circles).
partial synonymy in the word representation, i.e., by bundling
related words into a single feature. The reduction in dimension
can also be computationally efficient, facilitating capture of the
full-text content of papers, including their citation information.
On the other hand, the SVM already scales well to a large feature
space and performs well at determining relevant content
through suitable choices of weights. On a sufficiently large
training set, the SVM might even benefit from access to fine
distinctions between words potentially obscured by the latent
semantic analysis bundling. It is thus an open question, well :
worth further investigation, as to whether latent semantic anal-
ysis applied to full text could improve performance of the SVM
in this application without loss of computational efficiency.
Generative models for text classification provide yet another
alternative approach, in which each document is viewed as a
mixture of topics, as in the statistical inference algorithm for
Latent Dirichlet Allocation used in ref. 15. This approach can,
in principle, provide significant additional information about
document content but does not scale well to a large document
corpus, so the real-time applications intended here would not yet
be computationally feasible in that framework.
arXiv Benchmarks
Before using the machine learning framework to identify new
subject area content, we first assessed its performance on the
existing (author-provided) category classifications. Roughly
180,000 titles and abstracts were fed to model-building software,
which constructed a lexicon of ~100,000 distinct words and
produced training files containing the TDxIDF document vec-
tors for SVM~ight. "Although the SVM machinery could easily be
used to analyze the full document content, previous experiments
(11) suggest that well written titles and abstracts provide a highly
focused characterization of content at least as effective for our
document classification purposes.] The set of support vectors
and weight parameters output by SVM~ight was converted into a
form specific to the linear SVM, Eq. 3: a weight vector we and
a threshold be' where c is an index over the categories.
As seen in Fig. 3, the success of the SVM in classifying
documents improves as the size of a category increases. The
SVM is remarkably successful at identifying documents in large
(>10,000 documents) categories and less successful at identify-
ing smaller subject areas (<500 documents). A cutoff was
imposed to exclude subject areas with <100 documents. (Some
of the smaller subject areas are known to be less topically
focused, so the difficulty in recall, based solely on title/abstract
terminology, was expected.)
In experiments with small categories (N < 1,000), the SVM
consistently chose thresholds that resulted in unacceptably low
recall (i.e., missed documents relevant to the category). This is
in part because the null hypothesis, that no documents are in the
5238 1 www. pnas.org/cg i/doi/ 10.1 073/pnas.0308253 100
on
. .
0.4 _
08L
0.6 _
80.4 _
.
.
v ..... I . .. eP . I . . ~ . I
100 1000 10000
Number of documents in category
.
0.2
O-
.
.
L ~ .h ., 1 ., 1 . . 1
100 1000 10000
Number of documents in category
Fig. 3. Recall and precision as functions of category size for 78 arXiv major
categories and minor subject classes. Two thirds of the sample was used as a
training set and one third as a test set. The four largest categories, each with
>30,000 documents are cond-mat, astro-ph, hep-th, and hep-ph.
category, describes the data well for a small category; e.g., if only
1,000 of 100,000 documents are in a category, the null hypothesis
has a 99% accuracy. ("Accuracy" is the percentage of documents
correctly classified. We also employ other common terminology
from information retrieval: "precision" is the fraction of those
documents retrieved that are relevant to a query, and "recall" is
the fraction of all relevant documents in the collection that are
retrieved.) To counteract the bias against small categories, 10%
of the training data was used as a validation set to fit a sigmoid
probability distribution 1/[1 + exp(Af + B)] (16). This converts
the dot product xi we between document vector and weight vector
into a probability, P(i ~ c I xi), that document i is a member of
category c, given its feature vector xi. Use of the probability in
place of the uncalibrated signed distance from the hyperplane
output by the SVM permitted greater levels of recall for small
categories.
Other experiments showed that the use of TFx IDF weighting
as in Eq. 2 improved accuracy consistently over pure TF
weighting, so TFX IDF weighting was used in the experiments to
follow. We also used a document frequency threshold to exclude
rare words from the lexicon but found little difference in
accuracy between using a document occurrence threshold of two
and five. (Words that appeared in fewer than two documents
constituted ~50% of the lexicon, and those that appeared in
Ginsparg eta/.
OCR for page 57
4500
4000
3500
3000
a)
`,' 2500
o
an
.m
an
2000
1 500
1 000
500
o
1 1 1 1 1 1 1 1 1
| 121 cond-mat.(soft,stat-mech,dis-nn) + physics.bio-ph + nlin.AO ~
4:4
1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
Fig. 4. The number of submissions per year from 1992 through April 2003 in the particular subsets of arXiv subject areas of cond-mat, physics, and nlin most
likely to have "quantitative biology" content. The percentage of q-bio content in these areas grew from ~1 % to nearly 10% during this timeframe, suggesting
a change in intellectual activity among arXiv-using members of these communities.
fewer than five documents, ~70~o. Ignoring rare and conse-
quently uninformative words hence reduces the computational
needs.) Increasing the weight of title words with respect to
abstract words, on the other hand, consistently worsened accu-
racy, indicating that words in a well written abstract contain as
much or more content classification import as words in the title.
Changes in the word tokenization and stemming algorithms did
not have a significant impact on overall accuracy. The default
value of C = 1 in Eq. 9 was preferred.
q-bio Extraction
Recent anecdotal evidence indicates an intellectual trend among
physicists toward work in biology, ranging from biomolecules,
molecular pathways and networks, gene expression, cellular and
multicellular systems to population dynamics and evolution. This
work has appeared in separate parts of the archive, in particular,
under "Soft Condensed Matter," "Statistical Mechanics," "Dis-
ordered Systems and Neural Networks," "Biophysics," and
"Adaptive and Self-Organizing Systems" (abbreviated cond-
mat.soft, cond-mat.stat-mech, cond-mat.dis-nn, physics.bio-ph,
and nlin.AO). A more coherent forum for the exchange of these
ideas was requested, under the nomenclature "Quantitative
Biology" (abbreviated"q-bio"~.
To identify first whether a real trend to nurture and amplify
indeed existed and to create a training set, volunteers were
enlisted to identify the q-bio content from the subject areas
above in which it was most highly focused. Of 5,565 such articles
received from January 2002 through March 2003, 466 (8.4%)
. .
were found to have one ot the biological topics above as its
primary focus. The total number of distinct words in these titles,
abstracts, plus author names, was 23,558, of which 7,984 were
above the DF = 2 document frequency threshold. (Author
names were included in the analysis because they have potential
"semantic" content in this context, i.e., are potentially useful for
document classification. The SVM algorithm will automatically
determine whether to use the information by choosing suitable
weights.)
A data-cleaning procedure was used, in which SVM~ight was
first run with C = 10. We recall from Eqs. 4 and 9 that larger C
Ginsparg et al.
penalizes training errors and requires larger ~ values to fit the
data. Inspecting the "outlier" documents (17) with the largest boil
then permitted manual cleaning of the training set. Ten docu-
ments were moved into q-bio, and 15 documents were moved
out, for a net movement to 461 q-bio (8.3%) of the 5,565 total.
Some of the others flagged involved word confusions, e.g.,
"genetic algorithms" typically involved programming rather than
biological applications. Other q-bio words with frequent non-
biological senses were "epidemic" (used for rumor propagation),
"evolution" (used also for dynamics of sandpiles), "survival
probabilities," and extinction. "Scale-free networks" were some-
times used for biological applications and sometimes not. To
Table 1. The most positive, a few intermediate, and the most
negative components of the q-bio classifying weight vector
protein
dna
biological
neuron
ma
gene
mutation
population
. . .
epidemic
biology
disease
cell
neural
brain
ecosystem
tissue
sequence
genetic
bacterial
blood
genome
peptide
infection
+8.57
+7.08
+5.06
+ 5.05
+3.30
+3.21
+3.1 5
+3.1 1
+3.05
+3.02
+2.93
+2.90
+2.89
+2.83
+2.56
+2.52 _
+2.51 d saakian -0.00
+2.51 conformity -0.00
+2.48 aware -0.00
+2.43 even though -0.00
+2.37 practitioner -0.00
+2.37 permittivity -0.00
+2.34
forward +0.00
minimalist +0.00
region +0.00
confinement +0.00
implies +0.00
96 +0.00
y togashi +0.00
n wingreen +0.00
mean free +0.00
narrower +0.00
shot -0.00
repton -0.00
kyoto -0.00
regular -0.00
ceneralization-O.OO
point
equation
boundary
social
n
relaxation
fluid
Indian
spin
spin glass
traffic
system
polymer
class
emerge
gradient
quantum
surface
synchronizatio
market
particle
polyelectrolyte
world
—1.04
—1.05
—1.06
—1.06
—1.09
—1.1 4
—1.15
—1.15
—1.17
—1.17
—1.18
—1.30
—1.33
—1.35
—1.36
—1.39
—1.43
—1.43
- 1 .45
—1.47
—1.52
—1.53
—1.57
PNAS 1 April 6, 2004 1 vol 101 1 suppl. 1 1 5239
OCR for page 58
help resolve some of these ambiguities, the vocabulary was
enlarged to include a list of most frequently used two-word
phrases with semantic content different from their constituent
words.
With a training set fully representative of the categories in
question, it was then possible to run the classifier on the entirety
of the same subject area content received from 1992 through
April 2003, a total of 28,830 documents. The results are shown
in Fig. 4. A total of 1,649 q-bio documents was identified, and the
trend toward an increasing percentage of q-bio activity among
these arXiv users is evident; individual authors can be tracked as
they migrate into the domain. Visibility of the new domain can
be further enhanced by referring submitters in real time by
means of the automated classifier running behind the submission
interface.
Some components of the weight vector generated by the SVM
for this training set are shown in Table 1. Because the distance
of a document to the classifying hyperplane is determined by
taking the dot product with the normal vector, its component
values can be interpreted as the classifying weight for the
associated words. The approach here illustrates one of the major
lessons of the past decade: the surprising power of simple
algorithms operating on large datasets.
Although implemented as a passive-dissemination system, the
arXiv has also played a social engineering role, with active
research users developing an affinity to the system and adjusting
their behavior accordingly. They scan new submissions on a daily
1. Ginsparg, P. (2001) in Electronic Publishing in Science II, Proceedings of Joint
ICSUPress/UNESCO Conference, Paris, France (ICSU Press, Paris) Available
at http://users.ox.ac.uk/icsuinfo/ginspargfin.htm and at http://arXiv.org/
blurb/pgOlunesco.html. Accessed January 22, 2004.
2. Ginsparg, P. (2003) Sci. Technol. Libraries 22, 5-18, preprint available at
http://arXiv.org/blurb/pgO2pr.html.
3. O'Connell, H. B. (2002) http://arXiv.org/physics/0007040.
. Borner, K., Chen, C. & Boyack, K. (2003) Annul Rev. Inf. Sci. Technol. 37,
179-255.
5. Hayes, P. & Weinstein, S. (1990) in Proceedings of L9AI-90, 2nd Conference on
InnovativeApplications of Artificial Intelligence, eds. Rappaport, A. & Smith, R.
(AAAI Press, Menlo Park, CA), pp. 49-66.
6. Salton, G. & Buckley, C. (1988) Inf. Processing Manage. 24, 513-523.
7. Porter, M. (1980) Program (Autom. Libr. Inf. Syst.) 14, 130-137.
8. Vapnik, V. (1998) Statistical Learning Theory (Wiley, Chichester, U.K.).
9. Dumais, S., Platt, J., Heckerman, D. & Sahami, M. (1998) in Proceedings of
5240 1 www.pnas.org/cgi/doi/10.1 073/pnas.0308253100
basis, assume others in their field do so, are consequently aware
of anything relevant that has appeared there (whereas anything
that doesn't may as well not exist), and use it to stake intellectual
priority claims in advance of journal publication. We see further
that machine learning tools can characterize a subdomain and
thereby help accelerate its growth by the interaction of an
information resource with the social system of its practitioners.
tSome other implications of these methods, including potential
modification of the peer review system, are considered else-
where (2~.]
Postscript
The q-bio extraction described above was not just a thought
experiment, but a prelude to an engineering experiment. The
new category went online in mid-September 2003 (see http://
arXiv.org/new/q-bio.html), at which time past submitters
flagged by the system were asked to confirm the q-bio content
of their submissions and to send future submissions to the new
category. The activity levels at the outset corresponded precisely
to the predictions of the SVM text classifier and later began to
show indications of growth catalyzed by the public existence of
the new category.
This work was supported by National Science Foundation Agreement
0132355 (to P.G.), funding from the Cornell University Library (to P.H.),
National Science Foundation Career Award 0237381 (to T.J.), and the
National Science Foundation KD-D Project (to T.J.~.
CIKM-98, Seventh ACM International ACM Conference in Information and Knowl-
edge Managment (Association for Computing Machinery, New York), pp. 148-155.
10. Joachims, T. (1998) in Proceedings of the European Conference on Machine
Learning (Springer. Berlin), pp. 137-142.
11. Joachims, T. (2002) Learning to Classify Text Using Support Vector Machines:
Methods, Theory, and Algorithms (Kluwer, Dordrecht, The Netherlands).
12. Burges, C. (1998) Data Mining Knowledge Discovery 2, 121-167.
13. Cristianini, N. & Shawe-Taylor, J. (2000) An Introduction to Support Vector
Machines and Other Kernel-Based Learning Methods (Cambridge Univ. Press, NY).
Landauer, T. K. & Dumais, S. T. (1997) Psychol. Rev. 104, 211-240.
Griffiths, T. L. & Steyvers, M. (2004) Proc. Natl. Acad. Sci. USA 101,
5228-5235.
16. Platt, J. (1999) inAdvances in Large Margin Classifiers (MIT Press, Cambridge,
MA), pp. 61-74.
17. Guyon, I., Matic, N. & Vapnik, V. (1996) in Advances in Knowledge Discovery
and Data Mining (MIT Press, Cambridge, MA; AAAI Press, Menlo Park, CA),
pp. 181-203.
Ginsparg et a/.
Representative terms from entire chapter:
machine learning