The National Academies Press

Currently Skimming:

Finding scientific topics
Pages 46-53

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.

From page 46... ... We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content. When scientists decide to write a paper, one of the first things they do is identify an interesting subset of the many possible topics of scientific investigation. Read the entire page →
From page 47... ... We address this problem by using a Monte Carlo procedure, resulting in an algorithm that is easy to implement, requires little memory, and is competitive in speed and performance with existing algorithms. We use the probability model for Latent Dirichlet Allocation, with the addition of a Dirichlet prior on �. Read the entire page →
From page 48... ... 5 is the number of times a word is assigned to a topic and the number of times a topic occurs in a document, the algorithm can be run with minimal memory requirements by caching the sparse set of nonzero counts and updating them whenever a word is reassigned. After enough iterations for the chain to approach the target distribution, the current values of the zi variables are recorded. Read the entire page →
From page 49... ... With scientific documents, a large value of I3 would lead the model to find a relatively small number of topics, perhaps at the level of scientific disciplines, whereas smaller values of ,B will produce more topics that address specific areas of research. Given values of ax and [3, the problem of choosing the appropriate value for T is a problem of model selection, which we address by using a standard method from Bayesian statistics (15~. Read the entire page →
From page 50... ... If understanding these dynamics is the goal of our analysis, we can formulate more sophisticated generative models that incorporate parameters describing the change in the prevalence of topics over time. Here, we present a basic analysis based on a post hoc examination of the estimates of ~ produced by the model. Read the entire page →
From page 51... ... that might be less obvious in analyses that consider only the frequencies of single words. To find topics that consistently rose or fell in popularity from 1991 to 2001, we conducted a linear trend analysis on Hi by year, using the same single sample as in our previous analyses. Read the entire page →
From page 52... ... tagged according to topic assignment. The superscripts indicate the topics to which individual worcis were assigned in a single sample, whereas the contrast level reflects the probability of a word being assigned to the most prevalent topic in the abstract, computed across samples. Read the entire page →
From page 53... ... We thank Josh Tenenbaum, Dave Blei, and Jun Liu for thoughtful comments that improved this paper, Kevin Boyack for providing the PNAS class designations, Shawn Cokus for writing the random number generator, and Tom Minka for writing the code used for the comparison of algorithms. Several simulations were performed on the BlueHorizon supercomputer at the San Diego Supercomputer Center. Read the entire page →

From page 46...

... We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content. When scientists decide to write a paper, one of the first things they do is identify an interesting subset of the many possible topics of scientific investigation.

Read the entire page →

From page 47...

... We address this problem by using a Monte Carlo procedure, resulting in an algorithm that is easy to implement, requires little memory, and is competitive in speed and performance with existing algorithms. We use the probability model for Latent Dirichlet Allocation, with the addition of a Dirichlet prior on �.

Read the entire page →

From page 48...

... 5 is the number of times a word is assigned to a topic and the number of times a topic occurs in a document, the algorithm can be run with minimal memory requirements by caching the sparse set of nonzero counts and updating them whenever a word is reassigned. After enough iterations for the chain to approach the target distribution, the current values of the zi variables are recorded.

Read the entire page →

From page 49...

... With scientific documents, a large value of I3 would lead the model to find a relatively small number of topics, perhaps at the level of scientific disciplines, whereas smaller values of ,B will produce more topics that address specific areas of research. Given values of ax and [3, the problem of choosing the appropriate value for T is a problem of model selection, which we address by using a standard method from Bayesian statistics (15~.

Read the entire page →

From page 50...

... If understanding these dynamics is the goal of our analysis, we can formulate more sophisticated generative models that incorporate parameters describing the change in the prevalence of topics over time. Here, we present a basic analysis based on a post hoc examination of the estimates of ~ produced by the model.

Read the entire page →

From page 51...

... that might be less obvious in analyses that consider only the frequencies of single words. To find topics that consistently rose or fell in popularity from 1991 to 2001, we conducted a linear trend analysis on Hi by year, using the same single sample as in our previous analyses.

Read the entire page →

From page 52...

... tagged according to topic assignment. The superscripts indicate the topics to which individual worcis were assigned in a single sample, whereas the contrast level reflects the probability of a word being assigned to the most prevalent topic in the abstract, computed across samples.

Read the entire page →

From page 53...

... We thank Josh Tenenbaum, Dave Blei, and Jun Liu for thoughtful comments that improved this paper, Kevin Boyack for providing the PNAS class designations, Shawn Cokus for writing the random number generator, and Tom Minka for writing the code used for the comparison of algorithms. Several simulations were performed on the BlueHorizon supercomputer at the San Diego Supercomputer Center.

Read the entire page →

← Previous Chapter Skim

Next Chapter Skim →

This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.

Finding scientific topics Pages 46-53

Finding scientific topics
Pages 46-53