Transcription of the

Application presentation by Susan Dumais, Bellcore


DR. DUMAIS: As Daryl mentioned earlier, I'm going to focus on a completely different topic. I'm going to describe this morning some of the statistical challenges that we face in trying to help end users like ourselves better retrieve information from a variety of computer databases.

I think the promise of the information age is that we'll all have large volumes of relevant information at our fingertips. The reality of it, however, is that it is surprisingly difficult to find what you want when you want it. Librarians have known this problem for years. We as end users of systems like on-line library catalogs or the Worldwide Web are rediscovering this problem with alarming regularity.

What I want to focus on this morning is one specific approach that we have taken to solving this information retrieval problem. I think some of the statistical issues that come up here are also relevant to a wide variety of information filtering or routing problems, as well as to some information categorization problems.

Dave Lewis in his position paper has done a nice job of describing the relationship among these different problems, as well as some of the more statistical issues that arise. I'm going to focus on one particular information retrieval problem and solution that we have tried.

For present purposes, you can think of the information retrieval problem as exemplified by on-line library catalogs. You have a relatively stable database of objects and a wide variety of ad hoc user queries. Users are coming up constantly with new and changing queries and they want some good matches to those queries.

Most text retrieval systems are word-based. What I mean by that is, they match individual words and users' queries against words in the representation of the database objects. These kinds of word based matching systems are plagued by a variety of what we call verbal disagreement or vocabulary mismatch problems.

One aspect of the problem is that you get an incredible amount of junk back. Anybody who has done a search in a library catalog on the Worldwide Web is not surprised to hear that it is not unusual to get 50 percent junk back.

I mentioned at the outset that I am a psychologist. Our way of trying to understand these problems was to look at characteristics of how people use words. The same word can mean lots of different things to different users or in different situations. I think bank is the canonical example of this, meaning either financial matters or river matters. My own favorite example is the word chip, which is either an abbreviation for the California Highway Patrol or a person's first name, has to do with golf shots of food products of the chocolate kind or potato kind. It has to do with semiconductor technology, and so on. You can imagine the junk that you get back if you ask a query about chips.

The flip side of the coin is that you also miss an amazing amount of relevant information. This is much harder to know about, because you are missing the information. In carefully controlled information retrieval studies, people have typically found that you can miss 50 or 80 percent of the known relevant materials. This is an underestimate, because somebody else has found the material. So this underestimates the problem.

We believe that an important reason for the misses is that there is tremendous diversity in the words that people use to describe the same idea or concept. In some studies that we have done, we have found that the probability that two people agree on the same main descriptor word or object is about ten or 20 percent, depending somewhat on the main task.

Consider this: it is called an overhead or a Vu-graph or a transparency or a slide or a foil. What this means for retrieval is that if I as an author use one of those words and you as a searcher use another, you're going to fail to find my wonderful materials.

Now, another way of thinking about this last problem is that word-based retrieval systems treat words as if they are uncorrelated or independent. If I ask a query about automobiles, I am no more likely to find an article about cars than one about elephants or slide projectors. This is clearly untrue of the way that people use and generate words. It is also untrue of the way that human memory works. I think it is an undesirable characteristic of a lot of the automated systems that we now have.

Let me try to make these word based retrieval systems a bit more concrete. A text database can be described by means of a term by document matrix. Each of these cell frequencies indicates how often a particular term occurs in a document. For most modern methods, these cell frequencies are transformed in such a way to reflect the ability of individual words to discriminate among documents. So words that occur uniformly distributed in the database are given low weight, those that are very discriminating are given typically high weight.

The technique characteristic for today's purposes about these matrices is that in real applications, they are large and they are sparse. Most words don't occur in most documents.

Let me say a little bit about how a vector retrieval kind of word matching operates. In these systems, a user's query is represented as a vector in this same matrix. The similarity of that query vector to each of the database documents is computed, usually using a dot product or a cosine measure of similarity.

In this case, the query was about human-computer interaction. You can see that it matches a couple of document about human and computer interaction, which is good, but it also misses two others because those authors wrote not about humans and computers, but about users and systems.

Now, there are a number of approaches that have been used to try to overcome these problems of verbal disagreement. I'm not going to spend much time talking about them. Let me just mention a couple. One is to restrict the vocabulary that uses use for querying and that indexers use for indexing.

Another approach is to try to augment users' queries by means of automatically or manually derived thesauri. Other approaches involve AI, knowledge representation schemes that are typically hand crafted for each domain of interest, like medicine or law.

The approach that I want to talk about today to overcoming this problem uses instead an automatic and fairly general statistical technique in order to uncover what we hope are useful relationships among words and documents, and to improve peoples' ability to find what they want when they want it.

I'm going to do a lot of hand waving here in talking about the approach. People in the audience probably know an awful lot more than I do about the statistical details of the method, but let me give you an intuition about why we chose this method and some of the problems that it addresses.

Our latent semantic indexing method begins by viewing the observed term by document matrix as an unreliable estimate of the true associations between words and topics of interest or documents. We further assume that these words aren't generated randomly. That is, there is some underlying or what we call latent structure in the pattern of word usage across different documents. This regularity of structure is partially obscured by variability in word choice.

We use a reduced or truncated singular value decomposition in order to estimate this structure. The SVD is closely related to principal components analysis or factor analysis, Eigen analysis. It is in fact a linear net approach to this problem. Again, the intention is that we're going to try to approximate the observed term by document matrix using a much smaller number of statistically derived orthogonal indexing dimensions. So roughly speaking, we are trying to capture the regularity of how people use words in documents, but at the same time ignore the surface level variability in the particular word that they chose to use.

Let me point out that the number of dimensions we use in our information retrieval problems is typically on the order of several hundred. This is much fewer than the number of unique words. There can be hundreds of thousands or millions of words in the database. However, it is much larger than the number of dimensions that I was familiar with in factor analytic applications.

We then used the resulting reduced representation rather than surface level words for retrieval. We found in a number of formal tests that this latent semantic indexing method gives us about a 30 percent improvement over comparable key word methods.

Graphically, this is a graphical representation of the SVD. We have a rectangular term by document matrix. We re-describe it by means of orthogonal factors. In this case, you can think of a term matrix, a document matrix, each of which has orthonormal columns, and a diagonal matrix of single values. We want not to recreate this original matrix, but rather to approximate it in such a way that we overcome some of the problems of surface level word variability.

We think we can get a good approximation to this matrix, using what I have called a reduced or truncated singular value decomposition. The idea is, you keep only the K largest singular values and their associated left and right singular vectors.

This matrix here is the best ranked K approximation to our observed term by document matrix in the Lee square sense. You can think then of retrieval as operating either using this improved estimate of the word by document matrix or as exploring similarity neighborhoods in a K dimensional space. Each term and each document has a vectoring K space, and what you're doing is looking for nearby objects in K space.

Let me give you some characteristics of the data sets that we have looked at. Let's start with one that we started with about five or six years ago now. It has 1,033 documents, 5800 unique words. A couple of things to notice about it. It has less than one percent non-zero entries. We can compute the hundred largest singular values and vectors in about two minutes on a standard spark work station, using iterative sparse matrix methods. About five years ago, it took us 48 hours on somewhat slower machines, using standard dense methods.

The computational complexity however increases as the data size increases. We currently can handle databases of about 100,000 documents on standard work stations, albeit with reasonable amounts of core memory. For larger applications -- and in fact, these are the ones of most interest, we have had to resort to sub-sampling. Our limits involve memory IO as well as CPU to a lesser extent.

This trek(?) collection is a collection that is made available by NIST and ARPA to explore information retrieval with large data sets. This is the kind of collection we would like to be able to handle directly, rather than by sampling. It represents about three gigabytes of ASCII text. As you can see, there are 750,000 documents and 500,000 terms.

I should mention also that the number of terms here are the number of terms that occur in more than one document. We have omitted from the SVD terms that occur in exactly one document. You can about double this number if you include those terms.

The other thing is that the CPU times are for extracting the hundred largest singular values and vectors. For many of these problems, we have found that we need to increase this to compute 300 or 400 dimensions.

This increases the computational time by about a factor of ten. So to compute the SVD of 70,000 by 87,000 document matrix takes us about 24 hours on a standard spark board station. This is not unreasonable, because this is largely a one-shot process.

Let me end, as my time is running out, with some of the statistical issues that we have encountered and that I hope you have some hints about. The first is how we choose the number of dimensions in our reduced representation. We have done it largely by seat of the pants. You know when it doesn't work. You know when you have too few dimensions. We would like some better methods for doing this, things like the scree test don't seem to correspond very well to behavioral data that we have.

We would also like to be able to do larger SVDs faster. We are beginning to face issues about how to update SVDs, especially for changing databases. We can compute the SVD-1s for something like your library card catalog. For databases like running APU's wire, that change very frequently, we have issues of either recomputing the SVD to maintain orthogonality, or updating it efficiently.

We are interested in what is probably not a statistical issue, but one that is of great practical interest to us. That is finding neighbors in high dimensional spaces. Our queries are vectors in this 300 space, for example. Right now, in order to find near neighbors, the nearest documents, we as psychologists try the brute force method. We compute the similarities to all 700,000 documents, sort and rank.

We have tried a number of methods that don't seem to generalize very well from their original developments in two or three space to 200 or 300 space. We would also like to look at other models of associative structure. We chose the singular value decomposition dimensional model as a compromise between representational richness and computational tractability. I think there are a number of other models that perhaps mimic the similarity structure of word usage and memory better.

PARTICIPANT: Thank you, Susan. Questions from the floor?

PARTICIPANT: I'm a little nervous that if someone was browsing the Web and we hoped to put some of this material in the Web, that we're in trouble. We're talking about seat of the pants and underwear models, that people are going to get the wrong context for why we're here. But that is part of the big problem that Susan is talking about.

PARTICIPANT: I thought I would just mention an entirely different approach to this problem, with Joe (word lost) at EDS. What we're doing is --

PARTICIPANT: Can you get to a mike?

PARTICIPANT: We are using a poisson model for the word counts. Then we're interested in finding maximum likelihood estimates for the clustering, and we found various combinations of simulated annealing and markup chain Monte Carlo to work very well with funding these things.

One of the nice things in a model based approach is that you get natural measures of association rather than just SVD types of things, although it could be slower.

PARTICIPANT: I think one thing we will try to ask everyone after the conference is to send us electronically two or three references of relevant work that we can disseminate in this way, because we do hope to learn about new approaches and new methods and related work. So keep that in mind as the discussion progresses. We will send out E-mail requesting those in electronic form.

PARTICIPANT: (Comments off mike.)

DR. DUMAIS: It is first of all not clear that the 300 or 400 dimensions we have used for the trek databases is optimal. We find that performance is still increasing up to 400 dimensions; it may well increase beyond that.

In fact, I should mention that if you plot performance as a function of number of dimensions, what you get is an inverted U function that is heavily skewed. That is, performance increases dramatically as you go from 20 or 30 up to several hundred dimensions, and then it tails off gradually through the level of performance that you see with raw key word matching, which is the full dimensional solution.

We don't know that we have reached the peak. In problems where we know what the optimal number of dimensions is, we have found that the peak is not so sharp.


End Transcription