Information Resources, Inc.
Information Resources is a for-profit corporation that was founded in 1979. The software part of it was an acquired company, originally founded in 1969. Last year we sold about $350 million worth of data and software, with 3,500 employees. We collect consumer package goods data, the stuff that gets sold in grocery stores and drugstores and places like K-Mart and Walmart. We sell the data back to the manufacturers of those products so that they can track and assess their marketing programs. We also sell it to the retailers, so that they can see how they are doing.
We use this information for sales support (that is, how do I sell more product?), for marketing (that is, what products should I build and how should I support them and price them?), and also for logistical planning (that is, pushing stuff through the pipeline, and so forth.)
Our main database is called Infoscan. It was founded in 1986. Our expanded version of Infoscan, called the RI Census, was an expansion of our sample from 3,000 to 15,000 stores. That took place at the beginning of last year.
What we are in the business of doing is answering sales and marketing questions. Most of our products are developed and designed for interactive on-line use by end users who are basically sales people and planners. We do that because our margins are higher on that than on our consulting work. On consulting work, we basically do not make any money. On interactive on-line syndicated products where we have to ship out tapes or diskettes, we make fairly good margins.
We are basically in the business of helping people sell groceries. For example—we went through this with powdered laundry detergent—we helped Procter and Gamble sell more Tide. That is the only business that we are in. We do that by answering a group of questions for them. In a slightly simplified form, but not very simplified, I can classify them into four groups.
The first is tracking how I am doing? What are my sales like? What are the trends like? How am I doing in terms of pricing? How much trade support am I getting? How many of my products axe being sold with displays and advertising support?
The other three questions are aimed at some causal analysis underneath. The first is what we generally call variety analysis, that is, what products should I put in what stores? What flavors, what sizes? How much variety? Does variety pay off or is it just a waste of space? The second is, what price should I try to charge? I say "try to charge" because the manufacturers do not set the prices; the retailers do. The manufacturers have some influence, but they do not dictate it. The final area is what we call merchandising how much effort should I put into trying to get a grocery store to put my stuff at the end of the aisle in a great big heap so that you trip over it and some of the items accidentally fall in your basket?
Anecdotally, displays are tremendously effective. For a grocery product we typically see that sales in a week when there is an end-of-aisle display will be four or five or six times what they are normally.
The main data that we collect is scanner data from grocery stores, drugstores, and mass merchandisers. Our underlying database is basically very simple. It has three keys that indicate
what store a product was sold in, what UPC code was on the product, and what week it was sold. A lot of our data is now coming in dally rather than weekly.
We have only a few direct measures: how many units of a product did they sell and how many pennies' worth of the product did they sell, and some flags as to whether it was being displayed or whether it was in a feature advertisement, and some other kinds of odd technical flags.
We then augment that with a few derived measures. We calculate a baseline of sales by a fairly simple exponential weighted moving average with some correction for seasonality to indicate what the deviations are from the baseline. We calculate a baseline price also, so we can see whether a product was being sold at a price reduction. We calculate lift factors: if I sold my product and it was on display that week, how much of a rise above normal or expected sales did I get because of the display. We impute that. We do it in a very simple way by calculating the ratio of baseline sales to actual sales in weeks with displays. So you can imagine that this data is extraordinarily volatile.
The data is reasonably clean. We spend an enormous amount of effort on quality assurance and we do have to clean up a lot of the data. Five to 15 percent of it is missing in any one week. We infer data for stores that simply do not get their data tapes to us in time.
From this raw data we aggregate the data. We aggregate to calculate expected sales in Boston, expected sales for Giant Food Stores in Washington, D.C., and so on, using stratified sampling weights. We also calculate aggregate products. We take all of the different sales of Tide 40-ounce boxes and calculate a total for Tide 40 ounce, then calculate total Tide, total Procter and Gamble, how they did on their detergent sales, and total category.
This is an issue that comes back to haunt us. There is a big trade-off. If we do this precalculation at run time, at analysis time, it biases the analysis, because all of these totals are precalculated, and it is very expensive to get totals other than the ones that we precalculate.
We also cross-license to get data on the demographics of stores and facts about the stores. The demographics of stores is simply census data added up for some defined trading area around the store. These data are pretty good. Store facts—we cross-license these—include the type of store (regular store or a warehouse store). The data is not very good; we are not happy with that data.
Our main database currently has 20 billion records in it with about 9 years' worth of collected data, of which 2 years' worth is really interesting. Nobody looks at data more than about 2 years old. It is growing at the rate of about 50 percent a year, because our sample is growing and we are expanding internationally. We currently add a quarter of a billion records a week to the data set.
The records are 30, 40, 50 bytes each, and so we have roughly a terabyte of raw data, and probably three times that much derived data, aggregated data. We have 14,000 grocery stores right now, a few thousand nongrocery stores, generating data primarily weekly, but about 20 percent we are getting on a daily basis. Our product dictionary currently has 7 million products in it, of which 2 million to 4 million are active. There are discontinued items, items that have disappeared from the shelf, and so forth.
Remember, we are a commercial company; we are trying to make money. Our first problem is that our audience is people who want to sell Tide. They are not interested in statistics. They are not even interested in data analysis, and they are not interested in using computers. They want to push a button that tells them how to sell more Tide today. So in our case, a study means that a sales manager says, "I have to go to store X tomorrow, and I need to come up with a story for them. The story is, I want them to cut the price on the shelf, so I want to push a button that gives me evidence for charging a lower price for Tide." They are also impatient; their standard for a response time on
computers is Excel, which is adding up three numbers. They are not statistically or numerically trained—they are sales people. They used to have support staff. There used to be sales support staff and market researchers in these companies, but they are not there anymore.
Analysis is their sideline to selling products, and so we have tried to build expert systems for them, with some success early on. But when we try to get beyond the very basic stuff, the expert systems are hard to do.
There are underlying statistical issues that in particular, I need to look for. On price changes, we think that there is a downward-sloping demand curve. That is, if I charge more, I should sell less, but the data does not always say that, and so we have to do either some Baysian calculations or impose some constraints.
The databases are very large. Something I alluded to earlier we are doing all these precalculations, so we are projecting to calculate sales in Washington through a projection array. We are aggregating up an aggregation tree to get some totals for category and so forth. We do this because it saves a whole lot of time at run times, so we can get response times that are acceptable to people, and it saves a lot of space in the data, because we don't have to put in all of the detail. But it forces me to examine the data in the way we have allowed based on the precalculations. So we have a big trade-off here. The relevant subtotal is, what is the total market for powdered laundry detergent?
Those are all the nominal problems. What are the real problems? The real problem is that I have only two programmers who work for me. The tools that we have at our disposal are pretty good, at least as a starting point for front ends. But on the back end, just using SQL Query against Oracle or something simple is not fast enough. I do not have enough programmers to spend a lot of time on programming special-purpose retrievers over and over again. I have competition for my staff from operational projects for relatively simple things that we know are going to pay off. So time to spend on these interesting projects is being competed for by other projects.
Response time, particularly, is always a problem because of the questions that people ask, such as, What is the effect of a price change going to be in Hispanic stores if I increase the price of Tide by 10 percent? They are guessing at what to ask, and so they are not willing to invest a great deal in these questions.
The database setup time and cost are a problem. The setup time on these databases is mostly a "people" cost; it is not so much the computing time. It is getting people to put in all of the auxiliary data that we need around the raw data. So I have difficulty with getting enough auxiliary information in there to structure the analyses.
Carolyn Carroll: When you say auxiliary data, what are you talking about?
John Schmitz: A lot of what we are doing is looking at competitive effects, for example. So when I am doing an analysis on Tide, I need to know who Tide's competitors are. To a very limited extent you can do that by looking at the data. To a large extent you have to have somebody go in and enter the list of competitive brands. That is one basic thing.
Another is figuring out reasonable thresholds for defining exception points. A lot of that is manual. We start with automated systems, but a lot of it has to be examined manually.
When I mention a lack of staff, it is not so much a lack of straight programmers, but people who know the programming technology, and people who also understand the subject matter well enough to not need a whole lot of guidance or extremely explicit specifications.
Stephen Eick: So with your data, what are the privacy issues? I have noticed the few times I go to the store that you now have store cards, and so the stores know everything I have bought; they know who I am; they know my historical buying pattern. I am sure they have squirreled all this data away in a database. I am expecting soon to show up at the store and be pitched with coupons as I walk in.
Schmitz: That has not happened to you yet: I am not being facetious; we do not currently run any programs like that, but there are programs of that nature.
Participant: I think since they know everything I have bought, they are going to start targeting me with coupons. I personally resist, because I refuse to have a card. But others use every little coupon they can get.
Schmitz: There are two privacy issues. The privacy issue with our store audit database involves a contract that we have with the grocery chains that we will not release data identified with specific individual stores. We will not release sales data. So when we put out reports and so forth, we have to make sure that we have aggregated to the extent that we do not identify individual stores and say how much of a product they have sold.
The second privacy issue concerns individuals. We do have a sample of 100,000 U.S. households that identified themselves, and from whom we have a longitudinal sample that goes back anywhere from 3 to 10 years. We release that data, but it is masked as to the individuals. We have demographics on the individuals, but we do not identify them.
Eick: The other aspect of privacy involves the security cameras—at some point they are going to start tracking where I go in the store and what I look at. Then when I buy it, they are going to know it was me. So they are going to know not only what I bought, but also what I thought about buying.
Schmitz: Security cameras are used not so much to track people through the stores as to indicate when people are unhappy or happy about things—at hotels and so forth. We are not doing any of that yet.
Lyle Ungar: Are all your computations done off-line, or do you do on-line calculations, and how complex are they? Do you do factor analysis? Do you run correlations with demographics?
Schmitz: We do factor analysis off-line in order to reduce the dimensionality—or principal components—rather than reduce the dimensionality on our demographic data. We do that off-line and keep just component weights. About the most complicated things we do on-line are some fairly simple regressions and correlations with a little bit but not a whole lot of attention to robustization.