what store a product was sold in, what UPC code was on the product, and what week it was sold. A lot of our data is now coming in dally rather than weekly.
We have only a few direct measures: how many units of a product did they sell and how many pennies' worth of the product did they sell, and some flags as to whether it was being displayed or whether it was in a feature advertisement, and some other kinds of odd technical flags.
We then augment that with a few derived measures. We calculate a baseline of sales by a fairly simple exponential weighted moving average with some correction for seasonality to indicate what the deviations are from the baseline. We calculate a baseline price also, so we can see whether a product was being sold at a price reduction. We calculate lift factors: if I sold my product and it was on display that week, how much of a rise above normal or expected sales did I get because of the display. We impute that. We do it in a very simple way by calculating the ratio of baseline sales to actual sales in weeks with displays. So you can imagine that this data is extraordinarily volatile.
The data is reasonably clean. We spend an enormous amount of effort on quality assurance and we do have to clean up a lot of the data. Five to 15 percent of it is missing in any one week. We infer data for stores that simply do not get their data tapes to us in time.
From this raw data we aggregate the data. We aggregate to calculate expected sales in Boston, expected sales for Giant Food Stores in Washington, D.C., and so on, using stratified sampling weights. We also calculate aggregate products. We take all of the different sales of Tide 40-ounce boxes and calculate a total for Tide 40 ounce, then calculate total Tide, total Procter and Gamble, how they did on their detergent sales, and total category.
This is an issue that comes back to haunt us. There is a big trade-off. If we do this precalculation at run time, at analysis time, it biases the analysis, because all of these totals are precalculated, and it is very expensive to get totals other than the ones that we precalculate.
We also cross-license to get data on the demographics of stores and facts about the stores. The demographics of stores is simply census data added up for some defined trading area around the store. These data are pretty good. Store facts—we cross-license these—include the type of store (regular store or a warehouse store). The data is not very good; we are not happy with that data.
Our main database currently has 20 billion records in it with about 9 years' worth of collected data, of which 2 years' worth is really interesting. Nobody looks at data more than about 2 years old. It is growing at the rate of about 50 percent a year, because our sample is growing and we are expanding internationally. We currently add a quarter of a billion records a week to the data set.
The records are 30, 40, 50 bytes each, and so we have roughly a terabyte of raw data, and probably three times that much derived data, aggregated data. We have 14,000 grocery stores right now, a few thousand nongrocery stores, generating data primarily weekly, but about 20 percent we are getting on a daily basis. Our product dictionary currently has 7 million products in it, of which 2 million to 4 million are active. There are discontinued items, items that have disappeared from the shelf, and so forth.
Remember, we are a commercial company; we are trying to make money. Our first problem is that our audience is people who want to sell Tide. They are not interested in statistics. They are not even interested in data analysis, and they are not interested in using computers. They want to push a button that tells them how to sell more Tide today. So in our case, a study means that a sales manager says, "I have to go to store X tomorrow, and I need to come up with a story for them. The story is, I want them to cut the price on the shelf, so I want to push a button that gives me evidence for charging a lower price for Tide." They are also impatient; their standard for a response time on