National Academy of Sciences | 150 Year Anniversary

Questions? Call 800-624-6242

| Items in cart [0]

The National Academies Press

PAPERBACK
price:$63.75
add to cart

Rights & Permissions

topleft topright

Massive Data Sets: Proceedings of a Workshop (1997)
Commission on Physical Sciences, Mathematics, and Applications (CPSMA)

Citation Manager

. "Marketing." Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press, 1997.

Please select a format:

BibTeX EndNote RefMan


Page
48
bottomleft bottomright

The following HTML text is provided to enhance online readability. Many aspects of typography translate only awkwardly to HTML. Please use the page image as the authoritative form to ensure accuracy.


what store a product was sold in, what UPC code was on the product, and what week it was sold. A lot of our data is now coming in dally rather than weekly.

We have only a few direct measures: how many units of a product did they sell and how many pennies' worth of the product did they sell, and some flags as to whether it was being displayed or whether it was in a feature advertisement, and some other kinds of odd technical flags.

We then augment that with a few derived measures. We calculate a baseline of sales by a fairly simple exponential weighted moving average with some correction for seasonality to indicate what the deviations are from the baseline. We calculate a baseline price also, so we can see whether a product was being sold at a price reduction. We calculate lift factors: if I sold my product and it was on display that week, how much of a rise above normal or expected sales did I get because of the display. We impute that. We do it in a very simple way by calculating the ratio of baseline sales to actual sales in weeks with displays. So you can imagine that this data is extraordinarily volatile.

The data is reasonably clean. We spend an enormous amount of effort on quality assurance and we do have to clean up a lot of the data. Five to 15 percent of it is missing in any one week. We infer data for stores that simply do not get their data tapes to us in time.

From this raw data we aggregate the data. We aggregate to calculate expected sales in Boston, expected sales for Giant Food Stores in Washington, D.C., and so on, using stratified sampling weights. We also calculate aggregate products. We take all of the different sales of Tide 40-ounce boxes and calculate a total for Tide 40 ounce, then calculate total Tide, total Procter and Gamble, how they did on their detergent sales, and total category.

This is an issue that comes back to haunt us. There is a big trade-off. If we do this precalculation at run time, at analysis time, it biases the analysis, because all of these totals are precalculated, and it is very expensive to get totals other than the ones that we precalculate.

We also cross-license to get data on the demographics of stores and facts about the stores. The demographics of stores is simply census data added up for some defined trading area around the store. These data are pretty good. Store facts—we cross-license these—include the type of store (regular store or a warehouse store). The data is not very good; we are not happy with that data.

Our main database currently has 20 billion records in it with about 9 years' worth of collected data, of which 2 years' worth is really interesting. Nobody looks at data more than about 2 years old. It is growing at the rate of about 50 percent a year, because our sample is growing and we are expanding internationally. We currently add a quarter of a billion records a week to the data set.

The records are 30, 40, 50 bytes each, and so we have roughly a terabyte of raw data, and probably three times that much derived data, aggregated data. We have 14,000 grocery stores right now, a few thousand nongrocery stores, generating data primarily weekly, but about 20 percent we are getting on a daily basis. Our product dictionary currently has 7 million products in it, of which 2 million to 4 million are active. There are discontinued items, items that have disappeared from the shelf, and so forth.

Remember, we are a commercial company; we are trying to make money. Our first problem is that our audience is people who want to sell Tide. They are not interested in statistics. They are not even interested in data analysis, and they are not interested in using computers. They want to push a button that tells them how to sell more Tide today. So in our case, a study means that a sales manager says, "I have to go to store X tomorrow, and I need to come up with a story for them. The story is, I want them to cut the price on the shelf, so I want to push a button that gives me evidence for charging a lower price for Tide." They are also impatient; their standard for a response time on

Page
48
FRONT MATTER (R1-R10)
Opening Remarks (1-2)
PART I Participant's Expectations for the Workshop (3-12)
PART II Applications Papers (13-14)
Earth Observation Systems: What Shall We Do with the Data we Are Expecting in 1998? (15-22)
Information Retrieval: Finding Needles in Massive Haystacks (23-32)
Statistics and Massive Data Sets: one View from the Social Sciences (33-38)
The Challenge of Functional Magnetic Resonance Imaging (39-46)
Marketing (47-50)
Massive Data Sets: Guidelines and Practical Experience from Health Care (51-68)
Massive Data Sets in Semiconductor Manufacturing (69-76)
Management Issues in the Analysis of Large-Scale Crime Data Sets (77-80)
Analyzing Telephone Network Data (81-92)
Massive Data Assimilation/Fusion in Atmospheric Models and Analysis: Statistical, Physical, and Computational Challenges (93-103)
PART III Additional Invited Papers (103-104)
Massive Data Sets and Artificial Intelligence Planning (105-114)
Massive Data Sets: Problems and Possiblities, with Application to Environmental Monitoring (115-120)
Visualizing Large Datasets (121-128)
From Massive Data Sets to Science Catalogs: Applications and Challenges (129-142)
Information Retrieval and the Statistics of Large Data Sets (143-148)
Some Ideas About the Exploratory Spatial Analysis of Large Data Sets (149-156)
Massive Data Sets in Navy Problems (157-168)
Massive Data Sets Workshop: The Morning After (169-184)
PART IV Fundamental Issues and Grand Challenges (185-186)
Panel Discussion (187-202)
Items for Ongoing Consideration (203-204)
Closing Remarks (205-206)
Appendix: Workshop Participants (207-208)