Questions? Call 800-624-6242

| Items in cart [0]

## Statistical Analysis of Massive Data Streams: Proceedings of a Workshop (2004) Board on Mathematical Sciences and Their Applications (BMSA)

### Citation Manager

. "Session 5: Mining Commercial Streams of Data." Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press, 2004.

 Page 388

The following HTML text is provided to enhance online readability. Many aspects of typography translate only awkwardly to HTML. Please use the page image as the authoritative form to ensure accuracy.

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop

MR. DOMINGOS: Going back to the pairwise distances problem, they were all guaranteed to be within .4 or farther than .4. What if that never happens? Do you just go down to single problems?

You could always have two boxes where the smallest distance is less than .4 and the largest distance is more than .4.

MR. MOORE: In the exact version of the algorithm there will be some cases where you have to go down to individual points in the database. You can show that the number of times that you will have to do that is the square root of the number of pairs in the database. So, the square root of the number of pairs in the database is linear, so the whole thing ends up being linear in the database size.

If you tell the system that you are prepared to accept, say, a 1 percent error in your final count, usually it will never go down.

AUDIENCE: What about the geometric points are scaled to higher dimensions?

MR. MOORE: Good point, I should have mentioned that. These are based on a data structure called a KD tree, for which Jerry Friedman was one of the inventors. KD trees typically don’t work very well above about 10 or 20 dimensions. So, some of these algorithms, like the mixture of Gaussian ones, we get into computational problems with about 20 dimensions.

Some of the other things, like the kernel methods or the two point counts we have done, we actually these days run them in a different data structure called a metric tree, where everything is stored in balls instead of hyper-rectangles. Under some circumstances, we have been able to run those in thousands of dimensions. In one particular case, we ran it in a million dimensions. That is not generally possible. If our distribution of the data is uniform, you could prove that you should not be able to do this efficiently.

In empirical data, there are correlations among the covariates. Even if they are non-linear correlations, then you expect it to be practical, which is why we have been able to do this in thousands of dimensions.

 Page 388
 Front Matter (R1-R6) Table of Contents (1-3) Welcome and Overview of Sessions (4-9) Session 1: Atmospheric and Meteorological Data (10-65) Session 2: High Energy Physics (66-135) Luncheon Keynote Address (136-164) Session 3: Integrated Data Systems (165-209) After Dinner Address (210-222) Session 4: Network Traffic (223-294) Session 5: Mining Commercial Streams of Data (295-388) Concluding Comments (389-389)