The following HTML text is provided to enhance online
readability. Many aspects of typography translate only awkwardly to HTML.
Please use the page image
as the authoritative form to ensure accuracy.
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
MR. DOMINGOS: Going back to the pairwise distances problem, they were all guaranteed to be within .4 or farther than .4. What if that never happens? Do you just go down to single problems?
You could always have two boxes where the smallest distance is less than .4 and the largest distance is more than .4.
MR. MOORE: In the exact version of the algorithm there will be some cases where you have to go down to individual points in the database. You can show that the number of times that you will have to do that is the square root of the number of pairs in the database. So, the square root of the number of pairs in the database is linear, so the whole thing ends up being linear in the database size.
If you tell the system that you are prepared to accept, say, a 1 percent error in your final count, usually it will never go down.
AUDIENCE: What about the geometric points are scaled to higher dimensions?
MR. MOORE: Good point, I should have mentioned that. These are based on a data structure called a KD tree, for which Jerry Friedman was one of the inventors. KD trees typically don’t work very well above about 10 or 20 dimensions. So, some of these algorithms, like the mixture of Gaussian ones, we get into computational problems with about 20 dimensions.
Some of the other things, like the kernel methods or the two point counts we have done, we actually these days run them in a different data structure called a metric tree, where everything is stored in balls instead of hyper-rectangles. Under some circumstances, we have been able to run those in thousands of dimensions. In one particular case, we ran it in a million dimensions. That is not generally possible. If our distribution of the data is uniform, you could prove that you should not be able to do this efficiently.
In empirical data, there are correlations among the covariates. Even if they are non-linear correlations, then you expect it to be practical, which is why we have been able to do this in thousands of dimensions.