There is a lot of good work a lot of people could do that would get us not only to moderate data sets, but I think get us all going in a streaming data set sense.

This second recommendation is slippery. I don’t think I have made a great case for it. If you think about the complexity that ought to be inherent in some of these larger data sets, and how much trouble we have communicating some basic ideas, I believe there is a lot of effort that we need to expend in that area, and it is going to take a lot of folks, in part, just because they are all the type that we are going to be communicating with, and in part, because the statistics community isn’t going to have all of those answers.

MS. MARTINEZ: It is lunchtime. One question.

AUDIENCE: What type of clustering algorithms have you had the best luck with, and have you developed algorithms specifically to deal with streaming data?

MR. WHITNEY: I tend, just by default, to use a K-Means, and it works pretty good. It is simple, it is fast. We played with variants of it, a recursive partitioning version, with various reasons why you would recurse.

The version that I ended up writing for that network problem wasn’t for streaming data. It was for data basically defined so you could pass through and repass through it and repass through it. So, we had an explicit data structure that represented that type of operation. I hope to use that as a basis for lots of other algorithms, though.

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement