The Kepler mission, for example, is a planet-finding mission. The ones on top are space based, and the bottom ones are ground based, some of which are going to produce very large amounts of data. At my institution, we collaborate directly or indirectly with most of these missions and deal in some way with the data they produce. How are we going to manage the large amounts of data from these missions?
Over the past 25 years, astronomy has been changing quite radically. Astronomers have been very good at building large telescopes to collect more data. We have been much better at building larger mirrors. But we have been a hundred times better than that at building detectors that allow us to collect extremely large amounts of data. Since the detectors roughly follow Moore’s Law, every year or so our data collection doubles, and this raises many important issues.
We realize that we are not alone in this area. We know that fields from earth science and biology to economics are dealing with massive amounts of data that must be handled. I am not particularly fond of using the word “e-science” to describe this field; I prefer to speak of “data-intensive scientific discovery,” because I think that is exactly the field that we should be moving to, because we are going to be driven by data.
However, while astronomy is similar to other fields in managing big volumes of data, the astronomy field is somehow special—not because the data are intrinsically different and special in their own right, but because they have no commercial value. They belong to every one of us, so they are an ideal test bed for trying out complex algorithms that are based on very large dimensions. Currently there are missions that have in excess of 300 dimensions, which can be very useful if a scientist wants a dataset with many dimensions to analyze.
We also have to be aware that things have been changing, for example, in the geographic information system world, where our perception of our planet has been changed by such tools as TerraServer, Microsoft’s Virtual Earth, Google Earth, and Google Maps. The way that we interact with our planet is now different than it was, and as we use all of these tools, we have no concept of what is really going on underneath, because we are focusing on the discoveries. Therefore, some of us are trying to do the same thing for data in other areas.
What is this new size paradigm? Why is so much data a problem?
In Figure 2-2, the red curve represents how much data we collect in our systems as a function of time. The big spike in the middle corresponds to the installation of new instruments for Hubble, for example. This is a trend, but it is not a problem. The real problem is the curve above it, which is the amount of data that we serve to the community. This is a large multiple of the data that we collect, so this potentially can be a problem.