layering, and the statelessness of the basic datagram4 make it hard to identify some types of flows. Factors such as routing asymmetry and multipathing make it hard to gather necessary information even about self-describing flows such as TCP. Also, business concerns and increased sensitivity to privacy limit the willingness of many stakeholders to participate in data collection and constrain the release of data to a wider research community. The resulting paucity of sound or representative data has severely limited the ability to predict the effects of even incremental changes to the Internet architecture, and it has undermined confidence in more forward-thinking research.
Progress in measuring the Internet artifact will thus require the effort, ingenuity, and unified support of the networking community. In other fields, grand challenges—such as mapping the entire human genome—have served to expose and crystallize research issues and to mobilize research efforts. Along those lines, a challenge that could stimulate the necessary concerted effort is the following: (1) to develop and deploy the technology to make it possible to record a day in the life of the Internet, a data set containing the complete traffic, topology, and state across the Internet infrastructure and (2) to take such a snapshot. Even if the goal were realized only in part, doing so would provide the infrastructure for establishing a measurement baseline.
A “day in the life” should be understood as a metaphor for a more precise formulation of the measurement challenge. For example, the appropriate measurement period might not literally be a single 24-hour period (one might want to take measurements across a number of days to explore differences between weekdays and weekends, the effects of events that increase network traffic, and the like) and, as discussed below, the snapshot might sample traffic rather than record every single packet. To achieve many of the goals, one would also measure on an ongoing basis rather than as a one-time event.
This ambitious goal faces many hurdles that together form the foundation for a valuable research agenda in their own right. Although the overarching goal is the ability to collect a full snapshot, progress on each of the underlying problems discussed below would be a valuable step forward toward improving our understanding of actual network behavior.
Accommodating the growth in link speeds and topology is a significant challenge for large-scale Internet traffic measurement. Early versions of equipment with OC-768 links (40 gigabits per second) are already in trials, and the future promises higher speeds still. Worse yet, each individual router may have many links, increasing the overall computational challenge as well as making per-link measurement platforms extremely expensive to deploy and difficult to manage. Addressing these problems presents both engineering and theoretical challenges. High-speed links demand new measurement apparatus to measure their behavior, and efficient measurement capabilities must be incorporated into the routers and switches themselves to accommodate high port densities. Even with such advances it may be infeasible to collect a complete record of all communication in a highly loaded router, and we may be forced to sample traffic instead. To do so effectively will require developing a deeper understanding of how to soundly sample network traffic, which is highly correlated and structured. An especially important statistics question is how to assess the validity of a particular sampling approach—its accuracy, representativeness, and limitations—for characterizing a range of network behaviors.
One particular challenge in measuring the network today is incomplete knowledge about the internal configuration of parts of the network, a reflection of network operators’ reluctance to divulge information that may be of interest to their competitors. One way to cope with this impediment is the use of inference techniques that allow one to learn more about a network based