Measuring: Understanding the Internet Artifact
A remarkable creation, the Internet encompasses a diversity of networks, technologies, and organizations. The enormous volume and great variety of data carried over it give it a rich complexity and texture. It has proved difficult to characterize, understand, or model in terms of large-scale behaviors and a detailed understanding of traffic behavior. Moreover, because it is very difficult to prototype new networks—or even new networking ideas—on an interesting scale (see Chapter 4), data-driven analysis and simulation are vital tools for evaluating proposed additions and changes to its design.
Experimental science is an important approach in many areas of computer science and engineering, especially where the artifacts being studied are complex and have properties that are not well understood.1 Central to the experimental method is the repeated measurement of observed behavior. Without acquiring such data it is impossible to analyze and understand the underlying processes, let alone predict the impact of a change to the environment being observed. Further, data often help suggest new theoretical approaches. Measurement is at least in part driven by a particular question at hand, and changing questions over time may well lead to different measurement needs.
However, there are also strong arguments for collecting data in anticipation of future use. Citing the heavy dependence of our knowledge and understanding of global climate change on a record of atmospheric carbon dioxide measurements that Charles David Keeling started on Mauna Loa in 1957, workshop participant Jeff Dozier observed that “good data outlives bad theory.”2 Hence a data set with typical days from the next 10 years of the Internet might be a treasure chest for networking researchers just as the carbon dioxide record has been to earth scientists. Also, outsiders at the workshop observed that in other areas of computer science, older versions of artifacts—old microprocessors, operating systems, and the like—are important as bases for trend analysis and before/after comparisons of the impacts of new approaches.3 Archived Internet snapshots could provide an analogous baseline for evaluating the large-scale impact of both evolutionary and revolutionary changes in the Internet. Archived data could also be used by Internet researchers to determine if newly identified traffic phenomena (for example, a future equivalent of heavy-tailed behavior) existed in earlier instantiations of the Internet.
Unfortunately, the ability of network researchers or operators to measure the Internet is significantly limited by a number of interdependent barriers. The extreme scale of today’s Internet poses a challenge to acquiring a representative set of data points. The Internet architecture itself also makes measurement difficult. Factors such as the end-to-end design,
layering, and the statelessness of the basic datagram4 make it hard to identify some types of flows. Factors such as routing asymmetry and multipathing make it hard to gather necessary information even about self-describing flows such as TCP. Also, business concerns and increased sensitivity to privacy limit the willingness of many stakeholders to participate in data collection and constrain the release of data to a wider research community. The resulting paucity of sound or representative data has severely limited the ability to predict the effects of even incremental changes to the Internet architecture, and it has undermined confidence in more forward-thinking research.
Progress in measuring the Internet artifact will thus require the effort, ingenuity, and unified support of the networking community. In other fields, grand challenges—such as mapping the entire human genome—have served to expose and crystallize research issues and to mobilize research efforts. Along those lines, a challenge that could stimulate the necessary concerted effort is the following: (1) to develop and deploy the technology to make it possible to record a day in the life of the Internet, a data set containing the complete traffic, topology, and state across the Internet infrastructure and (2) to take such a snapshot. Even if the goal were realized only in part, doing so would provide the infrastructure for establishing a measurement baseline.
A “day in the life” should be understood as a metaphor for a more precise formulation of the measurement challenge. For example, the appropriate measurement period might not literally be a single 24-hour period (one might want to take measurements across a number of days to explore differences between weekdays and weekends, the effects of events that increase network traffic, and the like) and, as discussed below, the snapshot might sample traffic rather than record every single packet. To achieve many of the goals, one would also measure on an ongoing basis rather than as a one-time event.
This ambitious goal faces many hurdles that together form the foundation for a valuable research agenda in their own right. Although the overarching goal is the ability to collect a full snapshot, progress on each of the underlying problems discussed below would be a valuable step forward toward improving our understanding of actual network behavior.
THE CHALLENGES OF SCALE
Accommodating the growth in link speeds and topology is a significant challenge for large-scale Internet traffic measurement. Early versions of equipment with OC-768 links (40 gigabits per second) are already in trials, and the future promises higher speeds still. Worse yet, each individual router may have many links, increasing the overall computational challenge as well as making per-link measurement platforms extremely expensive to deploy and difficult to manage. Addressing these problems presents both engineering and theoretical challenges. High-speed links demand new measurement apparatus to measure their behavior, and efficient measurement capabilities must be incorporated into the routers and switches themselves to accommodate high port densities. Even with such advances it may be infeasible to collect a complete record of all communication in a highly loaded router, and we may be forced to sample traffic instead. To do so effectively will require developing a deeper understanding of how to soundly sample network traffic, which is highly correlated and structured. An especially important statistics question is how to assess the validity of a particular sampling approach—its accuracy, representativeness, and limitations—for characterizing a range of network behaviors.
One particular challenge in measuring the network today is incomplete knowledge about the internal configuration of parts of the network, a reflection of network operators’ reluctance to divulge information that may be of interest to their competitors. One way to cope with this impediment is the use of inference techniques that allow one to learn more about a network based
on incomplete, publicly accessible/observable information. For example, there has been research using border gateway protocol (BGP) routing table information to infer the nature of interconnection agreements between Internet service providers (ISPs). Inference techniques will not, in general, provide complete information, and more work is needed on how to make use of such incomplete information. Workshop participants noted that these statistical issues (along with the modeling issues discussed in the next chapter) would benefit from the involvement of statisticians.
A snapshot of an Internet day would contain an immense amount of data. Like other scientific communities faced with enormous data sets (for example, astronomy or the earth sciences), the Internet research community must grapple with analyzing data at very large scales. Among these challenges are effectively mining large, heterogeneous, and geographically distributed datasets; tracking the pedigree of derived data; visualizing intricate, high-dimensional structures; and validating the consistency of interdependent data. An additional challenge posed by measuring Internet traffic, which is also found in some other disciplines such as high-energy physics, is that data arrive quickly, so decisions about data sampling and reduction have to be made in real time.
In addition to the significant theoretical challenges, large-scale measurement of the Internet presents enormous deployment and operational challenges. To provide widespread vantage points for measuring network activity, even a minimal infrastructure will comprise hundreds of measurement devices. There is some hope that advances in remote management technologies will support this need, and lessons from several currently deployed pilot measurement projects could aid in the design of any such system. However, such an effort would also requires funding and personnel able to deploy, maintain, and manage the large-scale infrastructure envisioned here. In the long run, the value of this investment will be the creation of a foundation for watching network trends over time and establishment of an infrastructure available to researchers for new questions that are not adequately addressed by previous measurements.
Many of the challenges found in measuring today’s Internet could have been alleviated by improved design, which underscores the importance of incorporating self-measurement, analysis, and diagnosis as basic design points of future system elements and protocols. This is particularly critical to providing insight into failures that are masked by higher layers of abstraction, as TCP does by intentionally hiding information about packet loss from applications.
Although many of the challenges to effective Internet measurement are technical, there are important nontechnical factors—both within the networking community and in the broader societal context—that must be addressed as well. The committee recognizes that gathering this data will require overcoming very significant barriers. One set of constraints arises because the Internet is composed in large part of production commercial systems. Information on traffic patterns or details of an ISP’s network topology may reveal information that a provider prefers not to reveal to its competitors or may expose design or operational shortcomings. A related set of challenges concerns expectations of privacy and confidentiality. Users have an expectation (and in some instances a legal right) that no one will eavesdrop on their communications. As a consequence of the decentralized nature of the Internet, much of the data can only be directly observed with the cooperation of the constituent networks and enterprises. However, before these organizations are willing to share their data, one must address their concerns about protecting their users’ privacy. Users will be concerned even if the content of their communications is not
being captured—recording just the source, destination, type, or volume of the communications can reveal information that a user would prefer to keep private.
If network providers could find ways of being more open while protecting legitimate proprietary or privacy concerns, considerably more data could be available for study. Current understanding of data anonymization techniques, the nature of private and sensitive information, and the interaction of these issues with accurate measurement is rudimentary. Too simplistic a procedure may be inadequate: If the identity of an ISP is deleted from a published report, particular details may permit the identity of the ISP in question to be inferred. On the other hand, too much anonymity may hide crucial information (for example, about the particular network topology or equipment used) from researchers. Attention must therefore be paid to developing techniques that limit disclosure of confidential information while still providing sufficient access to information about the network to enable research problems to be tackled. In some circumstances, these limitations may prevent the export of raw measurement data—provoking the need to develop configurable “reduction agents” that can remotely analyze data and return results that do not reveal sensitive details.
Finally, realizing the “day in the life” concept will require the development of a community process for coming to a consensus on what the essential measurements are, the scope and timing of the effort, and so forth. It will require the efforts of many researchers and the cooperation of at least several Internet service providers. The networking research community itself will need to develop better discipline in the production and documentation of results from underlying data. This includes the use of more careful statistical and analytic techniques and sufficient explanation to allow archiving, repeatability, and comparison. To this end, the community should foster the creation of current benchmark data sets, analysis techniques, and baseline assumptions. Several organizations have engaged in such efforts in the past (on a smaller scale than envisioned here), including the Cooperative Association for Internet Data Analysis (CAIDA)5 and the Internet Engineering Task Force’s IP Performance Metrics working group (ippm).6 Future measurement efforts would benefit from the networking community at large adopting analogous intergroup data-sharing practices.