Measuring the Internet
The information collected by the committee provided a rough picture of what occurred on and after September 11 with respect to Internet performance. The two key sources were reports from people and direct measurements of Internet systems (such as links, routers, and hosts).
The information that was available indicated some interesting contrasts between September 11 and a “typical” day. Network traffic loads measured in several ISP networks the day of the attacks were generally lighter than normal. However, demand on servers at the major news Web sites was unprecedented—to such an extent that several of these systems were rendered inoperative for a period of hours. At the same time, several measurements suggested that the impact of the events of September 11 on the Internet was modest. The effects on the network infrastructure caused by physical damage in Lower Manhattan and at the Pentagon were quite limited, and they appeared smaller, or no worse, than what would result from other incidents. For example, with respect to data on the reachability of a particular set of Internet addresses, September 11 was more or less equivalent to a fiber cut—a nontrivial but relatively routine event. These data were supplemented by other information—polls of Internet users, for example.
Measurements of Internet systems from the September 11 crisis, however, were quite limited—in part because sources had usually discarded the data before the committee’s analysis began (some five months after the attacks) and in part because of inherent limitations in the data that were collected and retained.
The ability to report comprehensive details of the Internet’s response during September 11, or during any crisis for that matter, is further constrained by a number of factors. One of the consequences of the Internet’s fragmented and often proprietary measurement infrastructure is that data are taken piecemeal in diverse ways and stored in various formats. As a result, information that was available to the committee generally permitted only rough comparison with a normal or typical day in the context of a particular set of data. Measurement difficulties also arise from the size, complexity, and diversity of the Internet and from the fact that a great deal of the data that do exist are considered proprietary by the companies that collect them.
In the course of the committee’s work, it became clear that a number of questions could not be answered with the available information. These included:
How did Internet traffic vary from normal activity during and after the attacks? Some traffic data were available from individual ISPs, but it was not always clear how to extrapolate from these localized observations to a more generalized view.
What was the mix of applications used before, during, and after the attacks? Again, some local data were available from some ISPs, but it was unclear, as above, if they constituted a collective picture.
How much demand was there on news services before, during, and after the attacks? Some news services were so overwhelmed by demand that their monitoring systems shut down.
How much connectivity was lost as a result of the attacks? How many users were affected, and for how long? How quickly was connectivity restored? Answering these questions would require data from a large number of ISPs or from a carefully targeted sample of ISPs.
These unanswered questions suggest that a more robust assessment of crisis events in the future will require new approaches to gathering network measurement data. In addressing how measurement of the Internet may be improved, this chapter discusses methods and tools for measurement; the data available from September 11; types of measurements required to fully assess the Internet under crisis; challenges to be faced in gathering and analyzing these measurements; and suggestions for the future.
NETWORK MEASUREMENT METHODS AND TOOLS
Since the Internet’s inception, measurement has been a significant element of networking research, starting with the Network Measurement
Center at the University of California, Los Angeles, in 1970. Early on, when the system was operated as a government-funded research network, measurement was simpler owing to the explicit research character, relatively modest scale, and simple topology of the network, and to the absence of proprietary constraints. With commercialization and sustained rapid growth, the network has become much larger and more complex— making comprehensive measurement harder and more expensive—and a host of commercial interests have further limited how and where measurements can be made and who can make them. At the same time, measurement remains an important activity, particularly from the perspective of network operations.
These constraints notwithstanding, the continued interest in measuring a wide range of Internet characteristics both for operation and research has led to the development of an array of tools (though researchers’ access to them has not been unfettered).
Active Measurement Tools
Active measurement tools are based on the concept of sending probe packets into the network and measuring their behavior as they flow through it. The probe packets are typically emitted from a general-purpose end-host such as a personal computer. Probe packets are sent toward a destination host by providing a target IP address (or domain name) to the measurement tool. The injection of probe packets into the network provides an indication of the routing behavior, propagation delay, queuing delay, and loss that would be experienced by normal data packets. When (and if) the probes arrive at a destination, either their arrival is logged or response packets are returned to the sender. When a response packet is returned, its arrival back at the original sender is logged, constituting the conclusion of one measurement. Active probing can also be done by approximating the behavior of typical applications, such as sending a request for a Web page.
Active probes are important because one can gain crucial insight into network conditions for a specific end-to-end path at a specific time, which may not be possible if one monitoring occurs at only a single point. Furthermore, active measurements generally do not require special participation by intermediate nodes, making them easy to deploy and execute.
While active probe tools provide important data about specific end-to-end conditions, there are a number of drawbacks to their use. First, the act of placing a probe into the network causes a perturbation (dubbed the “Heisenberg effect” by analogy to the uncertainty principle in physics) that may lead to a change in the network’s operating conditions. Because of this problem, common practice is to use active measurement tools to
sample the network at sufficiently low rates so as not to significantly perturb the network—avoiding, for example, significant additions to congestion. However, the resultant measurement data are limited in their ability to capture events at time scales finer than the sampling rate and are constrained by the necessarily small number of source and receiver locations. A second drawback is that any one system used to conduct active measurements is limited by routing protocols and Internet topology to measuring only a portion of the Internet. Finally, active measurement tools are limited in their ability to assess aspects of volume (for example, the total amount of traffic flowing along a given path). Some of these limitations of active probes can be addressed by passive measurement tools.
Passive Measurement Tools
Passive measurement methods are based on logging different aspects of traffic observed at specific vantage points in the network. The data that can be collected by passive means may have a variety of forms, from access logs to packet traces to detailed activity counters on routers. These data can be collected either at end-hosts or at nodes within the network. Passively collected data can be displayed in real time (as is often done by network operators) or placed in a repository for offline analysis.
Passive measurement data can provide great insight into the activities on a link or at a node. However, they have some significant drawbacks. Such data are almost always considered proprietary and are rarely made available for general analysis. Passive collection of network data can result in extremely large data sets, which greatly complicate archiving and analysis. Passive measurement tools are also prone to various types of errors that require careful attention. The subsections below describe several common passive measurement tools.
Web Access Logs
Logging access activity is a standard feature in Web server software that is usually enabled by content providers. Log entries contain the time at which a particular Web file was requested, the IP address of the requester, the name and size of the requested file, and the status code returned to the requester. Content access logs can be used to assess many aspects of server behavior, including load, content being requested, and the sources of requests.
Packet traces can be a summarization of traffic (IP flow measurements) or the details of individual packets on a given link. Such measurements require access to a network device (such as a router, switch, or link splitter) or access to a broadcast local area network. A standard tool for logging individual packets is “tcpdump,”1 which uses packet filters to capture selected packet activity from the network interface. A typical log entry from tcpdump consists of a time stamp, the source/destination IP/ port numbers, the transport protocol name, details from the packet header, and details of the packet payload. Collection of this information, especially the packet payload itself, provides valuable insights into network use. However, almost all organizations that collect such detailed traces are unwilling or unable to share the traces with other parties, owing to privacy and confidentiality concerns.
Border Gateway Protocol
Because of the Internet’s distributed and very dynamic operation, the individual ISPs must continuously keep each other informed about their own network’s reachability. The protocol that they use for this purpose is called the Border Gateway Protocol (BGP). By examining changes in the routing information provided by BGP, one can trace changes in the status of the Internet. Each commercial ISP (e.g., UUNET or ATT) or network of a major organization (e.g., the National Science Foundation or the Massachusetts Institute of Technology) uses BGP to inform all other ISPs and network operators that it provides connectivity for particular sets of addresses and that packets destined for those addresses should be sent to it.
Such advertising of connectivity is called a BGP route announcement. Thus, ISPs adjacent to UUNET would repeat UUNET’s route advertisement to their neighbors, with the added information that the relaying ISP had connectivity and thus could relay packets through to UUNET, if needed. If a neighbor’s connectivity to UUNET failed for some reason, then it needs to tell its own neighbors that it can no longer relay packets through to UUNET; this information is advertised using a BGP route withdrawal. BGP update messages are logged for public use at a number of “looking glass” sites, such as Route Views.2
The size of a BGP routing table, which indicates how many announced paths are available, gives an overview of network status. As of June 2002,
a typical core BGP table contained roughly 100,000 entries (the exact size depends on the vantage point). A significant drop in the size of a core routing table is an indication of some sort of connectivity loss. Observing the route advertisements and withdrawals also provides information on the Internet’s health. If a route is withdrawn for an extended period of time, one may assume that some form of network outage has taken place. This failure may result from infrastructural damage, misconfiguration by an ISP, or simply scheduled maintenance. The withdrawal of all routes to a particular part of the network indicates a significant loss of connectivity. Routes that are repeatedly withdrawn and announced are an indication either of unstable links or instability in the routing system itself.
BGP tables are constructed and updated through exchanges among peer networks. However, each table only provides information on the network as seen from a given vantage point. A drop in connectivity seen at that point might, therefore, represent a local failure rather than something more widespread. Assessing the overall status of the network thus requires examining many, carefully selected BGP tables that in aggregate reflect the shape of the entire network.3
Simple Network Management Protocol
The Simple Network Management Protocol (SNMP) 4 is an important component in the daily operation of large-scale networks. It is the protocol used by network management systems to communicate with network elements such as routers and switches. SNMP enables network management systems both to query network elements for data and to send data to network elements. Data that are maintained and available from network elements through the SNMP are specified by a Management Information Base (MIB). This data set is gathered passively by network elements. Most of the items in the MIB data set are simple activity counters, such as the number of packets transferred on a specific link. One of the main uses of SNMP MIB data is to ensure that a network is performing within acceptable operational limits. Management systems are configured to provide multiple “views” into the network based on its topological configuration, enabling network managers to assess in nearly real time the state of their systems.
SNMP MIB data are ubiquitous in a network and could be very useful
in assessing the state of the Internet during a crisis. However, they are typically considered proprietary and are only available to the operators running a specific network.
As indicated above, business and legal considerations can mean that most data about Internet behavior during crisis conditions may never be made public. If these data were available, the assessment of Internet behavior during a crisis, or indeed, at any other time, would be greatly simplified. There would be challenges in organizing and normalizing the data, but these procedures would readily lend themselves to scientific methods. However, convincing large network providers to make their data publicly available is at best an uphill battle and at worst a pipe dream. An alternative approach would be to mandate reporting by ISPs to an agency such as the Federal Communications Commission (indeed, reports of certain types of outages in the public telephone network must be so filed under present rules).
Consistency in Data and Analysis
There is no guarantee that data gathered at different sites are consistent. Time stamps, units, and field descriptions for data can all be different. Owing to sampling and the possibility of measurement errors, there are also issues of the basic accuracy of particular measurements. Furthermore, even if the data are consistent, the tools and data analysis methods must also be consistent in order to evaluate and validate results.
The heterogeneity of the Internet infrastructure and its users, applications, protocols, and media all render it difficult if not impossible to make representative statements about overall Internet behavior on the basis of a small number of measurements. This heterogeneity manifests itself in several ways, such as:
Available bandwidth. Wireless users with a low-bandwidth connection to the Internet exhibit dramatically different behavior from users with corporate high-bandwidth connections. High-bandwidth users are much more likely to access multimedia content such as video streams.
Network congestion. The levels of congestion in the Internet vary
dramatically. Many parts of the United States have high-bandwidth links with relatively low utilization. In contrast, other parts of the network have modest capacity and high utilization, which in turn result in high loss rates for packets traversing them.
Connectivity. Some parts of the network are richly connected with many alternate paths, while other parts of the Internet are dependent on only a single link for connectivity.
Such factors make it virtually impossible to assess the health of the Internet without measurement data from a large and diverse set of vantage points.
THE FUTURE: TARGETED ASSESSMENT DURING A CRISIS
This section discusses what data would be required for a more robust assessment of Internet characteristics during crisis events (or any other time) and how these data might be gathered.
Global Network Monitoring
A thorough analysis of Internet behavior during crisis events requires clean, consistent data from a number of vantage points across all network layers. In a general sense, this means that the following data are required from sufficient numbers and types of protocols, networks, geographic points, and time scales:
Application and service-level data such as Web server logs,
End-to-end connectivity, delay, and loss data such as those gathered by active probes,
Packet traffic data such as IP flow or router Management Information Base logs, and
Global interdomain routing data.
Only modest quantities of data from each category in this list were available for September 11. Better understanding of future events will depend on the consistency, perspective (geographic and topological location), and time scale of measurement data.
Perhaps the most extreme means for gathering data robustly during a crisis would be to construct a measurement infrastructure targeted for this specific purpose. But a more practical approach would be the creation of a well-defined data repository into which network operators could submit data collected throughout the event. This approach would
have the significant benefit of not requiring the facilitator of the repository to deploy and manage measurement systems. It might also enable data gathering from areas of the Internet that would otherwise be inaccessible. The drawbacks of this approach would be the difficulties associated with maintaining consistency in submitted data and relying on others to choose where the data are gathered. It would also require the establishment of well-defined policies on submission, privacy, and the use of data. Another challenge would be in calibrating methods of analysis for comparing or aggregating different data sources.
Maintaining a robust set of network data would also provide a firmer basis for simulating Internet behavior. Models could be used to assess how the Internet might perform in different failure modes. This capability could provide key insights into Internet vulnerabilities and potentially alleviate circumstances in which connectivity was lost, as occurred in several instances on September 11.
Targeted Measurement During a Crisis
Effective assessment of Internet behavior during a crisis would be greatly enhanced by the ability to adjust the scope of what is being measured in accordance with the specific situation. This kind of targeted assessment would be facilitated by the establishment of a general repository of contact information for network operators, content providers, and groups that run network-monitoring infrastructures. Two examples of such lists are Jared Mauch’s compilation of information on network operations5 and CAIDA’s compilation of Internet measurement activities.6 When a crisis arises, measurement data could quickly be solicited from groups on this list in areas that are topologically close (from an Internet perspective) to the geographic location of the crisis. Maintaining such a repository would require resources; however, restricting the objective to targeted measurement of medium to large-scale events would make this effort much more manageable. Making sense of measurements taken during particular network events also requires the capture of a baseline “normal day.”7