Read "Statistical Analysis of Massive Data Streams: Proceedings of a Workshop" at NAP.edu

« Previous: 3.2 SESSION MEs

Page 311 Cite

Suggested Citation:"4. DATA STREAMS AND RIVERS." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.

Page 312 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 311 particular stream of traffic at a point in time. Internet session MEs are dynamic and usually create an association between an IP address (or cookie) and a responsible account ID or subscriber ID. Depending on the service definition, a session ME may also provide information about session state, for example, logged on or off, authorization level, location, and so on. In telephony, the usage information and session information are often combined into the same record. In the Internet, however, these data are acquired from different sources and must be time-correlated together in near-real time. Sources for session data for Internet services include DHCP, DNS, and DDNS, as well as authentication, authorization, and accounting (AAA) services such as those provided by RADIUS. 3.3 REFERENCE DATA Reference data, defined by the NSP, is merged with the real-time streams of incoming MEs in order to facilitate additional downstream processing and analysis. Network operational examples include network topological, physical, or routing information (e.g., Autonomous System Numbers). Business examples include subscriber segmentation and classification information useful by product planning. Security examples include thresholds or patterns useful for identifying abuse, fraud, hostile, or attack traffic. The actual collection and interpretation of MEs from real devices is complex because of its diversity and the legacy of old devices still in use. In the future, this arcane and time-consuming development process could be facilitated by the adoption of abstract event and services models such as those being developed by Jeff Meyer (see http://www.circumference.org) for the IPDR (http://www.ipdr.org). The proposed model has a simple three-layer structure. The top layer is the data model, which defines the service or data represented in the ME. This is preferably a machine-readable file (e.g., W3C XML-Schema), however, for legacy reasons a human-readable document can do the job. The middle layer is the data encoding model, which defines how the data are represented as a serialized stream of bits. The bottom layer is the transport model, which defines how to get the data from point A to point B. The transport model is often a hierarchy of protocol layers, but includes concepts of file-based exchange, streaming data, and other transports. 4. DATA STREAMS AND RIVERS The data streams of the Internet are huge. Even though usage MEs will be a couple of orders of magnitude less, the volume of usage events can still present significant design challenges for general purpose collection systems. We have begun to characterize these ME flow volumes from data that have been shared with us from several NSPs. We have measured ME streaming rates of 0.2 to 0.5 MEs/subscriber/sec for Cisco IOSÂ® NetFlow (http://www.cisco.com/ go/netflow) enabled routers. For a moderate-sized network supporting 1 million subscribers, 0.3 ME/sub/sec represents an input rate of 300K MEs/sec. NetFlow version 5 uses UDP packets of 30 ME records of 48 bytes each plus a single header of 24 bytes. At 50 bytes average per ME this is a line speed of about 120Mb/s. But stored into a database

A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 312 it can represent approximately 4 Terabytes per day (allowing 3X for the inefficiencies of relational DB storage). This is assuming, of course, that you are willing to pay for a database that can handle a continuous input stream of 300K records per second and perform useful analysis work at the same time. Figure 1. Simplistic approaches of store first then analyze later when applied to high-momentum streams can lead to very large storage requirements (high infrastructure costs) and long analysis latencies. How long do you wish to keep these records? The left side of Figure 1 is a simple back-of-the-envelope- calculation (BOTEC) that computes the database storage requirements as you scale up in number of subscribers and in length of storage time. The heavier line represents the 1 million subscriber case above. Three months of storage at 3 million subscribers is already a petabyte! The other consequence of these large datasets is the time required to process them. The chart on the right above is another BOTEC that illustrates the time it would take to do a single pass of a dataset as a function of the dataset size in terabytes and the record processing speed of the database. Another way to think about this is to consider the ratio of the continuous input record rate to the record processing rate once the data have been stored into the database. If the query requires complex processing while it is completing its scan it may not be much faster than the input data rate. At a ratio of 1:1 it will take as long to perform a single pass on the data as it took to capture it. This could be unacceptably long to obtain some of the key results hidden in the data. For high-momentum streams it is common to have hard-coded pre-processors that perform either simple aggregation or sampling to reduce the data down to a rate that can be absorbed by conventional databases. However, the use of either of these techniques involve making major a priori assumptions about the nature of the data and what potential queries will be made on the reduced data. Assuming, for a moment, that the NSP can afford the DBMS infrastructure required to capture all of this raw data, advanced data reduction techniques have been developed for obtaining quick approximate answers from large databases. The paper edited by Hellerstein (Hellerstein et al. 1997) provides an excellent survey of these techniques, which include singular value decomposition, wavelet, regression, log-linear, clustering, index tree, and sampling. But the choice of these techniques also heavily depends on the nature of the data

Next: 5. IUM HIGH-LEVEL ARCHITECTURE »

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

No Thanks

Take a Tour »

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop (2004)

Chapter: 4. DATA STREAMS AND RIVERS

Welcome to OpenBook!

Get Email Updates