National Academies Press: OpenBook
« Previous: 9.1 CAPTURE MODELS
Suggested Citation:"9.3 DRILL FORWARD." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 320
Suggested Citation:"9.3 DRILL FORWARD." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 321

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 320 9.2 CAPTURE MODEL AGGREGATION A large stream with a lot of variables can create a lot of models based on how you choose to configure the DNA Collector. A single distribution model consumes about 2KB with 100 bins. To monitor bandwidth distribution characteristics of stream flows at 10,000 points in your network amounts to only 20MB of memory, but how does one examine 10,000 distributions, (or 100,000 distributions)? This leads to the concept of model aggregation. Returning to the navigation discussion, the tree structure could be leveraged again and produce one type of model aggregation where capture models would reside at each of the interior nodes of the tree in addition to the leaf nodes. These interior models create a hierarchy of models where an interior model in an upper level of the tree represents the aggregate statistics of all the child nodes below it. Order is important, however. Using the "*" to represent the aggregation of all the coordinate values for a dimension you could create navigation coordinates like (al.*.*.*), (a1.a2.*.*), or (a1.a2.a3.*), where the coordinates are in top-down order. This does not allow aggregations of the form (*, a2, a3, a4), which diminishes this strategy's usefulness. Instead we have provided an internal query capability within the DNA Collector server that can traverse the in-memory tree and collect data from nodes with an arbitrary query of the form (*, a2 op x, *, a4 op z), where op is a qualifying operator. So far these aggregations have been inside a particular Collector. A large deployment may have hundreds of Collector agents widely distributed geographically. The second mechanism we have developed for model aggregation allows the model data from widely dispersed DNA Collectors to be merged as long as some basic rules are followed. The ability of a model to be aggregated with other models depends on the definition of the model. History Models and Distribution Models can be combined as long as the data was collected during the same aggregation time interval and the events of the different models are independent. An example is two sets of subscriber usage distributions collected per hour in San Francisco and Los Angeles. As long as the subscribers generating traffic in San Francisco are not the same subscribers generating traffic in Los Angeles and both datasets are from the same day and hour of the day the distributions can be aggregated. To facilitate this kind of model aggregation, the above dimensions of statistics data collection are marked with an independence parameter. This simple facility protects the user from accidentally creating model aggregations that would be meaningless. 9.3 DRILL FORWARD One of the important capabilities of these Capture Models is that they can be dynamically configured by the user, or some other agent, including the type of model and all of its configuration parameters. This leads to an important concept in stream analysis I call Drill Forward. Most of us are familiar with the concept of Drill Down when dealing with multidimensional online analytical processing (MOLAP) or relational online analytical processing (ROLAP) analysis systems. Clicking on a bar of a bar chart creates a new window of the historical detail displaying the next level deeper components that made up the selected bar.

A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 321 Note the word historical, because the presumption of drill down is that there is a database of history behind the data you see. Unfortunately, constructing such a history of data for massive data streams may not be economically practical, or take too long for the reasons discussed previously. Drill forward is simply a different name for what we all do when something draws our attention. We focus in and look more closely, discarding a vast majority of the other data pummeling our senses. But we are moving forward in time not backward. When we are dealing with massive data streams, the same technique can be used to investigate patterns. In a stream-processing context, a few key variables could be monitored to establish normative behavior. If there is a sudden change (exceeding a percentile threshold or the change in shape of a distribution, etc.) the rule logic could be dynamically restructured to collect more detailed data about a reduced, but focused subset of the stream where the exception occurred. For example, the appearance of certain traffic patterns may be a precursor to a hostile attack on the network. If this particular pattern occurs, it is desirable to collect additional detail on that substream. A simple example of this can be accomplished with a conditional hash rule, which is a variation of the hash rule above: In this example, if a single event flowing to (or from) a particular IP address shows traffic activity on one of a list of Trojan ports, a flag triggers aggregation of traffic by port in addition to aggregation by IP address. Once this port has been hashed into the table data continues to be collected for this port for a defined interval of time because it already exists in the hash table. This avoids having to collect high granularity data all the time for all substreams resulting in significant data reduction and efficient processing of these data in downstream systems. As another example, assume a capture model has been configured to measure the distribution of the number of unique destination addresses per subscriber for outgoing traffic on a routine basis. A large spike of activity at the 99th percentile may signal a subscriber performing address scans on the network. Based on this abnormal event the capture models can be reconfigured with filters to focus in on only the portion of the distribution where the spike occurs, then start exporting additional information about the suspect traffic such as protocol and destination port, which will help identify the type of traffic. The ability to establish normative distributions of various characteristics of a stream and then dynamically explore deviations from the norms adds considerable analysis capability. This technique is ideal for detecting and exploring patterns in a stream, but not for discovering once-in-a-lifetime events.

Next: 9.4 USER INTERACTION WITH STREAMING MODELS »
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Get This Book
×
 Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Massive data streams, large quantities of data that arrive continuously, are becoming increasingly commonplace in many areas of science and technology. Consequently development of analytical methods for such streams is of growing importance. To address this issue, the National Security Agency asked the NRC to hold a workshop to explore methods for analysis of streams of data so as to stimulate progress in the field. This report presents the results of that workshop. It provides presentations that focused on five different research areas where massive data streams are present: atmospheric and meteorological data; high-energy physics; integrated data systems; network traffic; and mining commercial data streams. The goals of the report are to improve communication among researchers in the field and to increase relevant statistical science activity.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!