Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 320 9.2 CAPTURE MODEL AGGREGATION A large stream with a lot of variables can create a lot of models based on how you choose to configure the DNA Collector. A single distribution model consumes about 2KB with 100 bins. To monitor bandwidth distribution characteristics of stream flows at 10,000 points in your network amounts to only 20MB of memory, but how does one examine 10,000 distributions, (or 100,000 distributions)? This leads to the concept of model aggregation. Returning to the navigation discussion, the tree structure could be leveraged again and produce one type of model aggregation where capture models would reside at each of the interior nodes of the tree in addition to the leaf nodes. These interior models create a hierarchy of models where an interior model in an upper level of the tree represents the aggregate statistics of all the child nodes below it. Order is important, however. Using the "*" to represent the aggregation of all the coordinate values for a dimension you could create navigation coordinates like (al.*.*.*), (a1.a2.*.*), or (a1.a2.a3.*), where the coordinates are in top-down order. This does not allow aggregations of the form (*, a2, a3, a4), which diminishes this strategy's usefulness. Instead we have provided an internal query capability within the DNA Collector server that can traverse the in-memory tree and collect data from nodes with an arbitrary query of the form (*, a2 op x, *, a4 op z), where op is a qualifying operator. So far these aggregations have been inside a particular Collector. A large deployment may have hundreds of Collector agents widely distributed geographically. The second mechanism we have developed for model aggregation allows the model data from widely dispersed DNA Collectors to be merged as long as some basic rules are followed. The ability of a model to be aggregated with other models depends on the definition of the model. History Models and Distribution Models can be combined as long as the data was collected during the same aggregation time interval and the events of the different models are independent. An example is two sets of subscriber usage distributions collected per hour in San Francisco and Los Angeles. As long as the subscribers generating traffic in San Francisco are not the same subscribers generating traffic in Los Angeles and both datasets are from the same day and hour of the day the distributions can be aggregated. To facilitate this kind of model aggregation, the above dimensions of statistics data collection are marked with an independence parameter. This simple facility protects the user from accidentally creating model aggregations that would be meaningless. 9.3 DRILL FORWARD One of the important capabilities of these Capture Models is that they can be dynamically configured by the user, or some other agent, including the type of model and all of its configuration parameters. This leads to an important concept in stream analysis I call Drill Forward. Most of us are familiar with the concept of Drill Down when dealing with multidimensional online analytical processing (MOLAP) or relational online analytical processing (ROLAP) analysis systems. Clicking on a bar of a bar chart creates a new window of the historical detail displaying the next level deeper components that made up the selected bar.
A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 321 Note the word historical, because the presumption of drill down is that there is a database of history behind the data you see. Unfortunately, constructing such a history of data for massive data streams may not be economically practical, or take too long for the reasons discussed previously. Drill forward is simply a different name for what we all do when something draws our attention. We focus in and look more closely, discarding a vast majority of the other data pummeling our senses. But we are moving forward in time not backward. When we are dealing with massive data streams, the same technique can be used to investigate patterns. In a stream-processing context, a few key variables could be monitored to establish normative behavior. If there is a sudden change (exceeding a percentile threshold or the change in shape of a distribution, etc.) the rule logic could be dynamically restructured to collect more detailed data about a reduced, but focused subset of the stream where the exception occurred. For example, the appearance of certain traffic patterns may be a precursor to a hostile attack on the network. If this particular pattern occurs, it is desirable to collect additional detail on that substream. A simple example of this can be accomplished with a conditional hash rule, which is a variation of the hash rule above: In this example, if a single event flowing to (or from) a particular IP address shows traffic activity on one of a list of Trojan ports, a flag triggers aggregation of traffic by port in addition to aggregation by IP address. Once this port has been hashed into the table data continues to be collected for this port for a defined interval of time because it already exists in the hash table. This avoids having to collect high granularity data all the time for all substreams resulting in significant data reduction and efficient processing of these data in downstream systems. As another example, assume a capture model has been configured to measure the distribution of the number of unique destination addresses per subscriber for outgoing traffic on a routine basis. A large spike of activity at the 99th percentile may signal a subscriber performing address scans on the network. Based on this abnormal event the capture models can be reconfigured with filters to focus in on only the portion of the distribution where the spike occurs, then start exporting additional information about the suspect traffic such as protocol and destination port, which will help identify the type of traffic. The ability to establish normative distributions of various characteristics of a stream and then dynamically explore deviations from the norms adds considerable analysis capability. This technique is ideal for detecting and exploring patterns in a stream, but not for discovering once-in-a-lifetime events.