Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 318 Figure 7. Capture Models are sophisticated rules that can perform complex, binning, filtering, and associative processing. These capture models can be dynamically configured to examine different statistical properties of the incoming stream. The tree structure can be also used to create a result stream that is an associative merge of two different input streams. Suppose we have two streams A and B. For stream A we mark each NME so that instead of traveling through the entire tree an A-NME travels to the coordinate node of the tree specified by A's hash attribute values and drops off its associative attributes values at that node, which are stored there until they are replaced with more current data for that coordinate, again from Stream A. Stream B is processed as discussed earlier. As a B-NME travels through the tree it is routed to the coordinate node specified by B's hash attribute values. As it passes by, the B-NME picks up the associative data, which was previously dropped by the A-NME. In this way, a multidimensional set of associations between streams can be performed in real-time. Other forms of associations can be performed by look-up rules, which are designed to perform fast specialized lookup algorithms for associations with reference data. An example of this is finding the longest qualified prefix of an IP address given a reference routing table. 9. STATISTICS FROM STREAMS DNA builds on the platform discussed above and extends the stream processing capabilities in several ways. 9.1 CAPTURE MODELS As the stream of NMEs pass through a node in the tree above it is possible to collect richer statistics as well (Figure 7). The motivations are several, but a capture model of a few KB can extract selected characteristics of a large stream very economically. Capture Models (or just Models) are similar to Rules. Models are contained in a special Modeling Rule that acts as a manager and container for multiple models. When a Modeling Rule is inserted into a rule chain it will spawn capture models into the associated data nodes of the tree as they are created. Consider a node of the hash tree as a representation of the intersection of a set of business coordinates such as customer, service, and geography. Each
A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 319 node can contain multiple Capture Models, which can collect different views of the data passing by. One Capture Model might be an adaptive histogram on one variable, another could be a TopN Model of a different variable from the stream. One way to think of a Capture Model is that its input is a stream of vectors (NMEs) and its output can be a matrix of values defined by the Capture Model: Capture models can be configured with an integration interval (minutes to days) that defines the amount of time that statistics are collected. At the end of the integration interval, the result matrix is usually flushed to a persistent store or to another downstream rule engine. There is no fundamental restriction on how a capture model is designed as the output can be any data structure that can be contained in a Java Object. For example, creating a correlation matrix model would be relatively straightforward. Defining conventions, like the matrix form above, has allowed us to create additional functionality such as model aggregation mentioned in the following. The most common capture models include log distributions, linear distributions, TopN, History (time series), and other specialty models for security and capacity planning flow analysis. As an example, the distribution capture model performs dynamic binning on the values that fly by for a configured attribute. For improved accuracy, particularly for rare events, the model defines two vector variables, sum and hits, both of which are dimensioned by the number of bins. The order of the above matrix becomes nÃ2, where n represents the current number of bins. The bins need not be contiguous and are only created based on actual data values that appear in the stream. The result of the distribution capture model, when queried, is again an NME. The first several attributes define the coordinates of the model and then a single object attribute that is a compact form of the empirical distribution of the variable. This result NME can be output either at flush time of the aggregation tree or obtained by a real time query from the DNA client application.