National Academies Press: OpenBook
« Previous: 8. RULE CHAINS AND ASSOCIATED DATA STRUCTURES
Suggested Citation:"9.1 CAPTURE MODELS." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 318
Suggested Citation:"9.1 CAPTURE MODELS." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 319

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 318 Figure 7. Capture Models are sophisticated rules that can perform complex, binning, filtering, and associative processing. These capture models can be dynamically configured to examine different statistical properties of the incoming stream. The tree structure can be also used to create a result stream that is an associative merge of two different input streams. Suppose we have two streams A and B. For stream A we mark each NME so that instead of traveling through the entire tree an A-NME travels to the coordinate node of the tree specified by A's hash attribute values and drops off its associative attributes values at that node, which are stored there until they are replaced with more current data for that coordinate, again from Stream A. Stream B is processed as discussed earlier. As a B-NME travels through the tree it is routed to the coordinate node specified by B's hash attribute values. As it passes by, the B-NME picks up the associative data, which was previously dropped by the A-NME. In this way, a multidimensional set of associations between streams can be performed in real-time. Other forms of associations can be performed by look-up rules, which are designed to perform fast specialized lookup algorithms for associations with reference data. An example of this is finding the longest qualified prefix of an IP address given a reference routing table. 9. STATISTICS FROM STREAMS DNA builds on the platform discussed above and extends the stream processing capabilities in several ways. 9.1 CAPTURE MODELS As the stream of NMEs pass through a node in the tree above it is possible to collect richer statistics as well (Figure 7). The motivations are several, but a capture model of a few KB can extract selected characteristics of a large stream very economically. Capture Models (or just Models) are similar to Rules. Models are contained in a special Modeling Rule that acts as a manager and container for multiple models. When a Modeling Rule is inserted into a rule chain it will spawn capture models into the associated data nodes of the tree as they are created. Consider a node of the hash tree as a representation of the intersection of a set of business coordinates such as customer, service, and geography. Each

A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 319 node can contain multiple Capture Models, which can collect different views of the data passing by. One Capture Model might be an adaptive histogram on one variable, another could be a TopN Model of a different variable from the stream. One way to think of a Capture Model is that its input is a stream of vectors (NMEs) and its output can be a matrix of values defined by the Capture Model: Capture models can be configured with an integration interval (minutes to days) that defines the amount of time that statistics are collected. At the end of the integration interval, the result matrix is usually flushed to a persistent store or to another downstream rule engine. There is no fundamental restriction on how a capture model is designed as the output can be any data structure that can be contained in a Java Object. For example, creating a correlation matrix model would be relatively straightforward. Defining conventions, like the matrix form above, has allowed us to create additional functionality such as model aggregation mentioned in the following. The most common capture models include log distributions, linear distributions, TopN, History (time series), and other specialty models for security and capacity planning flow analysis. As an example, the distribution capture model performs dynamic binning on the values that fly by for a configured attribute. For improved accuracy, particularly for rare events, the model defines two vector variables, sum and hits, both of which are dimensioned by the number of bins. The order of the above matrix becomes n×2, where n represents the current number of bins. The bins need not be contiguous and are only created based on actual data values that appear in the stream. The result of the distribution capture model, when queried, is again an NME. The first several attributes define the coordinates of the model and then a single object attribute that is a compact form of the empirical distribution of the variable. This result NME can be output either at flush time of the aggregation tree or obtained by a real time query from the DNA client application.

Next: 9.3 DRILL FORWARD »
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Get This Book
×
 Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Massive data streams, large quantities of data that arrive continuously, are becoming increasingly commonplace in many areas of science and technology. Consequently development of analytical methods for such streams is of growing importance. To address this issue, the National Security Agency asked the NRC to hold a workshop to explore methods for analysis of streams of data so as to stimulate progress in the field. This report presents the results of that workshop. It provides presentations that focused on five different research areas where massive data streams are present: atmospheric and meteorological data; high-energy physics; integrated data systems; network traffic; and mining commercial data streams. The goals of the report are to improve communication among researchers in the field and to increase relevant statistical science activity.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!