National Academies Press: OpenBook

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop (2004)

Chapter: 8. RULE CHAINS AND ASSOCIATED DATA STRUCTURES

« Previous: 7. STREAM RULE PROCESSING
Suggested Citation:"8. RULE CHAINS AND ASSOCIATED DATA STRUCTURES." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 316
Suggested Citation:"8. RULE CHAINS AND ASSOCIATED DATA STRUCTURES." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 317

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 316 produce aggregates of particular attributes. At the end of the aggregation time interval the aggregates are flushed either to a persistent store for recovery operations, or the aggregates can be fed directly to a following Collector for further processing (or both). (3) Certain rules allow a real time query of current results. This is particularly valuable to gain insight into ongoing statistics. For example, examining the current empirical distribution of a variable. This kind of functionality is enabled with DNA discussed later on. Figure 5. The rule engine can be viewed as an in-line agent that performs the work of merging, analyzing, and splitting streams. The output streams of a rule engine use the same NME structure, which enables the creation of a fabric of chained stream processors. 8. RULE CHAINS AND ASSOCIATED DATA STRUCTURES A rule chain is sequence of configured operations on the flowing NMEs of an input stream. Example rules include standard three-address arithmetic and logical operators of the form op(a1, a2, a3) where the first two parameters represent NME attributes as operands, and the third attribute is the result. Flow control rules include conditionals (e.g., if then else) that can change the path of the rules based on the NME attributes, but no looping rules—not a good idea in a stream! In addition we have developed a rich set of lookup rules, filter rules, adornment rules, and other special rules for more complex operations. The rules are written in Java and there is a developer's kit that enables users to create their own rules. However, the current rule library is extensive with more than 100 rules, which is satisfactory for most applications. The right side of Figure 6 illustrates a simple rule chain on the left with a single if then else rule followed by some sequential operations. Although the processing logic applied to an NME as it flows through is contained in the rule, there is also a reference to a data structure that travels with the NME and can also be changed by the action of the rule logic. This data structure can hold state information that can interact with subsequent NMEs traveling through the Rule Engine and can be flushed to a datastore at regular intervals. This parallel data structure is usually organized as a tree shown on the right. Hash rules are an example of the special rules that operate on the references to the data tree. Each node of a level of the tree associated with a Hash rule can contain a hash table (Cormen, Leiserson, and Rivest 1999; Knuth 1973) where the entries contain a key and a pointer to a child node. A hash rule is configured to use an attribute of the NME as a key and then either chooses the successor data node based on the result of the hash if it exists, or create one if it use the print version of this publication as the authoritative version for attribution. does not exist. Each data node can be an n-way branch (a binary tree

A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 317 structure is used in the diagram for graphical simplicity). New nodes of the tree get created as new values appear in the data. If the last rule in the chain is configured as an aggregation rule, and the Collector is configured to flush the leaf nodes flushed at periodic intervals, the resulting dataset would be the same as an SQL aggregating group-by operation except the grouping (essentially routing) occurs as the NMEs flow, not as a batch operation. The following table contrasts the stream processor rules that produce the same operation as the batch SQL statement on the right. Figure 6. A simplified view of how rule chains can interact with dynamic data structures. A single rule engine can have multiple of these rule chains in various series and parallel configurations. Stream Processor Rules Batch SQL Statement (in a specific rule chain context) Select SrcIP, DstIP, sum(usage) Hash SrcIP, DstIP; From <table> Aggregate sum(usage); Group By SrcIP, DstIP Because there can be multiple rule chains, as explained earlier, the specific rule chain where the hash rules appear is analogous to the “From <table>” clause in SQL. However, in a stream processing context we have the opportunity to perform operations that would be much more cumbersome in an SQL environment. This will be further discussed in Section 9.3 An n-level hash in this structure is analogous to a hyper-cube in the sense that the leaf nodes of the tree can be mapped to coordinates of an n-dimensional hyper-cube. However, there are differences. One is that the tree structure can be sparse in that nodes are created only when the combination of data actually exists in the input stream. Another difference is that the cell of a hyper-cube usually contains only one value. In this data structure the nodes are containers that can contain their own complex data structures internally.

Next: 9.1 CAPTURE MODELS »
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Get This Book
×
 Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Massive data streams, large quantities of data that arrive continuously, are becoming increasingly commonplace in many areas of science and technology. Consequently development of analytical methods for such streams is of growing importance. To address this issue, the National Security Agency asked the NRC to hold a workshop to explore methods for analysis of streams of data so as to stimulate progress in the field. This report presents the results of that workshop. It provides presentations that focused on five different research areas where massive data streams are present: atmospheric and meteorological data; high-energy physics; integrated data systems; network traffic; and mining commercial data streams. The goals of the report are to improve communication among researchers in the field and to increase relevant statistical science activity.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!