Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 313 and queries anticipated. Unfortunately, Internet data can have a high number of dimensions, the variables can be highly skewed in both frequency and value, and some of the events or patterns of high interest can be very rare (e.g., a slow address scan by a potential intruder). To make matters worse, with the constant evolution of viruses and worms, the priority of what is important to examine is constantly changing. These complicating factors make the selection of data reduction techniques somewhat of an art form. Broadband service providers find themselves between a rock and a hard place. They need much richer information about their subscriber usage behavior with strong business rationale on both the revenue and the cost side. The rock is the very high cost of building and managing these large datasets. The hard place is that most general purpose data analysis tools presume that the data to be analyzed exists or will exist in a database. No database, no analysis. What if you could extract some meaningful information about a data stream before you had to aggregate and commit it to hard storage? This idea, by itself, is not exactly new. But what is needed in a number of these high- momentum, complex data stream situations is a high-performance, flexible, and adaptive stream processing and analysis platform as a pre-processor to long-term storage and other conventional analysis systems. In this context, high performance means the ability to collect and process data at speeds much faster (>10X) than most common database systems; flexible implies a modular architecture that can be readily configured with new or specialized components as needs evolve; adaptive implies that certain key components can change their internal logic or rules on-the-fly. These changes could be as a result of a change in the input stream, or a detected change in the reference data from the environment, or from an analyst's console. Starting in 2000 we set out to build a platform with these goals in mind. The remainder of this article discusses the progress we have made. 5. IUM HIGH-LEVEL ARCHITECTURE Figure 2 is a high-level view of the IUM architecture. Streams of data flow left to right. The purple boxes on the left represent different sources of raw data within a service provider's network infrastructure. The blue boxes on the right represent the target business applications or processes that require distinctly different algorithms or rule sets applied to the streams of data. The gold triad of a sphere, rectangular prism, and a cylinder represent a single instance of an IUM server software agent that we call a Collector. Each Collector is capable of merging multiple streams of input data and producing multiple output streams, each of which can be processed by a different set of rules. The basic unit of scalability is the Collector. The first dimension of scaling is horizontal (actually front to back in the graphic) in that different input streams can be processed in parallel by different Collectors on the left. The second dimension of scale can be achieved though the processing speed of the hardware hosts. The third dimension of scale can be achieved by using pipelining techniques that partition the overall processing task for the various target applications into smaller sequential tasks that can execute in parallel. The