National Academies Press: OpenBook
« Previous: TRANSCRIPT OF PRESENTATION
Suggested Citation:"1. INTRODUCTION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 308

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 308 A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH- MOMENTUM INTERNET DATA Lee RHODES The data streams of the Internet are quite large and present significant challenges to those wishing to analyze these streams on a continuous basis. Opportunities for analysis for a Network Service Provider include understanding subscriber usage patterns for developing new services, network demand flows for network operations and capacity planning functions, and early detection of network security breaches. The conventional analysis paradigm of store first, then analyze later has significant cost and latency issues when analyzing these high-momentum streams. This article presents a deployed architecture for a general purpose stream processor that includes dynamically configurable Capture Models that can be tailored for compact collection of statistics of the stream in real time. The highly configurable flow processing model is presented with numerous examples of how multiple streams can be merged and split based on the requirements at hand. Key Words: DNA; IUM; Real-time statistics; Statistical pre-processing. 1. INTRODUCTION In 1997 a small R&D group was formed inside of Hewlett-Packard's Telecommunications Business Unit to develop Internet usage management software for Network Service Providers (NSPs). The services offered by these NSPs ranged from Internet backbone to Internet access. The range of access services included residential and commercial broadband (cable and xDSL), dial-up, mobile data, as well as numerous flavors of hosting and application services. Early on our focus was the processing of usage data records (e.g., NetFlow® or sFlow®) produced by Internet routers. However, it quickly broadened to include convergent voice Call Detail Records (CDRs) as well as the ability to collect and process data from a very broad range of sources such as log files, databases, and other protocols. The diverse technological histories (and biases) of the different segments of the communications industry created for us interesting challenges in creating a software architecture that was Lee Rhodes is Chief Scientist/Architect, IUM/DNA, Hewlett-Packard Co. (E-mail: lee.rhodes@hp.com). ©2003 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages 927–944 DOI: 10.1198/1061860032706

Next: 2. BUSINESS CHALLENGES FOR THE NSPs »
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Get This Book
×
 Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Massive data streams, large quantities of data that arrive continuously, are becoming increasingly commonplace in many areas of science and technology. Consequently development of analytical methods for such streams is of growing importance. To address this issue, the National Security Agency asked the NRC to hold a workshop to explore methods for analysis of streams of data so as to stimulate progress in the field. This report presents the results of that workshop. It provides presentations that focused on five different research areas where massive data streams are present: atmospheric and meteorological data; high-energy physics; integrated data systems; network traffic; and mining commercial data streams. The goals of the report are to improve communication among researchers in the field and to increase relevant statistical science activity.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!