National Academies Press: OpenBook

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop (2004)

Chapter: 9.4 USER INTERACTION WITH STREAMING MODELS

« Previous: 9.3 DRILL FORWARD
Suggested Citation:"9.4 USER INTERACTION WITH STREAMING MODELS." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 322
Suggested Citation:"9.4 USER INTERACTION WITH STREAMING MODELS." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 323

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 322 Figure 8. Capture Models can be configured for real-time queries, which enable interactive snap-shot views of the statistical data captured in memory. The above screen-shot reveals the lognormal distribution of subscriber usage. 9.4 USER INTERACTION WITH STREAMING MODELS The collection and processing of these streams forms the foundation, but users need graphical and visual tools for exploring this space. Wilkinson (1999) has done some extraordinary work in this area. This is a challenging area in its own right and where we will be investing more R&D going forward. The DNA technology suite includes both a browser-based client and a Java application client for more sophisticated viewing and analysis. Figure 8 is a real data example of the analysis screen examining a subscriber usage distribution. This kind of data can be pulled up from a DNA server using the real-time query mechanism mentioned earlier. What is interesting is that this usage distribution follows a lognormal distribution over five orders of magnitude (90KB/mo to 22GB/mo) with a shape factor of ~0.67. Transforming this into a CDF is trivial (Figure 9, top), which gives marketing folks information on how to segment their subscribers based on usage. The graph on the bottom is a percentile-percentile plot of percent subscribers using what percent of the overall traffic. This graph shows that this distribution follows the 80:20 rule, the top 20% of subscribers generate 80% of the traffic. The top 5% generate 50% of all traffic! To demonstrate how capturing statistics from a stream can generate valuable business

A STREAM PROCESSOR FOR EXTRACTING USAGE INTELLIGENCE FROM HIGH-MOMENTUM INTERNET DATA 323 insight, Figure 10 is from the DNA financial modeling tool that uses empirical distribution data collected from the DNA server to compute the estimated dollar value of subscriber traffic modeling different pricing scheme scenarios. Given b = bytes of usage per month s(b) = density function: # subscribers at b $(b) = pricing function: $ paid by a subscriber with total usage b for the month. The revenue in dollars for all subscribers with monthly usage between b0 and b1 is Figure 9. From the empirical distribution, multiple parameters can be derived and various transforms applied.

Next: 10. SUMMARY »
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Get This Book
×
 Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Massive data streams, large quantities of data that arrive continuously, are becoming increasingly commonplace in many areas of science and technology. Consequently development of analytical methods for such streams is of growing importance. To address this issue, the National Security Agency asked the NRC to hold a workshop to explore methods for analysis of streams of data so as to stimulate progress in the field. This report presents the results of that workshop. It provides presentations that focused on five different research areas where massive data streams are present: atmospheric and meteorological data; high-energy physics; integrated data systems; network traffic; and mining commercial data streams. The goals of the report are to improve communication among researchers in the field and to increase relevant statistical science activity.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!