Mark Hansen

Untitled Presentation

Transcript of Presentation

BIOSKETCH: Mark Hansen is a professor of statistics at the University of California at Los Angeles, with a joint appointment in design and media arts. His fields of expertise include statistical methods for data streams, text mining and informational retrieval, information theory, and practical function estimation.

Before joining the faculty at UCLA, Dr. Hansen was a member of the technical staff at Bell Laboratories. He specialized in Web statistics and other large databases, directing a number of experiments with sound in support of data analysis.

He has five patents and is the author of numerous publications as well as serving as an editor for the Journal of the American Statistical Association, Technometrics, and Statistical Computing and Graphics Newsletter. He has received a number of grants and awards for his art installation Listening Post. Listening Post produces a visualization of real-time data by combining text fragments in real time from thousands of unrestricted Internet chat rooms, bulletin boards, and other public forums that are then read (or sung) by a voice synthesizer and simultaneously displayed across a suspended grid of more than 200 small electronic screens.

Dr. Hansen received his undergraduate degree in applied mathematics from the University of California at Davis and his master’s and PhD degrees in statistics from the University of California at Berkeley.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 210
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Mark Hansen Untitled Presentation Transcript of Presentation BIOSKETCH: Mark Hansen is a professor of statistics at the University of California at Los Angeles, with a joint appointment in design and media arts. His fields of expertise include statistical methods for data streams, text mining and informational retrieval, information theory, and practical function estimation. Before joining the faculty at UCLA, Dr. Hansen was a member of the technical staff at Bell Laboratories. He specialized in Web statistics and other large databases, directing a number of experiments with sound in support of data analysis. He has five patents and is the author of numerous publications as well as serving as an editor for the Journal of the American Statistical Association, Technometrics, and Statistical Computing and Graphics Newsletter. He has received a number of grants and awards for his art installation Listening Post. Listening Post produces a visualization of real-time data by combining text fragments in real time from thousands of unrestricted Internet chat rooms, bulletin boards, and other public forums that are then read (or sung) by a voice synthesizer and simultaneously displayed across a suspended grid of more than 200 small electronic screens. Dr. Hansen received his undergraduate degree in applied mathematics from the University of California at Davis and his master’s and PhD degrees in statistics from the University of California at Berkeley.

OCR for page 210
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Transcript of Presentation MR. HANSEN: [Speech in progress]. That involves artists like Rauschenberg and even Andy Warhol. The idea was to try to pair, then, mostly engineers and artists together to see what kind of useful forms of artistic expression might come out. In a very self-conscious way, I think, the approach was to try to revive this tradition with the arts and, hence, was born this arts and multimedia program. It was actually an interesting event. The idea, as I said, was to very self-consciously pair artists and researchers together, and this will actually get to streaming data in a moment, I promise. So, what happened was, they organized a two-day workshop where 20 or so media artists from New York City and 20 or so invited researchers in the labs met in the boardroom, then, of Lucent, and each got 10 minutes to describe what they do. I had 10 minutes to kind of twitch and talk about what I do, and the artists got some time to have some very beautiful slides, and have a very big vocabulary and talk about what they do. Then we were supposed to pair up, somehow, find somebody and then put a proposal together, and they would fund three residency programs where the project would get funded. Ben and I put together perhaps the simplest thing given our backgrounds, him being a sound artist and me being a statistician. We put together a proposal on data sonification, which is a process by which data is rendered in sound, for the purpose of understanding some of its characteristics that may not be immediately obvious in the visual realm. So, instead of visualizing a data set, you might play a data set and get something out of it. This is sort of an old idea, and it seems like everything I have done John Chambers has done many, many years ago. So, I have kind of given up on trying to be unique or novel in any way. He was working with perhaps the father of electronic music, Max Mathews, at Bell Labs. This was back in 1974. He developed something that Bell Labs at the time gave the title MAVIS, the Multidimensional Audiovisual Interactive Sensifier. The idea was that you would take a data set or take a matrix and you would map the first column, say, the pitch, to the second column, the timbre, the third column, the volume. Then there would be some order to the data somehow and you would just play it. John said you got a series of squeaks and then a squawk, perhaps, if there was an outlier, and that was as far as it went. He said it wasn’t particularly interesting to listen to, but maybe there was something that could be done to kind of smoke out some characteristics in the data. Actually, this kind of mapping, when Ben and I were talking, we thought that this kind of mapping might be able to withstand underground bomb blasts and earthquakes. Apparently, this problem was motivated by Suki, who was involved in the Soviet test ban discussions. At least, I am getting this now all from Bill. I thought I could give you an example of what some of this early sonification sounds like. A friend of mine at the GMD has developed a program on earthquake sonification, and here is what the Kobe quake sounds like, if you speed it up 2,200 times

OCR for page 210
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop its normal speed at a recording station in California. [Audio played.] It sort of gets amusing if you listen to other places. Here is what it sounds like, where a few plates come together, but it also happens at the end of the water. [Audio played.] I am told that it has nothing to do with the fact that there is water all around and has everything to do with the fact that you have three plates coming together there, but I am not going to talk this evening about sort of those early examples of sonification. Instead, I am going to start to talk about some recent work we have been doing in analyzing communication streams. Most of this work was done, again, through this Bell Labs/Brooklyn Academy of Music program. Not surprisingly, then, a lot of it was inspired by communication processes or work-mediated transactions. Our group had been looking at things like call detail records. I am sure Daryl must have talked about that earlier today. I have personally been looking at Web log data, or proxy log data. So, you get records of who requested what file when from the network. Then we sort of slip in the end into like online forms, chat rooms, bulletin boards, that sort of thing, where the fundamental data consisted of who posted, what did they post, which room, what kind of content. If you think about it, in some sense, the model of this Web as being this place where you kind of go out and you download a page is sort of fading in favor of these sort of these more user-generated, or sort of connection-based communication processes. If you think about the number of e-mails per year, Hal Varian at Berkeley estimated that something like 610 billion e-mails are sent a year, versus only about 2 billion new Web pages are made every year. AUDIENCE: That is the wrong thing to count. You count how many times they are downloaded. MR. HANSEN: I guess you are right. AUDIENCE: Half of them are spam. MR. HANSEN: I am not going to go into my favorite spam story. It is kind of a mixed audience. Then, there are other ubiquitous public forums. IRC cracked its half-million user mark this year. In a sense, there is something to sort of these user-generated communication streams, and that is what we are going to try to get at with this project with Ben. So, our initial work, which seemed, I guess, a little wonky at the time, but at least produced some cool music—I don’t know how practical it was—focused just on Web traffic. The idea was that we were going to develop sort of a sound that you could use to passively monitor a system. So, we were going to look at people’s activity on a Web site. We looked at Lucent.com, which was then organized into a large number of businesses, now not so many businesses. Below each business directory there was a series of products and directories and then, below those product directories, there were white papers and then, below those white papers, you would have technical specifications. So, the deeper you went in the directory structure, the more detailed material you were downloading. So, the idea was to create a sound that somehow became more expressive as more

OCR for page 210
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop people on the site were browsing deeper and getting more interesting things. So, we created a kind of mapping of sorts that generated drones for sort of high-level browsing. So, if you were somewhere in the micro-electronics area—well, you won’t be now, but if at the time you were in the micro-electronics area, you would be contributing to the volume of some overall drone. As you went deeper, you would be contributing to, let’s say, a pulse or some other sound that was all at the same pitch. Then, the tonal balance of the sound would vary based on the proportion of people who were at different parts of the site. So, here is the kind of mapping that we used, just to cut to the chase. This is what Lucent.com sounds like at 6:00 in the morning. [Audio played.] Just to give you kind of a lonely feeling. At 6:00 o’clock in the morning, there are probably 15 people rattling around the site. At 2:30 in the afternoon, we get most of our visitors and it sounds more like this. [Audio played.] So, the deal was to make that somehow kind of pleasant and easy to listen to, and it might inform you of something. The idea of the sound at some level, the unique feature—I see a lot of skeptical faces. This is a crowd of statisticians and you are supposed to have a lot of skeptical faces. I was with David Cox at the spring research conference called The Cautious Empiricists or something. The “cautious” is the important thing. The idea of it was that sound somehow gives us the capability—you can attend to the sound in a way that you don’t attend to sort of the visual system, or you can background sound and you can’t really do that with the visual system. So, you can have something going on in the background and you can attend to changes in that something, in the musical track, without really having to listen to it hard. The visual system requires you to watch a display or watch a plot. So, we came up with this sort of general map of the Web site activity, and then a graduate student at Tufts wrote me that he didn’t really like this tonal balance that we got, he thought it was maybe a little too ravy or a little too something and he didn’t really care for it, and he preferred more natural sounds. So, he created sounds like this—[Audio played.]—to give this, which is telling you something about the network traffic. The patter of the water increases in volume with the more users who are on the system. The one of the bird sounds is incoming mail. So, you can kind of get a sense of what is going on. Anyway, he seemed to think that was more listenable. At some level we decided that these experiments in sonification were interesting, were certainly creating some music that we didn’t mind listening to, but they weren’t particularly practical. Also, they didn’t speak to many people, because very few people care about any one given Web server. I mean, the number of people who would care about the traffic on Lucent.com is quite small. If you think about most Web servers, that is going to be the case. So, we decided that we needed to find something that had perhaps a little more social relevance. So, we decided we would keep to kind of the communications realm and look at online communications. In some sense, as I pointed to before, with the amount of e-mail traffic and such, the Web is really just a big communications channel. Our thought was, perhaps

OCR for page 210
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop aggressively, perhaps we were a little too whatever, but could we characterize the millions of conversations that were taking place right now. You had people who were in chat rooms, hundreds of thousands of people in chat rooms, people posting to bulletin boards. Can you say something about what all these people are talking about. In some sense, these chat rooms and bulletin boards represent new spaces for public discourse. If you take them together, they represent a huge outpouring of real-time data, which is kind of begging to be looked at. There is a lot of structure here. There are sorts of chat sessions that are kind of day-to-day things. In the morning, it is what are you having for breakfast and in the middle of the day it is, my boss is riding my back. At the end of the day it is, this is a great day, I am off to bed. In between, you have got lots of sort of, not just cycles about sort of daily things, what is going on this morning or what is going on at work, but political arguments about terrorism or Afghanistan or something like that. So, our thought was that we would try to create some kind of tools to give us a better understanding, or tap into this big stream and sort of experience this in some way that perhaps is a little bit more accessible to the general public than just a plot or a graph or something like that. So, here is the kind of data that we are basically looking at, and we have lots of it. So, you get some sense of a source, in this case, suppose all of these are IRC chat rooms. So, you get the room, the name of the room, and then you get the user name and what they posted. We have agents, and I will talk about that in a little bit, who go out and sort of sample from these rooms. So, we will attach to a network and sample from tens of thousands of rooms, to kind of get an overall sense of what people are talking about. So, the interesting thing about this project is that not only has there been some statistical contact—and I have given talks about some of this stuff—but there has also been the opportunity for public performances or public events around it. The first thing we did with this chat space was a performance at a place in New York City called The Kitchen, which is a fairly well known—people like Laurie Anderson and stuff got their start at The Kitchen. It is in Chelsea and at that point we were looking at about 100 chat rooms and bulletin boards. We were looking at sort of news chat, sports, community. There was a beautiful room on the care and feeding of iguanas. I have told this story a lot. The beautiful thing about this room is that, after sort of monitoring it off and on for three or four months, they only used the word iguana like five or six times in that period. So, they don’t sort of refer to iguanas as iguanas. It is like baby or honey or my little something or other. I found it sort of amusing that you couldn’t really tell what people were talking about, if you didn’t include something you knew about the room itself. From there, we were monitoring for topics, looking at activity levels, that kind of thing. Now, the display that we put out, because this was a performance base, it was part of their digital happy hour series. So, we got a big room, perhaps a bit bigger than this, extraordinarily tall ceilings, and we had the picture at the bottom as a sort of layout of the rooms. There were round tables that people sat around because it was a digital happy

OCR for page 210
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop hour. There were four speakers, one in each corner of the room, and in the white bar at the top was a very large screen, about 20 feet tall, 20 feet wide. So, the display that we picked involved streaming the text along four lines at the top, and then each line of text was being read by one of the speakers in the room. I can tell you what that looks like. I have a small movie of very bad quality, but we get better stuff later. Here you get the full lines of text. [Tape played.] The sound is a tone, and then there is a voice that is speaking at that same tone, in a monotone, so you get sort of a chant effect. There is an algorithmic structure that is used to generate the pitches. It was picked according to the length of the post. So, we wanted to have it clear. If it was very short, it would take the voice only a short amount of time. So, that was, first of all, the text-to-speech was horrible. That was kind of the standard Mac text-to-speech voice. We only had, like I said, we only thought we had about 100 rooms, but we thought the structure was nice, having the text up there, having the text-to-speech to guide you, and having the compositional element helped to keep people’s attention. They were sort of watching this thing. At that point, there wasn’t a lot of organizational structure put to it. We just sort of randomly selected representative phrases from the individual chat room and let whatever collide, collide, in terms of meaning or content or whatever. So, that seemed to work out reasonably well. So we posed for ourselves another challenge which was, could we in fact display sort of large-scale activity in real-time from not just 100 rooms, but tens of thousands of rooms? As Ben keeps saying, we aspire to listen to it all. The “aspire” word means that we don’t actually have to achieve it, but that is where we are headed. Again, because we were, for some reason or another, extraordinarily lucky for sort of public performances to keep moving this along, we were part of the Next Wave Festival sponsored by the Brooklyn Academy of Music last year in 2001. I will show you some pictures of this. Here, instead of having the one large screen with four lines of text, we created a 7-foot-tall, 10-foot-wide grid of small text displays, fluorescent vacuum displays, about the size of a Hershey bar, each one. There were, like I said, 110 of them and they could show text that we were generating. Instead of having just four voices at a time, we used the speech engine, which would allow us to have up to 45 voices speaking in the room at a time on eight channels of audio. So, this was the little sign that they put out in front. This is what the installation space looked like. The Rockefeller Foundation kicked in and we were able to build an installation space. You see the hanging grid of small displays, and then the room itself, the silver panels conceal, in some cases, speakers, and in other cases just acoustic insulation, so you don’t get a lot of flutter echo from the walls. Here is what each of the little gizmos look like. This is a standard Noritake display and we had a printed circuit board designed with a microcontroller on board, so that we could communicate with this. This is RS45, for those who care. The two wires on the left are carrying communication and the two wires on the right are carrying power. So, you see that these two things are

OCR for page 210
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop hanging from the same things that we are talking to them on and powering them on. So, here is another view. We have this very tight text, sort of four lines of 20 characters, and then we also have this big character mode where we can stream text across the screens. Here are some pictures that were taken at the time. This is the back. The back has an LED on it. In fact, in the BAM room, the Brooklyn Academy of Music room, you enter it in the back. This wedge over here is the doorway, and you would enter in the back. What you would see is the LEDs are the pattern. So, you come into a very sort of abstract space, and then you move around and see all the text. The piece itself was organized in a series of scenes or phases that were all meant to highlight some different aspect of the stream. In some cases, they are quite simple. All they are doing is giving you a tabulation of all the posts by length, and streaming it by, so you not only get a sense of scale, like how much conversation is going on because things are streaming by fairly quickly, but you also get a chance to see the short posts, a lot of hi, hey, hi, and the longer posts were—at that time, there was a whole lot of talk about Afghanistan and John Walker. I think there was one Friday when we were up at BAM when Wynona Rider was arrested. If only we timed it better, this time we would have been able to see her being sentenced. Anyway, that was a very simple one, but the second scene tries to organize things by content and creates kind of a dynamic version of a self-organizing map. You get large regions that are all referring to the same topic, and the regions grow in response to the proportion of people who are talking about that particular topic. So, if Afghanistan is popular in the news and lots of people are talking about it, that region will grow quite large, and that depends quite heavily on the text-to-speech that you are using. Then there are other things I won’t have time to illustrate. This is the kind of thing that end up coming up from the map scene that I will show you in a minute. To generate all this, I guess I should give a little bit of talk about the stream itself. We have a series of JAVA and Pearl clients that are on the protected side of the Lucent firewall, that are going out and pulling things from chat rooms and bulletin board. Then, on the other side, in the display space, we have four computers, one running sort of the sounds in the room, one running a text-to-speech engine, one running the individual displays themselves, and then one kind of overseeing all of it. The unfortunate thing is that all of those machines are running on a different operating system. If you can think of another operating system, we would be happy to include it. So, it is all about interprocessor communication somehow. On the Lucent firewall side, we have two Linux servers and two class C networks that give us the capacity to look like about 500 different IP addresses. The chance that we are going to get somehow spotted and caught and turned off seems small, although, every time I say that I am a little nervous. So, we have upgraded the text-to-speech engine as well. We are using Lucent commercial heavy-duty speech engine, that can give us access to about 100 voices in the room, for this Whitney exhibit that I will show in a minute. [Audio played.] Can we show the DVD now? I am going to show a couple of examples of the

OCR for page 210
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop installation. Then I have some production pictures. The construction just started at the Whitney. I apologize in advance that this is going to be a little hard to see. [DVD shown.] This is just that simple tabulation where, when things are very short, the screens are less bright and the sounds are very soft. I will just show a little of the next scene. This is the one that has the content, or builds up a map dynamically, and here we will have some voices. [DVD shown.] So, as I said, to get to that point, there is, like I said, a series of scenes that this thing alternates through. Each time, because the stream is live, the text that is spoken and the scenes you experience of it are different, because the data are always changing. We are trying to get the buffering now to about 15 minutes, so that everything you see will have been typed just 15 minutes ago. So, there was a series of things that had to happen to pull things from the chat stream and, given the time, I am not going to go into too much detail. There are things like trying to characterize topic and track features by person and room and do some clustering and what have you. So, from the Brooklyn Academy of Music, we went on to—actually, the Friday after we closed at the Brooklyn Academy of Music, we were sort of summoned to a morning at the Whitney and an afternoon at the Museum of Modern Art in New York, where we met with curators who were, well, let’s talk about having our piece here, which was a little humbling and frightening. In the end, we were fortunate enough to—well, we are opening at the Whitney Museum of American Art, an expanded version of the piece is opening in just a few days. It is an enhanced display space. So, the display is not, instead of having 10 feet by 7 feet, with 110 displays, which was 11 rows and 10 columns, we now have 11 rows and 21 columns, which spans 21 feet. So, the thing is big, and it is in a soft curve. We had a workshop residency, or development residency with a performing arts organization in Seattle called On the Boards. They gave us like a studio, a kind of stage. You can see part of this grid, the curved grid, and then we got to sort of litter the place with our computers. This is what the thing wound up looking like in the end when it was up in Seattle. We started construction at the Whitney as of Monday. This is the Whitney lobby. This, right there, is where we get to go, which again is an awesome thing, to think that there will be people walking by. So, here is the inside of the space as construction is going on. The first thing they had to do was put up a wall that is going to have all of our panels, the concealed panels. This is the curved beam from which the text displays are going to be suspended. This is my collaborator, Ben. I am very happy that the beam is up in the air. It was a non-trivial problem to get it attached to the ceiling. This is part of the process. Now, speakers are mounted behind the beam, so that voices are spatialized. This is where I left it when I got on the plane earlier this evening. The carpet guys had just arrived and were putting carpeting in the room. So, it was kind of a lonely little place. I guess where this project goes, we have been looking at kind of a stream of data, a stream of text data. I don’t know how much sort of text has been talked about here, but it is

OCR for page 210
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop messy. Our audience isn’t a technical expert, per se, but the general public, and how you create kind of data analyses bases, in a way, that speak immediately to the general public. Some other applications that we have been working, we have begun a joint project with Bill Seaman, who was at UCLA and now is at RISD, jointly with UCLA. We are tiling a large room at UCLA with these sorts of condensed networks. So, it is a sensor network, which I have heard people already talk about, but we will have sort of wonky things like speech recognition and temperature and things on these sensors, for an inside of the room. Then, all of them will report wirelessly back to a central place, where we will be dealing with the streams. We have also looked with some Lucent people at perhaps network operations. When we first started this, we were talking to some people in the manufacturing lines. An interesting application, we were approached by an architect, Rem Koolhaas, to help for a building he was putting up at the IIT. The idea was, he was going to give us access to just streams of facilities data, occupancy sensors, data from boilers and what have you, and that we would create a sound in the foyer of this building. With repeated exposure to the foyer of this building, people would be able to know exactly what was going on in the building, just by the sound when they walked in. This is sort of a look—it is sort of a wonky artistic look at what the building is supposed to look like. There is a bowling alley in the space and a store, and we were going to get access to all of those data in real-time. Anyway, I guess I should summarize because I need to get on a plane, because tomorrow we have to start putting up, now that the carpet people are done. So, it began as a collaboration between, somehow, the arts and sciences, and there was an interesting interplay of viewpoints between the two. What I am finding is that there is a community of sort of artist folks who are trying to reach out for collaborations with the scientists. The collaboration is really highly valued and they are looking at and studying, kind of, how collaboration works and why it does, and how you can make it successful and how you promote both art and science through that collaboration, and not have just sort of one side of things. My work with Ben has been motivated, in a way, by new and complex data sources, large quantities of data—I suppose that is the massive part—and there is a strong time component. I guess that is the streaming part. Ultimately, the goal is to create kind of new experiences with data and, in particular, to create public displays of data that somehow speak to the general public. With that, I will thank you for spending your digesting time with me. AUDIENCE: How long are you at the Whitney? MR. HANSEN: Three months. We open the 17th and we are up until March 9. If anyone—I have invites to the opening, if anyone would like to come. The opening is on the 20th, the party. AUDIENCE: [Question off microphone]. MR. HANSEN: That is exactly the—that kept coming up again and again. We had a formal critique, actually, with some curators at MOMA and some artists, and that

OCR for page 210
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop was extremely important to them, that it be live, and that it be—that the process from data to display was legible, and you weren’t kind of tampering with it. There was some notion of unbiased-ness that came out there that they didn’t really have the words for, but it was definitely there. There is no filtering. It is funny. Even if there is no filtering, the bad words don’t get people. It is the really bad thoughts, somehow. Chat is like really bad talk radio, but it is even more anonymous, because you don’t even give your voice. So, you can type just any horrible thing that is off the top of your head. That is what seems to get people, when something comes across that is just like really hateful. It is not even clear to me how you would filter for that kind of hateful stuff, because it is not just the four letter words or whatever, which would be easy. AUDIENCE: In terms of interaction with the public, what sort of things have—[off microphone.] MR. HANSEN: There were some things, actually. We have another scene that we have just put in. So, this documentation video has four scenes, one that goes dee, dee, dee, dee, dee, one that has the spinning and the talking. There is another where we kind of blast along the streams, and then we have another that just gives a listing of the user names. We have added a few more. One of them was looking at how—every few hours we look at the most frequent ways in which people start their posts. Inevitably, aside from welcome back, which we kind of toss out—everyone gets stop lists. So, we toss our welcome back. After that, it is I’m or I am, is the most frequent way that people start their posts. So, we have started kind of a litany of those as one of the scenes. Our host in Seattle, who is this sort of semi-jaded art curatorial type, was in tears over this scene. I wasn’t prepared for it. You know, you kind of present this in a kind of human way because, at the end of the day, it is about people communicating. If you present this in a reasonably straightforward way, I think it has an impact, and that sort of surprised me. I should say, in Seattle, a very strange thing happened. So, we were there for three weeks. The first week we were just setting up. The second week, we were open to the public, and the third week we were open to the public. The third week, we got written up in the Seattle Times and what have you, but we started sort of marching up this crazy attendance curve. Like, the Wednesday, we had like 50 people and the Thursday it was 90 and the Friday was 150 and the Saturday was 311 and the Sunday it was 652, who came and just sat in this room for an hour, hour and a half, at a time. It blew my mind. People would come up afterwards and tell me what they thought it was doing and what they thought about it, and that was very surprising to me, that it would be sort of well received like that. It made me very nervous, at the same time. AUDIENCE: [Comment off microphone.] MR. HANSEN: I heard someone mention anomaly detection earlier. You people talked, could you use this to scoop up lots of chat and then find a terrorist or something. I think our approach has been to like sample and give a big picture, like what is sort of the broad—I don’t know that there is any practical use for it, really. I mean, there is a lot of

OCR for page 210
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop data analysis that I think is interesting to pursue but, like practical, I don’t think so. AUDIENCE: I guess I disagree with that. If you think of an analyst that has to try to assimilate all the information that is coming in, if you are actually moving in a direction—[off microphone.]—options that they have, to make it easy for—[off microphone.] MR. HANSEN: We thought about it. For those sorts of systems, like for the sonification, we thought, would be a natural for that, because you could hear a process drifting before an alarm would be set off in a statistical process, like in a control chart of some kind. So, we thought about that and we kind of toyed with that. Then we were quickly derailed with this text stuff and kind of went off in a different direction. I think that is an application, that you will be able to hear shifts in the background. Even something you are not directly attending to, you will be able to hear kind of shifts. So, for real-time process monitoring, I think it has got some applications for sure. AUDIENCE: [Question off microphone] MR. HANSEN: We do that all the time with our laptops and say, oh, this is a problem. AUDIENCE: I would point out that the accelerator—[off microphone]—if it deviates one little bit, you notice it. If it is something phenomenal, you would hear that. MR. HANSEN: There is—I mean, we do a lot of analysis in the—we do a lot of information gathering with our auditory system in our regular life. We start up the car and immediately we know if there is a problem. I mean, we are doing that all the time. The question is, can you apply that in a data sense. AUDIENCE: I was wondering if you had spoken with some of the people who are working on sonification of things like Web pages, and mathematics. MR. HANSEN: We have been to a couple of these ICAT meetings. So, there is an international community for auditory display and they have a meeting every year in some very exotic places. When it was in Finland, I remember there was a session—it was crushing to see how primitive that technology was, about how blind people were forced to kind of mouse over a page until they hit the radio button or something. It was horrifying to see what kind of state of the art there was at that point. That is a huge area for work that I don’t know who—David and I were at some digital libraries meeting at Dimax. I think one of the things that—we were supposed to propose things that people should be doing as more and more data libraries are keepers of more and more digital data. One of the things we were pushing for was assistive technologies like that. Horrifying is the word, the kinds of things that people have to do. Maybe I am more sensitive to it, because my mom is slowing losing sight. I am trying to get her to use speech recognition and things like that, and it seems like a really good research area. AUDIENCE: Different kinds of voices, I didn’t hear any different voices—[off microphone]—voices in different language. MR. HANSEN: It is the typical kind of male engineer response when you go, well, why aren’t there any women voices.

OCR for page 210
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop So, we asked the people who make the text-to-speech engines. Well, we have had a lot of problems with her. They can’t like get it. I mean, they have a lot of the—there are two sort of voice qualities. At the high-quality end, they only have the male voice. At the low-quality end, they have several males and several females. We really wanted the high-quality one, because it just sounds so much better. They have one now that we just started getting working as we went to Seattle. We are hoping we can get it wedged into the Whitney show. That was one criticism that we had. When you get these 45—even though they are British inflected—when you get these 45 voices going at once, it is a very male space, and sometimes it can be very heavy. The female voice is quite nice. It sounds something like an NPR announcer. She just keeps crashing. She will like crash the—we had a problem with the male voice initially—actually, this is a nice story. We had a problem with the male voice and that is that it would stay up for—we had a set of test data. We weren’t running it on live data. We had a set of test data. Inevitably, after two hours it would crash. Just before we were going to Seattle, this kept us debugging and working and figuring out. We had, you know, the engineers from the Lucent speech thing. I mean, they were like in the studio until like 2:00 and 3:00 o’clock in the morning. In the end, it was the word, abductor. There was something about the word abductor that brought the whole thing down. They had to bring in some new thing or whatever. I thought it was beautiful that it was the word, abductor. It kept us in the studio for a very long time. There was a fix, and we think something like that can fix the female voice, but as of last Thursday—because these things always happen on Thursday—the last of the text-to-speech people at Bell Labs were laid off. We are hoping that we will be able to get something like that going. They have French, they have Italian. They have Spanish. We stayed away from other languages, because I can barely speak English. So, I can barely do what I need to do and see that it is working right in English, much less in these foreign languages. There is that capacity. If, somehow, we find someone with a better grasp of languages, we can try that. AUDIENCE: This being a collaboration with artists, was there any piece that made it really difficult to understand certain parts, given the real mathematical sorts of things— MR. HANSEN: We are a slightly weird pair. I took a lot of art classes at undergraduate. My collaborator took a lot of computer science classes as an undergraduate. To the extent that kind of—the stats on the computer science I can find someplace to overlap. We did have some very difficult discussions around the concept of sampling. In fact, this came up at the critique, where the curators kept using the word privilege, why are some data points privileged over some others. It is not that they are privileged—it is sort of a hard thing to get over. We had some really sort of rocky evenings where we had to explain, we don’t need to take all of the data and throw it at some collection port. UDP protocol has no guarantee. So, packets will get dropped all over the place. So, rather than sending sort of a million messages at this poor port and just grabbing whatever sticks, we could send as many as we need—that was a concept

OCR for page 210
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop that was just really hard to get through. Then, like I said, there was this privileging concept, and legibility seems really important. Initially, there wasn’t a lot of trust—not my collaborator, but the curators and stuff—what is this all doing. Something like this would be too easy to twist. If we only went to like Aryan Nation Web sites, the thing would have a totally different character than it does now. So, there has been—the other thing I have noticed, and I am sorry to be kind of yammering—but the other thing that I have noticed is that these media artists are a lot more savvy technically than we give them credit for, maybe not statistically, but software and hardware wise, they run circles around them. Not my collaborator in particular, but a lot of them will run circles around us. That is kind of why—so, my position in UCLA that I am starting in April, is joint between media arts and statistics, and I will be teaching joint classes. I think that it will be interesting to have media arts students with stats students, in the sense that the stats students aren’t going to have the same kind of computing skills that the media arts students will, and the art students just won’t know what to compute. So, it is going to be kind of an interesting interplay, I think, of approaches to problem. Introducing to both a group of media arts students and statistics students the concept of a database and how a database works and all that, there is a big thrust now about database aesthetics, not just the politics of databases, but there is an aesthetic to them as well. So, I think that that is going to be kind of interesting. I suppose the last comment I want to make is that my collaborator has this interesting—everything should be doable, and that kind of pushes me a little farther. Of course, we should be able to string these 231 displays and, of course, we should be able to update the entire grid 25 times a second. That has been the other thing, that of course we can do it and we just haven’t hit kind of the limits yet. Thank you for your time.