Read "The Mathematical Sciences' Role in Homeland Security: Proceedings of a Workshop" at NAP.edu

« Previous: Welcome and Overview of Sessions, April 26

Page 11 Cite

Suggested Citation:"Data Mining, Unsupervised Learning, and Pattern Recognition." National Research Council. 2004. The Mathematical Sciences' Role in Homeland Security: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10940.

Page 12 Cite

Page 13 Cite

Page 14 Cite

Page 15 Cite

Page 16 Cite

Page 17 Cite

Page 18 Cite

Page 19 Cite

Page 20 Cite

Page 21 Cite

Page 22 Cite

Page 23 Cite

Page 24 Cite

Page 25 Cite

Page 26 Cite

Page 27 Cite

Page 28 Cite

Page 29 Cite

Page 30 Cite

Page 31 Cite

Page 32 Cite

Page 33 Cite

Page 34 Cite

Page 35 Cite

Page 36 Cite

Page 37 Cite

Page 38 Cite

Page 39 Cite

Page 40 Cite

Page 41 Cite

Page 42 Cite

Page 43 Cite

Page 44 Cite

Page 45 Cite

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Page 51 Cite

Page 52 Cite

Page 53 Cite

Page 54 Cite

Page 55 Cite

Page 56 Cite

Page 57 Cite

Page 58 Cite

Page 59 Cite

Page 60 Cite

Page 61 Cite

Page 62 Cite

Page 63 Cite

Page 64 Cite

Page 65 Cite

Page 66 Cite

Page 67 Cite

Page 68 Cite

Page 69 Cite

Page 70 Cite

Page 71 Cite

Page 72 Cite

Page 73 Cite

Page 74 Cite

Page 75 Cite

Page 76 Cite

Page 77 Cite

Page 78 Cite

Page 79 Cite

Page 80 Cite

Page 81 Cite

Page 82 Cite

Page 83 Cite

Page 84 Cite

Page 85 Cite

Page 86 Cite

Page 87 Cite

Page 88 Cite

Page 89 Cite

Page 90 Cite

Page 91 Cite

Page 92 Cite

Page 93 Cite

Page 94 Cite

Page 95 Cite

Page 96 Cite

Page 97 Cite

Page 98 Cite

Page 99 Cite

Page 100 Cite

Page 101 Cite

Page 102 Cite

Page 103 Cite

Page 104 Cite

Page 105 Cite

Page 106 Cite

Page 107 Cite

Page 108 Cite

Page 109 Cite

Page 110 Cite

Page 111 Cite

Page 112 Cite

Page 113 Cite

Page 114 Cite

Page 115 Cite

Page 116 Cite

Page 117 Cite

Page 118 Cite

Page 119 Cite

Page 120 Cite

Page 121 Cite

Page 122 Cite

Page 123 Cite

Page 124 Cite

Page 125 Cite

Page 126 Cite

Page 127 Cite

Page 128 Cite

Page 129 Cite

Page 130 Cite

Page 131 Cite

Page 132 Cite

Page 133 Cite

Page 134 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

11 James Schatz "introduction by Session Chair" Transcript of Presentation Summary of Presentation Video Presentation James Schatz is the chief of the Mathematics Research Group at the National Security Agency. 11

12 DR. SCHATZ: Thank you, Peter. We had a session in February down at the Rayburn Building in Washington to talk about homeland security that the American Mathematical Society sponsored and I thought I would like to just as an introduction to our first session here give some of the remarks we made down there which I think are relevant here. It is a wonderful privilege to be here today. In these brief remarks I would like to describe the critical role that mathematics plays at the National Security Agency and explain some of the immediate tangible connections between the technical health of mathematics in the United States and our national security. As you may know already the National Security Agency is the largest employer of mathematicians in the world. Our internal mathematics community is a dynamic professional group that encompasses full-time agency employees, three world-class research centers at the Institute for Defense Analyses that work exclusively for NSA and a network of hundreds of fully cleared academic consultants from our top universities. As the Chief of the Mathematics Research Group at NSA, and the executive of our mathematics hiring process I 12

13 have been the agency's primary connection to the greater US mathematics community for the past 7 years. The support we have received from the mathematicians in our country has been phenomenal. Our concern for the technical health of mathematics in our country is paramount. Perhaps the most obvious connection between mathematics and intelligence is the science of cryptology. This breaks down into two disciplines, cryptography, the making of codes and cryptanalysts, the breaking of codes. All modern methods of encryption are based on very sophisticated mathematical ideas. Coping with complex encryption algorithms requires at the outset a working knowledge of the most advanced mathematics being taught at our leading universities and at the higher levels a command of the latest ideas at the frontiers of research. Beyond cryptology the information age that is now upon us has opened up a wealth of new areas for pure and applied mathematics research, areas of research that are directly related to the mission of the National Security Agency. While advances in telecommunications science and computer science have produced the engines of the information age, that is the ability to move massive 13

14 amounts of digital data around the world in seconds, any attempt to analyze this extraordinary volume of data to extract patterns, to predict behavior of the system or recognize anomalies quickly gives rise to profound new mathematical problems. If you could visit the National Security Agency on a typical work day you would see many, many groups of mathematicians engaged in lively discussions at blackboards, teaching and attending classified seminars on the latest advances in cryptologic mathematics, arguing, exchanging and analyzing wild new ideas, mentoring young talent and most importantly pooling their knowledge to attack the most challenging technical problems ever seen in the agencyls history. You would hear conversations on number theory, abstract algebra, probability theories, statistics, combinatorics, coding theory, graph theory, logic and Fourier analysis. It would probably be hard to imagine that out of this chaotic flurry of activity and professional camaraderie anything useful could emerge. However, there is a serious sense of urgency underlying every project, and you would soon realize that the mathematicians of NSA are relentless in their pursuit 14

15 of tangible, practical solutions that deliver critical intelligence to our nation's leadership. The mathematicians of NSA, the Institute for Defense Analyses and our academic partners are the fighter pilots in a way that takes place in the information and knowledge layer of cyberspace. As Americans you would be very proud of their achievements in the war on terrorism. The National Security Agency's need for mathematicians is extreme right now. Although we hire approximately 50 highly qualified mathematicians per year, we actually require more than that, but the talent pool will not support more. Over 60 percent of our hires have a doctorate in mathematics, about 20 percent a master's and 20 percent a bachelor's degree. We are very proud of the fact that 40 percent of our mathematics hires are women and that 15 percent are from under represented minority groups. Of course, the agency depends solely on the greater US mathematics community to educate each new generation of students, but we, also, depend on the professors at universities across the country to advance the state of mathematics research. 15

16 If the US math community is not healthy the National Security Agency is not healthy, and I always like to use an occasion like this to thank everybody here for all they have done for math in this country because our agency benefits so greatly. Okay, this first session here is on data mining, unsupervised learning and pattern recognition. This is a very exciting, very active area of research for my office and for the agency at large. We attended just recently the Siam Conference on Data Mining about, when was that, just about a week ago here in Washington, and had a great presence there. It is a wonderful topic. There is absolutely nothing going on in this conference that isn't immediately relevant to NSA and homeland security for us, and this first topic is an area of research that I think we had a bit of a head start on. We have been out there doing this for a few years, but there is a whole lot to learn. It is a young science. So, let me without further ado bring up our first speaker for this session, and that is Professor Jerry Friedman from Stanford University. 16

17 Introduction by Session Chair James Schatz Perhaps the most obvious connection between mathematics and intelligence is the science of cryptology. This breaks clown into two disciplines cryptography, the making of cocles, and cryptanalysts, the breaking of cocles. All moclern methods of encryption are basest on very sophisticated mathematical ideas. Coping with complex encryption algorithms requires at the outset a working knowledge of the most acivancect mathematics being taught at our Ieacting universities and at the higher levels a command of the latest icleas at the frontiers of research. Beyoncl cryptology the information age that is now upon us has opened up a wealth of new areas for pure and applied mathematics research. While advances in telecommunications science and computer science have proclucect the engines of the information age that is, the ability to move massive amounts of digital data around the worIct in seconds any attempt to analyze this extraordinary volume of data to extract patterns, to predict behavior of the system, or to recognize anomalies quickly gives rise to profound new mathematical problems. Although the National Security Agency hires approximately 50 highly qualifiecl mathematicians per year, it actually requires more than that, but the talent pool will not support more. Of course, the agency depends on the greater U.S. mathematics community to educate each new generation of students, and it also ctepencts on the professors at universities across the country to advance the state of mathematics research. If the U.S. math community is not healthy, the National Security Agency is not healthy. 17

18 Jerry Friedman "Role of Data Mining in Homeland Defense" ~ ranscript of Presentation Summary of Presentation PDF Slides Video Presentation Jerry Friedman is a professor in the Statistics Department at Stanford University and at the Stanford Linear Accelerator Center. 18

19 PROF. FRIEDMAN: Jim asked me to talk about the role of data mining in homeland defense, and so in a weak moment I agreed to do so, and in looking it over I discovered that unlike any other areas of the mathematical sciences there is a well-perceived need among decision makers for data mining on homeland defense. So, I have a few examples. Here is an excerpt from a recent speech by Vice President Cheney, and he said, "Another objective of homeland defense is to find connections with huge volumes of seemingly disparate information. This can only be done with computers and only then with the very latest in data linkage analysis." So, that was in a recent speech by Vice President Cheney. Here is a slightly higher decision maker. This is from the President's Office on Homeland Security Presidential Directive 2, and it is a section on the use of the best opportunities for data sharing and enforcement efforts. It says, "Recommend ways in which existing government databases can best be utilized to maximize the ability of the government to identify and locate and apprehend terrorists in the United States. The utility of advanced data-mining software should be addressed." Here is the trade journal, the Journal of Homeland Security. Technologies such as data mining, as 19

20 well as regular statistical analysis can be used at the back end of biodefense communication networks to help analyze implied signatures for covert terrorist events. In addition data mining applications might look for financial transactions occurring as terrorists prepare for their attacks. Okay, here is the popular press. Computer databases are designed to become a prime component of homeland defense. Once the databases merge the really interesting software kicks in, data mining programs that can dig up needles in gargantuan haystacks. Okay, as many of you know DACHA has set up an information awareness office and they were charged among other things to look into biometric speech recognition and machine translation tools, data sharing among the agencies for quick decisions and knowledge discovery technology; knowledge discovery is another code word for data mining, that uncovers and displays links among people, content and topics. Here is my favorite. It is not quite germane but this is a comment by Peter W. Hoover, not Peter Hoover the statistician but the engineer from MIT, and he said that in this new era of terrorism it will be their sons versus our silicon, a rather startling point, but I think part of our 20

21 silicon will be data mining algorithms running our computers, and the data mining bureau, Interpol and the DARCY(?) coined the phrase MacInt for machine intelligence. It sounds like something that might come from either Apple computer or a hamburger chain, but we need a MacInt or machine intelligence capabilities to provide cuing or early warning from data pertaining to national security. So, there doesn't seem to be a need to convince decision makers that data mining is relevant to national security issues. Many think it is central for national . . security Issues. So, I think the problem here is not in convincing decision makers of the need for data mining technology but to live up to the current expectations of its capabilities, and that I think is going to be a big job. Now, what is data mining? Well, data mining is about data and about mining. Okay, let us talk about data. What are the kinds of data we are going to see in homeland security applications? Well, there would be real-time high- volume data streams, massive amounts of archived data, distributed data that may or may not be centrally warehoused; hopefully it will be centrally warehoused, but you can't centrally warehouse it all and of course many different data types that will have to be merged and 21

22 analyzed simultaneously like signals, text, images, transactional data, streaming media, computer-to-computer traffic, web data and of course biometric data. So, that is the kind of data we will be seeing. I will have more to say about that in a moment. Where is that data going to come from? Well, there is a big effort as most of you know in the government for data information sharing. The buzz word is to tear down the information stovepipes between the various agencies who apparently guard rather carefully all their own databases and information. Now, the big effort is to merge them. So, here is a quote from an unnamed official in the Global Security Newswire. It says, "There are many community-wide data mining architectures that are being looked at to allow information sharing among intelligence and law enforcement communities," and a big example of that is going on right now. It is the merging of the CIA and the FBI databases for forensics, and that will be made available to state and federal agencies that they can query that and this year the Federal Government is spending $155 million for this effort of tearing down information stovepipes and next year they have requested over $700 million for that. 22

23 So, that is where the data is going to come from. Here is another example again that I like. I got this from www.privacy.org, another example of where the data is going to come from. Federal aviation authorities and technology companies will soon begin testing a vast air security screening system linking every reservation system in the US to private and government databases designed to instantly pull together every passenger's travel history and living arrangements, plus a wealth of other personal and demographic information. The network reviews data mining and prediction software to profile passenger activity and intuit obscure clues about potential threats even before the scheduled date of the flight. That is a tall order, and now, www.privacy.org doesn't necessarily consider this a positive development, but it certainly is an example of the kind of data that we are going to be seeing. Now, what are the applications? Again, when you talk about data mining the first thing you have to think about is where is the data going to come from and second, what do you want to do with it; what are the applications? Okay, well, the applications read like the titles of the sessions of this conference, obviously bioterrorism, 23

24 detect early outbreaks of diseases and look for characteristic signatures of bioterrorism versus some other natural outbreak. Cybersecurity is a big area that of course is being studied on the commercial front quite a lot, intrusion detection, protection of physical infrastructure with surveillance systems. Lots of data are going to come from security cameras everywhere and/or transfer the infrastructure like transportation, water, energy, food supplies. Of course, document classification, as was mentioned before we have to recognize coded messages. There are some more applications. Again, they are all going to be discussed here and mathematical sciences are relevant to all of them. Behavior recognition, travel behavior, financial behavior, immigration behavior, automatic screening of people, mail, cargo, and of course forensics, post- apprehension. If there is an attack you want to catch the perpetrators and all those who planned it so they can't plan another one. Now, what all of these problems have in common are they are needle in haystack problems, namely, you are looking for a very, very small signal in a very, very big background. 24

25 Now, what is kind of the current state of the art in data mining? Well, to talk about that in 20 minutes is kind of hard, but let me give you just a very rough overview. The most successful applications of data mining have been commercial applications which you can basically characterize as behavior recognition using transactional data. Classic example are fraud detection, fraudulent phone calls, fraudulent credit card usage, tax evaders, those are classic applications of the commercial. Churn analysis, where you look for customers who are likely to quit your services or your competitor,a nd this is very important among competitors who offer virtually the same service like telephone companies. They do a lot of that. Purchasing patterns, recommender systems have been reasonably successful and other ways of trying to detect consumer purchasing patterns so you can promote very special things and hopefully make more money. Direct marketing, who do I send my brochures to, and in general customer relation management have many texts on data mining, actually have commercial customer relation management in their title. That has been the main focus and the main success so far in data mining, and of course, stretching a little bit of course document search over 25

26 large corpora like web and library search engines have obviously been very successful. Now, homeland security problems if you look at them are very similar qualitatively to many of the commercial applications. They are of the same kind but in my view at least are different in degree because I think they are much harder and here are the reasons why I think they are much harder. First of all the haystacks are bigger. We are looking for needles in haystacks. The government database and although data mining has been applied to some very large commercial databases the government databases are going to be even larger and they are going to be more diffuse in the kinds of information, and the needles are going to be smaller. I mean hopefully there are fewer terrorist than people who are trying to make fraudulent phone calls or trying to cheat on their taxes, and so the needles are going to be really small, and the stakes are much higher. If your data mining algorithm on a consumer database doesn't quite get all the customers who might be attracted to your product, well, your client is going to realize slightly less profit and won't even probably know it. 26

27 Here if we make a mistake the mistakes are highly visible and can lead to catastrophes, and so the stakes are very much higher. We need fast response. We can't take time to print up our brochures. We don't have any luxury of time. We have to worry about what the machine learning people call concept drift, namely, the underlying joint distribution of the variables if you like is changing with time for both the signal because the terrorists aren't stupid and the background. People's shopping patterns are likely to change or they are changing because of our intervention in homeland security, and we have an intelligent adversary and I think that is one of the things that makes these kinds of problems unique, not that consumers aren't intelligent but they generally don't try to evade you. So, the data that we are trying to see I characterize as maliciously observational data. First of all it is observational. You can't design experiments, and all data mining is applied to observational data, but more importantly they know we are watching. They are going to try to disguise their transactions and as they realize we are detecting them they are going to change the signatures in real time as we detect them, and they can also jam us with deliberate decoys. They can go through the motions of 27

28 planning an attack, do all the things we know how they would and then not attack and do that repeatedly just like car alarms that you ignore now until finally you know you say, "Well, this isn't working. We keep predicting attack and it is not happening. Obviously we are having many errors of a false positive nature. So, something is wrong with our system," and maybe nothing is wrong with our system. So, this maliciously observational data I think is a new twist that certainly existing data mining algorithms may be valuable to tell us about fraudulent phone calls but I think that is a really new twist to this kind of thing. So, what are the properties that our methodology is going to have to have? It is going to have to be fast in both training, that is in developing the data mining detection algorithms and it is going to also have to be fast in predicting and that will leave out a lot of traditional data mining because the computing requirements are going to be intense. They are going to have to be real time, of course, timely, accurate, robust and adaptive because everything will be changing with time. What are some of the other attributes of the methodology that we are going to have to have? It is going 28

29 to have to be dynamically updatable. Again, lots of data mining algorithms you run them on the data; if you change the data you are going to have to run them on the data again, and we can't tolerate that. You are going to have to model complex changing, surely non-linear relationships and those relationships as I mentioned before will be changing both of the signal and of the background. To the extent we can do any inference at all, it is going to have to be assumption free and I think most important is we will have to effectively incorporate the new knowledge which in most traditional data mining we haven't done that with the possible exception of the Bayesian-belief networks which may actually if we can ever get them fast enough might really have a lot to offer here. In my view we are not there yet. Now, of course, if all you have is a hammer every problem looks like a nail, and I think that all of us working on homeland defense issues will bring their own favorite set of tools to these problems. In the data mining area popular techniques are decision trees, rule induction, support vector machines and neural networks and a lot of other things, and it is not clear at least in my mind which if any of these will provide success. They all have advantages and 29

30 disadvantages, and none of them have enough of the advantages that are going to be required for the kinds of accuracy, speed and other properties that will be needed, and all of them if they are used will need a big increase in their performance both in their computational performance and in their statistical performance as well. I don't think all is lost. This allows us to do research into new methodology and often new good things can happen. First of all as I said, we will have new kinds of data. The data itself won't be new kinds but the fact that we have to merge it and analyze it simultaneously I think is fairly new, the fact that a single database will have transactional data, audio, video and text data all in it and we will have to treat it all simultaneously. Nearly all of everything we measure will be irrelevant. The signatures,for instance the reservation database, the travel database, of all the things that they are going to pull out of there about the nature of their trip very few of them will be really relevant. Very few attributes will be relevant to detect the terrorist attack and the performance requirements as I mentioned before both computational and statistical are very high, at best severely challenging or probably exceeding the present state of the art. 30

31 So, there are huge expectations. A lot of these databases are being merged and a lot of money spent on the expectation that the data mining algorithms will find things in them, and that is their raison d'etre for merging all of this data, spending all that money, and so I think that we are going to have to perform and figure out ways to increase computational and statistical performance so that we really do something with these very large and complex databases when we see them. However, I think that as is often the case more time requirements often lead to scientific advances and in World War II there were a ton of them in the mathematical area. Of course, it played a huge role and many aspects of World War II led also to advances in mathematics that led to the invention and advances in computers which of course have changed our lives in many areas other than code breaking, and advances are certainly needed in data mining if we are going to successfully address the expectations of what data mining can do in homeland security, and if we are successful, even remotely so we will also impact a wide variety of other areas which are not quite so difficult. So, I think it is a wonderful opportunity and of course very important. Thank you. 31

32 (Applause.) DR. SCHATZ: Thank you, Jerry. Technical note. We requested that the speakers speak here at the podium and if assistance is needed with the changing of transparencies, etc., we have an assistant who will do that. For the camera though they would like you up here. Just a couple of points I wanted to make about our last talk. A wonderful talk. It is very much to the point of my daily life right now. The government problems are much bigger typically than what you are seeing anywhere else, and the needles are very tiny. We talk about volume, variety and velocity where I come from, and one of the good points, also, made there was the robustness issue. In certain databases the data is very clean. It is very uniform. Certainly for the problems we deal with we have fragmented data, very badly garbled data, data from different sources that we are trying to pull together and the scalability of algorithms is absolutely a critical point to be made in all this. Certainly other fine points we have here are trees, rule based algorithms, support vector machines, neural nets; we are using all those techniques right now. They all are useful. They all help to a degree. They are 32

33 all not enough. So, there is really a wonderful opportunity here I think to bring in some advanced statistics and mathematics into these problems. Also, a wonderful point about the fact that wartime has certainly in the past inspired some amazing advances in science, the Manhattan Project, of course being one and the cryptography effort that really began, modern cryptography really began of course during World War II in the serious sense, and I think I agree with Jerryls point there exactly. I think if we look at this war on terrorism, the big advance is going to be in information processing and data extraction because that is the edge we need at this point. Okay, well, we are a little ahead of schedule which is fine, and I think Diane is ready. So, our next speaker is Diane Lambert form Bell Labs.

34 The Role of Data Mining in Homeland Defense Jerry Friedman The problem is not in convincing decision makers of the need for ciata-mining technology but in living up to their expectations. We must be able to hanctie a variety of data formats, inclucting real-time, high-volume data streams, massive amounts of archived data, and ctistributect data that may not be centrally warehoused. We also must be able to hanctie various data types, including signals, text, images, transactional data, streaming media, computer-to-computer traffic, Web data, and biometric data. In acictition, ciata-mining methoclologies must operate in real-time and be accurate, robust, and adaptive. In an effort to bring compatibility to various government databases, the government has begun work on a project to merge the CIA and FB! forensic databases. The product will be macle available to other agencies, fecleral and state, for querying. In 2002, the fecleral government invested $155 million in this effort, and it plans to increase the 2003 budget to more than $700 million. Data mining has several applications to homeland security: air travel, characteristic signatures of bioterrorism versus natural outbreaks, cybersecurity, intrusion ctetection, surveillance, and travel, financial, or immigration behavior. Currently, the fecleral government plans to create a vast air-security screening system linking every reservation network in the United States to private and government databases. The system is clesignect to instantly pull together every passenger's travel history and living arrangements, plus a wealth of other persona] software to profile passenger activity and intuit obscure clues about potential threats. 34

35 Diane Lambert "Statistical Detection from Communications Streams" Transcript of Presentation Summary of Presentation PDF Slides Video Presentation Diane Lambert is head of the Statistics Research Department of Bell Laboratories, Lucent Technologies, and a Bell Labs fellow. She is also a fellow of the American Statistical Association and the Institute of Mathematical Statistics and has served on NSF and National Academy of Science panels and as editor of the Journal of the American Statistical Association. She has also served on ACM and SIAM program committees for conferences on data mining. Her research for the past several years has focused on ways to analyze and model the kinds of massive, complex data streams that arise in telecommunications. 35

36 DR. LAMBERT: It is actually a little overwhelming to be here since I don't work in security and so trying to guess what someone's problems are when you don't work in the area always seems a little bit dangerous. So, what I am going to stick with is communications data. So, one thing that we have a lot of at Bell Labs is communications data, and you can imagine that communications data has a lot of information in it that would be of interest to people who are trying to track what is going on. So, there are all kinds of communications records that are kept now about all different kinds of communications, much more than maybe the average person is aware of. Every call of course has to be billed. So, for a very, very long time you have had a record of every call that is placed. You know who placed the call, when the call was placed, who was called and if it is a mobile phone you know where they were when the call was placed, at least generally because you know which cell site handled the call. You know how they paid for it. There is just a lot of information in this record. Okay, now, there is actually a lot of information about online communications, too. 36

37 If you request a file, if you want to download a file, then in a log file there is a record of that transfer. It says, "Who" where here Who depends on whether you allow cookies to be used or not and Who might be just an IP address. It is not too meaningful or it may actually be meaningful, but you know when and what was downloaded, which images and everything else. Going beyond that a bit online chatrooms the way they are structured there is actually a record kept in a log file of everything that is posted. That tells you who did the posting where again who might be an IP address. Who might partially a user name that somebody uses consistently. Who might be an actual person in some cases. You know which room they were posting in. You know when they posted and what they posted. All of this is kept in the file. So, there is this wealth of information about what people are doing both generally and in particular. You know can you get any useful information out of that. If you want to learn something about users and about people or about groups, you know, what can you actually do with this data? Now, to be believed that if you have data then all you have to do is think about it enough and you will 37

38 be able to extract information from it, and I am not sure we have made a lot of progress here yet, but let me tell you what we have done. Next slide, please? The first question we want, and you know, it _ particular examples because can say, "I don't want that with the easy one which is is what kind of information do is always a little hard to do in any particular example you information," but we will start the easiest flows of all are really just call records. So, they are records of the calls. So, this for example, is one user. Now, you don't get this user or this caller's, in this case it is a cell phone, you don't get the call phone records all nicely delineated like this. They come all mixed and interleaved together, but for the moment imagine you have got a process which isn't too hard to imagine that sort of filters them all for use. You get them all together. Maybe you have just pulled them all off, and you look at them. Now, what you can see here is each one of these vertical lines represents a call. Longer calls are represented by longer lines. You have time going on the horizontal axis, and then you get sort of categorical information by color. 38

39 Here it is terminating number, and you get things like which calls are incoming, which calls are outgoing, and then from that you can think of describing what the typical behavior of that caller is, and then you can also think of how do I find something that is atypical; how do I detect there is a change in the pattern? Now, if we look at it you can see that all of a sudden on the right there is all this purple stuff that starts happening that has longer lines and there is more of it. So, it looks in fact if you look at it as if there is a change, but actually that is a month's worth of data in the purple part over there, and you don't want to wait a month in order to say, "Oh, yes, there is something interesting going on." Also, you don't get to look at all the past data when you want to make a decision about this one. So, as soon as it turns purple you would like to say something interesting, and you would, also, like to be able to do that without looking at all the past data. So, the question is how do you describe behavior in an ongoing way? You can keep updating and you can start from new callers, okay, people you have never seen before 39

40 because that happens all the time. In such a way you can tell when something interesting is happening. Now, as Jerry said, you can't hope to be perfect here. All we can hope in some sense is to provide a filter and to give a rich set of calling, of numbers to some human analyst that then will look at it to figure out what really is interesting. So, all we are thinking of here is sort of filtering the data to look for interesting behavior but even in that one it is fairly hard. So, what do we have to think about behavior? If you are going to think about behavior and you are going to do it all automatically, all right, you can't think about behavior as I don't know, we can't think about behavior as a picture. We have to think about it as something more structured than that. So, how are we going to represent behavior? What we are going to do is say that behavior is just a probability distribution, that what it means is what are you likely to do and what are you unlikely to do and both of them may be of interest to you. In some cases you want to know what typical behavior is just to know how something is being used, but if you want to detect a change in behavior then you, also, 40

41 have to know what is typical because you have to know its baseline. If something is typical all the time, then you can't actually, you know, if what you are looking for is typical then there is no hope of finding it basically. So, we are looking for this change in behavior. Now, what we have are records for people. Either a record is the phone call record or it may be the posting in the chatroom, but basically you have got records and so that record just has some information in it that you can then extract and put into variables. In some cases it is pretty easy to see what a variable might be. So, for example, a variable might be the person who posted it, the time the post was made, which room it was posted in and then you can, also, get content information, too. You get this record that you have and then that has some distribution that varies widely across people. So, what people, the kinds of posts people make, how often they do it, what the different rooms look like all that is changing. Now, if it is changing too quickly you have no hope of tracking it, but on the other hand most things don't change too quickly. So, the hope is that you can evolve the probability distribution as time goes on. 41

42 So, what we want to do is to track those probability distributions. Okay, now, these bulletin boards and chatrooms, there are lots of them out there. There are at least 35,000 web sites that support specialty forums, and some of those have thousands of forums, aol.com, yahoo, right? There are all kinds of forums people can join, and then you know some of them are new. Some of them are technical, just anything that you can imagine and these discussions can be fractured and free flowing. So, most of the time people will post something and no one will answer. It will just go flat, and then about half of the time if there is an answer it is something completely unrelated to anything that has been said before, and probably about 80 percent of it is 18- year-old males seeking something or other. (Laughter.) DR. LAMBERT: So, there is a lot of it that may or may not be interesting. Next slide, please? So, this is just a small example of what a chat record would look like with stripping off who is making the post, that this is September 17. This is the day that the stock market opened after September 11, from the www. financialchat.com, and you can see there are two different 42

43 posters here. One is purple, and one is white, and they are sort of talking to each other in some sense. So, the question if someone all of a sudden starts saying something interesting are you going to be able to pick it up. We don't actually know that, but what we are actually trying to do first is to understand whether we can understand what, you know, kind of represent what people are doing because if you can't do that, it is going to be much harder to find anything that is interesting. Next slide, please? Okay, so what we are actually doing here, and I am not trying to say that this is the best way to do it at all. I am just trying to say that this is the way we have implemented it, and actually Mark has implemented this. This is beyond my understanding, unfortunately, but I can at least explain this picture. So there are all these discussions going on and these discussions use something called the IRC, the Internet Relay Chat protocol and so they are just writing files kind of like you would write in a log file for a web download. Okay, so, we have a fire wall and on one side of the fire wall -- Next? 43

44 On one side of the fire wall we have our system which is doing the data collection which is just a bunch of Linnex(?) machines basically and then on the other side of it -- Can you hit the next slide, please? On the other side of it is all the chatrooms which are all running different kinds of protocols. So, basically you have to be able to deal with all that kind of stuff, but you know that is a hard problem, but that is something that you can at least come up with solutions for. So, at the moment what we are doing is on the right where the chatrooms are we are monitoring somewhere 5 to 10 thousand chatrooms. These chatrooms, to monitor a chatroom doesn't mean that you are necessarily taking everything form it all the time because most of the time chatrooms are actually inactive except for very large ones, and so what you might do is you have to decide what you want to watch. So, if there is only, you know, if there is only a little bit of activity like five posts in the past hour then you may decide only two posters are active right now. You may decide that you don't want to sample that and move on to the next one. 44

45 So, there is some question about where you want to do it. Now, suppose you actually had some chat streams that you have collected. You bring them up beyond the fire wall and they are all in sort of a funny kind of time order, not exactly time order but they are not in the order of users. They are not in the order of topic, and they are not in the order of even room anymore. It is kind of all mashed together. So, what you have to do if that is what you are interested in is a different unit other time, and typically time is not interesting when you aggregate over everything because you have to somehow extract from this information about users, information about the group itself. So, let me just go on. So, some of it we are going to get here are things that are variables that you read out directly and some of it is things that you have to extract. So, for example, rate, you have to get a drive for knowing the time of this post and the time of the last post and just dynamically adapting a rate estimate over time. Okay, and other parts of it you are going to have to get out topic information. Next slide, please? Okay, so now what we want to do is to take each user for example or each room and what we are going to say 45

46 is what is the probability distribution that would allow you to predict what is going to happen next for that person or for that room. Now, you are not going to be able to predict exactly. So, that is where we end up that you need to do this probability distribution. Now, since everybody has a probability distribution, every room has a probability distribution and every topic has a probability distribution all over different kinds of variables what you have to do is to make sure that you are not trying to be too aggressive here. So, you have to have a distribution that is non-parametric because the variables vary so much from one user to another there is no hope of choosing one family with a small number of parameters and just estimating that. It, also, has to be small because you are keeping so many of these. You don't get to have much space or much detail on any one person. I mean maybe something that is not quite so obvious is you want to have it simple and that is because it turns out it is often useful to let people know if you are going to show them something what the state of knowledge about the person that you are showing them was at the time you made your decision. 46

47 So, it can't be so complicated that someone can't look at it and make sense out of it, and then the last two are some technical problems that arise. New users are coming into the system all the time. Somehow you have to be able to start off their probability distribution in a reasonable way and so this makes kind of odd clustering kinds of problems in a way or indexing problems because you could look at a few records for an individual and say,"Oh, this individual looks like they are going to behave like that group of people," and take an average for that group of people to start them off with, and then once you initialize you know that is not very good. It is better than just sort of guessing average over everything in the world, but it is not very good. So, you have to have ways to update it, and this has to be done record by record because you never get to go back and look at all the old records. Once you see them you have to move on. It just takes you too long to go back. Can you go back, please? So, there are many ways to do this. I am not saying that this is the best by any means but just to let you know what we have done. We say that this is a big joint 47

48 distribution. We are going to model it by a set of marginal and conditional histograms so that they don't take up too much space and if you had all of the marginals and conditionals then you could always get back the full joint distribution, but clearly there are some of these that are not so relevant. At any rate, you can come up with ways for choosing marginal and conditional histograms and maybe I will say a bit more about that later, and again you can do this for topics well. I mean not well. You can do these topics as though, you know, you have a topic; you have just standard information retrieval processes that get you started here. Next slide, please? Okay, so, one problem that comes up here that we run into is how are we going to visualize all of this stuff. So, you make a probability distribution up for everyone and then you are trying to track everyone. Is there any useful information you get out about what is going on out there in the world using all those probability distributions that you have collected? You have been just sort of using all that data. Well, you know if we have just the call records we can visualize that, right? You can make that plot and then you can look at it, and you can say, 48

49 "Oh, yes, I can see what is going on here. This person makes calls in between ~ and 2 minutes about such and such a rate. They never make them in the middle of the night." You can look at that and you can visualize that and then you can, also, get a sense of whether you think there is a reasonable pattern there. You can, also, if it spits it out and says, "This is something where fraud is being committed," you can then go back and look at it and say, "Oh, yes, the fraud started here. I believe this," or also equally likely probably the fraud, there is no fraud here. This is just that the system has been confused. Now, in this chatroom thing how are we going to actually visualize because it is much more complex than in the call record world. First of all there is there is this topic, this text stuff. Well, there is kind of a history in the Bell system of using sound, maybe because of the phone. I don't know. There is kind of a history that comes back from time to time about using sound as well as visualization in order to understand high-dimensional data. So, going back to the 1960s Tukey was involved in verifying atomic test ban treaties and one thing they did was they discriminated earthquakes and atomic blasts using sound, what they called seisometer sounds and so then 49

50 they put on a map, okay, so you take the data; you would map that to a sound which either represented an earthquake of a certain magnitude or it represented an atomic blast, but you didn't know which. You just had the sound. Okay, then you played the sound that went with that. So, you took this time series of seismometer readings. You map that into sound and then you play that back and then the analyst can tell. I didn't bring it with me, but it is pretty easy to tell which is the atomic blast and which is the earthquake. So, that was one case where sound was very successful. There are other works about a decade later which used going back to Max Matthews who is a musician among other things and John Chambers about using sound to represent point clouds. Now, more recently Mark Hanson who is Bell Labs and Ben Rubin who is something called the Ear Studio had a grant from Lucent(?) in order to see how sound could be used to visualize web data and so what they do is they use sight and sound to represent the evolution of chatrooms. They do this with a huge number of speakers, and little displays. Next one, please? Okay, so this would be like one of what a little display would be. So, basically you have go this text coming and right now you can think of doing it in terms of 50

51 how the topics are evolving in time and so, as these things come in you can cluster the topics because these topics are represented by a bag of words, and you can cluster those, and if you use some kind of online algorithm which is again, you know, there are ways to do it but not necessarily good ways, you can do dynamic clustering and see what is coming out over time. You can then map those to displays that are high dimensional, and so this is actually one which I had no idea what it was about, but I think it is about a gecko is what I have been told and so people, there is a bulletin board or a chatroom for people who have geckos as pets and this is a topic for them which I guess is the only thing on about that time of day. Next slide, please? So, what we would like to do is to be able to visualize what is the typical activity. Then what you can do is you can tabulate these posts by their length so that longer posts appear as bright pixels in the display. You can look for typical posts. Okay, so before we might look for long calls. Now, we are looking for sort of long posts or long trains of posts. You can look for what is a typical post. You can look for what is a typical topic. The clustering can be 51

52 used as the display changes as you keep updating these cells and as they change. That means a new topic has come in or the topic has changed somewhat and what is shown in that display is you pick a cluster and then you pick a representative phrase from that cluster. You can also do this for users. Next slide, please? Now, that is really as far as we have gotten to be honest, but now that you have these probability distributions at least in principle you can think about trying to detect interesting changes in behavior. Now, things that are subtle like Jerry said there is no, you know, it is not really going to be too easy to find it. I suppose it is too pessimistic maybe to say that there is no hope, but what we would like to do is to just find out when there has been an interesting change here. So, we have again this huge distribution. You have got a record. You have put it into, you know, now it has defined variables into it, whether those are topics in terms of clusters of words, whatever or just who did it. You have got all these variables. You can make that up into a probability distribution. You are going to keep all the marginals and some of the conditionals and the first technical step that you 52

53 come into here is deciding what to keep because you can't keep everything and the twist here from maybe the usual data mining procedure is that what you want to do, you would like it to be the best possible choice of variables for each individual. So, maybe time of day is important for somebody but it isn't important for someone else, but what is the best possible set of variables changes a lot from one person to another. So, you can't actually optimize for everyone at the same time, but you can take some large set of users and train from them and then what you can try to do is to figure out a set of variables which isn't too hard that is useful for a large fraction of users and then you would say that if is useful for a large fraction of people then you will keep it, but I think this is one place where things get a little murky, one of many, many places, but now if you have that you get probability distribution for what the person usually does say or what the topic usually is in this chatroom. Now, at the same time you can have a probability distribution for what you are looking for. So, if you are looking for fraud you have some kind of training data; usually it is not very good training data but you have 53

54 some, and you can use that to make another probability distribution which is what is sort of unlikely under a fraud scenario and what is unlikely under a fraud scenario and you can use that to direct what is an interesting departure from normal behavior because not all departures from normal behavior are interesting. If someone always makes a lO-minute call and then all of a sudden they make a 3-minute call well, that is change, but it is just not very interesting most likely. So, what we want to do here is to direct it by comparing the probability distributions of what usually goes on compared to the probability distribution of what you are looking for. Now, if you do this in a log likelihood ratio then big changes in either one should be flagged here. Now, as I said, this is easy to do in a call record case. So, you know, just giving you a sense here what happens here is just plotting the cumulative log likelihood ratio of the positive contributions. You don't want to put the negative ones in because there is some fraud superimposed over the legitimate behavior and you don't want legitimate behavior to cancel out the fraud. So, at any rate that is just what the phone picture is. 54

55 Next one, please? So, I think that is kind of all I have to say about communication data screens, that they are large; they are evolving; they have incredibly complex structure, but in fact it is probably not impossible to get some useful information. Now, whether you can detect really small signals I don't know, but you can certainly understand what is typical. You can take this information and you can formalize it and structure it in such a way that it is actually useful, but we don't necessarily know how to do it well. There are processing challenges. There is topic classification, evolution challenges. There are algorithm challenges for updating these things, and then for filtering how you decide whether something is worth showing to a human analyst because none of these systems as yet as Jerry said are foolproof, already just set off some long lists of actions based on an automated detection scheme. Thank you. (Applause.) DR. SCHATZ: Are there any questions? We are running ahead a little bit. If anybody had any questions for Diane and we could even get Jerry back up 55

56 here I am sure if there are questions. Does anyone have anything right at the moment? PARTICIPANT: I was intrigued by your comment about using sound data. Generally we find little response; they feel it is uninteresting. Do you agree with that decision? DR. LAMBERT: I think it is too harsh to say that it is not interesting. I think that when you have, you know, I think for the point cloud work it is hard to find examples where it adds a lot of value. I think here where there is actually text and words it may in fact help. Somehow your ear can pick out different words in conversations better than maybe your eyes can pick different words out of a page. So, I don't know whether it is an answer to anything, but I think it helps here. PARTICIPANT: Let me ask a follow-up question on that. This is not necessarily a question for the speaker but maybe someone in the audience. My impression is that there is at least an order of magnitude difference between sight and sound. That might have a significant bearing on the data. DR. LAMBERT: I think you don't want to replace visualization with sound but you might want to augment it. 56

57 PARTICIPANT: Presumably that would be much more useful. Presumably the phone company of course would have conversations in English. Now can you tell from the patterns what certain; you know, do patterns change in different ways? DR. LAMBERT: I have no idea. This system, quote, system, I wouldn't call it a system. This software we wrote, that Mark wrote, doesn't know anything about English. What it has is a dictionary, right, and then it has to pre-process the chat stream to get rid of things like pronouns which have very low content, with conjunctions which have very low content and some other words. Then it gets basically a bag of words and it doesn't really care whether they are Urdu words. It gets rid of these symbols you know and kind of the web speak stuff like LOL, but it doesn't really have any sense of meaning, not at the moment. It probably would be reasonable to put that in, but we did something that is much cruder than that. PARTICIPANT: I. T. Good had a suggestion years ago. He was imagining the Allies in World War II built a broadcast tower and broadcast 24 hours a day pure noise except at predetermined times something came through in 57

58 something that is fairly plain text. In the modern communications era it seems to me that terrorists could publish noise on the Internet all the time and just overwhelm the detection capabilities except for at predetermined times, you know, 12:03 a.m. on Fridays there is something that is actually important and that is how communication happens. DR. LAMBERT: If there is, I mean that depends. If somehow the nonsense is pure nonsense and the vocab changes, the vocabulary changes when there is something important to say then you have some hope, but if everything has been seen before then you are right. You are going to need something more sophisticated than this. DR. SCHATZ: A couple of other points I wanted to make on Diane's presentation. She made a very good point for our work which is the idea that a database isn't just a pile of stuff that is sitting there statically. There is really a very important challenge in tracking behavior over time and really trying to keep track of things. Some bit of data may have arrived a year ago that is relevant today, and that is a big challenge. It is not just what came in in the last 5 minutes. The other thing which may be covered here in the conference at some point but I couldn't exactly see where 58

59 it fit in which is very important to us and might not seem all that critical to the high-end math community when you think of it at the beginning is data visualization tools. At the end of the day we do all of our analysis and there could be some huge algebra computation or mark-up models, who knows what is going on behind the scenes, but the people who are the first line customers of our algorithms; at our place we call them analysts, they need to be able to get what they need to get out of these things very quickly, and the data visualization aspect of this is critical for us, and so, I just point that out as an area where we might not have thought about that aspect of our work, but that presentation at the end, somebody has to act on something. The other thing of course that came up in Diane's talk, a couple of points was just good old fashioned clustering algorithms, very important to us, and I think when we start talking about data sets of these magnitudes that we are all up against change point algorithms, clustering algorithms very, very important. Okay, I think we are all set. Our next speaker is Rakesh Agrawal from IBM-Almaden. 59

60 Statistical Detection from Communications Streams Diane Lambert At Bell Labs, records are kept about many types of communications, inclucting phone calls and online communications. One useful way to analyze behavior from communications data is to clerive probability distributions that incticate what a user is likely and unlikely to cto, either of which may be interesting to the analyst. First, it is important to clefine "typical behavior." Then, it is possible to detect changes relative to the baseline. A current research project at Bell Labs involves monitoring several thousand chat rooms in orcler to extract from the resulting data a probability distribution for each user and each room. It is also possible to devise a joint distribution and mocte] it by a set of marginal and conditional histograms. The value of organizing the information this way is that we can compare the probability distribution of what usually goes on with the probability distribution of what we are looking for. Processing challenges include topic classification, algorithms for rapidly and accurately updating information, and decisions regarding what information shouict be Mitered to an analyst. 60

61 Rakesh Agrawal "Data Mining: Potentials and Challenges" Transcript of Presentation Summary of Presentation Video Presentation Rakesh Agrawal is an IBM fellow whose current research interests include privacy technologies for data systems, Web technologies, data mining, and OLAP. He leads the Quest project at the IBM Almaden Research Center, which pioneered key data mining concepts and technologies. IBM's commercial data mining product, Intelligent Miner, grew out of this work. His research has been incorporated into other IBM products, including DB2 Mining Extender, DB2 OLAP Server, and WebSphere Commerce Server. His technical contributions have also influenced several external commercial and academic products, prototypes, and applications. He has published more than 100 research papers and he has been granted 47 patents. He is the recipient of the ACM-SIGKDD First Innovation Award, ACM-SIGMOD 2000 Innovations Award, as well as the ACM-SIGMOD 2003 Test of Time Award. He is also a fellow of IEEE. Dr. Agrawal received the M.S. and Ph.D. degrees in computer science from the University of Wisconsin-Madison in 1983. He also has a B.E. degree in electronics and communication engineering from the University of Roorkee, and a 2-year postgraduate diploma in industrial engineering from the National Institute of Industrial Engineering (NITIE), Bombay. Prior to joining IBM Almaden in 1990, he was with Bell Laboratories from 1983 to 1989. 61

62 DR. AGRAWAL: Good morning. This is what I thought I would do. So, this is my piece of the talk. Basically it is going to equal what Jerry said earlier. In fact, Jerry and I were going to share a cab ride and somehow we missed each other from the airport. It is good we missed each other because my talk is going to be like Jerry's talk in some sense. So, my first feeling is that data mining has started to live up to its promise in the commercial world, particularly applications involving structure data. By structure data I mean things which can be put in nice relational or object oriented databases. You know, they have fields. They have columns and so on, and Jerry gave a lot of examples of such applications. My sense is that at this time we are beginning to see data mining applications in non-commercial domains and I guess homeland defense definitely falls into this category of non-commercial domain, and these applications involve use of both structured and unstructured data, but my post hoc sense is that we are just seeing the beginning of it. I am pretty hopeful that if we have further research in this particular topic we might be able to have 62

63 bigger successes in this area, and this is going to be very important as we go in future. So, what line is that? I think of somebody saying that you know we have off-the-shelf data mining technology available which can solve all the homeland defense problem but have got, you know, that is too much of an exploration at this stage, but there is promise here, and it is worth exploring that promise. Okay, so, this is what I am going to do. I am going to give some examples of some applications of non- commercial data mining applications and sort of completely ignore all the commercial applications which you know Jerry gave a very nice introduction of that and you might have read that in two data mining conferences and so on, and I will sort of conclude by pointing out some of the important things as I see them. I like the talk to be interactive. So, we have some time. So, feel free to interrupt me at any time to ask questions. Next? I will begin by giving the first example of an application. So, this essentially is showing, you know, I guess Jerry mentioned identifying social links or linkage analysis kind of things. 63

64 So, this is showing some person and what does this person's social network look like, and in some sense you mentioned about data visualization. This is using fish eye kind of visualization which is showing that you know in some sense these are the people who are closely linked with this person, and this is the second layer of association and so on. You can sort of work it out and zoom on any person and get to see what these relationships look like, how these social linkages look. The interesting question is, you know, this is just an example we created, and the interesting question is how this was done, and basically this was just based on about the caller for about a million pages and using underneath it our data mining algorithm, and so let me just tell you what that algorithm is and then I will come back to this page and show how this was done. So, this is something in data mining work. You can call association rules and basically you know you have transactions. Transactions is just a set of literal and database construct of a set of such transactions. So, in this particular example if you look into a page a page would be called a transaction, and you run some sort of a spotter on that and you find out the names in 64

65 that particular page. So, those names would become the items of that particular transaction. Okay, so, a page is a transaction. Names appearing in that page are the items in that particular transaction, and if you are trying to find out sort of you know when ABC happens, then UVW, also, happens, and then there is the social support and confidence and people on this Committee have spent a whole bunch of time trying to figure out computationally what are the good ways of doing this. Can we do it, and I think you mentioned earlier, can we do this task efficiently because it is clear it is kind of important to obtain. Number of items is extremely large. Number of transactions is extremely large. We are talking in the millions or billions of pages, and number of items again you are talking of all the possible names that can become items. You are trying to find out relationships that might exist. So, it is a huge combinatorial effort and the question is can we do this thing really well and really fast. If you look into the data mining processes people have spent the last few years figuring out how to do this computation very well. 65

66 So, once you have found these rules then let us go back to the previous one. This is simply a visualization which has been put on top of these rules of what is called a frequent item set. So, this is just a visualization to a known problem. So, this sort of reminds me of what Jerry was earlier saying that a lot of these problems kind of look like major problems but we might have things to identify about them. So, here is an example, and here what we are trying to do is essentially give a profile for a web site. What does a web site look like, and in this case this is asgonvenga.com(?). This is a site in Pakistan and what is happening here is you know these are sort of different characteristics that we are interested in. It has things like financial support, Islamic leaders and so on and the idea is the following that can we by looking at a site give quote, unquote, a profile of this particular site in terms of these features that we are interested in. Okay, and you can sort of see how the profile looks like from these particular sites. So, how is this done? Again, next slide? 66

67 This is done by what Jerry earlier was alluding to. This has been done using classification which again is a technique that, you know, Jerry was one of the persons who initially wrote the first decision tree classification, and what it does essentially is this is how the input looks like; you know, this could be in this case records coming from some sort of a, you know, this is the variable. This is people who have high credit risk and these are the data and they are trying to sort of develop can we sort of develop rules for who are the people who are high credit risks, people who would be low credit risk and this is how a decision tree might look like, and the idea is that once you have printed this decision tree a new person comes in and we don't know whether this person is a high risk or a low credit risk and we again run this person's record through this decision tree and we are able to make a prediction. So, this is what people have done in the past and can we just go back to the previous page? This is simply an application in the same way here. What is happening is that we got pages which are examples of pages which sort of fall into a particular category. So, we got a good training set and once we have got a good training set we are building the classifier and 67

68 once we have built the classifier we can take any site and pump it to that, you know, sort of give the pages of that particular site to develop this kind of a profile. Yes? PARTICIPANT: I am just curious whether your example is realistic. It brings up the question of whether you wanted an effective algorithm or a legal algorithm. DR. AGRAWAL: This is an illustrative example basically. PARTICIPANT: Of course in market data that question will repeatedly arise in one form or another. DR. AGRAWAL: Yes. Okay, next one? This is the third example I wanted to give which is about discovering trend and in this case I have chosen to show what can be done with some sort of a, you know, again, this is a small application built on top of database and what is happening is the following. So, the input here is on the patent applications which have been filed form 1990 through 1994, and the idea is that you might be interested in finding out sort of specifying I am interested in this kind of a trend. This is an example of kind of a resurgence trend where technology was popular. 68

69 Then it started losing popularity, and it is coming back into popularity again. So, the input in this case is to change your interest and the original input is all the documents in this case patents and you want to sort of find out some sort of a result which looks like this. Notice in this case we did not provide things like e-tools and pooling as inputs. This was something which was sort of figured out by the underlying system. So, what is happening? Next slide, please? In this case the essentially the underlying technology is sort of two things. One is what sequential patterns, and the other thing is what chip queries. Sequential pattern the idea is that consists of database of sequences. A transaction as I said earlier a set of items, sequenced in this for transactions and vou will find out we call we call database Is again order of ~ __ ___ ______ _ and you will Menu out all the sequential patterns which are presented in this database. This lets you take these patterns and support for them over a period of time and then clearly the shift does support history for that, okay? So, this is the underlying technology beneath it. 69

70 Let us go to the previous page, and these things were essentially the sequential patterns which were found in the data and for these patterns of what these sequential factors there is a history and that is what is being shown here. Next slide? The next thing I want to mention is the idea of given all the web pages we certainly know the big communities that exist, things like AOL and so on but the interesting question was can we find out microcommunities that might exist, and the idea was that these microcommunities would be some of these types, if you will, and frequently pages are related. Pages with large geographicals were not related and these are some of the sites that you get by looking or trying to find out these microcommunities, and there are things like you find Japanese elementary schools, Australian fire brigade and so on. So, I mean the labeling of the community was done outside but you could find out that there were these communities that existed. You could look at it and then you could provide label to it, and again underneath it if you think about it the input was large number of pages with the links on them and trying to find out these trends. 70

Initially there was enough to use the standard algorithm which has been done but then a variation of it was used to discover these microcommunities from the web. Okay, next one? Okay, this is the third application I wanted to mention which is again a very non-conventional application that is similar to what Diane was mentioning earlier except that this is not using chatrooms but this is using the news group postings, and the idea was the following, that in some sense we were trying to sort of do the similar thing, can we find out from the news group postings, quote, unquote, pulse on a particular topic. So, the idea was that in fact a lot of these if you go to groups.google they have all the usenet postings. So, basically on any particular topic you can find for the last 20 years what has been posted and is available to you. So, the idea was that you take this posting and you have some of a response analyzer, and this response analyzer will show you the results on a particular topic over time, and it turns out without going into details that if you are trying to use standard data mining algorithms they don't work very well because if you are trying to let us say find out people who are against a topic or for a topic than just say let us say finding out about 71

72 immigration, how the people are feeling about it, then the positive and the negative postings tend to have almost identical vocabulary. So, if you use standard classifiers they won't do well. So, the analyzer works essentially just using some sort of computation of a class which is on the scent of what people and hits are doing there. Let me just quickly show you some results. This is, you know, we took some sort of a news group posting from State Farm and we tried to essentially see if we can from these news reports answer the following question. What do people think of State Farm's handling of claims? And you can sort of see that there is person here who doesn't like State Farm, but a lot of people, a surprising large number of people just comment in defense of this company and sort of answer every person. We found it pretty interesting. You can click on any of these things, see what these people are saying and so on. DR. CHAYES: Can you tell where they come from? DR. AGRAWAL: Sorry? DR. CHAYES: Where these positive people come from, do they come from statefarm.com? (Laughter.) 72

73 DR. AGRAWAL: So, clearly some of them might be agents, also. On the other hand that is the kind of the noise and to stress a point which was made earlier that somebody can use and go in and so it is a classic misinformation kind of a thing, that you can go and try to post things to confuse the thing. So, I mean that is a classical thing. All that you are trying to do was to sort of say technically can something be done, and there is a lot which has to be done on top of it. I mean somebody just takes these and tries to sort of use it, and that is going to be problematic. I just wanted to show you another one. State Farm decided to leave New Jersey and there was this concern how people were going to react to that particular fact, that they are quitting New Jersey, and again it is kind of interesting which was the point which Diane mentioned about how in these news groups you know there is a topic that lots of people respond off topic and so on, but you can sort of see that really what happened in this case it was very easy to see it once you went through this particular algorithm that in this instance people didn't think State Farm was at fault. It was either New Jersey they blamed or they blamed no-fault insurance which was just kind of pretty interesting to see that again, you know, it is 73

74 opinion and of course it can all be biased. It can all be done in different ways but it was interesting to see just how at least by looking at it. There is a tool available for you that you can get a good sense of what is happening. Next? PARTICIPANT: All these data mining algorithms we have seen involve large numbers. Are there algorithms where people can be solicited in several states and processed? DR. AGRAWAL: That is an excellent question, and can I just defer that question because I want to come back to that question? There is a lot of interesting work that has happened and a lot of people are thinking in that direction exactly, but that is a very good point. So, what I tried to do in this part was to give you an example of some sort of interesting applications and I thought I would mention at least two things which I think are kind of important to, also, think through, and I call them technical taboos(?) because they might really kind of destruct what people think data mining can do and one is sort of what I broadly call privacy concern, and there is some idea of the word "privacy," in data mining that I want to mention here and the second thing is where is this data going to come from. This is what Jerry was calling the 74

75 So how dc stovepipe essentially. You know how are you going to do data mining over compartmentalized databases? I, personally, don't believe that you ever have one place where all the data will be collected. It just can't be done, and I personally believe it is not even desirable to do that. So, how do we do sort of data mining over compartmentalized databases? That is going to be an interesting technical challenge, and in both of these things, again, I try to point what people have been thinking and then again think of the solutions that are sort of meant to illustrate the direction. So, this is the area of privacy in data mining and there is a clue from the fact that people were sort of really concerned that data mining is too powerful, too invasive. I think Jerry mentioned the example of privacy.org where what people are trying to do is not the best thing that should happen. So, here is a thought again. You can disagree with it, but that is okay. The idea here was that here is an example of one person's records. This is this person's age and this is this person's salary and here is another person whose age is 50. His salary is 40K and the goal is that using this particular data you want to build some kind 75

76 of classification model. We want to put a decision tree, and the question is can we do that particular task without violating individual privacy, and if you think about it what is really private about our being in the values, and what we tried to do here was essentially capture the intrusion of what happens anywhere on the web today. Basically somebody asks you to fill out a form. People go and lie. You know, they take the age, put that value as an age, you know as the question, and I will pick a number and put that. So, we basically said, "Hey, can we somehow institutionalize this lying? Can we in some sense made the lying scientific?" That is all we were trying to do. So, instead of you putting up an arbitrary number what you do is you take your true value and you add to it a random value. So, you throw a coin and this coin could be coming from some other distribution and you throw that coin and whatever value you get you add to the true value. So, the value seen here is extra. Okay, so this person's age was 30. PARTICIPANT: R is positive or negative or only - DR. AGRAWAL: In this case this was negative. So, what is going to happen is the following. So, here is an 76

77 example showing bimodal distribution once you do this randomization; you know, things might look pretty bad. So, you can sort of say, "What have you done?" Next, please? So, just in practice, you can take this randomized distribution and you can factor in how this was randomized and using these pieces of information you can try to reconstruct how the original distribution might have looked like. So, my idea of Reconstructing distribution you are not reconstructing. You will never be able to sort of reconstruct precisely what this looked like, but you might be able to reconstruct the distributions, and once you have reconstructed distributions most of the data mining algorithms require really working at distribution levels. So, you can use these reconstructed distributions to go and build the models. Next? This is the reconstruction problem. You know the appropriate distribution of Y and you want this to lead to redistribution and so, I will skip down and skip this, please? I just want to show quickly that this seems to work quite well. I mean this is how the distribution will 77

78 look and this is what this might look like, and this green one is how the reconstructed distribution looked like. Next? And if you use these reconstructed distributions this is an example where maybe sort of showing that for a particular task different level of randomization you are losing a certain amount but you might include more privacy and there is this notion of how much of randomization is there. If you can say a value scheme came, the age came from the range here of 200 and you tell me your age is 65, I can't move through on that estimate except to say that your age is between 0 and 100 somewhere, and you have 100 percent privacy. That is the notion of privacy. So, it is basically to say that for randomization you don't lose much privacy . Okay, next? So, I just that I talked to you This is how you do with categorical or nominal data. Here would be a transaction. You replace this transaction with this item, with an item which is not given in the transaction. So, this is the sort of books you have read. You take a book and replace it with some other title which want to sort of quickly again mention about how to do this for numeric data. 78

79 is not present in the transaction. So, this is when you are doing randomization. This is another nice way of selecting the complex randomization but it might be a slightly more complex randomization but this one addresses the problem of privacy. I look at the result based on the result and I can infer something more. Next, please? Here are some results again here that this is two different data sets, and for something like this exercise these frequent items are what are the things which come together. This is things we have taken the true data. This is how the true item sits on there. After randomization using the random data, I just say how many true positives you could recover. You know, what was the number of false positives, and then you can see that you can still do extremely well and you get fairly high level of privacy. Okay? So, once again another topic and so I will talk a little bit about this computation of compartmentalized databases. This is again this frequent travel example that Jerry was pointing out. You know, people are very frustrated with the system, that some sort of a frequent traveler rating model for this is going to come from all kinds of different sources, and I personally 79

80 believe it is not even desirable that these sources go and share the data, and you know there are lots of reasons for not sharing the data, and I firmly believe they should not share the data unless it is necessary to do it. So, whatever is necessary, so there has to be some notion of on demand minimal sharing of data which is needed here. It is sort of relevant to the topic that you have with fire walls and so, there is a bunch of data, and if you think about it you know the idea that I presented to you about randomized data kind of applies here also because that would be one way of sharing information by randomizing before we share the information. The other approach would be that each database builds some sort of local models and then you try to combine the models. So, you do partial computations and then try to combine partial results and third would be to do some sort of on-demand data shifting and data composition kind of thing. I will briefly point out that this can be done. I just want to sort of cite some work which is not done in IBM-Almaden, but I will try. I think it is sort of worth mentioning. Next one? 80

81 Here in this particular model this is for people who do cryptographic kind of stuff. Here are two parties, a phone conversation here. This is a decision tree type thing on the union of the databases without revealing any unnecessary information. Next? This is a two-party protocol, basically how to compute effects of X of Y without revealing what X and Y IS . Next? And basically decision tree, the key boils down to particularly if they are using IDS kind of classifier, the key step is to sort of compute information and sort of do this kind of computation and in this case B] and B2 could be coming from different sources, and using the protocol you can combine that information. That is basically what was done in that paper. The question, the challenges, you know do these results generalize and can they be applied for the data mining task, and there is sort of interesting work to be done here. Okay, so, this is sort of my sense of what I thought would be some interesting kind of new directions in data mining towards attacking these so-called "non- 81

82 conventional" problems and I just thought I would mention some of the problems which I, personally, find extremely hard. I really personally find it, you know, when I think about them I don't know how to even approach this sometimes. Most of the data mining is kind of based on the assumption that the past is connected to the future and what if history doesn't repeat? History changes abruptly. It is not a gradual kind of change which is happening, but you know there are abrupt changes, and how do we sort of detect those changes and know that whatever models we have will be totally applicable here, and if you think about it this is exactly what has happened, you know, on September Il. and since then. The profile of suicide bombers has completely changed from the examples of the current suicide bombers. So, anybody who was building data mining model using the previous examples collected from the past was fighting an old war. So, how do we learn? How do we sort of detect that we are fighting an old war now, and the war has completely changed? Reliability and quantity of data, this point is extremely relevant. We all know that. The kind of model that you are going to build depends on the quality of data 82

83 you have got, and how are we going to sort of detect which one of them is part of the data? Actually it is kind of interesting. Let me give you an example of what data mining values do? Clustering. Intelligent miner is an IBM product. When it was being developed basically people have taken an algorithm which was used in some place and trying to recode this algorithm and they found out that the results they were getting after this reimplementation were different from the results they were getting before. So, the question arose what is happening here and they found out that the reason the results were different was that the algorithm had a bug in it and so that is one part of this. The interesting part of the story is the following, that the group which was providing the algorithm and more important the consultants who were using that algorithm in engagements with real companies insisted that we should put that bug back in because they had convinced customers that what they were getting as a result of these algorithms gave good results. Customers were happy. So, think about that. (Laughter.) DR. AGRAWAL; So, the other part is the following, that I personally believe that data mining 83

84 because these truths are so general can only provide first of all the information and they have to be fed into something afterwards. So, how do we take these patterns and make them actionable? Some way we have to bring them in and i still don't know how, you know, principal way to bring the main knowledge into this whole process. I still don't know that. I think everybody has said that data mining is like finding needles in haystacks. In fact, it is not. I mean most of the data mining right now doesn't find needles in haystacks. It finds patterns which are dominantly present and how do we sort of save ourselves from overfeeding in our quest for finding real nuggets? There is a huge danger that we might be just finding noise, and all the things I know of people say that it is about finding really the real events and generally most of the data mining algorithms at this stage today are extremely weak in finding real events. Patterns? We can find some sets. We can find some sequences. I look at medical applications and there people really wanted to find out deaths or at least some sort of problem of deaths, because that is the kind of pattern they are looking at, and we don't know how to do this. Data types, I mean I have given you examples of text 84

85 kind of data, structured data, but you have much better data types on which you want to do mining, when to use which algorithm, you know, I just resonated with what Jerry said. You have a hammer and everybody looks like a nail and IBM, it was very interesting. You know, I work with a bunch of consultants. This consultant was a specialist in Uronet(?). Every problem he will convert into a Uronet problem, right? This guy knew accession. He will take every problem and convert it into accession problem. That was the best algorithm. That was the solution for every problem, and we still don't know which algorithm to use there, and these algorithms are tricky. A lot of them have a lot of wrinkles and you know, is there hope of getting this thing done without the help of something behind them and tuning these algorithms? Is there a hope of building things which in some sense go and figure out in a data dependent way what parameters to use for what particular algorithm? So, this is just a set of some things I thought I would like to mention, and this is sort of my summary slide saying that I do believe that data can be a strong ally. There is no doubt about that. I do believe that data mining has just shown some promise. I don't think it has matured enough to be used. It requires lots of work, but it 85

86 has shown promise and it is worth investing in it, and if we do invest in it, we might be able to realize its potential. That is it. Thank you. (Applause.) PARTICIPANT: Several months ago Karen Casadar suggested that instead of profiling passengers for airplanes what one might do is profile airplanes. It seems to me that you have offered an idea in another domain. I gather you can do social network analysis to find cliques and you have a very long list of cliques based on Internet data. The next step would be to try to characterize or profile the cliques. There would probably be more information at the clique level than at the individual level. DR. AGRAWAL: I cannot agree with what you just said. Jerry and I were coming. To stop us, you know, we were talking together. They had to cycle some people, right? Who would be the best two people, two nice people talking about this conference. They are going together. Come on, you know. PARTICIPANT: Sometimes common knowledge points to people who act as a bomb or something like that. Are 86

87 there any kind of data mining analyses to discover these kinds of common knowledge? DR. AGRAWAL: I think that is a really good point. I was talking to someone who was saying exactly what you are saying that, hey, you guys are looking too hard. If you look for common knowledge you improve the quality. DR. SCHATZ: Okay, thanks. DR. AGRAWAL: I forgot to answer a question that was raised. So, I think there is something happening right now. In fact, you know there is this point that we don't track a big model in one shop. You take some data, bring some models, and then refine that and that is how the model gets framed up. DR. SCHATZ: Thanks. A couple of points I just wanted to make before we break here,in Rakesh's talk, which is that the randomization and privacy issue I think is very important for us, too, and the challenges of working across what you are referring to as compartmentalized databases is a big one. It is hard to know how to get a grasp on those, but they are very big problems, important for us, too. Okay, we have got a 15-minute break. We are back on schedule, and we will see you her at ten-thirty-five. Save your questions up for Don McClure and Werner Steutzle. (Brief recess.) 87

88 88

89 Data Mining: Potentials and Challenges Rakesh Agrawal There are several potential applications of data-mining techniques to homeland security. These include linkage analysis, Web site profiling, trend discovery, and response analyzer algorithms. Linkage analysis consists of forming a graph that illustrates a person's social network storing this information in a database, and recording "association rules" and "transactions" for each relationship in order to permit the use of data-mining algorithms. Web site profiling uses a decision-tree-based technique called "classification." A classifier is built in order to classify a set of people and may be used to distinguish people who are high credit risks from those who are low credit risks. The technique of trend discovery is used to exploit a database of sequences by recording sequential patterns and chip queries. For example, trend discovery could be used to review patent applications filed over a number of years in order to detect a possible resurgence of a certain techno~ogy's popularity for inventions. Response analyzer algorithms can be used to identify trends in news group postings on a particular topic over time. However, it is often difficult to distinguish between responses for and against a particular topic, given that positive and negative postings tend to have almost identical vocabulary. There are, however, several concerns. It is impractical and undesirable to merge different organizations' databases. Therefore, it is important to create local models from independent databases and then develop techniques to combine these models into an overall model. There are also concerns about privacy: a known distribution can be added to a data set, resulting in a randomized distribution from which the analyst may preserve · · . . - . ·. . - · . . · .. · · . . . . ~ · · ~ . . - . · . . - . . . . the original d;str;hut;on without knowing the or;~;na] data values. Kandom~zed distributions should not affect data-mining algorithms since most data-mining algorithms . . A, A, just require working at distribution levels. One challenge facing the field is the fact that we cannot always rely on data from the past. For example, the profile of a suicide bomber has completely changed from what it once was. Another challenge is that data mining requires data that are reliable and of good quality. 89

go Donald McClure "Remarks on Data Mining, Unsupervised Learning, and Pattern Recognition" Transcript of Presentation Summary of Presentation Video Presentation Donald McClure is a professor of applied mathematics at Brown University. 90

91 DR. MC CLURE: Thank you. I would like to remind people of one point that Peter Bickel made at the beginning. The discussants are in general people who come from outside the areas of the main presenters in these sessions, and I am certainly an outsider to the area of data mining. It is not an area in which I have been active. On the other hand, I do have a fair understanding of problems in the area of trying to extract information from massive amounts of data, generally image data. I will mention a couple of examples of that type in the comments that I make, and some experience with pattern recognition. I had no idea exactly what the content of the talks this morning was going to be. I enjoyed them all very much. I did take a peak at a couple o papers that Jerry and Diane had written recently. So, I had some idea of what their perspective was going to be. So, there are three or four points that I wanted to make that related directly to the talks. The first two were sort of related, but an observation that Jerry Friedman made in his talk that advances in the area of data mining will rely on a combination of expertise, a coordination among disciplines and things that different people can bring to bear on these problems, I don't know whether data mining is regarded as an area that is owned by 91

92 computer science or owned by statistics, and it should probably be neither. It should probably be an area that attempts to draw on the things that people from different disciplines can bring to bear. I, personally, feel that one of the challenges in new areas that are poised for advances is technology transfer and this works in both directions. How do we stimulate cross fertilization between scholarly research that is going on on one hand and the expertise of people who are designing and integrating systems on the other hand, and this really has to occur in both directions. I felt very strongly in my academic career that people doing scholarly research, I am an applied mathematician; so, that the applied part is part of my own discipline, that the research, the scholarly basic research benefits greatly from having an understanding of what the real problems are. So, there needs to be more, that the academic disciplines can benefit from a better understanding of what the real problems are and in the other direction when good research is done how do we then do what most people refer to as technology and see that that gets integrated into the systems that are being developed so that systems design and integration is not a process of people taking things from 92

93 standard toolboxes and assembling those to assist them but instead drawing on new advances in research. There were two remarks made that relate to a third dimension, my own personal bias relating to the need or use of models, Jerry commented that, and I may have the words wrong. I was jotting down notes during the talk but one of the needs and important aspects of the way in which statistics might have a bearing on data mining would be in the area of assumption free inference. We need inference methods that are assumption free. I don't know if we completely agree or if we perhaps are saying the same things in different words or in different perspectives. Diane made the comment to pick up from the comments from the talks first that behavior should be modeled as a probability distribution, and I emphasize the word "modeled," that it is important to have a model when we are trying to develop decision procedures. She, also, emphasized that the models need to be non-parametric or distribution free. So, they are very flexible and can be adapted to all kinds of environments and I think distribution free models may be in the direction of what Jerry had in mind in referring to assumption free inference, models that are very flexible as opposed to ones that are parametric and more restrictive. 93

94 My own personal bias that relates to these is that if we are looking for either a needle or a nugget in a haystack we need to do something about modeling the needle or the nugget. We need to know something about what it is we are looking for. What can mathematics and statistics contribute to problems of identifying and extracting information from massive amounts of data? I believe that there are many contributions that mathematical sciences can make. Let me, I know I am a discussant, but I couldn't resist bringing a couple of pictures. So, let me put up a couple of pictures to show you? This is a picture provided Yali Ami of the Department of Statistics at the University of Chicago. He and Donald Gieman have cooperated for several years on research on trying to identify generic object classes in digital images and still images and as a prototype of the problem of detecting generic object classes, not ones where i have a fixed rigid model for what I am looking for but where there is something that in different instances can take many different forms. For a specific problem to focus on they focused on the problem of detecting faces in video imagery. So, the 94

95 triangles up here are hits and in this image these are the faces that have been identified. This is not a particularly complicated example. It is not a representative example but their methods are really quite impressive and what they have achieved and they have relied heavily on ideas and methods from probability and statistical inference, and I can give people, I don't have titles of papers, but I can give people pointers to some of this work. Last Sunday morning CBS news had a segment on, the key point was the new surveillance cameras being installed in Washington, DC, and how were these going to be used and concerns that should be debated in public debate about the needs to balance surveillance protection and privacy. They commented that in London where cameras are already in very widespread use on average every person in London has their face on camera 300 times a day. Now, when we think about trying to identify faces and video imagery this is I think a problem of trying to identify image information from massive amounts of data. The data rate for standard resolution, digital video, I mean like MTSE or Powell(?) video is, for color imagery is about 25 megabytes per second. Now, all of that 95

96 isn't interesting, but if we are going to find anything interesting in it, I believe it is going to be based on having a model for what it is that we are looking for. There is a line under each face pointing at it, and in this picture it says that it is probably good at recognizing brain wave membrane ties. (Laughter.) DR. MC CLURE: It is actually looking for facial features. So, this is sort of a postage stamp that I have blown up in order to present it at that size. So, here is one that doesn't look terribly exciting in terms of scientific content but the problem is to find and read the license plate. This is actually work that was going on in connection with developing a system to be used in parking lots at Logan Airport. The work has been going on for a couple of year. License plate reading is a problem in computer vision that has been studied for decades. In the sixties people were developing systems to read license plates. I took a look last week at what is available today in commercial systems, and I found something like 50 different numbers of systems on the market for reading license plates, and my feeling for the reason that there are so many systems is that none of them is really a superior 96

97 solution, and there is still room for improvement in this problem. This is a hard problem as a pattern recognition problem to find the plate and to read it. Simple instances of the problem, it is not rocket science to specify a system that is going to be able to do it, but even in this image we can find the plate very easily but there is a complication in the image and it is not -- there is illumination for example and a shadow creates a difficulty in terms of a vision problem, but the real challenges come from just the enormous variety of ways that this problem can manifest itself. So here are some challenging license plates to read. These come from a web page, www.photocomp.com where there is actually a fairly interesting summary or review from a consultant system integrator about systems that will read plates. License plates as an optical character recognition problem are challenging because of the many different forms in which the problems can be presented. So, the small characters with large characters, this is actually pretty common if you watch plates as much as I do when your car is riding down the street. The graphics are very common on plates and license plate frames that have a 97

98 habit of invading the bottom, strokes of the characters so that we can't simply say that the characters are dark characters on a light plate background. License plate reading is a more difficult problem than a character recognition problem. We can report that statistical methods from statistical inference have contributed to a very successful algorithm for reading these characters. This is relative to the plates. This particular idea is an easy problem. It is a very easy problem to contrast the light characters on a dark background. It is not like the noise. It isn't white noise in this image. The background has structure in it, and the characters come from a fixed font, from a font specification. So, we know exactly what the model for a character is before we start trying to do the reading, but in these problems these are problems, these are characters that are etched on wafers and as wafers go through the manufacturing process things happen to make this a more interesting vision problem. So, there is according to the specs supposed to be clear space around the identifying marks on a wafer but the semiconductor fates try to use every square millimeter of space on that wafer so they will etch over the region where the ID occurs. 98

99 Now, I won't put up any more examples of this, but I will return to my general remarks for a minute just about how mathematics can play a role in a problem like this more generally how mathematics and statistics can play role in problems of extracting information from non- conventional massive data sets. In addressing this problem, an approach that we have used the problem is first of all putting on our statistician's hat regard it basically as a hypothesis testing problem. I want to decide that every location in this image in the whole field of view is there a character present at that location, character present versus character not present. There are typically 36 different characters, alphabetic characters and the 10 digits. So I can refine that process or that hypothesis of character into it is a composite hypothesis formed from the different characters that might be present. So, at any rate we view this as a hypothesis testing problem and try to model what it is we are looking for and try to model the variation and try to model the probability distributions. This involves identifying in variants of what really defines what it is that we are looking for and using 99

100 distribution free methods, using non-parametric methods in order to have robust methods for forming a basis for decision tree procedures, articulating the model in the form of probability distributions, actually make so that when we need to decision among the hypotheses we can use methods that are based on time-honored principles of statistics such as in this problem. like the good ratio tests in particular At any rate I will conclude my comments. I found all three presentations this morning to be very, very interesting. DR. SCHATZ; Thank you, Don. 100

101 Remarks on Data Mining, Unsupervised [earning, and Pattern Recognition Donalc! McClure Advances in data mining rely on coordination among ctisciplines, with ctialogue between people ctoing scholarly research anct those who are designing anct integrating systems. Moclels used in data mining shouIct be very flexible as opposect parametric anct restrictive. If we are looking for a needle in a haystack we need to do something about mocteiing the neectle. We need to know something about what it is we are looking for. In problems of extracting information from massive data sets, nonparametric procedures in particular, we shouIct permit robust clecision-tree analyses that articulate the mocle! in the form of probability distributions. In that way, when we need to make a decision among the hypotheses, we can use methocts that are basest on time-honorect principles of statistics such as Iikeiihooct ratio tests. Several applications of data mining are being used in computer-vision research reiatect to homelanct security. One example is fincting and reacting license plates. Although there are many commercial systems to do this, there is still room for improvement in the areas of illumination and its resulting shallows; character size, type, and color; graphics; contrasts between characters and background; and the obscuring of information by license plate frames. In acictition, data mining will help in the detection of patterns on silicon wafers as they go through the manufacturing process. AIthough there is supposed to be clear space around the identifying marks on a wafer, semiconductor makers try to use every square millimeter of space on that wafer, often etching over the regions where the identification wouIct occur. 101

102 Werner Stuetzle "Remarks on Data Mining, Unsupervised Learning, and Pattern Recognition" Transcript of Presentation Summary of Presentation Video Presentation Werner Stuetzle was born on September 14,1950, in Ravensburg (Germany). He completed his undergraduate degree at Heidelberg University and at the Swiss Federal Institute of Technology (ETH) in Zurich. Dr. Stuetzle completed a master's degree (Diplom) in mathematics (1973) and a Ph.D. in mathematics (1977), both from ETH. He studied estimation and parameterization of growth curves under the direction of P.J. Huber. Dr. Stuetzle was an assistant professor in the Department of Statistics at Stanford University from 1978 to 1983, with a joint appointment in the Computation Research Group of the Stanford Linear Accelerator Center. In 1981, he was a visiting professor in the Department of Applied Mathematics and Center for Computational Research in Economics and Management Science at M IT. I n 1983 and 1984 he was a research staff member at the I BM Zu rich Research Lab. Since 1984, Dr. Stuetzle has served on the faculty of the Statistics Department, University of Washington, Seattle, with an adjunct appointment in the Department of Computer Science and Engineering. He was chair of the Statistics Department from 1994 to 2002 and spent 1999 and 2000 on sabbatical in the Research Division of AT&T Labs. 102

103 DR. STEUTZLE: All right, I just made a couple of slides here. So, one thing that struck me this morning or when I actually looked at the transparencies for Jerry's talk that he sent me ahead of time, one thing that certainly struck me is that there are very high expectations for the usefulness of data mining for homeland security. So, for example, when the Presidential Directive says, "Using existing government databases to detect, identify, locate and apprehend potential terrorists," so not people who did something, you know, who might do something in the future, that is certainly I think an extremely ambitious goal, the same way with locating financial transactions occurring as terrorists prepare for their attack. I mean given that it doesn't take a lot of money actually to commit terrorist activities and then finally, the use of data mining to pick out needles in a gigantic haystack. So, as to how realistic that all is I remember when in the late sixties, early seventies there was a terrorist group in Germany called the Baader-Meinhof gang. So, Germany is a very highly regulated society compared to the United States. 103

104 For example, the government knows where everybody is. I mean when you live someplace you have to go to city hall and you have to be registered. When you rent a room you have to go there. So, Germany has much more complete control over its citizenry than the US has or I am sure that US citizens would ever tolerate. That is point one. Point two is the bottom line is knowing who these people were. So, they wouldn't go looking for people who were going to commit activities. They had actually done stuff. I mean they had murdered leading bankers, politicians, judges and so on, blown up things and so even despite all these, besides those two factors, the highly regulated society and it was already known who the perpetrators were it took years to actually track these people down. So that is one thing. I think that is an important thing to keep in mind. They basically looked for certain characteristics. So, for example, when landlords reported the rental of apartments they would check for certain things like how old was the individual. Do they drive BMW s or other fast cars? Did they make the deposit in cash, etc. So, that is in a sense a data mining technique but despite all it actually took years to track these people down. 104

105 So, I am not saying that it is impossible. I am just saying that even under very highly regulated circumstances when you know who you are looking for this is not a trivial matter at all. Okay, so then when you think about applying data mining to this problem there is first of all what I would call the simple although not easy. Some of the ideas have the super big databanks. So, it is actually not physically one database. It is the collection of all information available so that it could be travel records, bank records, credit card records, previous travel patterns, demographics of the people, etc. So, you have properties of people and you also might have network data like who interacts with whom. Who calls whom. Who sends e-mail to whom, etc., and so forth. So, you have this large collection of information. In principle you could have that amongst all the people in the US and then the idea is to apply data mining tools and then from that you estimate the probability that the person is a terrorist given all these properties. So, that is a little part of what one would do, and so you look at all data. You apply data mining tools and you basically classify people into potential terrorists or not potential terrorists. 105

106 So, this is actually, so something related to that was what Diane was talking about. You take the calling card fraud. So, that is what AT&T does to detect calling card fraud. They know the people who have AT&T long distance and then they collect call records of these people. For every call in the long distance network there is a call record which gives you the calling number, the number that was called, who the call was billed to, how long the call was, etc. They have a long record of that. They build up a profile for all the users and then based on that profile and some training data they try to make a rule that sort of flags people who might or who are suspected of committing calling card fraud, and that is a much easier problem. First of all, you have a universe of people that you are interested in and those people have AT&T long distance. So, that is the first thing why it is easier. The second reason why it is easier is the consequences. So, the consequences of missing somebody or erroneously identifying somebody are quite different. So, if they think that somebody might be committing calling card fraud or if they think somebody is abusing my calling card they will just give me a call and say, "Well, did you really call Nigeria 15 times the last week?" And if I say, 106

107 "No," when then they are confident they have found something and if I say, "Yes," well, then that is not a big loss and it is different from being apprehended at the airport. So, if they miss somebody they will just lose some money which is painful, you know, but not like an airplane that has gone wrong. So, this is a much easier problem but even that is already quite, I mean people have put a lot of work into that. So, now getting back to the original problem of estimating that someone is a terrorist given the properties, the problems are first of all you have a very small training set. Jerry alluded to that. Actually you don't have a small training set. You have a small number of bonds in the training set. You have 280 million zeroes and 16 ones in that training set. I mean one example, Ted Kaczinski and Tim McVeigh you don't really see the attackers. Mohammed Atta and accomplices, but say you have the useful data in the training sample. So, that is one problem. Then the second problem is what you know from anybody who has ever taken any basic statistics course knows the problem with screening. If you have a tiny incidence with the population which we are going to have you need methods with very high specificity. Otherwise all 107

108 your alarms are going to be false alarms. So, that is another. So, that was something that Rakesh pointed out. You are not looking for things which are dominantly present. That is what data mining is good at, but you find out the people, that most people who buy beer also buy potato chips. So, that is something that is just dominantly present. So, you are looking for things which have a tiny incidence and so unless you have very high specificity you get a huge number of false hits, and finally, another problem is that the good predictors might not necessarily be legal predictors. Good predictors might be if you listen in on everybody's phone calls and automatically process those, and you can look at everybody's e-mail, etc., and that might not be possible or you might not want to do that. It might be illegal or just infeasible because of the sheer mass of the data. Okay, so, even in its highly idealized form where you have this universe of people and you want to make this rule and if you could put all these databases together and run these data mining algorithms, even in the idealized form this is very difficult. 108

109 Now, however, this is much too simplified. This is much too simplistic. First of all there is going to be a lot of individuals outside the system. So, the system, meaning the people who are in the US and are in these databases, well, first of all there are foreign travelers, people who come in as travelers who are basically not in the system of immigrants. As far as I know the US doesn't even know how many millions of illegal Mexicans you have. I mean it is a matter of millions of people that one is not sure about. So, it is not at all clear, but that universe is not clearly delineated. PARTICIPANT: They have attempted to build a system in the US. DR. STEUTZLE: I don't know much. I just hear rumors that it is not that effective. I might be wrong. Okay, so then there is the other issue of identity. So, even if you had the universe of people and you can run that decision rule and then I show up at the airport then the rule says that I am not a potential terrorist. Now, that only makes sense if I am really who the system thinks I am. So, therefore, if you can't prevent identity theft that kind of system is not going to be very useful because you have to be sure that the person that you are making the prediction about really is the

person standing at the gate. If you can't guarantee that then this prediction isn't going to be very useful. Then of course there is terrorism on foreign soil which seems very hard to prevent. So, now, I am not a national security expert but realistically it seems to me what you can hope to achieve is some access control. So, you can hope to deny access to some critical places, and places i put in quotation marks because there could be virtual things like access to certain machines on the network, etc. So, you can maybe hope to deny access to some individuals who are either outside of the database, so they are not in your universe; you don't know anything about them. You could just say, 'iWell, I am not going to let them do X or let them go in.' So, that is something you might realistically be able to do or you could deny access to people who are inside the system on whom you have data but you have bad indicators. So, you can run the specification procedure on your universe and you can try to see, well, you can try to make such a prediction, and of the likelihood of being a terrorist and if you think that is high you might deny access. So, finally, it seems to me that for any such thing establishing identity is critical and so it is not easy it seems to me and these biometric identification ~0

111 methods are really crucial, really have to be a crucial part of any strategy because that is the only way to totally reliably establish identify. All right, that is all I have to say. Thanks. (Applause.) DR. SCHATZ: Don, Werner, thanks very much. We have a little time for discussion. I do want to comment that the biometrics problem is very important to us at NSA, too. We have a large group of people who have been studying that recently. Just to get some discussion going here, if I could take one second before we jump in at NSA we have an activity we call the advance research and development activity and the director of that activity, Dean Collins is here, and I know that no one in this room is the least bit interested in funded or money or anything like that, but Dean actually provides funding for external research, and if he could just take a second to talk about his program I think you will be interested for a number of points of view. So, Dean, would you just give us a couple of minutes here on that? DR. COLLINS: Thank you, Jim. 111

112 My name is Dean Collins, and I am the director of a very small organization called ARGA(?) and has anybody ever heard of us? Oh, a couple of people, okay, my colleagues. (Laughter.) DR. COLLINS: So, there is a comment that I think that Jim is doing a very nice job of describing after each talk the things that NSA is interested in, and I represent not only NSA but our little activity is an intelligence community activity. So, to a certain extent I go across all the three letter agencies. The thing that Jim mentioned was that his customers are the analysts and so after all the data comes in then the analyst looks at it and I have been in this business, this particular job for a little over 2 years, and I have spent a lot of time with analysts, and I would like to tell you that a lot of the tools that are provided to them are not used, and so, we have only four particular areas we work in and one area that we are starting up is an area called novel intelligence from massive data, and that kind of clicks back in here, and it is hopefully of some interest to you not necessarily for the funding although if you would like to participate in the funding we would certainly welcome that, but we have spent about 6 months 112

113 working with analysts and saying, "What do you want?" as opposed to what can we give you, and I thought that you know if you have a hammer, you know, every problem is a nail. It is sort of like what are the problems in the form that you would like, and so, one of the things that we spent in the 6 months is a formulation of what are the analysts looking for, and they are quite different than you might think. Since I think the patriotic duty which a lot of us ascribe to in the fact that you are here you might be interested in looking at the, you know, what the analysts really want. This program is ongoing as we speak. It is unclassified. It was put out in what I used to call a Commerce Business Daily but it is now called Fed Business Ops and I will write down the URL. You can look at it. As of Wednesday there were 145 white papers. Obviously not a lot of people in this audience are aware of this opportunity and maybe you decided you didn't want to do it. I would just like to bring that to your attention but primarily from this standpoint. Look at the questions. You know, that is the important thing, you know, the funding put aside. Look at the questions about what they are really interested in. You may get a totally different 113

114 viewpoint of what is important to an analyst, and they are looking for that rare occurrence and so that the gentleman from IBM I think put it very succinctly. That is a very difficult problem, and all of the work is unclassified, and there will be conferences involved. So, if you don't get into the program you might like to at least attend some of the conferences. Jim, thank you very much. DR.CHAYES: I have got one question for you. Who are the people you are working with? DR. COLLINS: We are not working with anybody at the present time. DR. CHAYES: But in universities there are people, I mean what areas are they from, math or -- DR. COLLINS: We are problem focused not discipline focused specifically. We can't tell you who we are working for because nobody is under contract. DR. MC CLURE: Who is submitting the white papers? Where are they coming from? DR. COLLINS: I do not know. I have not looked at any of them. They are just coming in and that is something I couldn't tell you if I did know because that would be interfering with the procurement process. 114

115 DR. CHAYES: And this URL has the description of what you are looking at? DR. COLLINS: It refers you to an announcement for the questions. DR. AGRAWAL; Can you give us one example? DR. COLLINS: You mean one of the problems? Okay, I think that one of the problems for the analysts, and this is something that is a very big problem, this issue of passive knowledge. If you do a database, a data mining search and you come out and say, "Gee, there is a strong correlation between the White House and George Bush," this is not very interesting to an analyst and he is probably going to take your software and throw it in the trash. So, now this is a very difficult problem but it is a very crucial one to an analyst. DR. SCHATZ: Okay, Dean, thanks. DR. COLLINS: I will write this down afterwards so I don't interfere. DR. SCHATZ: So, I will open the discussion up here to anybody who would like to start. PARTICIPANT: I thought the issues about essentially screening which is what we are talking about were very important issues and we can apply all the mathematical techniques we want to but if there are only a 115

half dozen positive cases in a massive database you have to use your domain knowledge. You have to figure out how the domain knowledge, but we are also going to have to figure out policy and so I think that one of the things that this community should be, we are looking at the question of how to find the terrorists but we also have to bring to bear knowledge about, or we have also have to bring to bear policy issues like screening programs. If we have the metal detectors that keep people from doing things, how does that play in? Are we reducing the incidence of these things? We have to bring to bear behavioral analysis. Behavioral and social science research has to be brought to bear on this. How can we quantify that and put it into the same framework so that it can be integrated in the statistical models? DR. SCHATZ: Good points. Did I miss a question in there? PARTICIPANT: It was a point that I wanted to make and I wanted to ask people to make comments on that if you would. DR. STEUTZLE: I really have got not much to say to that. I mean I agree that that is true, and that is a problem I think that classification procedures are not necessarily, it is not always so clear, for example, how to 116

integrate networking information into those. So, that probably would not be properties of a single person that allow you to make diagnosis but properties of how a person relates to other people in the database, and that is not always totally obvious and how you would integrate that, how you would use current methodology to do that. DR. CIMENT: I would like to hear a little bit more from people with a vision about the future that relates to possibilities of using mathematics in the context of new architectures not just looking at the world the way it is today but the way the world might be in years to come based on for example, the proliferation of supply chains based on computer integrated systems, sort of the WalMartization of the world you might say, right? The story I heard relative to 9/11 which makes it significant is that certain citizens understood immediately after the impact of 9/11 that there were certain requirements. Motorola sent truckloads of batteries for New York City with policy escort, understanding that people with cell phones were going to go out more because they weren't recharging them. Home Depot I think was the one that shut down 20 of its stores understanding that people were going to need their appliances and services and things like that. ~7

118 What I am getting at is the extreme events of the future which we never know what they will be. They won't repeat the past, but it seems like they will require all sort of integration, not just from governmental agencies but from private sources, private organizations, industry, etc. The challenge for us in our society as Werner well pointed out that will not tolerate privacy invasion is to develop ideas that will preserve the privacy, allowing data sharing and I think mathematicians are probably the most suitable people to think these abstractions through and not worry about what the lawyers tell you you can't do, what the policy people tell you you can't do, create these models and show that there are maybe possibilities here if the policy would change and if the architecture would change, and we might have a society that preserves both security and privacy. DR. MC CLURE: I would reiterate what Jim jus said. That is an interesting point. I have given absolutely no thought to this. It sounds like you have given a good deal of thought to this. So, I don't really have anything to add. DR. SCHATZ: These are certainly critical issues, the privacy issue versus trying to actually do something 118

119 for homeland defense, and they are hard issues that we at NSA face every day because we go through extremes to ensure privacy for citizens. We are there for the citizens and yet you are trying to do these data mining things. We have had more discussions with lawyers at our agency in the past 8 months than probably in the preceding 40 years, but I think you are right. I mean just the randomization ideas that Rakesh pointed out are very important for this kind of work. I mean so much of the time, especially in the industry setting the information people want you don't need to have individual names. You are looking for trends and so forth. PARTICIPANT: This is both a comment and a question and I think related to one of the points that Don McClure mentioned. This issue of data mining I think is sometimes too general. Much of the action I think has to be at the level of domains. A particular example he mentioned which is what I know about, to say, "Oh, yes, this is a classification problem. It gets you maybe 10 percent of the way or 20 percent of the way," and there are studies showing that we are not suggesting taking a general off-the-shelf thing and running it. You have to look at the domain. There are particular questions how things are found, what the fusions are, what life forces, etc. It 119

120 seems to me that this theme may actually be doing a little bit of this. I just take an extreme position here, by forcing people to concentrate at this very broad level, whereas if they came down to the more specific level they might actually make more progress. DR. MC CLURE: This relates to the point that I mentioned that Jerry Friedman made also about problems in this domain just requiring the coordination of many different kinds of expertise. It is the missions of data mining but when you get down to the specifics it doesn't seem like everything fits within the kind of highest level definitions. DR. SCHATZ: Certainly a good general point. I mean at the National Security Agency the one thing we have that not everyone has is data and because of that it really forces us to, we have many conferences where we try to think big picture, but we have many very specific problems and there is nothing like very specific problems to just get you down to business in designing algorithms and pushing forward and you are right. There is only so much progress you can make at an abstract level and data mining the concept is vague and broad and so on. Yes, Peter? 120

121 DR. BICKEL: I would like to follow up on that in a somewhat facetious way. Jerry had two things. He had assumption free, and then he had domain knowledge. There is only really one assumption-free statistical model, that is that you have all of the things that you observed on certain -- I am talking unknown distribution and that doesn't get you very far. So, domain knowledge already implies that you are starting to put in structure and of course the more you know, the more you can put in. DR. PAPANICOLAOU: I am George Papanicolaou of Stanford University, and just about the comment that Diane Lambert made about during the sixties the sonification of seismograms to determine whether they come from natural earthquakes for from explosions and I know a little bit about that problem. I would like to see how you feel about this issue. In fact, in the early sixties the discrimination problem was very much an unsolved problem. It was a very large community of maybe a couple of hundred seismologists and applied mathematicians and others who were involved in this, and the problem wasn't solved until about 20 years later in the eighties, and it turned out it was a very complex problem. It wasn't an issue that you could hear, that the seismograph people hear whether you could discriminate with your ear, but it was an 121

122 understanding of how waves that are generated, the surface of the earth over long distances. These are important. Yes, we are collecting information about seismograph activity down to a very small earthquake level, natural or otherwise. This information is collected. We have huge databases. International agencies are collecting. There is a data mining issue there, but the real issue is when you go and use data how do you discriminate; what is the basic science that goes in there that would tell you what to do with the data that you have, and speaking also in the direction of some of the criticisms mentioned earlier exposing data mining in such very broad terms hides some extremely important long-term issues in basic science like computational wave propagation and various other issues for example that have to do with the imaging in the complex environment, and these are very bit as important let us say for the detection problems that are going to nuclear proliferation. Very small tests are going to be made and have to be discriminated as to the role the data mining problem starts to extract some general information, some popular information and I think that that is something that this community, this small group here should attempt to put a fix on. Exactly what will the general, in order to better 122

123 fix, what will the general ideas do; how far will they go and that they are applied and that the underling basic science needs to be emphasized very seriously. DR. SCHATZ: Disagreement with that? Good point. Yes? PARTICIPANT: We talked in an earlier session about techniques and terrorists changing their tactics, and focusing on false negatives. What about terrorists who try to force the system to make findings in people? Any comment about what current techniques and that kind of thing? What kind of techniques might there be? DR. MC CLURE: I don't know. I think about the problem as it has been approached over the years of automatic or aided target recognition. It is always easier to try to foil the system than it is to design it to be effective, and I think that is a problem in this area. DR. SCHATZ: Yes, you know one of the problems about that that struck me at the Siam data mining conference a couple of weeks ago is that our community of which we have a cross section here; we have academia; we have industry; we have the government and among ourselves we would like to share information about how we are doing this data mining, but in truth it is a little bit difficult because companies have proprietary information. They are 123

124 trying to put out products and academic folks need to be tenured and government agencies keep what they are doing secret, and it is all for very valid reasons. If people know how AT&T is going to detect phone fraud, they will get around that. So, it makes it difficult to exchange the science sometimes I think because of all the proprietary information and that is a difficult situation here, but at the same time if we don't keep our sources and methods quiet when we need to they won't be effective either. DR. STEUTZLE: Getting around things is only one problem and once you know how the system works then you can also flood the system so you have both options that you can flood it or get around it, you know. So, that is I guess why the airlines don't want to tell you exactly what they are looking for when they profile you at the gate. PARTICIPANT: The reason I asked the question is because even with many false negatives you feel the positives are still useful. With many false positives it is untenable and the terrorists may force us to remove the system entirely and that is why I asked the question. DR. SCHATZ: Good point. PARTICIPANT: I don't think it is a question. It is more of a comment. It is not so hard to find a needle in 124

125 a haystack. The point is that you often find sometimes the thing that you are looking for is a special feature that occurs in a population not necessarily for a single individual. The project that we had with fraudulent access to a computer system, sometimes you can just ignore the data and look for some movements or command that is not typical. Very much to what Werner mentioned your suspicions of the sixties and seventies looking for a robust method, sort of trying hard to avoid, how to evolve more for extreme events, sort of reverse thinking in trying to find these things in the bulk of the data and then from there on you can sort of try to find the individual. In fact, even in a synthetic model if you look at Diane's data if you take Diane's data and say, "Can I detect fraudulent usage of phones?" having the phone bills of individuals for say a year or two, taking just random streams will give you this phone and my phone and somebody else's phone bill but just for a week you can observe for a week there is ongoing data. Could you find fraud in that type of data, and the answer is probably you could because like Werner said if you see calls to Nigeria that may be a good start. DR. LAMBERT: Maybe I should say something. I wasn't actually trying to say that you can use these for terrorists. This is far beyond what we ever try to do. 125

126 The other thing is maybe we are focusing too much on trying to accomplish the final goals whereas it might be useful just to give people a filtered set of information so that they have less than actually puts that by hand, which is you know it is not that we are trying to accumulate analysts. We are so far from that; we are not trying to do that at all. Another thing is that even you know, actually in detecting fraud you don't have long histories on people because if people are going to commit fraud they don't go into the system that they have long distance service with for example. They make a call and access somebody else's system where you have no history whatsoever. So, you are right, being able to handle people is very important, and I will have to defer to the comment about earthquakes. That is just some math that I had which had little symbols on it. I actually don't know. It could have been that they developed the signals and used the signal extraction from 20 years after the original application. You do have to take all the information you have. The trick is to figure out how do you handle it. DR. CHAYES: Just as a summary I think I am not someone who knows about any of this but what I am hoping is that what we are going to get out of all of these sessions are some questions that mathematicians can approach, and so 126

127 I have just been writing down some more mathematical questions that have been coming up. That is also one of the things that we want the final report to do, to come out with a list of questions that mathematicians can look at, and I guess the one that has been coming up the most is how do we focus on extreme events and what I heard from everybody is that we really have to know how to model extreme events properly. So, I am not sure how much of that is the mathematical question. DR. AGRAWAL: That assumes that you know something. DR. CHAYES: That assumes that you know something. So, on a general level how do you get extreme events and it sounds like we are very far away from that. Another one that Jim mentioned was if you have a lot of data how do you visualize the data and I know that there are people working on this. I am certainly not one of them. I am not sure if there is anyone here who can speak to the question of how do you visualize data. PARTICIPANT: Not just visualize. DR. CHAYES: Yes, I mean in a metaphorical sense how do you visualize data and then there is also the question that seems to me the one that we are furthest along on which a number of people talked about which is how 127

128 do we randomize data to try to ensure privacy along with security. However being further along doesn't mean that we are very far. So, it struck me that those are three areas that could set a mathematical agenda and if anybody has any comments on any of those? PARTICIPANT: I have one comment. If we are talking about addressing terrorism are we talking about preventing a small number of events and data mining to prevent these events to make sure that you have got every single individual and on the other hand there are a number of organizations like Al Qaeda and maybe we could concentrate more on the structures of these organizations and then you are not talking about identifying every individual, identifying every possible conspiracy but identifying plans of the organization. DR. KARR: I am Alan Karr from the National Institute of Statistical Sciences and I would just like to point out that there is a wealth of techniques associated with preserving privacy in data other than randomization. Randomization has some well-recognized shortcomings in other cases, but I think this point is a lot broader and there is a whole area of statistical disclosure that ought to be brought into this. ornani z at for 128

129 DR. LASKEY: With these issues that have been brought up I would like to add one more which is combining, by the way, I am Kathy Laskey from George Mason University and combining human expertise with statistical data and that does in fact have mathematical issues associated with it because of methods where you represent the human knowledge and ability distributions to combine them to data, and there are lots of important innovations in that area. I would, also, like to point out on the varied events the importance of outliers of rare events have been mentioned a lot, but the importance of multivariate outliers, data points that are not particularly unusual on any one feature. It is in combination that they are unusual and in fact in the events leading up to September 1l, these people blended in with the society, but if you look at the configuration of their behaviors if somebody had actually been able to home in on those individuals and say, "Okay, you know, they paid cash for things, plus they were taking flying lessons, plus, they were from the Middle East, plus, this, plus," and then you discover that an Al Qaeda cell was planning to use airplanes as bombs there were enough pieces that could have been put together ahead of time, not that I am saying that it would be easy, but pieces were not 129

130 individually significant enough to set off anybody's warning system. It was the combination that was the issue. DR. KAFADAR: I am Karen Kafadar from University of Denver. I think I heard someone from the FAA say that actually the airlines did identify something unusual about at least one of those. The response was to recheck the check level, rescreen the check level. There was another variable there. They didn't know that. DR. SCHATZ: I don't want to cut off any discussions although it looks like we are getting into the lunch break. We will take a couple more quick ones and then I am sure there will be lots of time later to talk. PARTICIPANT: I am trying to put together a couple of things that seem related. One is that we classify this and we can see this and this and this, and that ought to be intuitively meaningful and it is information that ought to be the model used, but I also have a sense that we are looking for the kinds of things you can see looking back but not forward so easily. In retrospect every newscaster would know what was coming. So, what is the potential for these folks who are analysts who are in the business of knowing how the targets are changing? Are we talking about being experts in real time participating in the development a system that might have

131 to change in time as well and what are the odds that the system can say, "Is this interesting?" and have them say, "Yes," or "No," and then the system from what the analysts thought of it with the perspective of the analyst looking at it Tuesday of this week instead a month ago and with all the complexity that you are not going to be able to deal in rules no matter how careful you are; so, is that, I know analysts are probably overworked like everybody else, but maybe you could participate in something like this. DR. SCHATZ: We do a fair amount of analyzing the analysts if that is what you are asking. I get in trouble at our agency when I talk about rebuilding the analysts because they don't like that, but we do; a lot of our activities and algorithms have to do with on the one hand helping them prioritize data for them that we think they are interested in based on what they have been doing, try to predict things they should have looked at that they are not getting time to get to but modeling analyst behavior is something that we do all the time and will be more and more important for us, absolutely. PARTICIPANT: The third time that a rep came to us and said, "Bush is linked to the White House," you know the system should be one because the analyst knows well that that is not interesting. 131

132 DR. SCHATZ: Yes, there certainly is for us again we enjoy a population of people to study in that regard that other people don't have access to, but certainly when we do have access to it, and we do, a lot of what we do is studying analyst behavior and trying to correlate did they pull the document; did they look at a document; did they act on a document and try to maximize our advantage there because at the end of the day no matter how many individuals you have it is a minuscule epsilon number compared to the data size. So, what they actually do and act on is critically important. One more, Rakesh? DR. AGRAWAL: It is not a question. Many times I like to go and look at things, but sometimes I think I wish there was more computational aspect to it. So, in decisions and so on the interesting thing is like the combination of things. There is a lot of very interesting work happening and it is interesting for somebody in this Committee to understand what happened and to look at it, to understand the computational people and something I very strongly believe that we don't have hope for doing some of the massive common warehouse kind of things that somebody would pay for. I don't have the experience to look at the kind of data you have they are critical for commercial testing 132

133 in the field and they can be done. So, how would you solve all the complications that you have which essentially assume that there is one data source but think how would you do all the computations you wanted to do where you have these data sources which are kind of ready to share something through a mode of computation and these are some of the kinds of data points here which would be useful. DR. SCHATZ: Very relevant, absolutely. Okay, Andrew? DR . ODLYZKO: I f you look at the broad technology what we have to capacity. know in the next few years is storage DR. SCHATZ: Good wrap up. Thanks, everybody. Thanks to the speakers for the morning. (Applause.) DR. SCHATZ; Twelve-thirty, here. (Thereupon, at Il:50 a.m., a recess was taken until 12:40 p.m., the same day.) 133

134 Remarks on Data Mining: Unsupervised [earning, and Pattern Recognition Werner Stuetzle There appear to be unrealistically high expectations for the usefulness of data mining for homeland security. When a Presidential Directive refers to "using existing government databases to detect, identify, locate, and apprehend potential terrorists," that is certainly an extremely ambitious goal. For example, pinpointing the financial] transactions that occur as terrorists prepare for their attack is difficult given that it doesn't take a lot of money to commit terrorist acts. Using data-mining systems to combat counterterrorism is more difficult than applying data mining in the commercial arena. For example, to flag people who may be committing calling card fraud, a Iong-distance company has extensive records of usage. As a result, there are profiles of all users. However, such convenient data are nonexistent when detecting people who might be terrorists. In addition, errors and oversights in the commercial arena are, in general, not terribly costly, whereas charging innocent people with suspected terrorism is unacceptable. Biometrics will have to be a crucial part of any strategy in order to combat attempted identity theft. 134

Next: Detection and Epidemiology of Bioterrorist Attacks »

The Mathematical Sciences' Role in Homeland Security: Proceedings of a Workshop (2004)

Chapter: Data Mining, Unsupervised Learning, and Pattern Recognition

Welcome to OpenBook!

Get Email Updates