The National Academies Press

Currently Skimming:

Data Mining, Unsupervised Learning, and Pattern Recognition
Pages 11-134

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.

From page 11... ... 11 James Schatz "introduction by Session Chair" Transcript of Presentation Summary of Presentation Video Presentation James Schatz is the chief of the Mathematics Research Group at the National Security Agency. Read the entire page →
From page 12... ... Our internal mathematics community is a dynamic professional group that encompasses full-time agency employees, three world-class research centers at the Institute for Defense Analyses that work exclusively for NSA and a network of hundreds of fully cleared academic consultants from our top universities. As the Chief of the Mathematics Research Group at NSA, and the executive of our mathematics hiring process I 12 Read the entire page →
From page 13... ... Coping with complex encryption algorithms requires at the outset a working knowledge of the most advanced mathematics being taught at our leading universities and at the higher levels a command of the latest ideas at the frontiers of research. Beyond cryptology the information age that is now upon us has opened up a wealth of new areas for pure and applied mathematics research, areas of research that are directly related to the mission of the National Security Agency. Read the entire page →
From page 14... ... However, there is a serious sense of urgency underlying every project, and you would soon realize that the mathematicians of NSA are relentless in their pursuit 14 Read the entire page →
From page 15... ... We are very proud of the fact that 40 percent of our mathematics hires are women and that 15 percent are from under represented minority groups. Of course, the agency depends solely on the greater US mathematics community to educate each new generation of students, but we, also, depend on the professors at universities across the country to advance the state of mathematics research. Read the entire page →
From page 16... ... It is a wonderful topic. There is absolutely nothing going on in this conference that isn't immediately relevant to NSA and homeland security for us, and this first topic is an area of research that I think we had a bit of a head start on. Read the entire page →
From page 17... ... Beyoncl cryptology the information age that is now upon us has opened up a wealth of new areas for pure and applied mathematics research. While advances in telecommunications science and computer science have proclucect the engines of the information age that is, the ability to move massive amounts of digital data around the worIct in seconds any attempt to analyze this extraordinary volume of data to extract patterns, to predict behavior of the system, or to recognize anomalies quickly gives rise to profound new mathematical problems. Read the entire page →
From page 18... ... 18 Jerry Friedman "Role of Data Mining in Homeland Defense" ~ ranscript of Presentation Summary of Presentation PDF Slides Video Presentation Jerry Friedman is a professor in the Statistics Department at Stanford University and at the Stanford Linear Accelerator Center. Read the entire page →
From page 19... ... This is from the President's Office on Homeland Security Presidential Directive 2, and it is a section on the use of the best opportunities for data sharing and enforcement efforts. It says, "Recommend ways in which existing government databases can best be utilized to maximize the ability of the government to identify and locate and apprehend terrorists in the United States. Read the entire page →
From page 20... ... Okay, here is the popular press. Computer databases are designed to become a prime component of homeland defense. Read the entire page →
From page 21... ... What are the kinds of data we are going to see in homeland security applications? Well, there would be real-time highvolume data streams, massive amounts of archived data, distributed data that may or may not be centrally warehoused; hopefully it will be centrally warehoused, but you can't centrally warehouse it all and of course many different data types that will have to be merged and 21 Read the entire page →
From page 22... ... Well, there is a big effort as most of you know in the government for data information sharing. The buzz word is to tear down the information stovepipes between the various agencies who apparently guard rather carefully all their own databases and information. Read the entire page →
From page 23... ... I got this from www.privacy.org, another example of where the data is going to come from. Federal aviation authorities and technology companies will soon begin testing a vast air security screening system linking every reservation system in the US to private and government databases designed to instantly pull together every passenger's travel history and living arrangements, plus a wealth of other personal and demographic information. Read the entire page →
From page 24... ... Behavior recognition, travel behavior, financial behavior, immigration behavior, automatic screening of people, mail, cargo, and of course forensics, postapprehension. If there is an attack you want to catch the perpetrators and all those who planned it so they can't plan another one. Read the entire page →
From page 25... ... The most successful applications of data mining have been commercial applications which you can basically characterize as behavior recognition using transactional data. Classic example are fraud detection, fraudulent phone calls, fraudulent credit card usage, tax evaders, those are classic applications of the commercial. Read the entire page →
From page 26... ... I mean hopefully there are fewer terrorist than people who are trying to make fraudulent phone calls or trying to cheat on their taxes, and so the needles are going to be really small, and the stakes are much higher. If your data mining algorithm on a consumer database doesn't quite get all the customers who might be attracted to your product, well, your client is going to realize slightly less profit and won't even probably know it. Read the entire page →
From page 27... ... We have to worry about what the machine learning people call concept drift, namely, the underlying joint distribution of the variables if you like is changing with time for both the signal because the terrorists aren't stupid and the background. People's shopping patterns are likely to change or they are changing because of our intervention in homeland security, and we have an intelligent adversary and I think that is one of the things that makes these kinds of problems unique, not that consumers aren't intelligent but they generally don't try to evade you. Read the entire page →
From page 28... ... So, this maliciously observational data I think is a new twist that certainly existing data mining algorithms may be valuable to tell us about fraudulent phone calls but I think that is a really new twist to this kind of thing. So, what are the properties that our methodology is going to have to have? Read the entire page →
From page 29... ... In the data mining area popular techniques are decision trees, rule induction, support vector machines and neural networks and a lot of other things, and it is not clear at least in my mind which if any of these will provide success. They all have advantages and 29 Read the entire page →
From page 30... ... The data itself won't be new kinds but the fact that we have to merge it and analyze it simultaneously I think is fairly new, the fact that a single database will have transactional data, audio, video and text data all in it and we will have to treat it all simultaneously. Nearly all of everything we measure will be irrelevant. Read the entire page →
From page 31... ... However, I think that as is often the case more time requirements often lead to scientific advances and in World War II there were a ton of them in the mathematical area. Of course, it played a huge role and many aspects of World War II led also to advances in mathematics that led to the invention and advances in computers which of course have changed our lives in many areas other than code breaking, and advances are certainly needed in data mining if we are going to successfully address the expectations of what data mining can do in homeland security, and if we are successful, even remotely so we will also impact a wide variety of other areas which are not quite so difficult. Read the entire page →
From page 32... ... Certainly for the problems we deal with we have fragmented data, very badly garbled data, data from different sources that we are trying to pull together and the scalability of algorithms is absolutely a critical point to be made in all this. Certainly other fine points we have here are trees, rule based algorithms, support vector machines, neural nets; we are using all those techniques right now. Read the entire page →
From page 33... ... Okay, well, we are a little ahead of schedule which is fine, and I think Diane is ready. So, our next speaker is Diane Lambert form Bell Labs. Read the entire page →
From page 34... ... Data mining has several applications to homeland security: air travel, characteristic signatures of bioterrorism versus natural outbreaks, cybersecurity, intrusion ctetection, surveillance, and travel, financial, or immigration behavior. Currently, the fecleral government plans to create a vast air-security screening system linking every reservation network in the United States to private and government databases. Read the entire page →
From page 35... ... She has also served on ACM and SIAM program committees for conferences on data mining. Her research for the past several years has focused on ways to analyze and model the kinds of massive, complex data streams that arise in telecommunications. Read the entire page →
From page 36... ... So, what I am going to stick with is communications data. So, one thing that we have a lot of at Bell Labs is communications data, and you can imagine that communications data has a lot of information in it that would be of interest to people who are trying to track what is going on. Read the entire page →
From page 37... ... It is not too meaningful or it may actually be meaningful, but you know when and what was downloaded, which images and everything else. Going beyond that a bit online chatrooms the way they are structured there is actually a record kept in a log file of everything that is posted. Read the entire page →
From page 38... ... The first question we want, and you know, it _ particular examples because can say, "I don't want that with the easy one which is is what kind of information do is always a little hard to do in any particular example you information," but we will start the easiest flows of all are really just call records. So, they are records of the calls. Read the entire page →
From page 39... ... 39 Here it is terminating number, and you get things like which calls are incoming, which calls are outgoing, and then from that you can think of describing what the typical behavior of that caller is, and then you can also think of how do I find something that is atypical; how do I detect there is a change in the pattern? Now, if we look at it you can see that all of a sudden on the right there is all this purple stuff that starts happening that has longer lines and there is more of it. Read the entire page →
From page 40... ... So, how are we going to represent behavior? What we are going to do is say that behavior is just a probability distribution, that what it means is what are you likely to do and what are you unlikely to do and both of them may be of interest to you. Read the entire page →
From page 41... ... So, the hope is that you can evolve the probability distribution as time goes on. Read the entire page →
From page 42... ... 42 So, what we want to do is to track those probability distributions. Okay, now, these bulletin boards and chatrooms, there are lots of them out there. Read the entire page →
From page 43... ... So there are all these discussions going on and these discussions use something called the IRC, the Internet Relay Chat protocol and so they are just writing files kind of like you would write in a log file for a web download. Okay, so, we have a fire wall and on one side of the fire wall -Next? Read the entire page →
From page 44... ... So, if there is only, you know, if there is only a little bit of activity like five posts in the past hour then you may decide only two posters are active right now. You may decide that you don't want to sample that and move on to the next one. Read the entire page →
From page 45... ... You bring them up beyond the fire wall and they are all in sort of a funny kind of time order, not exactly time order but they are not in the order of users. They are not in the order of topic, and they are not in the order of even room anymore. Read the entire page →
From page 46... ... So, that is where we end up that you need to do this probability distribution. Now, since everybody has a probability distribution, every room has a probability distribution and every topic has a probability distribution all over different kinds of variables what you have to do is to make sure that you are not trying to be too aggressive here. Read the entire page →
From page 47... ... Somehow you have to be able to start off their probability distribution in a reasonable way and so this makes kind of odd clustering kinds of problems in a way or indexing problems because you could look at a few records for an individual and say,"Oh, this individual looks like they are going to behave like that group of people," and take an average for that group of people to start them off with, and then once you initialize you know that is not very good. It is better than just sort of guessing average over everything in the world, but it is not very good. Read the entire page →
From page 48... ... Is there any useful information you get out about what is going on out there in the world using all those probability distributions that you have collected? You have been just sort of using all that data. Read the entire page →
From page 49... ... Well, there is kind of a history in the Bell system of using sound, maybe because of the phone. I don't know. Read the entire page →
From page 50... ... Now, more recently Mark Hanson who is Bell Labs and Ben Rubin who is something called the Ear Studio had a grant from Lucent(? Read the entire page →
From page 51... ... 51 how the topics are evolving in time and so, as these things come in you can cluster the topics because these topics are represented by a bag of words, and you can cluster those, and if you use some kind of online algorithm which is again, you know, there are ways to do it but not necessarily good ways, you can do dynamic clustering and see what is coming out over time. You can then map those to displays that are high dimensional, and so this is actually one which I had no idea what it was about, but I think it is about a gecko is what I have been told and so people, there is a bulletin board or a chatroom for people who have geckos as pets and this is a topic for them which I guess is the only thing on about that time of day. Read the entire page →
From page 52... ... Now, that is really as far as we have gotten to be honest, but now that you have these probability distributions at least in principle you can think about trying to detect interesting changes in behavior. Now, things that are subtle like Jerry said there is no, you know, it is not really going to be too easy to find it. Read the entire page →
From page 53... ... So, you can't actually optimize for everyone at the same time, but you can take some large set of users and train from them and then what you can try to do is to figure out a set of variables which isn't too hard that is useful for a large fraction of users and then you would say that if is useful for a large fraction of people then you will keep it, but I think this is one place where things get a little murky, one of many, many places, but now if you have that you get probability distribution for what the person usually does say or what the topic usually is in this chatroom. Now, at the same time you can have a probability distribution for what you are looking for. Read the entire page →
From page 54... ... So, what we want to do here is to direct it by comparing the probability distributions of what usually goes on compared to the probability distribution of what you are looking for. Now, if you do this in a log likelihood ratio then big changes in either one should be flagged here. Read the entire page →
From page 55... ... We are running ahead a little bit. If anybody had any questions for Diane and we could even get Jerry back up 55 Read the entire page →
From page 56... ... PARTICIPANT: I was intrigued by your comment about using sound data. Generally we find little response; they feel it is uninteresting. Read the entire page →
From page 57... ... He was imagining the Allies in World War II built a broadcast tower and broadcast 24 hours a day pure noise except at predetermined times something came through in 57 Read the entire page →
From page 58... ... 58 something that is fairly plain text. In the modern communications era it seems to me that terrorists could publish noise on the Internet all the time and just overwhelm the detection capabilities except for at predetermined times, you know, 12:03 a.m. Read the entire page →
From page 59... ... Our next speaker is Rakesh Agrawal from IBM-Almaden. Read the entire page →
From page 60... ... First, it is important to clefine "typical behavior." Then, it is possible to detect changes relative to the baseline. A current research project at Bell Labs involves monitoring several thousand chat rooms in orcler to extract from the resulting data a probability distribution for each user and each room. Read the entire page →
From page 61... ... , Bombay. Prior to joining IBM Almaden in 1990, he was with Bell Laboratories from 1983 to 1989. Read the entire page →
From page 62... ... They have columns and so on, and Jerry gave a lot of examples of such applications. My sense is that at this time we are beginning to see data mining applications in non-commercial domains and I guess homeland defense definitely falls into this category of non-commercial domain, and these applications involve use of both structured and unstructured data, but my post hoc sense is that we are just seeing the beginning of it. Read the entire page →
From page 63... ... I think of somebody saying that you know we have off-the-shelf data mining technology available which can solve all the homeland defense problem but have got, you know, that is too much of an exploration at this stage, but there is promise here, and it is worth exploring that promise. Okay, so, this is what I am going to do. Read the entire page →
From page 64... ... You can sort of work it out and zoom on any person and get to see what these relationships look like, how these social linkages look. The interesting question is, you know, this is just an example we created, and the interesting question is how this was done, and basically this was just based on about the caller for about a million pages and using underneath it our data mining algorithm, and so let me just tell you what that algorithm is and then I will come back to this page and show how this was done. Read the entire page →
From page 65... ... If you look into the data mining processes people have spent the last few years figuring out how to do this computation very well. Read the entire page →
From page 66... ... It has things like financial support, Islamic leaders and so on and the idea is the following that can we by looking at a site give quote, unquote, a profile of this particular site in terms of these features that we are interested in. Okay, and you can sort of see how the profile looks like from these particular sites. Read the entire page →
From page 67... ... This is people who have high credit risk and these are the data and they are trying to sort of develop can we sort of develop rules for who are the people who are high credit risks, people who would be low credit risk and this is how a decision tree might look like, and the idea is that once you have printed this decision tree a new person comes in and we don't know whether this person is a high risk or a low credit risk and we again run this person's record through this decision tree and we are able to make a prediction. So, this is what people have done in the past and can we just go back to the previous page? Read the entire page →
From page 68... ... 68 once we have built the classifier we can take any site and pump it to that, you know, sort of give the pages of that particular site to develop this kind of a profile. Read the entire page →
From page 69... ... In this case the essentially the underlying technology is sort of two things. One is what sequential patterns, and the other thing is what chip queries. Read the entire page →
From page 70... ... 70 Let us go to the previous page, and these things were essentially the sequential patterns which were found in the data and for these patterns of what these sequential factors there is a history and that is what is being shown here. Next slide? Read the entire page →
From page 71... ... So, basically on any particular topic you can find for the last 20 years what has been posted and is available to you. So, the idea was that you take this posting and you have some of a response analyzer, and this response analyzer will show you the results on a particular topic over time, and it turns out without going into details that if you are trying to use standard data mining algorithms they don't work very well because if you are trying to let us say find out people who are against a topic or for a topic than just say let us say finding out about 71 Read the entire page →
From page 72... ... And you can sort of see that there is person here who doesn't like State Farm, but a lot of people, a surprising large number of people just comment in defense of this company and sort of answer every person. We found it pretty interesting. Read the entire page →
From page 73... ... I just wanted to show you another one. State Farm decided to leave New Jersey and there was this concern how people were going to react to that particular fact, that they are quitting New Jersey, and again it is kind of interesting which was the point which Diane mentioned about how in these news groups you know there is a topic that lots of people respond off topic and so on, but you can sort of see that really what happened in this case it was very easy to see it once you went through this particular algorithm that in this instance people didn't think State Farm was at fault. Read the entire page →
From page 74... ... Next? PARTICIPANT: All these data mining algorithms we have seen involve large numbers. Read the entire page →
From page 75... ... So, how do we do sort of data mining over compartmentalized databases? That is going to be an interesting technical challenge, and in both of these things, again, I try to point what people have been thinking and then again think of the solutions that are sort of meant to illustrate the direction. Read the entire page →
From page 76... ... 76 of classification model. We want to put a decision tree, and the question is can we do that particular task without violating individual privacy, and if you think about it what is really private about our being in the values, and what we tried to do here was essentially capture the intrusion of what happens anywhere on the web today. Read the entire page →
From page 77... ... So, my idea of Reconstructing distribution you are not reconstructing. You will never be able to sort of reconstruct precisely what this looked like, but you might be able to reconstruct the distributions, and once you have reconstructed distributions most of the data mining algorithms require really working at distribution levels. Read the entire page →
From page 78... ... Next? And if you use these reconstructed distributions this is an example where maybe sort of showing that for a particular task different level of randomization you are losing a certain amount but you might include more privacy and there is this notion of how much of randomization is there. Read the entire page →
From page 79... ... Okay? So, once again another topic and so I will talk a little bit about this computation of compartmentalized databases. Read the entire page →
From page 80... ... So, you do partial computations and then try to combine partial results and third would be to do some sort of on-demand data shifting and data composition kind of thing. I will briefly point out that this can be done. Read the entire page →
From page 81... ... Next? And basically decision tree, the key boils down to particularly if they are using IDS kind of classifier, the key step is to sort of compute information and sort of do this kind of computation and in this case B] Read the entire page →
From page 82... ... The profile of suicide bombers has completely changed from the examples of the current suicide bombers. So, anybody who was building data mining model using the previous examples collected from the past was fighting an old war. Read the entire page →
From page 83... ... Let me give you an example of what data mining values do? Clustering. Read the entire page →
From page 84... ... There is a huge danger that we might be just finding noise, and all the things I know of people say that it is about finding really the real events and generally most of the data mining algorithms at this stage today are extremely weak in finding real events. Patterns? Read the entire page →
From page 85... ... A lot of them have a lot of wrinkles and you know, is there hope of getting this thing done without the help of something behind them and tuning these algorithms? Is there a hope of building things which in some sense go and figure out in a data dependent way what parameters to use for what particular algorithm? Read the entire page →
From page 86... ... It seems to me that you have offered an idea in another domain. I gather you can do social network analysis to find cliques and you have a very long list of cliques based on Internet data. Read the entire page →
From page 87... ... A couple of points I just wanted to make before we break here,in Rakesh's talk, which is that the randomization and privacy issue I think is very important for us, too, and the challenges of working across what you are referring to as compartmentalized databases is a big one. It is hard to know how to get a grasp on those, but they are very big problems, important for us, too. Read the entire page →
From page 89... ... Linkage analysis consists of forming a graph that illustrates a person's social network storing this information in a database, and recording "association rules" and "transactions" for each relationship in order to permit the use of data-mining algorithms. Web site profiling uses a decision-tree-based technique called "classification." A classifier is built in order to classify a set of people and may be used to distinguish people who are high credit risks from those who are low credit risks. Read the entire page →
From page 90... ... go Donald McClure "Remarks on Data Mining, Unsupervised Learning, and Pattern Recognition" Transcript of Presentation Summary of Presentation Video Presentation Donald McClure is a professor of applied mathematics at Brown University. Read the entire page →
From page 91... ... It is not an area in which I have been active. On the other hand, I do have a fair understanding of problems in the area of trying to extract information from massive amounts of data, generally image data. Read the entire page →
From page 92... ... I, personally, feel that one of the challenges in new areas that are poised for advances is technology transfer and this works in both directions. How do we stimulate cross fertilization between scholarly research that is going on on one hand and the expertise of people who are designing and integrating systems on the other hand, and this really has to occur in both directions. Read the entire page →
From page 93... ... I don't know if we completely agree or if we perhaps are saying the same things in different words or in different perspectives. Diane made the comment to pick up from the comments from the talks first that behavior should be modeled as a probability distribution, and I emphasize the word "modeled," that it is important to have a model when we are trying to develop decision procedures. Read the entire page →
From page 94... ... What can mathematics and statistics contribute to problems of identifying and extracting information from massive amounts of data? I believe that there are many contributions that mathematical sciences can make. Read the entire page →
From page 95... ... They commented that in London where cameras are already in very widespread use on average every person in London has their face on camera 300 times a day. Now, when we think about trying to identify faces and video imagery this is I think a problem of trying to identify image information from massive amounts of data. Read the entire page →
From page 96... ... License plate reading is a problem in computer vision that has been studied for decades. In the sixties people were developing systems to read license plates. Read the entire page →
From page 97... ... These come from a web page, www.photocomp.com where there is actually a fairly interesting summary or review from a consultant system integrator about systems that will read plates. License plates as an optical character recognition problem are challenging because of the many different forms in which the problems can be presented. Read the entire page →
From page 98... ... So, there is according to the specs supposed to be clear space around the identifying marks on a wafer but the semiconductor fates try to use every square millimeter of space on that wafer so they will etch over the region where the ID occurs. Read the entire page →
From page 99... ... So I can refine that process or that hypothesis of character into it is a composite hypothesis formed from the different characters that might be present. So, at any rate we view this as a hypothesis testing problem and try to model what it is we are looking for and try to model the variation and try to model the probability distributions. Read the entire page →
From page 100... ... 100 distribution free methods, using non-parametric methods in order to have robust methods for forming a basis for decision tree procedures, articulating the model in the form of probability distributions, actually make so that when we need to decision among the hypotheses we can use methods that are based on time-honored principles of statistics such as in this problem. like the good ratio tests in particular At any rate I will conclude my comments. Read the entire page →
From page 101... ... in the form of probability distributions. In that way, when we need to make a decision among the hypotheses, we can use methocts that are basest on time-honorect principles of statistics such as Iikeiihooct ratio tests. Read the entire page →
From page 102... ... Dr. Stuetzle was an assistant professor in the Department of Statistics at Stanford University from 1978 to 1983, with a joint appointment in the Computation Research Group of the Stanford Linear Accelerator Center. Read the entire page →
From page 103... ... So, one thing that struck me this morning or when I actually looked at the transparencies for Jerry's talk that he sent me ahead of time, one thing that certainly struck me is that there are very high expectations for the usefulness of data mining for homeland security. So, for example, when the Presidential Directive says, "Using existing government databases to detect, identify, locate and apprehend potential terrorists," so not people who did something, you know, who might do something in the future, that is certainly I think an extremely ambitious goal, the same way with locating financial transactions occurring as terrorists prepare for their attack. Read the entire page →
From page 104... ... I mean they had murdered leading bankers, politicians, judges and so on, blown up things and so even despite all these, besides those two factors, the highly regulated society and it was already known who the perpetrators were it took years to actually track these people down. So that is one thing. Read the entire page →
From page 105... ... You apply data mining tools and you basically classify people into potential terrorists or not potential terrorists. Read the entire page →
From page 106... ... So, that is what AT&T does to detect calling card fraud. They know the people who have AT&T long distance and then they collect call records of these people. Read the entire page →
From page 107... ... So, this is a much easier problem but even that is already quite, I mean people have put a lot of work into that. So, now getting back to the original problem of estimating that someone is a terrorist given the properties, the problems are first of all you have a very small training set. Read the entire page →
From page 108... ... It might be illegal or just infeasible because of the sheer mass of the data. Okay, so, even in its highly idealized form where you have this universe of people and you want to make this rule and if you could put all these databases together and run these data mining algorithms, even in the idealized form this is very difficult. Read the entire page →
From page 109... ... Now, that only makes sense if I am really who the system thinks I am. So, therefore, if you can't prevent identity theft that kind of system is not going to be very useful because you have to be sure that the person that you are making the prediction about really is the Read the entire page →
From page 110... ... You could just say, 'iWell, I am not going to let them do X or let them go in.' So, that is something you might realistically be able to do or you could deny access to people who are inside the system on whom you have data but you have bad indicators. So, you can run the specification procedure on your universe and you can try to see, well, you can try to make such a prediction, and of the likelihood of being a terrorist and if you think that is high you might deny access. Read the entire page →
From page 111... ... 111 methods are really crucial, really have to be a crucial part of any strategy because that is the only way to totally reliably establish identify. All right, that is all I have to say. Read the entire page →
From page 112... ... So, to a certain extent I go across all the three letter agencies. The thing that Jim mentioned was that his customers are the analysts and so after all the data comes in then the analyst looks at it and I have been in this business, this particular job for a little over 2 years, and I have spent a lot of time with analysts, and I would like to tell you that a lot of the tools that are provided to them are not used, and so, we have only four particular areas we work in and one area that we are starting up is an area called novel intelligence from massive data, and that kind of clicks back in here, and it is hopefully of some interest to you not necessarily for the funding although if you would like to participate in the funding we would certainly welcome that, but we have spent about 6 months 112 Read the entire page →
From page 113... ... 113 working with analysts and saying, "What do you want? " as opposed to what can we give you, and I thought that you know if you have a hammer, you know, every problem is a nail. Read the entire page →
From page 114... ... 114 viewpoint of what is important to an analyst, and they are looking for that rare occurrence and so that the gentleman from IBM I think put it very succinctly. That is a very difficult problem, and all of the work is unclassified, and there will be conferences involved. Read the entire page →
From page 115... ... Okay, I think that one of the problems for the analysts, and this is something that is a very big problem, this issue of passive knowledge. If you do a database, a data mining search and you come out and say, "Gee, there is a strong correlation between the White House and George Bush," this is not very interesting to an analyst and he is probably going to take your software and throw it in the trash. Read the entire page →
From page 116... ... half dozen positive cases in a massive database you have to use your domain knowledge. You have to figure out how the domain knowledge, but we are also going to have to figure out policy and so I think that one of the things that this community should be, we are looking at the question of how to find the terrorists but we also have to bring to bear knowledge about, or we have also have to bring to bear policy issues like screening programs. Read the entire page →
From page 117... ... DR. CIMENT: I would like to hear a little bit more from people with a vision about the future that relates to possibilities of using mathematics in the context of new architectures not just looking at the world the way it is today but the way the world might be in years to come based on for example, the proliferation of supply chains based on computer integrated systems, sort of the WalMartization of the world you might say, right? Read the entire page →
From page 118... ... The challenge for us in our society as Werner well pointed out that will not tolerate privacy invasion is to develop ideas that will preserve the privacy, allowing data sharing and I think mathematicians are probably the most suitable people to think these abstractions through and not worry about what the lawyers tell you you can't do, what the policy people tell you you can't do, create these models and show that there are maybe possibilities here if the policy would change and if the architecture would change, and we might have a society that preserves both security and privacy. Read the entire page →
From page 119... ... Much of the action I think has to be at the level of domains. A particular example he mentioned which is what I know about, to say, "Oh, yes, this is a classification problem. Read the entire page →
From page 120... ... I mean at the National Security Agency the one thing we have that not everyone has is data and because of that it really forces us to, we have many conferences where we try to think big picture, but we have many very specific problems and there is nothing like very specific problems to just get you down to business in designing algorithms and pushing forward and you are right. There is only so much progress you can make at an abstract level and data mining the concept is vague and broad and so on. Read the entire page →
From page 121... ... DR. PAPANICOLAOU: I am George Papanicolaou of Stanford University, and just about the comment that Diane Lambert made about during the sixties the sonification of seismograms to determine whether they come from natural earthquakes for from explosions and I know a little bit about that problem. Read the entire page →
From page 122... ... There is a data mining issue there, but the real issue is when you go and use data how do you discriminate; what is the basic science that goes in there that would tell you what to do with the data that you have, and speaking also in the direction of some of the criticisms mentioned earlier exposing data mining in such very broad terms hides some extremely important long-term issues in basic science like computational wave propagation and various other issues for example that have to do with the imaging in the complex environment, and these are very bit as important let us say for the detection problems that are going to nuclear proliferation. Very small tests are going to be made and have to be discriminated as to the role the data mining problem starts to extract some general information, some popular information and I think that that is something that this community, this small group here should attempt to put a fix on. Read the entire page →
From page 123... ... DR. SCHATZ: Yes, you know one of the problems about that that struck me at the Siam data mining conference a couple of weeks ago is that our community of which we have a cross section here; we have academia; we have industry; we have the government and among ourselves we would like to share information about how we are doing this data mining, but in truth it is a little bit difficult because companies have proprietary information. Read the entire page →
From page 124... ... So, it makes it difficult to exchange the science sometimes I think because of all the proprietary information and that is a difficult situation here, but at the same time if we don't keep our sources and methods quiet when we need to they won't be effective either. Read the entire page →
From page 125... ... The project that we had with fraudulent access to a computer system, sometimes you can just ignore the data and look for some movements or command that is not typical. Very much to what Werner mentioned your suspicions of the sixties and seventies looking for a robust method, sort of trying hard to avoid, how to evolve more for extreme events, sort of reverse thinking in trying to find these things in the bulk of the data and then from there on you can sort of try to find the individual. Read the entire page →
From page 126... ... 126 The other thing is maybe we are focusing too much on trying to accomplish the final goals whereas it might be useful just to give people a filtered set of information so that they have less than actually puts that by hand, which is you know it is not that we are trying to accumulate analysts. We are so far from that; we are not trying to do that at all. Read the entire page →
From page 127... ... So, on a general level how do you get extreme events and it sounds like we are very far away from that. Another one that Jim mentioned was if you have a lot of data how do you visualize the data and I know that there are people working on this. Read the entire page →
From page 128... ... 128 do we randomize data to try to ensure privacy along with security. However being further along doesn't mean that we are very far. Read the entire page →
From page 129... ... 129 DR. LASKEY: With these issues that have been brought up I would like to add one more which is combining, by the way, I am Kathy Laskey from George Mason University and combining human expertise with statistical data and that does in fact have mathematical issues associated with it because of methods where you represent the human knowledge and ability distributions to combine them to data, and there are lots of important innovations in that area. Read the entire page →
From page 130... ... 130 individually significant enough to set off anybody's warning system. It was the combination that was the issue. Read the entire page →
From page 131... ... I get in trouble at our agency when I talk about rebuilding the analysts because they don't like that, but we do; a lot of our activities and algorithms have to do with on the one hand helping them prioritize data for them that we think they are interested in based on what they have been doing, try to predict things they should have looked at that they are not getting time to get to but modeling analyst behavior is something that we do all the time and will be more and more important for us, absolutely. PARTICIPANT: The third time that a rep came to us and said, "Bush is linked to the White House," you know the system should be one because the analyst knows well that that is not interesting. Read the entire page →
From page 132... ... There is a lot of very interesting work happening and it is interesting for somebody in this Committee to understand what happened and to look at it, to understand the computational people and something I very strongly believe that we don't have hope for doing some of the massive common warehouse kind of things that somebody would pay for. I don't have the experience to look at the kind of data you have they are critical for commercial testing 132 Read the entire page →
From page 133... ... 133 in the field and they can be done. So, how would you solve all the complications that you have which essentially assume that there is one data source but think how would you do all the computations you wanted to do where you have these data sources which are kind of ready to share something through a mode of computation and these are some of the kinds of data points here which would be useful. Read the entire page →
From page 134... ... Using data-mining systems to combat counterterrorism is more difficult than applying data mining in the commercial arena. For example, to flag people who may be committing calling card fraud, a Iong-distance company has extensive records of usage. Read the entire page →

From page 11...

... 11 James Schatz "introduction by Session Chair" Transcript of Presentation Summary of Presentation Video Presentation James Schatz is the chief of the Mathematics Research Group at the National Security Agency.

Data Mining, Unsupervised Learning, and Pattern Recognition Pages 11-134

Data Mining, Unsupervised Learning, and Pattern Recognition
Pages 11-134