Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 11
11
James Schatz
"introduction by Session Chair"
Transcript of Presentation
Summary of Presentation
Video Presentation
James Schatz is the chief of the Mathematics Research Group at the National Security Agency.
11
OCR for page 12
12
DR. SCHATZ: Thank you, Peter. We had a session
in February down at the Rayburn Building in Washington to
talk about homeland security that the American Mathematical
Society sponsored and I thought I would like to just as an
introduction to our first session here give some of the
remarks we made down there which I think are relevant here.
It is a wonderful privilege to be here today. In
these brief remarks I would like to describe the critical
role that mathematics plays at the National Security Agency
and explain some of the immediate tangible connections
between the technical health of mathematics in the United
States and our national security.
As you may know already the National Security
Agency is the largest employer of mathematicians in the
world. Our internal mathematics community is a dynamic
professional group that encompasses full-time agency
employees, three world-class research centers at the
Institute for Defense Analyses that work exclusively for
NSA and a network of hundreds of fully cleared academic
consultants from our top universities.
As the Chief of the Mathematics Research Group at
NSA, and the executive of our mathematics hiring process I
12
OCR for page 13
13
have been the agency's primary connection to the greater US
mathematics community for the past 7 years.
The support we have received from the
mathematicians in our country has been phenomenal. Our
concern for the technical health of mathematics in our
country is paramount.
Perhaps the most obvious connection between
mathematics and intelligence is the science of cryptology.
This breaks down into two disciplines, cryptography, the
making of codes and cryptanalysts, the breaking of codes.
All modern methods of encryption are based on
very sophisticated mathematical ideas. Coping with complex
encryption algorithms requires at the outset a working
knowledge of the most advanced mathematics being taught at
our leading universities and at the higher levels a command
of the latest ideas at the frontiers of research.
Beyond cryptology the information age that is
now upon us has opened up a wealth of new areas for pure
and applied mathematics research, areas of research that
are directly related to the mission of the National
Security Agency.
While advances in telecommunications science and
computer science have produced the engines of the
information age, that is the ability to move massive
13
OCR for page 14
14
amounts of digital data around the world in seconds, any
attempt to analyze this extraordinary volume of data to
extract patterns, to predict behavior of the system or
recognize anomalies quickly gives rise to profound new
mathematical problems.
If you could visit the National Security Agency
on a typical work day you would see many, many groups of
mathematicians engaged in lively discussions at
blackboards, teaching and attending classified seminars on
the latest advances in cryptologic mathematics, arguing,
exchanging and analyzing wild new ideas, mentoring young
talent and most importantly pooling their knowledge to
attack the most challenging technical problems ever seen in
the agencyls history.
You would hear conversations on number theory,
abstract algebra, probability theories, statistics,
combinatorics, coding theory, graph theory, logic and
Fourier analysis.
It would probably be hard to imagine that out of
this chaotic flurry of activity and professional
camaraderie anything useful could emerge.
However, there is a serious sense of urgency
underlying every project, and you would soon realize that
the mathematicians of NSA are relentless in their pursuit
14
OCR for page 15
15
of tangible, practical solutions that deliver critical
intelligence to our nation's leadership.
The mathematicians of NSA, the Institute for
Defense Analyses and our academic partners are the fighter
pilots in a way that takes place in the information and
knowledge layer of cyberspace.
As Americans you would be very proud of their
achievements in the war on terrorism. The National Security
Agency's need for mathematicians is extreme right now.
Although we hire approximately 50 highly
qualified mathematicians per year, we actually require more
than that, but the talent pool will not support more.
Over 60 percent of our hires have a doctorate in
mathematics, about 20 percent a master's and 20 percent a
bachelor's degree.
We are very proud of the fact that 40 percent of
our mathematics hires are women and that 15 percent are
from under represented minority groups.
Of course, the agency depends solely on the
greater US mathematics community to educate each new
generation of students, but we, also, depend on the
professors at universities across the country to advance
the state of mathematics research.
15
OCR for page 16
16
If the US math community is not healthy the
National Security Agency is not healthy, and I always like
to use an occasion like this to thank everybody here for
all they have done for math in this country because our
agency benefits so greatly.
Okay, this first session here is on data mining,
unsupervised learning and pattern recognition. This is a
very exciting, very active area of research for my office
and for the agency at large.
We attended just recently the Siam Conference on
Data Mining about, when was that, just about a week ago
here in Washington, and had a great presence there.
It is a wonderful topic. There is absolutely
nothing going on in this conference that isn't immediately
relevant to NSA and homeland security for us, and this
first topic is an area of research that I think we had a
bit of a head start on. We have been out there doing this
for a few years, but there is a whole lot to learn. It is a
young science.
So, let me without further ado bring up our first
speaker for this session, and that is Professor Jerry
Friedman from Stanford University.
16
OCR for page 17
17
Introduction by Session Chair
James Schatz
Perhaps the most obvious connection between mathematics and intelligence is the science
of cryptology. This breaks clown into two disciplines cryptography, the making of
cocles, and cryptanalysts, the breaking of cocles. All moclern methods of encryption are
basest on very sophisticated mathematical ideas. Coping with complex encryption
algorithms requires at the outset a working knowledge of the most acivancect mathematics
being taught at our Ieacting universities and at the higher levels a command of the latest
icleas at the frontiers of research.
Beyoncl cryptology the information age that is now upon us has opened up a wealth of
new areas for pure and applied mathematics research. While advances in
telecommunications science and computer science have proclucect the engines of the
information age that is, the ability to move massive amounts of digital data around the
worIct in seconds any attempt to analyze this extraordinary volume of data to extract
patterns, to predict behavior of the system, or to recognize anomalies quickly gives rise to
profound new mathematical problems.
Although the National Security Agency hires approximately 50 highly qualifiecl
mathematicians per year, it actually requires more than that, but the talent pool will not
support more. Of course, the agency depends on the greater U.S. mathematics community
to educate each new generation of students, and it also ctepencts on the professors at
universities across the country to advance the state of mathematics research. If the U.S.
math community is not healthy, the National Security Agency is not healthy.
17
OCR for page 18
18
Jerry Friedman
"Role of Data Mining in Homeland Defense"
~ ranscript of Presentation
Summary of Presentation
PDF Slides
Video Presentation
Jerry Friedman is a professor in the Statistics Department at Stanford University and at the
Stanford Linear Accelerator Center.
18
OCR for page 19
19
PROF. FRIEDMAN: Jim asked me to talk about the
role of data mining in homeland defense, and so in a weak
moment I agreed to do so, and in looking it over I
discovered that unlike any other areas of the mathematical
sciences there is a well-perceived need among decision
makers for data mining on homeland defense.
So, I have a few examples. Here is an excerpt
from a recent speech by Vice President Cheney, and he said,
"Another objective of homeland defense is to find
connections with huge volumes of seemingly disparate
information. This can only be done with computers and only
then with the very latest in data linkage analysis."
So, that was in a recent speech by Vice President
Cheney. Here is a slightly higher decision maker. This is
from the President's Office on Homeland Security
Presidential Directive 2, and it is a section on the use of
the best opportunities for data sharing and enforcement
efforts. It says, "Recommend ways in which existing
government databases can best be utilized to maximize the
ability of the government to identify and locate and
apprehend terrorists in the United States. The utility of
advanced data-mining software should be addressed."
Here is the trade journal, the Journal of
Homeland Security. Technologies such as data mining, as
19
OCR for page 20
20
well as regular statistical analysis can be used at the
back end of biodefense communication networks to help
analyze implied signatures for covert terrorist events.
In addition data mining applications might look
for financial transactions occurring as terrorists prepare
for their attacks.
Okay, here is the popular press. Computer
databases are designed to become a prime component of
homeland defense. Once the databases merge the really
interesting software kicks in, data mining programs that
can dig up needles in gargantuan haystacks.
Okay, as many of you know DACHA has set up an
information awareness office and they were charged among
other things to look into biometric speech recognition and
machine translation tools, data sharing among the agencies
for quick decisions and knowledge discovery technology;
knowledge discovery is another code word for data mining,
that uncovers and displays links among people, content and
topics.
Here is my favorite. It is not quite germane but
this is a comment by Peter W. Hoover, not Peter Hoover the
statistician but the engineer from MIT, and he said that in
this new era of terrorism it will be their sons versus our
silicon, a rather startling point, but I think part of our
20
OCR for page 21
21
silicon will be data mining algorithms running our
computers, and the data mining bureau, Interpol and the
DARCY(?) coined the phrase MacInt for machine intelligence.
It sounds like something that might come from either Apple
computer or a hamburger chain, but we need a MacInt or
machine intelligence capabilities to provide cuing or early
warning from data pertaining to national security.
So, there doesn't seem to be a need to convince
decision makers that data mining is relevant to national
security issues. Many think it is central for national
. .
security Issues.
So, I think the problem here is not in convincing
decision makers of the need for data mining technology but
to live up to the current expectations of its capabilities,
and that I think is going to be a big job.
Now, what is data mining? Well, data mining is
about data and about mining. Okay, let us talk about data.
What are the kinds of data we are going to see in homeland
security applications? Well, there would be real-time high-
volume data streams, massive amounts of archived data,
distributed data that may or may not be centrally
warehoused; hopefully it will be centrally warehoused, but
you can't centrally warehouse it all and of course many
different data types that will have to be merged and
21
OCR for page 124
124
trying to put out products and academic folks need to be
tenured and government agencies keep what they are doing
secret, and it is all for very valid reasons.
If people know how AT&T is going to detect phone
fraud, they will get around that. So, it makes it difficult
to exchange the science sometimes I think because of all
the proprietary information and that is a difficult
situation here, but at the same time if we don't keep our
sources and methods quiet when we need to they won't be
effective either.
DR. STEUTZLE: Getting around things is only one
problem and once you know how the system works then you can
also flood the system so you have both options that you can
flood it or get around it, you know. So, that is I guess
why the airlines don't want to tell you exactly what they
are looking for when they profile you at the gate.
PARTICIPANT: The reason I asked the question is
because even with many false negatives you feel the
positives are still useful. With many false positives it
is untenable and the terrorists may force us to remove the
system entirely and that is why I asked the question.
DR. SCHATZ: Good point.
PARTICIPANT: I don't think it is a question. It
is more of a comment. It is not so hard to find a needle in
124
OCR for page 125
125
a haystack. The point is that you often find sometimes the
thing that you are looking for is a special feature that
occurs in a population not necessarily for a single
individual. The project that we had with fraudulent access
to a computer system, sometimes you can just ignore the
data and look for some movements or command that is not
typical. Very much to what Werner mentioned your
suspicions of the sixties and seventies looking for a
robust method, sort of trying hard to avoid, how to evolve
more for extreme events, sort of reverse thinking in trying
to find these things in the bulk of the data and then from
there on you can sort of try to find the individual. In
fact, even in a synthetic model if you look at Diane's data
if you take Diane's data and say, "Can I detect fraudulent
usage of phones?" having the phone bills of individuals for
say a year or two, taking just random streams will give you
this phone and my phone and somebody else's phone bill but
just for a week you can observe for a week there is ongoing
data. Could you find fraud in that type of data, and the
answer is probably you could because like Werner said if
you see calls to Nigeria that may be a good start.
DR. LAMBERT: Maybe I should say something. I
wasn't actually trying to say that you can use these for
terrorists. This is far beyond what we ever try to do.
125
OCR for page 126
126
The other thing is maybe we are focusing too much
on trying to accomplish the final goals whereas it might be
useful just to give people a filtered set of information so
that they have less than actually puts that by hand, which
is you know it is not that we are trying to accumulate
analysts. We are so far from that; we are not trying to do
that at all. Another thing is that even you know, actually
in detecting fraud you don't have long histories on people
because if people are going to commit fraud they don't go
into the system that they have long distance service with
for example. They make a call and access somebody else's
system where you have no history whatsoever. So, you are
right, being able to handle people is very important, and I
will have to defer to the comment about earthquakes. That
is just some math that I had which had little symbols on
it. I actually don't know. It could have been that they
developed the signals and used the signal extraction from
20 years after the original application. You do have to
take all the information you have. The trick is to figure
out how do you handle it.
DR. CHAYES: Just as a summary I think I am not
someone who knows about any of this but what I am hoping is
that what we are going to get out of all of these sessions
are some questions that mathematicians can approach, and so
126
OCR for page 127
127
I have just been writing down some more mathematical
questions that have been coming up. That is also one of the
things that we want the final report to do, to come out
with a list of questions that mathematicians can look at,
and I guess the one that has been coming up the most is how
do we focus on extreme events and what I heard from
everybody is that we really have to know how to model
extreme events properly. So, I am not sure how much of that
is the mathematical question.
DR. AGRAWAL: That assumes that you know
something.
DR. CHAYES: That assumes that you know
something. So, on a general level how do you get extreme
events and it sounds like we are very far away from that.
Another one that Jim mentioned was if you have a
lot of data how do you visualize the data and I know that
there are people working on this. I am certainly not one of
them. I am not sure if there is anyone here who can speak
to the question of how do you visualize data.
PARTICIPANT: Not just visualize.
DR. CHAYES: Yes, I mean in a metaphorical sense
how do you visualize data and then there is also the
question that seems to me the one that we are furthest
along on which a number of people talked about which is how
127
OCR for page 128
128
do we randomize data to try to ensure privacy along with
security. However being further along doesn't mean that we
are very far. So, it struck me that those are three areas
that could set a mathematical agenda and if anybody has any
comments on any of those?
PARTICIPANT: I have one comment. If we are
talking about addressing terrorism are we talking about
preventing a small number of events and data mining to
prevent these events to make sure that you have got every
single individual and on the other hand there are a number
of organizations like Al Qaeda and maybe we could
concentrate more on the structures of these organizations
and then you are not talking about identifying every
individual, identifying every possible conspiracy but
identifying plans of the organization.
DR. KARR: I am Alan Karr from the National
Institute of Statistical Sciences and I would just like to
point out that there is a wealth of techniques associated
with preserving privacy in data other than randomization.
Randomization has some well-recognized shortcomings in
other cases, but I think this point is a lot broader and
there is a whole area of statistical disclosure that ought
to be brought into this.
ornani z at for
128
OCR for page 129
129
DR. LASKEY: With these issues that have been
brought up I would like to add one more which is combining,
by the way, I am Kathy Laskey from George Mason University
and combining human expertise with statistical data and
that does in fact have mathematical issues associated with
it because of methods where you represent the human
knowledge and ability distributions to combine them to
data, and there are lots of important innovations in that
area.
I would, also, like to point out on the varied
events the importance of outliers of rare events have been
mentioned a lot, but the importance of multivariate
outliers, data points that are not particularly unusual on
any one feature. It is in combination that they are unusual
and in fact in the events leading up to September 1l, these
people blended in with the society, but if you look at the
configuration of their behaviors if somebody had actually
been able to home in on those individuals and say, "Okay,
you know, they paid cash for things, plus they were taking
flying lessons, plus, they were from the Middle East, plus,
this, plus," and then you discover that an Al Qaeda cell
was planning to use airplanes as bombs there were enough
pieces that could have been put together ahead of time, not
that I am saying that it would be easy, but pieces were not
129
OCR for page 130
130
individually significant enough to set off anybody's
warning system. It was the combination that was the issue.
DR. KAFADAR: I am Karen Kafadar from University
of Denver. I think I heard someone from the FAA say that
actually the airlines did identify something unusual about
at least one of those. The response was to recheck the
check level, rescreen the check level. There was another
variable there. They didn't know that.
DR. SCHATZ: I don't want to cut off any
discussions although it looks like we are getting into the
lunch break. We will take a couple more quick ones and
then I am sure there will be lots of time later to talk.
PARTICIPANT: I am trying to put together a
couple of things that seem related. One is that we
classify this and we can see this and this and this, and
that ought to be intuitively meaningful and it is
information that ought to be the model used, but I also
have a sense that we are looking for the kinds of things
you can see looking back but not forward so easily. In
retrospect every newscaster would know what was coming.
So, what is the potential for these folks who are analysts
who are in the business of knowing how the targets are
changing? Are we talking about being experts in real time
participating in the development a system that might have
OCR for page 131
131
to change in time as well and what are the odds that the
system can say, "Is this interesting?" and have them say,
"Yes," or "No," and then the system from what the analysts
thought of it with the perspective of the analyst looking
at it Tuesday of this week instead a month ago and with all
the complexity that you are not going to be able to deal in
rules no matter how careful you are; so, is that, I know
analysts are probably overworked like everybody else, but
maybe you could participate in something like this.
DR. SCHATZ: We do a fair amount of analyzing the
analysts if that is what you are asking. I get in trouble
at our agency when I talk about rebuilding the analysts
because they don't like that, but we do; a lot of our
activities and algorithms have to do with on the one hand
helping them prioritize data for them that we think they
are interested in based on what they have been doing, try
to predict things they should have looked at that they are
not getting time to get to but modeling analyst behavior is
something that we do all the time and will be more and more
important for us, absolutely.
PARTICIPANT: The third time that a rep came to
us and said, "Bush is linked to the White House," you know
the system should be one because the analyst knows well
that that is not interesting.
131
OCR for page 132
132
DR. SCHATZ: Yes, there certainly is for us again
we enjoy a population of people to study in that regard
that other people don't have access to, but certainly when
we do have access to it, and we do, a lot of what we do is
studying analyst behavior and trying to correlate did they
pull the document; did they look at a document; did they
act on a document and try to maximize our advantage there
because at the end of the day no matter how many
individuals you have it is a minuscule epsilon number
compared to the data size. So, what they actually do and
act on is critically important.
One more, Rakesh?
DR. AGRAWAL: It is not a question. Many times I
like to go and look at things, but sometimes I think I wish
there was more computational aspect to it. So, in decisions
and so on the interesting thing is like the combination of
things. There is a lot of very interesting work happening
and it is interesting for somebody in this Committee to
understand what happened and to look at it, to understand
the computational people and something I very strongly
believe that we don't have hope for doing some of the
massive common warehouse kind of things that somebody would
pay for. I don't have the experience to look at the kind
of data you have they are critical for commercial testing
132
OCR for page 133
133
in the field and they can be done.
So, how would you solve
all the complications that you have which essentially
assume that there is one data source but think how would
you do all the computations you wanted to do where you have
these data sources which
are kind of ready to share
something through a mode of computation and these are some
of the kinds of data points here which would be useful.
DR. SCHATZ: Very relevant, absolutely.
Okay, Andrew?
DR . ODLYZKO: I f you look at the broad technology
what we have to
capacity.
know in the next
few years is storage
DR. SCHATZ: Good wrap up. Thanks, everybody.
Thanks to the speakers for the morning.
(Applause.)
DR. SCHATZ;
Twelve-thirty, here.
(Thereupon, at Il:50 a.m., a recess was taken
until 12:40 p.m.,
the same day.)
133
OCR for page 134
134
Remarks on Data Mining: Unsupervised [earning, and Pattern Recognition
Werner Stuetzle
There appear to be unrealistically high expectations for the usefulness of data mining for
homeland security. When a Presidential Directive refers to "using existing government
databases to detect, identify, locate, and apprehend potential terrorists," that is certainly
an extremely ambitious goal. For example, pinpointing the financial] transactions that
occur as terrorists prepare for their attack is difficult given that it doesn't take a lot of
money to commit terrorist acts.
Using data-mining systems to combat counterterrorism is more difficult than applying
data mining in the commercial arena. For example, to flag people who may be
committing calling card fraud, a Iong-distance company has extensive records of usage.
As a result, there are profiles of all users. However, such convenient data are nonexistent
when detecting people who might be terrorists. In addition, errors and oversights in the
commercial arena are, in general, not terribly costly, whereas charging innocent people
with suspected terrorism is unacceptable.
Biometrics will have to be a crucial part of any strategy in order to combat attempted
identity theft.
134
Representative terms from entire chapter:
probability distribution