| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 71
D
Workshop Presentations
QUANTUM INFORMATION
Charles H. Bennett
IBM Research
This paper discussed how some fundamental ideas from the dawn of the last
century have changed our understanding of the nature of information. The foun-
dations of information processing were really laid in the middle of the twentieth
century, and only recently have we become aware that they were constructed a bit
wrong that we should have gone back to the quantum theory of the early twen-
tieth century for a more complete picture.
The word calculation comes from the Latin word meaning a pebble; we no
longer think of pebbles when we think of computations. Information can be sepa-
rated from any particular physical object and treated as a mathematical entity. We
can then reduce all information to bits, and we can deal with processing these bits
to reveal implicit truths that are present in the information. This notion is best
thought of separately from any particular physical embodiment.
In microscopic systems, it is not always possible (and generally it is not
possible) to observe a system without disturbing it. The phenomenon of entangle-
ment also occurs, in which separated bodies could be correlated in a way that
cannot be explained by traditional classical communication. The notion of quan-
tum information can be abstracted, in much the same way as the notions of clas-
sical information. There are actually more things that can be done with informa-
tion, if it is regarded in this quantum fashion.
The analogy between quantum and classical information is actually straight-
71
OCR for page 72
72
APPENDIX D
forward classical information is like information in a book or on a stone tablet,
but quantum information is like information in a dream: we try to recall the dream
and describe it; each description resembles less the original dream than the previ-
ous description.
The first, and so far the most practical, application of quantum information
theory is quantum cryptography. Here one encodes messages and passes them on.
If one thinks of photon polarization, one can distinguish vertical and horizontal
polarization through calcite crystals, but diagonal photons are in principle not
reliably distinguishable. They should be thought of as a superposition of vertical
and horizontal polarization, and they actually propagate containing aspects of
both of these parent states. This is an entangled structure, and the entangled struc-
ture contains more information than either of the pure states of which it is com-
posed.
The next step beyond quantum cryptography, the one that made quantum
information theory a byword, is the discovery of fast quantum algorithms for
solving certain problems. For quantum computing, unlike simple cryptography, it
is necessary to consider not only the preparation and measurement of quantum
states, but also the interaction of quantum data along a stream. This is technically
a more difficult issue, but it gives rise to quite exciting basic science involving
quantum computing.
The entanglement of different state structures leads to so-called Einstein-
Podolsky-Rosen states, which are largely responsible for the unusual properties
of quantum information.
The most remarkable advance in the field, the one that made the field fa-
mous, is the fast factor algorithm discovered by Shor at Bell Laboratories. It
demonstrates that exponential speedup can be obtained using a quantum com-
puter to factor large numbers into their prime components. Effectively, this quan-
tum factorization algorithm works because it is no more difficult, using a quan-
tum computer, to factor a large number into its prime factors than it is to multiply
the prime factors to produce the large number. It is this condition that renders a
quantum computer exponentially better than a classical computer in problems of
this type.
The above considerations deal with information in a disembodied form. If
one actually wants to make a quantum computer, there are all sorts of fabrication,
interaction, decoherence, and interference considerations. This is a very rich area
of experimental science, and many different avenues have been attempted.
Nuclear magnetic resonance, ion traps, molecular vibrational states, and solid-
state implementations have all been used in attempts to produce actual quantum
computers.
Although an effective experimental example of a quantum computer is still
incomplete, many major theoretical advances have suggested that some of the
obvious difficulties can in fact be overcome. Several discoveries indicate that the
effects of decoherence can be prevented, in principle, by including quantum error
OCR for page 73
ANNE M. CHAKA
73
correcting codes, entanglement distillation, and quantum fault-tolerant circuits.
Good quantum computing hardware does not yet exist. The existence of these
methodologies means that the hardware does not have to be perfectly efficient or
perfectly reliable, because these programming techniques can make arbitrarily
good quantum computers possible, even with physical equipment that suffers
from decoherence issues.
Although most of the focus has been on quantum cryptography, quantum
processing provides an important example of the fact that quantum computers not
only can do certain tasks better than ordinary computers, but also can do different
tasks that would not be imagined in the context of ordinary information process-
ing. For example, entanglement can enhance the communication of classical mes-
sages, by augmenting the capacity of a quantum channel for sending messages.
To summarize, quantum information obeys laws that subtly extend those gov-
erning classical information. The way in which these laws are extended is reminis-
cent of the transition from real to complex numbers. Real numbers can be viewed as
an interesting subset of complex numbers, and some questions that might be asked
about real numbers can be most easily understood by utilization of the complex
plane. Similarly, some computations involving real input or real output (by "real" I
mean classical) are most rapidly developed using quantum intermediate space in
quantum computers. When I'm feeling especially healthy, I say that quantum com-
puters will probably be practical within my lifetime strange phenomena involv-
ing quantum information are continually being discovered.
HOW SCIENTIFIC COMPUTING KNOWLEDGE MANAGEMENT
AND DATABASES CAN ENABLE ADVANCES AND NEW INSIGHTS
IN CHEMICAL TECHNOLOGY
Anne M. Chaka
National Institute of Standards and Technology
The focus of this paper is on how scientific computing and information tech-
nology (IT) can enable technical decision making in the chemical industry. The
paper contains a current assessment of scientific computing and IT, a vision of
where we need to be, and a roadmap of how to get there. The information and
perspectives presented here come from a wide variety of sources. A general per-
spective from the chemical industry is found in the Council on Chemical
Research's Vision 2020 Technology Partnership, several workshops sponsored
iChemical Industry of the Future: Technology Roadmap for Computational Chemistry, Thompson,
T. B., Ed., Council for Chemical Research, Washington, DC, 1999; http://www.ccrhq.org/vision/index/
OCR for page 74
74
APPENDIX D
by NSF, NIST,2 and DOE, and the WTEC report on industrial applications of
molecular and materials modeling (which contains detailed reports on 91 institu-
tions, including over 75 chemical companies, plus additional data from 55 U.S.
chemical companies and 256 world-wide institutions (industry, academia, and
government).3 My own industrial perspective comes from my position as co-
leader for the Lubrizol R&D IT Vision team for two years, and ten years as the
head of computational chemistry and physics prior to coming to NIST. It should
be noted that although this paper focuses primarily on the chemical industry,
many of the same issues apply to the biotechnology and materials industry.
There are many factors driving the need for scientific computing and knowl-
edge management in the chemical industry. Global competition is forcing U.S.
industry to reduce R&D costs and the time to develop new products in the chemi-
cal and materials sectors. Discovery and process optimization are currently lim-
ited by a lack of property data and insight into mechanisms that determine perfor-
mance. Thirty years ago there was a shortage of chemicals, and customers would
pay premium prices for any chemical that worked at all. Trial and error was used
with success to develop new chemistry. Today, however, the trend has shifted
due to increased competition from an abundance of chemicals that work on the
market, customer consolidation, and global competition that is driving commod-
ity pricing even for high-performance and fine chemicals. Trial and error have
become too costly, and the probability of success is too low. Hence it is becoming
widely recognized that companies need to develop and fine-tune chemicals and
formulations by design in order to remain competitive, and to screen chemicals
prior to a long and costly synthesis and testing process. In addition, the chemicals
produced today must be manufactured in a way that minimizes pollution and
energy costs. Higher throughput is being achieved by shifting from batch pro-
cessing to continuous feed stream, but this practice necessitates a greater under-
standing of the reaction kinetics, and hence the mechanism, to optimize feed
stream rates. Simulation models are needed that have sufficient accuracy to be
able to predict what upstream change in process variables are required to main-
tain the downstream products within specifications, as it may take several hours
for upstream changes to affect downstream quality. Lastly, corporate downsizing
is driving the need to capture technical knowledge in a form that can be queried
and augmented in the future.
2NIST Workshop on Predicting Thermophysical Properties of Industrial Fluids by Molecular Simu-
lations (June, 2001), Gaithersburg, MD; 1st International Conference on Foundations of Molecular
Modeling and Simulation 2000: Applications for Industry (July, 2000), Keystone, CO; Workshop on
Polarizability in Force Fields for Biological Simulations (December 13-15, 2001), Snowbird, UT.
Applying Molecular and Materials Modeling, Westmoreland, P. R.; Kollman, P. A.; Chaka, A.
M.; Cummings, P. T.; Morokuma, K.; Neurock, M.; Stechel, E.B .; Vashishta, P., Eds. Kluwer Aca-
demic Publishers, Dordrecht, 2002; http://www.wtec.org/.
OCR for page 75
ANNE M. CHAKA
75
Data and property information are most likely to be available on commodity
materials, but industrial competition requires fast and flexible means to obtain
data on novel materials, mixtures, and formulations under a wide range of condi-
tions. For the vast majority of applications, particularly those involving mixtures
and complex systems (such as drug-protein interactions or polymer
nanocomposites), evaluated property data simply do not exist and are difficult,
time consuming, or expensive to obtain. For example, measuring the density of a
pure liquid to 0.01% accuracy requires a dual sinker apparatus costing $500,000,
and approximately $10,000 per sample. Commercial laboratory rates for measur-
ing vapor-liquid equilibria for two state points of a binary mixture are on the
order of $30,000 to 40,000. Hence industry is looking for a way to supply mas-
sive amounts of data with reliable uncertainty limits on demand. Predictive mod-
eling and simulation have the potential to help meet this demand.
Scientific computing and information technology, however, have the poten-
tial to offer so much more than simply calculating properties or storing data. They
are essential to the organization and transformation of data into wisdom that en-
ables better technical decision making. The information management pyramid
can be assigned four levels defined as follows:
1. Data: a disorganized, isolated set of facts
2. Information: organized data that leads to insights regarding relation-
ships knowing what works
3. Knowledge: knowing why something works
4. Wisdom: having sufficient understanding of factors governing perfor-
mance to reliably predict what will happen knowing what will work
To illustrate how scientific computing and knowledge management convert
data and information into knowledge and wisdom, a real example is taken from
lubricant chemistry. Polysulfides, R-Sn-R, are added to lubricants to prevent wear
of ferrous metal components under high pressure. The length of the polysulfide
chain, n, is typically between 2 and 6. A significant performance problem is that
some polysulfide formulations also cause corrosion of copper-containing compo-
nents such as bronze or brass. To address this problem, a researcher assembles
data from a wide variety of sources such as analytical results regarding composi-
tion, corrosion and antiwear performance tests, and field testing. IT makes it easy
for the right facts to be gathered, visualized, and interpreted. After analysis of the
data, the researcher comes to the realization that long-chain polysulfides (n = 4 or
greater) corrode copper, but shorter chains (n = 2 to 3) do not. This is knowing
what happens. Scientific computing and modeling can then be used to determine
why something happens. In this case, quantum mechanics enabled us to under-
stand that the difference in reactivity of these sulfur chains could be explained by
significant stabilization of the thiyl radical delocalized over two adjacent sulfur
atoms after homolytic cleavage of the S-S bond: R-SS-SS-R ~ 2R-SS.. The
OCR for page 76
76
APPENDIX D
monosulfur thiyl radical R-S. was significantly higher in energy and therefore is
much less likely to form. Hence copper corrosion is consistent with the formation
of stable thiyl radicals. This insight led to a generalization that almost any sulfur
radical with a low energy of formation will likely corrode copper, and we were
able to reliably predict copper corrosion performance from the chemical structure
of a sulfur-containing species prior to testing. This understanding also led to im-
provements in the manufacturing process and other applications of sulfur chemis-
try, and is an example of what is meant by wisdom (i.e., reliably predicting what
will happen in novel applications due to a fundamental understanding of the un-
derlying chemistry and physics).
What is the current status of scientific computing and knowledge manage-
ment with respect to enabling better technical decisions? For the near term, data-
bases, knowledge management, and scientific computing are currently most ef-
fective when they enable human insight. We are a long way from hitting a carriage
return and obtaining answers to tough problems automatically, if ever. Wetware
(i.e., human insight) is currently the best link between the levels of data, informa-
tion, knowledge, and wisdom. There is no substitute for critical, scientific think-
ing. We can, however, currently expect an idea to be tested via experiment or
calculation. First principles calculations, if feasible on the system, improve the
robustness of the predictions and can provide a link between legacy data and
novel chemistry applications. Computational and IT methods can be used to gen-
erate a set of possibilities combinatorially, analyze the results for trends, and
visualize the data in a manner that enables scientific insight. Developing these
systems is resource intensive and very application specific. Companies will in-
vest in their development for only the highest priority applications, and the knowl-
edge gained will be proprietary. Access to data is critical for academics in the
QSPR-QSAR method development community, but is problematic due to intel-
lectual property issues in the commercial sector.4 Hence there is a need to ad-
vance the science and the IT systems in the public arena to develop the funda-
mental foundation and building blocks upon which public and proprietary
institutions can develop their own knowledge management and predictive model-
ing systems.
What is the current status of chemical and physical property data? Published
evaluated chemical and physical property data double every 10 years, yet this is
woefully inadequate to keep up with demand. Obtaining these data requires me-
ticulous experimental measurements and/or thorough evaluations of related data
from multiple sources. In addition, data acquisition processes are time- and
resource-consuming and therefore must be initiated well in advance of an antici-
pated need within an industrial or scientific application. Unfortunately a signifi-
cant part of the existing data infrastructure is not directly used in any meaningful
4Comment by Professor Curt Breneman, Rensselaer Polytechnic Institute.
OCR for page 77
ANNE M. CHAKA
77
application because data requirements often shift between the initiation and
completion of a data project. Analysis and fitting, such as for equation-of-state
models, must be reinitiated when significant new data become available.
One vision that has been developed in consultation with the chemical and
materials industries can be described as a "Universal Data and Simulation En-
gine." This engine is a framework of computational tools, evaluated experimental
data, active databases, and knowledge-based software guides for generating
chemical and physical property data on demand with quantitative measures of
uncertainty. This engine provides validated, predictive simulation methods for
complex systems with seamless multiscale and multidisciplinary integration to
predict properties and model physical phenomena and processes. The results are
then visualized in a form useful for scientific interpretation, sometimes by a non-
expert. Examples of high-priority challenges cited by industry in the WTEC re-
port to be ultimately addressed by the Universal Data and Simulation Engine are
discussed below.5
How do we achieve this vision of a Universal Data and Simulation Engine?
Toward this end, NIST has been exploring the concepts of dynamic data evaluation
and virtual measurements of chemical and physical properties and predictive simula-
tions of physical phenomena and processes. In dynamic data evaluation, all available
experimental data within a technical area are collected routinely and continuously,
and evaluations are conducted dynamically using an automated system when in-
formation is required. The value of data is directly related to the uncertainty, so "rec-
ommended" data must include a robust uncertainty estimate. Metadata are also col-
lected (i.e., auxiliary information required to interpret the data such as experimental
method). Achieving this requires interoperability and data exchange standards. Ide-
ally the dynamic data evaluation is supplemented by calculated data based on vali-
dated predictive methods (virtual measurements), and coupled with a carefully con-
sidered experimental program to generate benchmark data.
Both virtual measurements and the simulation engine have the potential to
meet a growing fraction of this need by supplementing experiment and providing
data in a timely manner at lower cost. Here we define "virtual measurements"
specifically as predictive modeling tools that yield property data with quantified
uncertainties analogous to observable quantities measured by experiment (e.g.,
rate constants, solubility, density, and vapor-liquid equilibria). By "simulation"
we mean validated modeling of processes or phenomena that provides insight
5These include liquid-liquid interfaces (micelles and emulsions), liquid-solid interfaces (corrosion,
bonding, surface wetting, transfer of electrons and atoms from one phase to another), chemical and
physical vapor deposition (semiconductor industry, coatings), and influence of chemistry on the
thermomechanical properties of materials, particularly defect dislocation in metal alloys; complex
reactions in multiple phases over multiple time scales. Solution properties of complex solvents and
mixtures (suspending asphaltenes or soot in oil, polyelectrolytes, free energy of salvation rheology),
composites (nonlinear mechanics, fracture mechanics), metal alloys, and ceramics.
OCR for page 78
78
APPENDIX D
into mechanisms of action and performance with atomic resolution that is not
directly accessible by experiment but is essential to guide technical decision mak-
ing in product design and problem solving. This is particularly crucial for con-
densed-phase processes where laboratory measurements are often the average of
myriad atomistic processes and local conditions that cannot be individually re-
solved and analyzed by experimental techniques. It is analogous to gas-phase
kinetics in the 1920s prior to modern spectroscopy when total pressure was the
only measurement possible. The foundation for virtual measurements and simu-
lations is experimental data and mathematical models that capture the underlying
physics at the required accuracy of a given application. Validation of theoretical
methods is vitally important.
The Council for Chemical Research's Vision 2020 states that the desired
target characteristics for a virtual measurement system for chemical and physical
properties are as follows: problem setup requires less than two hours, completion
time is less than two days, cost including labor is less than $1,000 per simulation,
and it is usable by a nonspecialist (i.e., someone who cannot make a full-time
career out of molecular simulation). Unfortunately, we are a long way from meet-
ing this target, particularly in the area of molecular simulations. Quantum chem-
istry methods have achieved the greatest level of robustness and coupled with
advances in computational speed have enabled widespread success in areas such
as predicting gas-phase, small-molecule thermochemistry and providing insight
into reaction mechanisms. Current challenges for quantum chemistry are accurate
predictions of rate constants and reaction barriers, condensed-phase thermochem-
istry and kinetics, van der Waals forces, searching a complex reaction space,
transition metal and inorganic systems, and performance of alloys and materials
dependent upon the chemical composition.
A measure of the current value of quantum mechanics to the scientific com-
munity is found in the usage of the NIST Computational Chemistry Comparison
and Benchmark Database (CCCBDB), (http://srdata.nist.gov/cccbdb). The
CCCBDB was established in 1997 as a result of an (American Chemical Society)
ACS workshop to answer the question, How good is that ah initio calculation?
The purpose is to expand the applicability of computational thermochemistry by
providing benchmark data for evaluating theoretical methods and assigning un-
certainties to computational predictions. The database contains over 65,000 cal-
culations on 615 chemical species for which there are evaluated thermochemical
data. In addition to thermochemistry, the database also contains results on struc-
ture, dipole moments, polarizability, transition states, barriers to internal rotation,
atomic charges, etc. Tutorials are provided to aid the user in interpreting data and
evaluating methodologies. Since the CCCBDB's inception, usage has doubled
every year up to the current sustained average of 18,000 web pages served per
month, with a peak of over 50,000 pages per month. Last year over 10,000 sepa-
rate sites accessed the CCCBDB. There are over 400 requests per month for new
chemical species not contained in the database.
OCR for page 79
ANNE M. CHAKA
79
The CCCBDB is currently the only computational chemistry or physics data-
base of its kind. This is due to the maturity of quantum mechanics to reliably
predict gas-phase thermochemistry for small (20 nonhydrogen atoms or less),
primarily organic, molecules, plus the availability of standard-reference-quality
experimental data. For gas-phase kinetics, however, only in the past two years
have high-quality (<2% precision) rate-constant data become available for H.
and ·OH transfer reactions to begin quantifying uncertainty for the quantum me-
chanical calculation of reaction barriers and tunneling.6 There is a critical need
for comparable quality rate data and theoretical validation for a broader class of
gas-phase reactions, as well as solution phase for chemical processing and life
science, and surface chemistry.
One of the highest priority challenges for scientific computing for the chemi-
cal industry is the reliable prediction of fluid properties such as density, vapor-
liquid equilibria, critical points, viscosity, and solubility for process design.
Empirical models used in industry have been very useful for interpolating experi-
mental data within very narrow ranges of conditions, but they cannot be extended
to new systems or to conditions for which they were not developed. Models based
exclusively on first principles are flexible and extensible, but can be applied only
to very small systems and must be "coarse-grained" (approximated by averaging
over larger regions) for the time and length scales required in industrial applica-
tions. Making the connection between quantum calculations of binary interac-
tions or small clusters and properties of bulk systems (particularly systems that
exhibit high-entropy or long-range correlated behavior) requires significant break-
throughs and expertise from multiple disciplines. The outcome of the First Indus-
trial Fluid Properties Simulation Challenge7 (sponsored by AIChE's Computa-
tional Molecular Science and Engineering Forum and administered by NIST)
underscored these difficulties and illustrated how fragmented current approaches
are. In industry, there have been several successes in applying molecular simula-
tions, particularly in understanding polymer properties, and certain direct phase
equilibrium calculations. Predicting fluid properties via molecular simulation,
however, remains an art form rather than a tool. For example, there are currently
over a dozen popular models for water, but models that are parameterized to give
good structure for the liquid phase give poor results for ice. Others, parameter-
6Louis, F.; Gonzalez, C.; Huie, R. E.; Kurylo, M. J. J. Phys. Chem. A 2001,105, 1599-1604.
7The goals of this challenge were to: (a) obtain an in-depth and objective assessment of our current
abilities and inabilities to predict thermophysical properties of industrially challenging fluids using
computer simulations, and (b) drive development of molecular simulation methodology toward a
closer alignment with the needs of the chemical industry. The Challenge was administered by NIST
and sponsored by the Computational Molecular Science and Engineering Forum (AIChE). Industry
awarded cash prizes to the champions to each of the three problems (vapor-liquid equilibrium, den-
sity, and viscosity). Results were announced at the AIChE annual meeting in Indianapolis, IN, No-
vember 3, 2002.
OCR for page 80
80
APPENDIX D
ized for salvation and biological applications, fail when applied to vapor-liquid
equilibrium properties for process engineering. The lack of "transferability" of
water models indicates that the underlying physics of the intermolecular interac-
tions is not adequately incorporated. The tools and systematic protocols to cus-
tomize and validate potentials for given properties with specified accuracy and
uncertainty do not currently exist and need to be developed.
In conclusion, we are still at the early stages of taking advantage of the full
potential offered by scientific computing and information technology to benefit
both academic science and industry. A significant investment is required to ad-
vance the science and associated computational algorithms and technology. The
impact and value of improving chemical-based insight and decision making are
high, however, because chemistry is at the foundation of a broad spectrum of
technology and biological processes such as
.
how a drug binds to an enzyme,
· manufacture of semiconductors,
· chemical reactions occurring inside a plastic that makes it burn faster than
others, and
.
how defects migrate under stress in a steel I-beam.
A virtual measurement system can serve as a framework for coordinating
and catalyzing academic and government laboratory science in a form useful for
solving technical problems and obtaining properties. There are many barriers to
obtaining the required datasets that must be overcome, however. Corporate data
are largely proprietary, and in academia, generating data is "perceived as dull so
it doesn't get funded."8 According to Dr. Carol Handwerker (chief, Metallurgy
Division, Materials Science and Engineering Laboratory, NIST), even at NIST
just gathering datasets in general is not well supported at the moment, because it
is difficult for NIST to track the impact that the work will have on the people who
are using the datasets. One possible way to overcome this barrier may be to de-
velop a series of standard test problems in important application areas where the
value can be clearly seen. The experimental datasets would be collected and theo-
retical and scientific computing algorithms would be developed, integrated, and
focused in a sustained manner to move all the way through the test problems. The
data collection and scientific theory and algorithm development then clearly be-
come means to an end.
recommend by Professor John Tully, Yale University.
OCR for page 81
JUAN J. DE PABLO
ON THE STRUCTURE OF MOLECULAR MODELING: MONTE
CARLO METHODS AND MULTISCALE MODELING
Juan J. de Pablo
University of Wisconsin, Madison
81
The theoretical and computational modeling of fluids and materials can be
broadly classified into three categones, namely automatic or molecular, meso-
scopic, and continuum or macroscopic.) At the automatic or molecular level, de-
tailed models of the system are employed in molecular simulations to predict the
structural, thermodynamic, and dynamic behavior of a system. The range of ap-
plication of these methods is on the order of angstroms to nanometers. Ex-
amples of this type of work are the prediction of reaction pathways using elec-
tronic structure methods, the study of protein structure using molecular dynamics
or Monte Carlo techniques, or the study of phase transitions in liquids and solids
from knowledge of intermolecular forces.2 At the mesoscopic level, coarse-
gra~ned models and mean field treatments are used to predict structure and prop-
erties at length scales ranging from tens of nanometers to microns. Examples of
this type of research are the calculation of morphology in self-assembling sys-
tems (e.g., block copolymers and surfactants) and the study of macromolecular
configuration (e.g., DNA) in m~crofluidic devices.3 4 5 At the continuum or mac-
roscopic level, one is interested in predicting the behavior of fluids and materials
on laboratory scales Omicrons to centimeters), and this is usually achieved by
numerical solution of the relevant conservation equations (e.g., Navier-Stokes, in
the case of fluids).6
Over the last decade considerable progress has been achieved in the three
categories described above. It is now possible to think about "multiscale model-
ing" approaches, in which distinct methods appropriate for different length scales
are combined or applied simultaneously to achieve a comprehensive description
of a system. This progress has been partly due to the ever-increasing power of
computers but, to a large extent, it has been the result of important theoretical
and algorithmic developments in the area of computational materials andfluids
modeling.
Much of the interest in multiscale modeling methods is based on the premise
that, one day, the behavior of entirely new materials or complex fluids will be
iDe Pablo J. J.; Escobedo, F. A. AIChE Journal 2002, 48, 2716-2721.
2Greeley J.; Norskov, J. K.; Mavrikakis, M. Ann. Rev. Phys. Chem. 2002, 53, 319-348.
3Fredrickson G. H., Ganesan, V.; Drolet, F. Macromol. 2002, 35, 16-39.
4Hur J. S.; Shaqfeh, E. S. G.; Larson, R. A. J. Rheol. 2000, 44, 713-742.
5Jendrejack R. M.; de Pablo, J. J.; Graham, M. D. J. Chem. Phys. 2002, 116, 7752-7759.
6Bird, R. B.; Stewart, W. E.; Lightfoot, E. N. Transport Phenomena: 2nd Ed., John Wiley: New
York, NY; 2002.
OCR for page 166
166
1 .h
_ ~
o
~ O
-
5
10
APPENDIX D
15
, ~
it:
0 200 400
Temperature (=C)
/\
Models Model
FIGURE 4 First-principles calculations can provide input to models downstream.
The aluminum alloy material has roughly 11 components (as a recycled ma-
terial, the alloy actually has an unknown number of components that depend on
the recycling stream), and some of the concentrations are quite small. Some com-
ponents have more impact than others do. Impurities can have enormous effects
on the microstructure and thereby affect the properties. Precipitates form during
heat treatment, and manufacturers have the opportunity to optimize the aluminum
heat treatment to achieve an optimized structure. First principles calculations and
multiscale calculations are able to elucidate opportunities, including overturning
100 years of metallurgical conventional wisdom (Figure 4~.8
Several key challenges must be overcome in this area. In the chemical sci-
ences, methodologies are needed to acquire quantitative kinetic information on
real industrial materials without resorting to experiment. It is also necessary to
determine kinetic pre-factors and barriers that have enough accuracy to be useful.
Another key challenge is the need for seamless multiscale modeling including
uncertainty quantification. Properties such as ductility and tensile strength are
still very difficult to calculate. Finally, working effectively across the disciplines
still remains a considerable barrier.
~Wolverton, C.; Ozolins, V. Phys. Rev. Lett. 2001, 86, 5518; Wolverton, C.; Yan, X.-Y.;
Vijayaraghavan, R.; Ozolins, V. Acta Mater. 2002, 50, 2187.)
OCR for page 167
ELLEN STECHEL
Catalysis of Exhaust Gas Aftertreatment
167
Exhaust aftertreatment catalysts provide a second, less-integrated example
of the use of chemical sciences modeling and simulation to approach an applica-
tion-related problem. Despite large investments in catalysis, both academia and
funding agencies seem to have little interest in catalysis for stoichiometric ex-
haust aftertreatment. Perhaps the perception of a majority of the scientific com-
munity is that stoichiometric exhaust aftertreatment is a solved problem. Although
there is a large empirical knowledge base and a cursory understanding of the
principles, from my vantage point the depth of understanding is far from adequate
especially considering the stringency of Tier 2 tailpipe emissions regulations.
This is another challenging and intellectually stimulating research area with criti-
cal issues that span the range from the most fundamental to the most applied.
After the discovery in the 1960s that tailpipe emissions from vehicles were
adversely affecting air quality, there were some significant technological break-
throughs, and simply moving along a normal evolutionary path, we now have
obtained three-way catalysts that reduce NOX and oxidize hydrocarbons and CO
at high levels of efficiency with a single, integrated, supported-catalyst system.
Unfortunately, the three-way catalyst works in a rather narrow range of air-fuel
ratio, approximately the stoichiometric ratio. This precludes some of the opportu-
nities to improve fuel economy for example, a leaner air-fuel ratio can yield
better fuel economy but, with current technology, only at the expense of greater
NOX emissions. Also, most exhaust pollution comes out of the tailpipe in the first
50 seconds of vehicle operation, because the catalyst has not yet reached a tem-
perature range in which it is fully active.
The three-way catalyst is composed of precious metals on supported alumina
with ceria-based oxygen storage, coated on a high-surface-area ceramic or a me-
tallic substrate with open channels for the exhaust to pass through and over the
catalyst. As regulations become increasingly stringent for tailpipe emissions, the
industry is transitioning to higher cell densities (smaller channels) and thinner
walls between channels. This simultaneously increases the surface area, decreases
the thermal mass, and reduces the hydraulic diameter of the channels; all three
effects are key enablers for higher-efficiency catalysts. The high-surface-area
nano-structured washcoat is another key enabler, but it is necessary to maintain
the high surface area of the coating and highly dispersed catalytically active pre-
cious metals for the life of the vehicle despite high-temperature operation and
exposure to exhaust gas and other chemical contaminants. In other words, the
material must have adequate thermal durability and resist sintering and chemical
poisoning from exhaust gas components. Neither thermal degradation nor chemi-
cal degradation is particularly well understood beyond some general principles;
i.e., modeling is difficult without a lot of empirical input.
The industry also must design for end of life, which is a very large design
space. Simulation currently uses empirically derived, aged-catalyst-state input.
OCR for page 168
168
APPENDIX D
What the industry really needs is a predictive capability. The industry also has to
minimize the cost and the amount of the active catalytic platinum, palladium, and
rhodium used, since precious metals are a rare and limited commodity.
Again a key chemical science challenge is kinetics. How does one obtain
quantitative kinetic information for real industrial, heterogeneous catalysts with-
out resorting to time-consuming experiments? Simulations currently use empiri-
cally derived simplified kinetics.
The science of accelerated aging has generally been ignored. The automotive
industry must be able to predict aging of catalysts and sensors without running
many vehicles to 150,000 miles. The industry does currently take several vehicles
and accumulate miles, but driving vehicles to that many miles is a particularly
slow way to get answers and does not provide good statistics. The industry does
utilize accelerated aging, but it is done somewhat empirically without the benefit
of a solid foundation of basic science.
Homogeneous Charge Compression Ignition
The third and final example arises from the Ford Motor Company-MIT Alli-
ance, a partnership that funds mutually beneficial collaborative research. The princi-
pal investigators, Bill Green and Bill Kaiser, are from MIT and Ford, respectively.
The project spans a range from the most fundamental chemistry and chemical engi-
neering to the most applied. In contrast to the first two examples, this project focuses
on a technology in development, as opposed to a technology already in practice.
Homogeneous charged compression ignition is similar in concept to a diesel
engine. It is high efficiency and runs lean. It is compression ignited, as opposed to
spark ignited, but it has no soot or NOX because it runs much cooler than diesel. It
is similar to a gasoline engine in that it uses pre-mixed, volatile fuels like gaso-
line, and it has similar hydrocarbon emissions. But an HCCI engine has a much
higher efficiency and much lower NOX emissions than a gasoline engine, which
could eliminate the need for the costly three-way precious metal catalyst.
However, HCCI is difficult to control. There is no simple timing mechanism,
which can control ignition, as exists for a spark or diesel fuel injection engine.
HCCI operates by chemistry and consequently is much more sensitive to the fuel
chemistry than either spark-ignition or diesel engines. The fuel chemistry in an
internal combustion engine largely ignores the details of the fuel. HCCI looks
very promising, but researchers do not yet know what the full operating range is
or how to reliably control the combustion with computation. Yelvington and
Green have already demonstrated that HCCI can work well over a broad range of
conditions demonstrating the promise that computation and simulation can and
must play a continuing and large role in resolving the HCCI challenges.9
9Yelvington, P. E.; Green, W. H.; SAE Technical Paper 2003, 2003-01-1092.
OCR for page 169
ELLEN STECHEL
Final Words
169
The complexity of industrial design problems requires that one must be able
to deal with the realistic, not overly idealized, system. The complexity demands a
multidisciplinary approach and working in teams. Access to and understanding of
the full range of methods are generally necessary if the researcher is going to
have impact by solving relevant problems, and individual industrial researchers
must be able to interact and communicate effectively beyond their disciplines. To
achieve integration from fundamental science to real-world applications is truly a
challenge of coordination, integration, and communication across disciplines,
approaches, organizations, and research and development sectors.
In general, approximate answers or solutions with a realistic quantification
of uncertainty if arrived at quickly have greater impact than highly accurate
answers or solutions arrived at too late to impact critical decisions. Often, there is
no need for the increased level of accuracy. To quote Einstein, "Things should be
made as simple as possible, but not any simpler." Sometimes researchers in in-
dustry, just out of necessity, oversimplify. That is when the automotive develop-
ment engineer might lose sleep, because it could mean that vehicles might have
unexpected reliability issues in the field, if the oversimplification resulted in a
. .
wrong c eclslon.
Simulations with varying degrees of empiricism and predictive capability
should be aligned closely with extensive experimental capabilities. It is also im-
portant to bring simulation in at the beginning of a new project. Too often, experi-
mentalists turn to a computational scientist only after repeated experimental fail-
ures. Many of these failures would have been avoided had the consultation
occurred at an early stage. The computational expert could help generate hypoth-
eses even without doing calculations or by doing some simple calculations, and
the perspective of the computational researcher can frequently eliminate dead
ends. In addition, framing questions correctly so that the experiments yield unam-
biguous results, or reasonably unambiguous results, is crucial. Obtaining reason-
ably definitive answers in a timely manner is equally crucial, but too often, there
does not seem to be a time scale driving needed urgency.
Data and knowledge management offer additional overlooked opportunities.
Hypotheses that have been proven wrong often continue to influence decisions.
Decision makers too often operate on the basis of disproved or speculative con-
jectures rather than on definitive data. The science and technology enterprise
needs a way to manage data such that it is relatively easy to know what the
community does and does not know as well as what the assumptions are that
underlie current knowledge.
Finally, it is important to have realistic expectations for success. This is a
difficult challenge because what one measures and rewards is what one gets.
There are many ways to impact applications and technology with any level of
sophistication in a simulation. Some of the important ways lend themselves only
OCR for page 170
170
APPENDIX D
to intangible measures, but oftentimes these may be the most important. Again
quoting Einstein, "Not everything that counts can be counted, and not everything
that can be counted counts."
DRUG DISCOVERY: A GAME OF TWENTY QUESTIONS
Dennis J. Underwood
Infinity Pharmaceuticals, Inc.
Introduction
There is a revolution taking place in the pharmaceutical industry. An era in
which nearly continuous growth and profitability is taken for granted is coming
to an end. Many of the major pharmaceutical companies have been in the news
lately with a plethora of concerns ranging from the future of the biotechnology
sector as a whole to concerns over the availability of drugs to economically dis-
enfranchised groups in the developed and the developing world. The issues for
the pharmaceutical sector are enormous and will likely result in changes in the
health-care system, including the drug discovery and development enterprises.
Central challenges include the impact that drugs coming off patents have had on
the recent financial security of the pharmaceutical sector and the need for im-
proved protection of intellectual property rights. Historically, generic competi-
tion has slowly eroded a company's market share. There is a constant battle be-
tween the pace of discovering new drugs and having old drugs going off patent,
providing generic competition opportunities to invade their market share. Recent
patent expirations have displayed a much sharper decline in market share, mak-
ing new drugs even more critical.
The 1990s were a decade in which the pharmaceutical giants believed they
could sustain growth indefinitely by dramatically increasing the rate of bringing
new medicines to market simply by increasing R&D spending and continuing to
utilize the same research philosophies that worked in the past. It is clear from the
rapid rise in R&D expenditure and the resultant cost of discovering new drugs
that the "old equation" is becoming less favorable. There is a clear need to be-
come more efficient in the face of withering pipelines and larger and more com-
plex clinical trials. For large pharmaceutical companies to survive, they must
maintain an income stream capable of supporting their current infrastructure as
well as funding R&D for the future. The cost of drug development and the low
probability of technical success call for improved efficiency of drug discovery
and development and further investment in innovative technologies and processes
that improve the chances of bringing a compound to market as a drug. Already
there has been quite a change in the way in which drugs are discovered. Large
pharmaceutical companies are diversifying their drug discovery and development
OCR for page 171
DENNIS J. UNDERWOOD
171
processes: They are relying more on the inventiveness of smaller biotechnology
companies and licensing technology, compounds, and biologicals at a faster, more
competitive rate. To meet critical time lines they are outsourcing components of
research and development to contract research organizations, enabling greater
efficiencies by providing added expertise or resources and decreasing develop-
ment time lines. The trend toward mergers and acquisitions, consolidating pipe-
lines, and attempting to achieve economies of scale is an attempt by large phar-
maceutical companies to build competitive organizations. Although this may help
short-term security, future ongoing success may not be ensured solely with this
strategy.
One of the most valuable assets of a pharmaceutical company is its experi-
ence in drug discovery and development. Of particular importance are the data,
information and knowledge generated in medicinal chemistry, pharmacology, and
in vivo studies accumulated over years of research in many therapeutic areas.
This knowledge is based on hundreds of person-years of research and develop-
ment; yet most companies are unable to effectively capture, store, and search this
experience. This intellectual property is enormously valuable. As with the other
aspects of drug discovery and development, the methods and approaches used in
data-, information- and knowledge-base generation and searching are undergoing
evolutionary improvements and, at times, revolutionary changes. It is imperative
for all data- and information-driven organizations to take full advantage of the
information they are generating. We assert that those companies that are able to
do this effectively will be able to gain and sustain an advantage in a highly com-
plex, highly technical, and highly competitive domain. The aim of this overview
is to highlight the important role informatics plays in pharmaceutical research,
the approaches that are currently being pursued and their limitations, and the
challenges that remain in reaping the benefit of advances. We are using the term
"informatics" in a general way to describe the processes whereby information is
generated from data and knowledge is derived as our understanding builds.
Informatics also refers to the management and transformation of data, informa-
tion, and assimilation of knowledge into the processes of discovery and develop-
ment.
There has been much time, money, and effort spent in attempting to reduce
the time it takes to find and optimize new chemical entities. It has proven difficult
to reduce the time it takes to develop a drug, but the application of new technolo-
gies holds hope for dramatic improvements. The promise of informatics is to
reduce development times by becoming more efficient in managing the large
amounts of data generated during a long drug discovery program. Further, with
managed access to all of the data, information, and experience, discoveries are
more likely and the expectation is that the probability of technical success will
increase.
Why is the probability of technical success of drug discovery and develop-
ment so low? What are the issues in moving compounds through the drug discov-
OCR for page 172
72
APPENDIX D
cry and development pipeline? It is clear from existing data that the primary
bottlenecks are pharmacokinetic problems and lack of efficacy in man. In addi-
tion there are problems of toxicity in animal models and the discovery of adverse
effects in man. Clearly there are significant opportunities to improve the prob-
ability of technical success and, perhaps, to shorten the time line for develop-
ment. The current strategies bring in absorption, distribution metabolism, excre-
tion and toxicity (ADME-Tox) studies earlier in the process (late discovery) and
use programs based on disease clusters rather than a single target.)
Drug discovery and development is a difficult business because biology and
the interaction with biology is complicated and, indeed, may be classifiable as a
complex system. Complexity is due partly to an incomplete knowledge of the
biological components of pathways; the manner in which the components inter-
act and are compartmentalized; and the way they are modulated, controlled, and
regulated in response to intrinsic and environmental factors over time. Mostly,
biological interactions are important and not accounted for in screening and as-
saying strategies. Often model systems are lacking in some way and do not prop-
erly represent the target organism. The response of the cell to drugs is very de-
pendent on initial conditions, which is to say that the response of the cell is very
dependent on its state and the condition of many subcellular components. Typi-
cally, the behavior of complex systems is different from those of the components,
which is to say that the processes that occur simultaneously at different scales
(protein, nucleic acid, macromolecular assembly, membrane, organelle, tissue,
organism) are important and the intricate behavior of the entire system is depen-
dent on the processes but in a nontrivial way.2 3 If this is true, application of the
tools of reductionism may not provide us with an understanding of the responses
of an organism to a drug. Are we approaching a "causality catastrophe" whenever
we expect the results of in vitro assays to translate to clinical data?
What is an appropriate way to deal with such complexity? The traditional
approach has been to generate large amounts of replicate data, to use statistics to
provide confidence, and to move cautiously, stepwise, toward higher complexity:
from in vitro to cell-based to tissue-based to in viva in model animals and then to
man. In a practical sense, the drug discovery and development odds have been
improved by a number of simple strategies: Start with chemically diverse leads,
optimize them in parallel in both discovery and later development, back them up
with other compounds when they reach the clinic and follow on with new, struc-
turally different compounds after release. Approach diseases by focusing, in par-
allel, on clusters rather than on a single therapeutic target, unless the target has
proven to be clinically effective. Generate more and better-quality data focusing
on replicates, different conditions, different cell types in different states, and dif-
iKennedy, T. Drug Discovery Today 1997, 2 (10), 436-444.
2Vicsek, T. Nature 2001, 411, 421.
3Glass, L. Nature 2001, 410, 277-84.
OCR for page 173
DENNIS J. UNDERWOOD
173
ferent model organisms. Develop good model systems, such as for ADME-Tox,
and use them earlier in the discovery process to help guide the optimization pro-
cess away from difficulties earlier (fail early). The increase in the amount and
diversity of the data leads to a "data catastrophe" in which our ability to fully
extract relationships and information from the data is diminished. The challenge
is to provide informatics methods to manage and transform data and information
and to assimilate knowledge into the processes of discovery and development;
the issue is to be able to integrate the data and information from a variety of
sources into consistent hypotheses rich in information.
The more information, the better. For example, structure-based design has
had many successes and has guided the optimization of leads through a detailed
understanding of the target and the way in which compounds interact. This has
become a commonly accepted approach, and many companies have large, active
structural biology groups participating in drug discovery teams. One of the rea-
sons that this is an accepted approach is that there is richness in data and informa-
tion and there is a wealth of methodology available, both experimental and com-
putational, that enables these approaches. The limitations in this area are
concerned primarily with the challenges facing structural biology such as appro-
priate expression systems, obtaining sufficient amounts of pure protein, and the
ability to crystallize the protein. These limitations can be severe and can prevent
a structural approach for many targets, especially membrane-bound protein com-
plexes. There is also a question of relevancy: Does a static, highly ordered crys-
talline form sufficiently represent dynamic biological events?
There are approaches that can be used in the absence of detailed structural
information. Further, it is often instructive to use these approaches in conjunction
with structural approaches, with the aim of providing a coherent picture of the
biological events using very different approaches. However, application of these
methods is extremely challenging primarily due to the lack of detailed data and
information on the system under study. Pharmacophore mapping is one such ap-
proach. A complexity that is always present is that there are multiple ways in
which compounds can interact with a target; there are always slight changes in
orientation due to differences in chemical functionality and there are always dis-
tinctly different binding modes that are possible. Current methods of pharmaco-
phore mapping find it difficult to detect these alternates. Further, the approach
often relies on high-throughput screening approaches that are inherently "noisy"
making the establishment of consistent structure activity relationships difficult.
In addition, the methodology developed in this area is often limited to only parts
of the entire dataset. There has been much effort directed to deriving two-, three-
, and higher-dimensional pharmacophore models, and the use of these models in
lead discovery and development is well known.4
4Agrafiotis, D. K.; Lobanov, V. S.; Salemme, F. R. Nat. Rev. Drug Discov. 2002, 1, 337-46.
OCR for page 174
174
APPENDIX D
The key issue in these methods is the manner in which compounds, their
structure, features, character and conformational flexibility are represented. There
are many ways in which this can be achieved, but in all cases the abstraction of
the chemistry for computational convenience is an approximation. Each com-
pound, in a practical sense, is a question that is being asked of a complex biologi-
cal system. The answer to a single question provides very little information; how-
ever, in an manner analogous to the game of "twenty question," the combined
result from a well-chosen collection of compounds (questions) can provide an
understanding of the biological system under study.5 The manner in which chem-
istry is represented is essential to the success of such a process. It is akin to posing
a well-informed question that, together with other well-formed questions, is pow-
erful in answering or giving guidance to the issues arising from drug discovery
efforts. Our approach is to use the principles of molecular recognition in simpli-
fying representation of the chemistry: atoms are binned into simple types such as
cation, anion, aromatic, hydrogen-bond acceptor and donor, and so forth. In addi-
tion, the conformational flexibility of each molecule is represented. The result is
a matrix in which the rows are compounds and the columns are bits of a very
large string (tens of millions of bit long) that mark the presence of features. Each
block of features can be the simple presence of a functional group such as phenyl,
chlorophenyl, piperidyl, or aminoethyl, or it can be as complex as three- or four-
pint pharmacophoric features that combine atom types and the distances between
them. The richness of this representation along with a measure of the biological
response of these compounds enables methods that use Shannon's maximum-
entropy, information-based approach to discover ensembles of patterns of fea-
tures that describe activity and inactivity. These methods are novel and have been
shown to capture the essence of the effects of the SAR in a manner that can be
used in the design of information-based libraries and the virtual screening of
databases of known or realizable compounds.6
The power of these methods lies in their ability to discern relationships in
data that are inherently "noisy." The data are treated as categorical rather that
continuous: active (yes), inactive (no) and a variable category of unassigned data
(maybe). These methods are deterministic and, as such, derive all relationships
between all compounds at all levels of support. The relationships or patterns are
scored based on their information content. Unlike methods such as recursive par-
titioning, pattern discovery is not "greedy" and is complete. The discrimination
ability of pattern discovery depends very much on the quality of the data gener-
ated and on the type and condition of the compounds; if either is questionable, the
5Underwood, D .J. Biophys. J. 1995, 69, 2183-4.
6Beroza, P.; Bradley, E. K.; Eksterowicz, J. E.; Feinstein, R., Greene, J.; Grootenhuis, P. D.; Henne,
R. M.; Mount, J.; Shirley, W. A.; Smellie, A.; Stanton, R. V.; Spellmeyer, D. C. J. Mol. Graph.
Model. 2002,18, 335-42.
OCR for page 175
DENNIS J. UNDERWOOD
175
signal-to-noise ratio will be reduced and the quality of the results will be jeopar-
dized. These approaches have been used in the study of G-protein Coupled Re-
ceptors7 ~ and in ADME-Tox studies.9 i0
These methods have also been used in the identification of compounds that
are able to present the right shape and character to a defined active site of a
protein In cases where the protein structure is known and the potential bind-
ing sites are recognized, the binding site can be translated into a bit-stnng that is
in the same representational space as described above for the compounds. This
translation is done using methods that predict the character of the space available
for binding compounds. The ability to represent both the binding loiters) and com-
pounds in the same way provides the mechanism to discnm~nate between com-
pounds that are likely to bind to the protein. These approaches have been used for
senne proteases, kineses and phosphatases.~3
The game of 20 questions is simple but, almost paradoxically, has the ability
to give the inquisitor the ability to read the subject's mind. The way in which this
occurs is well understood; in the absence of any information there are many pos-
sibilities, a very large and complex but finite space. The solution relies on a tenet
of a dialectic philosophy in which each question provides a thesis and an anti-
thesis that is resolved by a simple yes or no answer. In so doing, the number of
possibilities are dramatically reduced and, after 20 questions, the inquisitor is
usually able to guess the solution. The solution to a problem in drug discovery
and development is far more complex than a game of 20 questions and should not
be tnvialized. Even so, the power of discnm~nation through categorization of
answers and integration of answers from diverse experiments provides an ex-
tremely powerful mechanism for optimizing to a satisfying outcome a potential
drug.
These approaches have been encoded into a family of algorithms known as
pattern discovery (PD).~4 PD describes a family of novel methods in the category
of data mining. One of the distinguishing features of PD is that it discovers rela-
7Wilson, D. M.; Termin, A. P.; Mao, L.; Ramirez-Weinhouse, M. M.; Molteni, V.; Grootenhuis, P.
D.; Miller, K.; Keim, S.; Wise, G. J. Med. Chem. 2002, 45, 2123-6.
~Bradley, E. K.; Beroza, P.; Penzotti, J. E.; Grootenhuis, P. D.; Spellmeyer, D. C.; Miller, J. L. J.
Med. Chem. 2000, 43, 2770-4.
9Penzotti, J. E.; Lamb, M. L.; Evensen, E.; Grootenhuis, P. D. J. Med. Chem. 2002, 45, 1737-40.
Clark, D. E.; Grootenhuis, P. D. Curr. Opin. Drug Discov. Devel. 2002, 5, 382-90.
iiSrinivasan, J.; Castellino, A.; Bradley, E. K.; Eksterowicz, J. E.; Grootenhuis, P. D.; Putta, S.;
Stanton, R. V. J. Med. Chem. 2002, 45, 2494-500.
i2Eksterowicz, J. E.; Evensen, E.; Lemmen, C.; Brady, G. P.; Lanctot, J. K.; Bradley, E. K.; Saiah,
E.; Robinson, L. A.; Grootenhuis, P. D.; Blaney, J. M. J. Mol. Graph. Model. 2002, 20, 469-77.
i3Rogers, W. T.; Underwood, D. J.; Argentar, D. R.; Bloch K. M.; Vaidyanathan, A. G. Proc. Natl.
Acad. Sci. U.S.A., submitted.
i4Argentar, D. R.; Bloch, K. M.; Holyst, H. A.; Moser, A. R.; Rogers, W. T.; Underwood, D. J.;
Vaidyanathan, A. G.; van Stekelenborg, J. Proc. Natl. Acad. Sci. U.S.A., submitted.
OCR for page 176
176
APPENDIX D
tionships between data rather than relying on human interpretation to generate a
model as a starting point; this is a significant advantage. Another important ad-
vantage of PD is that it builds models based on ensembles of inputs to explain the
data and therefore has an advantage in the analysis of complex systems (such as
biology2 3~. We have developed a novel approach to PD that has been applied to
big-sequence, chemistry, and genomic data. Importantly, these methods can be
used to integrate different data types such as those found in chemistry and biol-
ogy. PD methods are quite general and can be applied to many different areas
such as proteomics, text, etc.
Validation of these methods in big-sequence space has been completed using
well-defined and well-understood systems such as serine proteasesi3 and kineses.
PD in big-sequence space provides a method for finding ensembles of patterns of
residues that form a powerful description of the sequences studied. The similarity
between patterns expresses the evolutionary family relationships. The differences
between patterns define their divergence. The patterns express key functional and
structural motifs that very much define the familial and biochemical character of
the proteins. Embedded in the patterns are also key residues that have particular
importance with respect to the function or the structure of the protein. Mapping
these patterns onto the x-ray structures of serine proteases and kineses indicates
that the patterns indeed are structurally and functionally important, and further,
that they define the substrate-binding domain of the proteins. This leads to the
compelling conclusion that since the patterns describe evolutionary changes (di-
vergence and convergence) and also describe the critical features of substrate
binding, the substrate is the driver of evolutionary change.~3
A particular application of PD is in the analysis of variations of genetic infor-
mation (single nucleotide polymorphisms, or SNPs). Analysis of SNPs can lead
to the identification of genetic causes of diseases, or inherited traits that deter-
mine differences in the way humans respond to drugs (either adversely or posi-
tively). Until now, the main method of SNP analysis has been linkage disequilib-
rium (LD), which seeks to determine correlations among specific SNPs. A key
limitation of LD however is that only a restricted set of SNPs can be compared.
Typically SNPs within a local region of a chromosome or SNPs within genes that
are thought to act together are compared. PD on the other hand, through its unique
computational approach, is capable of detecting all patterns of SNPs, regardless
of the genomic distances between them. Among these will be patterns of SNPs
that are responsible for the disease (or trait) of interest, even though the indi-
vidual SNPs comprising the pattern may have no detectable disease (or trait)
correlation. This capability will greatly accelerate the exploitation of the genome
for commercial purposes.
Representative terms from entire chapter:
scientific computing