Read "Information and Communications: Challenges for the Chemical Sciences in the 21st Century" at NAP.edu

« Previous: Appendix C: Workshop Agenda

Page 71 Cite

Suggested Citation:"Appendix D: Workshop Presentations." National Research Council. 2003. Information and Communications: Challenges for the Chemical Sciences in the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/10831.

Page 72 Cite

Page 73 Cite

Page 74 Cite

Page 75 Cite

Page 76 Cite

Page 77 Cite

Page 78 Cite

Page 79 Cite

Page 80 Cite

Page 81 Cite

Page 82 Cite

Page 83 Cite

Page 84 Cite

Page 85 Cite

Page 86 Cite

Page 87 Cite

Page 88 Cite

Page 89 Cite

Page 90 Cite

Page 91 Cite

Page 92 Cite

Page 93 Cite

Page 94 Cite

Page 95 Cite

Page 96 Cite

Page 97 Cite

Page 98 Cite

Page 99 Cite

Page 100 Cite

Page 101 Cite

Page 102 Cite

Page 103 Cite

Page 104 Cite

Page 105 Cite

Page 106 Cite

Page 107 Cite

Page 108 Cite

Page 109 Cite

Page 110 Cite

Page 111 Cite

Page 112 Cite

Page 113 Cite

Page 114 Cite

Page 115 Cite

Page 116 Cite

Page 117 Cite

Page 118 Cite

Page 119 Cite

Page 120 Cite

Page 121 Cite

Page 122 Cite

Page 123 Cite

Page 124 Cite

Page 125 Cite

Page 126 Cite

Page 127 Cite

Page 128 Cite

Page 129 Cite

Page 130 Cite

Page 131 Cite

Page 132 Cite

Page 133 Cite

Page 134 Cite

Page 135 Cite

Page 136 Cite

Page 137 Cite

Page 138 Cite

Page 139 Cite

Page 140 Cite

Page 141 Cite

Page 142 Cite

Page 143 Cite

Page 144 Cite

Page 145 Cite

Page 146 Cite

Page 147 Cite

Page 148 Cite

Page 149 Cite

Page 150 Cite

Page 151 Cite

Page 152 Cite

Page 153 Cite

Page 154 Cite

Page 155 Cite

Page 156 Cite

Page 157 Cite

Page 158 Cite

Page 159 Cite

Page 160 Cite

Page 161 Cite

Page 162 Cite

Page 163 Cite

Page 164 Cite

Page 165 Cite

Page 166 Cite

Page 167 Cite

Page 168 Cite

Page 169 Cite

Page 170 Cite

Page 171 Cite

Page 172 Cite

Page 173 Cite

Page 174 Cite

Page 175 Cite

Page 176 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

D Workshop Presentations QUANTUM INFORMATION Charles H. Bennett IBM Research This paper discussed how some fundamental ideas from the dawn of the last century have changed our understanding of the nature of information. The foun- dations of information processing were really laid in the middle of the twentieth century, and only recently have we become aware that they were constructed a bit wrong that we should have gone back to the quantum theory of the early twen- tieth century for a more complete picture. The word calculation comes from the Latin word meaning a pebble; we no longer think of pebbles when we think of computations. Information can be sepa- rated from any particular physical object and treated as a mathematical entity. We can then reduce all information to bits, and we can deal with processing these bits to reveal implicit truths that are present in the information. This notion is best thought of separately from any particular physical embodiment. In microscopic systems, it is not always possible (and generally it is not possible) to observe a system without disturbing it. The phenomenon of entangle- ment also occurs, in which separated bodies could be correlated in a way that cannot be explained by traditional classical communication. The notion of quan- tum information can be abstracted, in much the same way as the notions of clas- sical information. There are actually more things that can be done with informa- tion, if it is regarded in this quantum fashion. The analogy between quantum and classical information is actually straight- 71

72 APPENDIX D forward classical information is like information in a book or on a stone tablet, but quantum information is like information in a dream: we try to recall the dream and describe it; each description resembles less the original dream than the previ- ous description. The first, and so far the most practical, application of quantum information theory is quantum cryptography. Here one encodes messages and passes them on. If one thinks of photon polarization, one can distinguish vertical and horizontal polarization through calcite crystals, but diagonal photons are in principle not reliably distinguishable. They should be thought of as a superposition of vertical and horizontal polarization, and they actually propagate containing aspects of both of these parent states. This is an entangled structure, and the entangled struc- ture contains more information than either of the pure states of which it is com- posed. The next step beyond quantum cryptography, the one that made quantum information theory a byword, is the discovery of fast quantum algorithms for solving certain problems. For quantum computing, unlike simple cryptography, it is necessary to consider not only the preparation and measurement of quantum states, but also the interaction of quantum data along a stream. This is technically a more difficult issue, but it gives rise to quite exciting basic science involving quantum computing. The entanglement of different state structures leads to so-called Einstein- Podolsky-Rosen states, which are largely responsible for the unusual properties of quantum information. The most remarkable advance in the field, the one that made the field fa- mous, is the fast factor algorithm discovered by Shor at Bell Laboratories. It demonstrates that exponential speedup can be obtained using a quantum com- puter to factor large numbers into their prime components. Effectively, this quan- tum factorization algorithm works because it is no more difficult, using a quan- tum computer, to factor a large number into its prime factors than it is to multiply the prime factors to produce the large number. It is this condition that renders a quantum computer exponentially better than a classical computer in problems of this type. The above considerations deal with information in a disembodied form. If one actually wants to make a quantum computer, there are all sorts of fabrication, interaction, decoherence, and interference considerations. This is a very rich area of experimental science, and many different avenues have been attempted. Nuclear magnetic resonance, ion traps, molecular vibrational states, and solid- state implementations have all been used in attempts to produce actual quantum computers. Although an effective experimental example of a quantum computer is still incomplete, many major theoretical advances have suggested that some of the obvious difficulties can in fact be overcome. Several discoveries indicate that the effects of decoherence can be prevented, in principle, by including quantum error

ANNE M. CHAKA 73 correcting codes, entanglement distillation, and quantum fault-tolerant circuits. Good quantum computing hardware does not yet exist. The existence of these methodologies means that the hardware does not have to be perfectly efficient or perfectly reliable, because these programming techniques can make arbitrarily good quantum computers possible, even with physical equipment that suffers from decoherence issues. Although most of the focus has been on quantum cryptography, quantum processing provides an important example of the fact that quantum computers not only can do certain tasks better than ordinary computers, but also can do different tasks that would not be imagined in the context of ordinary information process- ing. For example, entanglement can enhance the communication of classical mes- sages, by augmenting the capacity of a quantum channel for sending messages. To summarize, quantum information obeys laws that subtly extend those gov- erning classical information. The way in which these laws are extended is reminis- cent of the transition from real to complex numbers. Real numbers can be viewed as an interesting subset of complex numbers, and some questions that might be asked about real numbers can be most easily understood by utilization of the complex plane. Similarly, some computations involving real input or real output (by "real" I mean classical) are most rapidly developed using quantum intermediate space in quantum computers. When I'm feeling especially healthy, I say that quantum com- puters will probably be practical within my lifetime strange phenomena involv- ing quantum information are continually being discovered. HOW SCIENTIFIC COMPUTING KNOWLEDGE MANAGEMENT AND DATABASES CAN ENABLE ADVANCES AND NEW INSIGHTS IN CHEMICAL TECHNOLOGY Anne M. Chaka National Institute of Standards and Technology The focus of this paper is on how scientific computing and information tech- nology (IT) can enable technical decision making in the chemical industry. The paper contains a current assessment of scientific computing and IT, a vision of where we need to be, and a roadmap of how to get there. The information and perspectives presented here come from a wide variety of sources. A general per- spective from the chemical industry is found in the Council on Chemical Research's Vision 2020 Technology Partnership, several workshops sponsored iChemical Industry of the Future: Technology Roadmap for Computational Chemistry, Thompson, T. B., Ed., Council for Chemical Research, Washington, DC, 1999; http://www.ccrhq.org/vision/index/

74 APPENDIX D by NSF, NIST,2 and DOE, and the WTEC report on industrial applications of molecular and materials modeling (which contains detailed reports on 91 institu- tions, including over 75 chemical companies, plus additional data from 55 U.S. chemical companies and 256 world-wide institutions (industry, academia, and government).3 My own industrial perspective comes from my position as co- leader for the Lubrizol R&D IT Vision team for two years, and ten years as the head of computational chemistry and physics prior to coming to NIST. It should be noted that although this paper focuses primarily on the chemical industry, many of the same issues apply to the biotechnology and materials industry. There are many factors driving the need for scientific computing and knowl- edge management in the chemical industry. Global competition is forcing U.S. industry to reduce R&D costs and the time to develop new products in the chemi- cal and materials sectors. Discovery and process optimization are currently lim- ited by a lack of property data and insight into mechanisms that determine perfor- mance. Thirty years ago there was a shortage of chemicals, and customers would pay premium prices for any chemical that worked at all. Trial and error was used with success to develop new chemistry. Today, however, the trend has shifted due to increased competition from an abundance of chemicals that work on the market, customer consolidation, and global competition that is driving commod- ity pricing even for high-performance and fine chemicals. Trial and error have become too costly, and the probability of success is too low. Hence it is becoming widely recognized that companies need to develop and fine-tune chemicals and formulations by design in order to remain competitive, and to screen chemicals prior to a long and costly synthesis and testing process. In addition, the chemicals produced today must be manufactured in a way that minimizes pollution and energy costs. Higher throughput is being achieved by shifting from batch pro- cessing to continuous feed stream, but this practice necessitates a greater under- standing of the reaction kinetics, and hence the mechanism, to optimize feed stream rates. Simulation models are needed that have sufficient accuracy to be able to predict what upstream change in process variables are required to main- tain the downstream products within specifications, as it may take several hours for upstream changes to affect downstream quality. Lastly, corporate downsizing is driving the need to capture technical knowledge in a form that can be queried and augmented in the future. 2NIST Workshop on Predicting Thermophysical Properties of Industrial Fluids by Molecular Simu- lations (June, 2001), Gaithersburg, MD; 1st International Conference on Foundations of Molecular Modeling and Simulation 2000: Applications for Industry (July, 2000), Keystone, CO; Workshop on Polarizability in Force Fields for Biological Simulations (December 13-15, 2001), Snowbird, UT. Applying Molecular and Materials Modeling, Westmoreland, P. R.; Kollman, P. A.; Chaka, A. M.; Cummings, P. T.; Morokuma, K.; Neurock, M.; Stechel, E.B .; Vashishta, P., Eds. Kluwer Aca- demic Publishers, Dordrecht, 2002; http://www.wtec.org/.

ANNE M. CHAKA 75 Data and property information are most likely to be available on commodity materials, but industrial competition requires fast and flexible means to obtain data on novel materials, mixtures, and formulations under a wide range of condi- tions. For the vast majority of applications, particularly those involving mixtures and complex systems (such as drug-protein interactions or polymer nanocomposites), evaluated property data simply do not exist and are difficult, time consuming, or expensive to obtain. For example, measuring the density of a pure liquid to 0.01% accuracy requires a dual sinker apparatus costing $500,000, and approximately $10,000 per sample. Commercial laboratory rates for measur- ing vapor-liquid equilibria for two state points of a binary mixture are on the order of $30,000 to 40,000. Hence industry is looking for a way to supply mas- sive amounts of data with reliable uncertainty limits on demand. Predictive mod- eling and simulation have the potential to help meet this demand. Scientific computing and information technology, however, have the poten- tial to offer so much more than simply calculating properties or storing data. They are essential to the organization and transformation of data into wisdom that en- ables better technical decision making. The information management pyramid can be assigned four levels defined as follows: 1. Data: a disorganized, isolated set of facts 2. Information: organized data that leads to insights regarding relation- ships knowing what works 3. Knowledge: knowing why something works 4. Wisdom: having sufficient understanding of factors governing perfor- mance to reliably predict what will happen knowing what will work To illustrate how scientific computing and knowledge management convert data and information into knowledge and wisdom, a real example is taken from lubricant chemistry. Polysulfides, R-Sn-R, are added to lubricants to prevent wear of ferrous metal components under high pressure. The length of the polysulfide chain, n, is typically between 2 and 6. A significant performance problem is that some polysulfide formulations also cause corrosion of copper-containing compo- nents such as bronze or brass. To address this problem, a researcher assembles data from a wide variety of sources such as analytical results regarding composi- tion, corrosion and antiwear performance tests, and field testing. IT makes it easy for the right facts to be gathered, visualized, and interpreted. After analysis of the data, the researcher comes to the realization that long-chain polysulfides (n = 4 or greater) corrode copper, but shorter chains (n = 2 to 3) do not. This is knowing what happens. Scientific computing and modeling can then be used to determine why something happens. In this case, quantum mechanics enabled us to under- stand that the difference in reactivity of these sulfur chains could be explained by significant stabilization of the thiyl radical delocalized over two adjacent sulfur atoms after homolytic cleavage of the S-S bond: R-SS-SS-R ~ 2R-SS.. The

76 APPENDIX D monosulfur thiyl radical R-S. was significantly higher in energy and therefore is much less likely to form. Hence copper corrosion is consistent with the formation of stable thiyl radicals. This insight led to a generalization that almost any sulfur radical with a low energy of formation will likely corrode copper, and we were able to reliably predict copper corrosion performance from the chemical structure of a sulfur-containing species prior to testing. This understanding also led to im- provements in the manufacturing process and other applications of sulfur chemis- try, and is an example of what is meant by wisdom (i.e., reliably predicting what will happen in novel applications due to a fundamental understanding of the un- derlying chemistry and physics). What is the current status of scientific computing and knowledge manage- ment with respect to enabling better technical decisions? For the near term, data- bases, knowledge management, and scientific computing are currently most ef- fective when they enable human insight. We are a long way from hitting a carriage return and obtaining answers to tough problems automatically, if ever. Wetware (i.e., human insight) is currently the best link between the levels of data, informa- tion, knowledge, and wisdom. There is no substitute for critical, scientific think- ing. We can, however, currently expect an idea to be tested via experiment or calculation. First principles calculations, if feasible on the system, improve the robustness of the predictions and can provide a link between legacy data and novel chemistry applications. Computational and IT methods can be used to gen- erate a set of possibilities combinatorially, analyze the results for trends, and visualize the data in a manner that enables scientific insight. Developing these systems is resource intensive and very application specific. Companies will in- vest in their development for only the highest priority applications, and the knowl- edge gained will be proprietary. Access to data is critical for academics in the QSPR-QSAR method development community, but is problematic due to intel- lectual property issues in the commercial sector.4 Hence there is a need to ad- vance the science and the IT systems in the public arena to develop the funda- mental foundation and building blocks upon which public and proprietary institutions can develop their own knowledge management and predictive model- ing systems. What is the current status of chemical and physical property data? Published evaluated chemical and physical property data double every 10 years, yet this is woefully inadequate to keep up with demand. Obtaining these data requires me- ticulous experimental measurements and/or thorough evaluations of related data from multiple sources. In addition, data acquisition processes are time- and resource-consuming and therefore must be initiated well in advance of an antici- pated need within an industrial or scientific application. Unfortunately a signifi- cant part of the existing data infrastructure is not directly used in any meaningful 4Comment by Professor Curt Breneman, Rensselaer Polytechnic Institute.

ANNE M. CHAKA 77 application because data requirements often shift between the initiation and completion of a data project. Analysis and fitting, such as for equation-of-state models, must be reinitiated when significant new data become available. One vision that has been developed in consultation with the chemical and materials industries can be described as a "Universal Data and Simulation En- gine." This engine is a framework of computational tools, evaluated experimental data, active databases, and knowledge-based software guides for generating chemical and physical property data on demand with quantitative measures of uncertainty. This engine provides validated, predictive simulation methods for complex systems with seamless multiscale and multidisciplinary integration to predict properties and model physical phenomena and processes. The results are then visualized in a form useful for scientific interpretation, sometimes by a non- expert. Examples of high-priority challenges cited by industry in the WTEC re- port to be ultimately addressed by the Universal Data and Simulation Engine are discussed below.5 How do we achieve this vision of a Universal Data and Simulation Engine? Toward this end, NIST has been exploring the concepts of dynamic data evaluation and virtual measurements of chemical and physical properties and predictive simula- tions of physical phenomena and processes. In dynamic data evaluation, all available experimental data within a technical area are collected routinely and continuously, and evaluations are conducted dynamically using an automated system when in- formation is required. The value of data is directly related to the uncertainty, so "rec- ommended" data must include a robust uncertainty estimate. Metadata are also col- lected (i.e., auxiliary information required to interpret the data such as experimental method). Achieving this requires interoperability and data exchange standards. Ide- ally the dynamic data evaluation is supplemented by calculated data based on vali- dated predictive methods (virtual measurements), and coupled with a carefully con- sidered experimental program to generate benchmark data. Both virtual measurements and the simulation engine have the potential to meet a growing fraction of this need by supplementing experiment and providing data in a timely manner at lower cost. Here we define "virtual measurements" specifically as predictive modeling tools that yield property data with quantified uncertainties analogous to observable quantities measured by experiment (e.g., rate constants, solubility, density, and vapor-liquid equilibria). By "simulation" we mean validated modeling of processes or phenomena that provides insight 5These include liquid-liquid interfaces (micelles and emulsions), liquid-solid interfaces (corrosion, bonding, surface wetting, transfer of electrons and atoms from one phase to another), chemical and physical vapor deposition (semiconductor industry, coatings), and influence of chemistry on the thermomechanical properties of materials, particularly defect dislocation in metal alloys; complex reactions in multiple phases over multiple time scales. Solution properties of complex solvents and mixtures (suspending asphaltenes or soot in oil, polyelectrolytes, free energy of salvation rheology), composites (nonlinear mechanics, fracture mechanics), metal alloys, and ceramics.

78 APPENDIX D into mechanisms of action and performance with atomic resolution that is not directly accessible by experiment but is essential to guide technical decision mak- ing in product design and problem solving. This is particularly crucial for con- densed-phase processes where laboratory measurements are often the average of myriad atomistic processes and local conditions that cannot be individually re- solved and analyzed by experimental techniques. It is analogous to gas-phase kinetics in the 1920s prior to modern spectroscopy when total pressure was the only measurement possible. The foundation for virtual measurements and simu- lations is experimental data and mathematical models that capture the underlying physics at the required accuracy of a given application. Validation of theoretical methods is vitally important. The Council for Chemical Research's Vision 2020 states that the desired target characteristics for a virtual measurement system for chemical and physical properties are as follows: problem setup requires less than two hours, completion time is less than two days, cost including labor is less than $1,000 per simulation, and it is usable by a nonspecialist (i.e., someone who cannot make a full-time career out of molecular simulation). Unfortunately, we are a long way from meet- ing this target, particularly in the area of molecular simulations. Quantum chem- istry methods have achieved the greatest level of robustness and coupled with advances in computational speed have enabled widespread success in areas such as predicting gas-phase, small-molecule thermochemistry and providing insight into reaction mechanisms. Current challenges for quantum chemistry are accurate predictions of rate constants and reaction barriers, condensed-phase thermochem- istry and kinetics, van der Waals forces, searching a complex reaction space, transition metal and inorganic systems, and performance of alloys and materials dependent upon the chemical composition. A measure of the current value of quantum mechanics to the scientific com- munity is found in the usage of the NIST Computational Chemistry Comparison and Benchmark Database (CCCBDB), (http://srdata.nist.gov/cccbdb). The CCCBDB was established in 1997 as a result of an (American Chemical Society) ACS workshop to answer the question, How good is that ah initio calculation? The purpose is to expand the applicability of computational thermochemistry by providing benchmark data for evaluating theoretical methods and assigning un- certainties to computational predictions. The database contains over 65,000 cal- culations on 615 chemical species for which there are evaluated thermochemical data. In addition to thermochemistry, the database also contains results on struc- ture, dipole moments, polarizability, transition states, barriers to internal rotation, atomic charges, etc. Tutorials are provided to aid the user in interpreting data and evaluating methodologies. Since the CCCBDB's inception, usage has doubled every year up to the current sustained average of 18,000 web pages served per month, with a peak of over 50,000 pages per month. Last year over 10,000 sepa- rate sites accessed the CCCBDB. There are over 400 requests per month for new chemical species not contained in the database.

ANNE M. CHAKA 79 The CCCBDB is currently the only computational chemistry or physics data- base of its kind. This is due to the maturity of quantum mechanics to reliably predict gas-phase thermochemistry for small (20 nonhydrogen atoms or less), primarily organic, molecules, plus the availability of standard-reference-quality experimental data. For gas-phase kinetics, however, only in the past two years have high-quality (<2% precision) rate-constant data become available for H. and ·OH transfer reactions to begin quantifying uncertainty for the quantum me- chanical calculation of reaction barriers and tunneling.6 There is a critical need for comparable quality rate data and theoretical validation for a broader class of gas-phase reactions, as well as solution phase for chemical processing and life science, and surface chemistry. One of the highest priority challenges for scientific computing for the chemi- cal industry is the reliable prediction of fluid properties such as density, vapor- liquid equilibria, critical points, viscosity, and solubility for process design. Empirical models used in industry have been very useful for interpolating experi- mental data within very narrow ranges of conditions, but they cannot be extended to new systems or to conditions for which they were not developed. Models based exclusively on first principles are flexible and extensible, but can be applied only to very small systems and must be "coarse-grained" (approximated by averaging over larger regions) for the time and length scales required in industrial applica- tions. Making the connection between quantum calculations of binary interac- tions or small clusters and properties of bulk systems (particularly systems that exhibit high-entropy or long-range correlated behavior) requires significant break- throughs and expertise from multiple disciplines. The outcome of the First Indus- trial Fluid Properties Simulation Challenge7 (sponsored by AIChE's Computa- tional Molecular Science and Engineering Forum and administered by NIST) underscored these difficulties and illustrated how fragmented current approaches are. In industry, there have been several successes in applying molecular simula- tions, particularly in understanding polymer properties, and certain direct phase equilibrium calculations. Predicting fluid properties via molecular simulation, however, remains an art form rather than a tool. For example, there are currently over a dozen popular models for water, but models that are parameterized to give good structure for the liquid phase give poor results for ice. Others, parameter- 6Louis, F.; Gonzalez, C.; Huie, R. E.; Kurylo, M. J. J. Phys. Chem. A 2001,105, 1599-1604. 7The goals of this challenge were to: (a) obtain an in-depth and objective assessment of our current abilities and inabilities to predict thermophysical properties of industrially challenging fluids using computer simulations, and (b) drive development of molecular simulation methodology toward a closer alignment with the needs of the chemical industry. The Challenge was administered by NIST and sponsored by the Computational Molecular Science and Engineering Forum (AIChE). Industry awarded cash prizes to the champions to each of the three problems (vapor-liquid equilibrium, den- sity, and viscosity). Results were announced at the AIChE annual meeting in Indianapolis, IN, No- vember 3, 2002.

80 APPENDIX D ized for salvation and biological applications, fail when applied to vapor-liquid equilibrium properties for process engineering. The lack of "transferability" of water models indicates that the underlying physics of the intermolecular interac- tions is not adequately incorporated. The tools and systematic protocols to cus- tomize and validate potentials for given properties with specified accuracy and uncertainty do not currently exist and need to be developed. In conclusion, we are still at the early stages of taking advantage of the full potential offered by scientific computing and information technology to benefit both academic science and industry. A significant investment is required to ad- vance the science and associated computational algorithms and technology. The impact and value of improving chemical-based insight and decision making are high, however, because chemistry is at the foundation of a broad spectrum of technology and biological processes such as . how a drug binds to an enzyme, · manufacture of semiconductors, · chemical reactions occurring inside a plastic that makes it burn faster than others, and . how defects migrate under stress in a steel I-beam. A virtual measurement system can serve as a framework for coordinating and catalyzing academic and government laboratory science in a form useful for solving technical problems and obtaining properties. There are many barriers to obtaining the required datasets that must be overcome, however. Corporate data are largely proprietary, and in academia, generating data is "perceived as dull so it doesn't get funded."8 According to Dr. Carol Handwerker (chief, Metallurgy Division, Materials Science and Engineering Laboratory, NIST), even at NIST just gathering datasets in general is not well supported at the moment, because it is difficult for NIST to track the impact that the work will have on the people who are using the datasets. One possible way to overcome this barrier may be to de- velop a series of standard test problems in important application areas where the value can be clearly seen. The experimental datasets would be collected and theo- retical and scientific computing algorithms would be developed, integrated, and focused in a sustained manner to move all the way through the test problems. The data collection and scientific theory and algorithm development then clearly be- come means to an end. recommend by Professor John Tully, Yale University.

JUAN J. DE PABLO ON THE STRUCTURE OF MOLECULAR MODELING: MONTE CARLO METHODS AND MULTISCALE MODELING Juan J. de Pablo University of Wisconsin, Madison 81 The theoretical and computational modeling of fluids and materials can be broadly classified into three categones, namely automatic or molecular, meso- scopic, and continuum or macroscopic.) At the automatic or molecular level, de- tailed models of the system are employed in molecular simulations to predict the structural, thermodynamic, and dynamic behavior of a system. The range of ap- plication of these methods is on the order of angstroms to nanometers. Ex- amples of this type of work are the prediction of reaction pathways using elec- tronic structure methods, the study of protein structure using molecular dynamics or Monte Carlo techniques, or the study of phase transitions in liquids and solids from knowledge of intermolecular forces.2 At the mesoscopic level, coarse- gra~ned models and mean field treatments are used to predict structure and prop- erties at length scales ranging from tens of nanometers to microns. Examples of this type of research are the calculation of morphology in self-assembling sys- tems (e.g., block copolymers and surfactants) and the study of macromolecular configuration (e.g., DNA) in m~crofluidic devices.3 4 5 At the continuum or mac- roscopic level, one is interested in predicting the behavior of fluids and materials on laboratory scales Omicrons to centimeters), and this is usually achieved by numerical solution of the relevant conservation equations (e.g., Navier-Stokes, in the case of fluids).6 Over the last decade considerable progress has been achieved in the three categories described above. It is now possible to think about "multiscale model- ing" approaches, in which distinct methods appropriate for different length scales are combined or applied simultaneously to achieve a comprehensive description of a system. This progress has been partly due to the ever-increasing power of computers but, to a large extent, it has been the result of important theoretical and algorithmic developments in the area of computational materials andfluids modeling. Much of the interest in multiscale modeling methods is based on the premise that, one day, the behavior of entirely new materials or complex fluids will be iDe Pablo J. J.; Escobedo, F. A. AIChE Journal 2002, 48, 2716-2721. 2Greeley J.; Norskov, J. K.; Mavrikakis, M. Ann. Rev. Phys. Chem. 2002, 53, 319-348. 3Fredrickson G. H., Ganesan, V.; Drolet, F. Macromol. 2002, 35, 16-39. 4Hur J. S.; Shaqfeh, E. S. G.; Larson, R. A. J. Rheol. 2000, 44, 713-742. 5Jendrejack R. M.; de Pablo, J. J.; Graham, M. D. J. Chem. Phys. 2002, 116, 7752-7759. 6Bird, R. B.; Stewart, W. E.; Lightfoot, E. N. Transport Phenomena: 2nd Ed., John Wiley: New York, NY; 2002.

82 APPENDIX D conceived or understood from knowledge of their atomic or molecular constitu- ents. The goal of this brief document is to summarize how molecular structure and thermodynamic properties can be simulated numencally, to establish the promise of modern molecular simulation methods, including the opportunities they offer and the challenges they face, and to discuss how the resulting molecu- lar-level information can be used in conjunction with mesoscopic and continuum modeling techniques for study of macroscopic systems. As a first concrete example, it is instructive to consider the simulation of a phase diagram (a simple liquid-vapor coexistence curve) for a simple fluid (e.g., argon) from knowledge of the interactions between molecules; in the early l990s, the calculation of a few points on such a diagram required several weeks of supercomputer time.7 With more powerful techniques and faster computers, it is now possible to generate entire phase diagrams for relatively complex, industn- ally relevant models of fluids (such as mixtures of large hydrocarbons) in rela- tively short times (on the order of hours or days).8 A molecular simulation consists of the model or force field that is adopted to represent a system and the simulation technique that is employed to extract quan- titative information (e.g., thermodynamic properties) about that model. For mo- lecular simulations of the structure and thermodynamic properties of complex fluids and matenals, particularly those consisting of large, articulated molecules (e.g., surfactants, polymers, proteins), Monte Carlo methods offer attractive fea- tures that make them particularly effective. In the context of molecular simula- tions, a Monte Carlo algorithm can be viewed as a process in which random steps or displacements (also known as "teal moves") away from an arbitrary initial state of the system of interest are earned out to generate large ensembles of real- istic, representative configurations. There are two essential ingredients to any Monte Carlo algonthm: the first consists of the types of "moves" or steps that are used, and the second consists of the formalism or criteria that are used to guide an algorithm toward thermodynamic equilibnum. The possibility of designing unphysical "teal moves," in which molecules can be temporarily destroyed and reassembled, often permits efficient explora- tion of the configuration space available to a system. In the case of long hydro- carbons, surfactants, or phospholipids, for example, configurational-bias tech- niques9 i0 can accelerate the convergence of a simulation algorithm by several orders of magnitude. In the case of polymers, re-bndging techniques permit simu- lation of structural properties that would not be accessible by any other means. 7Panagiotopoulos, A. Z. Mol.Sim. 1992, 9, 1-23. Math S.; Escobedo, F. A.; Patramai, I.; de Pablo, J. J. Ind. Eng. Chem. Res. 1998, 37, 3195. 9De Pablo, J. J.; Yan, Q.; Escobedo, F. A. Ann. Rev. Phys. Chem. 1999, 50, 377-411. i°Frenkel D.; Smit, B. Understanding Molecular Simulation, 2nd Ed., Elsevier Science: London, UK, 2002. i~Karayiannis N. C.; Mavrantzas, V. G.; Theodorou, D. N. Phys., Rev. Lett. 2002, 88 (10), 105503.

JUAN J. DE PABLO 83 The development of novel, clever Monte Carlo trial moves for specific systems is a fertile area of research; significant advances in our ability to model;fluids and materials will result from such efforts. A Monte Carlo simulation can be implemented in a wide variety of ways. In the most common implementation, trial moves are accepted according to prob- ability criteria (the so-called Metropolis criteria) constructed in such a way as to result in an ensemble of configurations that satisfies the laws of physics. There is, however, considerable flexibility in the way in which such criteria are imple- mented. In expanded ensemble methods, for example, fictitious intermediate states of a system are created in order to facilitate transitions between an initial and a final state; transitions between states are accepted or rejected according to well-defined probability criteria. In parallel-tempering or replica-exchange simu- lation methods, calculations are conducted simultaneously on multiple replicas of a system. Each of these replicas can be studied at different conditions (e.g., dif- ferent temperature); configuration exchanges between distinct replicas can be pro- posed and accepted according to probability criteria that guarantee that correct ensembles are generated for each replica. Within this formalism, a system or replica that is sluggish and difficult to study at low temperatures (e.g., a highly viscous liquid or a glassy solid), can benefit from information generated in high- temperature simulations, where relaxation and convergence to equilibrium are much more effectively More recently, density-of-states techniques have been pro- posed as a possible means to generate all thermodynamic information about a system over a broad range of conditions from a single simulation.~3 i4 i5 These examples serve to illustrate that if recent history is an indication of progress to come, new developments in the area of Monte Carlo methods will continue to increase our ability to apply firs t principles information to thermodynamic prop- erty and structure prediction, thereby supplementing and sometimes even replac- ing more costly experimental work. Such new developments will also facilitate considerably our ability to design novel and advanced materials andfluids from atomistic- and molecular-level information. As alluded to earlier, the results and predictions of a molecular simulation are only as good as the underlying model (or force field) that is adopted to repre- sent a system. Transferable force fields are particularly attractive because, in the spirit of "group-contribution" approaches, they permit study of molecules and many-body systems of arbitrary complexity through assembly of atomic or chemi- cal-group building blocks. Most force fields use electronic structure methods to generate an energy surface that is subsequently fitted using simple functional i2Yan, Q.; de Pablo, J. J.; J. Chem. Phys. 1999.111, 9509. i3Wang, F. G.; Landau, D. P.; Phys. Rev. Lett. 2001, 86 (10), 2050-2053. i4Yan, Q. L.; Faller, R.; de Pablo, J. J. J. Chem. Phys. 2002,116, 8745-8749. i5Yan, Q. L.; de Pablo, J. Phys. Rev. Lett. 2003, 90 (3), 035701.

84 APPENDIX D forms. The resulting functions and their corresponding parameters are referred to as a force field. It is important to note that "raw" parameters extracted from elec- tronic structure calculations are generally unable to provide a quantitative de- scnption of the structure and properties of complex fluids and matenals; their subsequent optimization by analysis of experimental data considerably improves their applicability and predictive capability. Improvements in the efficiency of simulation techniques have rendered this last aspect of force field development much more tractable than it was a decade ago. Reliable force fields are now being proposed for a wide variety of systems, including hydrocarbons, carbohydrates, alcohols, polymers, etc.6~6~7~8 Accurate force fields are the cornerstone of;flu- ids and materials modeling; much more work in this area is required to reach a stage at which modeling tools can be used with confidence to interpret the results of experiments and to anticipate the behavior of novel materials. The above discussion has been focused on Monte Carlo methods. Such meth- ods can be highly effective for determining the equilibrium structure and proper- ties of fluids and matenals, but they do not provide information about time-de- pendent processes. Molecular dynamics (MD) methods, which are the technique of choice for study of dynamic processes, have also made considerable progress over the last decade. The development of multiple-time-step methods, for ex- ample, has increased significantly the computational efficiency of MD simula- tions. Unfortunately, however, the time scales amenable to molecular dynamics simulations are still on the order of tens or hundreds of nanoseconds. Many of the processes of interest in chemistry and chemical engineering occur on much longer time scales (e.g., minutes or hours); it is unlikely that the several orders of mag- nitude that now separate our needs from what is possible with atomistic-level methods will be bridged by the availability of faster computers. It is therefore necessary to develop theoretical and computational methods to establish a sys- tematic connection between atomistic and macroscopic time scales. These tech- niques are often referred to as multiscale methods or coarse-graining methods. While multiscale modeling is still in its infancy, its promise is such that considerable efforts should be devoted to its development in the years to come. A few examples have started to appear in the literature. In the case of solid maten- als, the challenge of coupling atomistic phenomena (e.g., the tip of a crack) with mechanical behavior (e.g., crack propagation and failures over macroscopic do- mains has been addressed by several authors. 920 In the case of fluids, molecular- level structure (e.g., the conformation of DNA molecules in solution) has been i6Jorgensen, W. E.; Maxwell, D. S.; TiradoRives, J. J. Amer. Chem. Soc. 1996,118, 11225-11236. i7Errington, J. R.; Panagiotopoulos, A. Z. J. Phys. Chem B 1999, 103, 6314-6322. i8Wick, C. D. Martin, M. G.; Siepmann, J. I. J. Phys. Chem. B 2000,104, 8008-8016. i9Broughton, J. Q. Abraham, F. F.; N. Bernstein; Phys. Rev. B 1999, 60, 2391-2403. 20Smith, G. S., E. B. Tadmor, N. Bernstein; Kaxiras, E. Acta Mater. 2001, 49, 4089-4101.

JUAN J. DE PABLO 85 solved concurrently with macroscopic flow problems (fluid flow through macro- and m~crofluidic geometnes).2~22 Multiscale methods for the study of dynamic processes currently rely on separation of time scales for various processes. One of the cornerstones of these methods is the averaging or coarse-graining of fast, local processes into a few, well-chosen variables carrying sufficient information content to provide a meaningful description of a system at longer time and length scales. The reverse is also true, and perhaps more challenging; information from a coarse-grained level must be brought back onto a microscopic level in a sen- sible manner, without introducing spurious behavior. This latter "reverse" prob- lem is often underspecified and represents a considerable challenge. New and better numerical schemes to transfer information between different description levels should be developed; this must be done without adding systematic pertur- bations to the system and in a computationally robust and efficient way. A better understanding of coarse "raining techniques is sorely needed, as are better algo- rithms to merge different levels of description in a seamless and correct manner. 2iJendrejack, RM, de Pablo, J. J.; Graham, M. D. J. Non-Newton Fluid 2002, 108, 123-142. 22Jendrejack, R. M.; Graham. M. D.; de Pablo, J. J. Multiscale simulation of DNA solutions in microfluidic devices, unpublished.

86 APPENDIX D ADVANCES IN INFORMATION & COMMUNICATION TECHNOLOGIES: OPPORTUNITIES AND CHALLENGES IN CHEMICAL SCIENCE AND TECHNOLOGY Thom H. Dunning, Jr. University of Tennessee and Oak Ridge National Laboratory Introduction The topics to be covered in this paper are the opportunities in chemical sci- ence and technology that are arising from the advances being made in computing technologies computers, data stores, and networks. That computing technology continues to advance at a dizzying pace is familiar to all of us. Almost as soon as one buys a PC, it becomes outdated because a faster version of the microproces- sor that powers that PC becomes available. Likewise, tremendous advances are being made in memory and data storage capacities. The density of memory is increasing at the same rate as computing speed, and disk drive densities are in- creasing at an even faster pace. These advances have already led to a new era in computational chemistry computational models of molecules and molecular processes are now so widely accepted and PCs and workstations so reasonably priced that molecular calculations are routinely used by many experimental chem- ists to better understand the results of their experiments. For someone like the author, who started graduate school in 1965, this transformation of the role of computing in chemistry is little short of miraculous. Dramatic increases are also occurring in network bandwidth. Bandwidth that was available only yesterday to connect computers in a computer room is becom- ing available today between distant research and educational institutions the National Light Rail is intended to link the nation's most research-intensive uni- versities at 10 gigabits per second. This connectivity has significant implications for experimental and computational chemistry. Collaborations will grow as geo- graphically remote collaborators are able to easily share data and whiteboards, simultaneously view the output of calculations or experiments, and converse with one another face to face through audio-video links. Data repositories provided by institutions such as the National Institute of Standards and Technology, although not nearly as large in chemistry as in molecular biology, are nonetheless impor- tant and will become as accessible as the data on a local hard drive. Finally, network advances promise to make the remote use of instruments, especially ex- pensive one-of-a-kind or first-of-a-kind instruments, routine, enabling scientific advances across all of chemical science. At the current pace of change, an order-of-magnitude increase in computing and communications capability will occur every five years. An order-of-magni- tude increase in the performance of any technology is considered to be the thresh- old for revolutionary changes in usage patterns. It is important to keep this in

THOM H. DUNNING, JR. 87 mind. In fact, Professor Jack Dongarra of the University of Tennessee, one of the world's foremost experts in scientific computing, has recently stated this fact in stark terms: The rising tide of change [resulting from advances in information technology] shows no respect for the established order. Those who are unwilling or unable to adapt in response to this profound movement not only lose access to the oppor- tunities that the information technology revolution is creating, they risk being rendered obsolete by smarter, more agile, or more daring competitors. Never before in the history of technology have revolutionary advances occurred at the pace being seen in computing. It is critical that chemical science and tech- nology in the United States aggressively pursue the opportunities offered by the advances in information and communications technologies. Only by doing so will it be able to maintain its position as a world leader. It is impossible to cover all of the opportunities (and challenges) in chemical science and technology that will result from the advances being made in informa- tion and communications technologies in a 30-minute presentation. Instead, I fo- cus on three examples that cover the breadth of opportunities that are becoming available: (i) advances in computational modeling that are being driven by ad- vances in computer and disk storage technology, (ii) management of large datasets that is being enabled by advances in storage and networking technologies, and (iii) remote use of one-of-a-kind or first-of-a-kind scientific instruments resulting from advances in networking and computer technologies. In this paper, I focus on applications of high-end computing and communications technologies. However, it is important to understand that the "trickle-down effect" is very real in comput- ing (if not in economics) the high-end applications discussed today will be ap- plicable to the personal computers, workstations, and departmental servers of tomorrow. Computational Modeling in Chemical Science and Technology First, consider the advances in computational modeling wrought by the ad- vances in computer technology. Almost everybody is aware of Moore's Law, the fact that every 18-24 months there is a factor of 2 increase in the speed of the microprocessors that are used to drive personal computers as well as many of today's supercomputers. Within the U.S. Department of Energy, the increase in computing power available for scientific and engineering calculations is codified in the "ASCI Curve," which is a result of Moore's law compounded by the use of increased number of processors (Figure 1~. The result is computing power that is well beyond that predicted by Moore's law alone. If there is a factor of 2 increase every 18-24 months from computer technology and the algorithms used in scien- tific and engineering calculations scale to twice as many processors over that same period of time, the net result is a factor of 4 increase in computing capabil-

88 100 10 1 ~ ~ Computing Power ("flops) 1,000 ~} ? on 0.1 1 1 1 1 1996 1998 2000 2002 2004 FIGURE 1 Moore's law and beyond. APPENDIX D MICROPROCESSORS 2x increase in performance every 18-24 months (Moves Law) PARALLELISM More processors per SMP More SMPs INNOVATIVE DESIGNS Specialized Computers Cellular Architectures Processors-in-Memory HTMT ity. At this pace, one can realize an order-of magnitude-increase in computing power in just five years. The increase in computing power in the past five years, in fact, follows this pattern. In 1997, "ASCI Red" at Sandia National Laboratories was the first com- puter system capable of performing more than 1 trillion arithmetic operations per second (1 teraflops). ASCI Red had a total of 9216 Pentium Pro processors, a peak performance of 1.8 teraflops, and 0.6 terabytes of memory. In 2002, the Japanese Earth Simulator is the world's most powerful computer with 5120 processors, a peak performance of 40.96 teraflops, and more than 10 terabytes of memory (Fig- ure 2~. This is an increase of a factor of over 20 in peak performance in just five years! But, the increase in delivered computing power is even greater. On ASCI Red a typical scientific calculation achieved 10-20% of peak performance, or 100- 200 gigaflops. On the Japanese Earth Simulator, it is possible to obtain 40-50% of peak performance, or 16 to 20 teraflops. Thus, in terms of delivered performance, the increase from 1997 to 2002 is more like a factor of 100. The Earth Simulator is the first supercomputer in a decade that was designed for science and engineering computations. Most of the supercomputers in use today were designed for commercial applications, and commercial applications place very different demands on a computer's processor, memory, I/O, and inter- connect subsystems than science and engineering applications. The impact of this is clearly evident in the performance of the Earth Simulator. For example, the performance of the Earth Simulator on the LinPack benchmark) is 35.86 teraflops ~ See: ht~p://www. topSOO. org/list/2002/] 1/.

THOM H. DUNNING, JR. , .,, ... I_ ; .,j ... ................................ Hi? - ..~ ~ ~.~.~.~ 1 ~ ~ i WorId's Most Powerful Computer 5~120 processors it 40.96 teraflops peak 10.24 terabytesofmemory ~ 640 x 640 single stage crossbar switch Designed for Science and Engineering Performance -a; LinPack Benchmark 35.86 teraflops (88%) Atmospheric Global Circulation Benchmark 26.58 teraflops on 640 nodes (65%) Plasma Simulation Code Benchmark 14.9 tflops on 512 nodes (45%) FIGURE 2 Japanese Earth Simulator. 89 or 87.5% of peak performance (Figure 3~. The Earth Simulator is a factor of 5 faster on the LinPack Benchmark than its nearest competitor, ASCI White, even though the difference in peak performance is only a factor of three. The perfor- mance of the Earth Simulator on the LinPack benchmark exceeds the integrated total of the three largest machines in the United States (ASCI White, LeMieux at the Pittsburgh Supercomputing Center, and NERSC3 at Lawrence Berkeley Na- tional Laboratory) by a factor of nearly 2.5. RmaX (Gflops) 40,000.0 30,000.0 20,000.0 1 0,000.0 0.0 FIGURE 3 LinPack benchmarks. Earth ASCI PSC NERSC3 Simulator White LeMieux

9o APPENDIX D The performance of the Earth Simulator is equally impressive when real sci- entific and engineering applications are considered. A general atmospheric glo- bal circulation benchmark ran at over 26 teraflops, or 65% of peak performance, on the Earth Simulator.2 On the commercially oriented machines that are cur- rently being used in the United States, it has proven difficult to achieve more than 10% of peak performance for such an application.3 So, not only is the raw speed of the Japanese Earth Simulator very high, it is also very effective for scientific and engineering applications. The Earth Simulator is expected to run a wide range of scientific and engineering applications 10-100 times faster than the fastest machines available in the United States. As Professor Dongarra noted, drawing a comparison with the launching of Sputnik by the Soviet Union in 1958, the Earth Simulator is the "computernik" of 2002, representing a wake-up call illustrating how computational science has been compromised by supercomputers built for commercial applications. The above discussion has focused on general computer systems built using commodity components. Computer companies now have the capability to design and build specialized computers for certain types of calculations (e.g., molecular dynamics) that are far more cost-effective than traditional supercomputers. Ex- amples of these computers include MD-GRAPE4 for molecular dynamics calcu- lations and QCDoC5 for lattice gauge QCD calculations. Pnce-performance im- provements of orders of magnitude can be realized with these specialized computers, although with some loss of flexibility and generality. However, with the exception of the lattice gauge QCD community, specialized computers have not gained widespread acceptance in science and engineenng. This is due largely to the rapid increases in the computing capabilities of general-purpose micropro- cessors over the last couple of decades with an increase of a factor of 2 every 18-24 months, the advantages offered by specialized computers can rapidly be- come outdated. In addition, the limited flexibility of specialized computers often prevented the use of new, more powerful algorithms or computational approaches. However, given the increasing design and fabrication capabilities of the com- puter industry, I believe that this is a topic well worth reexamining. 2S. Shingu, H. Takahara, H. Fuchigami, M. Yamada, Y. Tsuda, W. Ohfuchi, Y. Sasaki, K. Kobayashi, T. Hagiwara, S. Habata, M. Yokokawa, H. Itoh, and K. Otsuka, "A 26.58 Tflops Global Atmospheric Simulation with the Spectral Transform Method on the Earth Simulator," presented at SC2002 (Nov. 2002). 3"Capacity of U.S. Climate Modeling to Support Climate Change Assessment Activities," Climate Research Committee, National Research Council, (National Academy Press, Washington, 1999). 4See: http://www.research.ibm.com/grape/. 5D. Chen, N. H. Christ, C. Cristian, Z. Dong, A. Gara, K. Garg, B. Joo, C. Kim, L. Levkova, X. Liao, R. D. Mawhinney, S. Ohta, and T. Wettig, Nucl. Phys. B (Proc. Suppl.) 2001, 94, 825-832; P. A. Boyle, D. Chen, N. H. Christ, C. Cristian, Z. Dong, A. Gara, B. Joo, C. Kim, L. Levkova, X. Liao, G. Liu, R. D. Mawhinney, S. Ohta, T. Wettig, and A. Yamaguchi, Nucl. Phys. B (Proc. Suppl.) 2002, 106, 177-183.

THOM H. DUNNING, JR. 91 In summary, it is clear that over the next decade and likely longer, we will see continuing increases in computing power, both from commercially oriented computers and from more specialized computers, including follow-one to the Earth Simulator. With the challenge offered by the Japanese machine, U.S. com- puter companies can be expected to refine their designs in order to make their computers more appropriate for scientific and engineering calculations, although a major change would be required in the business plans of companies such as IBM and HP to fully respond to this challenge (the scientific and engineering market is miniscule compared to the commercial market). These increases in com- puting power will have a major impact on computer-based approaches to science . . ant englneenng. Opportunities in Computational Chemistry What do the advances occurring in computing technology mean for compu- tational chemistry, especially electronic structure theory, the author's area of ex- pertise? It is still too early to answer this question in detail, but the advances in computing technology will clearly have a major impact on both the size of mol- ecules that can be computationally modeled and the fidelity with which they can be modeled. Great strides have already been achieved for the latter, although as yet just for small molecules (the size of benzene or smaller). Consider, for ex- ample, the calculation of bond energies (Figure 4~. When I started graduate school Error (kcal/mol) 100 \ 10 Bond energies critical for describing many chemical phenomena Accuracy of calculated bond energies increased dramatically from 1970 2000 \ \ i\ . , 1970 1980 1990 2000 Due to advances in I Theoretical methodology Computational techniques it; Computing technology FIGURE 4 Opportunity: increased fidelity of molecular models. -

92 APPENDIX D in 1965, we were lucky to be able to calculate bond energies accurate to 50 kcal/ mol. Such predictions were not useful for much of anything and, in fact, looking back on the situation, I am amazed that experimental chemists showed such toler- ance when we reported the results of such calculations. But the accuracy of the calculations has been improving steadily over the past 30 years. By 2000, bond energies could be predicted to better than 1 kcal/mol, as good as or better than the accuracy of most experimental measurements. This increase in the accuracy of calculated bond energies is not just due to advances in computing technology. Computing advances were certainly impor- tant, but this level of accuracy could never have been achieved without signifi- cant advances in theoretical methodology (e.g., the development of coupled cluster theory) as well as in computational techniques, for example, the devel- opment of a family of basis sets that systematically converge to the complete basis set (CBS) limit. Coupled-cluster theory, which was first introduced into chemistry in 1966 (from nuclear theory) by J. Cizek and subsequently devel- oped and explored by J. Paldus, I. Shavitt, R. Bartlett, J. Pople, and their col- laborators, provides a theoretically sound, rapidly convergent expansion of the wave function of atoms and molecules.6 In 1989, K. Ragavachari, J. Pople and co-workers suggested a perturbative correction to the CCSD method to account for the effect of triple excitations.7 The accuracy of the CCSD(T) method is astounding. Dunning has shown that it provides a nearly quantitative descrip- tion of molecular binding from such weakly bound molecules as He2, which is bound by only 0.02 kcal/mol, to such strongly bound molecules as CO, which is bound by nearly 260 kcal/mol a range of four orders of magnitude. One of the most interesting aspects of coupled cluster theory is that it is the mathematical incarnation of the electron pair description of molecular structure a model used by chemists since the early twentieth century. True to the chemist's model, the higher-order corrections in coupled cluster theory (triple and higher excita- tions) are small, although not insignificant. The accuracy of the CCSD(T) method for strongly bound molecules is illus- trated in Figure 5. This figure provides a statistical analysis of the errors in the computed De values for a representative group of molecules.9 The curves repre- sent the normal error distributions for three different methods commonly used to solve the electronic Schrodinger equation: second-order M0ller-plesset perturba- tion theory (MP2), coupled cluster theory with single and double excitations, and 6A brief history of coupled cluster theory can be found in R. J. Bartlett, Theor. Chem. Acc. 2000, 103, 273-275. 7K. Raghavachari, G. W. Trucks, J. A. Pople, and M. Head-Gordon, Chem. Phys. Lett. 1989, 157, 479-483. 9242. IT. H. Dunning, Jr., J. Phys. Chem. A 2000, 104, 9062-9080. 9K. L. Bak, P. Jorgensen, J. Olsen, T. Helgaker, and W. Klopper, J. Chem. Phys. 2000, 112, 9229-

THOM H. DUNNING, JR. 1 .0 0.8 0.6 pit De) 0.4 0.2 CCSD(T) coon I MP, 6.0 -8.3 -1.0 7.5 4.5 0.5 1 . __ 0.0 ,,,,,,,,,,,,,,,,,,,I,...,....,....,...., -40.0 -30.0 -20.0 -1 0.0 0.0 1 0.0 20.0 30.0 40.0 De (kcal/mol) FIGURE 5 Theoretical advances: coupled cluster calculation of De values. 93 CCSD with the perturbative correction for triple excitations, CCSD(T). The box at the top right lists the average error (/\ave) and the standard deviation (/\s~a!) If the Schrodinger equation was being solved exactly for this group of molecules, the standard error distribution would simply be a 6-function at zero (0.0~. It is clear that the error distributions for the MP2 and CCSD methods do not have this shape. Not only are the average errors far from zero (/\ave = 6.0 and -8.3 kcal/mol, respectively), but the widths of the error distributions are quite large (/\s~ = 7.5 and 4.5 kcal/mol, respectively). Clearly, neither the MP2 nor the CCSD method is able to provide a consistent description of the molecules included in the test set, although the CCSD method, with its smaller /\s~, can be considered more reliable than the MP2 method. On the other hand, if the perturbative triples correction is added to the CCSD energies, the error distribution shifts toward zero (/\ave = -1.0 kcal/mol) and sharpens substantially (/\s~ = 0.5 kcal/mol). The accuracy of the CCSD(T) method is not a fluke, as studies with the CCSDT and CCSDTQ meth- ods, limited though they may be, show. The above results illustrate the advances in our ability to quantitatively de- scribe the electronic structure of molecules. This advance is in large part due to our ability to converge the solutions of the electronic Schrodinger equation. In the past, a combination of incomplete basis sets and lack of computing power prevented us from pushing calculations to the complete basis set (CBS) limit. For any finite basis set calculation, the error is the sum of the error due to the use of an approximate electronic structure method plus the error due to the use of an in- complete basis set. These errors can be of opposite sign, which can lead to con- fusing "false positives" (i.e., agreement with experiment due to cancellation of

94 APPENDIX D -39.0 -41 .0 -43.0 I, -45.0 -47.0 o - . . . . aug-cc- pVDZ aug-cc- pVTZ aug-cc- pVQZ aug-cc- pV5Z 1 7252 ~50x ~1 ,OOOx ~1 O,OOOx FIGURE 6 Opportunity: converged molecular calculations (data from Xantheas S., Burnham, C.; Harrison, R. J. Chem. Phys. 2002,116, 1493~. errors).8 At the CBS limit on the other hand, the remaining error is that due to the method itself. Thus, the ability to push calculations to the CBS limit has allowed us to separate the methodological errors from the basis set truncation errors, greatly advancing our understanding of the accuracy of the various approximate methods used to solve the Schrodinger equation. The computing resources needed to push calculations to the CBS limit are illustrated in Figure 6. This figure displays the results of MP2 calculations on the water hexamer by Xantheas, Burnham, and Harrison. In addition to its in- trinsic importance, an accurate value of the binding energy for the water hexamer was needed to provide information for the parameterization of a new many-body potential for water. This information is not available from experiment and thus its only source was high-level quantum chemical calculations. The energy plotted in the figure is that required to dissociate the water hexamer at its equilibrium geometry into six water molecules at their equilibrium geometries: Ee[(H2O)61 6Ee(H2O). With the augmented double zeta basis set (aug-cc-pVDZ), the calcula- tions predict that the hexamer is bound by 39.6 kcal/mol. Calculations with the i°In contrast to chemically-bound molecules, the MP2 method predicts binding energies for hydro- gen-bonded molecules such as (H2O)n comparable to those obtained with the CCSD(T) method (see Ref. 8). iiS. S. Xantheas, C. I. Burnham, and R. I. Harrison, J. Chem. Phys. 2002, 116, 1493-1499. i2C. J. Burnham and S. S. Xantheas, J. Chem. Phys. 2002, 116, 1500-1510; C. J. Burnham and S. S. Xantheas, J. Chem. Phys. 2002, 116, 5115-5124.

THOM H. DUNNING, JR. 95 quintuple zeta basis set (aug-cc-pV5Z), on the other hand, predict an equilibrium binding energy of 44.3 kcal/mol, nearly 5 kcal/mol higher. As can be seen, the variation of the binding energy with basis set is so smooth that the value can be extrapolated to the CBS limit. This yields a binding energy of 44.8 kcal/mol. From benchmark calculations on smaller water clusters, this number is expected to be accurate to better than 1 kcal/mol. Pushing molecular calculations on molecules such as (H2O)6 to the CBS limit requires substantial computing resources. If the amount of computer time required for a calculation with the double zeta set is assigned a value of 1 unit, then a calculation with the triple zeta set requires 50 units. The amount of computing time required escalates to the order of 1000 units for the quadruple zeta set and the order of 10,000 units for the quintuple zeta set. Clearly, advances in comput- ing power are contributing significantly to our ability to provide quantitative pre- dictions of molecular properties. Another opportunity provided by greatly increased computing resources is the ability to extend calculations such as those described above to much larger molecules. This is important in the interpretation of experimental data (experi- mentalists always seem to focus on larger molecules than we can model), to char- acterize molecules for which experimental data is unavailable, and to obtain data for parameterizing semiempirical models of more complex systems. The latter was the driving force in the calculations on (H2O)n referred to above. This was also what drove the study of water interacting with cluster representations of graphite sheets by Feller and Jordani3 (Figure 7~. The MP2 calculations reported by these authors considered basis sets up to aug-cc-pVQZ and graphitic clusters up to (Cg6H24~. By exploiting the convergence properties of the correlation con- sistent basis sets as well as that of the graphitic clusters, they predicted the equi- librium binding energy for water to graphite to be 5.8 + 0.4 kcal/mol. The largest MP2 calculation reported in this study was on H2O-Cg6H24 and required 29 hours on a 256-processor partition of the IBM SP2 at PNNL' s Environmental Molecu- lar Sciences Laboratory. The ultimate goal of Feller and Jordan's work is to produce a potential that accurately represents the interaction of water with a carbon nanotube. Their next study will involve H2O interacting with a cluster model of a carbon nanotube. However, this will require access to the next generation massively parallel com- puting system from Hewlett-Packard that is currently being installed in EMSL's Molecular Science Computing Facility.~4 i3D. Feller and K. Jordan, .1. Phys. Chem. A 2000, 104, 9971-9975. i4See: http://www. emsl.pnl.gov:2080/capabs/mscf capabs.html and http://www. emsl.pnl.gov:2080/ capabs/mscf/hardware/config_hpcs2. html.

96 H2O-C96H24 De -5 kcal/mo] APPENDIX D Need interaction potentials to mode' nanoscale processes Little data from experiment, need accurate calcula- tions First step: H2O- C,6H24; next step H2O with segment of nanolube. FIGURE 7 Opportunity: Larger, More Complex Molecules. Courtesy of D. Feller, Pa- cific Northwest National Laboratory; see also Feller, D.; Jordan, K. D. J. Phys. Chem. A 2000, 104, 9971. Challenges in Computational Chemistry Although the opportunities offered by advances in computing technologies are great, many challenges must be overcome to realize the full benefits of these advances. Most often the focus is on the computational challenges because these are indeed formidable. However, we must also quantify the limitations of the existing theories used to describe molecular properties and processes as well as develop new theories for those molecular phenomena that we cannot adequately describe at present (theoretical challenges). In addition, we must seek out and carefully evaluate new mathematical approaches that show promise of reducing the cost of molecular calculations (mathematical challenges). The computational cost of current electronic structure calculations increases dramatically with the size of the molecule (e.g., the CCSD(T) method scales as N7, where N is the number of atoms in the molecule). Of particular interest are mathematical tech- niques that promise to reduce the scaling of molecular calculations. Work is cur- rently under way in all of these challenge areas. As noted above, the theoretical challenges to be overcome include the vali- dation of existing theories as well as the development of new theories. One of the surprises in electronic structure theory in the 1990s was the finding that M011er- Plesset perturbation theory, the most widely used means to include electron cor- relation effects, does not lead to a convergent perturbation expansion series. This

THOM H. DUNNING, JR. 25.0 20.0 1 5. 0 10.0 5.0 ~ 0.0 ii] -5 . 0 <' -1 0 0 -1 5. 0 -20.0 -25.0 -30.0 ~ 0 10 20 30 40 50 97 ~ cc-pVDZ au~cc-pVDZ FIGURE 8 Theoretical challenges: convergence of perturbation expansion for HF (data from Olsen, J.; Christiansen, O.; Koch, H.; Jorgensen, P. J. Chem. Phys. 1996,105, 5082- 5090). nature of the perturbation expansion is illustrated in Figure 8, which plots the difference between full CI and perturbation theory energies for the HF molecule in a double zeta and augmented double zeta basis set as a function of the order of perturbation theory. As can be seen, at first the perturbation series appears to be converging, but then, around fifth- or sixth-order perturbation theory, the series begins to oscillate (most evident for the aug-cc-pVDZ set) with the oscillations becoming larger and larger with increasing order of perturbation theory.~5 The perturbation series is clearly diverging even for hydrogen fluoride, a molecule well described by a Hartree-Fock wave function. Dunning and Petersoni6 showed that even at the complete basis set limit, the MP2 method often provides more accurate results than the far more computationally demanding MP4 method. Thus, the series is not well behaved even at low orders of perturbation theory. How many other such surprises await us in theoretical chemistry? Although we now know how to properly describe the ground states of mol- ecules,~7 the same cannot be said of molecular excited states. We still await an i5~. Olsen, O. Christiansen, H. Koch, and P. Jorgensen, ]. Chem. Phys. 1996, 105, 5082-5090. i6T. H. Dunning, Jr. and K. A. Peterson, J. Chem. Phys. 108, 4761-4771 (1998). i7This is not to say that the CCSD(T) method provides an accurate description of the ground states of all molecules. The coupled cluster method is based on a Hartree-Fock reference wave function and thus will fail when the HF wave function does not provide an adequate zero-order description of the molecule. The development of multireference coupled cluster methods is being actively pursued by several groups.

98 APPENDIX D lo6 105 ~ 10 Q o 103 2 lol 10° 10 ,-1 I' CCSD(T) ~ ~ ~N7 J Hi / a' if f ,' r I' HF / ~N4 If A' I' A, it, ~ em, ....... , 10° lot 1o2 103 104 Number of Atoms FIGURE 9 Mathematical challenges: scaling laws for molecular calculations. accurate, general, computationally tractable approach for solving for the higher roots of the electronic Schrodinger equation. Without such a theory, many mo- lecular phenomena, such as visible-UV spectroscopy and photochemistry, will remain out of modeling reach. As noted above, current molecular electronic structure calculations scale as a high power of the number of atoms in the molecule. This is illustrated in Figure 9. For example, the Hartree-Fock method scales as N4, while the far more accurate CCSD(T) method scales as N7. Thus, when using the CCSD(T) method, doubling the size of a molecule increases the computing time needed for the calculation by over two orders of magnitude. Even if computing capability is doubling every 18 to 24 months, it would take another 10 years before computers would be able to model a molecule twice as big as those possible today. Fortunately, we know that, for sufficiently large molecules, these methods don't scale as N4 or N7. For example, it has long been known that simple, controllable mathematical approximations such as integral screening can reduce the scaling of the HF method to N2 for sufficiently large molecules. The impact of this is illustrated by the dashed curve in the figure. Using more advanced mathematical techniques, such as the fast multipole method (FMM) to handle the long-range Coulomb interaction, and a separate All. Greengard and V. Rohklin, Acta Numerica 6 (Cambridge University Press, Cambridge, 1997), pp 229-269 and references therein.

THOM H. DUNNING, JR. 99 treatment of the exchange interaction, it is possible to develop a HF algorithm that exhibits linear scaling in the very large molecule lim~t.~9 Combining these techniques with Fourier methods provides improved scaling even for small sys- tems in high-quality basis sets. These techniques can be straightforwardly ex- tended to DFT calculations and similar reductions in the scaling laws are possible for correlated calculations, including CCSD(T) calculations. As in the HF method, it is possible to exploit screening in calculating the correlation energy. However, to take advantage of screening, the equations for the correlated methods must be rewritten in terms of the atomic orbital basis rather than the molecular orbital basis. This has recently been done for the coupled cluster method by Scuseria and Ayala,20 who showed that, with screening alone, the CCD equations could be solved more efficiently in the AO basis than in the MO basisfor sufficiently large molecules. Since the effectiveness of screening and multipole summation tech- niques increase with molecule size, the question is not whether the use of AOs will reduce the scaling of coupled cluster calculations but, rather, how rapidly the scaling exponent will decrease with increasing molecular size. The impact of reduced scaling algorithms is just beginning to be felt in chem- istry, primary in HF and DFT calculations.2i However, reduced scaling algo- rithms for correlated molecular calculations are likely to become available in the next few years. When these algorithmic advances are combined with advances in computing technology, the impact will be truly revolutionary. Problems that cur- rently seem intractable will not only become doable, they will become routine. Numerical techniques for solving the electronic Schrodinger equation are also being pursued. Another paper from this workshop has been written by R. Friesner, who developed the pseudospectral method, an ingenious half numeri- cal-half basis set method. Another approach that is being actively pursued is the use of wavelet techniques to solve the Schrodinger equation. R. Harrison and coworkers recently reported DFT-LDA calculations on benzene22 that are the most accurate available to date (Figure 10~. Unlike the traditional basis set expan- sion approach, the cost of wavelet-based calculations automatically scales lin- early in the number of atoms. Although much remains to be done to optimize the codes for solving the Hartree-Fock and DFT equations, not to mention the devel- opment of wavelet approaches for including electron correlation effects via meth- ods such as CCSD(T), this approach shows great promise for extending rigorous electronic structure calculations to large molecules. Finally, there are computational challenges that must be overcome. The supercomputers in use today achieve their power by using thousands of proces- i9C. Ochsenfeld, C. A. White, and M. Head-Gordon, J. Chem. Phys. 1998, log, 1663-1669. 20G. E. Scuseria and P. Y. Ayala, J. Chem. Phys. 1999, l ll, 8330-8343. 2iSee, e.g., the Feature Article by G. E. Scuseria, J. Phys. Chem. A 1999, 103, 4782-4790. 22R. J. Harrison, G. I. Fann, T. Yanai, and G. Beylkin, "Multiresolution Quantum Chemistry: Basic Theory and Initial Applications," to be published.

100 APPENDIX D DFT/LDA Energy 10-3 -230.194 10-5 -230.19838 10-7 -230.198345 For comparison Partricige-3 primitive set + aug-cc-pVTZ polarization set: -230.158382 hartrees FIGURE 10 Mathematical challenges: multiwavelet calculations on benzene. Courtesy of R. Harrison, G. Fann, and G. Beylkin, Pacific Northwest National Laboratory. sors, and computers are already in the design stage that use tens of thousands to hundreds of thousands of processors. So, one challenge is to develop algorithms that scale well from a few processors to tens of processors, to hundreds of proces- sors, to thousands of processors, and so on. This is a nontrivial challenge and is an area of active research in computational science, computer science, and applied mathematics. In some cases, this work has been very successful with calculations using more than 1,000 processors being reported with NWChem23 and NAMD.24 For other scientific applications, we don't know when such algorithms will be discovered this is, after all, an issue of creativity and creativity respects no schedules. One of the surprises that I had as I began to delve into the problems associ- ated with the use of the current generation of supercomputers was the relatively poor performance of many of the standard scientific algorithms on a single pro- cessor. Many algorithms achieved 10% or less of peak performance! This effi- ciency then further eroded as the calculations were scaled to more and more pro- cessors. Clearly, the cache-based microprocessors and their memory subsystems used in today's supercomputers are very different than those used in the super- computers of the past. Figure 11 illustrates the problem. In the "Good Ole Days" (just a decade ago), Cray supercomputers provided very fast data paths between the central processing unit and main memory. On those machines, many vector 23See benchmark results at the NWChem web site ht~p://www.emsl.pnl.gov:2080/docs/nwchem/ nwchem.html. 24J. C. Phillips, G. Zheng, S. Kumar, and L. V. Kale, "NAMD: Biomolecular Simulation on Thou- sands of Processors," presented at SC2002 (Nov. 2002). See also ht~p://www.ks.uiuc.edu/Research/ namd/.

THOM H. DUNNING, JR. 101 The Good Ole Days Yesterday Today circa 1990 100-1,000 Processors ~10,000 processors CPU CPU nnn~ 1,OOOs -lO,OOOs FIGURE 11 Computational challenges: keeping the processors busy. The numbers su- perimposed on the arrows refer to the number of processor cycles required to transfer a byte of data from the indicated memory location to the processor. arithmetic operations, such as a DAXPY, which requires two words to be drawn from memory plus one written to memory every processor cycle, could be run at full speed directly from main memory. The supercomputers of today, on the other hand, are built from commercially oriented microprocessors and memory. A1- though this greatly reduces the cost of the supercomputers, the price is slow com- munications between the microprocessor and memory. Thus, it may take tens of cycles to transfer a word from memory to the processor. Computer designers attempt to minimize the impact of slow access to main memory by placing a fast cache memory between the processor and main memory. This works well if the algorithm can make effective use of cache, but many scientific and engineering algorithms do not. New cache-friendly algorithms must be developed to take full advantage of the new generation of supercomputers again, a problem in creativ- ity. One of the reasons for the success of the Japanese Earth Simulator is that it was designed with a knowledge of the memory usage patterns of scientific and engineering applications. Although the bandwidth between the processor and memory in the Earth Simulator does not match that in the Cray computers of the 1990s (on a relative basis), it is much larger (four times or more) than that for LLNL's ASCI White or PSC's LeMieux. In late 2002, Cray, Inc. announced its new computer, the Cray X1. The Cray X1 has the same bandwidth (per flops) as the Earth Simulator, but each processor also has a 2 Mbyte cache with a band- width that is twice the bandwidth to main memory. The scientific and engineering community is eagerly awaiting the delivery and evaluation of this new computer

102 APPENDIX D from Cray, Inc. Oak Ridge National Laboratory is leading this effort for DOE's Office of Science. The figure also illustrates the speed with which data can be transferred from the memory associated with other processors. This can require thousands or tens of thousands of processor cycles. The trick to developing scalable algorithms is to keep the data close to the processors that need it. This is, of course, easier said than done. Computational Modeling of Complex Chemical Systems I don't know how many in the audience read John Horgan's book The End of Science: Facing the Limits of Knowledge in the Twilight of the Scientific Age (Little Brown & Company, 1997~. Given the amazing advances in scientific knowledge that are being made each day, I began reading this book in a state of disbelief. It wasn't until I was well into the book that I realized that Horgan was not talking about the end of science, but rather the end of reductionism in science. These are not the same thing. A physicist might be satisfied that he understands chemistry once the Schrodinger equation had been discovered, but for chemists the job has only begun. Chemists are interested in uncovering how the basic laws of physics become the laws that govern molecular structure, energetics, and reac- tivity. Molecules are complex systems whose behavior, although a result of the fundamental laws of physics, cannot be directly connected to them. Although I don't think we have yet reached the end of reductionism (much still remains to be discovered in physics, chemistry, and biology), I do think that we are in a period of transition. Scientists spent most of the twentieth century trying to understand the fundamental building blocks and processes that underlie our material world. The result has been a revolution chemists now understand much about the basic interactions between atoms and molecules that influence chemical reactivity and are using that knowledge to create molecules that once could be obtained only from nature (e.g., cancer-fighting taxol); biologists now understand that the basic unit of human heredity is a large, somewhat monoto- nous molecule and have nearly determined the sequence of A's, T's, G's and C's in human DNA and are using this knowledge to pinpoint the genetic basis of diseases. As enlightening as this has been, however, we are now faced with an- other, even greater, challenge assembling all of the information available on building blocks and processes in a way that will allow us to predict the behavior of complex, real-world systems. This can only be done by employing computa- tional models. This is the scientific frontier of the twenty-first century. We are not at the end of science, we are at a new beginning. Although Horgan may not recognize this activity as science, the accepted definition of science, "knowledge or a system of knowledge covering general truths or the operation of general laws especially as obtained and tested through the scientific method," indicates that this is science nonetheless.

THOM H. DUNNING, JR. 103 One example of a complex chemical system is an internal combustion en- gine. To predict the behavior of such a system, we must be able to model a wide variety of physical and chemical processes: · simulate the mechanical device itself, including the dynamical processes involved in the operation of the device as well as the changes that will occur in the device as a result of combustion (changes in temperature, pressure, etc.~; . simulate the fluid dynamics of the fuel-air mixture, from the time it is injected into the combustion chamber, through the burning of the fuel and conse- quent alteration of its chemical composition, temperature, and pressure, to the time it is exhausted from the chamber; and · simulate the chemical processes involved in the combustion of the fuel, which for fuels such as n-heptane (a model for gasoline) can involve hundreds of chemical species and thousands of reactions. The problem with the latter is that many of the molecular species involved in combustion have not yet been observed and many of the reactions have not yet been characterized (reaction energetics and rates) in the laboratory. From this it is clear that supercomputers will play a critical role in understanding the behavior of complex systems. It is simply not possible for scientists to understand how all of these physical and chemical processes interact to determine the behavior of an internal combustion engine without computers to handle all of the bookkeeping. It will not be possible for experimental chemists to identify and characterize all of the important chemical species and reactions of importance in flames; reliable computational predictions will be essential to realizing the goal. An example of cutting-edge research on the fundamental chemical processes occurring in internal combustion engines is the work being carried out at the Combustion Research Facility of Sandia National Laboratories. For example, Jackie Chen and her group are working on the development of computational models to describe autoignition. Autoignition is the process that ignites a fuel by the application of heat, not a flame or spark. The fuel in diesel engines is ignited by autoignition; a spark plug is not present. Autoignition is also the basis of a very efficient internal combustion engine with extremely low emissions the revolutionary homogeneous charge, compression ignition, or HCCI engine. The catch is that HCCI engines can not yet be used in cars and trucks because it is not yet possible to properly control the "blameless" combustion of fuels associated with autoignition. Figure 12 illustrates recent results from J. Chen and T. Echekki at Sandia National Laboratories (unpublished). This figure plots the concentration of peroxyl radical (HOO), which has been found to be an indicator of the onset of autoignition, as a function of time in a Hz-air mixture from two-dimensional, direct numerical simulations. These results led Chen and Echekki to identify new chemical pathways for autoignition and led them to propose a new approach to

104 APPENDIX D 1/2 t .,.d 3/2 t .;,.d 2 two Evolution of hydroperox~, NO2, at different fractions of autoignition induction time. Turbulent mixing strongly affects ignition delay by changing chemical branching/ termination balance New models may assist innovative engine design (HCCT) Limited by existing computer capabilities FIGURE 12 Coupling chemistry and fluid dynamics: H2 autoignition in turbulent mix- tures. Courtesy of J. H. Chen and T. Echel~i, Sandia National Laboratories. describing autoignition in terms of relevant flow and thermochemical parameters. The simulations show that autoignition is initiated in discrete kernels or struc- tures that evolve differently, depending strongly upon the exact local environ- ment in which they occur. This is easily seen in the figure. But, this work is only a beginning. Turbulent fluctuations as well as the autoignition kernels are inher- ently three dimensional, and more complex fossil fuels will undoubtedly give rise to new behaviors that must be studied in a full range of environments to gain the scientific insight needed to develop predictive models of real-world autoignition systems. However, the computing requirements for such calculations require soft- ware and hardware capable of sustaining tens of teraflops. An internal combustion engine is just one example of a complex system. There will be many opportunities to use our fundamental knowledge of chemistry to better understand complex, real-world systems. Other examples include a wide variety of processes involved in industrial chemical production, the molecular processes that determine cellular behavior, and the chemical processes affecting the formation of pollutants in urban environments. At the workshop, Jim Heath made a presentation on nanochemistry, i.e., nanoscale molecular processes. One of the most intriguing aspects of nanoscale phenomena is that nanoscale systems are the first systems to exhibit multiple scales in chemistry the molecular scale and the nanoscale. Phenomena at the molecular scale are characterized by time scales of femto- to picoseconds and distance scales of tenths of a nanometer to a nanometer. Nanoscale phenomena on the other hand often operate on micro- to millisecond (or longer) time scales and over distances of 10-100 (or more) nm. So, in addition to whatever practical applications nanochemistry has, it also rep- resents a opportunity to understand how to describe disjoint temporal and spatial

THOM H. DUNNING, JR. 105 scales in computational modeling. Achieving this goal will require close collabo- ration between experimental chemists synthesizing and characterizing nano- chemical systems and computational chemists modeling these systems. The con- cepts developed in such studies may be applicable to scaling to the meso- and macroscales. Data Storage, Mining, and Analysis in Genomics and Related Sciences Let me turn now to a very different type of problem the data problem in science. The research universities in North Carolina, like most major research universities in the United States, want to become major players in the genomics revolution. This revolution not only portends a major advance in our understand- ing of the fundamentals of life, but also promises economic rewards to those states that make the right investments in this area. However, the problem with the "new biology" is that it is very different than the "old biology." It is not just more quantitative: it has rapidly become a very data-intensive activity. It is not data intensive now most of the data currently available can be stored and analyzed on personal computers or small computer-data servers. But, with the quantity of data increasing exponentially, the amount of data will soon overwhelm all of those institutions that have not built the information infrastructure needed to man- age it. Few universities are doing this. Most are simply assuming that their fac- ulty will solve the problem on an individual basis. I don't believe this will work- the problem is of such a scale that it cannot be addressed by point solutions. Instead, to be a winner in the genomics sweepstakes, universities or even univer- sity consortia must invest in the integrated information infrastructure that will be needed to store, mine, and analyze the new biological data. The next two figures illustrate the situation. Figure 13 is a plot of the number of gigabases (i.e., linear sequences of A's, T's, C's, and G's) stored in GenBank from 1982 to the present.25 As can be seen, the number of gigabases in GenBank was negligible until about 1991, although I can assure you that scientists were hard at work from 1982 to 1991 sequencing DNA; the technology simply did not permit large-scale DNA sequencing and thus the numbers don't show on the plot. The inadequacy of the technology used to sequence DNA was widely recognized when the Office of Science in the U.S. Department of Energy initiated its human DNA sequencing program in 1986 and substantial investments were made in the development of high-throughput sequencing technologies. When the Human Ge- 25There are three database respositories that store and distribute genome nucleotide sequence data. GenBank is the repository in the U.S. and is managed by the NIH (see: http://www.ncbi.nlm.nih.gov); DDBJ is the DNA Data Bank of Japan managed by the Japanese National Institute of Genetics (see: http://www.nig.ac.jp/home.html); and the EMBL Nucleotide Database is maintained by the European Bioinformatics Institute (see: http://www.ebi.ac.uk/). Sequence information is exchanged between these sites on a daily basis.

106 20 15 ._ V e_ 10- o 5 o 1 1982 1986 1990 1994 1998 2002 FIGURE 13 Background: exponential growth in GenBank. APPENDIX D GenBank · Number of base pairs . · · . ncreasmg rape -. "y (exponentially) · So far, more than ~ 5 gigabases have been added in 2002 (as of Oct. 30th) nome Project, a joint effort of the National Institutes of Health and the Office of Science, began in 1990, a similar approach was followed. We are now reaping the benefits of those investments. At the present time, the doubling time for GenBank is less than one year and decreasing in the first 10 months of 2002, more se- quences were added to GenBank than in all previous years. The other "problem" that must be dealt with in genomics research is the diversity of databases (Figure 14~. There is not just one database of importance to molecular biologists involved in genomics, proteomics, and bioinformatics re- search; there are many, each storing a specific type of information. Each year the journal Nucleic Acids Research surveys the active databases in molecular biology and publishes information on these databases as the first article of the year. In January 2002, 335 databases were identified. Many of these databases are derived from GenBank; others are independent of GenBank. Whatever the source, how- ever, all are growing rapidly. The challenge to the researcher in genomics is to link the data in these databases with the data that he/she is producing to advance the understanding of biological structure and function. Given the size and diver- sity of the databases, this is already intractable without the aid of a computer. Putting all of the above together, we find ourselves in a most unusual situa-

THOM H. DUNNING, JR. | "The Molecular Biology Database Collection: 2002 Update," Andreas D. Baxevanis, A.D. Nucleic Acids Research 2002, 30 (1) 1-12. | Major Public Sequence Repositories Varied Biomedical Content DNA Data Bank of Japan (DDBJ) http://www.ddbj.nig.ac.jp All known nucleotide and protein sequences . . ... ... ... ... ... ... 335 Databases! ... ... ... ... ... ... VirOligo http:/hriroligo.okstate.edu Virus-specific oligonucleotides for PCR and FIGURE 14 Background: number and diversity of databases. Computational Load a) · _ Q · _ X 107 Genome Data 8x Growth / 18 -24 Months Moore's Law 2x Growth / 1 8 -24 Months Time FIGURE 15 Background: mining and analysis of genomic data. Courtesy of TimeLogic Corporation. tion in biology (Figure 15, from TimeLogic26~. Moore's Law is represented at the bottom of the graph; this represents a doubling of computing capability every 18-24 months. The middle band illustrates the rate at which the quantity of ge- nomic data is increasing. Note that, as seen above, it outstrips the growth in com- 26See: http://www. timelogic. com/.

108 APPENDIX D puting power from Moore' s Law. The upper band represents the rate of growth in computing resources needed for genomic research (i.e., the amount of computing power that will be needed to mine and analyze genomic data). Clearly, the day of reckoning is coming. Although most biologists involved in genomic and related research are very comfortable with the situation they are currently in, when stor- age and computing requirements are increasing at an exponential rate, one can be quite comfortable one year and absolutely miserable the next. Of course, those biologists most heavily involved in genomics research un- derstand that substantial computing power and data storage capabilities will be needed to support their research. Even so, many still underestimate the growing magnitude of the problem. Faculty in Duke University, North Carolina State Uni- versity, and the University of North Carolina at Chapel Hill are attempting to solve the problem by building PC-based clusters (usually Linux clusters) to store and analyze their data. There are many such clusters all over the three universities with more to come. If there ever was an argument for collective action, this is it. Not only will it be difficult to grow the clusters at the pace needed to keep up with the proliferation of data, but it will be difficult to house and operate these ma- chines as well as provide many essential services needed to ensure the integrity of the research (e.g., data backup and restore). One difficulty with taking collective action is that each researcher wants to have his/her data close at hand, but these data need to be seamlessly integrated with all of the other data that are available, both public data being generated in the universities in North Carolina and data being stored in databases all over the world. In addition, the researchers want to be confident than their private data are secure. One way to achieve this goal is to use Grid technologies. With Grid tech- nologies, a distributed set of computing and data storage resources can be com- bined into a single computing and data storage system. In fact, my idea was to spread this capability across the whole state of North Carolina. All of the 16 campuses of the University of North Carolina plus Duke University and Wake Forest University would be tied together in a Grid and would have ready access to the computing and data storage resources available throughout the Grid whether they were physically located in the Research Triangle, on the coast, or in the mountains. Although the biggest users of such a Grid would likely be the research universities, all of the universities need to have access to this capability if they are to properly educate the next generation of biologists. Some of the attributes of the North Carolina Bioinformatics Grid27 are sum- marized in Figure 16. At the top of the list are general capabilities such as single sign-on (a user need only log onto the Grid; thereafter all data and computing resources to which they are entitled become available) and a system-wide ap- proach to security rather than a university-by-university approach. The latter is particularly important in a multicampus university. Other capabilities include fine 27For more information on the NC BioGrid, see: http://www.ncbiogrid.org.

THOM H. DUNNING, JR. Attributes Single sign-on, system-wicle security -a Policy-basecl resource sharing FIGURE 16 The North Carolina Bioinformatics Grid. 109 Unified view of resources ~ Computers and data -A Manages large data sets Efficient caching and replication grain control of resource sharing based on policies set at the university and cam- pus level (controlling who has access to which resources for what length of time). This not only protects the rights of the individual campuses, but allows the per- formance of the BioGrid to be optimized by siting computing and data storage resources at the optimal locations on the network. The Grid software underlying the BioGrid provides a unified view of all of the resources available to a user whether they are located on his/her campus or distributed across the state. In fact, one of the chief goals of Grids, such as the BioGrid, is to simplify the interaction of scientists with computers, data stores, and networks, allowing them to focus on their scientific research rather than the arcane of computing, data management, and networking. Finally, Grids, unlike other attempts at distributed computing such as NFS and AFS, are designed to efficiently handle large datasets. At times, to optimize performance, the Grid management software may decide to cache a dataset locally (e.g., if a particular dataset is being heavily used by one site) or replicate a dataset, if it is being heavily used on an ongoing basis. There are many economies and benefits that will be realized with the NC BioGrid. For example, at the present time all of the universities maintain their own copies of some of the public databases. Although this is not a significant expense now, it certainly will be when these databases hold petabytes of data rather than hundreds of gigabytes of data. In addition to the hardware required to store and access such massive amounts of data, a seasoned staff will be required to manage such a resource. Important data management services such as backup and restore of research data can be assured more easily in a Grid environment

110 FIGURE 17 Elements of the North Carolina BioGrid. APPENDIX D BioGrid Portal, Bio- Applications Interfaces Grid-aware or grid-enabled bioinformatics applications ·. Globus, Legion/Avaki, O NCREN3 NCSC plus Memberis Computing Centers than in a loosely coupled system of computers. Finally, economies will be realized by allowing each campus to size its computing and data storage sys- tem to meet its average workload and using resource sharing arrangements with the other campuses on the BioGrid to offload its higher-than-average demands. So, how is a BioGrid constructed? Like many applications in computing, a Grid is best viewed as a layered system (Figure 17~. At the lowest level are the distributed computing and data resources. It is assumed that these resources are geographically distributed, although one particular site may be dominant. This is the case in North Carolina where the dominant site will be the North Carolina Supercomputing Center (terascale supercomputer and petascale data store); yet sig- nificant resources will also be needed at the three major campuses, and some level of resources will be needed at all campuses. The next level is the network that links the distributed computing and data storage resources. If the datasets are large, a high-performance network is needed. North Carolina is fortunate is having a state- wide, high-performance network the North Carolina Research and Education Network (NCREN). This network serves all 16 campuses of the University of North Carolina plus Duke University and Wake Forest University and is currently being upgraded to ensure that no university is connected to NCREN at less than OC-3 (155 Mbps) and many will be connected at OC-12 (622 Mbps). The next layer is the operating system for the Grid, something called Grid middleware. It is the middleware that adds the intelligence associated with the Grid to the network. There are currently two choices for the Grid operating system, Globus28 and Legion- Avaki.29 Although Globus, which was first developed by researchers at 28http://www. globes. org.

THOM H. DUNNING, JR. 111 Argonne National Laboratory and the University of Southern California, cur- rently has the greatest mind-share, it is a toolkit for building Grids and requires a considerable amount of expertise to set up a Grid. Avaki, a commercial prod- uct that grew out of Legion, a Grid environment developed at the University of Virginia, provides an all-in-one system. The North Carolina BioGrid Testbed has both Grid operating systems running. Both Globus and Avaki are currently migrating to open standards (Open Grid Services Infrastructure and Open Grid Services Architecture30 ). The top two layers of the BioGrid are what the user will see. They consist of applications such as BLAST, CHARMM, or Gaussian, modified to make them aware of the services that they can obtain from the Grid, and the portals and web interfaces that allow access to these applications and data resources in a user- friendly way. Portals can also provide access to workflow management services. Often a scientist doesn't just perform an isolated calculation. In genomics re- search, a typical "computer experiment" might involve comparing a recently dis- covered sequence with the sequences available in a number of databases (BLAST calculation). If related sequences are found in these databases, then the scientist may wish to know if three-dimensional molecular structures exist for any of the proteins coded by these related sequences (data mining) or may want to know what information is available on the function of these proteins in their associated biological systems (data mining). And, the list goes on. The development of soft- ware that uses computers to manage this workflow, eliminating the time that the scientist has to spend massaging the output from one application to make it suit- able for input to the next application, is a major opportunity for advancing re- search in molecular biology. When we announced that we were going to build a BioGrid in North Caro- lina, we were immediately approached by several biology-related software com- panies stating that we didn't have to build a BioGrid, they already had one. After parrying several such thrusts, Chuck Kesler at MCNC decided that we needed to more carefully define what we meant by a Grid. This is done in Figure 18. A1- though the companies were able to provide bits and pieces of the functionality represented in this figure, none were able to provide a significant fraction of these capabilities. So, this is a Grid. I challenge the participants in this workshop to think about applications for Grid technologies in chemical science and technology. Clearly, chemists and chemical engineers are not yet confronted with the flood of data that is becoming available in the genomics area but the amount of data that we have in chemistry is still substantial. Ann Chaka discusses the issue of chemical data storage and management in her paper. 29http://www. avaki. com. 30http://www.gridforum. org/Documents/Drafts/default.htm.

112 · · ~ FIGURE 18 No, what you have is not a Grid! APPENDIX D Virtual Laboratories in Chemical Science and Technology The last topic that I want to discuss is the concept of virtual laboratories. Ad- vances in networking and computing have made remote access to instruments not only possible but (reasonably) convenient, largely removing the inconveniences associated with distance (e.g., travel to the remote instrument site, specified time slots for instrument availability, limited time to consult with site scientists on ex- perimental details, etc.~. The concept of virtual laboratories is quite advanced in astronomy, where the instruments (large telescopes) are very expensive and often located in remote regions (on the tops of mountains scattered all over the world). In chemical science and technology, on the other hand, remote access to instruments is largely a foreign concept. However, as instruments for chemical research continue to increase in cost (e.g., the highest field NMRs and mass spectrometers currently available already cost several million dollars), it will soon become desirable, if not necessary, for universities and research laboratories to share these instruments rather than buy their own. Investments in software and hardware to enable remote access to these instruments can dramatically decrease the barriers associated with the use of instruments located at a remote site. One of the major investments that I made as associate director for comput- ing and information science (and later as Director) in the Environmental Molecu- lar Sciences Laboratory at Pacific Northwest National Laboratory was in the de- velopment of collaboratories, which were defined by Wm. A. Wulf ash 3iWulf, Wm. A. The National Collaboratory: A White Paper in Towards a National Collaboratory; unpublished report of a NSF workshop, Rockefeller University, NY. March 17-18, 1989.

THOM H. DUNNING, JR. a "center without walls," in which the nation's researchers can perform their research without regard to geographical location interacting with colleagues, accessing instrumentation, sharing data and computational resources, and ac- cessing information in digital libraries. 113 (See also National Collaboratories: Applying Information Technologies for Sci- entific Research32 ). I did this because PNNL is located in the Pacific Northwest, a substantial distance from most of the major centers of research in the United States. Since EMSL is a National User Facility, much like the Advanced Photon Source at Argonne National Laboratory or the Advanced Light Source at Lawrence Berkeley National Laboratory, we needed to minimize the inconve- nience for scientists wishing to use EMSL's facilities to support their research programs in environmental molecular science. Allowing scientists access to EMSL' s resources via the Internet was key to realizing this goal. In EMSL there are a number of first-of-a-kind and one-of-a-kind instruments that could be made available to the community via the Internet. We decided to focus on the instruments in the High-Field Nuclear Magnetic Resonance Facility as a test case. When we began, this facility included one of the first 750-MHz NMRs in the United States; it now includes an 800-MHz NMR, and a 900-MHz NMR is expected to come on-line shortly. The NMRs were very attractive candi- dates for this test since the software for field servicing the NMRs provided much of the basic capability needed to make them accessible over the Internet. Development of the Virtual NMR Facility (VNMRF, Figure 19) was a true collaboration between NMR spectroscopists and computer scientists. One of the first discoveries was that simply implementing secure remote control of the NMRs was not sufficient to support the VNMRF. Many other capabilities were needed to realize the promise of the VNMRF, including · real-time videoconferencing, · remotely controlled laboratory cameras, and · real-time computer displays sharing a Web-based electronic laboratory notebook and other capabilities. Of particular importance was the ability of remote researchers to discuss issues related to the experiment with collaborators, scientists, and technicians in the EMSL before, during and after the experiment as well as to work together to analyze the results of the experiment. These discussions depended on the avail- ability of electronic tools such as the Electronic Laboratory Notebook and the Televiewer (both are included in PNNL's CORE2000 software packaged for supporting collaborative research). 32National Collaboratories: Applying Information Technologies for Scientific Research, National Research Council, National Academy Press, Washington, DC, 1993. 33http://www. emsl.pnl.gov:2080/docs/collab/.

114 APPENDIX D FIGURE 19 The Virtual NMR Facility at the Environmental Molecular Sciences Labo- ratory (Pacific Northwest National Laboratory). Using the software developed or integrated into C0RE2000 by EMSL staff, a remote scientists can schedule a session on an NMR, discuss the details of the experiment with EMSL staff, send the sample to EMSL, watch the technician insert the sample into the NMR, control the experiment using a virtual control panel, and visualize the data being generated in the experiment. The scientists can do everything from the virtual control panel displayed on his/her computer dis- play that he/she can do sitting at the real control panel. Currently, over half of the scientists using EMSL's high-field NMRs use them remotely, a testimony to the effectiveness of this approach to research in chemical science and technology. A more detailed account of the VNMRF can be found in the paper by Keating et al.34 See also the VNMRF home page.35 There are a few other virtual facilities now in operation. A facility for elec- tron tomography is operational at the University of California at San Diego36 and has proven to be very successful. There are electron microscopy virtual laborato- 34K. A. Keating, J. D. Myers, J. G. Pelton, R. A. Bair, D. E. Wemmer, and P. D. Ellis, J. Mag. Res. 2000, 143, 172-1 83. 35http://www. emsl.pnl.gov:2080/docs/collab/virtual/EMSLVNMRF.html. 36http://ncmir. ucsd. edu/Telescience.

THOM H. DUNNING, JR. 115 ries in operation at Argonne National Laboratory37 and Oak Ridge National Labo- ratory.38 Despite these successes, however, full-fledged support for virtual labo- ratories has not yet materialized. Those federal agencies in charge of building state-of-the-art user facilities for science and engineering rarely consider the com- puting, data storage, and networking infrastructure that will be needed to make the facility widely accessible to the scientific community. Even less do they con- sider the investments in software development and integration that will be needed to provide this capability. This is a lost opportunity that results in much wasted time on the part of the scientists who use the facility to further their research programs. Conclusion I hope that this paper has given you an appreciation of the potential impact of advances in computing, data storage, and networking in chemical science and technology. These advances will profoundly change our field. They will greatly enhance our ability to model molecular structure, energetics, and dynamics, pro- viding insights into molecular behavior that would be difficult, if not impossible, to obtain from experimental studies alone. They will also allow us to begin to model many complex systems in which chemical processes are an integral part (e.g., gasoline and diesel engines, industrial chemical production processes, and even the functioning of a living cell). The insights gained from these studies not only will deepen our understanding of the behavior of complex systems, but also will have enormous economic benefits. The advances being made in Grid technologies and virtual laboratories will enhance our ability to access and use computers, chemical data, and first-of-a- kind or one-of-a-kind instruments to advance chemical science and technology. Grid technologies will substantially reduce the barrier to using computational models to investigate chemical phenomena and to integrating data from various sources into the models or investigations. Virtual laboratories have already proven to be an effective means of dealing with the rising costs of forefront instruments for chemical research by providing capabilities needed by researchers not co- located with the instruments all we need is a sponsor willing to push this tech- nology forward on behalf of the user community. The twenty-first century will indeed be an exciting time for chemical science and technology. 37http://www.amc.anl.gov/. 3 ~http://www. ms. 0 ml. gov/htmlhome/mauc/MA Grem. htm.

116 APPENDIX D SYSTEMS APPROACHES IN BIOINFORMATICS AND COMPUTATIONAL GENOMICS Christodoulos A. Floudas Princeton University The genomics revolution has generated a plethora of challenges and opportu- nities for systems approaches in bioinformatics and computational genomics. The essential completion of several genome projects, including that of the human genome, provided a detailed map from the gene sequences to the protein se- quences. The gene sequences can be used to assist and/or infer the connectivity within or among the pathways. The overwhelmingly large number of generated protein sequences makes protein structure prediction from the amino acid se- quence of paramount importance. The elucidation of the protein structures through novel computational frameworks that complement the existing experimental tech- niques provides key elements for the structure-based prediction of protein func- tion. These include the identification of the type of fold, the type of packing, the residues that are exposed to solvent, the residues that are buried in the core, the highly conserved residues, the candidate residues for mutations, and the shape and electrostatic properties of the fold. Such elements provide the basis for the development of approaches for the location of active sites; the determination of structural and functional motifs; the study of protein-protein and protein-ligand complexes and protein-DNA interactions; the design of new inhibitors; and drug discovery through target selection, lead discovery and optimization. Better under- standing of the protein-ligand and protein-DNA interactions will provide important information for addressing key topology related questions in both the cellular meta- bolic and signal transduction networks. In this paper, we discuss two components of the genomics revolution roadmap: (1) sequence to structure, and (2) structure to function. In the first, after a brief overview of the contributions, we present ASTRO- FOLD, which is a novel ah initio, approach for protein structure prediction. In the second, we discuss the approaches for de nova protein design and present an inte- grated structural, computational, and experimental approach for the de nova design of inhibitors for the third component of complement, C3. We conclude with a sum- mary of the advances and a number of challenges. Sequence to Structure: Structure Prediction in Protein Folding Structure prediction of polypeptides and proteins from their amino acid se- quences is regarded as a holy grail in the computational chemistry and molecular and structural biology communities. According to the thermodynamic hypoth- esisi the native structure of a protein in a given environment corresponds to the iAnfinsen, C. B. Science -1973,181, 223.

CHRISTODOULOSA. FLOUDAS 117 global minimum free energy of the system. In spite of pioneering contributions and decades of effort, the ah initio prediction of the folded structure of a protein remains a very challenging problem. The existing approaches for the protein struc- ture prediction can be classified as: (1) homology or comparative modeling meth- ods,2 3 45 6 (2) fold recognition or threading methods,7 ~ 9 i0 ii i2 i3 (3) ah initio methods that utilize knowledge-based information from structural databases (e.g., secondary andlor tertiary structure restra~nts).~4~5~6~7~9~20 and (4) ah initio methods without the aid of knowledge-based information.2i 22 23 24 25 26 27 In the sequel, we introduce the novel ASTRO-FOLD approach for the ah initio prediction of the three-dimensional structures of proteins. The four stages of the approach are outlined in Figure 1. The first stage involves the identification of helical segments24 and is accomplished by partitioning the amino acid sequence into overlapping oligopeptides (e.g., pentapeptides, heptapeptides, nonapeptides). 2Bates, P. A., Kelley, L. A.; MacCallum, R. M.; Sternberg, M. J. E. Proteins 2001, S5, 39-46. 3Shi, J. Y.; Blundell, T. L.; Mizuguchi, K. J. Mol. Biol. 2001, 310, 243-257. 4Sali, A.; Sanchez, R. Proc. Natl. Acad. Sci. U.S.A. 1998, 95, 13597-13602. 5Fischer, D. Proteins 1999, S3. 6Alazikani, B., Sheinerman, F. B.; Honig, B. Proc. Natl. Acad. Sci. U.S.A. 2001, 98, 14796-14801. 7Fischer, D.; Eisenberg, D. Proc. Natl. Acad. Sci. U.S.A. 1997, 94, 11929-11934. Skolnick, J.; Kolinski, A. Adv. Chem. Phys. 2002, 120, 131-192. 9McGuffin, L. J.; Jones, D. T. Proteins 2002, 48, 44-52. i°Panchenko, A. R.; Marchler-Bauer, A.; Bryant, S. H. J. Mol. Biol. 2000, 296, 1319-1331. Cajoles, D. T. Proteins 2001, S5, 127-132. i2Skolnick, J.; Kolinski, A.; Kihara, D.; Betancourt, M.; Rotkiewicz, P.; Boniecki, M. Proteins 2001, S5, 149-156. i3Smith, T. F.; LoConte, L.; Bienkowska, J.; Gaitatzes, C.; Rogers, R. G.; Lathrop, R. J. Comp. Biol. 1997, 4, 217-225. i4Ishikawa, K.; Yue, K.; Dill, K. A. Prot. Sci. 1999, 8, 716-721. i5Pedersen, J. T.; Moult, J. Proteins 1997, S1, 179-184. i6Eyrich, V. A.; Standley, D. M.; Friesner, R. A. Adv. Chem. Phys. 2002, 120, 223-264. i7Xia, Y.; Huang, E. S.; Levitt, M.; Samudrala, R. J. Mol. Biol. 2000, 300, 171-185. i8Standley, D. M.; Eyrich, V. A.; An, Y.; Pincus, D. L.; Gunn, J. R.; Friesner, R. A. Proteins 2001, SS, 133-139. i9Standley, D. M.; Eyrich, V. A.; Felts, A. K.; Friesner, R. A.; McDermott, A. E. J. Mol. Biol. 1999, 285, 1691-1710. 20Eyrich, V.; Standley, D. M.; Felts, A. K.; Friesner, R. A. Proteins 1999, 35, 41. 2iPillardy, J.; Czaplewski, C.; Liwo, A.; Wedemeyer, W. J.; Lee, J.; Ripoll, D. R.; Arlukowicz, P.; Oldziej, S.; Arnautova, Y. A.; Scheraga, H. A. J. Phys. Chem. B 2001,105, 7299-7311. 22Pillardy, J.; Czaplewski, C.; Liwo, A.; Lee, J.; Ripoll, D. R.; Kazmierkiewicz, R.; Oldziej, S.; Wedemeyer, W. J.; Gibson, K. D.; Arnautova, Y. A.; Saunders, J.; Ye, Y. J.; Scheraga, H. A. Proc. Natl. Acad. Sci. U.S.A. 2001, 98, 2329-2333. 23Srinivasan, R.; Rose, G. D. Proteins 2002, 47, 489-495. 24Klepeis, J. L.; Floudas, C. A. J. Comp. Chem. 2002, 23, 245-266. 25Klepeis, J. L.; Floudas, C. A. J.Comp. Chem. 2003, 24, 191-208. 26Klepeis, J. L.; Floudas, C. A. J. Global Optim. unpublished. 27Klepeis, J. L.; Schafroth, H. D.; Westerberg, K. M.; Floudas, C. A. Adv. Chem. Phys. 2002, 120, 254-457.

118 APPENDIX D ~ . Helix Prediction - Detailed l~odel~ng - Simulations of Local Interactions Beta Sheet Precliction - Kim Modeling of Beta Sheet Formation - Predict List of Optimal Arrangements Derivation of Restraints _ Overall SD Structure Prediction - Structural Data from Pus Stages - Prediction via Novel Solution Approach (Global Optimization & Molecular Dynamics) FIGURE 1 Overall flow chart for the ah initio structure prediction using ASTRO-FOLD. The concept of partitioning the amino acid sequence into overlapping oligopeptides is based on the idea that helix nucleation relies on local interactions and positioning within the overall sequence. This is consistent with the observa- tion that local interactions extending beyond the boundaries of the helical seg- ment retain information regarding conformational preferences.28 The partition- ing pattern is generalizable and can be extended to heptapeptides, nonapeptides, 28Baldwin, R. L.; Rose, G. D. TIBS 1999, 24, 77-83.

CHRISTODOULOSA. FLOUDAS 119 or larger oligopeptides.29 The overall methodology for the ah initio prediction of helical segments encompasses the following steps:24 The overlapping oligopeptides are modeled as neutral peptides surrounded by a vacuum environ- ment using the ECEPP/3 force field.30 An ensemble of low potential energy pen- tapeptide conformations, along with the global minimum potential energy con- formation, is identified using a modification of the ocpp global optimization approach3i and the conformational space annealing approach.32 For the set of unique conformers Z. free energies (FharVaC) are calculated using the harmonic approximation for vibrational entropy.3i The energy for cavity formation in an aqueous environment is modeled using a solvent-accessible surface area expres- sion FCavi~' = yA + b, where A is the surface area of the protein exposed to the solvent. For the set of unique conformers Z. the total free energy Flood is calcu- lated as the summation of FharVaC, FCaVi~y, and FSO1V' which represents the differ- ence in polarization energies caused by the transition from a vacuum to a solvated environment, and Fio,~ze, which represents the ionization energy. The calculation of FSO1V and Fio,~ze requires the use of a Poisson-Boltzmann equation solver.33 For each oligopeptide, total free energy values (Flow) are used to evaluate the equilib- num occupational probability for conformers having three central residues within the helical region of the ¢- ~ space. Helix propensities for each residue are deter- m~ned from the average probability of those systems in which the residue in ques- tion constitutes a core position. In the second stage, ~ strands, ~ sheets, and disulfide bridges are identified through a novel superstructure-based mathematical framework originally estab- lished for chemical process synthesis problems.25 34 Two types of superstructure are introduced, both of which emanate from the principle that hydrophobic inter- actions drive the formation of structure. The first one, denoted as hydrophobic residue-based superstructure, encompasses all potential contacts between pairs of hydrophobic residues (i.e., a contact between two hydrophobic residues may or may not exist) that are not contained in helices (except cystines, which are allowed to have cystine-cystine contacts even though they may be in helices). The second one, denoted as p-strand-based superstructure, includes all possible p-strand arrangements of interest (i.e., a ~ strand may or may not exist) in addi- tion to the potential contacts between hydrophobic residues. The hydrophobic residue-based and p-strand-based superstructures are formulated as mathematical 29Anfinsen, C.; Scheraga, H. Adv. Prot. Chem. 1975, 29, 205. 30Nemethy, G.; Gibson, K. D.; Palmer, K. A.; Yoon, C. N.; Paterlini, G.; Zagari, A.; Rumsey, S.; Scheraga, H. A. J. Phys. Chem. 1992, 96, 6472. 3iKlepeis, J. L.; Floudas, C. A. J. Chem. Phys. 1999,110, 7491-7512. 32Lee, J.; Scheraga, H.; Rackovsky, S. Biopolymers 1998, 46, 103. 33Honig, B.; Nicholls, A. Science 1995, I, 11144-1149. 34Floudas, C. A. Nonlinear and Mixed-Integer Optimization; Oxford University Press: New York, NY, 1995.

120 APPENDIX D models that feature three types of binary variables: (1) representing the existence or nonexistence of contacts between pairs of hydrophobic residues; (2) denoting the existence or nonexistence of the postulated ~ strands; and (3) representing the potential connectivity of the postulated ~ strands. Several sets of constraints in the model enforce physically legitimate configurations for antiparallel or parallel ~ strands and disulfide bridges, while the objective function maximizes the total hydrophobic contact energy. The resulting mathematical models are integer lin- ear programming (ILP) problems that not only can be solved to global optimality, but also can provide a rank-ordered list of alternate p-sheet configurations.25 The third stage determines pertinent information from the results of the pre- vious two stages. This involves the introduction of lower and upper bounds on dihedral angles of residues belonging to predicted helices or ~ strands, as well as restraints between the Cot atoms for residues of the selected p-sheet and disul- fide-bridge configuration. Furthermore, for segments that are not classified as helices or ~ strands, free-energy runs of overlapping heptapeptides are conducted to identify tighter bounds on their dihedral angles.24 27 35 The fourth stage of the approach involves the prediction of the tertiary struc- ture of the full protein sequence.26 Formulation of the problem relies on the mini- mization of the energy using a full atomistic force field, ECEPP/330 and on dihe- dral angle and atomic distance restraints acquired from the previous stage. To overcome the multiple minima difficulty, the search is conducted using the ocpp global optimization approach, which offers theoretical guarantee of convergence to an £-global minimum for nonlinear optimization problems with twice-differen- tiable fUnctions.27,36,37,38,39 This global optimization approach effectively brackets the global minimum by developing converging sequences of lower and upper bounds, which are re- fined by iteratively partitioning the initial domain. Upper bounds correspond to local minima of the original nonconvex problem, while lower bounds belong to the set of solutions of convex lower bounding problems, which are constructed by augmenting the objective and constraint functions by separable quadratic terms. To ensure nondecreasing lower bounds, the prospective region to be bisected is required to contain the infimum of the minima of lower bounds. A nonincreasing sequence for the upper bound is maintained by selecting the minimum over all the previously recorded upper bounds. The generation of low-energy starting points 35Klepeis, I. L.; Pieja, M. T.; Floudas, C. A. Comp. Phys. Comm. 2003,151, 121-140. 36Adjiman, C. S.; Androulakis, L P.; Floudas, C. A. Computers Chem. Engng. -1998, 22, Il37- 158. 37Adjiman, C. S.; Androulakis, L P.; Floudas, C. A. Computers Chem. Engng. -1998, i I l59-~179. 38Adjiman, C. S.; Androulakis, L P.; Floudas, C. A. AlChE Journal 2000, 46, 1769-1797. 39Floudas, C. A. Deterministic Global Optimization: Theory, Methods and Applications, Nonconvex Optimization and its Applications; Kluwer Academic Publishers: Dordecht, 2000.

CHRISTODOULOSA. FLOUDAS 121 for constrained minimization is enhanced by introducing torsion angle dynam- ics40 within the context of the ocpp global optimization framework.26 Two viewpoints provide competing explanations of the protein-folding ques- tion. The classical opinion regards folding as hierarchic, implying that the process is initiated by rapid formation of secondary structural elements, followed by the slower arrangement of the tertiary fold. The opposing perspective is based on the idea of a hydrophobic collapse and suggests that tertiary and secondary features form concurrently. ASTRO-FOLD bridges the gap between the two viewpoints by introducing a novel ab intro approach for tertiary structure prediction in which helix nucleation is controlled by local interactions, while non-local hydrophobic forces drive the formation of ~ structure. The agreement between the experimental and predicted structures (RMSD, root mean squared deviation: 4-6 A for segments up to 100 amino acids) through extensive computational studies on proteins up to 150 amino acids reflects the promise of the ASTRO-FOLD method for generic tertiary structure prediction of polypeptides. Structure to Function: De Nova Protein Design The de novo protein design relies on understanding the relationship between the amino acid sequence of a protein and its three-dimensional struc- ture.4~42~43~44~45~46 This problem begins with a known protein three-dimensional structure and requires the determination of an amino acid sequence compatible with this structure. At the outset the problem was termed the "inverse folding problem"46 47 since protein design has intimate links to the well-known protein folding problem.48 Expenmentalists have applied the techniques of mutagenesis, rational de- sign, and directed evolution49 50 to the problem of protein design, and although these approaches have provided successes, the searchable sequence space is highly restncted.5i 52 Computational protein design allows for the screening of 40Guntert, P.; Mumenthaler, C.; Wuthrich, K. J. Mol. Biol. 1997, 273, 283-298. 4iVentura, S.; Vega, M.; Lacroix, E.; Angrand, I.; Spagnolo, L.; Serrano, L. Nature Struct. Biol. 2002,9,485-493. 42Neidigh, J. W.; Fesinmeyer, R. M.; Andersen, N. H. Nature Struct. Biol. 2002,9,425-430. 430ttesen, J. J.; Imperiali, B. Nature Struct. Biol. 2001, 8, 535-539. 44Hill, R. B.; DeGrado, W. F. J. Am. Chem. Soc. 1998,120, 1138-1145. 45Dahiyat, B. I.; Mayo, S. L. Science 1997, 278, 82-87. 46Drexler, K. E. Proc. Natl. Acad. Sci. U.S.A. 1981, 78, 5275-5278. 47Pabo, C. Nature 1983, 301, 200. 48C. Hardin, T. V. P.; Luthey-Schulten, Z. Curr. Opin. Struc. Biol. 2002,12, 176-181. 49Bowie, J. U.; Reidhaar-Olson, J. F.; Lim, W. A.; Sauer, R. T. Science 1990, 247, 1306-1310. 50Moore, J. C.; Arnold, F. H. Nat. Biotechnol. 1996,14, 458-467. 5iDeGrado, W. F.; Wasserman, Z. R.; Lear, J. D. Science 1989, 243, 622-628. 52Hecht, M. H.; Richardson, D. S.; Richardson, D. C.; Ogden, R. C. Science 1990, 249, 884-891.

22 APPENDIX D overwhelmingly large sectors of sequence space, with this sequence diversity subsequently leading to the possibility of a much broader range of properties and degrees of functionality among the selected sequences. Allowing for all 20 pos- sible amino acids at each position of a small 50-residue protein results in 205° combinations, or more than 1065 possible sequences. From this astronomical num- ber of sequences, the computational sequence selection process aims at selecting those sequences that will be compatible with a given structure using efficient optimization of energy functions that model the molecular interactions. The first attempts at computational protein design focused only on a subset of core resi- dues and explored steric van der Waals-based energy functions, although over time they evolved to incorporate more detailed models and interaction potentials. Once an energy function has been defined, sequence selection is accomplished through an optimization based search designed to minimize the energy objective. Both stochastic53 54 and deterministic55 56 methods have been applied to the com- putational protein design problem. Recent advances in the treatment of the pro- tein design problem have led to the ability to select novel sequences given the structure of a protein backbone. The first computational design of a full sequence to be experimentally characterized was the achievement of a stable zinc-finger fold (spot) using a combination of a backbone-dependent rotamer library with atomistic-level modeling and a dead-end elimination-based algorithm.45 Despite these breakthroughs, issues related to the stability and functionality of these de- signed proteins remain sources of frustration. We have recently introduced a combined structural, computational, and ex- perimental approach for the de nova design of novel inhibitors such as variants of the synthetic cyclic peptide Compstatin.57 A novel two-stage computational pro- tein design method is used not only to select and rank sequences for a particular fold but also to validate the stability of the fold for these selected sequences. To correctly select a sequence compatible with a given backbone template that is flexible and represented by several NMR structures, an appropriate energy func- tion must first be identified. The proposed sequence selection procedure is based on optimizing a pairwise distance-dependent interaction potential. A number of different parameterizations for pairwise residue interaction potentials exist; the one employed here is based on the discretization of alpha carbon distances into a set of 13 bins to create a finite number of interactions, the parameters of which were derived from a linear optimization formulated to favor native folds over 53Wernisch, E.; Hery, S.; Wodak, S. I. .1. Mol. Biol. 2000, 301, 713-736. 54Desjarlais, J. R.; Handel, T. M. J. Mol. Biol. 1999, 290, 305-318. 55Desmet, J.; Maeyer, M. D.; Hazes, B.; Lasters, I. Nature 1992, 356, 539-542. 56Koehl, P.; Levitt, M. Nature Struct. Biol. 1999, 6, 108. 57Klepeis, J. L.; Floudas, C. A.; Morikis, D.; Tsokos, C. G.; Argyropoulos, E.; Spruce, L.; Lambris, J. D. 2002, submitted.

CHRISTODOULOSA. FLOUDAS 123 optimized decoy structures.58 59 The resulting potential, which involves 2730 parameters, was shown to provide higher Z scores than other potentials and place native folds lower in energy.58 59 The formulation allows a set of mutations for each position i that in the general case comprises all 20 amino acids. Binary variables yet and Ski can be introduced to indicate the possible mutations at a given position. That is, the yet variable will indicate which type of amino acid is active at a position in the se- quence by taking the value of 1 for that specification. The objective is to mini- mize the energy according to the amino acid pair and distance dependent energy parameters that multiply the binary variables. The composition constraints re- quire that there is at most one type of amino acid at each position. For the general case, the binary variables appear as bilinear combinations in the objective func- tion. This objective can be reformulated as a strictly linear (integer linear pro- gramming) problem.57 The solution of the ILP problem can be accomplished rigorously using branch and bound techniques,34 making convergence to the glo- bal minimum energy sequence consistent and reliable. Finally, for such an ILP problem it is straightforward to identify a rank-ordered list of the low-lying en- ergy sequences through the introduction of integer cuts34 and repetitive solution of the ILP problem. Once a set of low-lying energy sequences has been identified via the se- quence selection procedure, the fold validation stage is used to identify an opti- mal subset of these sequences according to a rigorous quantification of conforma- tional probabilities. The foundation of the approach is grounded in the development of conformational ensembles for the selected sequences under two sets of conditions. In the first circumstance the structure is constrained to vary, with some imposed fluctuations, around the template structure. In the second case a free-folding calculation is performed for which only a limited number of re- straints are likely to be incorporated, with the underlying template structure not being enforced. The distance constraints introduced for the template-constrained simulation can be based on the structural boundaries defined by the NMR en- semble, or simply by allowing some deviation from a subset of distances pro- vided by the structural template; hence they allow for a flexible template on the backbone. The formulations for the folding calculations are reminiscent of structure prediction problems in protein folding.26 27 In particular, a novel constrained glo- bal optimization problem first introduced for structure prediction of Compstatin using NMR data,60 and later employed in a generic framework for the structure prediction of proteins, is utilized.26 The folding formulation represents a general 58Tobi, D.; Elber, R. Proteins 2000, 41, 40-46. 59Tobi, D.; Shafran, G.; Linial, N.; Elber, R. Proteins 2000, 40, 71-85. 60Klepeis, J. L.; Floudas, C. A.; Morikis, D.; Lambris, J. D. J. Comp. Chem. 1999, 20, 1354-1370.

24 APPENDIX D nonconvex constrained global optimization problem, a class of problems for which several methods have been developed. In this work, the formulations are solved via the ocpp deterministic global optimization approach, a branch and bound method applicable to the identification of the global minimum of nonlinear optimization problems with twice-differentiable functions.27 36 37 38 39 60 6i In addition to identifying the global minimum energy conformation, the glo- bal optimization algorithm provides the means for identifying a consistent en- semble of low-energy conformations.35 6i Such ensembles are useful in deriving quantitative comparisons between the free folding and template-constrained simu- lations. The relative probability for template stability Hemp is calculated by sum- ming the statistical weights for those conformers from the free-folding simulation that resemble the template structure and dividing this sum by the summation of statistical weights over all conformers. Compstatin is a 13-residue cyclic peptide and a novel synthetic complement inhibitor with the prospect of being a candidate for development as an important therapeutic agent. The binding and inhibition of complement component C3 by Compstatin is significant because C3 plays a fundamental role in the activation of the classical, alternative, and lectin pathways of complement activation. Although complement activation is part of the normal inflammatory response, inappropri- ate complement activation may cause host cell damage, which is the case in more than 25 pathological conditions, including autoimmune diseases, stroke, heart attack, and burn injuries.62 The application of the discussed de nova design ap- proach to Compstatin led to the identification of sequences with predicted seven- fold improvements in inhibition activity. These sequences were subsequently ex- perimentally validated for their inhibitory activity using complement inhibition assays.57 Summary and Challenge In the two components of the genomics revolution: (1) sequence to structure, and (2) structure to function, we discussed two significant advances. The first one, the ah initio structure prediction approach ASTRO-FOLD, integrates the two competing points of view in protein folding by employing the thesis of local interactions for the helical formation and the thesis for hydrophobic-hydrophobic residue interactions for the prediction of the topology of ~ sheets and the location of disulfide bridges. ASTRO-FOLD is based on novel deterministic global opti- mization and integer linear optimization approaches. The second advance, a novel approach for de nova protein design, introduces in the first stage an explicit math- ematical model for the in silica sequence selection that is based on distance- 6i Klepeis, J. L.; Floudas, C. A. J. Chem. Phys. 1999,110, 7491-7512. 62Sahu, A.; Lambris, J. D. Immunol. Rev. 2001,180, 35-48.

RICHARD FRIESNER 125 dependent force fields and a rigorous treatment of the combinatorial optimization problem, while in the second stage, full atomistic-level folding calculations are introduced to rank the proposed sequences, which are subsequently validated ex- perimentally. Several important challenges exist; these include new methods for the pro- tein structure prediction that will consistently attain resolutions of about 4-6 A for all-oc, all-p,ocp, and oc/p proteins of medium to large size; new approaches that will lead to resolution of protein structures comparable to the existing experimen- tal techniques; novel global optimization methods for sampling in the tertiary structure prediction and refinement; new approaches for the packing of helices in globular and membrane proteins; new computational methods for the structure prediction of membrane proteins; improved methods for protein-protein and pro- tein-DNA interactions; new methods for the determination of active sites and structural and functional motifs; new methods for protein function prediction; new approaches for the design of inhibitors; and systems-based approaches for improved understanding of gene regulatory metabolic pathways and signal trans- duction networks. Acknowledgments The author gratefully acknowledges financial support from the National Sci- ence Foundation and the National Institutes of Health (Rot GM52032~. MODELING OF COMPLEX CHEMICAL SYSTEMS RELEVANT TO BIOLOGY AND MATERIALS SCIENCE: PROBLEMS AND PROSPECTS Richard Friesner Columbia University Overview In this paper, I discuss the future of computational modeling of complex, condensed-phase systems over the next decade, with a focus on biological model- ing and specifically structure-based drug design. The discussion is organized as follows. First, I review the key challenges that one faces in carrying out accurate condensed phase modeling, with the analysis centered on the core technologies that form the essential ingredients of a simulation methodology. Next, I examine the use of molecular modeling in structure-based drug design applications. In my presentation, I briefly discussed issues associated with software development, and space limitations do not allow elaboration on that area.

26 APPENDIX D Core Technologies I focus here on a condensed-phase simulation problem involving the interac- tions of macromolecular structures, small molecules, and solvent. Periodic solid- state systems have their own special set of difficulties and simplifications, which I do not discuss. The problems of protein structure prediction, protein-ligand bind- ing, and enzymatic catalysis, which are discussed in the next section, fall into this category. Any condensed-phase simulation protocol that is going to address the above problems requires three fundamental components: 1. A function describing the energetics of the macromoleculets) and small moleculets) as a function of atomic positions. Such a function can involve direct quantum chemical computation (e.g., via density function theory or MP2 second order perturbation theory methods), a molecular mechanics force field, a mixed QM-MM model, or an empirical scoring function parameterized against experi- mental data. 2. A description of the solvent. If explicit solvent simulation is to be em- ployed, this can simply be another term in the molecular mechanics force field, for example. However, explicit solvent models are computationally expensive and can introduce a substantial amount of noise into the evaluation of relative energies unless extremely long simulation times are used. Therefore, consider- able effort has been invested in the development of continuum solvent models, which are relatively inexpensive to evaluate (typically between one and two times the cost of a gas-phase force field evaluation for one configuration) and can achieve reasonable accuracy. 3. A protocol for sampling configuration space. If structure prediction is desired, then the objective is to find the minimum free-energy configuration. If one is calculating binding affinities or rate constants, some kind of statistical averaging over accessible phase space configurations is typically required. The total cost of any given calculation is roughly the number of configurations re- quired to converge the phase space average (or to locate the global minimum) multiplied by the computational cost of evaluating the energy (including the con- tribution of the solvent model if continuum salvation is employed) per configura- tion. Thus, a key problem is to reduce the number of required configurations by devising more effective sampling methods. The ability to treat real-world condensed-phase simulation problems, such as structure-based drug design, is dependent on making progress in each of these three areas, to the point where none of them represents an unavoidable bottleneck to achieving adequate accuracy in reasonable CPU times. We dis- cuss each area briefly below, summarizing the current state of the art and pros- pects for the future.

RICHARD FRIESNER Energy Models 127 Quantum (chemical Methods. Quantum chemistry represents the most funda- mental approach to computation of the energy of any given atomic configuration. If the electronic Schrodinger equation is solved accurately, the correct answer will be obtained. Since an exact solution is not possible for systems containing more than one electron, the problem here is to solve the Schrodinger equation with approximations that are tractable and yet yield good quantitative accuracy. Following is a summary of the most useful methods that have been developed for doing this: 1. Density function theory provides respectable results for bond energies (2- 3 kcal/mol), activation barriers (3-4 kcal/mol), conformational energies (0.5-1.5 kcal/mol), and hydrogen bonding interactions (0.5-1 kcal/mol) with a scaling with system size in the N-N2 regime (Nbeing the number of electrons in the system). A crucial strength of DFT is its ability to deliver reasonable results across the peri- odic table with no qualitative increase in computational effort. DFT is currently the method of choice for calculations involving reactive chemistry of large sys- tems, particularly those containing transition metals, and is useful for a wide range of other calculations as well. Systems on the order of 300-500 atoms can be treated using current technology. 2. Second-order perturbation theory (MP2) can provide more accurate con- formational energies, hydrogen bond energies, and dispersion energies than DFT (which fails completely for the last of these, at least in current functionals). This makes it the method of choice for computing conformational or intermo- lecular interactions and developing force fields. The computational scaling is in the N2 -N3 range. Systems on the order of 100-200 atoms can be treated using current technology. The use of large basis sets and extrapolation to the basis set limit is necessary if high accuracy is to be achieved for the properties discussed above. 3. Coupled cluster methods (CCSD (T) in particular) provide high-accuracy results (often within 0.1 kcal/mol) for many types of molecules (e.g., organic molecules), but have more difficulties with transition metal-containing species. The method scales as N7 and at present can conveniently be applied only to small molecules, where it is however quite valuable in producing benchmark results. Overall, current quantum chemical methods are adequate for many purposes. Improvements in speed and accuracy are ongoing. My personal view is that this does not represent the bottleneck in accurate predictive condensed phase simula- tions. If one accepts the idea that entirely quantum chemical simulations are not the optimal approach for most problems (and almost certainly not for biological problems), current methods perform well in the context of QM-MM modeling and in producing results for force field development.

28 APPENDIX D Molecular Mechanics Force Fields. There are a number of different philoso- phies that have been the basis for various force field development efforts. Some groups have relied primarily on fitting parameters to experimental data, while others have introduced a larger component of ah initio quantum chemical data. My view is that extensive use of ah initio data is necessary if one is going to achieve high accuracy and broad coverage of chemical space. Experimental data can be used in conjunction with ah initio data and also used to test the resulting models. There are two basic issues associated with force field development. The first is the functional form to be used. The simplest cases involve building models for organic molecules that do not consider reactive chemistry. In this case, the stan- dard models (valence terms for stretches, bends, torsions, electrostatic, and van der Waals terms for intermolecular interactions) are sufficient, with the proviso that the electrostatics should be described by higher order multipoles of some sort (as opposed to using only atomic point charges) and that polarization should be introduced explicitly into the model. Simplified models that do not include polar- ization or higher multipoles appear to yield reasonable structural predictions but may not be able to provide highly precise energetics (e.g., binding affinities, al- though the improvements attainable with a more accurate electrostatic descrip- tion have yet to be demonstrated). The second problem is achieving a reliable fit to the quantum chemical and experimental data. This is a very challenging prob- lem that is tedious, labor intensive, and surprisingly difficult technically. While I believe that better solutions will emerge in the next decade, this is an area that definitely could benefit from additional funding. Current force fields are lacking in both accuracy and coverage of chemical space. Improvements in the past decade have been incremental, although real. With current levels of funding, continued incremental improvement can be ex- pected. Larger-scale efforts (which perhaps will be financed in private industry once people are convinced that bottom line improvements to important problems will result) will be needed to produce improvement beyond an incremental level. Empirical Models. The use of empirical models is widespread in both academia and industry. There are for example a large number of "statistical" potential functions that have been developed to address problems such as protein folding or protein-ligand docking. The appeal of such models is that they can be designed to be computationally inexpensive and can be fitted directly to experi- mental data to hopefully yield immediately meaningful results for complex sys- tems. The challenge is that systematic improvement of empirical models beyond a certain point is extremely difficult and becomes more difficult as the basis for the model becomes less physical. A recent trend has been to combine approaches having molecular mechanics and empirical elements; an example of this is Bill Jorgensen's work on the com-

RICHARD FRIESNER 129 putation of binding free energies. This is a direction in which much more work needs to be done, much of it of an exploratory variety. In the end, it is unlikely that brute force simulations will advance the solution of practical problems in the next decade (in another 10-20 years, this may actually be a possibility). A combi- nation of physically meaningful functional forms and models with empirical in- sight gained from directly relevant experiments is more likely to work. Computa- tional chemistry needs to become a full partner with experiment, not try to replace it the technology to do that simply is not there at this time. Solvation Models I concentrate here on continuum models for aqueous solution, although the basic ideas are not very different for other solvents. A great deal of progress has been made in the past decade in developing continuum salvation models based on solution of the Poisson-Boltzmann (PB) equation, as well as approximations to this equation such as the generalized Born (GB) model. These approaches prop- erly treat long-range electrostatic interactions and as such are significantly more accurate than, for example, surface-area-based continuum salvation methods. Progress has also been made in modeling the nonpolar part of the salvation model (i.e., cavity formation, van der Waals interactions) for both small molecules and proteins. The current situation can be summarized as follows: · Computational performance of modern PB- and GB-based methods is quite respectable, no more than a factor of 2 more expensive than a gas-phase calculation. · Accurate results for small molecules can be obtained routinely by fitting experimental data when available. · The significant issues at this point revolve around the transferability of the parameterization to larger structures. How well do continuum methods de- scribe a small number of waters in a protein cavity, for example, or salvation around a salt bridge (provided that the individual functional groups are well de- scribed in isolation)? Our most recent results suggest that there are cases in which a small number of explicit waters is essential and that the transferability problem has not yet been completely solved. The above problems, while highly nontrivial, are likely to be tractable in the next decade. Thus, my prediction would be that given sufficient resources, robust continuum solvent models (using a small number of explicit waters when neces- sary) can and will be developed and that this aspect of the model will cease to limit the accuracy. However, a major investment in this area (public or private) is going to be necessary if this is to be accomplished.

130 Sampling Methods APPENDIX D A wide variety of sampling methods are being developed to address the prob- lems of both global optimization and free-energy calculations. Traditional meth- ods using molecular modeling programs include molecular dynamics, Monte Carlo, conformational searching, and free-energy perturbation techniques. More recently, there have been many new ideas including parallel tempering, replica exchange, quantum annealing, potential smoothing techniques, and so forth. Tests of the newer methods on realistic problems (as opposed to the model problems where they typically are validated initially) have not yet been extensive, so we do not really know which approaches will prove optimal. It does seem likely, how- ever, that significant progress is going to be made beyond that provided by faster processors. Sampling is a key bottleneck at present in obtaining accurate results in mo- lecular modeling simulations. Obtaining convergence for a complex condensed- phase system is extremely challenging. This is the area in my opinion where prospects are most uncertain and where it is critical to support a lot of new ideas as opposed to just improved engineering of existing approaches. Some advances will come about from faster hardware, but algorithmic improvement should con- tribute even more if sufficient effort is applied. Until we can converge the sam- pling, it is going to be very difficult to improve the potential functions and solva- tion models reliably using experimental input, because there are always questions about whether the errors are due to incomplete sampling as opposed to the model itself. Structure-based Drug Design lows: The key problems in structure-based drug design can be enumerated as fol- 1. Generation of accurate protein structures, whether in a homology model- ing context or simply enumerating low-energy structures given a high-resolution crystal structure. The use of a single protein conformation to assess ligand bind- ing is highly problematic if one wants to evaluate a large library of diverse ligands so as to locate novel scaffolds capable of supporting binding. This is particularly the case for targets such as kineses where there is considerable mobility of loops and side chains in the active site of the protein. Progress on this problem will come from the ability to rapidly sample possible conformations using an accurate energy functional and continuum sol- vation model. There is good reason to believe that this can be accomplished in the next three to five years. While perfect accuracy will not be achieved, the genera- tion of structures good enough to dock into and make predictions about binding is

RICHARD FRIESNER 131 a realistic possibility. Experimental data can be used to refine the structures once good initial guesses can be made. 2. High-throughput docking of ligands into the receptor for lead discovery. The first objective is to correctly predict the binding mode of the ligand in the receptor. Considerable progress has been made in this direction over the past decade, and continued improvement of the existing technology, as well as use of flexible protein models when required, should yield successful protocols in the next 3-5 years. The second objective is scoring of the candidate ligands at this stage, one simply wants to discriminate active from inactive compounds. To do this rapidly, an empirical component is required in the scoring. This in turn ne- cessitates the availability of large, reliable datasets of binding affinities. At present the quality and quantity of the publicly available data is problematic; a much greater quantity of data resides in a proprietary form in pharmaceutical and biotechnology companies, but this is inaccessible to most of those working on the development of scoring functions. This problem must be addressed if more ro- bust and accurate scoring functions are to be produced. Given the availability of sufficient data, it seems likely that excellent progress can be made over the next several years. 3. Accurate calculation of binding free energies for lead optimization. From a technical point of view, this is the most difficult of the problems that have been posed. At present we do not really know whether the dominant cause of errors in rigorous binding affinity computation should be attributed to the potential func- tions, salvation model, or sampling; perhaps all three are contributors at a signifi- cant level. All one can do is attempt to improve all three components and carry out tests to see whether better agreement with experiment is obtained. My own intuition is that on a 5- to 10-year time scale, there will be some real successes in this type of calculation, but achieving reliability over a wide range of chemistries and receptors is going to be a great challenge. Overall, there is a good probability that in the next 5-10 years, computational methods will make increasingly important contributions to structure-based drug design and achieve some demonstrable successes.

32 APPENDIX D THE CURRENT STATE OF RESEARCH IN INFORMATION AND COMMUNICATIONS James R. Heath University of California, Los Angeles This paper focuses on an area of nanoelectronics to which chemists have been making substantial contributions in the past five to ten years. It discusses what the chemical challenges are and what the state of the art is in the field of nano-IT. We have recently prepared a series of aligned semiconductor (silicon) wires, 5 nm in diameter. These are at 16 nm pitch, and 5 mm long; the aspect ratio is 107. This means that if a crossbar junction of the silicon were p-doped at a reasonable level, there would be only a 1% chance of the dopant atom being at the wire crossing. Consequently, if the electronics of doped silicon could be brought down to these dimensions, the statistical fluctuations in the density simply mean that classical electronics analysis would fail one cannot use ohmic assumptions, because one does not know the doping level locally. From a classical computing viewpoint, there are major problems in working at this scale. Each device no longer acts like every other device, leakage currents can dominate in the nanodimension area, and they can cause tremendous parasitic power losses due to very thin gate widths. One can use molecules to augment the capabilities of the cross wires. It is not difficult to make small structures down to 60 nm, but it is difficult to bring them close together, so density becomes the most challenging synthetic capability. Work in molecular electronic construction of logic and memory devices has advanced substantially the Hewlett Packard (HP) group a year ago prepared a 64-bit electronic-based random-access memory. Our group at the University of California, Los Angeles, and the California Institute of Technology has also made a random-access memory, but it is substantially smaller (and fits within a single bit of the work reported by HP). Crossbar structures can still be used for logic. Three-terminal devices can be developed at a single molecule level, and we have done so. I believe that these structures will be the way in which chemists can extrapolate structure-property relationships that go back to synthesis and simple reduced dimensionality circuits. There are major problems involved in working at this scale the traditional concepts of Marcus theory may fail because there is no solvent to polarize. Tradi- tional analytical methodology such as NMR, mass spectrometry, and optical tech- niques are very difficult to use on this length scale. The direct observables are the conductance in two-terminal devices and the gated conductance in three-terminal devices. Using these, we have been able to analyze hysteresis loops and cycling data to indicate that the molecules are stable. Very recent work in our group has shown that one can prepare three-molecule gates and analyze the behavior of these molecular quantum dots that also have the internal structures.

DIMITRIOS MAROUDAS 133 Other concepts become complicated. Is it clear that the molecules behave ohmically in such junctions? How do the energies align? How does gating work within this molecular structure? There are problems with organics, which have mobilities that are many orders of magnitude lower than that of single-crystal silicon. Consequently, there will be a major challenge in making molecules be- have like silicon. The molecular junctions that we have prepared have several advantages: first, they are almost crystal-like, and therefore it seems that they could be chemically assembled. Second, they are quite tolerant of defective components, and are there- fore appropriate for the world of chemistry, where reactions never go 100 percent. Both the HP structure and the structures that we have prepared are really extremely dense: ours are roughly 10~2 bits per centimeter. The volume density of information storage in the brain is roughly 10~2 per cubic centimeter, so the den- sity of these molecular structures is extremely high. Finally, it is important to think about architectures. Molecular devices have now been demonstrated, and the fabrication of several molecular devices has been clarified over the past five years. However, architecture is more compli- cated, and it requires the fabrication, alignment, interaction, and behavior of many devices. Key questions are now being addressed by scientists in places such as Carnegie Mellon, Stanford, HP, and Caltech: How big should the memory be? How big should the wires be? How much should be devoted to routing? And how much gain needs to be put in? Some of the structures that I showed you of crossbars had 2200 junctions; they were all made; there weren't any broken components. So at the level of 104 bits, it seems possible to do high-fidelity fabrication. This is the start of a true molecular electronics. MULTISCALE MODELING Dimitrios Maroudas University of Massachusetts In the next decade, multiscale modeling will be a very important area of chemi- cal science and technology in terms of both needs and opportunities. The field has emerged during the past 10 years in response to the need for an integrated computa- tional approach toward predictive modeling of systems that are both complex and complicated in their theoretical description. The applications to date have involved complexity in physical, chemical, or biological phenomena or have addressed prob- lems of material structure and function. The intrinsic characteristics of all such systems are the multiplicity of length scales due to, e.g., multiple structural fea- tures, multiplicity of time scales due to multiple kinetic phenomena that govern processing or function, the strong nonlinear response as these systems operate away

134 APPENDIX D from equilibrium, and the large number or broad range of operating parameters that are involved in practical engineering applications. From the viewpoint of theory and computation, the major challenge in this area will be to establish rigorous links between widely different theoretical for- malisms quantum mechanics, classical statistical mechanics, continuum me- chanics, and so on that span a very broad range of space and time scales and are used to explore broad regions of parameter space. The practical goal will be to derive, as rigorously as possible, relationships between processes, structure, func- tion, and reliability, and to use them to develop optimal engineering strategies. The core capabilities of multiscale modeling consist of computational meth- ods that have been developed over many decades and are now used to compute properties and model phenomena. Figure 1 illustrates some of these methods in a schematic diagram of length versus time: Computational Quantum Mechanics for accurate calculation of properties and investigation of small and fast phenom- ena, Statistical Mechanics for sem~-empirical modeling for mechanistic under- standing and, at the much larger scale, Continuum Mechanics for macroscopic modeling. Between these last two is the Mesoscopic-Microstructural Scale, which has been an important motivator for the development of multiscale modeling tech- niques in the area of materials science. Ultimately, what one would like to do, from an engineering viewpoint at least, is use all these methodologies to explore vast regions of parameter space, identify critical phenomena, promote critical phenomena that improve the behavior of a system, and avoid critical phenomena that lead to failure. Length Supply Chain Modeling: Planning & Scheduling Mixed Integer Linear Programming ~ Global logistics Process Simulat ion: Equation-Based Models Control and Optimization ~ Processing units/facilities Continuum Mechanics: Finite-Element & -Difference Methods Boundary-lntegral Methods ~ Macroscopic modeling Mesoscopic Scaler Coarse-grained Quantum and Statistical Mechanics Mixed/coupled atomistic-continuum Methods ~ Mesoscale modeling Statistical Mechanics: Semi-empirical Hamiltonians Molecular Statics, Lattice Dynamics, Molecular Dynamics, Monte Carlo Modeling for mechanistic understanding Quantum Mechanics: Ab initio, electronic structure Density Functional Theory, first principles Molecular Dynamics Accurate calculations of materials properties Time 1 ,. FIGURE 1 Modeling elements and core capabilities.

DIMITRIOS MAROUDAS 135 Strict requirements must be imposed if multiscale modeling is to become a powerful predictive tool. In particular, we need to deal with issues of accuracy and transferability through connection to first principles because phenomeno- logical models are not transferable between widely different environments. The question is unavoidable: Is system level analysis starting from first principles- feasible for complex systems? In certain cases, the realistic answer would be an immediate "no." In general, the optimistic answer is "not yet," but the stage has been set for tremendous progress in this direction over the next decade. In addi- tion to developing novel, robust multiscale computational methods, fundamental mechanistic understanding will be invaluable in order to enable computationally efficient schemes and to steer parametric studies and design of experiments. I suggest classifying the approaches for multiscale modeling into two cat- egones: 1. Serial Strategies: Different-scale techniques are implemented sequentially in different computational domains at different levels of discretization. 2. Parallel Strategies: Different-scale techniques are implemented simulta- neously in the same computational domain that is decomposed appropriately. A number of significant trends are noteworthy: . A variety of multi-space-scale techniques are emerging from efforts on different applications where phenomena at the mesoscopic scale are important. · Efforts to push the limits of the core capabilities continue as illustrated by molecular-dynamics (MD) simulations with hundreds of millions of atoms. . Kinetic Monte Carlo (KMC) methods are used increasingly to extend the time-scale limitation of atomistic simulations. These, in turn, are driving the cre- ation of methodology for properly treating structural complexity in KMC schemes and for assessing the completeness and accuracy of the required database, i.e., determining whether all of the important kinetic phenomena are included and accurately calculating transition probabilities. . Methods are being developed for accelerating the dynamics of rare events either by taking advantage of physical insights about the nature of tran- sition paths or by using numerical methods to perform atomistic simulations for short periods and then project forward over large time steps. . Multiscale models are making possible both the integration of insight be- tween scales to improve overall understanding and the integration of simulations with experimental data. For example, in the case of plastic deformation of metals, one can incorporate constitutive theory for plastic displacements into macroscopic evolution equations, where parameterization of the constitutive equations is de- rived from analysis of MD simulations. · Methods are being explored for enabling microscopic simulators to per- form system-level analysis mainly numerical bifurcation and stability analy-

136 APPENDIX D sis to predict and characterize system instabilities and effectively compute the stability domains of the system. The central question is, can we predict the onset of critical phenomena that lead to phase, structural, or flow transitions? Such transitions may lead to function improvement (e.g., prediction of disorder-to- order transitions, or failure). In conclusion, over the past decade, various multiscale methods and models have been developed to couple widely different length scales, accelerate rare- event dynamics, and explore the parametric behavior of complex systems. These multiscale methods have been applied successfully to various problems in physi- cal chemistry, chemical engineering, and materials science. Over the next de- cade, these and other new methods will enable truly predictive analyses of com- plex chemical, material, and biological systems. These will provide powerful tools for fundamental understanding, as well as technological innovation and engineer- ing applications. The development of multiscale methods will generate tremen- dous research opportunities for the chemical sciences, and the integration of multiscale methods with advances in software and high-performance computing will be strategically important. The new opportunities also will present an educa- tional challenge to enable students, researchers, and practitioners to understand deeply what is going on throughout the physical scales and parameter space so they can develop intuition and understanding of how best to carry out simulations of complex systems. THE COMING OF AGE OF COMPUTATIONAL SCIENCES Linda R. Petzold University of California, Santa Barbara The workshop program overstated the title of my contribution. Rather than "The Coming Age of Computational Sciences," I focus on this discipline's com- ing of age. By this, I mean that computational science is like a teenager who has a large potential for rapid growth and a great future but is in the midst of very rapid change. It is like a love affair with the sciences and engineering that is really going to be great. I explain how I see computational science what it is, where it is going, and what is driving the fundamentals that I believe are taking place, and finally, how all this relates to you and to the chemical sciences. I start with some controversial views. Many of us see computation as the third mode of science, together with theory and experiment, that is just coming of age. What could be driving that? Other presentations at this workshop have noted the tremendous growth of the capabilities of both hardware and algorithms. Then I discuss what I see as the emerging discipline of computational science and

LINDA R. PETZOLD 137 engineering or maybe it's the antidiscipline, depending on how one looks at it. I hope that discipline and antidiscipline don't interact in the same way as matter and antimatter! Next I turn to the revolution in engineering and the sciences at the micro scale, what this means for computational science, the implications for the kinds of tools we are going to need, and what is already happening in response to the rapidly increasing application of engineering to biology. Finally, I address some current trends and future directions, one of which is going beyond simulation. In many cases, simulation has been successful, but now people are getting more ambitious. As algorithms and software get better and the computer gets faster, we want to do more and this will be a big theme in computational science. What will be the role of computer science in the future of computational science? Figure 1 illustrates computational speedup over the last 30 years. The lower amp ~ ~0: ~ ~0 ~ ~-~ ~ ~ ~ ~ ~ ~ ;~ t~ ~ Hi Hi ~ ~ t-~ ~ ~ - ~~ ~ ~ <~ ~ ~ ~ ~ ~ ~ ~~ ;r~s i-11 6 5~.,J ~ ~ :: 105 ._ . . ~; ~:Parallejl hlulti-(ilri . . :~4 .! t~tJJ ~ . ~ . Mu lti~ rift .~ i _ f By: . ~ :8 ~~ ~ ... ~ ~ Conjugate:: G. Pliant Us- ~ ~ ~!!~;~=c;~ive C3~r-~1-ax;~lt', In ~ . ~ 4~ ~ .~ O:1 ~ ~~_S idel:: , ~ 'die:: : :, ~~ , /6parse ~ ~:~sio~r, ~' hi nEltion . 1:~O 41 ~l!8l3~ ~~D ;~o :!1113 d: ~~ p ~~ i. I.:. ~ : i.: . : : :: ~ i: :: ::: : Ajar t~ ~~ (~-~-s~- -am- a--- Aim cot ,:o6 1 ^. .,~ i I 05 ,: ~ ASCI WE life ~ As e::;:~: Thai 1 ~ 1 :104 ~~ ~ : ~ 1 : . ~ ~ ::: ~ ~o ~ ~ . .~ ~~ Boar Super ,: A-' ~ l ~ :: ~ |02 _~ . .,~ ~ l ma' : : ~ ~ ~'1 : :~ Pa ~ _ : ~ : 10 ~ ~ ]970 1980~ 4~:~9f] aim 20~: 1:~ :: FIGURE 1 Speedup resulting from software and hardware developments. (Updated from charts in Grand Challenges: High Performance Computing and Communications, Execu- tive Office of the President, Office of Science and Technology Policy Committee on Physi- cal, Mathematical and Engineering Sciences, 1992; SIAM Working Group on CSE Educa- tion, SIAM Rev., 2001, 43:1, 163-177~.

138 APPENDIX D plot, which shows speedup (relative to computer power in 1970) derived from supercomputer hardware, is just another restatement of Moore's Law.i The three last data points are ASCI Red and ASCI White (the big DOE machines) and the famous Earth Simulator System (ESS). The ESS is right on schedule for the computer power that we would expect based on the trends of the last 30 years. The architecture is vector parallel, so it's a combination of the kind of vectorization used 20 years ago on Cray supercomputers, with the parallelization that has become common on distributed machines over the last decade. It' s no surprise that this architecture leads to a fast computer. Probably the main reason that it's newsworthy is that it was built in Japan rather than the United States. Politicians may be concerned about that, but as scientists I think we should just stay our course and do what we feel is right for . . . science ant englneenng. The upper graph in Figure 1 is my favorite because it is directly related to my own work on algorithms. It shows performance enhancement derived from com- putational methods, and it is based on algorithms for solving a special problem at the heart of scientific computation the solution of linear systems. The dates may seem strange for example, Gaussian elimination came earlier than 1970- but they illustrate when these algorithms were introduced into production-level codes at the DOE labs. Once again, the trend follows Moore's Law. My view on what these numbers mean to us, and why computational science and engineering is coming of age, relates to the massive increases in both com- puter and algorithm power. In many areas of science and engineering, the bound- ary has been crossed where simulation, or simulation in combination with experi- ment, is more effective in some combination of time, cost, and accuracy, than experiment alone for real needs. In addition, simulation is now a key technology in industry. At a recent conference, I was astonished to see the number of compa- nies using computer applications to address needs in chemical processing. There is also a famous example in my field, the design of the Boeing 777 an incred- ibly complex piece of machinery in which simulations played a major role. The emerging discipline of computational science and engineering is closely related to applied mathematics (Figure 2~. There were early arguments that it shouldn't be called mathematics but applied mathematics illustrating the disci- plinary sensitivities of the fields but computer science and engineering also overlaps strongly with science, engineering, and computer science. It lies in the twilight zone between disciplines. This is a great area of opportunity, because there is a lot of room there for growth! [Moore originally stated that "rlhe complexity for minimum component costs has increased at a rate of roughly a factor of two per year," Moore, G. E., Electronics 1965, 38 (8) 114-17; This has been restated as "Moore's Law, the doubling of transistors every couple of years," (http://www.intel.com/ research/silicon/mooreslaw.htm).

LINDA R. PETZOLD 139 \ Applied \ Computer Mathematics ~ ~ ~ Science / \ Science and Engineering FIGURE 2 Computer Science and Engineering (CSE) focuses on the integration of knowledge and methodologies from computer science, applied mathematics, and engi- . , . neermg and science. In an academic setting, it's easy to become mesmerized by disciplinary boundaries. We identify with those in our own discipline, but here I'm just look- ing from the outside. Some have suggested that the structure is like a religion, so your department would be like a branch of the church of science. Consequently, it feels a little bit heretical to suggest that there is a lot of value in integrating the disciplines. Nevertheless, I believe there is a lot of science and technology to be done somewhere in that murky area between the disciplines. The major development that is driving change, at least in my world, is the revolution at the micro scale. Many people are working in this area now, and many think it will be as big as the computer revolution was. Of particular impor- tance, the behavior of fluid flow near the walls and the boundaries becomes criti- cal in such small devices, many of which are built for biological applications. We have large molecules moving through small spaces, which amounts to moving discrete molecules through devices. The models will often be discrete or stochas- tic, rather than continuous and deterministic a fundamental change in the kind of mathematics and the kind of software that must be developed to handle these problems.

140 APPENDIX D For the engineering side, interaction with the macro scale world is always going to be important, and it will drive the multiscale issues. Important phenom- ena occurring at the micro scale determine the behavior of devices, but at the same time, we have to find a way to interact with those devices in a meaningful way. All this must happen in a simulation. Many multiscale methods have been developed across different disciplines. Consequently, much needs to be done in the fundamental theory of multiscale numerical methods that applies across these disciplines. One method is famous in structural materials problems: the quasi-continuum method of Tadmor, Ortiz, and Philips.2 It links the automatic and continuum models through the finite element method by doing a separate automat structural relaxation calculation on each cell of the finite element method mesh, rather than using empirical constitutive infor- mation. Thus, it directly and dynamically incorporates atom~stic-scale informa- tion into the deterministic scale finite element method. It has been used mainly to predict observed mechanical properties of materials on the basis of their constitu- ent defects. Another approach is the hybrid finite element-molecular dynamics-quantum mechanics method, attributed to Abraham, Broughton, Bernstein, and Kaxiras.3 This method is attractive because it is massively parallel, but it's designed for systems that involve a central defective region, surrounded by a region that is only slightly perturbed from the equilibnum. Therefore it has limitations in the systems that it can address. A related hybrid approach was developed by Nakano, Kalia, and Vashista.4 In the totally different area of fluid flow, people have been thinking about these same things. There has been adaptive mesh and algorithm refinement, which in the continuum world has been very successful. It's a highly adaptive way of refining the mesh that has been useful in fluid dynamics. These researchers have embedded a particle method within a continuum method at the finest level (using an adaptive method to define the parts of the flow that must be refined at the smaller scale) and applied this to compressible fluid flow. Finally, one can even go beyond simulation. For example, Ioannis Kevrekidis5 has developed an approach to computing stability and bifurcation analysis using time steppers, in which the necessary functions are obtained di- rectly from atom~stic-scale simulations as the overall calculation proceeds. This 2Shenoy, V. B.; Miller, R.; Tadmor, E. B.; Phillips, R.; Ortiz M. Physical Review Letters 1998, 80, 742-745; Miller, R.; Tadmor, E. B., Journal of Computer-Aided Materials Design 2003, in press. 3Abraham, F. F.; Broughton, J. Q.; Bernstein, N.; Kaxiras, E. Computers in Physics 1998, 12, 538- 546. 4Nakano, A.; Kalia, R. K.; Vashishta, P. VLSI Design 1998, 8, 123; Nakano, A.; Kalia, R. K.; Vashishta, P.; Campbell, T. J.; Ogata, S.; Shimojo, F.; Saini, S. Proceedings of Supercomputing 2001 (http://www.sc2001.org). 5Kevrekidis, Y. G.; Theodoropoulos, K.; Qian, Y.-H. Proc. Natl. Acad. Sci. U.S.A. 2000, 97, 9840.

LINDA R. PETZOLD 141 is difficult because it must take into account the boundary conditions on the small- scale simulations. There are so many of these multiscale problems and algorithms that the new Multiscale Modeling and Simulation Journal is devoted entirely to multiscale issues. Another big development is taking place where engineering meets biology (Box 1~. These two different worlds will vastly augment each other, but what does it mean for computation? A huge number of multiscale problems exist along with problems that involve understanding and controlling highly nonlinear net- work behavior. At a recent meeting in the area of systems biology, someone suggested that it would require 140 pages to draw a diagram for the network behavior of E. coli. A major issue is uncertainty. The rate constants are unknown, and frequently the network structure itself is not well understood. We will need to learn how to deal with uncertainty in the network structure and make predictions about what it means. It would be useful to be able to tell experimentalists that we think something is missing in this structure. We have large amounts of both uncertain and heterogeneous data. Biologists are obtaining data any way they can and accumulating the data in different forms discrete data, continuous data, old data, and new data. We must find a way to integrate all these data and use them as input for the computations. A key challenge in systems biology is identification of the control behavior. Biology is an engineered system it just was engineered by nature and not by us. However, we would like to understand it and one way to do that is to treat it as an engineered system. It is a system that has many feedback loops, or else it wouldn't be nearly as stable as it is. We need to identify the feedback behavior, which is difficult using only raw simulations and raw data. Again, there are many vari- ables arising from different models, leading to something that control engineers call a hybrid system. Such a system may have continuous variables, Boolean

42 APPENDIX D variables, and Bayesian variables. Until now, computational scientists haven't dealt with such systems, but we must learn to do so quickly. Experimental design offers a unique opportunity for computation to contrib- ute in a huge way. Often it is not possible, or nobody has conceived of a way, to do experiments in biology that can isolate the variables, because it's difficult to do in vivo experiments so the data come out of murkier experiments. It would be of great value to the experimental world if we could design computations that would allow us to say, "Here is the kind of experiment that could yield the most information about the system." An important multiscale problem arises from the chemical kinetics within bio- logical systems, specifically for intracellular biochemical networks. An example is the heat shock response in E. coli, for which an estimated 20 to 30 sigma-32 mol- ecules per cell play a key role in sensing, in the folding state of the cell, and in regulating the production of heat shock proteins. The method of choice is the sto- chastic simulation algorithm, in which molecules meet and react on a probabilistic basis, but the problem becomes too large because one also must consider the inter- actions of many other molecules in the system. It is a challenge to carry out such simulations in anything close to a reasonable time, even on the largest computers. Two important challenges exist for multiscale systems. The first is multiple time scales, a problem that is familiar in chemical engineering where it is called stiffness, and we have good solutions to it. In the stochastic world there doesn't seem to be much knowledge of this phenomenon, but I believe that we recently have found a solution to this problem. The second challenge one that is even more difficult arises when an exceedingly large number of molecules must be accounted for in stochastic simulation. I think the solution will be multiscale simulation. We will need to treat some reactions at a deterministic scale, maybe even with differential equations, and treat other reactions by a discrete stochastic method. This is not an easy task in a simulation. I believe that some trends are developing (Box 2), and in many fields, we are moving beyond simulation. As a developer of numerical software, in the past 10 years, I've seen changes in the nature of computational problems in engineering. Investigators are asking questions about such things as sensitivity analysis, opti- mum control, and design optimization. Algorithms and computer power have reached a stage where they could provide success, but we are ambitious, and want to do more. Why do people carry out simulations? Because they want to design something, or they want to understand it. I think the first step in this is sensitivity analysis, for which there has been a lot of work in past decade. The forward method of sensitivity analysis for differ- ential equations has been thoroughly investigated, and software is available. The adjoins method is a much more powerful method of sensitivity analysis for some problems, but it is more difficult to implement, although good progress has been made. Next it moved to PDEs, where progress is being made for boundary condi- tions and adaptive meshes.

LINDA R. PETZOLD 143 Uncertainty analysis was mentioned many times at the workshop, and still to come are multiscale systems, fully stochastic systems, and so on. And then we will move toward design optimization and optimal control. There is a need for more and better software tools in this area. Even further in the future will be computational design of experiments, with the challenge of learning the extent to which one can learn something from incomplete information. Where should the experiment be done? Where does the most predictive power exist in experiment space and physical space? Right now, these questions are commonly answered by intuition, and some experimentalists are tremendously good at it, but computa- tions will be able to help. Some examples are shown in Box 3. I think computer science will play a much larger role in computational sci- ence and engineering in the future. First, there are some pragmatic reasons. All of the sciences and engineering can obtain much more significant help from soft- ware tools and technology than they have been receiving. A really nice example is source code generation codes that write the codes. A very successful applica- tion is that of automatic differentiation codes developed at Argonne National Laboratory. These codes take input functions written in FORTRAN or C, while you specify where you have put your parameters and instruct them to differenti- ate the function with respect to those parameters. In the past, this has been a big headache because we had to do finite differences of the function with respect to perturbations of the parameter. In principle this a trivial thing first-semester calculus but it's actually very difficult to decide on the increment for the finite

44 APPENDIX D difference. There is a very tenuous trade-off between round-off and truncation er- ror, for which there sometimes is no solution particularly for chemistry problems, which tend to be badly scaled. The new approach writes the derivative function in FORTRAN or C, and the result is usually perfect, eliminating all of the finite differencing errors. Moreover, it doesn't suffer from the explosion of terms that were encountered with symbolic methods. The methodology uses compiler tech- nology, and you might be surprised at what the compiler knows about your pro- gram. It's almost a Big Brother sort of thing. This has enabled us to write much more robust software, and it has greatly reduced the time needed for debugging. Another thing we can do use source-code generation to fix the foolish mis- takes that we tend to make. For example, when I consult with companies, I often hear the problem of simulation software that has worked for a long time but suddenly failed when something new was introduced and the failure appears to be random. I always ask the obvious question, "Did you put an 'if' statement in your function?" And they always say, "No, I know about that potential problem: I wouldn't be stupid enough to do that." But after many hours of looking at mul- tiple levels of code (often written by several generations of employees who have all gone on to something else), we find that "if" statement. In any case, what Paul Barton has done to source code generation is to go in and just automatically fix these things in the codes, so that we don't have to go in

LINDA R. PETZOLD 145 and do it. And sometimes it' s a really hard thing to go into some of these codes, dusty decks, or whatever, written by people who have departed long ago, so this is really making things easier. Another big issue is that of user interfaces for scientific software. User inter- faces generate many complaints, and with good reason! I am embarrassed to ad- mit that my own codes and the scientific engineering software with which I come in contact regularly are in the dinosaur age. If our interfaces were better up to the standards of Microsoft Office, for example we would have an easier time attracting students into scientific and engineering fields. It certainly can't excite them to deal with those ugly pieces of 30-year-old codes. On the other hand, we aren't going to rewrite those codes, because that would be tremendously tedious and expensive. There have been some good developments in user interface technology for scientific computing, and some exceptions to the sad state of most of our software interfaces. In fact, I think that the first one is obvious: MATLAB.6 It is an excel- lent example of the value of a beautiful interface, but it has been directed primar- ily at relatively small problems. Computer science technology can be used to enable the semiautomatic gen- eration of graphical user interfaces. This is also being done at Sandia National Laboratories and at Lawrence Livermore National Laboratory with the MAUI code. There are ugly codes in chemistry. We all know it. And there are many dusty decks out there in industry and national laboratories. We'd like to just keep those codes but use compiler technology to bring up the kind of interfaces for the user. A collaboration with those users will be necessary, which is why the method is semiautomatic; but it can become a reality without every user needing to learn the latest user-interface technology. Finally, my major point is that computer science will play a much larger role. There is a deep reason for that. It isn't just machines, compilers, and source-code generation that will help you with your research. Those will be nice and undoubt- edly will be useful, but they are not the main issue. At the smaller scales and this is the fundamental change we are dealing with and manipulating large amounts of discrete stochastic, Bayesian, and Bool- ean information. The key word is information. In the past, we manipulated con- tinuum descriptions, which are averages. Those were nice, but now we must ma- nipulate discrete data heterogeneous data. Who has been thinking about that for the last 20 or 30 years? These problems form the core of computer science. Living simultaneously in both a computer science department and a mechani- cal engineering department, I know that the communications between engineering and computer science have not always been as good as they could be. But we are all researchers, and the needs of science will make the communication happen. Originally developed as a "matrix laboratory" to provide access to matrix software, MATLAB integrates mathematical computing, visualization, and a programming language for scientific and technical computing: http://www.mathworks.com/

146 APPENDIX D SIMULATION IN MATERIALS SCIENCE George C. Schatz Northwestern University Introduction The primary goal of this paper is to provide examples of problems in nanomaterials where computational methods are able to play a vital role in the discovery process. Three areas of research are considered, and in each of these areas, an important scientific problem is identified. Then we consider the current computational and theoretical tools that one can bring to bear on the problem and present representative results of applications to show how computation and theory have proven to be important. The three areas are (1) optical properties of gold nanoparticle aggregates, (2) melting of DNA in the gold nanoparticle aggregates, and (3) oxygen atom erosion of polymers in low earth orbit conditions. The three theories and computational methods being highlighted are (1) computational elec- trodynamics, (2) empirical potential molecular dynamics, and (3) direct dynam- ics-Car-Parinello molecular dynamics. Optical Properties of Gold Nanoparticle Aggregates Although colloidal gold has been known for thousands of years as the red color of stained glass windows, only very recently has it found a use in the medi- cal diagnostics industry as a sensor for DNA. In this work by Chad Mirkin and colleagues at Northwestern, single-stranded oligonucleotides are chemically at- tached to colloidal gold particles, typically 10 nm in diameter. If an oligonucletide (that is from DNA that one wishes to detect and is complementary to the single stranded portions attached to the gold particles) is added to the system, DNA hybridization results in aggregation of the particles, inducing a color change from red to blue that signifies the presence of the complementary oligonucleotide. Small-angle x-ray scattering studies of these aggregates indicate that although the aggregates are amorphous, the nearest-neighbor spacing between particles is ap- proximately equal to the length of the linker oligonucleotide. The challenge to theory in this case is to develop an understanding of the optical properties of the aggregates so that one can determine the optimum par- ticle size and optimum linker length, as well as the variation in optical properties that can be achieved using other kinds of nanoparticles. The basic theory needed for this problem is classical electrodynamic theory because this is known to de- scribe the optical properties of metal particles accurately, provided only that we have the dielectric constants of the particles and surrounding medium and that we have at least a rough model for the aggregate structure. Computational electrodynamics has made great progress during the past de-

GEORGE C. SCHATZ 147 cede, with the result that codes are now available for describing the optical prop- erties of assemblies of nanoparticles with upwards of tens of thousands of par- ticles. However there are many choices for methods and codes, including a vari- ety of finite element (Grid-based) methods (discrete dipole approximation, finite difference time domain, multiple multipoles, dyadic Green's function) and also a variety of coupled particle methods, of which the most powerful is the T-matrix method but which in the present case can be simplified to coupled dipoles for most applications. The codes for this application are largely public domain, but important functionality can be missing, and we have found it necessary to make revisions to all the codes that we have used. Fortunately the nanoparticle aggregate electrodynamics problem is relatively easy to solve, and many of the codes that we examined give useful results that not only explain the observed color change, but tell us how the results vary with particle size, DNA length, and optical constants. Figure 1 shows a typical com- parison of theory and experiment. We learned enough from our studies of this problem that we were able to develop a simple, almost analytical theory based on a dynamic effective medium approximation. In addition, the methods that we considered have been useful for studying more challenging electrodynamics prob- lems that involve ordered assemblies of nonspherical particles in complex dielec- tric environments. T in. FIGURE 1 Absorption spectra of DNA-linked gold nanoparticle aggregates for 24, 28 and 72 base-pair DNA linkers, comparing theory (light scattering calculations, right) and experiment (left).

148 APPENDIX D Melting of DNA That Links Nanoparticles The DNA-linked gold nanoparticle aggregates just described have another use for DNA detection that is both very important and distinctly nanoscale in nature. When the aggregates are heated, it is found that they "melt" (i.e., the double helix unravels into single-stranded oligonucleotides) at about 50 °C over a narrow temperature range of about 3 °C. This melting range is to be contrasted with what is observed for the melting of DNA in solution, which typically has a 20°C range for the same length oligonuclides. This is important to DNA testing, as single base pair mismatches, insertions, and deletions can still lead to aggre- gate formation, and the only way to identify these noncomplementary forms of DNA is to subject the aggregates to a thermal wash that melts out the noncom- plementary linkers but not the complementary ones. The resolution of this pro- cess is extremely sensitive to the melting width. In view of this, it is important to establish the mechanism for the narrow melting transition. Recently we have developed a cooperative melting model of DNA melting for DNA-linked gold nanoparticle aggregates. This model considers the equilib- rium between different partially melted aggregates, in which each stage involves the melting of a linker oligonucleotide, but there are no other structural changes (at least initially). In this case, the equilibria can be written DN = DN_1 + Q + nS . . . = D1 + Q + nS - = Do + Q + nS where DN is the aggregate with N linkers, Q is the linker, and S stands for free counterions that are released with each linker. The key to the cooperative mecha- nism is to realize that each successive step becomes easier due to the loss of counterions from the aggregate. Here, we are guessing that when the DNAs are sufficiently close that their ion clouds overlap, the ion clouds mutually stabilize the hybridized state (due to improved screening of the phosphate interactions). With this assumption, the equilibrium collapses to a single expression: DN = Do + NQ + nNS in which N targets and their complement of counterions come out cooperatively. Equilibrium concentrations based on this expression are consistent with a variety of measurements on the gold nanoparticle aggregates. Figure 2 shows the sharp- ening of the melting curve that occurs as the number of DNA linkers per nanoparticle is increased, while Figure 3 shows a fit to experimental data using this model. While this works, the key assumption of the theory, that the ion clouds over-

GEORGE C. SCHATZ 1.00 0.80 0.60 o 0.40 0.20 0.00 149 - Cooperative Melting Model ' _Tlm320, T2m350 / / _ /\Hl=AH2=5o kcal/mol / I I f11 I ~ 1 1 . I ~ l ~JWJJ; I I I I I I 320 280 300 Temperature 340 360 380 FIGURE 2 Cooperative melting mechanism: dependence of melting curves on number of DNA linkers per nanoparticle. Target: CTCCCTMTMCMTTTATMCTATTCCTA 0.80 ~ 0.60 a . _ ~ 0.40 IL 0.20 Thermodynamic Model (N=1.6) Experiment 50.0 52.0 54.0 56.0 Temperature FIGURE 3 Cooperative melting mechanism: fit to experimen 58.0 60.0 Lit.

150 APPENDIX D lap and the DNA is stabilized needs to be proven, and crucial information about the range of the cooperative mechanism needs to be determined. To do this, we must simulate the ion atmosphere around DNA dimers, trimers and other struc- tures, ideally with gold particles also nearby. This could probably be done with an electrostatic model, but there is concern about the validity of such models for singly charged counterions. To circumvent this, we have considered the use of atomistic molecular dynamics with empirical force fields. Such methods have already been used to determine charge cloud information concerning a single duplex, but the calculations had not been considered for aggregates of duplexes. Fortunately the methodology for simulating DNA and its surrounding charge cloud is well established based on the Amber suite of programs. This suite allows for the inclusion of explicit water and ions into the calculation, as well as particle- mesh Ewald sums to take care of long-range forces. Previous work with Amber, explicit ions, and Ewald summation has demonstrated that the ion distribution around duplex DNA is consistent with available observations in terms of the number of ions that are condensed onto the DNA, and the extent of the counterion cloud around the DNA. Given this, it is therefore reasonable to use the same technology to examine the counterion distribution around pairs (and larger aggre- gates) of DNA duplexes and how this varies with bulk salt concentration. My group has now performed these simulations, and from this we have es- tablished the relationship between bulk salt concentration and the counterion con- centration close to the DNA. By combining these results with the measured varia- tion of DNA melting temperature with bulk salt concentration, we have determined that the decrement in melting temperature associated with loss of a DNA target from an aggregate is several degrees, which is sufficiently large that cooperative melting can occur. In addition we find that the range of interaction of the counterion atmospheres is about 4 nm, which establishes the DNA density needed to produce cooperative melting. Oxygen Atom Erosion of Polymers in Low Earth Orbit Conditions The exterior surfaces of spacecraft and satellites in low earth orbit (LEO) conditions are exposed to a very harsh environment. The most abundant species is atomic oxygen, with a concentration of roughly 109 atoms per cm3. When one factors in the spacecraft velocity, one finds that there is one collision per second per exposed surface site on the spacecraft, and the oxygen atoms typically have 5 eV of energy, which is enough to break bonds. When this is combined with other LEO erosion mechanisms that involve ions, electrons, UV photons, and other high-energy neutrals, one finds that microns of exposed surface can be eroded per day unless protective countermeasures are imposed. This problem has been known since the beginning of space flight, but the actual mechanism of the erosion pro- cess has not been established. In an attempt to rectify this situation, and perhaps to stimulate the development of new classes of erosion resistant materials, I have

LINDA R. PETZOLD 151 been collaborating with a team of AFOSR funded faculty Steven Sibener, Luping Yu, Tim Minton, Dennis Jacobs, Bill Hase, Barbara Garrison, John Tully to develop theory to model the erosion of polymers and other materials, and to test this theory in conjunction with laboratory and satellite experiments. Of course this research will also have spin-offs to other polymer erosion processes important in the transportation and electronics industries. We are only at the beginning of this project at the moment, so all of our attention thus far has been directed at understanding the initial steps associated with the impact of high-velocity oxygen atoms with hydrocarbon polymer sur- faces. Even for this relatively simple class of materials, there has been much confusion as to the initial steps, with the prevailing wisdom being that CHAP), the ground state of oxygen, can only abstract hydrogen atoms from a hydrocarbon, which means that in order to add oxygen to the surface (as a first step to the formation of CO and CO2), it is necessary for intersystem crossing to switch from triplet to singlet states, thereby allowing the oxygen atom to undergo insertion reactions. This wisdom is based on analogy with hydrocarbon combustion, where hydrogen abstraction reactions are the only mechanism observed when oxygen interacts with a saturated hydrocarbon. In order to explore the issue of oxygen-atom reactions with hydrocarbons in an unbiased way, we decided to use "direct dynamics" (DD), i.e., molecular dy- namics simulations in which the forces for the MD calculations are derived from electronic structure calculations that are done on the fly. Although this idea has been around for a long time, it has become practical only in the last few years as a result of advances in computational algorithms for electronic structure calculations and the advent of cheap distributed parallel computers. We have studied two ap- proaches to this problem: the Car-Parinello molecular dynamics (CPMD) method using plane wave basis functions and ultrasoft pseudopotentials, and conventional Born-Oppenheimer molecular dynamics (BOMD) using Gaussian orbitals. Both calculations use density functional theory, which is only capable of 5 kcal/mol accuracy for stationary point energies, but should be adequate for the high energies of interest to this work. Both methods give similar results when applied to the same problem, but CPMD is more useful for atom-periodic surface dynamics, while BOMD is better for cluster models of surfaces or disordered surfaces. Our calculations indicate that there are important reaction pathways that arise with 5-eV oxygen atoms that have not previously been considered. In particular we find that oxy radical formation is as important as hydrogen atom abstraction and that C-C bond breakage also plays a role. Both of these pathways allow for triplet oxygen to add to the hydrocarbon directly, without the need for intersys- tem crossing. Thus, it appears that previously guessed mechanisms need to be revised. Further studies are under way in John Tully's group to incorporate these results into kinetic Monte Carlo simulations that will bridge the gap between atomic-scale reaction simulations and the macro-scale erosion process.

152 APPENDIX D Conclusion These examples indicate that an arsenal of simulation approaches is needed for nanoscale materials research including electrodynamics, empirical potential MD, direct dynamics, and CPMD. Fortunately, many of the necessary tools are avail- able, or can be developed by straightforward modifications of existing software. These examples also show that simulation is an important tool for many nanoscale materials problems. Although brute force simulation is not usually ef- fective because the time scales are too long and the number of particles is too large, a combination of simulation in conjunction with analytical theory and simple models can be quite effective. Acknowledgments The research described here is supported by the AFOSR (MURI and DURINT programs) and the National Science Foundation (NSEC program). Im- portant contributors to these projects have been my students and postdocs (Anne Lazarides, Guosheng Wu, Hal Long, Diego Troya, Ronald Pascual, Lance Kelly, and LinLin Zhao) as well as Tim Minton (Montana State) and Chad Mirkin (Northwestern). DISTRIBUTED CYBERINFRASTRUCTURE SUPPORTING THE CHEMICAL SCIENCES AND ENGINEERING Larry L. Smarr University of California at San Diego During the next 10 years, chemical science and engineering will be partici- pating in a broad trend in the United States and across the world: we are moving toward a distributed cyberinfrastructure. The goal will be to provide a collabora- tive framework for individual investigators who want to work with each other or with industry on larger-scale projects that would be impossible for individual investigators working alone. I have been involved in examining the future of the Internet over the next decade, and in this paper I discuss this future in the context of the issues that were dealt with at the workshop. Consider some examples of grand challenges that are inherently chemical in nature: · simulating a micron-sized living cell that has organelles composed of mil- lions of ribosomes, macromolecules with billions of proteins drawn from thou- sands of different species, and a meter of DNA with several billion bases all involving vast numbers of chemical pathways that are tightly coupled with feed-

LARRY L. SMARR 153 back loops by which the proteins turn genes on and off in regulatory networks; and · simulating the star-forming regions in which radiation from newly form- ing stars causes a complex set of chemical reactions in diffuse media containing carbon grains as catalytic surfaces moving in turbulent flow. According to data from the Hubble Space Telescope, the size of the "reactor" would be on the order of one light year. Both examples require solving a similar set of equations, although the re- gime in which they are applied is quite different from traditional chemical engi- neering applications. Other areas of science have similar levels of complexity, and I would argue that those areas have progressed farther than the chemical sciences in thinking about the distributed information infrastructure that needs to be built to make the science possible. · Medical Imaging with MRI: The NIH has funded the Biomedical Informatics Research Network (BIRN), a federated repository of three-dimen- sional brain images to support the quantitative science of statistical comparison of the brain subsystems. The task includes integration of computing, networks, data, and software as well as training people. · Geophysics and the Earth Sciences: The NSF is beginning to roll out EarthScape, a shared facility which will ultimately have hundreds of high-resolu- tion seismic devices located throughout the United States. The resulting data will provide three-dimensional images of the top several kilometers of the earth to the academic community. · Particle Physics: The physics community is developing a global data Grid that will provide tens of thousands of users, in hundreds of universities, in dozens of countries throughout the world with a distributed collaborative environment for access to the huge amount of data generated at CERN in Geneva. Many additional examples exist across scientific disciplines that have been involved as a community for 5 or 10 years in coming to consensus about the need for ambitious large-scale, data-intensive science systems. You will be hearing more and more about a common information infrastructure that underlies all of these called distributed cyberinfrastructure as it rolls out over the next few years. Here I want to impart a sense of what will be in that system. It seems to me that chemical engineering and chemistry share many aspects of data science with these other disciplines, so chemists and chemical engineers should be thinking about the problem. For 20 years, the development of information infrastructure has been driven by Moore's Law the doubling of computing speed per unit cost every 18

54 APPENDIX D months.) But now, storage and network bandwidth are rising even faster than computing speed. This means that everything we thought we knew about inte- grated infrastructure will be turned completely upside down. It is why the Grid movement described earlier by Thom Dunning2 is now taking off. We are going through a phase transition from a very loosely coupled world to a very tightly coupled world because the bandwidth between computers is growing much faster than the speed of the individual computers. The rapid growth in data handling is being driven by the same kind of effect that influenced computing over the last 10 years: that is, not only did the indi- vidual processors become faster but the amount of parallelism within high-per- formance computers also increased. Each year or so, as each processor became twice as fast, you also used twice as many processors, thereby increasing the overall speed by a factor of 4 this leads to hyperexponential growth. We have entered a similar era for bandwidth that enables data to move through optical fibers. It is now possible to put wavelength bins inside the fiber, so that instead of just a single gigabit-per-second channel it is possible to have 16, 32, or many more channels (called lambdas), each operating at a slightly different wavelength. Every year both the number and the speed of these channels increase, thereby creating a bandwidth explosion. As an aside, the capability of optical fiber to support these multiple channels, ultimately enhancing the chemical sciences us- ing the fiber, is itself a victory for chemical engineering. It was the ability, par- ticularly through the contributions of Corning, to remove almost every molecule of water from the glass (water absorbs in the infrared wavelengths used by the fiber to transmit information) that makes this parallel revolution possible with optical fiber. Without going into more detail, over the next few years each campus and state will be accumulating "dark fiber" to support such parallel dedicated networks for academic research or governmental functions. Each campus will have to start wor- rying about where the conduit is, so that you can pull the dark fiber from building to building. Once this owner-operated dark fiber is in place, you will get more band- width every year for no money other than for increasing the capability of the elec- tronic boxes at the ends of the fiber. Consequently, you will use the increasing lambda parallelism to get an annual increase in bandwidth out of the same installed fiber. Today, this is occurring on a few campuses such as UCSD, as well as in some states, and there are discussions about building a national "light rail" of dark fiber across the country. When that happens, we will be able to link researchers together with data repositories on scales we haven't imagined. Coming back to the science drivers, state-of-the-art chemical supercomputing iMoore, G. E., Electronics 1965, 38 (8) 114-17; http://www.intel.com/research/silicon/ mooreslaw.htm. 2See T. Dunning, Appendix D.

LARRY L. SMARR 155 engineering simulations of phenomena such as turbulence (or experiments with laser read-outs of fluid) are comparable in scale to the highest resolution MRIs or earth seismic images, namely gigazone data objects. Such giant three-dimensional individual data objects (103 X 103 X 103, or larger) often occur in timed series, so one has a whole sequence of them. Furthermore the data are generated at a remote supercomputer or laboratory and stored there in federated data repositories. There may be some important consequences if you are the investigator: · The data piece you want is somewhere you aren't, and you would like to interactively analyze and visualize these high-resolution and data objects. · Because nature puts so many different kinds of disciplines together in what it does, you actually need a group of different kinds of scientists to do the analysis, and they are probably spread out in different locations, requiring col- laborative networked facilities. · Individual PCs don't have enough computing power or memory to handle these gigazone objects, requiring scalable Linux PC clusters instead. · You won't have enough bandwidth over the shared Internet to interact with these objects; instead, you need dedicated lambdas. You might ask, "What is the twenty-first century Grid infrastructure that is emerging?" I would answer that it is this tightly optically coupled Data Grid of PC clusters for computing and visualization with a distributed storage fabric, tied together a software middle layer that enables creation of virtual metacomputers and collaboration. In my institute we try to look out to see how this is all happen- ing, by creating "Living Laboratories of the Future." We do this by deploying experimental networked test beds, containing bleeding edge technological com- ponents that will become mass market in 3, 5, or 10 years, and using the system now with faculty and students. That is the way supercomputers developed from 15 years ago when a few people ran computational chemistry on a gigahertz machine like the Cray-2 that cost $20 million, to today, when everybody works on a gigahertz machine PC that costs only $1000. Let me summarize a few major trends in this emerging distributed cyberinfrastructure: · Wireless Internet: It took 30 years of exponential growth for the number of fixed Internet users, sitting at their PCs, to grow to two hundred million. We are likely to double that number with mobile Internet users who use cell phones or portable computers in the next three years. Consider that today there are prob- ably a few hundred million automobiles in the United States, each having a few dozen microprocessors and several dozen sensors and actuators. When the pro- cessors on these cars get on the net they will vastly dominate the number of people in this country on the net. So, if you thought you had seen an explosion on the Internet, you really haven't seen anything yet.

156 APPENDIX D · Distributed Grid Computing: Wisconsin's pioneering Condors project, which creates a distributed computing fabric by tying hundreds of UNIX work- stations together, was discussed during the workshop. Today, companies such as Entropia or United Devices are emulating that basic concept with PCs. In this sense, PCs are the dark matter of the Internet there is an enormous level of computing power that is not being used but could be tied together by the Grid. · SensorsNets: One of the problems with doing large-scale chemistry in the environment is that you don't know the state of the system. I think this may be generally true if you look at large-scale chemical plants. If you have wireless everywhere you can insert devices to measure the state of the system and obtain a finer-grained set of data. If you are simulating the future of the system, your simulation can be much more useful, and the approach will afford a tighter cou- pling between those who do simulation and those who gather data. · Miniaturization of Sensors: One of the big trends that is taking place is the ability to take chemical sensing units sensors that detect the chemical state of air, water, and so forth and make them smaller and cheaper so you can put them in more places. I think chemistry and chemical engineering will be a frontier for SensorNets based on nanotechnology and wireless communication. Moreover, there is a revolution going on in the kinds of sensor platforms that can be used, such as automous vehicles the size of hummingbirds that fly around, gather, and transmit data. Nanoscale Information Objects: Consider the human rhinovirus, which has genetic information coded for the several surface proteins. These proteins form an interlocking coat that serves to protect the nucleic acid which is a soft- ware program which, of course, is interpreted to generate the protein coat that protects the software. So, in addition to being a nano-object, the virus structure is also an information object as well. So when you talk about nanotechnology, I want you to remember that it is not just the physics, chemistry, and biology that is small, but there is a new set of nano information objects that are also involved. . For science, all of these trends mean that we are going to be able to know the state of the world in much greater detail than any of us ever thought possible- and probably on a much shorter time scale than most of us would expect as we focus on this report that looks at "the next 10 years." 3See: http://www.cs.wisc. edu~condor/.

ELLEN STECHEL 157 MODELING AND SIMULATION AS A DESIGN TOOLS Ellen Stechel Ford Motor Company The lens from which my comments arise is an industrial perspective, albeit from a limited experience base. The experience base reflected is one of four years in industrial research and development, during which about 60% of my time has been in research and 40% in development. Over those four years, my observa- tions and experiences have caused my thinking to evolve dramatically. The first section of this paper is focused on the chemical and materials simu- lation section in Ford Research Laboratory. I describe research that has signifi- cant academic character but simultaneously has a path of relevance to automotive applications. The second section focuses on some cross-cutting themes, and the third section describes three specific examples that illustrate those themes and reflect some challenges and opportunities for chemical sciences. Chemical and Materials Simulation in an Automotive Industrial Research Laboratory The unifying strategy of the chemical and materials simulation (CAMS) group at Ford Research Laboratory is the use of atomistic and higher length scale chemical and materials simulation methods to develop a knowledge base with the aim of improving materials, interracial processes, and chemistry relevant to Ford Motor Company. Excellence and fundamental understanding in scientific research are highly valued; but relevance and potential impact are most important. There are three focus areas of research: 1. Surface chemistry and physics for such applications as environmental ca- talysis, exhaust gas sensors, mechanical wear and fatigue. 2. Materials physics, such as lightweight alloys and solid oxide fuel cells. 3. Molecular chemistry, which includes particulate matter dynamics, atmo- spheric chemistry, and combustion chemistry. The tools used by the chemical and materials simulation group are com- mercial, academic, and homegrown simulation and analysis software, in addition to dedicated compute servers. The development of some of those "homegrown" iCopyright reserved by Ford Motor Company. Reproduced by permission. Acknowledgement: The author is indebted to Drs. John Allison, Peter Beardmore, Alex Bogicevic, William Green, Ken Hass, William Schneider, and Chris Wolverton for many enlightening conversa- tions and for use of their work to illustrate the points in this manuscript.

158 APPENDIX D FIGURE 1 NOX storage: Ab initio calculations of NOX binding to an MgO surface pro- vide a knowledge base. tools occurrec' In non-industrial settings, since one of the proven ways to transfer technology is simply to hire a person, who often arrives with tools as well as expertise. Catalytic Materials Exhaust gas catalytic aftertreatment for pollution control depends heavily on chemistry and on materials. Hence, in one CAMS project, the goal is to guide the development of improved catalytic materials for lean-burn, exhaust-NOx (NO and NO2) aftertreatment, for gas sensors, and for fuel cells. One driver for im- proved materials is to enable cleaner lean-burn and diesel engine technologies. The first principles, atomistic-scale calculations have been able to elucidate NOx binding to an MgO surface, providing a knowledge base relevant to the funda- mentals of a NOx storage device (Figure 1~.3 Using ah initio calculations, Ford researchers have been able to map out the energy landscape for NOx decomposi- tion and N2O formation on a model Cu(111) surface, relevant to exhaust NOx catalysis.4 Schneider, W. F.; Hass, K. C.; Militec, M.; Gland, I. L. .1. Phys. Chem. 2002, B 106, 7405. 4Bogicevic, A.; Hass, K. C. Su7/ Sci. 2002, 506, L237.

ELLEN STECHEL 159 FIGURE 2 Ab initio calculations on yttria-stabilized zirconia to improve activity and robustness of oxide conductors for fuel cells and sensors. Ionic Conductors A second example from the CAMS group focuses on ionic conductors, with the goal of improving the activity and robustness of oxide conductors for fuel cells and sensors.5 The scientific goal of the project is to understand with the intent of tailoring ion diffusion in disordered electrolytes. This work has devel- oped a simple computational screen for predicting ionic conductivity of various dopants. The work has also established the only known ordered yttria-stabilized zirconia phase (Figure 2~. Cross-cutting Themes While the two examples briefly described above have direct relevance to the automotive industry, the scientific work is indistinguishable from academic basic research. Nevertheless, successful industrial research generally reflects and builds 5Bogicevic A.; Wolverton, C. Phys. Rev. B 2003, 67, 024106; Bogicevic A.; Wolverton, C. Europhys. Lett. 2001, 56, 393; Bogicevic A.; Wolverton, C.; Crosbie, G.; Stechel, E. Phys. Rev. B 2001, 64, 014106.

160 APPENDIX D on four recurring and broadly applicable themes. The first theme is the false di- chotomy that arises when trying to distinguish applied from basic research. The second theme is a definition of multidisciplinary research that is broader than what is conventional. The application of this broader definition leads to the chal- lenge of integration across research fields within the physical sciences as well as across science, engineering, technology, and business. The third theme concerns complex systems and requires establishing an appropriate balance between re- ductionism (solvable "bite-size" pieces) and holism (putting the pieces back to- gether again). Finally, the fourth theme is one of a four-legged stool, a variation of the often-referenced three-legged stool. A successful application of industrial research necessarily adds a leg to the stool and puts something on the seat of the stool. The three legs are theory and modeling, laboratory experimentation, and computer simulation. The fourth leg is subsystem and full-system testing (verifi- cation and validation of laboratory concepts), and what sits on top of the stool is the real-world application in the uncontrolled customer's hands. False Dichotomy The structure of science and technology research sometimes is assumed to have a linear progression from basic research (which creates knowledge that is put in a reservoir) to application scientists and engineers (who try to extract knowledge from the reservoir and apply it for a specific need), and finally to develop technology. Few people believe that the science and technology enter- prise actually works in this simplistic way. Nevertheless, with this mental model underpinning much of the thinking, unintentional adverse consequences result. As an alternative mental model, Pasteur' s Quadrant,6 generalizes the linear model to a two-dimensional model, where one axis is consideration and use, and the other axis is the quest for fundamental understanding. The quadrant in which there is both a strong consideration of use and a strong quest for fundamental understanding ("Pasteur's Quadrant") has basic and applied research rolled to- gether. This is the quadrant with the potential for the greatest impact in industrial research. The false dichotomy between basic and applied research arises when the one-dimensional model ignores feedback where the application drives a need for new fundamental understanding that is not yet in the reservoir. Indeed, in implementing new technology, one is likely to find that what keeps one awake at night is not the known but the unknown missing basic research that is appli- cable to the technology. Industrial researchers quickly find no shortage of chal- lenging, intellectually stimulating, and poorly understood (albeit evolutionary) research problems that are relevant and have the potential for great impact on the industry. 6Pasteur's Quadrant, Stokes, D. E. Brookings Institution Press: Washington, DC, 1997.

ELLEN STECHEL 161 Another way of looking at the two-way relationship between science and application is to recognize that the difference is in the starting and ending points, but not necessarily in the road traveled. If the primary motivation is science a quest for understanding then the world in which the researcher lives is one that is reasonably deterministic, predictive, and well controlled. The goal is generally one of proof of concept a consistent explanation or reproducibility of observa- tions. New and novel knowledge and wisdom, new capabilities, and new verifica- tion and validation methodologies are outputs of the science. However, if the motivation stems from an industrial or application perspec- tive, the researcher begins first by considering the use of what he or she is study- ing. As the researcher digs beneath the surface as if peeling away layers of an onion, he or she finds that it is necessary to ask new questions; this drives new science, which drives new knowledge, and, which in turn, drives new uses and new technology. Additionally, the two researchers (understanding-driven and application-driven) frequently meet in the middle of their journey with both finding the need to develop new and novel knowledge and wisdom, new capabili- ties, and new verification and validation methodologies. It does not matter where the researcher starts the journey: whether he or she comes from an industrial or academic perspective, there is common ground in the middle of the journey. In industry, science is a means to an end, but it is no less science. The world the industrial researcher strives for is a world defined by performance within specifications, robustness, reliability, and cost. How- ever, in trying to achieve those goals, the researcher necessarily asks new ques- tions, which drive new science in the process of deriving verifiable answers to those questions. Integrated, Complex Systems, Perspective To strike the appropriate balance between reductionist approaches and inte- grated approaches is a grand challenge. It may be that as social creatures, people are simply ill-equipped to go beyond reductionist approaches, to approaches that integrate science, engineering, technology, economics, business, markets, and policy. Right now, there is a strong tendency to view these disparate disciplines in isolation. Nevertheless, they interrelate and interrelate strongly in the real world. Even when starting with a problem-driven approach, a goal or problem can always be broken down into solvable pieces. The very basis for the success of twentieth century science has been such reductionism. Nevertheless, to obtain optimal solutions, it is necessary to appropriately integrate disparate pieces to- gether. Within any system, if one only optimizes subsystems, then one necessar- ily suboptimizes the entire system. Typically, there will be many ways to assemble pieces. Industry needs a reassembled state that is practical, cost-effective, robust, adaptable, strategic, and

162 APPENDIX D so on. Although the constraints are many, it is necessary to consider all of them seriously and intellectually. Complex Systems The term "complex systems" means different things to different people. An operational definition is used here: complex systems have, nonlinear relation- ships, large experimental domains, and multiple interactions between many com- ponents. For complex systems, the underlying model is generally unknown, and there is no analytical formula for the response surface. It may be that solutions come in the form of patterns rather than numerical values. Most importantly, the whole is not the sum of the parts. A system can be very complicated and can be ordered, or at least determinis- tic. One might be able to write down a model and use a very powerful computer to solve the equations. Science can make very good progress in elucidating solu- tions of such a system. A system can also be random enough that many variables may not be explicitly important, but their effects can be adequately accounted for with statistical science. It is important not to ignore such effects, but often they are describable as noise or random variables. Somewhere between the ordered state and the fully random state lie many complex systems of interest. Significant progress depends on both the under- standing-driven and the application-driven researcher comprehending and becom- ing comfortable with such complexity. The Four-Legged Stool and Statistics Theory, experiment, and simulation form a three-legged stool. The addition of subsystem and full-system testing produces a four-legged stool. The ultimate goal is to have a firm grasp of how a product will perform in the customer's hands, and that is what sits on top of the stool. The four legs must support the development of a firm understanding of mean performance variables, natural vari- ability, and the sources of variability as well as the failure modes and the ability or not to avoid those failure modes. One does not need to be able to predict performance under all conditions (something which is often impossible), but one does need the ability to determine if the anticipated stresses can or cannot lead to a hard failure (something breaks) or a soft failure (performance degrades to an unacceptable level.) Consequently, it is critically important that the four legs do not ignore statis- tical performance distributions. Statistical science is a powerful tool for compre- hending complex systems; however, it does not appear to be a well-appreciated discipline, despite its importance to product development. In addition, of note are the tails of performance distributions, which become quite significant in a popu- lation of millions for high-volume products.

ELLEN STECHEL 163 Examples Three automotive examples that can and do benefit from simulation and from the chemical sciences are aluminum casting for engine blocks and heads, exhaust aftertreatment catalysts for air quality, and homogeneous charged compressive ignition (HCCI) engines a potential future technology for improving fuel economy and maintaining low emissions. In the following discussion, the focus for each example is on why the industry has an interest in the application, what the industry is trying to accomplish, and the nature of some of the key ongoing scientific challenges. The emphasis is on how the chemical sciences fit in and how simulation fits in, but the challenge in each example involves significantly more than chemical sciences and simulation. Virtual Aluminum Casting Aluminum casting research at Ford Motor Company is the most integrated of my three examples. I chose it because this research illustrates good working rela- tions across disciplines and research sectors. The research in the aluminum-cast- ing project spans the range from the most fundamental atomic-scale science to the most applied in the manufacturing plant, and the project exploits all legs of the four-legged stool. The research approach displays a heavy reliance on simula- tion that unimpeachably is having significant impact. Modeling and simulation of new materials receives most of the attention these days, but breakthroughs on mature material may have greater potential im- pact. Peter Beardmore, a now-retired director from Ford Research Laboratory, argued [Ref. P. Beardmore, NAE, 19971 that refinements made to mature materi- als and processes would have a much greater probability of being implemented- and therefore would have far greater impact than all new materials concepts combined. While his focus was on automotive structural materials, I believe the statement is broadly applicable and is likely true of virtually all technology. Automobiles are mostly steel, by weight. However, the automotive industry comprises a large percentage of the consumer market for aluminum, iron, and magnesium. The automotive industry does not dominate the plastics market. Materials usage (measured by weight) did not changed much from 1977 to 2001. Vehicles are still mostly steel, which accounts for the high recyclability of the automobile. Aluminum and plastic content have increased some. However, with emerging energy issues, further change will be necessary. Fuel economy and emission standards are very strong drivers for lightweight materials, since to a first approximation fuel economy is inversely proportional to the weight of a vehicle. Any other avenues to increase fuel economy pale in terms of what arises from lightening the vehicle. However, from a consumer perspective, it is critical that the vehicle is reliable, cost-effective, and high quality: consumer demands thus present considerable challenges to lightweight materials.

164 APPENDIX D To attack the problem from a technology perspective, the researcher must consider the material' s usage, performance, reliability, and manufacturability. The automotive industry makes a large number of cylinder heads and engine blocks; so reducing weight creates a driver to replace cast iron with cast aluminum. To have impact, the researcher must set goals and meet specifications, such as reduc- ing product and process development time by a large percentage through simula- tion, improving quality, reducing scrap, and ensuring high-mileage durability. The Ford Research virtual aluminum-casting project has developed unique capabilities for simultaneous process and product optimization. Generally, pro- cess optimization and product optimization are not integrated. In the stereotypical process, a design engineer generates a product design, and then throws it over the fence to the manufacturing division. The manufacturing engineer says, "Well, if you made this change in your design it would be easier to manufacture," and an argument begins. Both parties remain focused on their subsystem and neither is able to recognize that the design and manufacturability of the product is a system, and that system optimization as opposed to subsystem optimizations is often where opportunities emerge. The virtual aluminum-casting project has developed a comprehensive and integrated set of simulation tools that capture expertise in aluminum metallurgy casting and heat treatment years of experience captured in models. In addition to enabling calculations, an often-overlooked characteristic of models, is that they can organize knowledge and information and thereby assist in knowledge dis- semination. The models in the toolset have unsurpassed capabilities. John Allison, the project leader for virtual aluminum casting uses the phrase "computational materials science plus." The project, which links physics, materials science, me- chanics, theory, modeling, experiment, engineering, fundamental science, and applied science, is achieving notable successes. The structure of this virtual aluminum casting project could serve as a readily generalized example. One of the fundamental premises is that computational models should be able to solve 80 percent of the problem quickly through simula- tion, and the models should not be trusted completely. Experimental verification and validation on subsystem and full-system levels are critical. Another funda- mental premise is that new knowledge should be transferred into the plant and implemented. In other words, it is important to go beyond proof of concept to practical application. The virtual aluminum casting project is comprehensive and multiscale;7 it reaches from initial product design based on property requirements, all the way to influencing actual manufactured components (Figure 3~. The project involves experiments and models at all length scales and all levels of empiricism that predict properties, service life, damage, stress, and the like. 7Vaithyanathan, V.; Wolverton, C.; Chen, L.-Q. Phys. Rev. Lett. 2002, 88, 125503.).

~ - o ~ ......... ~ ~~ 7~ ~ ~ ~ :: :~it: ~ ~ T ~ --my $$ ~ 1' ~$. _ ! ~:$~. ~,---~ :.- .~$' :~.;- ~ ~~ .' ,,;~,,,~22~ ~~ ~~,.~ '"'-"'"""''""'' :~: : :.$~::~: :::: :: - ~:~: I: :~: hi_ ~ ,~. :::. ::: .$. :.::: :$ : - - -:, $.$.~$~$~. _ i~~~ = ' ~ ~~ ~~ $$':$$.-~2.~ $$ $~$ _ At 0 ~ ~ ~ . _ ~ ~ ~1 ~ $ jib ~ . - I~ ?.k Pa Pa 165 ~ ,, - an 4b i fed in . - )] ~1 ;~ . _I ( ? - ~ ~ Hi ~ plU4 ~ i ~ .~ IN ~ ~ n, S........ .:.:.:' . ~ ~ ,".2 ;:''"". _I ,.:.:.:.:. rip ~ I,.... ~ id,, ~ ,= ~ :..... .2':':':, * ~ >'1 4= . .~ 4 - $ .~ - 4~ ~ -F 4 .3~` 4= ~ ~ '1 ·~q , - ~ ~ .~ ~ ~ 5~ ~ ~ o .= o So o ~0 ·_. Cal ;^ o .e o Cal o Ct Cal = X

166 1 .h _ ~ o ~ O - 5 10 APPENDIX D 15 , ~ it: 0 200 400 Temperature (=C) /\ Models Model FIGURE 4 First-principles calculations can provide input to models downstream. The aluminum alloy material has roughly 11 components (as a recycled ma- terial, the alloy actually has an unknown number of components that depend on the recycling stream), and some of the concentrations are quite small. Some com- ponents have more impact than others do. Impurities can have enormous effects on the microstructure and thereby affect the properties. Precipitates form during heat treatment, and manufacturers have the opportunity to optimize the aluminum heat treatment to achieve an optimized structure. First principles calculations and multiscale calculations are able to elucidate opportunities, including overturning 100 years of metallurgical conventional wisdom (Figure 4~.8 Several key challenges must be overcome in this area. In the chemical sci- ences, methodologies are needed to acquire quantitative kinetic information on real industrial materials without resorting to experiment. It is also necessary to determine kinetic pre-factors and barriers that have enough accuracy to be useful. Another key challenge is the need for seamless multiscale modeling including uncertainty quantification. Properties such as ductility and tensile strength are still very difficult to calculate. Finally, working effectively across the disciplines still remains a considerable barrier. ~Wolverton, C.; Ozolins, V. Phys. Rev. Lett. 2001, 86, 5518; Wolverton, C.; Yan, X.-Y.; Vijayaraghavan, R.; Ozolins, V. Acta Mater. 2002, 50, 2187.)

ELLEN STECHEL Catalysis of Exhaust Gas Aftertreatment 167 Exhaust aftertreatment catalysts provide a second, less-integrated example of the use of chemical sciences modeling and simulation to approach an applica- tion-related problem. Despite large investments in catalysis, both academia and funding agencies seem to have little interest in catalysis for stoichiometric ex- haust aftertreatment. Perhaps the perception of a majority of the scientific com- munity is that stoichiometric exhaust aftertreatment is a solved problem. Although there is a large empirical knowledge base and a cursory understanding of the principles, from my vantage point the depth of understanding is far from adequate especially considering the stringency of Tier 2 tailpipe emissions regulations. This is another challenging and intellectually stimulating research area with criti- cal issues that span the range from the most fundamental to the most applied. After the discovery in the 1960s that tailpipe emissions from vehicles were adversely affecting air quality, there were some significant technological break- throughs, and simply moving along a normal evolutionary path, we now have obtained three-way catalysts that reduce NOX and oxidize hydrocarbons and CO at high levels of efficiency with a single, integrated, supported-catalyst system. Unfortunately, the three-way catalyst works in a rather narrow range of air-fuel ratio, approximately the stoichiometric ratio. This precludes some of the opportu- nities to improve fuel economy for example, a leaner air-fuel ratio can yield better fuel economy but, with current technology, only at the expense of greater NOX emissions. Also, most exhaust pollution comes out of the tailpipe in the first 50 seconds of vehicle operation, because the catalyst has not yet reached a tem- perature range in which it is fully active. The three-way catalyst is composed of precious metals on supported alumina with ceria-based oxygen storage, coated on a high-surface-area ceramic or a me- tallic substrate with open channels for the exhaust to pass through and over the catalyst. As regulations become increasingly stringent for tailpipe emissions, the industry is transitioning to higher cell densities (smaller channels) and thinner walls between channels. This simultaneously increases the surface area, decreases the thermal mass, and reduces the hydraulic diameter of the channels; all three effects are key enablers for higher-efficiency catalysts. The high-surface-area nano-structured washcoat is another key enabler, but it is necessary to maintain the high surface area of the coating and highly dispersed catalytically active pre- cious metals for the life of the vehicle despite high-temperature operation and exposure to exhaust gas and other chemical contaminants. In other words, the material must have adequate thermal durability and resist sintering and chemical poisoning from exhaust gas components. Neither thermal degradation nor chemi- cal degradation is particularly well understood beyond some general principles; i.e., modeling is difficult without a lot of empirical input. The industry also must design for end of life, which is a very large design space. Simulation currently uses empirically derived, aged-catalyst-state input.

168 APPENDIX D What the industry really needs is a predictive capability. The industry also has to minimize the cost and the amount of the active catalytic platinum, palladium, and rhodium used, since precious metals are a rare and limited commodity. Again a key chemical science challenge is kinetics. How does one obtain quantitative kinetic information for real industrial, heterogeneous catalysts with- out resorting to time-consuming experiments? Simulations currently use empiri- cally derived simplified kinetics. The science of accelerated aging has generally been ignored. The automotive industry must be able to predict aging of catalysts and sensors without running many vehicles to 150,000 miles. The industry does currently take several vehicles and accumulate miles, but driving vehicles to that many miles is a particularly slow way to get answers and does not provide good statistics. The industry does utilize accelerated aging, but it is done somewhat empirically without the benefit of a solid foundation of basic science. Homogeneous Charge Compression Ignition The third and final example arises from the Ford Motor Company-MIT Alli- ance, a partnership that funds mutually beneficial collaborative research. The princi- pal investigators, Bill Green and Bill Kaiser, are from MIT and Ford, respectively. The project spans a range from the most fundamental chemistry and chemical engi- neering to the most applied. In contrast to the first two examples, this project focuses on a technology in development, as opposed to a technology already in practice. Homogeneous charged compression ignition is similar in concept to a diesel engine. It is high efficiency and runs lean. It is compression ignited, as opposed to spark ignited, but it has no soot or NOX because it runs much cooler than diesel. It is similar to a gasoline engine in that it uses pre-mixed, volatile fuels like gaso- line, and it has similar hydrocarbon emissions. But an HCCI engine has a much higher efficiency and much lower NOX emissions than a gasoline engine, which could eliminate the need for the costly three-way precious metal catalyst. However, HCCI is difficult to control. There is no simple timing mechanism, which can control ignition, as exists for a spark or diesel fuel injection engine. HCCI operates by chemistry and consequently is much more sensitive to the fuel chemistry than either spark-ignition or diesel engines. The fuel chemistry in an internal combustion engine largely ignores the details of the fuel. HCCI looks very promising, but researchers do not yet know what the full operating range is or how to reliably control the combustion with computation. Yelvington and Green have already demonstrated that HCCI can work well over a broad range of conditions demonstrating the promise that computation and simulation can and must play a continuing and large role in resolving the HCCI challenges.9 9Yelvington, P. E.; Green, W. H.; SAE Technical Paper 2003, 2003-01-1092.

ELLEN STECHEL Final Words 169 The complexity of industrial design problems requires that one must be able to deal with the realistic, not overly idealized, system. The complexity demands a multidisciplinary approach and working in teams. Access to and understanding of the full range of methods are generally necessary if the researcher is going to have impact by solving relevant problems, and individual industrial researchers must be able to interact and communicate effectively beyond their disciplines. To achieve integration from fundamental science to real-world applications is truly a challenge of coordination, integration, and communication across disciplines, approaches, organizations, and research and development sectors. In general, approximate answers or solutions with a realistic quantification of uncertainty if arrived at quickly have greater impact than highly accurate answers or solutions arrived at too late to impact critical decisions. Often, there is no need for the increased level of accuracy. To quote Einstein, "Things should be made as simple as possible, but not any simpler." Sometimes researchers in in- dustry, just out of necessity, oversimplify. That is when the automotive develop- ment engineer might lose sleep, because it could mean that vehicles might have unexpected reliability issues in the field, if the oversimplification resulted in a . . wrong c eclslon. Simulations with varying degrees of empiricism and predictive capability should be aligned closely with extensive experimental capabilities. It is also im- portant to bring simulation in at the beginning of a new project. Too often, experi- mentalists turn to a computational scientist only after repeated experimental fail- ures. Many of these failures would have been avoided had the consultation occurred at an early stage. The computational expert could help generate hypoth- eses even without doing calculations or by doing some simple calculations, and the perspective of the computational researcher can frequently eliminate dead ends. In addition, framing questions correctly so that the experiments yield unam- biguous results, or reasonably unambiguous results, is crucial. Obtaining reason- ably definitive answers in a timely manner is equally crucial, but too often, there does not seem to be a time scale driving needed urgency. Data and knowledge management offer additional overlooked opportunities. Hypotheses that have been proven wrong often continue to influence decisions. Decision makers too often operate on the basis of disproved or speculative con- jectures rather than on definitive data. The science and technology enterprise needs a way to manage data such that it is relatively easy to know what the community does and does not know as well as what the assumptions are that underlie current knowledge. Finally, it is important to have realistic expectations for success. This is a difficult challenge because what one measures and rewards is what one gets. There are many ways to impact applications and technology with any level of sophistication in a simulation. Some of the important ways lend themselves only

170 APPENDIX D to intangible measures, but oftentimes these may be the most important. Again quoting Einstein, "Not everything that counts can be counted, and not everything that can be counted counts." DRUG DISCOVERY: A GAME OF TWENTY QUESTIONS Dennis J. Underwood Infinity Pharmaceuticals, Inc. Introduction There is a revolution taking place in the pharmaceutical industry. An era in which nearly continuous growth and profitability is taken for granted is coming to an end. Many of the major pharmaceutical companies have been in the news lately with a plethora of concerns ranging from the future of the biotechnology sector as a whole to concerns over the availability of drugs to economically dis- enfranchised groups in the developed and the developing world. The issues for the pharmaceutical sector are enormous and will likely result in changes in the health-care system, including the drug discovery and development enterprises. Central challenges include the impact that drugs coming off patents have had on the recent financial security of the pharmaceutical sector and the need for im- proved protection of intellectual property rights. Historically, generic competi- tion has slowly eroded a company's market share. There is a constant battle be- tween the pace of discovering new drugs and having old drugs going off patent, providing generic competition opportunities to invade their market share. Recent patent expirations have displayed a much sharper decline in market share, mak- ing new drugs even more critical. The 1990s were a decade in which the pharmaceutical giants believed they could sustain growth indefinitely by dramatically increasing the rate of bringing new medicines to market simply by increasing R&D spending and continuing to utilize the same research philosophies that worked in the past. It is clear from the rapid rise in R&D expenditure and the resultant cost of discovering new drugs that the "old equation" is becoming less favorable. There is a clear need to be- come more efficient in the face of withering pipelines and larger and more com- plex clinical trials. For large pharmaceutical companies to survive, they must maintain an income stream capable of supporting their current infrastructure as well as funding R&D for the future. The cost of drug development and the low probability of technical success call for improved efficiency of drug discovery and development and further investment in innovative technologies and processes that improve the chances of bringing a compound to market as a drug. Already there has been quite a change in the way in which drugs are discovered. Large pharmaceutical companies are diversifying their drug discovery and development

DENNIS J. UNDERWOOD 171 processes: They are relying more on the inventiveness of smaller biotechnology companies and licensing technology, compounds, and biologicals at a faster, more competitive rate. To meet critical time lines they are outsourcing components of research and development to contract research organizations, enabling greater efficiencies by providing added expertise or resources and decreasing develop- ment time lines. The trend toward mergers and acquisitions, consolidating pipe- lines, and attempting to achieve economies of scale is an attempt by large phar- maceutical companies to build competitive organizations. Although this may help short-term security, future ongoing success may not be ensured solely with this strategy. One of the most valuable assets of a pharmaceutical company is its experi- ence in drug discovery and development. Of particular importance are the data, information and knowledge generated in medicinal chemistry, pharmacology, and in vivo studies accumulated over years of research in many therapeutic areas. This knowledge is based on hundreds of person-years of research and develop- ment; yet most companies are unable to effectively capture, store, and search this experience. This intellectual property is enormously valuable. As with the other aspects of drug discovery and development, the methods and approaches used in data-, information- and knowledge-base generation and searching are undergoing evolutionary improvements and, at times, revolutionary changes. It is imperative for all data- and information-driven organizations to take full advantage of the information they are generating. We assert that those companies that are able to do this effectively will be able to gain and sustain an advantage in a highly com- plex, highly technical, and highly competitive domain. The aim of this overview is to highlight the important role informatics plays in pharmaceutical research, the approaches that are currently being pursued and their limitations, and the challenges that remain in reaping the benefit of advances. We are using the term "informatics" in a general way to describe the processes whereby information is generated from data and knowledge is derived as our understanding builds. Informatics also refers to the management and transformation of data, informa- tion, and assimilation of knowledge into the processes of discovery and develop- ment. There has been much time, money, and effort spent in attempting to reduce the time it takes to find and optimize new chemical entities. It has proven difficult to reduce the time it takes to develop a drug, but the application of new technolo- gies holds hope for dramatic improvements. The promise of informatics is to reduce development times by becoming more efficient in managing the large amounts of data generated during a long drug discovery program. Further, with managed access to all of the data, information, and experience, discoveries are more likely and the expectation is that the probability of technical success will increase. Why is the probability of technical success of drug discovery and develop- ment so low? What are the issues in moving compounds through the drug discov-

72 APPENDIX D cry and development pipeline? It is clear from existing data that the primary bottlenecks are pharmacokinetic problems and lack of efficacy in man. In addi- tion there are problems of toxicity in animal models and the discovery of adverse effects in man. Clearly there are significant opportunities to improve the prob- ability of technical success and, perhaps, to shorten the time line for develop- ment. The current strategies bring in absorption, distribution metabolism, excre- tion and toxicity (ADME-Tox) studies earlier in the process (late discovery) and use programs based on disease clusters rather than a single target.) Drug discovery and development is a difficult business because biology and the interaction with biology is complicated and, indeed, may be classifiable as a complex system. Complexity is due partly to an incomplete knowledge of the biological components of pathways; the manner in which the components inter- act and are compartmentalized; and the way they are modulated, controlled, and regulated in response to intrinsic and environmental factors over time. Mostly, biological interactions are important and not accounted for in screening and as- saying strategies. Often model systems are lacking in some way and do not prop- erly represent the target organism. The response of the cell to drugs is very de- pendent on initial conditions, which is to say that the response of the cell is very dependent on its state and the condition of many subcellular components. Typi- cally, the behavior of complex systems is different from those of the components, which is to say that the processes that occur simultaneously at different scales (protein, nucleic acid, macromolecular assembly, membrane, organelle, tissue, organism) are important and the intricate behavior of the entire system is depen- dent on the processes but in a nontrivial way.2 3 If this is true, application of the tools of reductionism may not provide us with an understanding of the responses of an organism to a drug. Are we approaching a "causality catastrophe" whenever we expect the results of in vitro assays to translate to clinical data? What is an appropriate way to deal with such complexity? The traditional approach has been to generate large amounts of replicate data, to use statistics to provide confidence, and to move cautiously, stepwise, toward higher complexity: from in vitro to cell-based to tissue-based to in viva in model animals and then to man. In a practical sense, the drug discovery and development odds have been improved by a number of simple strategies: Start with chemically diverse leads, optimize them in parallel in both discovery and later development, back them up with other compounds when they reach the clinic and follow on with new, struc- turally different compounds after release. Approach diseases by focusing, in par- allel, on clusters rather than on a single therapeutic target, unless the target has proven to be clinically effective. Generate more and better-quality data focusing on replicates, different conditions, different cell types in different states, and dif- iKennedy, T. Drug Discovery Today 1997, 2 (10), 436-444. 2Vicsek, T. Nature 2001, 411, 421. 3Glass, L. Nature 2001, 410, 277-84.

DENNIS J. UNDERWOOD 173 ferent model organisms. Develop good model systems, such as for ADME-Tox, and use them earlier in the discovery process to help guide the optimization pro- cess away from difficulties earlier (fail early). The increase in the amount and diversity of the data leads to a "data catastrophe" in which our ability to fully extract relationships and information from the data is diminished. The challenge is to provide informatics methods to manage and transform data and information and to assimilate knowledge into the processes of discovery and development; the issue is to be able to integrate the data and information from a variety of sources into consistent hypotheses rich in information. The more information, the better. For example, structure-based design has had many successes and has guided the optimization of leads through a detailed understanding of the target and the way in which compounds interact. This has become a commonly accepted approach, and many companies have large, active structural biology groups participating in drug discovery teams. One of the rea- sons that this is an accepted approach is that there is richness in data and informa- tion and there is a wealth of methodology available, both experimental and com- putational, that enables these approaches. The limitations in this area are concerned primarily with the challenges facing structural biology such as appro- priate expression systems, obtaining sufficient amounts of pure protein, and the ability to crystallize the protein. These limitations can be severe and can prevent a structural approach for many targets, especially membrane-bound protein com- plexes. There is also a question of relevancy: Does a static, highly ordered crys- talline form sufficiently represent dynamic biological events? There are approaches that can be used in the absence of detailed structural information. Further, it is often instructive to use these approaches in conjunction with structural approaches, with the aim of providing a coherent picture of the biological events using very different approaches. However, application of these methods is extremely challenging primarily due to the lack of detailed data and information on the system under study. Pharmacophore mapping is one such ap- proach. A complexity that is always present is that there are multiple ways in which compounds can interact with a target; there are always slight changes in orientation due to differences in chemical functionality and there are always dis- tinctly different binding modes that are possible. Current methods of pharmaco- phore mapping find it difficult to detect these alternates. Further, the approach often relies on high-throughput screening approaches that are inherently "noisy" making the establishment of consistent structure activity relationships difficult. In addition, the methodology developed in this area is often limited to only parts of the entire dataset. There has been much effort directed to deriving two-, three- , and higher-dimensional pharmacophore models, and the use of these models in lead discovery and development is well known.4 4Agrafiotis, D. K.; Lobanov, V. S.; Salemme, F. R. Nat. Rev. Drug Discov. 2002, 1, 337-46.

174 APPENDIX D The key issue in these methods is the manner in which compounds, their structure, features, character and conformational flexibility are represented. There are many ways in which this can be achieved, but in all cases the abstraction of the chemistry for computational convenience is an approximation. Each com- pound, in a practical sense, is a question that is being asked of a complex biologi- cal system. The answer to a single question provides very little information; how- ever, in an manner analogous to the game of "twenty question," the combined result from a well-chosen collection of compounds (questions) can provide an understanding of the biological system under study.5 The manner in which chem- istry is represented is essential to the success of such a process. It is akin to posing a well-informed question that, together with other well-formed questions, is pow- erful in answering or giving guidance to the issues arising from drug discovery efforts. Our approach is to use the principles of molecular recognition in simpli- fying representation of the chemistry: atoms are binned into simple types such as cation, anion, aromatic, hydrogen-bond acceptor and donor, and so forth. In addi- tion, the conformational flexibility of each molecule is represented. The result is a matrix in which the rows are compounds and the columns are bits of a very large string (tens of millions of bit long) that mark the presence of features. Each block of features can be the simple presence of a functional group such as phenyl, chlorophenyl, piperidyl, or aminoethyl, or it can be as complex as three- or four- pint pharmacophoric features that combine atom types and the distances between them. The richness of this representation along with a measure of the biological response of these compounds enables methods that use Shannon's maximum- entropy, information-based approach to discover ensembles of patterns of fea- tures that describe activity and inactivity. These methods are novel and have been shown to capture the essence of the effects of the SAR in a manner that can be used in the design of information-based libraries and the virtual screening of databases of known or realizable compounds.6 The power of these methods lies in their ability to discern relationships in data that are inherently "noisy." The data are treated as categorical rather that continuous: active (yes), inactive (no) and a variable category of unassigned data (maybe). These methods are deterministic and, as such, derive all relationships between all compounds at all levels of support. The relationships or patterns are scored based on their information content. Unlike methods such as recursive par- titioning, pattern discovery is not "greedy" and is complete. The discrimination ability of pattern discovery depends very much on the quality of the data gener- ated and on the type and condition of the compounds; if either is questionable, the 5Underwood, D .J. Biophys. J. 1995, 69, 2183-4. 6Beroza, P.; Bradley, E. K.; Eksterowicz, J. E.; Feinstein, R., Greene, J.; Grootenhuis, P. D.; Henne, R. M.; Mount, J.; Shirley, W. A.; Smellie, A.; Stanton, R. V.; Spellmeyer, D. C. J. Mol. Graph. Model. 2002,18, 335-42.

DENNIS J. UNDERWOOD 175 signal-to-noise ratio will be reduced and the quality of the results will be jeopar- dized. These approaches have been used in the study of G-protein Coupled Re- ceptors7 ~ and in ADME-Tox studies.9 i0 These methods have also been used in the identification of compounds that are able to present the right shape and character to a defined active site of a protein In cases where the protein structure is known and the potential bind- ing sites are recognized, the binding site can be translated into a bit-stnng that is in the same representational space as described above for the compounds. This translation is done using methods that predict the character of the space available for binding compounds. The ability to represent both the binding loiters) and com- pounds in the same way provides the mechanism to discnm~nate between com- pounds that are likely to bind to the protein. These approaches have been used for senne proteases, kineses and phosphatases.~3 The game of 20 questions is simple but, almost paradoxically, has the ability to give the inquisitor the ability to read the subject's mind. The way in which this occurs is well understood; in the absence of any information there are many pos- sibilities, a very large and complex but finite space. The solution relies on a tenet of a dialectic philosophy in which each question provides a thesis and an anti- thesis that is resolved by a simple yes or no answer. In so doing, the number of possibilities are dramatically reduced and, after 20 questions, the inquisitor is usually able to guess the solution. The solution to a problem in drug discovery and development is far more complex than a game of 20 questions and should not be tnvialized. Even so, the power of discnm~nation through categorization of answers and integration of answers from diverse experiments provides an ex- tremely powerful mechanism for optimizing to a satisfying outcome a potential drug. These approaches have been encoded into a family of algorithms known as pattern discovery (PD).~4 PD describes a family of novel methods in the category of data mining. One of the distinguishing features of PD is that it discovers rela- 7Wilson, D. M.; Termin, A. P.; Mao, L.; Ramirez-Weinhouse, M. M.; Molteni, V.; Grootenhuis, P. D.; Miller, K.; Keim, S.; Wise, G. J. Med. Chem. 2002, 45, 2123-6. ~Bradley, E. K.; Beroza, P.; Penzotti, J. E.; Grootenhuis, P. D.; Spellmeyer, D. C.; Miller, J. L. J. Med. Chem. 2000, 43, 2770-4. 9Penzotti, J. E.; Lamb, M. L.; Evensen, E.; Grootenhuis, P. D. J. Med. Chem. 2002, 45, 1737-40. Clark, D. E.; Grootenhuis, P. D. Curr. Opin. Drug Discov. Devel. 2002, 5, 382-90. iiSrinivasan, J.; Castellino, A.; Bradley, E. K.; Eksterowicz, J. E.; Grootenhuis, P. D.; Putta, S.; Stanton, R. V. J. Med. Chem. 2002, 45, 2494-500. i2Eksterowicz, J. E.; Evensen, E.; Lemmen, C.; Brady, G. P.; Lanctot, J. K.; Bradley, E. K.; Saiah, E.; Robinson, L. A.; Grootenhuis, P. D.; Blaney, J. M. J. Mol. Graph. Model. 2002, 20, 469-77. i3Rogers, W. T.; Underwood, D. J.; Argentar, D. R.; Bloch K. M.; Vaidyanathan, A. G. Proc. Natl. Acad. Sci. U.S.A., submitted. i4Argentar, D. R.; Bloch, K. M.; Holyst, H. A.; Moser, A. R.; Rogers, W. T.; Underwood, D. J.; Vaidyanathan, A. G.; van Stekelenborg, J. Proc. Natl. Acad. Sci. U.S.A., submitted.

176 APPENDIX D tionships between data rather than relying on human interpretation to generate a model as a starting point; this is a significant advantage. Another important ad- vantage of PD is that it builds models based on ensembles of inputs to explain the data and therefore has an advantage in the analysis of complex systems (such as biology2 3~. We have developed a novel approach to PD that has been applied to big-sequence, chemistry, and genomic data. Importantly, these methods can be used to integrate different data types such as those found in chemistry and biol- ogy. PD methods are quite general and can be applied to many different areas such as proteomics, text, etc. Validation of these methods in big-sequence space has been completed using well-defined and well-understood systems such as serine proteasesi3 and kineses. PD in big-sequence space provides a method for finding ensembles of patterns of residues that form a powerful description of the sequences studied. The similarity between patterns expresses the evolutionary family relationships. The differences between patterns define their divergence. The patterns express key functional and structural motifs that very much define the familial and biochemical character of the proteins. Embedded in the patterns are also key residues that have particular importance with respect to the function or the structure of the protein. Mapping these patterns onto the x-ray structures of serine proteases and kineses indicates that the patterns indeed are structurally and functionally important, and further, that they define the substrate-binding domain of the proteins. This leads to the compelling conclusion that since the patterns describe evolutionary changes (di- vergence and convergence) and also describe the critical features of substrate binding, the substrate is the driver of evolutionary change.~3 A particular application of PD is in the analysis of variations of genetic infor- mation (single nucleotide polymorphisms, or SNPs). Analysis of SNPs can lead to the identification of genetic causes of diseases, or inherited traits that deter- mine differences in the way humans respond to drugs (either adversely or posi- tively). Until now, the main method of SNP analysis has been linkage disequilib- rium (LD), which seeks to determine correlations among specific SNPs. A key limitation of LD however is that only a restricted set of SNPs can be compared. Typically SNPs within a local region of a chromosome or SNPs within genes that are thought to act together are compared. PD on the other hand, through its unique computational approach, is capable of detecting all patterns of SNPs, regardless of the genomic distances between them. Among these will be patterns of SNPs that are responsible for the disease (or trait) of interest, even though the indi- vidual SNPs comprising the pattern may have no detectable disease (or trait) correlation. This capability will greatly accelerate the exploitation of the genome for commercial purposes.

Next: Appendix E: Biographies of Workshop Speakers »

Information and Communications: Challenges for the Chemical Sciences in the 21st Century (2003)

Chapter: Appendix D: Workshop Presentations

Welcome to OpenBook!

Get Email Updates