The NNSA laboratories have an extensive set of activities in modeling and simulation (M&S) that are essential to maintenance of the nuclear weapons stockpile given the complexity of the weapon systems and the absence of testing. The laboratories also use computational modeling to answer fundamental scientific questions and to aid in engineering analysis and design, sometimes in collaboration with university researchers, other laboratories, or U.S. industry. In aggregate, the scope of the M&S program is very broad, from focused science codes that address a particular physical phenomenon, to integrated modeling codes (IMCs) that provide physical models involving the interactions between multiple physical processes. The IMCs used by LANL and LLNL are geared toward development and evaluation of the nuclear explosives packages of nuclear weapons. They are some of the most complex numerical simulations used anywhere, with millions of lines of software written over decades to capture the behaviors of and interactions between materials, plasmas, fluids, and radiation. SNL also uses IMCs, primarily in the engineering design/analysis space but also in support of complex experimental facilities such as the Z-pinch machine.
Compared to the IMCs, the science codes are typically smaller and involve fewer interacting physical models. That is not to say they are less sophisticated, because some of them represent the limits of our understanding of underlying physics and push the frontiers of mathematical algorithms, the methods for quantifying uncertainties, and the computer systems on which they run. None of these codes are entirely static, as they are regularly adapted to incorporate refinements in the models, new algorithmic techniques that improve accuracy or performance, and changes in the underlying computer architectures.
A successful program in this sort of advanced M&S requires deep expertise in applied mathematics, computer science, and a range of physical science disciplines relevant to the mission, plus access to experimental data to validate the simulation models and to quantify uncertainties. It also needs a sophisticated set of software engineering activities for the design, development, and maintenance of codes. The work must cover, in a balanced way, both support for production software used in weapons design and certification and innovations needed to develop new models, algorithms, and computer systems. Successful M&S programs capable of addressing these requirements and challenges are built around large, interdisciplinary team-based science and engineering (S&E) projects, which are a hallmark of the NNSA national security laboratories.
A strength of the M&S programs at the laboratories is the quality of the senior scientific staff, including those in management roles. Several of them have strong backgrounds in theoretical or experimental research, only later moving into M&S. They demonstrated to the committee an appreciation of the importance of a strong interface between modeling and basic science.
When DOE began the Accelerated Strategic Computing Initiative (which later transitioned into today’s Advanced Simulation and Computing program), the IMCs were seen primarily as tools for nuclear
weapons design, particularly for stockpile stewardship in the absence of underground nuclear tests. But they have also proved valuable in a range of other applications, including designing inertial confinement fusion capsules at LLNL, exploring high-energy-density physics and other basic science topics, enabling studies to underpin the LEPs, and providing simulations in support of counter-proliferation.
While the codes have had an important impact on laboratory science and technology decisions, the code developers acknowledge that they do not yet have a robust predictive capability for a number of key phenomena, such as energy flows and boost, except through the use of problem-dependent calibration from experiments. The predictive capability of a numerical simulation depends on several factors, including the following:
• The validity of the physical models encoded in the simulation;
• The fidelity of the numerical methods used, including whether the discretization accurately captures the essential features;
• The quality of the algorithms—for example, the choice of preconditioner used in solving linear systems and the numerical accuracy of the algorithms in the face of roundoff errors; and
• The quality of the software implementation of the algorithms.
The predictive capability that has been achieved is assessed by validating the simulation code against experimental data, assessing parametric uncertainties, estimating numerical errors, estimating the impacts of uncertainties in initial and boundary conditions, and so on, all for the range of applications of interest.
Although all of these factors must be attended to, it is essential to ensure that the equations attempting to describe the phenomena under study are properly captured and are sufficiently accurate in the parameter range being explored. If that condition is not clearly met, then ancillary experiments should be undertaken, as possible, to decrease the uncertainties. That step might have been incomplete in the case of the ignition experiments at NIF.
Hydrodynamics and mixing at material interfaces are critical processes related to stockpile stewardship, and M&S is a primary means of exploring these phenomena. While the work the committee examined is in general well executed, and the staff involved are working at the state of the art, the committee has concerns about one strategic issue in connection with this work. Those concerns are addressed in the classified Annex to this report.
Transport involves the modeling of the flow of neutrons and other particles and of x rays and other radiation, as determined from their couplings to electrons and ions, which is essential for describing the dynamics of nuclear weapons and related components. The modeling of radiation and neutron transport accounts for the bulk of the computational time in a typical weapon simulation. Accordingly, there is a premium placed on efficient and accurate modeling of radiation and neutron transport (denoted “particle transport”) phenomena at the NNSA laboratories. There are also unclassified applications of particle transport, including nuclear reactors, medical diagnostics and treatments, astrophysics, nuclear fusion, high-power lasers, and some industrial processes.
The NNSA laboratories were the pioneers in developing particle transport methods and continue to be leaders in the field. The computational methods for transport include discretization of the partial differential equation that describes the transport process (denoted “deterministic” transport methods) or stochastic modeling of the underlying radiation or particles (“Monte Carlo” methods). All three NNSA
laboratories are actively involved in the development of deterministic and Monte Carlo codes for transport simulation.
Two notable examples of high-quality efforts in computational transport for general science at the NNSA laboratories are the groups that develop the deterministic transport capabilities in the code KULL at LLNL and the Monte Carlo radiation-transport code MCNP6 at LANL. The KULL code has an unprecedented capability for modeling radiation transport. For example, a recent simulation of the Searchlight experiment with KULL, which modeled the flow of x-ray energy through an evolving density gradient to validate modeling of stellar atmospheres, was one of the largest radiation transport problems ever run at the NNSA laboratories. This calculation included more than 275 million unknowns and ran for 30,000 time steps (to simulate 12 nanoseconds of experiment time!). Perhaps just as significant, the same radiation transport code was used to help plan an optimal signal-collection strategy for a related experiment. There are other examples of successful deterministic transport codes at the NNSA laboratories, but the development of KULL is particularly noteworthy.
The MCNP series of Monte Carlo codes at LANL is recognized as the international “gold standard” for Monte Carlo particle transport, with more than 10,000 users around the globe. There is a rigorous verification and validation process for MCNP revisions, and the MCNP team is internationally respected for its expertise and achievements in Monte Carlo development. The development of MCNP and the support provided to the large user community is an indirect peer review of the Monte Carlo methods development at the laboratories, and the community’s acclaim for these codes couples with the MCNP team’s strong record of journal publications and conference proceedings to illustrate the high quality of MCNP development and applications.
SNL has long been a leader in developing electron transport and neutron transport methods for determining the radiation dose to electronic components and other devices. This input can then be used by materials scientists to predict the effect of the radiation on device performance. By necessity, this effort requires experimental data to validate the models. The Sandia Pulsed Reactor (SPR) was the source of much of this data, but this facility was decommissioned in November 2006. The QASPR (Qualification Alternatives to Sandia Pulsed Reactor) initiative is an outstanding example of the systematic approach taken by SNL in validating its engineering models for predicting radiation dose, materials damage, and impacts on device performance. Data from SPR experiments have been collected, stored, and treated as new data that complements data from existing experimental facilities.
Within the nuclear weapons design mission, there is a division of labor between LANL/LLNL and SNL. LANL and LLNL are responsible for designing the “physics package,” i.e., the components of the weapon that are directly responsible for the nuclear explosion, and its response to the external environment. SNL is responsible for the design of the non-nuclear components and for integration of the weapon into the delivery system. In contrast to the IMCs used to design the physics package, SNL’s mission requires M&S tools that are more closely aligned with those in other more general engineering applications.
To carry out its mission, SNL has developed a broad range of simulation capabilities in areas such as fluid dynamics and heat transfer, solid mechanics, radiation effects, and electromagnetics. The groups developing these packages have strong software engineering practices, including a layered design that factors different capabilities into reusable components, and a documented software design process. The IMC development at SNL is closely coupled to basic research activities that are funded by the DOE Office of Science to build capabilities for numerical partial differential equations, linear and nonlinear solvers, and grid generation.
Nonetheless, SNL is subject to the same pressures as the other two laboratories in its weapons mission: expansion of the mission, particularly due to the demands of the LEPs, without corresponding increases in resources. In addition, over the past 15 years, SNL has deliberately expanded its sponsor base
to the point where 50 percent of its funding comes from sources other than NNSA and DOE. While laboratory management has attempted to cultivate a long-term institutional relationship with these non-DOE stakeholders, in practice, many of the projects have been short term, sometimes as little as a year, placing enormous pressure on the scientific personnel due to the difficulties of staffing such short-term projects and delivering on such short timescales and the fragmentation of individual staff members’ time. The phase I report from this study1 endorsed this movement of the laboratories into broader “national security” laboratories, but short-term and fragmented work imposes risks to the S&E quality over the long term.
Verification, Validation, and Uncertainty Quantification
Laboratory staff presented an analysis of a particular code’s sensitivity to three choices that are made in all M&S, looking particularly at how these choices affected the code’s ability to represent an implosion. The sensitivities examined were (1) parameter variation for a given model, (2) choice of model, e.g., a change in the choice of equation of state (EOS) used, and (3) choices of numerical methods, such as the mesh resolution. The striking result was that all three of these factors were significant—i.e., the code’s output was clearly sensitive to all of them. The presentation noted, though, that parameter variation, while non-negligible for the particular case treated, was less important than either of the other two factors. It also explained that some loss of precision due to choices of numerical methods can be mitigated via higher resolution calculations, albeit at the cost of longer runs. The outcome of this analysis is to highlight the substantial importance of model choice.2
The committee was later informed that a switch from an older EOS to a more modern version had resulted in less accurate results. This outcome prompted staff to reconsider the EOS in question in the light of very recent results, and the result is that a code improvement is very likely. The committee found this discussion to be a very positive illustration of the scientific method at work, causing the weapons community to revise its views. In more detail, it also shows that the original development of uncertainty quantification (UQ) that emphasized parameter variation needs to be updated to accommodate the less tractable uncertainty of model choice (as part of epistemic uncertainty). This technical exchange between and within the laboratories is a compelling example of the importance of inter-laboratory peer review.
Effective Use of High-Performance Computing Technology
Overall, the laboratories’ work in M&S related to the core weapons mission is quite impressive. The interplay among basic physics, numerical techniques, and computer science is robust and energetic. Roughly 15 years ago, the laboratories led the transition of M&S codes from vector supercomputers to massively parallel machines. Today, the production codes run on large-scale parallel clusters, while laboratory developers have been able to modify the science codes and associated algorithms to make use of the Blue Gene systems at LLNL and, in some cases, the heterogeneous Roadrunner architecture that arrived at LANL a few years ago. For example, production codes routinely run on parallel clusters using 5,000-10,000 processors and some occasionally run with up to 100,000 processors. However, significant challenges loom ahead, starting with underlying computing technology.
1 National Research Council, Managing for High-Quality Science and Engineering at the NNSA National Security Laboratories, The National Academies Press, Washington, D.C., 2013.
2 M. Henrion and B. Fischhoff, Assessing uncertainty in physical constants. American Journal of Physics, 54:791-798, reprinted in Judgment Under Uncertainty II: Extensions and Applications (T. Gilovich, D. Griffin, and D. Kahneman, eds.), Cambridge University Press, New York, N.Y., 1986.
Well-known technology challenges limit the growth of high-performance computing performance: power density is constraining the clock speed of individual processors; the cost of data movement in time and energy continues to grow relative to the costs of computation; failure rates may increase due both to device-level physics and to the multiplicity of components; and total system power places practical restrictions on the design and operation of large systems. 3 These considerations have led to plans for new computer architectures that will require the development of new algorithms and programming techniques at enormous software costs, costs that will escalate if the software is not well organized and documented. The BlueGene and Roadrunner architectures, which represent very energy-efficient processor designs, have provided early examples of the kinds of transitions, analogous to those required for migrating to parallel computers in the 1990s, that will be necessary if production workloads are to keep pace with hardware advances.
In the past, the NNSA laboratories have been leaders in co-developing hardware with vendors and in transitioning applications to these systems. Individuals at the laboratories are well aware of the changes ahead, have support to work on specific software development, and are active participants in the discussion of “exascale” challenges occurring across the DOE community. The goal of building an exascale system capable of performing 1018 floating point operations per second reflects a broader interest in improving energy efficiency and making sure the systems are balanced, programmable, and generally useful for science. However, in contrast to the way the high-performance computing community within the national laboratories contributed to major architecture transitions in the past, the committee heard concerns about the present lack of a coordinated plan from DOE to support laboratory engagement with the computer industry in developing future generations of systems and in transitioning codes to these platforms. The committee also heard that budget trends are likely to set up a competition between hardware and software.
Software Engineering Practices
Software engineering is important for delivering reliable results and for adapting code to new mathematical or programming techniques. The committee has some doubts about the quality of the software engineering methodology used in the IMCs. When questioned about how software developers at the laboratories use documentation of the model (including its discretization and algorithms) as part of their design process, the response was that the codes under development are too changeable for such an approach to be practical. This answer was provided across the board with respect to code-design projects dating from the mid-1990s to today.
To give a sense of scale of the IMCs, a typical integrated code at LLNL has 750,000 to 1 million lines of code, in addition to shared libraries that are on the order of 3 million lines of code. There are some positives in the software practices at LLNL, including systematic regression testing, revision control, release processes, and some documentation, such as user manuals and, in at least one case, a developer’s manual. However, the codes at both LLNL and LANL lack systematic documentation of the physics, algorithms, and software design and, rely instead on developers themselves to be the repositories of such information. LLNL recognizes this as a serious defect, and it has begun to repair the problem, but to date progress has been limited. This is a potentially unstable situation, given that the expected disruptive change in computer architectures will almost certainly require a reconsideration of all design decisions for the IMCs, a process that will be extremely difficult without documentation of the current designs.
3 P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, et al., ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems, DARPA, Arlington, Va., 2008, available at http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/exascale_final_report_100208.pdf.
SNL has notably practiced modularization of their software packages, with an emphasis on dual use and open-source publication of lower level modules, e.g., Trilinos. In this respect the overall approach of SNL appears to be superior to that of LLNL or LANL. Certain of the SNL codes have become de facto standards (e.g., LAMMPS for molecular dynamics simulations in materials science).
One contrasting point at LANL was an effort to develop protocols for producing new code in a 4-year period, which is substantially shorter than the typical 10-15 year timeframe. Although the concept is laudable, the committee could not establish that industry-standard software engineering methods were being applied (beyond basic steps such as version control). In contrast, SNL presented evidence that they are indeed applying such methods. While it is a challenge to find staff with the appropriate training, SNL’s M&S staff is committed to the discipline and believes that the overall greater degree of acceptance of their software (e.g., LAMMPS) outside the nuclear weapons complex can be partly attributed to their adherence to good software engineering practices. SNL staff deserve credit for making effective use of good software engineering practices, which are very important.
The impressions gleaned by the committee suggest that all three laboratories have long been seriously committed to staff recruiting, retention, and succession planning. The committee met with energetic and promising groups of postdoctoral researchers, many of whom were drawn to the laboratories’ mission and the opportunity to work with the senior staff. With respect to diversity, the M&S staff at the three laboratories mirrors the general technical population, which means that the fraction of women, for example, is quite low.
Despite these efforts, the committee has concerns about some areas, one being “core” computer science activities, such as computer architecture, systems software, programming models, tools, and the algorithms used in these systems. While there are some outstanding individuals in these areas within the laboratories, there were also signs of difficulty in recruiting and retention. These researchers are mobile because they can easily find challenging and lucrative employment in industry, and while their work is necessary to the NNSA mission, they have other good options. The committee was told that researchers and engineers in these areas are more likely to leave mid-career than are people in other disciplines. This does not seem to be an issue for other specialists who are key to the laboratories’ M&S, such as physical scientists, applied mathematicians, computational modeling experts, or even computer scientists in selected areas like scientific visualization—probably because the laboratories offer unusual intellectual challenges for these specialties.
The committee also has a general concern about the dilution of resources devoted to at least some aspects of M&S. Until the mid-2000s, the code teams at LLNL contained some 25 full-time equivalent (FTEs) for each IMC. Since that time, the staffing has decreased to approximately 17 FTEs for each IMC, while the mission demands have expanded. Given the range of activities that are required to meet the M&S challenges identified earlier in this chapter, this appears to be a woeful understaffing. Furthermore, the funding for collaboration with organizations such as the Center for Applied Scientific Computing at LLNL, which have better connections to the science communities, has been shrinking. While the laboratories have continued to invest in cutting-edge computing platforms, they must also invest appropriately in all aspects of software development for stockpile stewardship.
The laboratory scientists and facilities involved in M&S represent a unique national asset, with both depth of expertise in particular technical areas and the experience to integrate across areas to solve critical and challenging problems. LANL, LLNL, and SNL have developed a spectrum of capabilities in M&S that have solved critical problems in national security. The scope of the M&S activities has grown
substantially in recent years, as the laboratories have become more reliant on predictive modeling of an aging weapons stockpile in this era of no testing. At the same time, many of the computational groups are smaller than they were a decade ago.
The committee also observed that funding pressures—budgets that are lower than laboratory staff feels are necessary, fluctuations from year to year, and uncertainties associated with those fluctuations—appear to have had a noticeable impact on the morale of the laboratories’ M&S scientists. The contract changes at LANL and LLNL raised costs and, therefore, contributed to this pressure, along with overall decreases in funding, growing scope, and general increases in the cost of doing business. If planned LEPs divert funding away from M&S research, the situation could worsen.
Finding 5.1. The next decade is expected to bring disruptive advances in computer architectures, with profound consequences for laboratory M&S capabilities. While there is awareness of the issues, DOE and NNSA have not developed a comprehensive plan to respond to the challenges. The issues are not being addressed with the kind of coordinated effort that has characterized prior major DOE initiatives in scientific computing.
Finding 5.2. Changes in materials properties due to weapons aging and component replacement, or due to refurbishment of materials or the use of materials fabricated with processes that differ from those used for the weapons that produced test data, are an increasing source of uncertainty in weapons systems. The laboratories’ staff recognize that new physics-based models capable of addressing these uncertainties must be developed to replace key current models whose reliability is dependent on their calibration to old nuclear test data.
Finding 5.3. The development of predictive codes based on physics modeling requires data for validation and uncertainty quantification, plus close connection between modeling and experiment. The committee shares the concerns of laboratory M&S staff that the increasing difficulty in fielding experiments is undercutting this process. This difficulty is most evident for small-scale experiments, as discussed elsewhere in this report.
Finding 5.4. The quality of the NNSA laboratory scientific and technical workforce is the most important factor determining how well the laboratories respond to computer architecture changes, to the challenge of new physics-based models, and to the need for ancillary experiments for code and physics validation. Maintaining staff quality is a major challenge in the face of budget uncertainties, competition in computer science from other employers, and a perception among some that the scientific environment of the laboratories has eroded.
Finding 5.5. There are substantial needs for higher model fidelity and numerical accuracy in the IMCs. In particular, there are no robustly predictive simulation capabilities (i.e., ones that do not require calibration from UGT data) for multiple key physical phenomena. The staffing levels of the M&S effort are inadequate to meet the needs of retooling the IMC codes to meet the simultaneous challenges of developing higher-fidelity simulation capabilities, meeting expanded mission requirements, and changing the algorithms and software architecture of the codes to respond to the disruptive changes in computer architecture expected to occur over the next decade.
Recommendation 5.1. The laboratories should ensure that they have an environment that nurtures broad scientific inquiry to aid in recruiting and retaining a cadre of first-rate, creative, energetic scientists with expertise in all aspects of M&S, ranging from deep understanding of the underlying physics and mathematics to the most advanced ideas in computer architectures, algorithms, and programming methods. They also should track staffing and prioritize activities so as to deal with the growing demands on M&S and related technical challenges.
Recommendation 5.2. Given the increasingly important role that the IMCs will play in certification of the stockpile in the absence of testing, the NNSA should undertake a detailed assessment of the needs for simulation and modeling over the next decade and implement an adequately funded execution plan to meet the challenges outlined in Finding 5.5.