Demand for capacity and capability computing has been growing both in terms of computing requirements and the number of scientists and researchers involved. It is becoming increasingly difficult to balance investments, given the large and growing aggregate demand, the high cost of high-end facilities, and the constant or shrinking National Science Foundation (NSF) resources. Compounding the challenge is the wide variety across scientific disciplines in terms of computing needs, the state of scientific data and software, and the ability of researchers to effectively use advanced computing.
These developments present new challenges for NSF as it seeks to understand the expanding requirements of the science and engineering community; explain the importance of a new broader range of advanced computing infrastructure to stakeholders, including those that set its budget; explore non-traditional approaches; and manage the advanced computing portfolio strategically.
Multiple fields (e.g., materials science) are also transitioning from being primarily compute-intensive (e.g., ab initio simulations in material
science) to being much more data-intensive (e.g., due to the rapid growth of experimental data, growing use of data analytics, and automation of calculations searching for materials with desired properties) and may not be prepared for this transition. Some communities may lack sufficient national or communal hardware or software infrastructure to facilitate development of new workflows or to realize economies of scale and may not have leveraged best practices and investments established by other communities.
Workflow refers to the series of computational steps required to yield a research result from the experimental and/or simulation results and to the tools and processes used to manage them and record the provenance of results. The range of science and engineering research sponsored by NSF involves a diverse set of workflows, including those that involve primarily compute- or data-intensive processing or combinations of both. Additionally, the compute and data capacities and the scale of parallelism required by these workflows can vary greatly by several orders of magnitude. The shift from general-purpose central processing units (CPUs) to more specialized architectures, such as hybrids of general-purpose processors and graphical processing units, which have a much more highly parallel structure, further exacerbates the challenges of aligning the workflows with available computing capabilities.
A number of technology challenges will affect the ability of NSF and others to deliver the desired advanced computing capabilities to the science and engineering communities. They will require adaptations, such as recoding existing software and writing new software in new ways, while providing new opportunities for advanced computing users to make the necessary adaptations.
It is an accepted truth today that Moore’s Law will end sometime in the next decade, causing significant impact to high-end systems. We have already transitioned through a major technology-driven change in 2004, driven by hitting a “power wall” in our ability to cool processor chips, which has already rewritten the architectural landscape. Already, graphics processing units (GPUs) are providing a significant increase in computing power per chip and per unit energy, but often at the cost
of needing new algorithms and new software. Data exchange capabilities between processors are also under increasing pressure, because as the number of transistors and cores on a die continues to explode, the number of paths and rate at which we can signal over such paths is growing, at best, slowly. These trends are forcing consideration of new architectures—possibly distinct from the ones used to build conventional mid-range systems—and new software approaches in order to use them effectively. Indeed, future growth in capabilities may come from an explosion of specialized hardware architectures that exploit the growth in the number of transistors on a chip. The transition implied by the anticipated end of Moore’s Law will be even more severe—absent development of disruptive technologies; it could mean, for the first time in over three decades, the stagnation of computer performance and the end of sustained reductions in the price-performance ratio. Redundancy and fault tolerant algorithms are also likely to become more important. Lastly, power consumption (and its associated costs) is now a significant factor in the design of any large data center. For example, simple extrapolation of existing climate models to resolve processes such as cloud formation quickly lead to a computer that requires costly and possibly impractical amounts of electrical power. These challenges and the associated uncertainty pose significant challenges when contemplating future investment in extreme performance computers.
Building data-intensive systems that provide the needed scale and performance will require attention to several technical challenges. These include the following:
• Managing variability and failure in storage components. Very-large-scale, data-intensive computing consists of large numbers of storage devices (typically disks), which are often commodity components and not the higher-quality storage devices generally used in high-performance computing. Although the probability of failure of any single device is low, the aggregate number of failures is high, as is the variability in time required for a device to perform a computation (those that take longer than would be expected from the performance of their peers are sometimes called stragglers). For example, a large part of the complexity of systems like Hadoop is the result of dealing with failures and stragglers. Research is needed to more efficiently manage failure and variance, especially for a broader range of programming models.
• Very-large-scale scientific data management and analysis. Although this is an active research area, it is still a challenge to manage data at
the petabyte (PB) to exabyte scale. File systems, data management systems, data querying systems, provenance systems, data analysis systems, statistical modeling systems, workflow systems, visualization systems, collaboration systems, and data sharing must all scale together. Data analysis is typically an iterative process, and traditional scientific computing approaches often rely on software that was never designed to work at this scale. As a simple example, there is no open-source file or storage system that scales to 100 PB. On the other hand, the commercial sector has developed data management infrastructure over distributed file systems, which has produced a variety of new data management systems, sometimes called NoSQL (not only SQL) systems. We are moving into an era of data access through a set of application programming interfaces (APIs) rather than discrete files. Adapting scientific software will be a challenge in this new environment.
• At scale interoperability of geographically distributed data centers. Very-large–scale, data-intensive computing relies more on external data resources than is usually the case with high-performance computing. Some of the most interesting discoveries in data science have been made by integrating third-party and external data. Analysis that uses data distributed across multiple locations requires costly, high-capacity network links, and its performance will in any event suffer compared to computation that uses data in a single location. For this reason, data-center-scale computing platforms benefit by integrating at scale with other such facilities and the data repositories they contain.
As discussed above, research increasingly involves both compute- and data-intensive computing. What technical and system architectural approaches are best suited to handling this mix is an open question. Federating distributed compute- and data-intensive resources has repeatedly been found to present multiple additional costs and challenges, including, but not limited to, network latency and bandwidth, resource scheduling, security, and software licensing and versioning. Overcoming these challenges could increase participation and diversify resources and might be essential to realizing new science and engineering frontiers by coupling capability computing with experiments producing large data. Avoiding unnecessary federation by consciously co-locating facilities might yield significant cost savings and enhancements to both performance and capability. An additional complication is that many important scientific data collections are not currently hosted in existing scientific computing centers.
Recent advances in cloud data center design (including commodity processors and networks and virtualization of these resources) may make
it cost-effective for data centers to serve a significant fraction of both data-intensive and compute-intensive workloads. Such an approach might also support different use models, such as access via cloud APIs, that complement traditional batch queues. This may prove essential to opening NSF resources to use by new communities and enabling greater utilization. Co-location of computing and data will be an important aspect of these new environments, and such approaches may work best when the bulk of the data exchange can be kept inside a data center.
As described above, most experts believe that the coming end of Moore’s Law and the long domination of complementary metal-oxide-semiconductor (CMOS) devices in computing will force significant changes in computer architecture. Successful exploitation of these new architectures will require the development of new software and algorithms that can use them effectively. New software and algorithms will also be needed for computation that uses cloud computing architectures.
New algorithms and software techniques can also help improve the performance of codes and the productivity of researchers. Adoption of either will depend on establishing incentives for their adoption into existing applications and use of appropriate metrics to evaluate the effectiveness of applications in context. For example, a code that only needs to run for, at most, a few hundred hours may not need to be very efficient, but one that will run for a million hours should be demonstrably efficient in terms of total run time, not floating-point operations per second (FLOPS).
Relaxing the “near-perfect” accuracy of computing may usher in a new era of “approximate computing” to address system failures, including data corruption, given the massive scale of these new systems. Any investment in cyberinfrastructure will need to take into account the need to update, and in many cases redevelop, the software infrastructure for research that has been developed over the past few decades. Under these conditions, innovations in algorithms, numerical methods, and theoretical models may play a much greater role in future advances in computational capability.
New knowledge and skills will be needed to make effective use of new system architectures and software. “Hybrid” disciplines such as computational science and data science and interdisciplinary teams may come to play an increasingly important role. Keeping abreast of a rapidly evolving suite of relevant technologies is challenging for many computer
science programs, especially those with limited partnerships with the private sector. Most domain scientists rely on traditional software tools and languages and may not have ready access to knowledge or expertise about new approaches.
2. The committee will explore and seeks comments on the technical challenges to building future, more capable advanced computing systems and how NSF might best respond to them.
Comments from the science and engineering communities anecdotally suggest a pent-up demand for advanced computing resources, such as unsatisfied Extreme Science and Engineering Discovery Environment (XSEDE) allocation requests for already peer-reviewed and NSF-funded research. This need is across all types and capabilities of systems, from large numbers of single-commodity-nodes to jobs requiring thousands of cores, fast interconnects, and excellent data handling and management.
Since the beginnings of the NSF supercomputing centers, NSF has provided its researchers with state-of-the-art capability computing systems. Today, the Blue Waters system at Illinois and Stampede at Texas represent significant infrastructure for capability computing, augmented by other systems that are part of XSEDE. Today, it is unclear whether NSF will be able to invest in future highest-tier capability systems. Mission-oriented agencies in the United States, such as the Department of Energy, as well as international research organizations, such as the Partnership for Advanced Computing in Europe or the Ministry of Science and Technology in China, are pursuing systems that are at least an order of magnitude more powerful, for both computation and data handling, than current NSF systems. Similarly, commercial cloud systems, while not an alternative for the kinds of applications that require tightly coupled capability systems, have massive aggregate computing and data-handling power.
3. The committee will review data from NSF and the advanced computing programs it supports and seeks input, especially quantitative data, on the computing needs of individual research areas.
4. The committee seeks comments on the match between resources and demand for the full spectrum of systems, for both compute- and data-intensive applications, and the impacts on the research community if NSF can no longer provide state-of-the-art computing for its research community.
Historically, NSF has supported the acquisition of specialized research infrastructure through a variety of processes, including Major Research Equipment and Facilities Construction and Major Research Instrumentation programs, support for major centers, and individual grants. In many cases, the private sector has provided equipment and expertise, but the private sector has not provided NSF researchers with a significant source of computing cycles or resources. The growth of new models of computing, including cloud computing and publically available but privately held data repositories, opens up new possibilities for NSF. For example, by supporting some footprint in commercial cloud environments, many more NSF researchers could have the ability to access compute and data capabilities at a scale currently only available to a few researchers and commercial users. For some fields, this could be transformative.
One of the benefits of cloud computing is the flexible way in which resources are provided on demand to the users. Evidence from several studies suggests that this flexibility comes with a monetary cost (which may not be competitive with NSF-supported facilities) that must be balanced against the opportunity cost, in terms of scientific productivity, in the conventional model of allocations and jobs queues. On the other hand, virtualization, the implied ability to migrate work, and limited oversubscription can work to decrease overall costs, increase overall system throughput, and increase the ability of the system to meet fluctuating workloads, although perhaps at the expense of the performance of an individual job. The cost trade-offs are complicated and need to be looked at carefully.
Researchers funded by one agency sometimes make use of computing resources provided by other federal agencies. Today, allocations are made by the agency that operates the advanced computing system on the basis of scientific merit and alignment with agency mission. Other arrangements are possible. NSF could directly purchase advanced computing services from another federal agency. It could also join with other agencies
to contract from a commercial provider or coordinate with other agencies on specifying services and costs in developing requests for proposals for commercial services.
5. The committee seeks comments on the role that private industry and other federal agencies can play in providing advanced computing infrastructure—including the opportunities, costs, issues, and service models. It also seeks input on balancing the different costs and on making trade-offs in accessibly (e.g., guaranteeing on-demand access is more costly than providing best-effort access).
A particular issue that has surfaced in the committee’s work so far is the “double jeopardy” that arises when researchers must clear two hurdles: getting their research proposals funded and getting their requests for computing resources allocated. Given the modest acceptance rates of both processes, such a process necessarily diminishes the chances that a researcher with a good idea can in fact carry out the proposed work. Relatedly, researchers also do not know in advance on what machine they will be granted an allocation, which may cause them to incur the cost and delay needed to “port” data and code to a new system (and possibly new system architecture) in order to use the allocation.
6. The committee seeks comments on the challenges facing researchers in obtaining allocations of computing resources and suggestions for improving the allocation and review processes for making advanced computing resources available to the research community.