The National Science Foundation’s (NSF’s) current model of cyberinfrastructure, including advanced computing, is based on a mix of centralized and distributed funding, anchored by the Division of Advanced Cyberinfrastructure (ACI) within the Directorate of Computer and Information Science and Engineering (CISE). Previously, ACI was the Office of Cyberinfrastructure (OCI), reporting to the director. This central structure currently supports the Blue Waters facility (a leading-edge facility) and a set of smaller computing and storage resources via the Extreme Science and Engineering Discovery Environment (XSEDE). In addition to these centrally funded resources, the Geosciences Directorate operates advanced computing facilities at the National Center for Atmospheric Research (NCAR), and it and other NSF directorates fund cyberinfrastructure via a variety of programs.
Advanced computing shares many elements of other NSF infrastructure investments, but it also differs in some profound ways. First, unlike advanced telescopes or particle accelerators, where there is no competing commercial market, a vibrant computing industry develops new technologies and products and responds to market needs and opportunities that dwarf computing expenditures in academia and by federal research sponsors. Second, computing market shifts and the well-documented, rapid evolution of computing technology mean that researcher expectations and economically viable computing technologies change every few years. Consequently, advanced computing capital assets have a very short operational lifetime, in marked contrast to many other scientific instru-
ments. These shifts, however, do not mean that long-term planning is unnecessary or impossible. Businesses and academia regularly develop strategic information technology (IT) plans that accommodate technology shifts.
Third, advanced computing is distinguished by its universality; it is applicable to all scientific and engineering domains, spanning data capture and analysis, simulation and modeling, and communication and collaboration. Fourth, and consequently, demand for advanced computing continues to grow rapidly, placing increasing stress on the financial models and social processes used to support research cyberinfrastructure. Although states, universities, and companies have long subsidized the capital and operating costs of NSF’s leading-edge advanced computing, those costs have now reached tens to hundreds of millions of dollars. Consequently, the willingness of these parties to engage in “pay to play” (i.e., accept losses in exchange for publicity or collateral institutional advantage) has declined accordingly.
The unique attributes of advanced computing create both opportunities and challenges for any NSF strategy, requiring both nimbleness in the face of changing technologies and economics and stability to ensure sustained capabilities and research continuity. The following basic principles will help ensure the sustainability of NSF’s advanced computing strategy:
- Realistic business assessment that exposes the true costs and subsidies of cyberinfrastructure deployment and operation at all scales;
- Identification and tracking of technology trends and economics, along with the research opportunities they create;
- Long-term planning and articulated strategy (a roadmap) that allows the broad research community and service providers to plan accordingly;
- Balanced support for computing hardware, storage systems, and networks, along with professional staff, software and tools, and operating budgets; and
- NSF-wide commitment to cyberinfrastructure investment, strategic directions, and operational processes.
Three crosscutting aspects of sustainability are particularly crucial: continuity, coverage, and skills.
6.1.1 Service Continuity and Adaptability
Service continuity encompasses long-term strategic planning and sustainability on a decadal or longer timescale. NSF’s Major Research Equipment and Facilities Construction (MREFC) projects for scientific infrastructure typically involve years of planning. Today, NSF’s cyberinfrastructure facilities are rarely used to support computational modeling and data analysis for MREFC projects. The former have lifetimes of just a few years, making it impractical for MREFC project leaders to reduce overall costs of advanced computing by including NSF’s own cyberinfrastructure facilities on the MREFC operational plan. This must change if common cyberinfrastructure is to support MREFC projects and other long-term community research.
Historically, most research data has been produced by carefully planned experiments, and it has been both expensive to capture and highly guarded by the researchers who produced it. Ubiquitous, inexpensive sensors and a new generation of large-scale scientific instruments, including MREFC infrastructure, have changed the economics of data capture and are shifting scientific expectations about data retention and community sharing.
Although NSF’s recent requirement that all NSF-funded research projects have a data management and accessibility plan is an explicit policy recognition of data’s importance, there is no NSF-wide cyberinfrastructure strategy or program to support disciplinary or cross-disciplinary data sharing and preservation. Hence, much of the data preservation responsibility and financial burden rests on individual investigators and their home institutions. Today, when the cognizant investigators no longer perceive value in retaining the data, those data are often lost. This is increasingly problematic as the longer-term research value of data often accrues to those in other disciplines.
6.1.2 Service Coverage: Breadth and Depth
In its earliest form, cyberinfrastructure was synonymous with high-performance computing and computational science. Today it encompasses not only high-performance computing but also large-scale data archiving and analytics, software codes and tools, and human expertise and computing-mediated research and discovery. Orthogonally, cyberinfrastructure spans the capabilities and needs of individual investigator laboratories, campus sites, regional and national research facilities, and commercial cloud service providers.
Any comprehensive cyberinfrastructure strategy must include the entire spectrum of services and span the entire range of organizational
scales. It cannot be simply about leading-edge supercomputing platforms or just about big data analytics; it must integrate both at multiple scales. Nor can it focus on hardware infrastructure while neglecting both software development and maintenance and training and support of technical expertise. It must balance sustainability against adaptation, recognizing that community needs evolve and technology shifts drive new solutions.
The rise of “big data” as a cyberinfrastructure challenge that rivals the scale and complexity of advanced scientific computing is indicative of this need for community adaptation. To respond appropriately to this technology shift and opportunity, NSF must adapt its investments and infrastructure. Big data will require big infrastructure, just as leading-edge computational science does, and will likely involve a mix of both centralized facilities and decentralized repositories at universities. The Australian eResearch initiative and its Australian National Data Service is a relevant example.
In this context, the NSF community would benefit from a coherent, big data retention and preservation strategy and capability, one that balances investigator and disciplinary differences against communal benefit and research collaborations. Unfunded mandates for retention and preservation will not be workable. A balanced model is likely to require greater total funding, a better balance of capital and operating budgets, more focus on business practices and return on research investment, and greater coordination across NSF directorates.
6.1.3 Skills and Workforce
Sustainable and effective cyberinfrastructure depends critically on the skills and expertise of domain scientists and of committed and well-trained advanced computing professionals. Even if they are not directly responsible for code development and workflow management, scientists using advanced computing need to be generally knowledgeable about these matters. For their part, technical staff members not only deploy and operate facilities, but also support community toolkits and codes, serve as keepers of institutional knowledge and expertise, and manage and ensure data security and provenance. Unlike hardware, with a lifetime of a few years, the human infrastructure of people’s experiences in operating such systems has a lifetime of decades. Despite their importance, these staff often lack clear academic career paths and are dependent on an uncertain stream of funding for support.
Given the global competition of computing and computational science talent, any cyberinfrastructure plan must include mechanisms that recognize and reward professional staff and ensure they have career opportunities that retain their talent within the academic community. One
important contribution to retaining and rewarding this skilled workforce is stability in funding for centers, recognizing that developing an expert staff is a long-term process that can be wasted with even a short-term gap in staff funding.
Programs are also needed to train future computational science and data analytics experts. The report of the NSF Task Force on Cyberlearning and Workforce Development1 addressed this issue in depth and includes, more broadly, the use of computer-based approaches in learning and recognizes the need to train both the workforce that supports advanced computing and the practicing scientists who make use of advanced computing. Note that the effective use of advanced computing systems requires specialized and advanced training. NSF computing centers and other centers of advanced computing expertise (academic departments involved in advanced computing, national laboratories, and private industry) have leveraged their in-house expertise to offer such training. Examples include training programs for users offered by XSEDE and Blue Waters and the Argonne Training Program in Extreme Scale Computing. Such programs could benefit from a more formal approach and, in particular, long-term support for training materials and resources.
The pervasive NSF-wide and nationwide nature of advanced computing presents a perhaps unique opportunity, and responsibility, to pursue NSF’s diversity and inclusion goals.2 This includes ensuring the broadest possible benefit from and access to NSF’s cyberinfrastructure, as well as translating this participation into creating and sustaining a computationally skilled workforce that reflects our nation. XSEDE has made significant progress in increasing the number of underrepresented minority and women users and, more notably, principal investigators (PIs) with allocations. The successful XSEDE campus champions program is a human network, which, while pursuing its primary mission of “empowering campus researchers, educators, and students to advance scientific discovery,”3 also serves other missions including advancing diversity through increased awareness, training, and education. Increased access to statistics and metrics, concerning not just PIs and users but also those accessing online materials or participating in events or using other services, could better inform and guide actions by NSF, XSEDE, and
1 National Science Foundation, Advisory Committee for Cyberinfrastructure, Task Force on Cyberlearning and Workforce Development Final Report, March 2011, https://www.nsf.gov/cise/aci/taskforces/FrontCyberLearning.pdf.
2 National Science Foundation, Diversity and Inclusion Strategic Plan 2012-2016, http://www.nsf.gov/od/odi/reports/StrategicPlan.pdf, accessed January 29, 2016.
the community, and XSEDE is already working toward increased public access to data.
Although NSF’s current mix of centralized and distributed cyberinfrastructure has had many notable successes, it is not without problems, both for infrastructure providers and for the research community. Some of these problems are rooted in history, some are embedded in the NSF culture, and some are consequences of NSF’s organizational structure.
6.2.1 Competitive Challenges
From its origins, NSF’s advanced computing programs—the original 1980s supercomputer centers program, the 1990s Partnership for Advanced Computational Infrastructure (PACI) program, the 2000s Distributed and Extensible Terascale Facilities, and now XSEDE—have all been based on a repeated cycle of competitions to host and operate large-scale cyberinfrastructure. This cycle continues to pit putative operators—universities and national laboratories—against one another in irregularly scheduled “winner take all” competitive battles. In each case, competitors build ad hoc hardware and software vendor alliances to mount proposals. To compete, they also leverage institutional funds to cover facility, hardware, and operations costs (which are capped in the competitions as a percentage of hardware costs). Much of this difficulty is rooted in the lack of distinction between research and infrastructure funding. Each has widely differing timescales and success metrics.
Not only does repeated infrastructure competition on 2- to 5-year cycles create strong disincentives for national collaboration, it convolves performance review, recompetition, and strategic planning in ways that are challenging for all. In addition, it leads to proposals designed to win a competition rather than maximize community scientific returns. For example, it places a premium on sometimes unproven, next-generation technology that can serve as a vendor-marketing showpiece, rather than on proven, production-quality infrastructure, and researchers have little input into vendor selection, configuration options, or service models. (There is a role for facilities to test novel and risky computing technologies, but it is not in production systems.)
Researchers whose work depends on access to shared facilities also face a form of “double jeopardy.” The scientific merit of their proposed work is assessed via the standard peer review process. However, if funded, they are still not assured of access to the computing and storage resources
they need to conduct their research. A separate proposal for shared cyberinfrastructure access is conducted by either the XSEDE Resource Allocation Committee (XRAC) or the Petascale Computing Resource Allocations Committee (PRAC) to assess the competence of the researcher and his/her team to use the cyberinfrastructure resources efficiently. However, there is little operational follow-up to ensure the resources are in fact used wisely and efficiently. This is especially problematic because the monetary value of computing resource awards continues to increase.
Finally, as discussed earlier, the current model is structured largely in support of individual investigator and small team projects, with a nominal 3-year lifetime. Larger disciplinary projects and major scientific instruments (e.g., NSF MREFC projects or cross-agency partnerships) with longer production cycles have no mechanism to plan for and request cyberinfrastructure for a 10- or 20-year horizon, because there is no guarantee that any of the extant cyberinfrastructure facilities will still be operational. This adversely affects data preservation activities in particular, because, by definition, they target long-term access.
6.2.2 Structural Challenges
Since the beginning of the NSF supercomputing centers program in the 1980s, NSF ACI and its predecessor organizations have supported computational science research across NSF and provided services to a user base that spans all federal research agencies. Despite the clear recognition that computational science and data analytics are true peers with theory and experiment in the scientific process, NSF-wide coordination and support remain somewhat informal and ad hoc, with directorate participation often a secondary responsibility of the designees.
Although researchers in all NSF directorates are critically dependent on cyberinfrastructure, at present there are no formal mechanisms for coordinated strategic planning, nor are there ready ways to pool and disburse shared resources. Concretely, there are no shared negotiations for discounted infrastructure or services, nor an accepted strategy for prioritizing the balance of individual investigator, campus, and shared infrastructure. NSF would benefit from a formal roadmapping committee for cyberinfrastructure with representatives drawn from all directorates and shared responsibility for cross-directorate resource investment and strategy. In addition, it is crucial that advanced computing be treated as an NSF asset and funded accordingly, regardless of its organizational location. The need is too great and current resources are too limited for loosely coordinated action and reactive processes.
One corollary to the need for strategic coordination is scaling and scoping to match available resources. As a decentralized organization,
with frequent rotation of program officers, NSF regularly launches new programs and initiatives. For research, this is the distinguishing characteristic of NSF; it is community driven and adaptive. For infrastructure, this is often debilitating, because it leads to a proliferation of small efforts and projects that consume critical resources. When building and operating infrastructure, it is critical to do a small number of things extremely well. Successful infrastructure is derived from a sustained strategy and driven by relentless focus. The implication for NSF is clear. Given limited cyberinfrastructure resources, it must do a very small number of things extremely well, avoiding mission creep and resource dilution at all costs.
A second and equally important corollary is an integrated strategy for high-performance computing and big data analytics and a concomitant rebalancing of investments. Big data requires strongly coordinated big infrastructure, just as leading-edge computational science requires advanced computing systems. The lessons of commercial cloud computing are clear; centralization and scale create unprecedented opportunities for innovation and discovery. Clear and unambiguous requirements for data deposit and access are also needed. Only via such a mechanism, developed in broad community consultation, can the true benefits of data analytics be realized.
As the scale and scope of advanced computing demands and associated facilities and services have grown, the irregular, winner-take-all process described above has become more problematic. First, the scale and cost of high-end or leadership-class facilities needed to meet researcher demands is a large fraction of the total currently available in the NSF budget, whether within the ACI division budget or the budgets of other directorates. (Whether NSF needs a leadership-class or high-end system should be determined by the analysis of science requirements.) NSF could afford to purchase a significantly larger system than it is currently acquiring, but only by focusing on that investment rather than a larger number of much smaller investments.
Second, uncertainty regarding the timing and capability of infrastructure upgrades makes community planning difficult, and the timing is often not well matched to vendor hardware and software upgrade cycles. Third, the timescales are incompatible with the planning and life cycle of other scientific infrastructure, making use of centrally funded cyberinfrastructure difficult at best and often impossible.
Current models of funding for advanced computing (based on periodic recompetition) and service block allocations (via committee) create substantial uncertainty regarding service continuity and research access.
There are several ways to address these shortcomings while retaining the best elements of the current approach. These include approaches as varied as public-private partnerships for access to cloud services, federally funded research and development centers (FFRDCs) for organizational sustainability, and MREFC projects for facility construction. Many of these are not mutually exclusive and could be combined to address limitations of the current model.
6.3.1 A Regular Cadence of Infrastructure Investments
The cost of leading-edge advanced computing facilities and user support, whether for computational modeling or data analytics, is no longer measured in tens of millions of dollars. Rather, the costs are now denominated in hundreds of millions of dollars. Indeed, large-scale commercial data centers operated by cloud providers now cost over $1 billion each. The MREFC process may be a useful point of departure. Although there are some aspects of MREFC projects that match the needs of advanced computing infrastructure, the current MREFC mechanisms may need to be modified and adapted to the unique needs of advanced computing infrastructure, including the general nature of computing and the need for regular refresh of computing equipment.
To establish a regular cadence of infrastructure investments, NSF would plan and budget an upgrade every 3 to 5 years, with planning and construction of each generation overlapping the operation of the previous generation. This would clarify and systematize the technology upgrade and refresh process, provide a community mechanism to plan and shape infrastructure transitions, elevate budget planning and prioritization to NSF-wide discussion and approval, and provide the level of funding needed to maintain leading-edge capability.
As with MREFC projects, NSF would be able to request new funds as a line item in its annual budget request, explicitly acknowledging that that current, internal funding is inadequate to meet burgeoning need and scientific priorities. Finally, it would provide an operational instantiation for an NSF-wide advanced computing roadmap.
6.3.2 Leased Infrastructure
Historically, NSF cyberinfrastructure facilities have been operated by academic institutions on NSF’s behalf, typically via cooperative agreements. In turn, the academic institutions have purchased computing, storage, and networking hardware from computing vendors at the start of the cooperative agreement to deliver the committed services. This hardware then depreciates over its nominal 3- to 5-year lifetime until its
residual economic value is minimal and its performance and capability are no longer competitive. At that point, only another infusion of capital will ensure service continuity.
Rather than purchasing hardware at the time of an award, NSF or its awardees might choose to lease the desired hardware from a vendor or a system integrator. In the simplest variation of this model, the hardware remains the property of the vendor but is located at the operator’s facility. From an operational perspective, a simple leasing model is indistinguishable from outright purchase. Alternatively, the hardware could be hosted and maintained at a vendor facility, with a division of hardware service and user support between the partners.
Annual lease payments would smooth the punctuated budget shock of capital acquisitions, allowing amortization across multiple budget years. Lease terms at a higher level might also include periodic hardware upgrades to maintain leading-edge capability (e.g., equipment could be upgraded during the life of a cooperative agreement without competition to meet a series of performance targets) as well as quality of service and/or performance guarantees. Leases could also include exit clauses for termination, either with or without cause.
This is not a new idea. For example, the Department of Energy (DOE) has used this strategy successfully for its leading-edge computing deployments. University supercomputing centers in Japan also use leasing, which permits a regular and stable annual funding for each center.
6.3.3 Commercial Cloud Service Purchases
The explosive growth of commercial cloud services and their widespread adoption by both large corporations and small start-ups offers another alternative for provisioning advanced computing but is not a panacea (Boxes 6.1 and 6.2). Cloud computing now allows large organizations to outsource the provisioning, maintenance, and operation of computing infrastructure and commodity services, allowing them to focus resources and expertise on their core competence and differential value proposition. For smaller companies, the ability to offer services on a pay-as-you-go basis has reduced capital start-up requirements and lowered the barrier to market entry. The same could be true of individual laboratory users where computing use is highly episodic, with periods of low and high utilization.
The ability to scale services rapidly and dynamically across a wide range of demand is a consequence of the massive scale of cloud service deployment. All of the major cloud service vendors are investing billions of dollars annually to offer advanced computing and data analytics services. In addition, market competition is driving rapid declines in
service costs and frequent service expansions (e.g., in software tools and packages).
NSF could make cloud services available to its researchers in one of several ways. All would likely involve NSF negotiating a bulk purchase agreement for data analytics and computing services.
- Individual investigators could request cloud services as part of a standard NSF proposal. The PIs of funded proposals could spend awarded funds with the cloud service provider of their choice. This is possible today, although cloud services incur indirect costs that may be more than 50 percent at many institutions, making them significantly less attractive than they otherwise would be compared to the purchase
- The current computing allocation review process could be expanded to include award of cloud services. Approved users would receive a budget to be spent with their chosen cloud provider. This would ensure centralized assessment of the appropriateness and likely efficiency
of computing hardware, which presently seems inequitable because the cost to an institution for purchasing cloud services is more akin to that of a recurring credit card charge or a subcontract. By bulk purchasing, NSF could eliminate this additional cost as well, potentially receiving more favorable rates than single investigators could obtain. Alternatively, mechanisms to reduce the indirect cost rate charged on cloud services can be explored.
of the request, albeit with the double jeopardy of separate research and computing reviews.
- NSF could negotiate an agreement with one or more commercial cloud service providers (e.g., Amazon, Google, or Microsoft) and then operate a virtual facility on behalf of its users. In this model, user and application support would still rest with a noncommercial entity (e.g., via a cooperative agreement with an academic institution), and the cloud vendor would provide computing and storage services. NSF could leverage the Internet2 organization’s NET+ initiative, which has selected commercial cloud services for its members and negotiated pricing and other terms.
All of these approaches would help take advantage of the rapid evolution of cloud services, the vibrant software ecosystem for cloud data analytics, the ability to use resources at massive scale, and the presence of large, shared data sets.
To address the structural disparity in the cost of cloud services compared to hardware acquisition, NSF would need to address the facilities and administrative (F&A) costs now charged for purchase of cloud services. Today, researchers can include cloud services as direct costs in research proposals, but these services are not excluded from the modified total direct cost (MTDC) on which F&A is computed. In contrast, capital equipment costs (e.g., computing equipment exceeding $5,000) are excluded from MTDC. The result is that $1 of cloud service costs $1.XX, where XX is the F&A rate at the researcher’s institution. In contrast, the equivalent service on computing equipment purchased by an investigator on a research award costs only $1. In addition, power, cooling, and space for equipment are included in F&A, further skewing the incentive toward equipment purchase rather than service purchase. Removing this inequity would allow a more direct comparison and researcher selection based on perceived research value.
6.3.4 Cooperative Agreement Extension
Any funding and organizational structure must balance organizational stability and sustainability against responsiveness to technological change and customer needs. As noted earlier, NSF has long supported leading-edge cyberinfrastructure via a series of solicitations and open competitions. Although this has stimulated intellectual competition and increased NSF’s financial leverage, it has also made deep and sustainable collaboration difficult among frequent competitors. Individual awardees quite rationally often focus more on maximizing their long-term prob-
ability of continued funding, rather than adapting and responding to community needs.
Frequent competitions have also made it more difficult for NSF-funded service providers to recruit and retain talented staff when the horizon for funding is only 2 to 5 years. This is especially true when the competition for IT and computational science expertise with industry is so great. Periodic review and rigorous performance assessment need not be coupled with “life or death” proposal competition and cooperative agreement funding.
Other federal agencies regularly review the performance of their service facilities, providing strategic and tactical guidance, without coupling those reviews to a facility termination decision. For example, DOE operates its National Energy Research Scientific Computing Center (NERSC) in this model. Hardware acquisition decisions, management reviews, and service priorities are subject to stringent reviews, but NERSC itself is not subject to termination review each time a new system is acquired. This also allows more honest and forthright discussion of problems, without existential fears.
NSF could consider designating one or more cyberinfrastructure centers as a core facility with a nominal lifetime of a decade—for example, as part of an extended cooperative agreement. Working with NSF and under regular review, the center would deploy and operate cyberinfrastructure on NSF’s behalf. This would ensure organizational lifetime and planning horizons more similar to those of other NSF MREFC projects, which often last 10 to 20 years. In addition, longer horizons would also let NSF and its service providers evolve services and staffing in response to changing community needs and business partnerships. As extant examples, NSF’s National Radio Astronomy Observatory and National Optical Astronomy Observatory play these roles in the astronomy community.
6.3.5 Federally Funded Research and Development Centers
As noted above, continuity is crucial to strategic planning, staff retention, and cross-domain partnerships. Cooperative agreements, whether for MREFC projects or other initiatives, provide one mechanism for collaborative planning and management. Implicit in all such approaches is a presumption that the project has a bounded lifetime. In turn, that presumption profoundly and adversely affects strategic planning and a commitment to sustainability within NSF and the community.
The centrality of advanced computing to research suggests that NSF treat it as a long-term, indefinite commitment that more clearly delineates the distinction between performance review and accountability and organizational continuity and service capabilities. Such separation
would allow service providers to work more collaboratively with NSF on responses to community needs and would encourage interorganizational collaboration.
An FFRDC is an excellent example of this balance. FFRDCs are independent nonprofit entities sponsored and funded by the U.S. government to meet specific long-term technical needs in areas of national interest. They operate as long-term strategic partners with their sponsoring government agencies. Many FFRDCs, such as DOE laboratories, include multiple programs spanning many areas of science and engineering research. NSF already uses an FFRDC, NCAR, as an integral part of NSF’s cyberinfrastructure service strategy for the geoscience community; it can budget and plan new equipment acquisitions, and it offers staff career paths and continuity.
NSF could consider establishing one or more FFRDCs to support national cyberinfrastructure for research. Working with NSF, industry, and academia, such cyberinfrastructure FFRDCs could develop a strategic plan for cyberinfrastructure that meets evolving community needs, tracks technology developments, and provides a roadmap for NSF’s directorates. The FFRDCs would also deploy and operate general or domain-specific cyberinfrastructure for the national community.
6.3.6 Partnerships with Other Agencies
NSF could explore partnerships with other federal agencies. For example, NSF could coordinate complementary leadership-class system configurations with DOE, especially with DOE systems that are used to support the DOE Innovative and Novel Computational Impact on Theory and Experiment program. The purpose of this partnership is not to shift the responsibility for providing cycles from NSF to DOE; rather, it is in recognition of the fact that there is not a simple one-dimensional configuration space for advanced cyberinfrastructure. Such a partnership would develop a way to fairly serve special needs from the population supported by each agency. For example, today NSF operates a system with more memory than any DOE system; conversely, DOE operates a system with more GPUs and peak floating-point operations per second (FLOP/s) than any NSF system. Currently, computational scientists request time on a variety of resources, taking advantage of DOE, NSF, and other providers of advanced computing infrastructure to the science community. But there is no formal coordination between agencies of the systems that they acquire, and trade-offs are made independently. Partnerships with other agencies could help ensure that the full spectrum of advanced cyberinfrastructure is available to the science community.
6.3.7 Strategic Public-Private Partnerships
As the demand for cyberinfrastructure continues to rise, the costs for deployment and operation rise commensurately. This is true for both aggregate demand—laboratory and institutional capabilities—and leading-edge computing and data storage systems. Superficially, this may seem paradoxical, given the dramatic increases in computing capability and storage capability regularly delivered by the computing industry. However, those same computing advances have birthed new sensors and scientific instruments and a torrent of new digital data, as well as new simulation models and expectations for ever-larger computing capability.4
Rising demands for computing and storage (end-to-end capabilities, not just hardware) now challenge the finances and social processes of both NSF and its academic grantees. Simply put, the rising cost of leading-edge facilities (NSF Track 1 and Track 2 systems) is not sustainable under the current partnership model and may not be sustainable under any government-funded model. Put another way, the perceived return on investment for a facility costing hundreds of millions of dollars must be substantial, particularly when the equipment has a useful lifetime of only 3 to 5 years.
NSF might consider alternative public-private partnership models that create financial incentives for private-sector partners to operate large-scale cyberinfrastructure facilities on the research community’s behalf. These necessarily require more flexible approaches than traditional fee-for-service models and might include such options as access to university intellectual property in exchange for cyberinfrastructure services. Precisely how such arrangements might work would depend on the willingness of the academic community to agree on, for example, vendor exclusivity and intellectual property sharing.
6.3.8 User-Driven Acquisition and Allocation
All of the operational strategies described above are based on some variant of central planning and resource management. Alternatively, NSF could decentralize cyberinfrastructure acquisition and support and rely on social and economic forces to define and optimize community cyberinfrastructure. One first step in this process would be denominating all services in dollars, rather than the abstract, normalized service units (SUs) or storage allocations used today. SUs play an important role by attempting to enable the comparison of allocations on computers that may differ widely in both architecture (e.g., conventional processors or
4 The end of Dennard scaling and limits of future microprocessor performance increases mean the “free lunch” of performance doubling will bring new and sobering economic constraints. Larger capability will require larger capital infusions.
graphical processing units) and time of deployment. For instance, the use of SUs makes more quantitative the assessment in Figure 2.5 of resources over the past decade. However, despite their merit, SUs obscure from users the actual costs associated with requests and allocations, and the use of SUs also distances the NSF programs and the user community from the prioritization processes about how the underlying funding is allocated. Moreover, the conversion factor between actual wall time on a computational resource and SUs is established by each site based on High-Performance Linpack benchmark results, which is just a single and outdated metric that does not capture the diversity of factors controlling the capability (which is more than just performance) of individual applications mapped to different architectures. Recently, XSEDE has started notifying both users and associated NSF program managers of the actual dollar value associated with an allocation, and there seem to be multiple significant potential benefits in making users even more cognizant of and ultimately responsible for the actual costs and effective use of resources.
Realizing these benefits can certainly start with increasing user awareness of costs and engaging users in resource planning and acquisition. In a more extensive realization of this model, however, individual researchers or research teams would be allowed to spend awarded cyberinfrastructure dollars at their discretion. This cyberinfrastructure marketplace might include the following options:
- Purchasing local computing infrastructure, services, or staff support for use within the individual researcher’s laboratory;
- Contributing dollars to a university pool that operates a campus facility under a “campus condominium” model;5
- Pooling research dollars to purchase and operate shared regional or national facilities; and
- Purchasing commercial cloud services, exploiting the properties of elasticity and on-demand access.
All of these variants allow individual researchers and research teams to make separate decisions on how best to advance their research. They also remove researchers from double jeopardy, where they must compete separately for research funding and for computing resources. In addition, the options expose the costs of each option in a common currency. How-
5 Under a condominium model, a university purchases a baseline computing and storage infrastructure and allows individual researchers to purchase and contribute nodes and storage to the shared pool. Researchers receive access priority in proportion to their financial contribution.
ever, the risk is that the sum of the local research optimizations may not be globally optimal for the national community.
Moreover, some form of such a model may provide an effective mechanism to encourage and formalize investments and responsibilities of researchers, institutions, and regions in private and shared local or national infrastructure. NSF already recognizes that there are significant computing resources “at the edges” (meaning within campuses and states) and that there is a clear need to coordinate and leverage investments. Programs such as Campus Cyberinfrastructure—Data, Networking, and Innovation Program (CC*DNI) and Major Research Instrumentation help develop this infrastructure, and elements of XSEDE, such as campus champions, are directed toward tying both communities and cyberinfrastructure together. However, the same economic and technological forces driving the decisions on national computing infrastructure are eroding the abilities of campuses to purchase and operate their own cyberinfrastructure, and especially challenging are the cost and complexity of managing research data. Thus, smaller institutions are now choosing to invest in infrastructure operated by larger neighbors or at national centers, which can provide both cost and other advantages compared to attempting to use the commercial cloud. However, in the absence of a scalable national model, such partnerships are presently ad hoc. The NSF Big Data Regional Innovation Hubs (BD Hubs) program is potentially a powerful catalyst to drive regional synergy, but this still needs to be tied to a national narrative that includes all aspects of advanced cyberinfrastructure.
Variations of this economic model have been explored in the past. Then called the “green stamps” model of resource allocation, it was analyzed in the 1995 Report of the Task Force on the Future of the NSF Supercomputer Centers Program.6 The report noted
The key concept in a green stamp mechanism is the use of the stamps to represent both the total allocation of dollars to the Centers and the allocation of those resources to individual PI’s. NSF could decide a funding level for the Centers, which based on the ability of the Centers to provide resources, would lead to a certain number of stamps, representing those resources, being available. Individual directorates could disperse the stamps to their PI’s, which could then be used by the researchers to purchase cycles. Multiple stamp colors could be used to represent different sorts of resources that could be allocated.
The major advantages raised for this proposal are the ability of the di-
6 National Science Foundation, Report of the Task Force on the Future of the NSF Supercomputer Centers Program, NSF9646, September 15, 1995, https://www.nsf.gov/publications/pub_summ.jsp?ods_key=nsf9646.
rectorates to have some control over the size of the program by expressing interest in a certain number of stamps, improvement in efficiency gained by having the Centers compete for stamps, and improvements in the allocation process, which could be made by program managers making normal awards that included a stamp allocation.
Other than the mechanics of overall management, most of the disadvantages of such a scheme have been raised in the previous sections. In particular, such a mechanism (especially when reduced to cash rather than stamps) makes it very difficult to have a centralized high-end computing infrastructure that aggregates resources and can make long-term investments in large-scale resources.
NSF could conduct a pilot project to evaluate the power of market forces in allocating limited cyberinfrastructure support. Among the issues to evaluate is whether such an approach would exacerbate the problem of buying resources by the hour (see Section 5.5) without recognizing the fixed costs, such as the cost of retaining staff and supporting the use of new architectures.
Independently of any pilot projects, NSF will benefit by expressing in dollars the true cost of large cyberinfrastructure resource allocations (i.e., those now made by the XSEDE Resource Allocation Committee [XRAC] and Petascale Computing Resource Allocation Committees [PRAC]). First, it would allow researchers to identify the value of cyberinfrastructure awards to their institutions. Second, and equally important, it would make clear that such large allocations have true costs, encouraging wise and efficient use.
This page intentionally left blank.