University of Texas at Austin

News

A Community Conversation On Future Extreme Scale Computing

Published Dec. 8, 2020

Even with the unprecedented computing power currently at our disposal, a tenfold increase in supercomputing capacity is required if scientists hope to make significant progress in solving many of today's scientific grand challenge problems.

That's according to leaders of 24 research groups at the forefront of high performance computing (HPC). The researchers participated in a workshop hosted by the Texas Advanced Computing Center (TACC) at The University of Texas at Austin and its partners at UT Austin's Oden Institute for Computational Engineering and Sciences and the National Science Foundation (NSF) earlier this year.

Discussions and input from the workshop inform a report entitled, "Future Directions in Extreme Scale Computing for Scientific Grand Challenges."

The report, published earlier this year, identifies a number of scientific grand challenge problems that will drive HPC over the next decade. And not just how HPC technology needs to evolve, but also what kind of research and programs should be prioritized.

While the report emphasizes the HPC needs of a wide range of scientific problems, "a number of cross-cutting issues emerged, including the need for scalable algorithms for heterogeneous architectures, exploration of parameter space to support such `outer loop' goals as data assimilation, optimal control, and uncertainty quantification, and integration with artificial intelligence and machine learning tasks," said Omar Ghattas, professor at the Oden Institute.

[The report is available for download now, and the team is welcoming comments and input on additional grand challenges from the scientific community at lccf-community@tacc.utexas.edu.]

The participants — comprising long-time users of HPC systems, directors of HPC centers, developers of community codes, leaders of large international research groups, as well as up-and-coming computational scientists — stressed the importance of HPC in efforts to achieve transformative research in key areas like global climate modeling, computational vehicle design, and the fundamental understanding of sub-atomic particles.

The report was authored by Omar Ghattas (Oden Institute, Geological Sciences, and Mechanical Engineering), George Biros (Oden Institute, Mechanical Engineering, and Computer Science), Dan Stanzione (TACC), Rick Stevens (Argonne National Laboratory), and John West (TACC) based on in-person discussions and position papers from the participants.

The report addresses both the science challenges researchers face and the advances in cyberinfrastructure and computational science needed to address them. Eighteen of the participants presented their ideas at the workshop, spurring conversations about the future of computational science.

The report provides summaries of cross-cutting topics covering the representative scientific grand challenges presented.

Select grand challenges discussed in the report include:

Incorporation of Molecular-scale Processes: The need for multiscale modeling and integration of different modalities to provide sufficient accuracy for design and decision-making; e.g.: atmospheric aerosol simulations.

AI-Enhanced Science: Effectively supporting AI and related methods in an HPC environment and combining these methods with physics-based modeling; e.g.: predicting disruptions in tokomak fusion reactors.

Hypersonic Flight: Coupling codes from different scientific domains to predict and model the evolution of a vehicle and to estimate the uncertainty in the predictions.; e.g. predict the aerodynamic response of a hypersonic vehicle.

Global Dynamics and the Great Earthquakes: Developing a new generation of coupled codes that include both forward and inverse models; e.g.: time-dependent plate-mantle simulations.

Modeling Thermonuclear X-ray Bursts: The ability to solve multiscale and multiphysics simulations, with an extreme range of spatiotemporal scales; e.g.: 3D simulations of a neutron star surface.

Quantum Materials Engineering: The need to develop methods to accurately calculate the quantum behavior of real materials beyond density functional theory; e.g.: electrical conductivity of materials for photovoltaic and plasmonic devices.

Physics of Fundamental Particles: More precise calculations of particle characteristics to reduce errors and accurately compare with experiment; e.g.: Lattice QCD calculations to refine mass estimates of the bottom quark.

Turbulent Flows: Improve turbulence models to reliably predict the average behavior of turbulent flows in common systems; e.g.: direct numerical simulations of the Navier-Stokes equations on extreme resolution grids. [Full position papers provided by attendees are included in Appendix C of the report.]

Based on discussions, the report authors identified several common themes that crossed many, if not all, fields of research:

Numerous individual science tasks have a demand for vastly greater computational scale and time; as such, a 10x baseline improvement in application performance should be considered a minimum for the Facility.

Single-node performance and portability present two conflicting challenges. Good single-node performance (i.e., getting as close as possible to a roofline-analysis expected performance) is absolutely critical for the efficient utilization of the LCCF. Portability is critical since few groups can afford frequent, architecture-specific, major code rewrites. The LCCF should provide performance portability tools to balance these two challenges.

Part of the work of extracting science and engineering results happens outside the main simulation or analysis run and is done in analysis of the produced results later. Support for this "expanded" workflow at the appropriate scale is therefore critical for the facility, including throughout the data lifecycle.

The workload is evolving, with increasing incorporation of methods from artificial intelligence and machine learning in the science workloads, and increased emphasis on throughput at scale – this work is in addition to, rather than replacing simulation at scale.

The discussions were not hypothetical. Starting in 2019, TACC was formally invited by NSF to develop a plan for a Leadership-Class Computing Facility (LCCF) — not a single system, but a center for cyberinfrastructure (hardware, software, storage, people, and programs) — that would launch around 2025 and support academic researchers in the U.S. on a decadal scale. The project is being planned for funding as part of the NSF's Major Research Equipment and Facilities (MREFC) process.

"The ten-year initial operational period for MREFC projects will provide the nation's scientists and engineers with a long-term partner, enabling new opportunities for long-term planning and collaboration not possible with shorter awards," according to John West, one of the principals on the planning effort.

The workshop, and the LCCF design process, were held at a moment when uncertainty about the future directions of HPC is high. Processor and system architecture options are rapidly diversifying, and the workloads that centers are being asked to support — once reliably simulation- and modeling-driven — are expanding to include machine and deep learning, data assimilation, new forms of data and visual analysis, urgent computing, and a demand for greater accessibility to allow a far larger number of scholars to use advanced computing. Questions about the appropriateness of various types of future architectures was a common and contentious feature of the workshop discussions.

But, "high performance computing itself is only a part of the solution," the authors wrote. Grand Challenge problems "also require breakthroughs in algorithms, computational science, data management and visualization, software engineering, scientific workflows, and system architecture as well as a community of expertise built around the technological capabilities in these areas to ensure that the technologies (hardware and software) can be translated into practice."

Requirements gathering will continue over the course of the LCCF design period through additional workshops, community events, and other opportunities for input from the community.

"The workshop was incredibly useful in terms of hearing novel ideas from, and taking the temperature of the community," Dan Stanzione, TACC Executive Director, said. "Learning more about the specific research questions and workflows that leading researchers are using HPC for was instructive and will lead to a better vision for our future facility."

By Aaron Dubrow, TACC