DaCeMI: Harnessing future hardware using Data-Centric ML Integration
PI: Torsten Hoefler (ETH Zurich)
Co-PIs: Tal Ben-Nun
July 1, 2021 – June 30, 2024
Project Summary
Modern supercomputing is characterized by a Cambrian explosion of hardware architectures triggered by the end of Dennard scaling and Moore’s law. Accelerators (e.g., NVIDIA or AMD GPUs) are commonplace and contain built-in special-purpose ASICs for accelerating certain computations on specific data types—domain-specific accelerators are on a stark rising. On each compute node one would find several such accelerators, and the number may change between systems. Harnessing the full power of an HPC system requires immense low-level knowledge, and right now all of the burden, including choosing the right data types (e.g., floating-point bit precision), “massaging” code so that compilers can optimize it, and redesigning communication and memory-sharing strategies, is laid on the domain scientist.
Data-centric programming is an emerging paradigm that promises to alleviate many of those challenges. By focusing scientific workflows on the structure of data dependencies and which processor is responsible for each element, many programming concepts (overlapping computation with communication, vectorization, data type replacement, fine- and coarse-grained parallelism, pipelining, and many more) can be abstracted, paving the way towards performance portability. However, in practice these models are still in their infancy, and while they provide the capability to manually transform code to utilize all above features and attain near-optimal performance, knowing which parts can be, for example, reduced in precision, is not yet fully explored for general programs.
The goal of the DaCeMI (Data-Centric ML Integration) project is to improve a forward-looking, semi-automated (human-in-the-loop) workflow that scales to whole-application optimization for current and anticipated future HPC platforms. The key observation we make is that we can use differentiable programming, namely using automatic differentiation and the latest advances in ML on existing applications in order to determine data requirements (such as bit-width) and automatically generate/train deep neural networks to imitate subsets of that code. The latter can generate code that is already optimized for different architectures without additional effort.
For this purpose, we will use the DaCe framework [4]. DaCe lifts existing code in Python/NumPy (and is currently being extended to support C and FORTRAN programs) into a dataflow graph intermediate representation, which can be transformed to provide all aforementioned optimizations and more. The resulting program can be mapped to different hardware architectures, and DaCe was tested successfully with Intel and IBM CPUs, NVIDIA and AMD GPUs, and Xilinx and Intel FPGAs. The framework is led by the PI’s team at ETH Zurich, and fuels several scientific applications in Switzerland and others in Europe. These include Quantum Transport Simulation at ETH Zurich (2019 ACM Gordon Bell Prize winner [32]), Numerical Weather Prediction in collaboration with MeteoSwiss and CSCS, spectral analysis optimization at KTH, and deep neural network training for sequence transduction that produces comparable performance to industry-leading efforts. DaCe allows performance engineers to interactively or programmatically optimize applications from the communication pattern down to microkernels, carefully tuning workloads for dimensions and properties that hardware vendor libraries do not optimize for.
This project is a software development effort to improve the usability of DaCe for direct simulations, differentiable programming, and interactions with deep learning frameworks such as TensorFlow and PyTorch to integrate deep learning into the application workflow. We will focus on today’s and future GPU architectures and their new ML-targeting features, such as matrix-multiply-accumulate units, structured sparsity, BVH-querying ASICs, low- and mixed-precision computations, and whatever may rise during the course of the project. We will attempt to facilitate the process of optimization for new hardware, introducing integrated interfaces to run a guided search on different configurations of general-purpose programs to enable harnessing those features.
For applications, we see an uptick in large-scale deep learning for scientific applications, with models that do not fit on a single GPU and datasets in the 1–100 terabyte regimes. We will thus accelerate deep learning workloads with focus on the Transformer architecture, one of the most influential DNNs for sequence data analysis. In addition, integrating ML into numerical simulation is an ongoing effort in meteorological simulations [25] and in COVID-19 research [17]. As such, we will give specific attention to optimizing applications that couple ML/ModSim in a single workflow. Since the Python-based DaCe framework can perform whole program optimization (rather than invoking an ML framework as a black-box), we expect to act on unique acceleration opportunities in data movement and fusion between the two workloads. As an indirect, beneficial result of generating a full library, the code will be directly deployable and interoperable with existing systems used in Switzerland.
We see intersecting points with various teams at CSCS, including but not limited to ML-powered software tuning for hardware architectures (DBCSR), numerical simulation library developers (CP2K, GridTools), and Domain-Specific Language development teams (GT4Py).