From Code to Currents: Julia and GPU Magic Transform Ocean Modeling Efficiency
Authors:
(1) Simone Silvestri, Massachusetts Institute of Technology, Cambridge, MA, USA;
(2) Gregory Wagner, Massachusetts Institute of Technology, Cambridge, MA, USA;
(3) Christopher Hill, Massachusetts Institute of Technology, Cambridge, MA, USA;
(4) Matin Raayai Ardakani, Northeastern University, Boston, MA, USA;
(5) Johannes Blaschke, Lawrence Berkeley National Laboratory, Berkeley, CA, USA;
(6) Valentin Churavy, Massachusetts Institute of Technology, Cambridge, MA, USA;
(7) Jean-Michel Campin, Massachusetts Institute of Technology, Cambridge, MA, USA;
(8) Navid Constantinou, Australian National University, Canberra, ACT, Australia;
(9) Alan Edelman, Massachusetts Institute of Technology, Cambridge, MA, USA;
(10) John Marshall, Massachusetts Institute of Technology, Cambridge, MA, USA;
(11) Ali Ramadhan, Massachusetts Institute of Technology, Cambridge, MA, USA;
(12) Andre Souza, Massachusetts Institute of Technology, Cambridge, MA, USA;
(13) Raffaele Ferrari, Massachusetts Institute of Technology, Cambridge, MA, USA.
Table of Links
Abstract and 1 Justification
2 Performance Attributes
3 Overview of the Problem
4 Current State of the Art
5 Innovations
5.1 Starting from scratch with Julia
5.2 New numerical methods for finite volume fluid dynamics on the sphere
5.3 Optimization of ocean free surface dynamics for unprecedented GPU scalability
6 How performance was measured
7 Performance Results and 7.1 Scaling Results
7.2 Energy efficiency
8 Implications
9 Acknowledgments and References
5.1 Starting from scratch with Julia
Oceananigans.jl is an open-source library for ocean-flavored fluid dynamics written from scratch in Julia [7]. Julia is a dynamic high-level programming language that leverages Just-In-Time (JIT) compilation and LLVM [24] to achieve performance competitive with traditional HPC languages like C or Fortran. Julia has gathered interest as potential language for HPC [17, 10, 16, 21, 27] and provides easy integration with MPI [47, 38]. Most of Oceananigans.jl software is hardware-agnostic through the Julia package KernelAbstractions.jl [10], which enables performance portability targeting CPUs and different GPU vendors using the JuliaGPU [6, 5] software stack, similar to the capabilities provided by Kokkos [9], OCCA [30], and HIP [1].
To our knowledge, Oceananigans is the first ocean model written from scratch for GPUs, rather than ported from existing CPU code. Starting from scratch and using the Julia programming language allowed us to rethink the typical patterns used in ocean and atmosphere dynamical cores. In particular, we developed a system of composable atomic operators that leverages Julia’s functional programming paradigm and effective inlining capabilities to recursively construct large expression trees for calculus on staggered finite volume grids. Using this composable operator system, we fuse the entire tendency computation for each prognostic variable into a single compute-heavy kernel, each of which depends on only two intermediate diagnostic variables representing hydrostatic pressure and vertical diffusivity (which is treated implicitly using a predictor-corrector method).
Such a high degree of abstraction yields a number of innovations: first, kernel fusion maximizes efficiency on GPUs. Second, almost all intermediate quantities are computed on-the-fly, so that Oceananigans is extremely memory efficient and can perform global ocean simulations at resolutions up to 1/4th degree on a single Nvidia V100. Finally, because all compute-heavy kernels rely on a single “tendency kernel function” applied at each grid index i, j, k, we can easily optimize performance by rapidly prototyping techniques to overlap computation and communication. The sparsity of kernels per time-step and small number of temporary variables mean that Oceananigans’ algorithmic structure is markedly different from current ocean models, which typically allocate 10 to 100 times the minimum necessary memory [4] and distribute computations across many small kernels [51]. We argue these algorithmic differences are a major factor in Oceananigans’ energy-efficiency and time-to-solution on GPU systems.
5.2 New numerical methods for finite volume fluid dynamics on the sphere
Our results use Oceananigans.HydrostaticFreeSurfaceModel, which solves the hydrostatic Boussinesq equations in a finite volume framework on staggered C-grids [3]. Oceananigans’ hydrostatic model employs an implicit-explicit second-order Adams-Bashforth time stepping scheme. Vertically implicit diffusion is implemented with a backward Euler time-discretization and tridiagonal solver.
A major innovation is a new adaptive-order scheme based on weighted essentially non-oscillatory (WENO) reconstructions [42] for advecting momentum and tracers on curvilinear finite-volume grids [43]. This new scheme automatically adapts to changing spatial resolution and permits stable, high-fidelity simulations of ocean turbulence without explicit dissipation or hyper-dissipation. This innovation reduces setup time when changing or increasing resolution while guaranteeing high-fidelity solutions that exhibit the minimum necessary dissipation of sharp, near-grid scale features.
5.3 Optimization of ocean free surface dynamics for unprecedented GPU scalability
In hydrostatic ocean models with a free surface, the vertically-averaged, two-dimensional “barotropic mode” has dynamics orders of magnitude faster than the three-dimensional “baroclinic” component, and must be treated by a special “barotropic solver”. Due to communication overhead, barotropic solvers in current ocean models — whether implicit or explicit — are a major bottleneck that accounts for between 40% [22] to 60% [48, 36] of the cost of a typical IPCC-class ocean simulations.
Oceananigans’ excellent scalability is enabled by an innovative optimization of the parallel barotropic solver. An increase in computation is traded in for decreased communication latency by leveraging the two-dimensionality of the barotropic problem. Our new barotropic solver is based on explicit subcycling of the barotropic mode. Increasing the width of the barotropic halo to equal the number of explicit subcycles (typically between 10–30) greatly decreases the frequency of communication. As a result, communication is required once per time-step rather than every subcycle, reducing the frequency of communication by a factor of 10 to 30. The cost of the barotropic solver is therefore less than 10% of the total cost of a time step. Due to the sparsity of communication enabled by our novel barotropic solver, all communication operations can be overlapped with computational workloads as sketched in figure 2.