Autotuning NekBone for OpenACC

Author: Luis Cebamanos
Posted: 3 Jul 2014 | 12:12

Nek5000 being used to simulate of turbulent thermal transport in sodium-cooled reactor cores. Courtesy of Argonne National Laboratory.

Heterogeneous HPC architectures are becoming increasingly prevalent in the Top500 list with CPU-based nodes being enhanced by accelerators or coprocessors optimized for floating-point calculations. This trend is likely to increase as we move towards Exascale (1018 flops) capable systems and it is vital that the relevant HPC applications are able to exploit this heterogeneity [1, 2]. 

Whilst accelerators offer a large boost in peak system speed, it is hard to translate this into sustained application performances. For GPU accelerators, applications are typically rewritten in low-level language such as CUDA or OpenCL. This is a productivity drawback, with developers having to maintain multiple versions of their code without any guarantee of portability. In addition, the HPC community is nervous about investing substantial software development effort in converting applications to use a programming language that is not portable between different architectures. On the other hand, OpenACC [3, 4], a collection of compiler directives specified by the programmer to identify areas that should be accelerated, enable existing HPC applications to run on accelerators with minimal source code changes.

In CRESTA we have developed an autotuning technology that can address the inherent complexity of programming the latest and future computer architectures. The autotuner provides a framework in which an application developer can try out various optimization strategies in an automated fashion to maximise their application performance. This autotuner explores a tuning parameter space by repeatedly building and running the application. From these the best run is chosen using a metric obtained from the program execution which currently is done by exhaustive search. To accomplish a tuning run, the source is appropriately preprocessed and compiled and an optimization process is organized.

We have used the CRESTA autotuner with NekBone, a skeleton application of Nek5000, an open-source code used for the simulation of incompressible fluid flow. It is employed in a broad range of domains, including the study of thermal hydraulics in nuclear reactor cores, the modelling of ocean currents and the simulation of combustion in mechanical engines.

NekBone has been configured to capture the basic structure and user interface of the extensive Nek5000 software and exposes its main computational kernel to reveal the essential elements of the algorithm-architectural coupling that is relevant to Nek5000 [5].

Implementation

As has already noted, NekBone is configured to very closely resemble the basic structure of Nek5000. A matrix is initialized and then a linear system is solved twice for every computational cycle using a Conjugate Gradient (CG) solver. A large number of small rectangular matrix multiplications take place at each solver iteration. Previous work [6] has demonstrated that the computation of those matrix multiplications dominates the execution time of NekBone. Therefore we focused on an OpenACC implementation of different algorithms used to calculate the matrix-matrix multiplications.

The main subroutines to optimise implement independent matrix-matrix multiplication kernels. An example of those kernels can be seen in the code shown below:

do j = 1, n3
      do i = 1, n1
            c( i, j ) = 0.0
            do k = 1, n2
                  c ( i, j ) = c ( i, j ) + a ( i, k ) * b ( k, j )
            end do
      end do
end do

To execute this kernel on a GPU using OpenACC we include additional compiler directives assuming that the data had already been copied to the GPU, for instance:

!$ACC PARALLEL LOOP PRESENT(a,b,c) PRIVATE(i,j,k)
do j = 1, n3
      do i = 1, n1
      c( i, j ) = 0.0
            do k = 1, n2
                  c ( i, j ) = c ( i, j ) + a ( i, k ) * b ( k, j )
            end do
      end do
end do
!$END PARALLEL LOOP

The PARALLEL LOOP section indicates that this part of the code will be executed on a GPU, which is similar, as it happens, to what would be done in other compiler directives APIs such as OpenMP. This should be enough to get part of the code running on a GPU. However, in order to find the optimum kernel we created a number of different implementations of the above kernel using different parameters and OpenACC optimizations. These implementations were then enumerated so that the CRESTA autotuner could identify and compare them.

Over 10 different implementations of each matrix-matrix multiplication kernel were included in the autotuning benchmark providing many different computation paths for the NekBone kernel and exploring the following types of optimizations:

  • specific hard-coded versions for different values of n1, n2 and n3 so that these would be constant at compile time;
  • different loop orderings;
  • loop unrolling;
  • hand tiling the matrices into blocks for better cache reuse;
  • calls to DGEMM BLAS routines;
  • matrix values stored explicitly in temporary scalars;
  • loop collapsing.

For instance, some of the OpenACC parameters that were used to optimise the kernels were VECTOR_LENGTH, GANG, WORKER or COLLAPSE. For instance:

!$ACC PARALLEL PRESENT(a,b,c) PRIVATE(i,j,k) VECTOR_LENGTH(VLENGTH)
!$ACC LOOP GANG WORKER VECTOR COLLAPSE(2)
do j = 1, n3
      do i = 1, n1
#ifdef SCALAR
            tmp = 0.0
#else
            c(i, j) = 0.0
#endif
            do k = 1, n2
#ifdef SCALAR
                  tmp = tmp + a(i,k) * b(k, j)
#else
                  c ( i, j ) = c ( i, j ) + a ( i, k ) * b ( k, j )
#endif
            end do
#ifdef SCALAR
            c(i,j) = tmp
#endif
      end do
end do
!$END PARALLEL LOOP

The autotuning process then reports the best kernel implementations. One would then have, for the given underlying architecture, the best choice of algorithms and deployment strategies on to any accelerator hardware available on the target machine to maximise the performance for that particular architecture.

Performance and Conclusions

The optimisation and acceleration of complex software like NekBone on heterogeneous systems can easily be carried out using OpenACC compiler directives. However, naive OpenACC implementations generally show little performance improvements and further advanced techniques are needed to explode any GPU parallelism. In addition, autotuning technologies such as the CRESTA autotuner can help application developers to penetrate into the complex structure of heterogeneous architectures allowing applications to be optimised by making a selection from different optimisation strategies in an automated manner. Through this autotuning process we, in CRESTA, have managed to achieve an up to 20% performance improvement over an OpenACC hand-tuned version of NekBone. Figure 1 illustrates the effect mentioned where it can be seen how the speed up achieved also depends on the number of elements, nel, for a matrix of size N.

Figure 1- Performance ratio of autotuned NekBone over hand-tuned NekBone.

Top image: Nek5000 used to simulate turbulent thermal transport in sodium-cooled reactor cores. Courtesy of Argonne National Laboratory.

References

  1.  P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, et al., Exascale computing study: Technology challenges in achieving exascale systems, Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep.
  2. J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert, S. Mat- suoka, P. Messina, T. Moore, R. Stevens, A. Trefethen, et al., The in- ternational exascale software project: a call to cooperative action by the global high-performance community, International Journal of High Performance Computing Applications 23 (4) (2009) 309–322.
  3. OpenACC standard (Jun. 2013).
URL http://www.openacc-standard.org.
  4. R. Ansaloni, A. Hart, Cray’s approach to heterogeneous computing, in: PARCO, 2011.
  5. Nek5000 strong scaling tests to over one million processes (Jun. 2013). URL http://nek5000.mcs.anl.gov/index.php/Scaling.
  6. Markidis, S., Gong, J. , Schliephake, M. , Laure E., Hart, A. , Henty, D., Heisey, P. and Fischer, P.: OpenACC Acceleration of Nek5000, Spectral Element Code
  7. Fischer,P.,Heisey,K.:NEKBONE:ThermalHydraulicsmini-application.Nekbone Release 2.1 (2013), last accessed 10-January-2014: https://cesar.mcs.anl.gov/ content/software/thermal_hydraulics