Day 5 - Wrapping up the week
Posted: 21 Jun 2015 | 20:02
The final analysis and future plans
A week ago we finished our 5 days of intensive work optimising CP2K (and to a lesser extent GS2) for Xeon Phi processors. As discussed in previous blog posts (Day4, Day3, Day2, Day1), this was done in conjunction with research engineers from Colfax, and built on the previous year's work on these codes by EPCC staff through the Intel-funded IPCC project.
It was not realistic to expect that we could make dramatic performance improvements in just a week given the complexity and scale of the codes we are looking at (both are modern FORTRAN codes, with CP2K consisting of over 800,000 lines of code, and GS2 over 200,000 lines of code), and the previous work done optimising them both through the IPCC project, and other code-optimisation projects (such as UK funded dCSEs and eCSEs).
However the collaboration was extremely useful for us, enabling us to both get another perspective on our approaches to the optimisation of these codes, and check our understanding of the Xeon Phi with other experts in the field.
Some performance improvements on CP2K were achieved, notably by replacing allocate/deallocate functionality in some of the lower-level kernels in the code with static arrays with sizes big enough to cope with any data that will be passed to the kernels through-out the run of the program. This replacement is not as straightforward as it first appeared because some of these routines are called from different places in the code and thus require the local and global sizes of the static arrays to be passed for correctness, necessitating some code modifications throughout CP2K.
This optimisation has provided significant speed-ups to the OpenMP code using current versions of the Intel compiler, although it should be noted that this is because there is a performance issue in the OpenMP functionality of these versions of the Intel compiler that this code change negates. However, it has also shown modest performance improvements in the MPI code as well, so is being pursued throughout the rest of the code.
Another item that was identified through MPI profiling was that only a restricted set of processes were being used for some of the PETSc library functionlity, leading to idle processes for parts of the code. Through discussion with the code developers it transpires this was intentional due to poor scaling of the PETSc functionality when this part of the code was developed. Therefore, we will re-assess the PETSc performance to see if this restriction can be removed, and thus allow more processes to be active in these intensive sections of the code.
Finally, our previous investigation of vectorisation optimisation for CP2K was validated, the core computational kernels simply have too small an iteration space to enable effective vectorisation. Addressing this problem would require higher level restructuring of CP2K, either to try to make the data structures that are passed to these core routines consecutive so vectorisation can be done at a higher level, or by caching local calculations and reusing them at future calls for the same data elements.
This type of optimisation would be going down the "revolution" rather than "evolution" approach of optimisation that is often required to exploit computing hardware such as the Xeon Phi processor. Unfortunately, discussion with the code developers and further analysis of the code suggest it does not easily lend itself to such an approach. Further investigation will be carried out to make sure this is the case, but it doesn't currently look like this is possible.
Once we have completed the allocated/deallocate and PETSc MPI optimisation work on CP2K, we will document the performance achieved and work undertaken in a report and release it on our IPCC web page.
As discussed in a previous blog post we did some work looking at the vectorisation performance of GS2 to see if there were any areas that could be optimised. We are already working on replacing arrays of complex numbers with arrays that separate the real and imaginary parts of the complex number into two distinct array to help with the vectorisation of the core computational kernels. The complex data type is one of the strengths for programming scientific algorithms using FORTRAN, but as it stores the real and imaginary parts of a data point contiguously in memory it can negatively impact the performance of some operations that only work on one part of the complex number, or do separate operations on the complex number.
It was also nice to be able to share tips and procedures for using the various performance tools that are important when trying to optimise codes for Xeon Phi; Intel's Vtune, Vector Advisor, and ITAC tools, Allinea's MAP profilers, and even Cray's CrayPat profiler. I think both sides learnt a lot when collaborating on these tools for analysing and understanding code performance on Xeon and Xeon Phi processors.
I would strongly recommend this type of intensive working process, or "hackathon" style working, for anyone undertaking optimisation and porting work on scientific simulation codes and algorithms. Being able to really focus on a small number of codes with a group of people for a significant period enables quick progress. It also helps if you have the code owners or developers around or contactable at the same time as there will inevitably be questions about why a particular part of code is the way it is, or whether an algorithm can be changed for performance reasons.