Array index order matters, right?

Author: Adrian Jackson
Posted: 30 Mar 2016 | 17:46

Does array index order affect performance?

A couple of weeks ago I was teaching an ARCHER Modern Fortran course, and one of the things we discuss during the course is index ordering for multi-dimension arrays.  This course is an introduction to modern Fortran (primarily F90/F95), so we don't go into lots of details about parallel or performance programming, but as attendees are likely to be using Fortran for computational simulation it is important they understand which array dimensions are contiguous in memory so that they don't accidentally write code that is much slower than it should be.

GNU compiler performance

Figure 1: Performance using the GNU compiler

During one of the practical sessions on the course, one of the students wrote a little program to investigate the performance impact of iterating through array elements in a non-contiguous order.  They also included some code to investigate if there is a performance impact when using allocatable array rather than static arrays (I'd mentioned it shouldn't impact performance but I obviously wasn't convincing enough...).

The code is very simple, it just iterates through a 2d array and performs a multiply and an add on each element (it also sums the array after and prints out some statistics on the array just to make sure the compiler does not optimise away the array update loop):

    do Itime = 1, 100
        do j = 1, n2
            do i = 1, n1
        a(i,j) = i * j + i
            enddo
        enddo
    enddo

It has another kernel which does the loop in the reverse order, which conventional wisdom tells us should perform poorly as it is indexing the array in a non-contiguous order (in Fortran the first array dimension, `i' in this example, is stored contiguously):

    do Itime = 1, 100
        do i = 1, n1
            do j = 1, n2
                D(i,j) = i * j + i
            enddo
        enddo
    enddo

On ARCHER we have 3 compilers, Cray, Intel, and GNU.  I thought it would be interesting to compare the performance of these compilers using this benchmark code, and using different levels of compiler optimisations (the flags -O0, -O1, -O2, and -O3).  Figures 1-3 show how each of the compilers performed.  Each test was run 3 times, the mean value is the height of the plotted bar, with min and max values plotted as error bars on the top of the bar, although these are actually too small to see.

Benchmarks were undertaken on ARCHER, using a single compute node for the runs, with exclusive access to that node.  No other compiler flags were used, apart from the optimisation flag previously discussed, although we were using the Cray ftn compiler wrapper so that may have added flags that were not visible to us.  We can see from the graphs that all 3 compilers benefit significantly from compiler optimisation flags (as we would expect). 

Cray compiler performance

Figure 2: Performance using the Cray compiler

We can also see that the reverse order loop (where we are indexing the array in a non-contiguous order) is easily 5 times slower across all the compilers when we're not using compiler optimisation.  Furthermore, using no compiler optimisation all the compilers give roughly the same performance (again, we'd expect this as the different optimisations are really what differentiate the compilers).

Intel compiler performance

Figure 3: Performance using the Intel compiler

Comparing compilers in more detail requires zooming in on the graph to better view the timings at the higher optimisation levels.  Figure 4 compares the -O3 performance of the compilers for each of the different code kernels. From this we can see that there are variations in performance between the different compilers.

O3 performance of the three compilers

Figure 4: Comparing the performance of the 3 compilers using the -O3 flag

Cray compiler

The Cray compiler gives the best performance, and at this level of optimisation we can see there is effectively no difference between the 3 different code snippets (contigious access, reversed access, and contigious access using allocated arrays). It is likely that the Cray compiler is recognising that the data access pattern we have coded is sub-optimal and has changed it as it can identify there are no dependencies between loop iterations.

GNU Fortran compiler

The GNU Fortran compiler does not give as good performance as the Intel or Cray compilers, although it's not terrible (it's still within the same order of magnitude runtime as the other two compilers). It also still sees performance impacts from the reversed array access, clearly it's not identified this as a performance issue and fixed it. However, this is to be expected, the GNU compilers are general purpose, not very specialised for scientific computing, so the development effort tends to go into supporting a very large range of hardware rather than making the performance as fast as possible on specific hardware. It does produce good code, which is significantly faster compared to the runs without compiler optimisation.

Intel compiler

However, the real surprise in these results is the Intel compiler. The Intel compiler gives much better performance than the GNU compiler but isn't quite as fast as the Cray compiler for this example code. But, the reversed access order loop is more than twice as fast as the contiguous access loops. This goes against all conventional wisdom on memory access on modern computers!

As this is a very small amount of code we have some chance of trying to understand why this is the case. Having a look at the assembler code that the Intel compiler produces for the two different loops we can see some differences in what is going on. The assembler code of the body of the contiguous access loop is below:

vpmulld   %xmm0, %xmm1, %xmm2                
vpaddd    %xmm9, %xmm1, %xmm7                
vpaddd    %xmm1, %xmm2, %xmm3                
vpaddd    %xmm7, %xmm9, %xmm12               
vpmulld   %xmm0, %xmm7, %xmm1                
vpaddd    %xmm7, %xmm1, %xmm5                
vpaddd    %xmm12, %xmm9, %xmm1               
vpmulld   %xmm0, %xmm12, %xmm8               
vpmulld   %xmm0, %xmm1, %xmm13               
vpaddd    %xmm12, %xmm8, %xmm10              
vpaddd    %xmm1, %xmm13, %xmm14              
vpaddd    %xmm1, %xmm9, %xmm1                
vcvtdq2pd %xmm3, %ymm4                       
vcvtdq2pd %xmm5, %ymm6               
vcvtdq2pd %xmm10, %ymm11                     
vcvtdq2pd %xmm14, %ymm15                     
vmovupd   %ymm4, test_$A.0.1(%rax,%rbx,8)    
vmovupd   %ymm6, 32+test_$A.0.1(%rax,%rbx,8) 
vmovupd   %ymm11, 64+test_$A.0.1(%rax,%rbx,8)
vmovupd   %ymm15, 96+test_$A.0.1(%rax,%rbx,8)
addq      $16, %rbx                          
cmpq      $10000, %rbx                       
jb        ..B1.6        # Prob 99%           

We can see that it is using vector instructions and it looks like it's unrolled the loop a bit and is doing 4 iterations at a time (my assembler is a bit rusty but I think that's what's going on here).  Now here's the assembler code for the reverse access loop:

..B1.23:               
xorb      %al, %al                           
lea       1(,%rdx,8), %ebx
lea       2(,%rdx,8), %esi
lea       3(,%rdx,8), %edi
lea       4(,%rdx,8), %r8d
vmovd     %ebx, %xmm8     
lea       5(,%rdx,8), %r9d
vmovd     %esi, %xmm7     
lea       6(,%rdx,8), %r10d                     
vmovd     %edi, %xmm6     
lea       7(,%rdx,8), %r11d                     
vmovd     %r8d, %xmm5     
lea       8(,%rdx,8), %ebx
vmovd     %r9d, %xmm4     
vmovd     %r10d, %xmm3    
vmovd     %r11d, %xmm2    
vmovd     %ebx, %xmm0     
vpshufd   $0, %xmm8, %xmm8
vpshufd   $0, %xmm7, %xmm7
vpshufd   $0, %xmm6, %xmm6
vpshufd   $0, %xmm5, %xmm5
vpshufd   $0, %xmm4, %xmm4
vpshufd   $0, %xmm3, %xmm3
vpshufd   $0, %xmm2, %xmm2
vpshufd   $0, %xmm0, %xmm1

..B1.24:                       
vmovdqu   .L_2il0floatpacket.0(%rip), %xmm0 
xorl      %ebx, %ebx                              
..B1.25:                      
vpmulld   %xmm8, %xmm0, %xmm10                  
vpmulld   %xmm7, %xmm0, %xmm13                  
vpaddd    %xmm10, %xmm0, %xmm11                 
vpaddd    %xmm13, %xmm0, %xmm14                 
vcvtdq2pd %xmm11, %ymm12  
vcvtdq2pd %xmm14, %ymm15  
vmovupd   %ymm12, test_$D.0.1(%rcx,%rbx,8)      
vmovupd   %ymm15, 80000+test_$D.0.1(%rcx,%rbx,8)
vpmulld   %xmm6, %xmm0, %xmm10                  
vpmulld   %xmm5, %xmm0, %xmm13                  
vpaddd    %xmm10, %xmm0, %xmm11                 
vpaddd    %xmm13, %xmm0, %xmm14                 
vcvtdq2pd %xmm11, %ymm12  
vcvtdq2pd %xmm14, %ymm15  
vmovupd   %ymm12, 160000+test_$D.0.1(%rcx,%rbx,8)
vmovupd   %ymm15, 240000+test_$D.0.1(%rcx,%rbx,8)
vpmulld   %xmm4, %xmm0, %xmm10                  
vpmulld   %xmm3, %xmm0, %xmm13                  
vpaddd    %xmm10, %xmm0, %xmm11                 
vpaddd    %xmm13, %xmm0, %xmm14                 
vcvtdq2pd %xmm11, %ymm12  
vcvtdq2pd %xmm14, %ymm15  
vmovupd   %ymm12, 320000+test_$D.0.1(%rcx,%rbx,8)
vmovupd   %ymm15, 400000+test_$D.0.1(%rcx,%rbx,8)
vpmulld   %xmm2, %xmm0, %xmm10                  
vpmulld   %xmm1, %xmm0, %xmm13                  
vpaddd    %xmm10, %xmm0, %xmm11                 
vpaddd    %xmm13, %xmm0, %xmm14                 
vpaddd    %xmm9, %xmm0, %xmm0                   
vcvtdq2pd %xmm11, %ymm12  
vcvtdq2pd %xmm14, %ymm15  
vmovupd   %ymm12, 480000+test_$D.0.1(%rcx,%rbx,8)
vmovupd   %ymm15, 560000+test_$D.0.1(%rcx,%rbx,8)
addq      $4, %rbx                              
cmpq      $10000, %rbx                           
jb        ..B1.25       # Prob 99%                

..B1.26:                      
        incb      %al                                
cmpb      $12, %al                           
jb        ..B1.24       # Prob 92%           

..B1.27:                       
xorb      %al, %al                           

..B1.28:                        
vmovdqu   .L_2il0floatpacket.0(%rip), %xmm0  
xorl      %ebx, %ebx                             

..B1.29:                       
vpmulld   %xmm8, %xmm0, %xmm10                  
vpmulld   %xmm7, %xmm0, %xmm13                  
vpaddd    %xmm10, %xmm0, %xmm11                 
vpaddd    %xmm13, %xmm0, %xmm14                 
vcvtdq2pd %xmm11, %ymm12  
vcvtdq2pd %xmm14, %ymm15  
vmovupd   %ymm12, test_$D.0.1(%rcx,%rbx,8)      
vmovupd   %ymm15, 80000+test_$D.0.1(%rcx,%rbx,8)
vpmulld   %xmm6, %xmm0, %xmm10                  
vpmulld   %xmm5, %xmm0, %xmm13                  
vpaddd    %xmm10, %xmm0, %xmm11                 
vpaddd    %xmm13, %xmm0, %xmm14                 
vcvtdq2pd %xmm11, %ymm12  
vcvtdq2pd %xmm14, %ymm15  
vmovupd   %ymm12, 160000+test_$D.0.1(%rcx,%rbx,8)
vmovupd   %ymm15, 240000+test_$D.0.1(%rcx,%rbx,8)
vpmulld   %xmm4, %xmm0, %xmm10                  
vpmulld   %xmm3, %xmm0, %xmm13                  
vpaddd    %xmm10, %xmm0, %xmm11                 
vpaddd    %xmm13, %xmm0, %xmm14                 
vcvtdq2pd %xmm11, %ymm12  
vcvtdq2pd %xmm14, %ymm15  
vmovupd   %ymm12, 320000+test_$D.0.1(%rcx,%rbx,8)
vmovupd   %ymm15, 400000+test_$D.0.1(%rcx,%rbx,8)
vpmulld   %xmm2, %xmm0, %xmm10                  
vpmulld   %xmm1, %xmm0, %xmm13                  
vpaddd    %xmm10, %xmm0, %xmm11                 
vpaddd    %xmm13, %xmm0, %xmm14                 
vpaddd    %xmm9, %xmm0, %xmm0                
vcvtdq2pd %xmm11, %ymm12  
vcvtdq2pd %xmm14, %ymm15  
vmovupd   %ymm12, 480000+test_$D.0.1(%rcx,%rbx,8)
vmovupd   %ymm15, 560000+test_$D.0.1(%rcx,%rbx,8)
addq      $4, %rbx                               
cmpq      $10000, %rbx                           
jb        ..B1.29       # Prob 99%               

The first thing we can see is there is a lot more code.  At least 4 times as much assembler for this loop, and the memory access are much more separated (as we'd expect because we are looping over the non-contiguous array index).  So what I think is happening is that the compiler has transformed our inner loop by unrolling it four times, into 4 separate loops, and then unrolled each of those 4 times.  Our single loop has been transformed into 4 unrolled loops (again, I could be reading the assembler wrong, but it's what it looks like to me).  Then the code is executed and is fetching non-contiguous memory blocks, but working on multiples of those and thus able to re-use the data that has been loaded.  Why this is twice as fast as the standard loop isn't as clear to me, maybe it lets the processor use more of its functional units, fill up more of its pipeline, than with the standard code.  Anyway, it's interesting to see the difference.

Of course, this result doesn't mean you should re-write your simulation codes to iterate through arrays in the non-contiguous order, it's likely this effect is limited to very simple kernels such as the one we used in this benchmark.  More functional code is likely to contain many more instructions, and data elements, than our benchmark, and therefore have significantly more demands on the processor resources (registers, memory buses, etc...) than our simplistic kernels.  However, it does show that understanding, and designing code for, optimal processor performance really isn't easy, but at least we have compilers that can help us.

Really, the take-home message for this code would be to use the Cray compiler as it gives optimal performance regardless of our memory access pattern.  So, for your simulation codes you should always benchmark them with the range of compilers you have available to see if you can get improved performance, and therefore reduced time to solution, by selecting the compiler that does the best job on your code for your representative input files.  And once you've developed your code and are ready to use it for production simulations, always, always enable as much compiler optimisation as you can without adversely affecting the results of the code (some compiler optimisation can change the results you get as they re-order floating point operations), but remember that higher compiler optimisation levels are not guaranteed to give better performance (for a given code -O2 may give better performance than -O3, etc...), so benchmark your code with a range of compilers and optimisation levels to understand which give you best performance.

Just for comparison, here's the assembly the Cray compiler produces at O3 optimisation:

.LBBmain_2:
    leaq    (%rcx,%rbx), %rdi       
    vxorps    %xmm0, %xmm0, %xmm0     
    vcvtsi2sdq    %rsi, %xmm0, %xmm0 
    vmovsd    %xmm0, ($data_2$test_)(,%rdi,8) 
    prefetcht0    ($data_2$test_)+312(,%rdi,8)
    addq    %rdx, %rsi              
    vxorps    %xmm0, %xmm0, %xmm0     
    vcvtsi2sdq    %rsi, %xmm0, %xmm0 
    vmovsd    %xmm0, ($data_2$test_)+8(,%rdi,8) 
    addq    %rdx, %rsi              
    vxorps    %xmm0, %xmm0, %xmm0     
    vcvtsi2sdq    %rsi, %xmm0, %xmm0 
    vmovsd    %xmm0, ($data_2$test_)+16(,%rdi,8) 
    addq    %rdx, %rsi              
    vxorps    %xmm0, %xmm0, %xmm0     
    vcvtsi2sdq    %rsi, %xmm0, %xmm0 
    vmovsd    %xmm0, ($data_2$test_)+24(,%rdi,8) 
    addq    %rdx, %rsi              
    vxorps    %xmm0, %xmm0, %xmm0     
    vcvtsi2sdq    %rsi, %xmm0, %xmm0 
    vmovsd    %xmm0, ($data_2$test_)+32(,%rdi,8) 
    addq    %rdx, %rsi              
    vxorps    %xmm0, %xmm0, %xmm0     
    vcvtsi2sdq    %rsi, %xmm0, %xmm0 
    vmovsd    %xmm0, ($data_2$test_)+40(,%rdi,8) 
    addq    %rdx, %rsi              
    vxorps    %xmm0, %xmm0, %xmm0     
    vcvtsi2sdq    %rsi, %xmm0, %xmm0 
    vmovsd    %xmm0, ($data_2$test_)+48(,%rdi,8) 
    addq    %rdx, %rsi              
    vxorps    %xmm0, %xmm0, %xmm0     
    vcvtsi2sdq    %rsi, %xmm0, %xmm0 
    vmovsd    %xmm0, ($data_2$test_)+56(,%rdi,8) 
    addq    %rdx, %rsi              
    .loc    1  53  0
    addq    $8, %rcx            
    .loc    1  51  0
    cmpq    $10000, %rcx           

Why is this faster than what the Intel compiler does?  Well, I'm guessing, but the Cray compiler seems to be using a different method for the multiplication (XOR and shifting) and it is also enabling prefetching of data.

Thanks to Panagiotis Simatos from Imperial College who provided the original code I modified for this benchmark.

This post was also published on our Medium channel.