Under pressure

Author: Adrian Jackson
Posted: 23 Mar 2020 | 10:45

Squeezed performance

Memory under pressure

I was recently working with a colleague to investigate performance issues on a login node for one of our HPC systems. I should say upfront that looking at performance on a login node is generally not advisable, they are shared resources not optimised for performance.

We always tell our students not to run performance benchmarking on login nodes, because it's hard to ensure the results are reproducible. However, in this case we were just running a very small (serial) test program on the login node to ensure it worked before submitting it to the batch systems and my colleague noticed a performance variation across login nodes that was unusual.

The test program should finish in under a second, but on the primary login node it was taking over 40 seconds to complete. On one of the back-up login nodes it ran, as expected, in under a second, and the node has effectively the same software and hardware configuration, so it was interesting that the performance was so different between the two nodes.

Our first thought was some aspect of I/O (reading or writing data to files) was causing the issue, because the program does some (although not lots) I/O. However, the timing data coming out of the program showed that it wasn't any of the code associated with I/O that was the cause of the slowdown. Indeed, the test program could be reduced to the following simple code which still exhibited the performance issue:

#include <stdio.h>
#include <stdlib.h>
#define K 1024 
#define M (K*K) 
#define N (100*M) 
int main(void){   
  int i;   
  int *x = (int *) malloc(N * sizeof(int));
  for(i=0; i < N; i++){
     x[i] = i;   
  }   
  printf("x[%d] = %d\n", N-1, x[N-1]);   
  return 0;  
}

The next thought was that it could be competition with other users on the system, competing for CPU resources, that was causing the issues. This is definitely possible because the node often has upwards of 70 people logged into it, and users don't always behave as they should (ie run big MPI jobs or build processes using all the cores in the node). However, the performance was reproducible (ie measured runtime was within ~10% of the average runtime), with variable loads on the node (load was monitored with htop to give a rough approximation of how busy the node was).

Paging

Another cause we considered was memory paging/swapping, ie the memory of the nodes was being heavily used, requiring allocated memory to be swapped to disk before new memory could be allocated to the test program. However whilst the node does have swap enabled, there's only 2GB of swap space (compared to 256GB of memory), which would not be enough to satisfy the memory required by the test code (approximately 3GB), and vmstat was not showing any data swapped to or from swap space.

Therefore, as we teach our students in the Performance Programming and Programming Skills courses on our MSc, we broke out the profiler to understand where time was being spent when running the application. In this instance we used perf to profile the application, both on the primary login node where performance was poor and the back-up login node where performance was as expected.

Profiling

Poor Performance Profile

Figure 1: Performance profile on the primary login node

Comparing the performance between the profiles in Figure 1 and Figure 2, there are significant differences. Figure 2 shows that over 70% of the runtime is in three application routines, and the first non-application routine takes under 1% of the runtime. In comparison Figure 1 shows that on the primary login node, where performance is 40x lower than expected, the first application routine is 20% of the runtime, but after that very large numbers of non-application routines dominate runtime.

Looking at the non-application, Linux kernel, routines being called they're all associated with memory management and memory allocation. Indeed, the test program, or at least where the performance is poor, does little more than allocate and initialise memory. Therefore, whilst the application is not running into swapping performance issues, it is the memory management that's causing the poor performance.

NUMA allocations

Expected Performance Profile

Figure 2: Performance profile on the back-up login node

The first thing for us to try was NUMA-aware allocation, ie ensuring the memory being used by the test application was being allocated from the same CPU socket as the application was running on. This login node is a two CPU system, and we know that memory access costs are higher from remote nodes (ie accessing memory attached to the other CPU socket). However, it is reasonably straightforward to ensure local memory is used with an application: we can use the numactl program to restrict our application to a single CPU socket and the memory associated with the cores on that CPU.

Running numactl --membind 0 -C 0-17 to ensure the application runs on a single socket and only allocates memory on that socket does indeed make a performance difference, with the application now only taking 10 seconds to complete, compared to 40 seconds previously. This is still 10 times slower than it should be but four times faster than before. An eye-opening demonstration of the costs of non-uniform memory access hardware such as dual socket compute nodes.

As you should do with all performance optimisation projects, we re-profiled after making this change to ensure that the performance issues were still kernel function calls associated with memory management. Figure 3 shows that, indeed, moving to guaranteed local memory has reduced the kernel function calls, although not removed them altogether.

numactl

Figure3: Performance profile using numactl to ensure socket local memory is used

Kernel calls

From Figure 3 we can see that we still have system calls such as isolate_freepages_block consuming time in the profile, with a wide range of kernel calls (kernel.kallsyms) appearing in the profiling output. These all suggest the runtime is being affected by memory allocation costs associated with the operating system, and even though the overall program runtime is not dominated by these routines, it is likely the program is being interrupted by the kernel and paused whilst the memory allocations occur.

All this profiling information suggests that the system we are seeing poor performance on has memory fragmentation issues, meaning there is a high overhead for the operating system to allocate sufficient memory to the application when it requests it. It looks like the kernel is undertaking memory compaction and migration tasks when the application allocates memory, which would account for the long runtime of the application.

Memory fragmentation

Indeed, that particular node had been running for over 60 days, has many users and attached filesystem, so memory fragmentation is likely to occur. We can get some information from the operating system about its current memory state, looking at the buddyinfo data provided by the /proc filesystem in Linux.

Figure 4 shows the output of buddyinfo on the two different nodes, with login0 the slow node and login2 the normal node. The buddyinfo output isn't the easiest to understand, but basically it is outlining the number of free areas of memory available of various sizes, from one page in size (the left-most column) up to 1024 pages in size (the right-most column). Each column is 2^column_number contiguous pages free, so the first column is the number 2^0 (which is 1) of pages that are free (i.e. empty pages surrounded by used pages), the second column is number of places there are 2^1 (i.e. 2) contiguous pages free, the first column 2^2 (i.e. 4) etc... all the way up to 2^10 (1024) contiguous free pages.

/proc/buddyinfo output

Figure 4: /proc/buddyinfo data from both the nodes benchmarked

The rows represent different areas of memory, so DMA is the 16MB of memory, DMA32 is first 4GB of memory (minus the DMA memory), and the normal row is all memory over 4GB. These systems have 256GB of memory so the vast majority of the memory is in the normal category. Therefore, if the system has low memory fragmentation we'd expect the majority of pages to be in the right most columns of the table, and if there is high fragmentation the majority of pages to be in the left most columns of the tables. Figure 4 shows exactly this, for the login2 node, where performance is as we expect, there are a large number of large contiguous areas of memory available, including a significant amount of 1024 page spaces available on both sockets (node 0 and node 1). For login0 the opposite is true, there are no very large contiguous spaces available (ie zeros in the last two columns of the table) and very high numbers of single or small numbers of contiguous pages available.

Page faults

This matters for performance because the operating system needs to do higher amounts of work to allocate the 3GB memory this application requires from fragmented pages than it does when it can give a single contiguous block of pages to an application to use. The system with more contiguous pages will require far fewer page faults by the application to allocate and use memory pages than the fragmented system. Indeed, we can track page faults in Linux using the ps tool, with a command like ps -o min_flt,maj_flt,cmd which will display the number of page faults for each running application. If we do this for the fast node then the test application causes around 450 minor page faults when running (the specific number really will depend on the system load and other user applications). Whereas on the slow node the same application generates around 100,000 minor page faults!

Mitigation

So, memory fragmentation really does matter for application performance. Admittedly, this isn't a problem you should generally see on well-run compute nodes, and for a lot of applications memory creation isn't a frequent operation, but it clearly can impact performance. There are a number of ways to address memory fragmentation. Firstly, you can use large pages, which means that memory is allocated in much larger pages than by default, up to GB page sizes. However, whilst this will reduce memory fragmentation, it will increase memory usage, especially for something like a login node, as memory is assigned in much large blocks even if the application only uses a small amount of memory.

Secondly, you can get the operating system to try to defragment memory (I'm old enough to still remember routinely doing this for Windows hard disk drives). You need administrator privileges to do the following, but they should trigger the operating system to try to clean up the memory: echo 1 > /proc/sys/vm/compact_memory or sysctl vm.compact_memory=1. Dropping the page caches (sync; echo 3 > /proc/sys/vm/drop_caches) can also help.

Finally, rebooting the node will definitely get rid of this memory fragmentation, albeit for a limited period of time. However, it's definitely not a solution we can easily deploy as these are active systems used by a lot of people. As it turned out, the node did get rebooted shortly after we finished doing this performance investigation, because it needed some software upgrades, and once the system had been rebooted the performance issues did disappear. However, the performance issues noticed here have progressively come back the longer the node has been up. It's now close to 30 days of uptime and the code is running about 10x slower than it could be.

Caveats

As previously mentioned, it's unlikely this would significantly impact most applications (where calculations and memory accesses dominate memory creation operations), and most production compute systems probably don't see this level of long-term memory fragmentation, so your application probably isn't suffering from this issues. It's also likely a different operation system/processor combination would see different performance characteristics here, although we've not done any benchmarking on other systems to verify this. We'd expect other malloc functions (jemalloc, tcmalloc, etc...) may have some impact on performance as well. That said, this is a good example of why we closely curate our compute nodes, to ensure performance problems like this don't build up, so if you're running computers for production applications it's always worth considering whether you are ensuring memory is properly freed and page caches dropped between jobs to mitigate any such performance issues.

Working through this issue also highlighted that I don't really fully understand what the operating system is doing when you ask for memory (i.e. malloc). I roughly know what's going on, at a high level (I think), but there is clearly a lot of complexity under the hood that is obscured by my simple understanding of memory allocation from an applications perspective. Time for a bit more in-depth research I think!

Author

Adrian Jackson, EPCC
Adrian on Twitter: @adrianjhpc

Dr David Henty was instrumental in a lot of the performance work described above.

Image: Photo by Vlad Gurea on Unsplash