Debugging in 5D

Author: Adrian Jackson
Posted: 24 Feb 2016 | 16:41

Or why debugging is hard and parallel debugging doubly so

Computing bug: Grace Hopper's famous bug found in 1947 in a relay in the Mark II computer, taped it to the operations logbook with the annotation "First actual case of bug being found". Image courtesy of the Naval Surface Warfare Center, Dahlgren, VA., 1988. - U.S. Naval Historical Center Online Library Photograph

Debugging programs is hard. I give a lecture on debugging for the Programming Skills module of EPCC's MScs in HPC and HPC with Data Science where we try to point out common programming mistakes, programming strategies for making bugs less likely, and the skills and tools required for investigating, identifying, and fixing bugs.

For non-programmers it may seem straightforward to eliminate program bugs: simply write correct code! However, even for the best and most conscientious programmers this is not feasible. I estimate I average around 2 bugs per 20 lines of code I write (this figure does flucuate based on programming language being used, and the amount of concentration applied when developing the program, etc...), and I'd like to think I was a good programmer (although I'm not the best person to make this claim).

For those who don't recognise it, the picture above is the first ever computing bug: Grace Hopper's famous bug found in 1947 in a relay in the Mark II computer, taped to the operations logbook with the annotation "First actual case of bug being found". Image courtesy of the Naval Surface Warfare Center, Dahlgren, VA., 1988. - U.S. Naval Historical Center Online Library Photograph.

In my debugging lecture we talk about programming being a science and debugging being an art. This is because when you are debugging, especially when you are debugging your own code, you need to try and look at the code with a different perspective, as an outsider, to pick up the mistakes you yourself made. Mistakes that weren't obvious to you when you wrote the code. Luckily we have tools to help with debugging, the compiler is generally good at picking up typos and similar kinds of mistakes, and then there are tools that check the correctness of a program (ie ftnchek and splint). Furthermore, there's testing (unit, regression, system, etc) that should help developers pick up the fact that there is a bug/issue with a program. And there are debuggers that provide functionality to discover where crashes are happening or where variables are changing values. 

However, when it comes to debugging (finding the mistakes and fixing them), this relies on experience, persistence, and understanding of the way you program, the way the language works, and what the program should be doing.  When debugging serial programs there are multiple areas, or dimensions, that have to be considered; the programming language, the initial data the program is using, and the hardware/software environment that the program is running in (ok, I'm being selective with the areas I'm picking, there are probably many more things to consider, but this list gives me the 3 dimensions I need for now).

But, teaching a recent ARCHER MPI course I was reminded about the further difficulties of debugging parallel programs. We teach courses like this through a mixture of lectures and practicals. One of the nice things about teaching the practicals is it lets you seem like a real expert to the attendees because you often can look over their shoulder at a example that's not working for them and spot their mistake immediately. This actually isn't due to amazing skills, it's just attendees often make similar mistakes (think leaving the ierror argument of the end of MPI calls in Fortran) so you can get attuned to the things to look for and spot them quickly in most cases.

In that last MPI course, though, a student told me their code wasn't working and I couldn't see what was wrong. The code (slightly tidied up) is included below, see if you can find the mistake. It took me a good half an hour before I spotted it, and I needed further information, specifically that the code works for small numbers of iterations but if the iteration number was increased it started to give incorrect results (if you really want to know what's wrong you can e-mail me or leave a comment asking).

1  program CalcPi
2
3  implicit none
4 
5  include 'mpif.h'
6 
7  integer ierr,N,I,J,rank,size
8  integer Imin,Imax,status(MPI_STATUS_SIZE)
9  integer*8 iter
10 double precision PiEst,SumPi
11 parameter (N=8400000)
12 parameter (iter=1000000)
13
14 call MPI_INIT(ierr)
15 call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
16 call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)
17
18 DO J=1,iter
19   SumPi=0.
20   PiEst=0.
21   Imin=(rank*N/size) +1
22   Imax=(rank+1)*N/size
23   DO I=Imin,Imax
24     PiEst=PiEst+1./(1.+((FLOAT(I)-0.5)
25 +         /FLOAT(N))**2.)
26   ENDDO
27   PiEst=4.*PiEst/FLOAT(N)
28   IF(rank.ne.0)THEN
29     call MPI_Ssend(PiEst,1,
30 +     MPI_DOUBLE_PRECISION,0,0,
31 +     MPI_COMM_WORLD,ierr)
32   ELSE
33     SumPi=SumPi+PiEst
34     DO I=1,size-1
35       call MPI_Recv(PiEst,1,
36 +       MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,
37 +       0,MPI_COMM_WORLD,status,ierr)
38       SumPi=SumPi+PiEst
39     ENDDO
40   ENDIF
41 ENDDO
42      
43 IF(rank.eq.0)THEN
44   WRITE(*,*)'sum of all ranks is ',SumPi
45 ENDIF
46       
47 call MPI_FINALIZE(ierr)
48
49 end

Having spent a long time working with MPI in C and FORTRAN programs, and teaching MPI courses, I should have been able to spot the mistake quicker, but the fact I couldn't highlighted why parallel debugging is difficult. Not only do you debug in the 3 dimensions I discussed previously, you also have to care about 2 more: time and scale (parallel scale).

Now, clearly serial programs are affected by time, the fact that the program progresses over time and thus different parts of the code are exercised over time means that bugs may only be encountered after a long period of time, or their affects be encountered after long run times (ie memory leaks may be present in your program but not actually cause a problem until the program has run for a long time).

However, in parallel computing time is additionally troublesome as the programmer needs to consider divergence between the time of different running processes in a parallel program, the fact that two processes in a parallel program may take different paths through the program or the same paths but be at different places at any single point in time.  Affects that are due to this loose coupling of processes in a parallel program can be transient and sporadic, they may only occur when certain numbers of processes are used, or when certain hardware (and associated software libraries) are used (ie working on one machine is no guarantee of working on another machine).

This also means that things like debuggers may not be helpful for analysing and identifying these issues, as they tend to slow down and synchronise processes in a parallel program, and therefore can cause the condition that triggers the bug not to occur.

Parallel scale, also, is challenging.  Bugs can occur only at large process counts, but debuggers or other traditional methods for debugging (testing, printing out values, etc...) are hard to use at very large scale (imagine trying to run your test suite on 100,000 cores every time you alter your program, it may be desirable but is probably not feasible).  Furthermore, parallel computing often requires programmers to specify or formalise additional information about a program related to parallel functionality. 

One example of a bug associated with this is the following OpenMP program (courtesy of EPCC's Mark Bull) that may run correctly, or may not, depending on the system it is running on and the compiler used.  Again, try and see if you can identify the bug (if you need more info please do get in touch):

1  #pragma omp parallel for private(temp)
2  for(i=0;i<N;i++){
3    for(j=0;j<M;j++){
4       temp = b[i]*c[j];
5       a[i][j] = temp * temp + d[i];
6    }
7  }

It would be nice to be able to round this post off with a solution to this debugging problem, but I can't.  It's not an easy problem to solve. Testing can help, pair programming and code review can also be very powerful tools in ensuring correct functionality, and proper software design, where the purpose of subroutines/functions is documented and parallel requirements and functionality is defined, can help understand what the code is trying to do and analyse what it is actually doing.  

We are also collecting common mistakes that people make when programming with OpenMP and MPI and will be putting them up on the EPCC website shortly.  If you have any suggestions for contributions to the set of common mistakes, or for sensible techniques or tools for reducing, identifying, and fixing, bugs in programs, particularly parallel programs, we'd love to hear them to hear them. Please feel free to leave a comment on this post, or e-mail me!