Code for failure

Author: Adrian Jackson
Posted: 14 Apr 2016 | 21:02

Writing programs assuming that they will be incorrect

I was thinking about development methodologies and software design principles recently and have decided that one of the things I've learned is that it is essential to write programs with the assumption they are going to fail.

I don't think that any of us like to think that the programs we write or maintain will go wrong, or have mistakes/problems in them. However, as I've discussed previously, it is very hard to develop code without making mistakes: coding mistakes, algorithmic errors, mistaken assumptions, etc...

There are lots of techniques to minimise or identify such mistakes, but I've learned over 15 years of development that the most powerful is writing programs assuming that they will fail in some way.

Assuming failure means that from the outset I'm considering the parts of the program that may break and how I could work out what is going wrong. Program crashes are generally not the issue, as we have tools to find out what's going wrong, or at least where it's going wrong and from there it should be possible to discover the problem. Errors where the program runs but produces incorrect results are much more problematic as without the correct design and development processes it's very hard to work out the cause.

Without considering which parts of a program can give rise to such mistakes – and building testing, modularity, review, and other processes into the design and development task to ensure that they do not occur – you will end up having to take your program apart, piece by piece, to discover where the error is coming from and how to fix it. Considering such possibilities from the start when coding can often save lots of time when errors or problems occur.

Of course, this isn't a new approach, it's really a form of defensive programming, where programmers follow various precepts to try and ensure correct and robust programming. Rules can include:

  • Document functionality and assumptions
  • Make code as simple as possible
  • Don't trust inputs, check them
  • Test as much as possible, and automate your tests
  • If an error is possible, check for it
  • Fail fast (if there is a problem, finish the program then and there and be verbose about what the problem was).

Now in the scientific simulation arena some of these rules may seem problematic as they can affect performance (keeping code as simple as possible, always testing numbers to routines, etc...).  However, we should always be coding for correctness first, and performance second.  Once a program is robust and working well then you can look at optimising it to get best performance; you can turn off argument checking, you can add more optimised versions of core computational kernels that are more complicated etc... 

Having the working, and robust, code as a starting point for optimisation means you have a point of reference for error checking and identifying problems.  If you develop a more optimised computational kernel, keep the original version as well so you can compare the results from both and understand your correctness and any variability in results. Techniques like these will pay dividends in easing the process of developing, extending, and maintaining codes.

As Douglas Adams wrote in Mostly Harmless, "The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at and repair."

Author

Adrian Jackson, EPCC
You can often find Adrian on Twitter .