Benevolent dictator vs democracy: which are you coding for?

Author: Adrian Jackson
Posted: 27 Sep 2016 | 16:47

Developing for the real world

As part of a recent ARCHER eCSE project I developed a new parallelisation strategy for a computational simulation application to enable it to scale efficiently to larger process counts. We managed to significantly reduce the parallel overheads, so the code was accepted into the main repository for users to exploit.

However, what was key with this new functionality is that it only works at larger process counts, at process counts where the standard parallelisation strategies perform poorly (the exact process range is dependent on the simulation be run). Indeed, it will produce wrong results when used on a process count that is too small, because of some of the assumptions made in the parallelisation.

To integrate the new code back into the main repository and make it available to users of the application, we documented the code, wrote tests of the new functionality and added them to the test suite, validated the correctness using that test suite, and wrote user documentation for the new functionalty. We also ensured that the new functionality isn't switched on by default but needs modification of a user's input script to enable it. And that's where we left it: new functionality, available for users if they needed it, to improve parallel performance. 

However, I recently received an email from a user saying the new functionality was producing incorrect results for them. It's always a bit worrying to get such emails, as hard as you try to test and validate new code there's always a chance that something has gone wrong, and if that's the case then potentially you have to tell users that the simulation results they've obtained using the code may not be correct; not a pleasant prospect!

The user sent the test case and I ran some tests and found that the results were correct, so I got back in touch with the user and asked how they were running the code. It eventually transpired that they were violating the constraints of the new functionality, running it on smaller process counts than it was designed to work for. In short, using it as it was documented it shouldn't be used.

Who's to blame?

Now, was this the users' fault? Well, in an ideal world then users would fully read any documentation before using a feature. After all this feature needs to be manually enabled, it's not on by default.  However, I have to admit that most of the time I don't do that, and that's the point. I'd designed and deployed the code assuming a mythical ideal user would be running it.

So, this wasn't the user's fault, it was mine. I should have built in controls to ensure that this didn't happen, indeed it's easy for the code to check the process count it is running on, and the minimum processes a particular simulation requires for the optimised functionality, and to switch off the functionality or stop the program if the requirement isn't met. 

I'd been acting like a benevolent (developer) dictator where people only use my code in the exact way I'd documented, rather than building a program that could work in a messy democracy where people can use it however they want.

So I'm now fitting the code to check, tell the user if they've done something wrong, and default to a more sensible set of options.

Author

Adrian Jackson, EPCC
Adrian on Twitter: @adrianjhpc

Image: James Cridland, Flickr