Originally posted: 2019-08-26. View source code for this page here.

Effective testing of analytical models using automated sense checks

I once worked for a slightly terrifying senior analyst.

When presenting my modelling results to her, she would often do a back-of-the-envelope calculation of her own, to estimate roughly what she expected the answer to be. If our answers differed, I knew she would challenge me to explain why.

This was an effective technique because if my explanation was unconvincing, it probably meant either my model was wrong, or I didn’t understand it well enough to trust it.

I now think of her approach as a manual form of integration testing which is well-suited to analytical models. These ‘sense check’ tests are some of the most important tests of a model. They typically take the form of an expectation that a value within the model will fall between a plausible upper and lower bound. Re-imagined as automated test of analytical code, these checks can be applied all the way from low level calculations thorough to high level results to ensure that modelling results make sense and can be explained.

Because they test the answer rather than the details of the implementation, they have analytical meaning independently from the code. This means they can be written early on in the modelling process. They’re therefore capable of spotting errors earlier, increasing their value to the analyst.

They’re so valuable that I think that automating these sorts of tests should be a standard part of analytical quality assurance.

In contrast to technical-sounding ‘unit’ and ‘integration’ tests, analysts can readily see the value of automated ‘sense checks’, so they are easier to explain and more likely to be adopted. Presenting your results to your boss suddenly becomes much less scary if you know for sure your model isn’t doing anything crazy.

And more importantly, these ‘sense checks’ are likely to be more effective than many unit tests. It’s all too easy to descend into writing perfunctory unit tests, which are often the result of an analyst being told unit tests are necessary without understanding when and why they are valuable.

None of this is to say unit tests are not valuable when used effectively. They can be an important complement of sense checks, and at lower-layers of a model sense checks sometimes are unit tests.

By starting out with sense checks, analysts are introduced to a type of testing of their code which obviously improves the robustness of the modelling result. By considering why these tests have value, it should help them start to design other unit tests which are more valuable — favouring tests which have well-defined analytical meaning and are easy to maintain as the model evolves, and avoiding tests that offer little protection and get in the way of code re-factoring or model generalisation.

An example

The rest of this post is more technical. It makes the above ideas more concrete with some real-life examples of the kind of ‘sense check’ tests I describe.

The context for these examples is a simple personal project: an energy usage calculator I have building to better understand how different types of energy usage compare. For instance, how does owning a dog compare to driving 10,000 miles? Note: This project is still work in progress, but you can see its current state here.

This project involves a lot of unit conversions — e.g. miles driven ➡litres of petrol➡ joules of energy ➡kilowatt hours, each one with the potential for error. It also involves a lot of rough-order-of-magnitude calculations such as ‘how many joules of energy does it take to produce one joule of edible food’.

Here are some ‘sense checks’ I’ve written:

Bottom up

  • Do my estimates of car energy usage for a moderately efficient car fall between estimates on Wikipedia of a highly inefficient and highly efficient car. See here.

  • Is my estimate of the embodied energy in ‘stuff’ like consumer electronics within reasonable plausible bounds. Specifically, is it less than the energy from buying and burning petrol (minus fuel duty) and more than one-tenth that amount? See here.

  • Does the energy required to produce the food to feed an average man lie within plausible bounds — between a lower-bound food intake with a low food-production multiplier, and a similar higher bound. See here.

  • Does the energy required to feed a dog fall within a sensible range of the energy required to feed a human omnivore of similar weight? See here.

  • Does turning down the thermostat from 20 to 19 degrees give a plausible energy saving. I guessed at between 10 and 15 per cent. I think it’s fine to use these kind of somewhat-arbitrary ranges. If your tests fail, you can think harder about why, and adjust the range, making an appropriate note of the explanation. See here.

Tests of internal consistency of calculations.

  • Is the energy required to feed a dog roughly consistent with estimates of the energy embodied in ‘stuff’. See here.

Top down

  • Totalling up energy consumption for a fairly average person, is it within some sensible bounds of average per-capita usage. See here.

And to give a flavour of the unit tests which I think are valuable:

  • My unit conversion code automatically detects all possible ‘chained’ unit conversions and reciprocal unit conversions, minimising the list of hard coded values. For instance, by defining hours➡ minutes and minutes➡ seconds, it should successfully be able to compute convert_units("hours", "seconds")Obviously, there is scope for bugs in this code. As such I test a couple of these derived conversions, and reciprocal conversions — see here.

  • Simple test cases of basic functions. For instance If I drive 30 miles in a 30 mpg car, do I use one gallon of petrol? See here.

These tests run automatically every time my code is changed and uploaded to its Github repository. I use Github Actions to run these tests, which is a type of continuous integration that’s very easy to set up.

Overall, do my tests provide a guarantee the calculations are correct? Definitely not. But they do give me a great deal of confidence they’re not too far out, that I haven’t made any all-too-common mistakes like sign errors, and that I have a well-defined defence of my estimates within my code. And for rough estimates, that’s probably good enough.