Originally posted: 2020-02-22. View source code for this page here.

Why you should open source your analytical work

Working in the open improves the quality of analytical work and multiplies its impact.

The payoff extends far beyond transparency and reuse. Greater benefits come from the way of working that is needed to publish work openly, because it promotes analytical best practice.

Analysts should therefore aim to open source the non-sensitive parts of their projects, which is often the majority of the code. Some of the key benefits which I've experienced are as follows:

Re-use, collaboration and documentation

An obvious benefit of open source is that other teams and organisation can reuse the work. Less obviously, the benefits of reuse are quickly realised within a team itself, for several reasons:

  • Collaboration and feedback. It is much easier to tap into the expertise of the wider community of analysts when work is shared in the open. In my experience, analysts outside the team are much more likely be helpful if they know they're contributing to open source work. This is motivated by a mutual understanding of the benefits of open source, and because they use free software themselves and want to 'pay it forward'.

    It's also enormously easier for analysts in different teams (or even different countries) to contribute to and use the same project if there is a single open source codebase. This can often lead to iterative improvements as different teams find ways of improving the code, to the benefit of everyone.

  • Bug finding. Many users make light work of QA, because they are quick to tell you when things aren't right. It's not uncommon to get bug reports from all over the world.

  • Documentation. Open sourcing helps to quickly identify areas of weak documentation which are likely to cause problems of corporate knowledge retention, because if the documentation is lacking, you'll quickly receive questions. If you're lucky, your users will even contribute to the documentation themselves.

Managing complexity and corporate knowledge retention

A defining characteristic of high quality analytical work is that complexity is managed effectively.

At a basic level, this involves separating out data, assumptions, and modelling work, and ensuring results are reproducible.

In more complex projects, this usually means breaking down the problem into simpler parts, each of which has a clear responsibility with limited scope (a good 'separation of concerns'). These parts can then be run and quality assured separately.

If a project is to be open sourced, this approach to model building is essential from the outset because sensitive data and assumptions must be kept separate from the code.

It also encourages the analyst to consider the general version of the problem, rather than writing highly specific code that is capable only of solving the problem at hand. This usually results in code that is better abstracted and easier to understand, because it requires clearer thinking about the problem. It also

In turn, generalised code increases agility, because it requires small modifications rather than a complete rewrite when requirements change. Finally, generalised code is usually less sensitive because it reveals little about its specific application.

The overall result is projects that are easier to understand and maintain. New members of staff can see how the problem has been split out into smaller, simpler parts, and do not have to understand the whole before they can usefully contribute.

Quality assurance (QA)

Over time, authors of open source have established a widely-adopted convention for quality assurance that allows external users to trust the work, despite the fact they usually have no contact with the authors. This convention involves writing automated and self-documenting tests of the correctness of the code which run automatically whenever changes are made.

Trust in the project is established because this convention results in a fully-documented, robust quality assurance which is open to anybody to review, and improve.

Open sourcing code encourages analysts to develop an understanding of this process - which should be the gold standard for QA - and to begin to follow this convention themselves.

An additional benefit of this kind of QA is it makes it much easier for a newcomer to the project to make improvements, because any changes they make are automatically tested before they are accepted.

Job satisfaction and learning and development

Speaking for myself, I find working on open source projects incredibly motivating. I use other people's open source work every day, and so it feels good to be able to contribute back to the community. It also means the value of my work extends beyond the immediate project - and it's very satisfying to hear about other people building on your work to achieve interesting things.

Familiarity with open source ways of working also uncovers huge opportunities for learning and development. Rather than solving new problems from scratch, there's always a wealth of existing open source work to learn from and copy from. By referring to the code of some of the most widely-used packages, you can learn from some of the best analysts in the world for free.