Research as a build system with Shake

In the last year, I had the opportunity to try out a new way to improve upon my own productivity doing research and other university-related projects. I call it the research build system, and it has allowed me to iterate on my work more quickly.

Motivation

What I am about to present has solved one of the main problems I used to have with research and other just-try-different-things-until-something-works material. The main problem that I have with this kind of work has its origins in the following workflow:

Start research project.
Try something new, manually.
Take on technical debt by not automating it, because it may turn out to be completely useless in hindsight
Find that your project is now in a less maintainable state.
Realise that you haven found what you were looking for.
Repeat from step 2.

As a result, you end up with a system that is bloated, inconsistent and slow to iterate on any further. Most problematically, when you now change any of the subject matter, you will have to remember which parts of your research relied on it, in order to keep your results correct and consistent.

The general idea

In its most basic form, the idea is to optimise for the most common case where almost all work can be thrown away. Ideally:

In each iteration we can try out something new quickly without being bogged down by previous work.
All of the work is kept consistent across iterations.
Refactoring to remove technical debt can be done easily and quickly.
Most importantly: whenever any dependencies are changed, all the dependants are updated accordingly, but no other parts are updated unnecessarily.

When you think about your research as a build system, a lot of things start of fall into place. For example: Your paper is the final product of your research, and it contains some plots and some text. This means that paper.pdf depends on paper.tex and plot1.png, plot2.png... The plots depend on the script you use to make it, and the results of your experiments. If your experiments are also automatable, these results could depend on the inputs to those experiments. Because you probably want to work together with other people, you could also have your experiments depend on the tools that you use to run them, etc...

Implementation

I implemented this concept using a Shake build system in Haskell. Haskell already takes care of the first three parts of the idea:

Local reasoning due to purity.
Consistency using compile-time type-safety.
Easy refactoring due to type-safety and higher-order functions.

The last aspect, the matter of dependencies, is handled by shake. Here is a small example of a research build system for a simple paper:

import Development.Shake

main :: IO ()
main = shakeArgs shakeOptions $ do

    "plot.png" %> do
        need ["script.r", "results.csv"]
        cmd "Rscript" "script.r" "results.csv"
        
    "paper.pdf" %> do
        need ["paper.tex", "plot.png"]
        cmd "latexmk" "-pdf" "paper.tex"

    want ["paper.pdf"]

In reality you would use a lot more variables, and define some things much more generally, but I hope this example can still give you an idea of what I mean.

Where this approach really shines is when your experiments can be run from within Haskell. Then you can just hook up your experiments into your build system directly, and take full advantage of the power of shake.

Pain Points of shake

Shake is good enough, but there are some ugly parts that I would like to see fixed.

Shake uses FilePath for its paths, which is both slow and rather unsafe. It makes sense if I use a lot of patterns instead of paths that point to individual files, but I found shake works the best for me if you use Path Abs File instead of FilePath.

Without a ReaderT, the Rules concept is not very composable. There are concepts that you may want to define throughout your build system, but without a ReaderT, you have to pass them around to all relevant parts of your build system. The problem then, is that you have to use lift everywhere or locally define a wrapper functions.

Resources allow you to limit how shake runs some actions, but they have a limitation that you cannot declare a dependency while using a Resource. This breaks composability when it comes to defining helper functions such as an rscript function that depends on a custom R installation.