Reproducible Research - Version control

Better collaboration and transparency

Image credit: Flickr

As promised in the prior article on reproducible data analysis, here I’d like to outline the concept of version control, and how it pertains to reproducible research.

Version control is a method of recording changes made in a file or set of files over time, allowing you and/or your collaborators to track their history and easily review any changes. Many academics attempt a poor facsimile of this by using some flavour of “track changes” via software like Microsoft Word or Google docs, and emailing these documents back an forth with varying file names. Particularly with the use of Microsoft Word, this can lead to having lots of slightly different versions of the same file, with no easy way of knowing what version contains which changes, and who made them.

More advanced version control systems (VCS) such as Git, SVN and Mercurial provide more powerful and elegant tools. Git was originally designed for large groups of developers to work on big software projects, but is now used for many other kinds of projects.

As in the previous article, here is an overview of the benefits of using version control:

  • The process of creating work becomes more seamlessly integrated with organising, recording and disseminating said work, as opposed to additional burdens that may be neglected
  • More structured collaboration, with robust tools for asynchronous work and managing versions
  • Easy creation of a web presence for a given project (this website is on GitHub!)
  • It can also serve as an excellent tool to support teaching

Motivation

A toy of the GitHub mascot. Image credit: [**Unslash**](https://unsplash.com/photos/wX2L8L-fGeA)

Figure 1: A toy of the GitHub mascot. Image credit: Unslash

Version control allows us to understand the history of a piece of work (typically the writing of a paper for academics). This lets us and others understand what was done, when, and with the aid of well written comments/commit messages, why it was done. This greatly increases the transparency of the work, which is crucial for reproducibility. A version control system allows us to do this cleanly, with no debris from previous versions of files strewn about various folders which could confuse us or others trying to understand how we arrived at out final results. This can be especially useful with code, as we no longer need to keep large chunks of unused code commented out just in case we want it later.

But perhaps the greatest benefit is the ease of collaboration good version control allows for. Particularly when combined with version control hosting services, such as GitHub, GitLab or Bitbucket, version control allows many people, potential all over the world and in different time zones, to work on the same files. The changes from different people can be tracked separately, and there are powerful tools communicate in a more structured way, via pull requests, code reviews and issues for example.

Basic workflow

A typical workflow using version control is as follows:

  1. Create files - these should be plain text, usually code and/or markup or some kind
  2. Work on these files - adding, deleting or changing content
  3. Create a snapshot - explicitly record the state of a file/s at that particular time, with a message summarising what was done

In Git, which I will focus on for the remainder of the article as it the most popular, these snapshots are called “commits”. This can feel a little strange at first, but you quickly get used to it. Personally, I’ve found a nice added benefit to using Git as a tool for delimitating where my day ends. The last thing I do each day is commit all the work I’ve done, giving me a nice sense of accomplishment (or terrible shame if I’ve been unproductive…), and giving my brain a signal that work is over it’s time to switch gears.

Learning resources

Image credit: [**Unsplash**](https://unsplash.com/photos/WE_Kv_ZB1l0)

Figure 2: Image credit: Unsplash

If I’ve managed to convince you to give version control a go then do be warned if you using Windows - getting things set up will be a bit ugly. Windows is a pretty terrible operating system for doing anything, but it’s especially bad for these kinds of things.

In an ideal world more people would realise that there are operating systems that you don’t have to pay for and don’t include spyware. I’m talking about Linux for the uninitiated. If you’ve got an old clunker computer that’s unusably slow, try installing a lightweight Linux distribution and marvel at how much difference a good OS can make. Without getting into some boring details, MacOS effectively has the same backbone as Linux, so that works well enough (but Linux is still way better!).

That’s not to say it can’t be done of course, especially with the help of these lovely resources:

  • A great resource for many facets of reproducible research is The Turing Way
  • Another great resource, particularly for integrating Git with R is Happy Git with R
  • The Pro Git book is another great resource for a more in depth understanding of how Git works, and some of it’s more advanced features
Gabriel Mateus Bernardo Harrington
Gabriel Mateus Bernardo Harrington

My research interests include spinal cord injury, bioinformatics, statistical modelling and reproducible research.

Next
Previous

Related