Most of us in the biotech space are fully onboard with the concept of reproducible research. Who thinks it’s a bad idea to be able to trace back from our results to the data and methods that produced them?
The trick, of course, is in how to do it. In the lab, there’s a strong culture of recording our experiments, tracking samples in LIMS systems and lab notebooks. However, in computational research the processes are often more personalized, or simply aren’t done.
And yet we’ve all experienced the confusion of looking back at a collection of analyses we’ve done, and wondering which was the “real” result, or why the results of one analysis differed from the other. What about that slide deck we showed to management – which of the many versions of the RNA-Seq pipeline did it come from? Why does the code result in a different result than it did the last time I ran it?
I had this problem with my dissertation. My graduate work was done entirely in R, and included nearly 100 figures and tables. Some of these figures were very similar – the same analysis was done on several different datasets. Keeping track of the origin of each figure was extremely difficult; I couldn’t convince myself I had it right.
Two things helped with this: version control and self-documenting code. I’ve written about version control before
. It solved the problem of figuring out why the results might change from one day to the next. Today I’ll tell you about why self-documenting code is a critical component of a data analyst’s workflow.
My entire thesis was self-documented. It was written in Sweave, which is a documentation integration package for R and LaTeX. I embedded all of the code from my work into the text document that described it. When I made a change to the text, I would rebuild the whole thing and see a pdf emerge with all of my prose, figures, and tables, all beautifully formatted. Since the code that produced any given figure was located right next to the text that described it, I was fully confident I knew exactly which analysis produced which plot. Since the whole document was checked into version control (I used Subversion at the time), I knew I could track back any changes to the plot to the changes in the analysis that produced it. It may sound like doing this sort of tracking and documentation is a lot of work, and I won’t pretend that there isn’t overhead. But this extra work paid off profoundly one day.
A good four months into the thesis-writing process one of the public datasets I had analyzed was retracted; there were irregularities in the authors’ data curation process. I was mortified – what did this mean for my work in integrating these datasets, and how was I going to extract all of the plots and tabular results that included the disgraced dataset?
Since my thesis was self-documenting, I didn’t need to worry about how the extraction of one dataset would mangle the organization of the document. It was structured like a programmer had written it – with for loops and lists of datasets. I deleted the offending dataset and rebuilt the thesis. My figures were recreated, my tables recalculated. I did need to edit a few things by hand; for example, any mention of that disgraced dataset in the text needed to be changed. But overall, reproducible research had saved me. I could have spent a month rewriting and instead I spent a few days.
I can’t recommend self-documenting code highly enough. It’s helped me on many occasions since, in smaller analyses and in less-spectacular ways, but I no longer even think about doing work for myself or for a client without self-documentation.
When I wrote my thesis Sweave was pretty state-of-the art and Subversion was not a complete dinosaur. Now, there are better options available, like Markdown and Knitr, or the open lab notebooks put together by the Jupyter folks. Any one of them will make your analysis infinitely more reliable. For version control, Git is the most common choice, but any of them will give you the confidence you need that your code is what you think it is.