Reproducibility in Code and Science

This week SciPy 2016 took place in Austin. I didn’t attend. However, the conference produced a lot of really interesting conversation and sharing of links to talks on youtube on two different slack channels I follow.

One of the themes that stood out was reproducibility. A good introduction to the problem of reproducibility is point 3 in the article “The 7 biggest problems facing science, according to 270 scientists“. For various reasons, documentation of procedures isn’t what it perhaps should be. Even in fields where published paper are entirely based on code, reproducibility is a problem. Datasets get edited. Mistakes get corrected but not published. Code evolves and what runs one day might not run five or ten years later if dependencies are not exactly spelled out and managed. 

One obvious way to address this issue is to pressure journal editors to pressure reviewers to pressure authors to better document code. In the slack channel discussions, there were also two interesting technology based ways to lesson the problem.


 

MyBinder.org

A way to get reproducible code notebooks online and running live quickly

MyBinder.org is a site that runs jupyter notebooks in the cloud. To quote their webpage, “making your code immediately reproducible by anyone, anywhere”. Usually, jupyter notebooks are either static on github pages (you see the input and result but can’t run it live) or run on your local computer. This last method can be problematic if you’re running local python modules that are very different versions than the ones the original author used. If there are recognized issues, they can be fixed but typically with effort. MyBinder.org gets around this issue by having code run live in the cloud with all the author specified module versions without the author having to themselves spin up a virtual machine in the cloud or run a sever.

Screenshot of the to-do list running on google cloud

Screenshot of the to-do list running on google cloud

Putting the code on a sever is a lot easier these days with all the cloud options. For example, I spun up a back-end and front-end for an online to-do list on google cloud’s platform to compare their service vs. Amazon in 25 minutes today. Although using a service like google cloud platform is relatively easy and cheap. It isn’t as quick as throwing a link from your github repo onto myBinder.org, nor is it free.

 

 

I put a Jupyter notebook of python data visualization practice up on myBinder.org today and found the process very easy and quick. Screenshot is below. The link above takes you to the full jupyter notebook on myBinder.org. There code is also on github. The github file named index.ipynb is the version on MyBinder.org. The dataset is morphometrics from 3 species of iris, courtesy of Kaggle.

bokeh chart image

bokeh chart image


Rescience

ReScience is a peer-reviewed journal that targets computational research and encourages the explicit replication of already published research, promoting new and open-source implementations in order to ensure that the original research is reproducible. There was some discussion that this might be a way to both

A.) increase the percent of studies that people actually attempt to replicate.

B.) bring undergraduates and graduate students into publishing early. 

Rescience is very young, but I think it is a very interesting concept, similar to the “International Journal of Negative & Null Results“. Titles of the first published articles include:

[Re] Chaos in a long-term experiment with a plankton community – Owen Petchey, Marco Plebani, Frank Pennekamp, ReScience, volume 2, issue 1, 2016.

[Re] Least-cost modelling on irregular landscape graphs – Joseph Stachelek, ReScience, volume 2, issue 1, 2016.

[Re] Interaction between cognitive and motor cortico-basal ganglia loops during decision making: a computational study – Meropi Topalidou & Nicolas P. Rougier, ReScience, volume 1, issue 1, 2015.

Leave a Reply

Your email address will not be published. Required fields are marked *