Beyond Data: Reproducibility in Scientific Software and the Role of Digital PreservationBy Fernando Rios posted
Looking at the titles of presentations and workshops at recent digital library- and curation-related conferences such as the DLF Forum, iPRES, IASSIST, and IDCC, it's hard to miss the popularity of topics related to research data management. Although describing, preserving, and sharing data has become increasingly common, the software tools, parameters, and workflows used to extract knowledge from that data aren't usually so well curated. This hinders the ability of others to extend or replicate the work.
In February, I attended a workshop in Washington DC, hosted by the American Association for the Advancement of Science (AAAS)—the publisher of Science Magazine—dealing with software as it relates to research transparency and reproducibility. The aim was to identify needs and challenges related to the development and use of software as a part of the research process and to form recommendations for both authors and publishers that promote computational reproducibility. In attendance were representatives from funding agencies in the physical and life sciences as well as representatives from various scientific publishers. Also present were research scientists from several national laboratories and supercomputing centers.
So what's the big deal with reproducibility in relation to software and why?
Research Translucency—Data is Only Half the Battle
Being able to reproduce research is a cornerstone of the scientific process so it's interesting that in adopting digital workflows and tools, the knowledge discovery process tends to become more opaque as the complexity of the tools and their interrelations increases. Documenting and sharing the input/output data without the accompanying software process only makes the process translucent, not transparent. Therefore, to enable (or at least move closer to) true transparency, capturing the software and associated workflow is critical. Reproducibility is not the only goal of course. Improving algorithms, tools, and analysis methods for the purpose of furthering science is a higher objective, and ensuring that software remains accessible is paramount to enable it.
Why is Reproducibility so Hard?
During the workshop, the view that publication is the ultimate research product was plainly evident. Nevertheless, many also underscored the need to begin to consider other research outputs, such as software, as part of the academic reward system.
Most in the research community would agree that the biggest hurdle to reproducibility is not technical but cultural. Without appropriate reward systems in place, there is little incentive to produce well-documented, well-engineered software. Of course, changing the status quo is not easy. Funders and publishers recognize that, and it was frequently noted that they currently are the ones most able to influence change (from the top down at least).
Intellectual property law is another big hurdle. The lack of knowledge surrounding issues in this area (e.g., licensing restrictions that prevent the distribution and archiving of software) makes it a difficult problem to address head-on. Even at this workshop devoted to reproducibility in research software, there was a dearth of contributions to the discussion in the panel dedicated to legal issues. This speaks volumes to the difficulty of the problem.
Although the largest hurdles are cultural, technical challenges remain. Some in the computational community consider the technical aspects a solved problem, but I would disagree. Certainly, there are components, such as virtualization technologies, that are more mature than others. However, something that was lacking in the discussions was a digital preservation perspective.
The Role of Digital Preservation
Although there was much discussion about reproducibility in relation to the publication process, there was little agreement on how reproducibility can be best implemented, both in terms of infrastructure and researcher education.
I asked a few attendees what they did with their code to preserve it and make it available to others. Almost invariably, the answer was something along the lines of "Oh, I archive it on GitHub." A digital archivist might gasp at such a statement. GitHub isn't usually considered to be a good long-term archive since it is set up as a social, collaborative code/document editing environment, not as a repository. Interestingly, one journal dedicated to publishing papers on impactful research software uses GitHub as an archive. The resulting system seemed to me quite cumbersome.
An interesting discussion took place about how long software should be preserved. Although a specific time frame was not agreed upon, there was consensus that research software has a much shorter useful lifetime than data, and preserving it indefinitely is neither required nor desired. This may be one reason that services like GitHub are seen as adequate archives of code. Knowing how long software is likely to be useful is important for the development of archiving services and infrastructure given limited resources.
Since ensuring continued access to digital resources is one aspect of reproducibility for research software, I think the digital preservation community has a lot of expertise to offer. Many of the approaches already used for the preservation of digital objects (such as data) can serve as a foundation to enable reproducibility, although more work is required to ensure the unique needs and characteristics of software are appropriately addressed (see Hong 2014 for a discussion around minimal software metadata).
The digital preservation community can also help promote more structured preservation as a way to foster improved reproducibility by enabling software to be more easily found, used, and linked to related information. In addition, it can encourage engineering and documentation best practices to ensure software is "curation-ready" in a bottom-up approach.
There's Still a Long Way to Go
We're still pretty far away from software reproducibility. In the workshop, for example, it was mentioned that even when researchers made deliberate attempts to make their software reproducible, others still had trouble reproducing the experiments. Although there are much larger cultural and legal issues that must be addressed, the digital preservation community is well positioned to tackle many of the problems around preserving research software.1
1Of course, I'm not the first to come to this realization as evidenced by work in areas such as the use of virtualization and emulation (e.g., Rosenthal 2015), metadata standards for research software (e.g., Hong 2014, Codemeta), and citation and attribution (e.g., Niemeyer, Smith, and Katz 2016; Piwowar and Priem 2016; and the Software Citation Working Group).
Hong, Neil Chue. 2014. Minimal information for reusable scientific software. Figshare: http://dx.doi.org/10.6084/m9.figshare.1112528.
Niemeyer, Kyle E., Arfon M. Smith, and Daniel S. Katz. 2016. The challenge and promise of software citation for credit, identification, discovery, and reuse. arXiv preprint. arXiv:1601.04734 http://arxiv.org/abs/1601.04734.
Piwowar, Heather, and Jason Priem. 2016. Depsy: Valuing the software that powers science.
Fernando Rios is a research data management fellow at Johns Hopkins University.