Responsible research: configuration version-controlled
The term Responsible Research is often used to describe an ethical research process taking into consideration of its potential environmental or social impact. However, in this post, I will use this term in a completely different meaning, which is somewhat related to (but different from) Reproducible Research in computational science. “Reproducible Research” has created such a buzz that the online course taking platform Coursera even has a dedicated course for it.
The buzzword “reproducible research” leads people to believe that the work hence produced must be correct; I should, nonetheless, point out that this is not the case. The word “reproducible” separated from its context can have two meanings: the research conclusion in question can be reproduced under precisely the same configuration, or the research conclusion can be reproduced under different configurations as well. The failure of the former implies academic dishonesty, while the failure of the latter implies the uselessness of the research.
When scientists use the term “reproducible research”, they mean the first definition, which, in my opinion, is nothing else than a euphemism to fight against academic dishonesty. Academic honesty is a very rudimentary standard, and itself is insufficient to make the research correct or useful. Even if the researchers can provide the data and the code, there is nothing to guarantee that the code is bug-free and truly reflects the computation or analysis protocol. Even if it does, it cannot guarantee its reproducibility under slightly different configurations. (The conclusion of science should be able to be generalized. For instance, the gravity does not exist only at a single point; the gravity is everywhere.) In short, the goal of the so-called “reproducible research” is a very low standard.
It is worth noticing that it is neither possible nor necessary for a researcher to guarantee the reproducibility of his work under any configuration other than the ones he used. The best he can do is to make a complete, detailed bookkeeping of his research and to use it for comparison whenever the research conclusion cannot be reproduced under different configurations or under the same configuration but by other researchers.
This practice leads to my idea of “responsible research” that I attempt to discuss in this post. By this term, I want to describe a complete, detailed, chronological bookkeeping, which allows the research design and the computation outcomes of any time to be revisited, in contrast to “reproducible research”, which showcases only the final design and outcomes. If reproducible research is the snapshot of the final cross section of the research history, “responsible research” would then be the snapshots of every cross section of the history. Indeed, a half-hearted reviewer will only look at the final cross section, but if you want to be responsible, you will need to be the reviewer for yourself and examine every cross section. This practice will allow you to compare the results of every configuration and spot potential abnormality.
|Reproducible Research||Responsible Research|
To achieve this kind of bookkeeping, I propose to use the version control technology. In the remaining of this post, I will first describe how computational scientists conduct their research and point out why their current methodology is not appropriate for responsible research. Then, I will draw inspiration from experimental scientists and propose a research diary solution. Finally, I will show how the research diary can be version-controlled, which does not only make the research more traceable but also improves the collaboration and hence accelerates the research.
The hammer of computational scientists
There are typically three modes of science. The first one, experimental science, studies the physical system directly. The second one analyzes a mathematical model of the physical system. The third one, computational science, makes simulations with the mathematical model. Each mode has its methodology, and this section describes the methodology of computational science.
In any computational science (e.g., computer science, machine learning, biostatistics, quantitative trading, real-time bidding), there are four types of data: input, configuration, output, and evaluation; and two types of procedures: algorithms and criteria. The algorithm, along with its configuration, takes the input and generates the output. Then, the criterion, along with its configuration, takes the output as well as the input to generate evaluation.
Even when we have a single input, we can still have a significant amount of output and evaluation by varying the configuration of the algorithms and the criteria. To store these data, we can use the key-value pair data structure (aka dictionary). The key is usually a string encoding the source of the input and the configuration – we can also occasionally encode the algorithm’s name or the criterion if we intend in advance to vary them. The value is composed of the output and the evaluation. This practice is widely adopted either formally (e.g., HDF5) or informally (e.g., array).
By using the key related to the configuration, We can easily have access to the output and the evaluation. Here, we store both the output and the evaluation. By storing the output, we can directly calculate any new criteria without running the algorithm once again; by storing the evaluation, we do not have to re-evaluate when switching among various criteria – some criteria can be more time-consuming than algorithms.
This strategy seems smart, but it also has one flaw: it requires to predefine the configuration you want to vary, which, however, contradicts the philosophy of science. In science, it is difficult, if not impossible, to predict the outcome; we explore it, we play with it, and we dynamically adjust our approach until we reach the truth. In consequence, we often find the keys allocated a while ago incomplete, ambiguous, and outdated. The longer we conduct the research, the more frustrated we become. We are lost in the ocean of the data. The experiment data becomes more a debt than an asset.
The gift from experimental scientists
Before giving a solution to the problem mentioned at the end of the last section, I will first discuss how experimental scientists conduct their research. We often hear from biology students complaining about the dozens of pages of experiment reports that they need to write before and after the experiment, and we also see in the television that some archaeologist’s diary helps young adventurers escape from the danger. This kind of research diary is valuable and has been considered the norm of experimental science.
This practice common in experiment science raises the question why the same approach did not make its way into computational science. A reasonable answer is the convenience of rerunning the computation given the computing power of today’s hardware. If we can rerun the computation, which costs only a couple of seconds, whenever we are confused about the keys we previously allocated, why bother taking pains to craft a detailed research diary, which will never win us a Nobel Prize in Literature?
Indeed, when we revisit biologists’ experiments, we do discover that their experiments are time-consuming. They may wait for several weeks or even several months to obtain some bacteria, and by then they may have forgotten how they conducted the experiment. If they do not write a detailed research diary, they will not be able to understand why sometimes they succeed and other times they fail.
On top of the long time span of the experiments, the enormous amount of experiment detail (equivalent to configuration) and the unpredictability of the outcome also contribute to the adoption of the research diary methodology. A seemingly insignificant detail may lead to a failure or an unexpected discovery, which, if not properly documented, can cost you a Nobel Prize in Medicine. The long time span of the experiments, the enormous amount of experiment detail, and the unpredictability of the outcome explain the research diary dogma of experimental science.
With the three reasons above, it is not too difficult to understand why computational science did not need research diary. I used the word “did” because this is changing. Today, we can observe all the three characteristics of experimental science happening also in computational science. First, with the rise of the big data volume, we can no longer finish our experiment within several seconds. Indeed, most experiments will take several hours or several days or even several weeks. The experiment becomes time-consuming for us too. Second, the input can be high dimensional and has lots of features, and the algorithm is growing more and more complex and contains more and more hyperparameters. We intend to test with the inclusion or exclusion of a specific feature and various values of the hyperparameters. To conduct the experiments in a more organized way, many researchers use a dedicated external JSON file for the configuration. Third, algorithms are getting more and more random by introducing stochasticity, which can be represented by, say, random initialization and re-sampling. The outcome of algorithms is thus becoming unpredictable. All of these changes in computational science imply the urgency to adopt the research diary approach just like experimental scientists have done.
Version controlled research diary
In the last two sections, I talked about the limit of expressiveness of using strings as the key for storing data and that we need to adopt the research diary approach. In this and last section, I will show you how to combine the best of two worlds by using version control.
The solution is embarrassingly simple, though it took me a while to reach this straightforward idea. We can write the input, configuration, algorithm, and criterion into the research diary with a text file (not Microsoft Word, since the format of Word is binary) and commit it with the version control tool Git. We then use the hash code of that commit as the key, which later points to the output and the evaluation generated by the computation. In other words, the research diary is version controlled and can be revisited and branched out, and the computation outcomes are not version controlled and are all stored in a dictionary indexed by the commit hash code.
The commit messages can be used to give brief information about the configuration as well as the major change in the experiment design, whose detail can be viewed in the correspondent research diary via the command
You can compare the difference between any two versions of research diary via the command
You can also create a new branch of your research design and explore several directions simultaneously via the command
Last but not least, you can collaborate with other researchers via the commands
pull, and the most successful research design will be merged to the
master via the command
In this post, I proposed to use version controlled research diary to track your research as well as collaborate with other researchers. The hash code of the version further points to the outcomes of the computation, which allows you to access the data generated by any research design. With this technology, your research becomes more responsible and reproducible, for both the community and yourself.