12/03

I've recently been working on the installation of a bio software called Whole-Genome Shotgun Assembler (wgs-assembler) on the cluster I am working with. This error took us a while to fix, so I am writing the details here hopefully to help other system administrators on installing this software.

As most research software does, wgs contains a bunch of bash scripts to call all the different bio software programs (we call as dependencies) to finish the processing on different stages. Research software projects are generally poorly documented, so it is very hard to figure out what is going on if any error message occurs. There is a tool script called PacBio Corrected Reads (PBcR) which yields the following error message on the user's execution:

ERROR: Overlap prep job /users/**/tempsciara/1-overlapper/correct_reads_part 1
FAILED.
ERROR: Overlap prep job /users/**/tempsciara/1-overlapper/correct_reads_part 2
FAILED.
...
8 overlap partitioning jobs failed.

This issue is hard to deal with because I couldn't reproduce it on my end, since the exact same command works on my end. There were apparently some funny things happening that were different on my run from the user's run. Reproduction is important in this case, as there is not much documentation to reference for research software and the error message didn't provide very useful information. The best way to deal with such kind of ad-hoc scripts is to debug the script which causes the trouble. By using debugger or simply tentatively modifying the script to get more information, we could figure out the behaviors of the scripts and then the direct causes of the failures.

Finally, we made to reproduce the error on our end, and it turned out that the script is hard-coding an absolute path for an external command `/usr/bin/time`. It is understandable in this case because `time` is a basic utility command distributed via core-utils package. For most Linux users, they should have `time` installed on their default executable path `/usr/bin`. However, the problem for clusters is that a clusters normally need to host as many compute nodes as possible. Putting too much thing in the core (i.e. storing too much time locally) would really obstruct this goal and slow down the system. Therefore, in our case, we installed some non-essential utilities somewhere else accessible via NFS. The user was running the job on a compute node that does not have `time` installed at `/usr/bin`, while we were attempting to reproduce on a fully-functional master node which has the command installed locally. The script did write the error message as the result from a problem called at an intermediate step, but it was not documented anywhere, so we couldn't only figure it out until going through the source code.

The fix for this issue would be replacing all `/usr/bin/time` in PBcR (or PBcR.pl in the installation source code) to `/usr/bin/env time`. A bug report has been created to the author.

In summary, from this issue, we can see the efficient steps to deal with the error of research software are:

  1. reproduce the errors;
  2. go through the source code and debug the code to figure out what those error messages mean and what causes the errors.

Another takeaway is that script software developers should not pose assumptions on the path of any command. `/usr/bin/env` is a great tool to help figure out the path of a desired command.