Aim: Bioinformatics is experiencing a crisis of reproducibility, which inhibits research progress and undermines scientific findings. This is driven by a variety of factors, including incomplete documentation, poor version control, lack of accessible code, and incompatible software dependencies. Leveraging containerization technology is a promising solution to address these issues. The field of immunogenetics is especially in need of such workflows, as high levels of genomic complexity characteristic of immune loci require the development of unique tools. For example, in 2021, we published a pipeline, Pushing Immunogenetics to the Next Generation (PING), designed to genotype the killer immunoglobulin-like receptor (KIR) genes from short read data. Due to its various dependencies, however, some investigators found PING challenging to run and install. This prompted us to containerize both PING and a recently developed software from our lab, MHConstructor, a de novo short read sequence assembler for the human major histocompatibility complex (MHC) region.
Method: We used Singularity as our platform of choice due to its simplicity. It can encapsulate the entire workflow’s dependencies in a single file that can be effortlessly executed. A particular challenge faced by MHConstructor is its reliance on multiple Python versions. We addressed this by implementing multiple conda environments within one container. Lastly, we attempted to enhance the accessibility of our pipeline by offering an alternative solution for users to directly download the container image from Sylabs.
Results: We tested both pipelines to run on a different HPC, ensuring that they can run in a different environment. Both runs were successful and produced consistent output with the local version of the pipeline. We also managed to reduce the time it takes for users to obtain their Singularity image by storing it in Sylabs, essentially halving it for MHConstructor image to be obtained.
Conclusion: We believe that a reproducible pipeline should be the standard for bioinformatics tools. Here, we are pushing the initiative two pipelines at a time. The containerization of PING and MHConstructor ensures the reproducibility of these two immunogenetic methods, providing reliable high throughput analysis of large datasets not otherwise accessible with currently available tools.