DRAM: Contribution

Hi there! Many thanks for taking an interest in improving DRAM.

DRAM is written as a Nextflow Pipeline that wraps around the largely python code that does the DRAM calculations. Nextflow handles a lot of the cross platform deployment, dependency manegement, scheduling, and parallelization. What we need to do to get that out of Nextflow though is write whatever DRAM python code we need that can be interphased with through a CLI so Nextflow can call it directly by passing arguements to it. Then the script should write its outputs to files so Nextflow can pass those arguments to the next step.

Nextflow resources

You can read more about Nextflow here,

Nf-core

DRAM is also following the convention of the nf-core community, which is a a large community bioinformatics Nextflow users that have developed tools and standards for using Nextflow and making it easier to use Nextflow. DRAM is currectly still in the process of transition from the original Nextflow implementation of DRAM to the nf-core style DRAM which will allows more flexibility and easier development.

nf-core docs, tools, and examples can be found here

Main Structure of DRAM with Nextflow

Pipeline kicks of with main.nf which runs some boiler plate initialization steps such as printing the help text if needed, prints your parameters you are using, etc. Then calls the main DRAM pipeline file, which is workflow/dram.nf. workflow/dram.nf controls the entire flow of the steps of DRAM. Every pipeline step we think of in DRAM runs through this file to some extect. Call Prodigal, Annotate, Distill, Product, etc. When DRAM wants to run Annotate, because Annotate is a large step with many substeps, it then calls subworkflows/local/annotate.nf, which then does all of the logic for the Annotate step. Annotate then calls individual processes such as modules/local/annotate/mmseqs_index.nf which is then dispatched with the scheduler (SLURM if you are using SLURM, or whatever other scheduling system you are using (local, or whatever)). Those individual processes that get dispatched as individual jobs (or parallel jobs) are what run the DRAM python and other language scripts.

Pipeline contribution conventions

To make the DRAM code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written.

Adding a new step

  1. Write up your main script or script code in whatever language and place it in the bin folder.

    • We need to ensure it is executable for Nextflow, so add a shebang to the top of the file (ex. #!/usr/bin/env python for a python file) then run chmod a+x bin/new_script.py replacing new_sciprt.py with the actual script.

  2. If this is a new process, add a new nextflow script (.nf) under modules/local/whatever_subdir_it_goes

  3. Define the corresponding input channels and output channels

  4. If your scripts in the bin folder need external dependencies (such as from conda), then add a environment.yml file in the same directory as your process (in the module/local/..) in your process script add conda "${moduleDir}/environment.yml" above your input channels. (See other process scripts as examples).

  5. You will now need to install the wave-cli tool to create containers on demand from conda environments. This is how we specify our containers without having to manage docker files or anything else like that. Wave just tells Nextflow how to pull a container on demand from a conda file. You can install wave-cli here. Ideally this will be integrated into our CI/CI system in the future.

  6. Once you have wave installed, run wave --freeze --conda-file modules/local/subdir/environment.yml which will give you some string that starts something like community.wave.seqera.io. Below your conda "${moduleDir}/environment.yml", add container "YOUR STRING FROM WAVE"

    • Now users can run DRAM with with -profile conda and -profile singulary/docker/apptainer/etc. and it will just work without them installing the dependencies

  7. Add a computation label to your process, such as label 'process_small', that tells Nextflow how much resources to use. We have defined defaults for what process_small, process_medium, etc. mean, but users can override this in their own configs, allowing more control. All the options can be found in conf/base.config

  8. Add your process by name to conf/modules.config if you want to change where your output files get stored in the user’s outdir.

  9. Add a script section to call your DRAM script, passing in whatever CLI argumenets. The outputs from your DRAM script should be the process outputs. You want them to be written directly in the working directory, Nextflow will manage moving them to the users output directy. You can also rename outputs for future Nextflow steps with the emit: keyword, and mark some outputs as optional (see other processes).

  10. We now need to add this process to the correct part in our pipeline to be called and ran. If it is in an already established step such as Annotate, we might go to subworkflows/local/annotate.nf and find the right spot in that code and add it. Though we might need to add a new subworkflow and add that to our workflows/dram.nf

  11. We also need to add any new parameters to our nextflow.config with a default. And then add the equivalent parameter to our nextflow_schema.json with help text. The nextflow_schema.json is how our --help CLI option is populated as well as what our parameters page on our docs is built from. It also allows us to add sanity checks and validation for all relevant parameters.