DRAM: Contribution¶
Hi there! Many thanks for taking an interest in improving DRAM.
DRAM is written as a Nextflow Pipeline that wraps around the largely python code that does the DRAM calculations. Nextflow handles a lot of the cross platform deployment, dependency manegement, scheduling, and parallelization. What we need to do to get that out of Nextflow though is write whatever DRAM python code we need that can be interphased with through a CLI so Nextflow can call it directly by passing arguements to it. Then the script should write its outputs to files so Nextflow can pass those arguments to the next step.
Nextflow resources¶
You can read more about Nextflow here,
Nf-core¶
DRAM is also following the convention of the nf-core community, which is a a large community bioinformatics Nextflow users that have developed tools and standards for using Nextflow and making it easier to use Nextflow. DRAM is currectly still in the process of transition from the original Nextflow implementation of DRAM to the nf-core style DRAM which will allows more flexibility and easier development.
nf-core docs, tools, and examples can be found here
Main Structure of DRAM with Nextflow¶
Pipeline kicks of with main.nf which runs some boiler plate initialization steps such as printing the help text if needed, prints your parameters you are using, etc. Then calls the main DRAM pipeline file, which is workflow/dram.nf. workflow/dram.nf controls the entire flow of the steps of DRAM. Every pipeline step we think of in DRAM runs through this file to some extect. Call Prodigal, Annotate, Distill, Product, etc. When DRAM wants to run Annotate, because Annotate is a large step with many substeps, it then calls subworkflows/local/annotate.nf, which then does all of the logic for the Annotate step. Annotate then calls individual processes such as modules/local/annotate/mmseqs_index.nf which is then dispatched with the scheduler (SLURM if you are using SLURM, or whatever other scheduling system you are using (local, or whatever)). Those individual processes that get dispatched as individual jobs (or parallel jobs) are what run the DRAM python and other language scripts.
Pipeline contribution conventions¶
To make the DRAM code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written.
Adding a new step¶
Write up your main script or script code in whatever language and place it in the bin folder.
We need to ensure it is executable for Nextflow, so add a shebang to the top of the file (ex.
#!/usr/bin/env pythonfor a python file) then runchmod a+x bin/new_script.pyreplacing new_sciprt.py with the actual script.
If this is a new process, add a new nextflow script (
.nf) undermodules/local/whatever_subdir_it_goesDefine the corresponding input channels and output channels
If your scripts in the
binfolder need external dependencies (such as from conda), then add aenvironment.ymlfile in the same directory as your process (in themodule/local/..) in your process script addconda "${moduleDir}/environment.yml"above your input channels. (See other process scripts as examples).You will now need to install the wave-cli tool to create containers on demand from conda environments. This is how we specify our containers without having to manage docker files or anything else like that. Wave just tells Nextflow how to pull a container on demand from a conda file. You can install wave-cli here. Ideally this will be integrated into our CI/CI system in the future.
Once you have wave installed, run
wave --freeze --conda-file modules/local/subdir/environment.ymlwhich will give you some string that starts something likecommunity.wave.seqera.io. Below yourconda "${moduleDir}/environment.yml", addcontainer "YOUR STRING FROM WAVE"Now users can run DRAM with with
-profile condaand-profile singulary/docker/apptainer/etc.and it will just work without them installing the dependencies
Add a computation label to your process, such as
label 'process_small', that tells Nextflow how much resources to use. We have defined defaults for what process_small, process_medium, etc. mean, but users can override this in their own configs, allowing more control. All the options can be found inconf/base.configAdd your process by name to
conf/modules.configif you want to change where your output files get stored in the user’s outdir.Add a script section to call your DRAM script, passing in whatever CLI argumenets. The outputs from your DRAM script should be the process outputs. You want them to be written directly in the working directory, Nextflow will manage moving them to the users output directy. You can also rename outputs for future Nextflow steps with the
emit:keyword, and mark some outputs as optional (see other processes).We now need to add this process to the correct part in our pipeline to be called and ran. If it is in an already established step such as Annotate, we might go to
subworkflows/local/annotate.nfand find the right spot in that code and add it. Though we might need to add a new subworkflow and add that to ourworkflows/dram.nfWe also need to add any new parameters to our
nextflow.configwith a default. And then add the equivalent parameter to ournextflow_schema.jsonwith help text. Thenextflow_schema.jsonis how our--helpCLI option is populated as well as what our parameters page on our docs is built from. It also allows us to add sanity checks and validation for all relevant parameters.