R2915/SIS/MelArray7613678dd452master
MelArray
README.md
MelArray
This is the repository for the MelArray analysis pipeline testing within EnhanceR. This pipeline uses the Snakemake workflow manager and can be run using a Singularity container.
Getting started
To run this pipeline, you need to follow the steps below:
- Clone this repository
git clone https://c4science.ch/diffusion/2915/esct-pipeline-interoperability.git
<img src="http://bit.ly/2n4EjjM" alt="Warning: " style="width: 15px" width="15px"/> If you have problems cloning with HTTPS, try with SSH instead:
git clone ssh://git@c4science.ch/diffusion/2915/esct-pipeline-interoperability.git
- Install Snakemake and Singularity using the install instructions on their respective page:
- Setup your running script and running configuration, depending on your scheduler (LSF, SGE, SLURM, ...). There are 2 files you need to modify: the run script (e.g. run_lsf.sh) and the deploy configuration file (e.g. /pipeline/configs/melarray_deploy_enhancer_lsf.yaml). You can modify the existing example files, there is one for LSF and one for SGE.
The line you need to change is the lines that tells Snakemake how to use your scheduler. For example, in the melarray_deploy_enhancer_lsf.yaml file, you will find the following:
- 'snakemake -p -k -c "bsub -n 4 -W 23:59 -R \"rusage[mem=4096]\" -R singularity" -j 16 --restart-times 1 ...
The part after -c is the cluster command to call your scheduler. Every step of the pipeline will be prepended with this command. It can for example be changed into the following for SGE (as seen in melarray_deploy_enhancer_SGE.yaml):
- 'snakemake -p -k -c "qsub" -j 16 --restart-times 1 ...
Or to anything else that tells your scheduler how to launch jobs on your cluster.
Alternatively, you can also just remove the whole -c argument and value, so that Snakemake does not spawn jobs but runs on your local machine / node.
Finally, you need to make sure that your runner script (e.g. run_lsf.sh) points to the right deploy configuration file (e.g. /pipeline/configs/melarray_deploy_enhancer_lsf.yaml). For example, in run_lsf.sh:
singularity-pipeline -p pipeline/configs/melarray_deploy_enhancer_lsf.yaml build
So if you create a new deploy configuration file, then make sure your runner script uses that one as parameter for the singularity-pipeline runner module.
- Run the pipeline using one of the runner scripts: run_lsf.sh or run_lsf.sh or any other runner script you created in the step above. This script will use the singularity-pipeline runner module to fetch the test data, create the container and validate the output of your pipeline after the execution.
For more instructions on how to run the Snakemake pipeline without these runner scripts, please refer to the *[Run the pipeline]* section below. Note however that if you do not use the runner scripts, you will need to "manually" fetch some test data and pull the container. For this, the test data can be found here and the container can be pulled from [here](docker://sissource.ethz.ch:5005/balazsl/usz/melarray:5.2).
What is executed for the EnhanceR test
Here is what will be run:
*This is an overview of the entire MelArray pipeline. The part in black is the one configured to be executed for the Enhancer test:* ![Rule graph](other/enhancer_png/melarray_v5p2_rulegraph_all_hori_marked.png)
*This is the simplified rule graph configured to be executed for the Enhancer test:* ![Rule graph](other/enhancer_png/melarray_v5p2_rulegraph_sub_hori.png)
*It will be executed on 4 test samples:* ![Rule graph](other/enhancer_png/melarray_v5p2_dag_sub_hori.png)
Run the pipeline
There are several ways to run the pipeline, either with or without using container, and either by running the pipeline *inside* or *outside* the container. The following describes the default mode, which is to run the pipeline using Snakemake *outside* the container, and using a single container for all dependencies.
Calling Snakemake
First make sure that you correctly put all your data in the data/fastq folder, the extra files (bed, interval list, tumor normal match)
in the `data/extra_files` folder and that you have your reference files in the `ref/hg19` (or similar) folder. You can then run the full pipeline as follows: snakemake
In case you use a custom configuration file, for example config_myCustomConfig.yaml, you can run the pipeline with that configuration file as follows:
snakemake --config config=myCustomConfig
You can of course use all the input arguments available to the snakemake command (please see the full list here). Here are a few important ones:
- dry run snakemake -n
- dry run and show the commands that will be executed snakemake -n -p
- do not erase the temporary files snakemake --nt
- go on with independent jobs if a job fails snakemake -k
<img src="http://bit.ly/2F85JwS" alt="Info: " style="width: 15px" width="15px"/> In case you have some steps in your pipeline that are known to "randomly" fail, but can succeed if re-run multiple times, Snakemake provides an option --restart-times to try to re-run rules that failed:
snakemake --restart-times 3 --config config=...
Running the pipeline asynchronously
Running the pipeline can take a (very) long time, so here is how you can launch it *asynchronously* with nohup, storing the output in the run.log file:
nohup snakemake -k --config config=myCustomConfig 2>&1 > run.log
The 2>&1 redirects the error output to the standard output.
<img src="http://bit.ly/2F85JwS" alt="Info: " style="width: 15px" width="15px"/> Note that all the following commands (and many others of your own creation) can be wrapped with nohup + 2>&1 > XXXX.log wrapper, to launch them asynchronously. However, for clarity reasons, all the following commands will be written *without* the wrapper.
Running subworkflows
By default, the pipeline will run all subworkflows specified by the config file. In case of the default config file, this corresponds to all the subworkflows, which, at the time of writing (2018-01-24), corresponds to the following list:
- switchs for activating the different subworkflows subworkflows:
- "TrimAlign"
- "GATK"
- "VariantCaller"
- "Sequenza"
- "VariantAnnotation"
- "HaploCaller"
- "AnaCovar"
- "OptiType"
- "Fusion"
- "Mixcr"
- "QC"
However a different list of subworkflows can be run by adding or removing elements to this list in your custom configuration file.
Alternatively, it is possible to specify the list of subworkflows using the command line arguments. For example, the following will only run the TrimAlign subworkflow:
snakemake --config subworkflows=TrimAlign
It is also possible to specify several subworkflows by specifying a comma-separated list:
snakemake --config subworkflows=TrimAlign,GATK
Or to run all subworkflows by specifying 'all' or a '*' as input:
snakemake --config subworkflows=all snakemake --config subworkflows=*
<img src="http://bit.ly/2n4EjjM" alt="Warning: " style="width: 15px" width="15px"/> Note that any subworkflow option specified at the command line will *override* the list specified in the config file.
<img src="http://bit.ly/2n4EjjM" alt="Warning: " style="width: 15px" width="15px"/> Keep in mind that the subworkflows are still depending on each other, even if you can call them individually. For example, you need to run the "TrimAlign" subworkflow before you run the "GATK" subworkflow. If you specify a subworkflow that does not have its input files ready, Snakemake will throw an error:
# files are absent [wid@derm USZ]$ ll /home/wid/USZ/data/3_bwamem total 0 # snakemake will *not* execute the subworkflow [wid@derm USZ]$ snakemake --config subworkflows=GATK 2>/dev/null MissingInputException in line 25 of /home/wid/USZ/pipeline/subworkflows/GATK.snake: Missing input files for rule mark_duplicates: /home/wid/USZ/data/3_bwamem/Sandra-4T.bam
However, if you run the first part of the pipeline, you can then call afterwards the second part individually, as the output files will already be present:
- files are now present [wid@derm USZ]$ ll /home/wid/USZ/data/3_bwamem total 0 -rw-r--r--. 1 wid wheel 0 Jan 24 17:14 Sandra-3T.bam -rw-r--r--. 1 wid wheel 0 Jan 24 17:13 Sandra-4T.bam
- re-run only the GATK subworkflow [wid@derm USZ]$ snakemake --config subworkflows=GATK 2>/dev/null | grep -A1000 "Job counts:" Job counts: count jobs 1 all 2 base_recal 2 fix_mate 2 mark_duplicates 2 print_reads 9
Running the pipeline with a scheduler (qsub, bsub, etc.)
Snakemake allows you to specify a command to not run your jobs (rules) directly on your machine but rather submit them to a scheduler (you can find more information here). Here is how to submit jobs to your SGE scheduler using 32 jobs in parallel:
snakemake --cluster 'qsub' -j 32
You can then combine this --cluster option with all your other parameters and run instructions, as described above. This is the final form of the command that you might want to run:
nohup snakemake -k --cluster 'qsub' -j 32 --config config=derm_server subworkflows=* 2>&1 > my_log_file.log &
Debugging options
A few other *minor* options are available to help you in debugging the pipeline. These options are specified through the --config parameter.
- Verbosity: you can increase the level of debugging log printed out by the pipeline using: ` snakemake --config verbosity=DEBUG `
- Indexing: you can select to only run a subset of the samples by specifying which ones to run: `
- only run samples number 1, 3, 5 and 6. snakemake --config indexes=1,3,5,6
- only run samples number 2, 3, 4 and 5. snakemake --config indexes=2-5
- only run sample number 3 snakemake --config indexes=3 `
Other running modes
<img src="http://bit.ly/2n4EjjM" alt="Warning: " style="width: 15px" width="15px"/> Note that the following options are experimental and should only be used by users that understand what they are doing.
It is possible to run the pipeline with 2 other orchestration strategies than the default one (*run outside single*). These are documented in the main configuration file config.yaml:
- defines how the pipeline should be run:
- 'host': the pipeline runs on the host, without using any containers. This requires that the
- software dependencies are installed on the host.
- 'run_outside_single': the pipeline runs on the host, but using a single container that
- contains all the software dependencies. A container path has to be specified.
- 'run_inside_single': the pipeline runs inside a single container, using the software dependencies
- that are installed inside that single container. A container path has to be specified. run_mode: "run_outside_single"
It is therefore possible to change the run_mode field in the configuration file to have the pipeline's execution done differently. However each run mode needs certain prerequisites:
- host: Python3 and Snakemake must be installed on the host, as well as all the software dependencies of the pipeline. Software versions might be somewhat flexible but reproducibility cannot be fully guaranteed.
- run_outside_single: Python3, Snakemake and Singularity must be installed on the host, and a single container containing all the software dependencies of the pipeline must be accessible by Snakemake.
- run_outside_single: only Singularity must be installed on the host, and the single container containing Python3, Snakemake and all the software dependencies of the pipeline must be accessible by Singularity.
Running the pipeline with a scheduler (qsub, bsub, etc.)
Snakemake allows you to specify a command to not run your jobs (rules) directly on your machine but rather submit them to a scheduler (you can find more information here). Here is how to submit jobs to your SGE scheduler using 32 jobs in parallel:
snakemake --cluster 'qsub' -j 32
You can then combine this --cluster option with all your other parameters and run instructions, as described above. This is the final form of the command that you might want to run:
nohup snakemake -k --cluster 'qsub' -j 32 --config config=derm_server subworkflows=* 2>&1 > my_log_file.log &
Debugging options
A few other *minor* options are available to help you in debugging the pipeline. These options are specified through the --config parameter.
- Verbosity: you can increase the level of debugging log printed out by the pipeline using: ` snakemake --config verbosity=DEBUG `
- Indexing: you can select to only run a subset of the samples by specifying which ones to run: `
- only run samples number 1, 3, 5 and 6. snakemake --config indexes=1,3,5,6
- only run samples number 2, 3, 4 and 5. snakemake --config indexes=2-5
- only run sample number 3 snakemake --config indexes=3 `
Other running modes
<img src="http://bit.ly/2n4EjjM" alt="Warning: " style="width: 15px" width="15px"/> Note that the following options are experimental and should only be used by users that understand what they are doing.
It is possible to run the pipeline with 2 other orchestration strategies than the default one (*run outside single*). These are documented in the main configuration file config.yaml:
- defines how the pipeline should be run:
- 'host': the pipeline runs on the host, without using any containers. This requires that the
- software dependencies are installed on the host.
- 'run_outside_single': the pipeline runs on the host, but using a single container that
- contains all the software dependencies. A container path has to be specified.
- 'run_inside_single': the pipeline runs inside a single container, using the software dependencies
- that are installed inside that single container. A container path has to be specified. run_mode: "run_outside_single"
It is therefore possible to change the run_mode field in the configuration file to have the pipeline's execution done differently. However each run mode needs certain prerequisites:
- host: Python3 and Snakemake must be installed on the host, as well as all the software dependencies of the pipeline. Software versions might be somewhat flexible but reproducibility cannot be fully guaranteed.
- run_outside_single: Python3, Snakemake and Singularity must be installed on the host, and a single container containing all the software dependencies of the pipeline must be accessible by Snakemake.
- run_outside_single: only Singularity must be installed on the host, and the single container containing Python3, Snakemake and all the software dependencies of the pipeline must be accessible by Singularity.
Repository structure
Here is the structure of this pipeline's repository. The files and folders you DO NOT need to run the pipeline are marked by a star (\*):
├── container │ ├── melarray.img [Singularity container file] │ ├── [scripts] (*) │ ├── .dockerignore (*) │ └── Dockerfile (*) ├── data │ ├── fastq │ │ ├── sample1.fastq.gz │ │ ├── sample2.fastq.gz │ │ └── ... │ └── extra_files │ ├── bed file │ ├── interval list file │ └── tumor normal match file ├── pipeline │ ├── scripts │ ├── configs │ │ ├── config.yaml │ │ ├── config_custom.yaml (*) │ │ └── ... │ └── subworkflows │ ├── HaploCaller.snake │ └── ... ├── ref │ ├── hg19 │ │ ├── [all reference files] │ │ └── ... │ ├── hg38 │ ├── b37 │ └── ... ├── Snakefile ├── other (*) │ ├── [DAG and rule graphs] (*) │ ├── helper_scripts (*) │ └── TODO (*) ├── README.md (*) └── .gitignore (*)
Authors
- Original pipeline from Phil Cheng, USZ [phil.cheng at usz.ch]
- Conversion to Snakemake by Balazs Laurenczy, ETHZ [balazs at ethz.ch]