Continue Decorating¶
The remaining rules need inputs and outputs so we can link up all the rules through to the final variant calling step. Parts 2 and 3 of the video tutorials cover the remaining content by adding more detail to the Snakefile and wrapping up with the final workflow rule.
Tip
We recommend watching the videos first and referring to the text for additional help!
Part 2: Continue decorating the Snakefile
Part 3: More decorating and running through the entire Snakemake workflow
Step 6: Decorating the remaining rules¶
Run the rules one at a time to figure out what the output files are. Check the file time stamps (ls -lht
) to track more recent output files. The inputs are specified in the shell command section.
Tip
You can do a dry run of the rule with snakemake -n <rule name>
to check how Snakemake is interpreting the rule input(s), output(s), and shell command(s), without actually running the command(s) or creating any output(s).
Sometimes a command may require multiple input files but only explicitly state one in the command (the software assumes that if a certain file exists the other required files must exist). To avoid potential errors in rule dependencies, we will define all inputs in the rule's input:
section.
In this workflow, the rule map_reads
is a good example of such a behavior. bwa
is used to generate the mapped reads (.sam
) file and requires the reference genome and the index files without explicitly referring to the index files in the command. We can add an additional input variable for index files in map_reads
to define all input files to Snakemake:
rule map_reads:
input:
genome = "ecoli-rel606.fa",
reads = "SRR2584857_1.fastq.gz",
idxfile = expand("ecoli-rel606.fa.{ext}", ext=['sa', 'amb', 'ann', 'pac', 'bwt'])
output: "SRR2584857.sam"
shell:
"bwa mem -t 4 {input.genome} {input.reads} > {output}"
Follow along with the video tutorials to fill in the input
and output
sections for the remaining rules. For help, refer to the complete Snakefile for this tutorial, but note that there are many ways to concisely enter the input and output files and this is just one version!
Step 7: Running lots of rules all at once¶
Once you've fixed the rules index_genome_bwa
and map_reads
, you should be able to run everything up to the rule index_genome_samtools
by running:
snakemake -p index_genome_samtools -j 2
This also serves as a good way to check that you have all the correct input/output information. You'll have files left over if you forgot to put them in output.
Step 8: Re-running rules¶
Snakemake also has the option to delete all the inputs/outputs for a particular rule (including preceding rules), shown here by running the index_genome_samtools
rule:
snakemake --delete-all-output index_genome_samtools
Step 9: Using filenames instead of rule names¶
You don't actually need to use the rule names (this will be important later on!). Instead of rule names, you can specify the required output file in Snakemake which will trigger execution of all the upstream linked rules necessary to produce the file. So, the command below will also work to run the rule map_reads
, and you don't have to remember the rule name (which can be arbitrary).
snakemake -p SRR2584857.sam -j 2
Step 10: Exploring rule decoration shortcuts¶
There are several ways to more efficiently and cleanly specify inputs/outputs in the Snakefile. Here are some shortcuts:
Take the uncompress_genome
rule we decorated above:
rule uncompress_genome:
input: "ecoli-rel606.fa.gz"
shell:
"gunzip ecoli-rel606.fa.gz"
It can also be written as follows with template variables {input}
and {output}
. Variables operate entirely within a single rule, not across rules. This means that we can use different definitions for {input}
and {output}
for each rule and they won't conflict.
rule uncompress_genome:
input: "ecoli-rel606.fa.gz"
output: "ecoli-rel606.fa"
shell:
"gunzip -c {input} > {output}"
Multiple inputs files can be separated by commas and written on their own lines. The input files can be assigned variable names that are accessed in the shell:
block with input.<input file variable>
. The \
tells the shell that this is one command written over two lines in the file. Also, similar to map_reads
, the samtools_mpileup
rule also includes an input file (.bai
) that is not explicitly stated but required to complete the command.
rule samtools_mpileup:
input:
index="ecoli-rel606.fa",
sorted="SRR2584857.sorted.bam",
sorted_bai="SRR2584857.sorted.bam.bai"
output:
"variants.raw.bcf"
shell:
"""
samtools mpileup -u -t DP -f {input.index} {input.sorted} | \
bcftools call -mv -Ob -o - > {output}
"""
Since the Snakefile is written in Python, we can also use Python functions! As an example, we will consolidate the list of reference genome index files into a single line of code. The expansion (expand
) tells Python that there is a common file name pattern (ecoli-rel606.fa.
) with different endings ({ext}
) that are specified using a list (ext=['sa', 'amb', 'ann', 'pac', 'bwt']
). This results in expand()
creating a new list, ecoli-rel606.fa.sa
, ecoli-rel606.fa.amb
, and so on.
rule map_reads:
input:
genome = "ecoli-rel606.fa",
reads = "SRR2584857_1.fastq.gz",
idxfile = expand("ecoli-rel606.fa.{ext}", ext=['sa', 'amb', 'ann', 'pac', 'bwt'])
output: "SRR2584857.sam"
shell:
"bwa mem -t 4 {input.genome} {input.reads} > {output}"