The Snakefile¶
Snakemake uses a file called "Snakefile" to configure the steps, or rules, of your pipeline. The basic Snakefile consists of several rules defining the inputs, outputs, and rule commands.
Tip
The written contents of the tutorial are demonstrated in three short videos. We recommend watching the videos first and referring to the tutorial text for additional help.
Part 1: Introducing the Snakefile and Snakemake
Step 1: Editing the Snakefile¶
Make sure you're in the (snaketest)
conda environment. The remainder of this tutorial will be in the (snaketest)
environment.
Let's take a look at the Snakefile using the nano text editor:
nano -ET4 Snakefile
This is the skeleton of our Snakefile for calling variants. The command above tells nano to open the Snakefile and to create 4 spaces when you hit the Tab key. The Snakefile is written in the Python programming language, which uses specific indentation formatting to interpret the code. Incorrect indentation will result in syntax errors.
There are several rules defined with commands to run, but we'll need to add a few more details by editing with nano.
Add a comment to the Snakefile:
-
open Snakefile in
nano
-
on a new line, add a comment with
#
, e.g.,# This is my first edit!
-
save the change by hitting
control
key ando
key.nano
will ask if you want to save the file as "Snakefile". We want to keep the same file name, so hitreturn
key. -
exit
nano
by hittingcontrol
key andx
key. -
view the Snakefile in Terminal with:
less Snakefile
. -
if you can see your comment, it worked! Exit the
less
view by hittingq
key.
Tip
Please refer to the bash command cheat sheet for commonly used nano
commands and other shortcuts for navigating your Terminal!
Ok, let's move on and take a look at the structure of the Snakefile rules.
Step 2: Looking at Snakefile rules¶
Each step in a pipeline is defined by a rule in the Snakefile. The components of each rule are indented 4 spaces. The most basic structure of a rule is:
rule rule_name:
shell:
# for single line commands
# command must be enclosed in single quotes - "
"command"
There are several rules in the Snakefile. Let's do a search for all the rules in the file:
grep rule Snakefile
The output is a list of the lines in the Snakefile with the word 'rule' in them. There are 11 rules in this pipeline:
rule download_data:
rule download_genome:
rule uncompress_genome:
rule index_genome_bwa:
rule map_reads:
rule index_genome_samtools:
rule samtools_import:
rule samtools_sort:
rule samtools_index_sorted:
rule samtools_mpileup:
rule make_vcf:
Step 3: Running Snakemake & the Snakefile¶
Let's try running a Snakemake rule:
snakemake -p map_reads -j 2
The -p
means show the shell command that snakemake is running for this rule.
The -j
(or --cores
) is used to specify the number of cores available to run the job; in newer versions of snakemake, this flag must be specified. If you are running a job locally on your computer or for this tutorial, setting the flag to 1 or 2 is sufficient.
Info
In earlier versions of snakemake-minimal, the -j
flag was not required. The vidlets in this lesson were created prior to the update.
Oops, this will fail! Why?
As the error message in red states, the rule failed because we don't have any of the input files required to run this rule yet! For the mapping rule to work, we need the raw read (.fastq.gz
) file and reference genome (.fa.gz
) that should be indexed.
Tip
The placement of snakemake flags must follow the snakemake
command, but otherwise the location does not matter. Thus, snakemake -p map_reads
will run the same as snakemake map_reads -p
.
Let's try again, starting with the first rule in the Snakefile:
Snakemake runs the shell command listed under the download_data
rule. In this case, the shell command downloads the raw read file from a public repository on osf.io.
snakemake -p download_data -j 2
It worked!
Check the working directory. There should now be a .fastq.gz
file:
ls -lht
This command shows you the file permissions, number of links, owner name, owner group, file size in bytes, time of last modification, and file/directory name.
-rw-r--r-- 1 jovyan root 8.4M Aug 26 00:35 SRR2584857_1.fastq.gz
-rw-r--r-- 1 jovyan jovyan 1.2K Jul 23 00:01 Snakefile
Next run some more rules sequentially – one at a time:
snakemake -p download_data -j 2
snakemake -p download_genome -j 2
snakemake -p uncompress_genome -j 2
snakemake -p index_genome_bwa -j 2
snakemake -p map_reads -j 2
Check the working directory again. The directory is populated by many output files including reference genome (.fa
), genome index (.fa.sa
, .fa.amb
etc) and mapped reads (.sam
) files. The map_reads
rule ran without any errors, once we ran all of the necessary rules before it!.
Tip
Please refer to the Snakemake command cheat sheet for commonly used Snakemake commands!
Key Points
In this section we explored the basic template of a Snakefile which contains rules with all the necessary commands for variant calling.
- each rule can be run individually using
snakemake -p <rule_name> -j 2
. - each rule encompasses shell commands, with a bit of “decoration”. You could run them yourself directly in the terminal if you wanted!
- while the written order of the rules in Snakefile doesn’t matter, the order in which the rules are run on the terminal does matter! More on this in the next section!
- by default, if you don't specify a rule, Snakemake executes the first rule in the Snakefile (this is actually the only case where the order of rules in the Snakefile matters!)
- output Snakemake message is in red if it fails
- the code is case-sensitive
- tabs and spacing matters
In the next section, we'll cover how to connect the rules so Snakemake can recognize rules that depend on each other and run them in the correct order.