Skip to content

The Snakefile

Snakemake uses a file called "Snakefile" to configure the steps, or rules, of your pipeline. The basic Snakefile consists of several rules defining the inputs, outputs, and rule commands.

Tip

The written contents of the tutorial are demonstrated in three short videos. We recommend watching the videos first and referring to the tutorial text for additional help.

Part 1: Introducing the Snakefile and Snakemake

Step 1: Editing the Snakefile

Make sure you're in the (snaketest) conda environment. The remainder of this tutorial will be in the (snaketest) environment.

Let's take a look at the Snakefile using the nano text editor:

nano -ET4 Snakefile

This is the skeleton of our Snakefile for calling variants. The command above tells nano to open the Snakefile and to create 4 spaces when you hit the Tab key. The Snakefile is written in the Python programming language, which uses specific indentation formatting to interpret the code. Incorrect indentation will result in syntax errors.

There are several rules defined with commands to run, but we'll need to add a few more details by editing with nano.

Add a comment to the Snakefile:

  1. open Snakefile in nano

  2. on a new line, add a comment with #, e.g., # This is my first edit!

  3. save the change by hitting control key and o key. nano will ask if you want to save the file as "Snakefile". We want to keep the same file name, so hit return key.

  4. exit nano by hitting control key and x key.

  5. view the Snakefile in Terminal with: less Snakefile.

  6. if you can see your comment, it worked! Exit the less view by hitting q key.

Tip

Please refer to the bash command cheat sheet for commonly used nano commands and other shortcuts for navigating your Terminal!

Ok, let's move on and take a look at the structure of the Snakefile rules.

Step 2: Looking at Snakefile rules

Each step in a pipeline is defined by a rule in the Snakefile. The components of each rule are indented 4 spaces. The most basic structure of a rule is:

rule rule_name:

    shell:

        # for single line commands
        # command must be enclosed in single quotes - "
        "command"

There are several rules in the Snakefile. Let's do a search for all the rules in the file:

grep rule Snakefile

The output is a list of the lines in the Snakefile with the word 'rule' in them. There are 11 rules in this pipeline:

rule download_data:

rule download_genome:

rule uncompress_genome:

rule index_genome_bwa:

rule map_reads:

rule index_genome_samtools:

rule samtools_import:

rule samtools_sort:

rule samtools_index_sorted:

rule samtools_mpileup:

rule make_vcf:

Step 3: Running Snakemake & the Snakefile

Let's try running a Snakemake rule:

snakemake -p map_reads -j 2

The -p means show the shell command that snakemake is running for this rule. The -j (or --cores) is used to specify the number of cores available to run the job; in newer versions of snakemake, this flag must be specified. If you are running a job locally on your computer or for this tutorial, setting the flag to 1 or 2 is sufficient.

Info

In earlier versions of snakemake-minimal, the -j flag was not required. The vidlets in this lesson were created prior to the update.

Oops, this will fail! Why?

snakemake map reads rule error

As the error message in red states, the rule failed because we don't have any of the input files required to run this rule yet! For the mapping rule to work, we need the raw read (.fastq.gz) file and reference genome (.fa.gz) that should be indexed.

Tip

The placement of snakemake flags must follow the snakemake command, but otherwise the location does not matter. Thus, snakemake -p map_reads will run the same as snakemake map_reads -p.

Let's try again, starting with the first rule in the Snakefile:

Snakemake runs the shell command listed under the download_data rule. In this case, the shell command downloads the raw read file from a public repository on osf.io.

snakemake -p download_data -j 2

It worked!

snakemake downloaded data

Check the working directory. There should now be a .fastq.gz file:

ls -lht

This command shows you the file permissions, number of links, owner name, owner group, file size in bytes, time of last modification, and file/directory name.

-rw-r--r-- 1 jovyan root   8.4M Aug 26 00:35 SRR2584857_1.fastq.gz
-rw-r--r-- 1 jovyan jovyan 1.2K Jul 23 00:01 Snakefile

Next run some more rules sequentially – one at a time:

snakemake -p download_data -j 2

snakemake -p download_genome -j 2
snakemake -p uncompress_genome -j 2
snakemake -p index_genome_bwa -j 2
snakemake -p map_reads -j 2

Check the working directory again. The directory is populated by many output files including reference genome (.fa), genome index (.fa.sa, .fa.amb etc) and mapped reads (.sam) files. The map_reads rule ran without any errors, once we ran all of the necessary rules before it!.

Tip

Please refer to the Snakemake command cheat sheet for commonly used Snakemake commands!

Key Points

In this section we explored the basic template of a Snakefile which contains rules with all the necessary commands for variant calling.

  • each rule can be run individually using snakemake -p <rule_name> -j 2.
  • each rule encompasses shell commands, with a bit of “decoration”. You could run them yourself directly in the terminal if you wanted!
  • while the written order of the rules in Snakefile doesn’t matter, the order in which the rules are run on the terminal does matter! More on this in the next section!
  • by default, if you don't specify a rule, Snakemake executes the first rule in the Snakefile (this is actually the only case where the order of rules in the Snakefile matters!)
  • output Snakemake message is in red if it fails
  • the code is case-sensitive
  • tabs and spacing matters

In the next section, we'll cover how to connect the rules so Snakemake can recognize rules that depend on each other and run them in the correct order.


Last update: May 14, 2021