3. Snakemake#

Learning Goals

After this lesson, you should be able to:

  • Explain what Snakefiles are

  • Explain what Snakemake rules are

  • Create a Snakefile

This chapter provides a brief introduction to using Snakemake for workflow management.

3.1. Snakefiles#

With Snakemake, you use a text file called a snakefile to configure the workflows and steps in a project. The snakefile can be named Snakefile or snakefile.

In the Snakefile, you’ll add rule for each step that specifies what command to run and other details. The syntax of the Snakefile is based on Python, so indentation and spacing is important.

As a first example, let’s create a Snakefile with a hello step that uses the echo shell command to print the message Hello, world! (similar to what we did with Pixi in Creating Tasks). Create a new empty project directory:

mkdir snake_project
cd snake_project

Then open Snakefile in your favorite text editor. Edit the file to look like this:

rule hello:
  shell:
    "echo 'Hello, world!'"

If you’ve used Python before, this syntax should look familiar (but don’t worry if you haven’t). A Snakemake rule always begins with the keyword rule, followed by a name for the rule and a colon :. The name of the rule can contain letters, numbers, and underscores. All of the details of the rule go on subsequent lines and must be indented 2 or 4 spaces.

Within a rule, the shell keyword specifies a shell command to run when Snakemake runs the step the rule describes. The command must be indented 2 or 4 spaces and quoted with single or double quotes.

Once you’ve set up a Snakefile, you can run specific steps with the snakemake command. Snakemake is designed from the ground up to take advantage of all of your computer’s CPU cores by running tasks in parallel whenever possible, so you must also provide a --cores argument that specifies how many cores to use. You can set --cores all to use all cores. So to run the hello step:

snakemake --cores all hello
Assuming unrestricted shared filesystem usage.
host: sei
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job stats:
job      count
-----  -------
hello        1
total        1

Select jobs to execute...
Execute 1 jobs...

[Wed May 20 15:18:51 2026]
localrule hello:
    jobid: 0
    reason: Rules with neither input nor output files are always executed.
    resources: tmpdir=/tmp
Hello, world!
[Wed May 20 15:18:51 2026]
Finished jobid: 0 (Rule: hello)
1 of 1 steps (100%) done
Complete log(s): /home/nick/snake_project/.snakemake/log/2026-05-20T151851.774908.snakemake.log

The output from Snakemake includes a variety of diagnostic information, including which shell was used, how many cores were provided, which steps were selected to run, and the path to a log file with all of the output from the steps. It’s a bit hard to see amidst all of the other output, but notice that the step did in fact print Hello, world!.

Tip

If you want to run an R, Python, or Julia script, use the script keyword instead of the shell keyword. The script keyword requires a quoted path to the script. For instance:

rule run_script:
  script:
    "path/to/script.R"

Snakemake will automatically infer how to run the script from the extension (.R, .py, or .jl). The script keyword also supports other languages; see the documentation for details.

You can use the argument -n (or --dry-run) to do a dry run, which will make Snakemake print out the diagnostic information without actually running any steps. In this case, you don’t have to set the --cores argument:

snakemake -n hello
host: sei
Building DAG of jobs...
Job stats:
job      count
-----  -------
hello        1
total        1


[Wed May 20 15:27:44 2026]
rule hello:
    jobid: 0
    reason: Rules with neither input nor output files are always executed.
    resources: tmpdir=/tmp
Job stats:
job      count
-----  -------
hello        1
total        1

Reasons:
    (check individual jobs above for details)
    neither input nor output:
        hello
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

You can use the -l (or --list or --list-rules) argument to make Snakemake print a list of all of the rules in a project:

snakemake -l
hello

The list only shows descriptions for rules that include one. You can set a description for a rule by putting a triple-quoted string on the lines immediately after the rule keyword (in Python, this is called a docstring). Edit the Snakefile to look like this:

rule hello:
  """Print a hello message.
  """
  shell:
    "echo 'Hello, world!'"

Now try listing the rules again:

snakemake -l
hello (Print a hello message.)

The description is printed in parentheses after the name of each rule.

Tip

Snakemake’s standout feature is that Snakefiles are just an extension of Python, so they can contain arbitrary Python code. In other words, you can define Python variables and functions in a Snakefile and use them to help you set up rules. See the Snakemake documentation for examples of this.

See also

The official Snakemake documentation includes a tutorial and details about many more features than we can cover here.

3.2. Input & Output Files#

Snakemake is designed to make tracking the relationships between steps in a workflow as painless as possible. The best way to do this is by specifying the input files and output files for each step. Once you do this, Snakemake will automatically infer which steps depend on others.

Within a rule, you can use the input keyword to provide a list of input files to the step and the output keyword to provide a list of output files from the step. The paths in the list must be quoted and separated by commas.

Let’s create two new rules to demonstrate how to set inputs and outputs. The first rule, save_message, will save a message to a file called message.txt. The second rule, show_message, will print the message in the file. Edit the Snakefile to look like this:

rule save_message:
  output:
    "message.txt"
  shell:
    "echo 'IMPORTANT MESSAGE' > message.txt"

rule show_message:
  input:
    "message.txt"
  shell:
    "cat message.txt"

Now try running the show_message task:

snakemake --cores all show_message
Assuming unrestricted shared filesystem usage.
host: sei
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job stats:
job             count
------------  -------
save_message        1
show_message        1
total               2

Select jobs to execute...
Execute 1 jobs...

[Wed May 20 16:04:56 2026]
localrule save_message:
    output: message.txt
    jobid: 1
    reason: Missing output files: message.txt
    resources: tmpdir=/tmp
[Wed May 20 16:04:56 2026]
Finished jobid: 1 (Rule: save_message)
1 of 2 steps (50%) done
Select jobs to execute...
Execute 1 jobs...

[Wed May 20 16:04:56 2026]
localrule show_message:
    input: message.txt
    jobid: 0
    reason: Rules with a run or shell declaration but no output are always executed.
    resources: tmpdir=/tmp
IMPORTANT MESSAGE
[Wed May 20 16:04:56 2026]
Finished jobid: 0 (Rule: show_message)
2 of 2 steps (100%) done
Complete log(s): /home/nick/snake_project/.snakemake/log/2026-05-20T160455.967729.snakemake.log

Snakemake correctly infers that in order to run show_message, it must first run save_message. If message.txt exists, Snakemake will automatically skip running save_message, unless the rule or its inputs have changed.

At this point, perhaps you can already see that Snakemake’s input and output keywords are more powerful than their counterparts in Pixi, but there’s more.

The save_message and show_message rules we wrote are slightly redundant, because the input/output appears twice in each rule. First we list it after the input/output keyword, and then we repeat it again in the shell command. Whenever you write a string in a Snakemake rule, you use curly braces { } in the string to substitute a value from the rule’s metadata. For instance, you can write {input} in a string and Snakemake will replace this with the name(s) of the input file(s). We can use this feature to make the Snakefile less redundant. Edit the Snakefile to look like this:

rule save_message:
  output:
    "message.txt"
  shell:
    "echo 'IMPORTANT MESSAGE' > {output}"

rule show_message:
  input:
    "message.txt"
  shell:
    "cat {input}"

Try running the show_message step again. You should see that the step still runs and that Snakemake still infers that it depends on save_message.

When input or output is a list of multiple files, Snakemake automatically concatenates the file names, with spaces in between, before replacing {input} or {output}. You can use square brackets to get a single element from the list by position (starting from 0). So {input[0]} is replaced with the first file name in the input list.

Tip

In an input or output list, curly braces indicate an named wildcard rather than a substitution. You can use named wildcards to create steps that generalize to many files.

For example, suppose you sometimes use the magick shell command to convert images from JPEG to PNG format. You can use named wildcards to create a rule that works with any JPEG file:

rule convert:
  input:
    "{image}.jpeg"
  output:
    "{image}.png"
  shell:
    "magick {input} {output}"

Then you can run this step for a file, say flower.jpeg, with the this command:

snakemake --cores all flower.png

Rather than specifying the desired rule’s name, here we specify the desired output. Snakemake automatically figures out which rule to use.

You can learn more about named wildcards in the official documentation.

3.3. Case Study: Davis Bike Counts, Part III#

Let’s try using Snakemake to manage the workflow in the project from Case Study: Davis Bike Counts, Part I. We’ll create three rules (just as we created three rules with Pixi):

  • clean_data

  • fit_model

  • visualize

In the project directory, create a Snakefile and edit it to look like this:

rule clean_data:
  """Clean the dataset.
  """
  input:
    "data/2020_davis_bikes.rds"
  output:
    "data/interim/2020_davis_bikes_clean.rds"
  shell:
    "pixi run R/01_clean.R"


rule fit_model:
  """Fit a model to the cleaned data.
  """
  input:
    "data/interim/2020_davis_bikes_clean.rds"
  output:
    "models/bikes_model.rds"
  shell:
    "pixi run R/02_model.R"


rule visualize:
  """Make a plot of the data and model predictions.
  """
  input:
    "data/interim/2020_davis_bikes_clean.rds",
    "models/bikes_model.rds"
  output:
    "figures/bikes_plot.png"
  shell:
    "pixi run R/03_plot.R"

Note

The Snakefile in this example uses the shell keyword rather than the script keyword to ensure that the scripts run in the environment set managed by Pixi (hence pixi run).

Snakemake has built-in support for automatically activating Conda environments through its conda keyword, but support for Pixi environments is still being developed. Know that it is a priority for the developers, many of whom are enthusiastic Pixi users, and will likely be ready soon.

Then, to run the entire analysis:

snakemake --cores all visualize
Assuming unrestricted shared filesystem usage.
host: sei
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job stats:
job           count
----------  -------
clean_data        1
fit_model         1
visualize         1
total             3

Select jobs to execute...
Execute 1 jobs...

[Wed May 20 18:51:03 2026]
localrule clean_data:
    input: data/2020_davis_bikes.rds
    output: data/interim/2020_davis_bikes_clean.rds
    jobid: 1
    reason: Missing output files: data/interim/2020_davis_bikes_clean.rds
    resources: tmpdir=/tmp
Read 'data/2020_davis_bikes.rds'
Wrote 'data/interim/2020_davis_bikes_clean.rds'
[Wed May 20 18:51:03 2026]
Finished jobid: 1 (Rule: clean_data)
1 of 3 steps (33%) done
Select jobs to execute...
Execute 1 jobs...

[Wed May 20 18:51:03 2026]
localrule fit_model:
    input: data/interim/2020_davis_bikes_clean.rds
    output: models/bikes_model.rds
    jobid: 2
    reason: Missing output files: models/bikes_model.rds; Input files updated by another job: data/interim/2020_davis_bikes_clean.rds
    resources: tmpdir=/tmp
Read 'data/interim/2020_davis_bikes_clean.rds'
Wrote 'models/bikes_model.rds'
[Wed May 20 18:51:03 2026]
Finished jobid: 2 (Rule: fit_model)
2 of 3 steps (67%) done
Select jobs to execute...
Execute 1 jobs...

[Wed May 20 18:51:03 2026]
localrule visualize:
    input: data/interim/2020_davis_bikes_clean.rds, models/bikes_model.rds
    output: figures/bikes_plot.png
    jobid: 0
    reason: Missing output files: figures/bikes_plot.png; Input files updated by another job: data/interim/2020_davis_bikes_clean.rds, models/bikes_model.rds
    resources: tmpdir=/tmp
Read 'data/interim/2020_davis_bikes_clean.rds'
Read 'models/bikes_model.rds'
Saving 7 x 7 in image
Wrote 'figures/bikes_plot.png'
[Wed May 20 18:51:04 2026]
Finished jobid: 0 (Rule: visualize)
3 of 3 steps (100%) done
Complete log(s):
/home/nick/foundry/datalab/teaching/research_computing/example-project/.snakemake/log/2026-05-20T185103.057255.snakemake.log

If you run the command again, Snakemake will automatically skip all of the steps, since the outputs already exist.

You can also use snakemake -l to print the steps in the workflow:

snakemake -l
clean_data (Clean the dataset.)
fit_model (Fit a model to the cleaned data.)
visualize (Make a plot of the data and model predictions.)