4. Making Python Projects & Environments Reproducible#
Learning Objectives
Create and organize project directories for projects
Create file names that are human and machine readable
Identify when to use Python scripts versus Jupyter notebooks
Explain different strategies for storing parameters and other configuration details
Explain what version control is and why it’s beneficial
Explain what computing environments and virtual environments are
Describe the different environment managers available for Python
Use conda to create, manage, save, and restore virtual environments
This chapter focuses on strategies and advice to make your Python projects well-organized and reproducible—minimizing the obstacles for people interested in understanding, reproducing results, contributing, or collaborating. Doing so encourages community engagement, makes it easier for you to expand upon or reuse parts of the project in the future, and is a cornerstone of responsible scientific research.
Most of this chapter is broadly applicable to any computing project, including projects which don’t use Python. The second half of the chapter describes how to use conda to manage Python and Python packages, but conda can also be used to manage other scientific computing software.
4.1. Prerequisites#
This chapter assumes you already have basic familiarity with Python. DataLab’s Python Basics Reader and its accompanying workshop provide a suitable introduction.
This chapter also assumes you already have basic familiarity with the UNIX command-line and shell commands. DataLab’s Introduction to the UNIX Command Line Reader and its accompanying workshop provide a suitable introduction.
To follow along, you’ll need the following software versions (or newer) installed on your computer:
Python 3.10
One way to install these is to install the Anaconda Python distribution. Anaconda is explained in Section 4.4.
4.2. What’s a Project Directory?#
Whenever you start a new computing project, no matter how small, I recommend that you create a new project directory as a centralized place to store all of the project’s files. As you produce or download new files, make sure that they’re also stored in the project directory.
Some examples of files you should store in a project directory are:
Documentation, such as a file manifest and instructions for use
Code, such as notebooks and scripts
Inputs, such as data sets and configuration files
Outputs, such as reports, figures, and intermediate data sets
License information (if the project will be shared with anyone)
By using a project directory to centralize all of the files in a project, it’s easier to:
Find files, because you know where to look or to run search software
Move or copy the project to other computers
Share the project with collaborators, colleagues, or the public
Create backup copies of the project to protect your work
Access and run files with Python and other command-line tools
Use version control software to manage different versions of project files
The gold standard is for a project directory to be completely portable, meaning you can copy the directory to another computer, follow included instructions to setup necessary software (such as Python), and then run the code without any modifications to get the expected result.
The following subsections describe best practices for project directories, in approximate order from highest to lowest priority.
4.2.1. Use Relative Paths!#
Note
This section assumes you’re familiar with file paths. You can find a detailed introduction to file paths in this section of DataLab’s Python Basics Reader.
Perhaps the most common way projects fail to be portable is by including file paths that make assumptions about where the project directory is located. As an example, suppose Taylor is working on a project to analyze data about the population explosion of sea urchins along the California coast. Suppose the project directory is at:
/Users/taylor/ca_sea_urchins/
And the contents of the project directory are:
ca_sea_urchins/
├── analysis.ipynb
├── data/
│ ├── 2021-q1_ca_urchins.csv
│ ├── 2021-q2_ca_urchins.csv
│ ├── 2021-q3_ca_urchins.csv
│ ├── 2021-q4_ca_urchins.csv
│ └── 2022-q1_ca_urchins.csv
├── LICENSE
├── README.md
└── report.docx
If Taylor wants to load the 2022 data set in the analysis.ipynb
notebook,
they could use the path:
/Users/taylor/ca_sea_urchins/data/2022-q1_ca_urchins.csv
This path is an absolute path, meaning it begins from the root
directory (the top level /
) of Taylor’s file system. The path assumes that
the project directory ca_sea_urchins/
is in the directory /Users/taylor/
,
but that might not always be true. If Taylor shares the project with their
colleague Sam, then Sam will probably have to edit the path to make the code
work correctly.
You can avoid this problem by making sure every path in your project is a
relative path, written relative to files within the project directory. For
instance, the path from analysis.ipynb
to the 2022 data set is:
data/2022-q1_ca_urchins.csv
This path avoids any assumptions about the location of the project directory (and more generally, about anything outside of the project directory).
In summary, you can make your projects more portable by using only relative paths.
4.2.2. Naming Files#
When naming files in a project directory, choose names that are both human readable and machine readable.
Choosing human readable file names ensures that you and others will be able to determine the purpose of files without needing to open or otherwise inspect them. The key is to put descriptive, unambiguous information in the name. Consider the audience for the project, and avoid any acronyms, abbreviations, or jargon that won’t be familiar to the entire audience.
Choosing machine readable file names ensures that paths to your files will work in most software (such as Python) and that the file names can be parsed to extract metadata. Some basic rules for machine readable names are:
Don’t use whitespace characters. Whitespace is not always supported by older software and requires special treatment (such as escape characters) in modern software. Instead:
Use dashes
-
to separate parts of a single phrase. For example,sea-urchin-report
or2022-05-02
.Use underscores
_
to separate distinct phrases. For example,2022-05-02_sea-urchin-report
contains two distinct phrases: a date and a description.If you want, you can substitute a different pair of characters for dashes and underscores. The key is to be consistent and to make sure the characters are distinct and valid on most operating systems.
Write numbers in time sequences, especially dates, from slowest to fastest. For example, a good format for dates is
yyyy-mm-dd
. More generally, see ISO 8601, the international standard for formatting dates and times. This ensures files will be sorted correctly by tools that use dictionary ordering.Pad numbers in sequences with leading zeroes. For instance, in a collection of 1000 images of trees, the name
tree0003.jpg
is preferable totree3.jpg
. Try to plan ahead, padding enough that you won’t need to increase the number of digits later. As with the previous point, this ensures files will be sorted correctly by tools that use dictionary ordering.For Python scripts (
.py
files), make sure the name begins with a letter. This is especially important if you want to import code from one script into another, because Python’simport
keyword requires files/packages to start with a letter.
Note
Some guides recommend using camel case, where parts of a phrase are
delineated by capital letters, likeThis
. I recommend against camel case in
file names, because how acronyms should be written is ambiguous. For example,
if your project is “UC Davis Sheepmowers”, should you write
UCDavisSheepmowers
or UcDavisSheepmowers
?
For more suggestions and examples about how to name files, see:
This section of DataLab’s README, Write Me! Reader
This chapter of The Turing Way
These slides from Jenny Bryan
4.2.3. Organizing Files#
Use subdirectories to organize files within a project directory. Here’s the basic directory structure I recommend and use for data science projects:
env.yml Conda environment
LICENSE License for the project
README.md Markdown file that describes the project
data/ Data sets
docs/ Supporting documents
notebooks/ Jupyter and RMarkdown notebook files
src/ Source code (scripts)
This is loosely based on Cookiecutter Data Science, a standardized directory structure for data science projects.
Every project is different and has different needs; you should tailor the directory structure to meet them. For example, some directories I occasionally include in projects are:
R/
to store R code (because the R community prefers this oversrc/
)output/
to store output filesconfig/
ormodels/
to store configuration files or model descriptionslog/
to store logs from runs of the code
What’s most important is that you choose descriptive names for subdirectories
and use them in a consistent way. It’s also a good idea to include a file
manifest in your project’s documentation (typically in the README
file)
that describes the purpose of each file and subdirectory, similar to the
listing at the beginning of this subsection.
For more suggestions and examples about how to organize files, see:
This section of DataLab’s README, Write Me! Reader
This chapter of The Turing Way
This chapter of The Hitchhiker’s Guide to Python
4.2.4. Documentation#
A project directory should always include a README, a plain text file that serves as an introduction to and overview of your project. At a minimum, the README should include:
The title of the project
A description of the project
The name of the project maintainer
A way to contact the project maintainer
The names of all other project contributors
It’s also common for READMEs to include:
A file manifest
Hardware and software requirements
Instructions for running the project
Instructions for contributing to the project
The Markdown formatting language is a popular way to format README files.
In addition to a README, include documentation in your code. I recommend that you:
Put a comment at the beginning of each script to describe the purpose of the script. Where appropriate, you can also include inputs, outputs, and usage information. In notebooks, make a note at the beginning rather than a comment.
Use comments to frame and explain code, particularly any code that feels complicated. Writing the comments before you the write code can also be a good way to guide development.
Write Python docstrings for functions defined in your code. A docstring is a string at the beginning of a function (the first line after
def
) that describes how the function works. At a minimum, you should provide a brief description of what the function does. It’s also common to describe the parameters (inputs), describe the return value (output), and provide examples of use. Docstrings are accessible through Python’shelp
function. Here’s an example of a function with a docstring:def central_indexes(n, k = 5): """Compute the k central indexes for an array with n elements. Parameters ---------- n : int The length of the array. k : int The number of indexes to compute. Returns ------- A list of indexes. """ if n <= k: return list(range(n)) span = (k - 1) / 2 below = int(np.ceil(span)) above = int(span) midpoint = int(np.ceil((n - 1) / 2)) return list(range(midpoint - below, midpoint + above + 1))
For more suggestions and examples of how to write documentation, see:
DataLab’s README, Write Me! Reader
This chapter of The Hitchhiker’s Guide to Python
Write the Docs, a global community dedicated to writing good documentation
4.2.5. Scripts vs. Notebooks#
There are two common ways to store Python code:
Python scripts (
.py
files).Jupyter notebooks (
.ipynb
files)
Notebooks provide a way to develop code interactively, meaning you can run code in small chunks, quickly switching between writing code and running code. Notebooks also support formatted text, images, and a selection of other programming languages. As a result, notebooks are well-suited for:
Data exploration and open-ended analyses
Testing potential solutions to problems
Learning how use unfamiliar packages or code
Creating reports or presentations
Teaching
The primary way view and edit notebooks is through a web browser, which makes them inconvenient if you want to work at the command-line (on a server, for example) or with command-line tools (such as git). The Jupytext project provides a partial remedy by making it easy to convert between notebooks and plain text formats.
Scripts tend to be more useful than notebooks for code that doesn’t need to run interactively. Scripts are well-suited for:
Creating programs designed to run with minimal intervention (for example, on a server)
Creating programs to use at the command-line
Creating modules that contain reusable code. A module is a
.py
file designed to be imported into other Python scripts with theimport
keyword. Modules usually contain function and class definitions, but can also contain other code. Creating modules is a good way to organize your code, especially if you have multiple scripts that all use a few common functions or classes.Creating Python packages
Data science projects often include a mix of notebooks and scripts. Whenever you start working on a new task, consider which format is more appropriate for what you’re trying to do.
Tip
If you use the directory structure described in Section 4.2.3 and
create Python modules, it can be difficult to figure out how to import the
modules for code that exists outside of the src/
directory, such as notebooks
(in notebooks/
). Adding this code snippet to the beginning of a notebook or
script makes it possible to import modules from the src/
directory:
import os
import sys
# Assemble a path to `../src/`
module_path = os.path.abspath(os.path.join("..", "src"))
# Append the path to the module search list
if module_path not in sys.path:
sys.path.append(module_path)
This assumes ../src/
is the relative path to the src/
directory from your
notebook or script. If that’s not the correct you’ll have to edit the path in
the module_path
variable.
You can learn more about how to create Python modules and packages in this chapter of The Hitchhiker’s Guide to Python.
4.2.6. Parameters & Configuration Files#
Most code, especially in scientific computing, includes a collection of adjustable parameters. For instance, code to run a simulation might allow for different initial conditions and code to fit a statistical model might allow for different formulations of the model.
Here are a few ways to create a parameterized script:
Make the parameters global variables within the script. Then users can set the parameters by editing the variables. In this case, it’s good practice to put the parameter definitions at the beginning of the script, write the parameter names in
ALL CAPS
, and use comments to document each parameter.Create a
main
function in the script that runs the rest of the code. Call themain
function in guard condition that checks whether the script was deliberately executed (rather thanimport
ed):if __name__ == "__main__": main()
The parameters of the main function are the parameters of the script. Then users can set the parameters by editing the call to
main
(or more generally, the code in the guard condition).Set the parameters with command-line arguments to the script. The
argv
object in Python’s built-insys
module is the list of command-line arguments. Python’s built-inargparse
module provides a richer set of functions for parsing command-line arguments.Set the parameters with a configuration file read by the script. Two options are:
Python’s built-in
json
module provides a way to read settings from a JavaScript Object Notation (JSON) file. JSON is a plain text format for storing lists and dictionaries of numbers, strings, and other data.The
tomli
package provides a way to read settings from a Tom’s Obvious Minimal Language (TOML) file. The TOML format has more features and is easier to read and write than the JSON format. Moreover,tomli
may soon be built-in to Python.
The options above are listed from lowest to highest complexity, but also from lowest to highest reproducibility. Using separate configuration files (the last option) is the best way to ensure that your project is reproducible, because you can create a new configuration file to record the parameters for each run of the code.
4.3. Version Control#
A version control system (VCS) is software that tracks the changes you make to files, so that you can go back to a previous version at any time. You might already be familiar with the version control systems built-in to Microsoft Word and Google Docs. Many different version control systems exist for code, but the most popular one is git.
I recommend that you use version control for every project. By using version control, particularly git, you can:
Access older versions of files at any time
Back up your project to the cloud
Share your project with collaborators or the public
Merge changes when you and a collaborator both edit the same file
Learning to use a version control system can be difficult, but it’s well-worth the time and effort. DataLab’s Introduction to Version Control Reader and workshop are a good way to learn more.
Note
Most version control systems are designed for versioning text or code rather than data. However, versioning data is important for many projects, and the growth of data science as a discipline has led to multiple efforts to develop version control systems for data.
One VCS for data that looks promising is dvc
, which is designed to
coexist with git and provides a similar interface.
4.4. What’s an Environment?#
A computing environment is a collection of hardware and software used to run code. Whether code runs correctly or at all depends on the computing environment, so an important part of making a project reproducible is documenting the environment where the code was originally developed and tested.
In a high-level programming language like Python, details of the hardware are mostly hidden away. That is, hardware has little to no impact on how you write Python code, with the exception of a few specific applications such as GPU computing. Hardware may affect how quickly your Python code runs, but usually not the final result. As a consequence, for Python projects (and many scientific computing projects in general) the hardware environment is generally less of a concern than the software environment.
One of the major advantages of Python over other programming languages is the massive number of packages developed and published by members the community. As Python and Python packages are updated, code designed for older versions may need to be edited to continue to work correctly with the newer versions. So for most Python projects, keeping track of the software environment means keeping track of the specific versions of Python and Python packages for which the code was designed.
A package manager is a program that can install, update, and remove
packages. Python’s built-in package manager is pip
. Even if you use an
alternative package manager, it’s good to know the basics of pip in order to
troubleshoot problems with package installations.
4.4.1. Virtual Environments#
A virtual environment is a computing environment with specific software versions that can coexist alongside other virtual environments with different software versions. Virtual environments make it possible to work on several different projects at once, even if they require different computing environments.
There are several options for managing Python virtual environments:
venv
is Python 3’s built-in module for managing virtual environments, based on virtualenv.virtualenv
is the most popular tool for managing virtual environments for Python 2, and also supports Python 3. It provides more features than venv.pipenv
is an integrated environment and package manager.poetry
is a relatively new integrated environment and package manager. It also provides tools to create and publish packages.conda
is an integrated environment and package manager originally designed with data science projects in mind.
Unlike the other tools listed here, conda can manage environments for a variety of programming languages, not just Python. This is its standout feature, and the reason why the rest of this chapter will focus on conda rather than any of the others.
4.5. Managing Environments with conda#
There are two major ways to get conda:
Install Anaconda. Anaconda is a Python distribution designed to be ready-to-use for data science. It includes over 1,500 packages and as a consequence, it requires at least 3 GB of hard disk space to install.
Install Miniconda. Miniconda is a Python distribution that only includes Python, conda, and a few packages conda depends on. Miniconda is much smaller than Anaconda, but leaves it up to you to install the packages you want to use.
I recommend Miniconda because it only uses hard disk space for the packages you actually want. Moreover, because you have to install the packages yourself, you’ll quickly become familiar with conda’s basic commands.
Both Anaconda and Miniconda provide detailed installation instructions on their respective websites.
4.5.1. Creating Environments#
Once you have conda installed, get a list of your conda environments with the shell command:
conda env list
The default environment for conda is called base
. It’s generally recommended
to avoid installing packages in base
because it’s the environment where conda
itself is installed. Any problems you create in the base
environment can
potentially break your entire conda installation.
You can create a new environment with the command conda create
. For example,
the command to create a new environment called datasci
is:
conda create --name datasci
You can get help with most conda commands by appending --help
. For instance,
you can see all of th options for conda create
with the command:
conda create --help
You can switch to or activate a specific conda environment with the command
conda activate
. Go ahead and activate the datasci
environment:
conda activate datasci
Once an environment is activated, you can install software with the command
conda install
. For example, this command installs Python:
conda install python
You can request a specific version of software by appending =
, >=
, >
,
<=
, or <
and a version number after the name, and then enclosing the name
and version in single quotes. For example, this command specifically requests
the latest version of Python 2:
conda install 'python<3'
Say no at the prompt so that your environment still has Python 3.
By default, conda installs packages from Anaconda’s package repository. The
packages in Anaconda’s package repository tend to work well together, but can
be slightly out of date. The community-maintained conda-forge
package
repository often has newer versions and also has many packages that are not
available through Anaconda. You can specify that you want to install a package
from conda-forge by including -c conda-forge
in the conda install
command.
For instance, to install numpy
from conda-forge:
conda install -c conda-forge numpy
Tip
A common complaint about conda is that it takes a long time for conda to install new packages. Mamba is a drop-in replacement for conda that’s designed to be substantially faster. You can install Mamba in a conda environment with the command:
conda install -c conda-forge mamba
Then you can replace conda
with mamba
in any conda command.
You can see a list of installed software in a conda environment with the command:
conda list
You can also use the commands conda update
and conda uninstall
to update
and uninstall packages within an environment.
When you’re finished using a conda environment, you can deactivate it with the command:
conda deactivate
Finally, you can delete an environment with the command conda env remove
. Use
the --name
option to provide the name of the environment.
4.5.2. Saving & Restoring Environments#
In order to make a project reproducible, you should include information about
the conda environment used to develop and test the code. You can export the
details of a conda environment to a file with the command conda env export
.
For example, to export the details of the datasci
environment:
conda env export --name datasci > env.yml
Then anyone conda and a copy of your project directory can install the environment on their machine with the command:
conda env create --file env.yml
Note that you can include --name NAME
in this command to explicitly set the
name of the new environment to NAME
.
4.5.3. Additional References#
The conda cheatsheet provides a list of the most common conda commands.
For more suggestions and examples of how to use conda, see:
This chapter of DataLab’s Introduction to Remote Computing Reader