6 Concrete development workflow and tools

In order of importance, roughly.

6.1 Your development process should be repeatable

This means a collaborator (or future you) should be able to:

Spin up a new development environment with all the dependencies (this is a continuum, with "How To" docs at on end and Docker build files at the other)
Understand what your code does
Recreate your files
Recreate your analyses
Distinguish between raw and processed data
Prove your code does what it claims to do

6.2 Testing and Validation

How do you know your code does what you say it does? A taxonomy of testing strategies, from simple to complex:

Defensive coding
1. Assume your inputs are bad, and include tests of input correctness in your code.
2. Use assert statements (sparingly) for things that should never break.
Unit tests: Can be overkill (not enough return for time invested). Many languages have unit test libraries as part of their core offering (e.g., Java, Python). Use selectively for:
1. Input validation
2. Calculation validation
3. Places where the code tends to change a lot
Integration testing: The sweet spot for small-to-medium projects. For example:
1. Start with a vetted sample input file
2. Generate intermediate data and compare to known intermediate data
3. Run analyses and compare results to known results
4. Write results to output and compare with known output file (this is different than 3!)

6.2.1 The metaphysics of integration/system testing

What are the theoretically possible workflow paths?
Which ones are implemented? If you pull on this thread, you will discover that your code implements many partial workflows. This is a huge source of confusion for future users and maintainers. When you discover a partial workflow, you can clean up and/or reorganize in one of three ways:
1. Finish implementing the complete workflow
2. Strip out the workflow entirely. This usually requires more work than the alternatives.
3. Explicitly stub out the un-implemented parts. The simplest way to do this is to leave comments: "X, Y, Z cases aren't handled yet. When you try them, we attempt to return an informative error."
Which ones are tested?

6.3 Version Control

Version control preserves a record of your changes over time.

Version control allows you to fearlessly collaborate.

6.3.1 Version control in practice

One branch should always be deliverable, working code. Typically this is "main".
New work happens on development branches.
Merge new work using a "general and lieutenants" workflow:
- Developer ("lieutenant") pushes development branch to shared repository.
- Project lead ("general") merges development branch into main branch, or talks to developer if there's a conflict.
Everyone comments their commits.
There are many possible workflows; the more your team knows, the more options you have.

6.4 Issue Tracking

6.4.1 Key features

Issue title
Issue description
Issue creator
Current assignee
Status
Dates (created, resolved, closed, re-opened)
Comments
Topic tags, version tags, etc
Version control integration ("fixed by commit X"; this is a nice-to-have but not necessary feature)
Support for searching, filtering, and sorting

6.4.2 Many options

Github, Trello, Microsoft Planner, Airtable, Jira, Fossil, Trac…

6.4.3 Demo

Github, because you're probably already using it.

6.5 Dependency management and environment management by language

Broadly speaking, you want to be able to set up a self-contained environment that contains all of your weird dependencies, such that you can tear it down and rebuild it if something goes wrong.

xkcd comic

6.5.1 Python

Conda package manager and environments see example here
Pip and virtualenv

6.5.2 R

renv
Read about additional options here

6.5.3 SQL

Integrates with almost every language; check your language docs for usage information.

6.5.4 Parallel concerns for other languages

6.5.5 When does it make sense to use containers?

Containers and VMs add an additional maintenance and testing burden. It may still make sense to use them if:

Your code needs to run on a remote environment (e.g. UCSD Supercomputing). In this case, using a container for setup and teardown may ultimately save time.
You need to repeatedly recreate a computing environment.

6.6 Deployment

Where is the lever I pull to make this go? If you have an answer for dependency management, the deployment (i.e., automatic recreation of your code in its environment) is trivial.

Packaged environment and dependencies
1. .condarc
2. environment.yml file
Description of environment and dependencies (otherwise how will we debug?)
1. git version
2. python version
3. shell type (bash, zsh, sh, dash, powershell)
4. Have you tested this on Windows? I see by your face that you haven't.
You can solve this problem with Docker!
1. Now you have two problems.
2. Containerizing more likely to pay off in circumstances where you have to deploy to the cloud anyway (e.g., you are building and tearing down instances at UCSD Supercomputing)

6.7 How do we know when we're done?

General enough
Robust enough
Extensible enough
Tested enough