6 Concrete development workflow and tools
In order of importance, roughly.
6.1 Your development process should be repeatable
This means a collaborator (or future you) should be able to:
- Spin up a new development environment with all the dependencies (this is a continuum, with "How To" docs at on end and Docker build files at the other)
- Understand what your code does
- Recreate your files
- Recreate your analyses
- Distinguish between raw and processed data
- Prove your code does what it claims to do
6.2 Testing and Validation
How do you know your code does what you say it does? A taxonomy of testing strategies, from simple to complex:
- Defensive coding
- Assume your inputs are bad, and include tests of input correctness in your code.
- Use
assert
statements (sparingly) for things that should never break.
- Unit tests: Can be overkill (not enough return for time invested). Many languages have unit test libraries as part of their core offering (e.g., Java, Python). Use selectively for:
- Input validation
- Calculation validation
- Places where the code tends to change a lot
- Integration testing: The sweet spot for small-to-medium projects. For example:
- Start with a vetted sample input file
- Generate intermediate data and compare to known intermediate data
- Run analyses and compare results to known results
- Write results to output and compare with known output file (this is different than 3!)
6.2.1 The metaphysics of integration/system testing
- What are the theoretically possible workflow paths?
- Which ones are implemented? If you pull on this thread, you will discover that your code implements many partial workflows. This is a huge source of confusion for future users and maintainers. When you discover a partial workflow, you can clean up and/or reorganize in one of three ways:
- Finish implementing the complete workflow
- Strip out the workflow entirely. This usually requires more work than the alternatives.
- Explicitly stub out the un-implemented parts. The simplest way to do this is to leave comments: "X, Y, Z cases aren't handled yet. When you try them, we attempt to return an informative error."
- Which ones are tested?
6.3 Version Control
Version control preserves a record of your changes over time.
Version control allows you to fearlessly collaborate.
6.3.1 Version control in practice
- One branch should always be deliverable, working code. Typically this is "main".
- New work happens on development branches.
- Merge new work using a "general and lieutenants" workflow:
- Developer ("lieutenant") pushes development branch to shared repository.
- Project lead ("general") merges development branch into main branch, or talks to developer if there's a conflict.
- Everyone comments their commits.
- There are many possible workflows; the more your team knows, the more options you have.
6.4 Issue Tracking
6.4.1 Key features
- Issue title
- Issue description
- Issue creator
- Current assignee
- Status
- Dates (created, resolved, closed, re-opened)
- Comments
- Topic tags, version tags, etc
- Version control integration ("fixed by commit X"; this is a nice-to-have but not necessary feature)
- Support for searching, filtering, and sorting
6.4.2 Many options
Github, Trello, Microsoft Planner, Airtable, Jira, Fossil, Trac…
6.4.3 Demo
Github, because you're probably already using it.
6.5 Dependency management and environment management by language
Broadly speaking, you want to be able to set up a self-contained environment that contains all of your weird dependencies, such that you can tear it down and rebuild it if something goes wrong.
6.5.1 Python
- Conda package manager and environments see example here
- Pip and virtualenv
6.5.3 SQL
Integrates with almost every language; check your language docs for usage information.
6.5.4 Parallel concerns for other languages
6.5.5 When does it make sense to use containers?
Containers and VMs add an additional maintenance and testing burden. It may still make sense to use them if:
- Your code needs to run on a remote environment (e.g. UCSD Supercomputing). In this case, using a container for setup and teardown may ultimately save time.
- You need to repeatedly recreate a computing environment.
6.6 Deployment
Where is the lever I pull to make this go? If you have an answer for dependency management, the deployment (i.e., automatic recreation of your code in its environment) is trivial.
- Packaged environment and dependencies
- .condarc
- environment.yml file
- Description of environment and dependencies (otherwise how will we debug?)
- git version
- python version
- shell type (bash, zsh, sh, dash, powershell)
- Have you tested this on Windows? I see by your face that you haven't.
- You can solve this problem with Docker!
- Now you have two problems.
- Containerizing more likely to pay off in circumstances where you have to deploy to the cloud anyway (e.g., you are building and tearing down instances at UCSD Supercomputing)
6.7 How do we know when we're done?
- General enough
- Robust enough
- Extensible enough
- Tested enough