UC Davis DataLab High Performance Computing Toolkit (WIP)

Overview

This guide is a quick overview of High Performance Computing (HPC) at UC Davis. It has taken heavily from several other resources, listed below. If you would like a more in-depth tutorial, please look at the Introduction to Remote Computing workshop series.

If you need to talk with someone now, look into the #hpc channel on the UC Davis slack, or email the HPC helpdesk.

Consulted Resources

High Performance Computing at UC Davis

There are several HPC environments at UC Davis, but all of them require accounts to use. If you already have a sponsor, you can request an account here after you have generated a SSH key (see next section, #). If you do not, you will need to look into if you qualify for any of the following cluster environments.

Cluster Environments:

Name Sponsor Status
Atomate UCD Health
Cardio UCD Health Status
Crick College of Biological Sciences Status
Demon College of Letters and Science Status
FARM College of Agricultural and Environmental Sciences Status
Gauss College of Letters and Science Status
HPC1 College of Engineering Status
HPC2 College of Engineering
Impact College of Letters and Science Status
LSSC0 Bioinformatics Core
Peloton College of Letters and Science Status

Connecting to a Server

To connect to a server you will need to have a terminal on your local machine which you will use to connect to the remote one. Linux and Macs already have a terminal built in, Windows users can follow our install guides (INSERT LINK) to get a terminal running on their machines.

(Maybe) Generating an SSH Key

It is expected you will use an SSH keypair to connect to any of the previously mentioned servers. A keypair is a pair of files which allow a server to identify you when you try to connect; it’s like an added layer of security beyond a password. If you use a service like github, there is a good chance you already have a keypair. DO NOT GENERATE A NEW ONE. If you do, all services which rely on your current keypair will stop working if you overwrite it. There is no way to undo this.

To check if you already have a keypair in the default location, enter cat ~/.ssh/id_rsa.pub in your terminal and press enter. If it displays some text starting with ssh-rsa, you already have a keypair and should proceed to the next step.

If text starting with ssh-rsa does not appear, you will need to make a keypair. To generate a keypair, open your terminal and enter ssh-keygen. When it prompts for a location accept the default by pressing enter. It will then ask if you want to add a password. It is always more secure to have a password. As you type nothing will appear on screen, this is normal. You will be asked to enter it twice to confirm. Make sure you do not forget it, as if you do there is no way to figure it out.

Requesting Access

Once you have a keypair, you will need to upload the public part of it to request an account on a cluster server. You can find the request form here. Make sure you upload the PUBLIC half of your keypair. It should be located at ~/.ssh/id_rsa.pub. You will also need to select what server you are requesting access to, and who your sponsor is.

Logging On

Once you have received confirmation that your account has been created, you can try to log on. It may be helpful to create an ssh profile. To do this, enter nano ~/.ssh/config. This will start the nano text editor. In this new text file, copy the following, filling in the information inside the brackets.

Host <SHORT SERVER NAME>
   HostName <IP ADDRESS OF SERVER>
   User <YOUR USER NAME>

Once you are done, hit ctrl-x to exit and save. Now, you should be able to enter ssh <SHORT SERVER NAME> and connect to the server!

Using a Remote Server

Once you can log in to the remote server, you can navigate around your home directory as normal. However, the process of running your code will be slightly different. These servers are cluster servers because they have access to several workers on which to run jobs. The server you actually log in to is just the coordinator, or head node, for these workers, and shouldn’t be used for any jobs. We will cover how to send jobs to the workers in a moment, for now let’s cover getting files onto the server.

Moving Files

You will need to copy your files from your machine to the remote server before you can use it. There are two common ways to do this: using a dedicated FTP program like FileZilla or using the command line.

FileZilla is a free, multi-platform tool for copying files. If you would like to use it, download it from their site and follow their tutorial here.

To copy files using the command line, you could either use scp or rsync. scp works like a normal copy/paste, while rsync keeps track of copied files, and will only copy files again if they have been updated. rsync has the added benefit of being able to resume where it left off if there is an interruption. To move files, navigate to their location on your local machine with your terminal. Once there either of the following will recursively copy a directory to your home directory on the server.

scp -r <LOCAL DIRECTORY LOCATION> <SHORT SERVER NAME>:~/
rsync -avzeP 'ssh -p 22' <LOCAL DIRECTORY LOCATION> <SHORT SERVER NAME>:~/

Running Jobs

The server you connect to should not be used for running any jobs. Rather, this coordinator head node is used to send tasks to the worker compute nodes within the cluster. A common piece of software for this coordination is called slurm, and we will go over the basics of using it here. It is possible the server you are using does not use slurm, in which case you will need to follow the documentation for that software.

First, take a look at the current state of affairs on the cluster by using sinfo. This will let us see the nodes on the server, and what they are doing. If you would like to see the jobs that are in the work queue, you can use squeue instead. If your job is running, you can kill it using scancel <JOBID>.

Before we can start working on jobs, it is recommended you make a directory for each project. A template is provided here. It is important for you to make these directories, as slurm cannot make them for you. If you don’t make them, your job will fail silently.

~/
├─<PROJECT NAME>\
| ├─data\
| ├─slurm_sripts\
| ├─scripts\
| ├─logs\
| ├─.gitignore
| ├─README.md
├─Foo\
├─Bar\

After you have your project directory set up, you can start making scripts for your jobs. Keep these in the slurm_scripts directory. You will need to create bash scripts which you will enter into the job queue. These scripts will call other pieces of code to be run, and must have a specific format with metadata for slurm to read. This is included in the headers of the script with #SBATCH. The flags afterwards are the same you can pass when entering the command in the terminal; a list of others are presented below. Here is an example header:

Code
#!/bin/bash -l
#SBATCH -D ../
#SBATCH -o ../logs/log1-stdout-%j.txt
#SBATCH -e ../logs/log1-stderr-%j.txt
#SBATCH -J MY_JOB
#SBATCH -t 24:00:00
set -u
set -e
set -x

scontrol show job ${SLURM_JOB_ID}

<PUT YOUR CODE HERE>
Key
Run this as a bash script
Set the project root directory
Set the output location for your stanard output
Set the output location for your stanard errors
Name your job
Max runtime? (here is 24 hours)
Error if un-named variables called
Error if a single command fails
Print commands as they run

Print out final stats from your job on exit

Whatever your code is,
most likely calling R or python

You can send this script to be run by the compute nodes by calling sbatch -p <PARTITION NAME> <YOUR SCRIPT>. You should know what partitions you have access to run things on, but you can see the list again using sinfo. Once you hit enter, you should see your job in the queue using squeue. If that is not the case, time to check your logs in the logs directory (maybe good to do that anyway!).

Running Interactively

Some times you just need to try things yourself. You can open a bash terminal for 2 hours by entering srun -t 2:00:00 --pty /bin/bash. To open an interactive R session on a compute node, enter srun -p <PARTITION NAME> -t 2:00:00 --pty R. These will open an interactive terminal for you, limited to 2 hours. You can change the time, but you will be kicked off when it runs out!

Additional Arguments

There are a few other options you might want to pass along when you run a batch or interactive job. Just add these the the command when you run them.

  • --mem=<number>Gb = request a certain amount of memory

  • -c <number> = request a certain number of CPUs

  • --mail-user=<YOUR EMAIL> --mail-type=<ARGUMENT> = Mail you according to the argument you provide, valid options include:

    • ALL

    • BEGIN

    • END

    • FAIL

    • REQUEUE

    • TIME_LIMIT

    • TIME_LIMIT_90

    • TIME_LIMIT_80

    • TIME_LIMIT_50

    • ARRAY_TASKS

Contributions

This research toolkit is maintained by the UC Davis DataLab, and is open for contribution. See how you can contribute on the Github repo.

This toolkit has been made possible thanks to contributions by: