How to submit and run jobs on AI.Panther and Blueshark

NOTE 1: For information on using installed packages on AI.Panther and Blueshark, please see this KB article for more information on using environment modules.

Introduction

To submit jobs on AI.Panther and Blueshark, the scheduler, Slurm Workload Manager, must be used to run the job. This differs from how you'd normally run a command, as you need to prepare a submission script and optionally make your code MPI capable.

All software must be run using the Slurm scheduler. These clusters are shared resources; therefore, if a user runs code on the login server, it will cause login and performance issues for others. Users who run on the login node will have their code terminated, and continued disregard will result in the user being blocked from using these clusters.

Connecting to the AI.Panther and Blueshark

To access the AI.Panther and Blueshark HPC clusters, an SSH client must be installed on your device and used to connect to the HPC clusters.

The AI.Panther hostname for SSH is ai-panther.fit.edu

The Blueshark hostname for SSH is blueshark.fit.edu

The clusters support using your TRACKS username and password. Please be aware that your TRACKS username is not your email address. It is only the part before the '@' symbol. 

Resource Management

The clusters use the Slurm Workload Manager to manage available resources and to distribute jobs to free compute nodes. Slurm also provides a queuing system; if not enough resources are available, it will hold your job until it can run it. 

Slurm Example Submission Script

In order to submit a job to Slurm, a job submission script must be created. An example submission script is provided below:

#!/bin/bash

#SBATCH --job-name TestJob
#SBATCH --nodes 2

#SBATCH --ntasks 2

#SBATCH --mem=50MB

#SBATCH --time=00:15:00
#SBATCH --partition=short

#SBATCH --error=testjob.%J.err 

#SBATCH --output=testjob.%J.out

module load mpich

echo "Starting at date"

echo "Running on hosts: $SLURM_NODELIST"

echo "Running on $SLURM_NNODES nodes."

echo "Running on $SLURM_NPROCS processors."

echo "Current working directory is pwd"

The only options you absolutely need are:

  • --job-name   — a unique name for your job. This can be set to anything.
  • --nodes      — the number of nodes to request.
  • --ntasks     — the number of tasks in total across all nodes.
  • --mem        — amount of memory to request on each node. This is a hard limit and you will run into out-of-memory errors if you fail to provide the correct amount.
  • --partition  — the partition for your job. Valid partitions can be found by using sinfo.

If you need MPICH to run your jobs, set it to load at login using:

module initadd mpich

or by placing

module load mpich

in your submission script like above. Many more options are available for Slurm's submission scripts, and more information available by visiting: https://slurm.schedmd.com/sbatch.html.

Submitting a Job

Slurm has its own set of commands for job management. To submit your submission script, use 

sbatch script.sh

Some other commands you may want to use are listed below.

  • squeue      — lists the jobs that are currently running for everyone.
  • sinfo       — show node status.
  • scancel     — cancel a currently running job. 
  • sstat       — show statistics for a job.

For a more in-depth look into Slurm and its respective commands, check out the Slurm quick start guide.

Optimizing your Submission Script

Slurm will attempt to run your job wherever it can place it, however, this is hugely dependent on how your submission script specifies its resources. Thus, if you can reduce your submission script requirements, your job has a much higher chance of being scheduled faster.

Memory Requirements

--mem is most often used to specify the amount of memory your job will take per node. However, this is largely dependent on how many tasks you can fit in a node or the number of nodes you'll require. If you don't specify the number of nodes you need, slurm won't balance out the tasks, often leading to out-of-memory errors on nodes where more jobs were placed than expected. Another issue arises when the cluster is under heavy use. Small pockets of resources are scattered through the cluster, and won't be easy to acquire when your job needs a fixed amount of memory per node. 

To prevent this, we can use --mem-per-cpu instead. If each task only requires a certain amount of memory, you can specify this amount instead. This way, the scheduler can better allocate resources -- if tasks require more memory than what's available on a node, they'll be split, and if there are pockets of resources a single task can fit in, it will allocate that spot.

Partitions

Slurm's partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc. Jobs that only need a short amount of time to run, but a large number of processors, will have their jobs categorized differently than jobs that may need to run for days and require fewer processors. In addition, partitions can be used to group together nodes that have general hardware that others don't (ex, GPU partition has GPUs in its nodes).

To set a partition, use:

#sbatch --partition=[partition]

in your submission script, or specify it on the command line using --partition.

AI.Panther Partitions

These partitions exist on AI.Panther:

Partition Name Max Compute Time Max Nodes Allowed Groups
short
45 minutes 16

AI.Panther users

med
4 hours 16

AI.Panther users

long
7 days 16

AI.Panther users

eternity
infinite 16 AI.Panther users
gpu1
Infinite 4 gpu1 users
gpu2
infinite 4 gpu2 users

NOTE: gpu1 are 4 nodes with 4 A100 GPUs with SXM4 (NVLink) interconnects. This partition is setup for jobs requiring a high amount of bandwidth between GPUs compared to the A100 PCIe nodes. Additional information about comparing the differences: https://infohub.delltechnologies.com/p/accelerating-hpc-workloads-with-nvidia-a100-nvlink-on-dell-poweredge-xe8545/

NOTE: gpu2 are 4 nodes with 4 A100 GPUs with PCIe interconnects.

Blueshark Partitions

These partitions exist on Blueshark:

Partition Name Max Compute Time Max Nodes Allowed Groups
short
45 minutes N/A

blueshark users

med
4 hours N/A

blueshark users

long
7 days N/A

blueshark users

eternity
infinite 20 blueshark users
class
10 minutes 6 Parallel Programming class
gpu
infinite 10 blueshark GPU users

Running GPU Jobs

Running GPU jobs is very similar to running regular jobs. An extra parameter has to be passed (--gres) and the partition must be set to gpu.

#sbatch --partition=gpu

will set your partition.

#sbatch --gres=gpu:[#] 

will set the number of GPUs you want per node. Note that this differs from ntask, specified earlier. --gres will request n number of GPUs from each node. Thus, if you request 4 nodes with --gres=gpu:2, you will have [4 nodes] * [2 GPU/node] = 8 GPUs in total. This option will not exceed 4, as we only have 4 GPUs per node.

Also, GPUs can be selected based on whether or not they support GPUDirect technology. Each GPU node has 2 standard GPUs and 2 GPUDirect enabled GPUs. To select between the two, you can use: 

#sbatch --gres=gpu:[type]:[#] 

where [type] is either gpudirect or standard.

Details

Article ID: 2107
Created
Sun 11/27/22 10:30 PM
Modified
Thu 2/2/23 1:26 PM