Slurm operations

Jump to our top-level Slurm page: Slurm batch queueing system

Testing basic functionality

We assume that you have carried out the above deployment along the lines of Slurm installation and upgrading, Slurm configuration, Slurm database, Slurm accounting and Slurm job scheduler.

From the Head node try to submit an interactive job:

srun -N1 /bin/hostname

If srun hangs, check the firewall settings described in Slurm configuration. Please note that interactive batch jobs from Login nodes seem to be impossible if your compute nodes are on an isolated private network relative to the Login node.

To display the job queue:

scontrol show jobs

To submit a batch job script using sbatch:

sbatch -N1 <script-file>

System information

Useful sysadmin commands:

  • sinfo - view information about Slurm nodes and partitions.

  • showpartitions - Print a Slurm cluster partition status overview with 1 line per partition.

  • squeue - view information about jobs located in the Slurm scheduling queue

  • scancel Used to signal jobs or job steps

  • smap - graphically view information about Slurm jobs, partitions, and set configurations parameters

  • sview - graphical user interface to view and modify Slurm state (requires gtk2)

  • scontrol - view and modify Slurm configuration and state

Slurm test suite

There is a large test suite, see the Testing section of the Slurm_Quick_Start Administrator Guide. The test suite is in the source .../testsuite/expect/ directory, see the file README.

The testsuite should be copied to the shared filesystem, for example, /home/$USER/testsuite/ and run by a non-root user:

cd testsuite/expect
./regression

MPI setup

MPI use under Slurm depends upon the type of MPI being used, see MPI_and_UPC_Users_Guide. The current versions of Slurm and OpenMPI support task launch using the srun command, see the MPI_Guide_OpenMPI.

For PMIx please see the PMIx_Slurm_support page.

You must add these flags when building OpenMPI:

--with-slurm --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr

The Slurm RPM installs header files in /usr/include/slurm and libraries in /usr/lib64. Using the OpenMPI tools, verify the installation of slurm as well as pmi modules, for example:

# ompi_info | egrep -i 'slurm|pmi'
                MCA db: pmi (MCA v2.0.0, API v1.0.0, Component v1.10.3)
               MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component v1.10.3)
               MCA ess: slurm (MCA v2.0.0, API v3.0.0, Component v1.10.3)
           MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)
               MCA plm: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.3)
               MCA ras: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.3)
            MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)

Since Slurm provides both the PMI and PMI-2 interfaces, this advice in MPI_Guide_OpenMPI is important:

If the pmi2 support is enabled then the command line options '--mpi=pmi2' has to be specified on the srun command line.

Hence you must invoke srun like:

srun --mpi=pmi2

It may alternatively be convenient to add this line to slurm.conf:

MpiDefault=pmi2

See the FAQ: Running jobs under Slurm and the Process Management Interface (PMI) page.

MPI locked memory

MPI stacks running over Infiniband or Omni-Path network fabric by Cornelis Networks require the ability to allocate more locked memory than the default limit. Unfortunately, user processes on login nodes may have a small memory limit (check it by ulimit -a) which by default are propagated into Slurm jobs and hence cause fabric errors for MPI. See the memlock FAQ.

This is fixed by adding to slurm.conf:

PropagateResourceLimitsExcept=MEMLOCK

You can view the running slurmd process limits by:

cat "/proc/$(pgrep -u 0 slurmd)/limits"

CPU management

It is important to understand how Slurm manages nodes, CPUs, tasks etc. This is documented in the cpu_management page.

GPU accelerators

Configure Slurm for GPU accelerators as described in the Slurm configuration page under the GRES section.

The AutoDetect configuration in gres.conf can be used to detect GPU hardware (currently Nvidia and AMD).

You should set the job Default count of CPUs allocated per allocated GPU (DefCpuPerGPU) for each partition containing GPUs in the slurm.conf file, for example:

PartitionName=xxx DefCpuPerGPU=4 ...

For accounting of GPU usage you must add to the AccountingStorageTRES in slurm.conf, for example:

AccountingStorageTRES=gres/gpu,gres/gpu:tesla

and restart slurmctld so that these new fields are added to the database.

Nvidia GPUs

It is possible to build Slurm packages which include the Nvidia NVML library for easy handling of GPU hardware. NVML automatically detects GPUs, their type, cores, and NVLinks. Quoting the GRES page:

If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management Library (NVML) is installed on the node and was found during Slurm configuration,
configuration details will automatically be filled in for any system-detected NVIDIA GPU.
This removes the need to explicitly configure GPUs in gres.conf, though the Gres= line in slurm.conf is still required in order to tell slurmctld how many GRES to expect.

However, it is not necessary to include the NVML in your Slurm packages, since you can configure gres.conf manually for the GPU hardware in your nodes. See the mailing list thread Building Slurm RPMs with NVIDIA GPU support?.

Nvidia drivers

Download Nvidia drivers from https://www.nvidia.com/Download/index.aspx and select the appropriate GPU version and host operating system. You can also download and install Nvidia UNIX drivers, and the CUDA toolkit from https://developer.nvidia.com/cuda-downloads.

To verify the availability of GPU accelerators in a node run the nvidia-smi command:

nvidia-smi -L

which is installed with the xorg-x11-drv-nvidia RPM package.

GPU monitoring tools

There is a useful page Top 3 Linux GPU Monitoring Command Line Tools recommending the tools gpustat, nvtop, and nvitop. The NVIDIA tool nvidia-smi can of course also be used.

We recommend the gpustat tool which gives a 1-line status of each GPU in the system. The installation on EL8 systems is a bit tricky, so use these commands:

dnf install gcc python3-devel
python3 -m pip install setuptools-scm
python3 -m pip install gpustat

Our Slurm monitoring tools psjob and psnode use gpustat on nodes with GPU GRES to print a GPU usage summary.

RPC rate limiting

It is common to experience users who bombard the slurmctld server by executing commands such as squeue, sinfo, sbatch or the like with many requests per second. This can potentially make the slurmctld unresponsive and therefore affect the entire cluster.

The ability to do RPC rate limiting on a per-user basis is a new feature with Slurm 23.02. It acts as a virtual bucket of tokens that users consume with Remote Procedure Calls (RPC). The RPC logging frequency (rl_log_freq) is a new feature with Slurm 23.11.

Enable RPC rate limiting in slurm.conf by adding rl_enable and other parameters, for example:

SlurmctldParameters=rl_enable,rl_refill_rate=10,rl_bucket_size=50,rl_log_freq=10

NOTE: After changing SlurmctldParameters make an scontrol reconfig to restart slurmctld. See also bug_18067.

This allows users to submit a large number of requests in a short period of time, but not a sustained high rate of requests that would add stress to the slurmctld. You can define:

  • The maximum number of tokens with rl_bucket_size,

  • the rate at which new tokens are added with rl_refill_rate,

  • the frequency with which tokens are refilled with rl_refill_period

  • and the number of entities to track with rl_table_size.

  • New in 23.11: rl_log_freq option to limit the number of RPC limit exceeded… messages that are logged.

When this is enabled you may find lines in slurmctld.log such as:

2023-10-06T10:22:32.893] RPC rate limit exceeded by uid 2851 with REQUEST_SUBMIT_BATCH_JOB, telling to back off

We have written a small script sratelimit for summarizing such log entries.

Utilities for Slurm

Here we list some useful third-party utilities that Slurm administrators or users may find useful:

Graphical monitoring tools

There exist a number of Open Source tools for graphical monitoring of Slurm:

Working with Compute nodes

Slurm power saving scripts

Slurm provides an integrated power saving mechanism for powering down idle nodes, and starting them again when jobs need to be scheduled, see the Slurm_Power_Saving_Guide.

We provide some Slurm_power_saving_scripts which may be useful for power management using IPMI or with cloud services.

Expanding and collapsing host lists

Slurm lists node/host lists in the compact format, for example node[001-123]. Sometimes you want to expand the host list, for example in scripts, to list all nodes individually.

You can use this command to output hostnames one line at a time:

scontrol show hostnames node[001-123]

or rewrite the list into a single line with paste:

scontrol show hostnames node[001-123] | paste -s -d ,

To contract expanded hostlists:

# scontrol show hostlistsorted h003,h002,h001
h[001-003]
# scontrol show hostlist h003,h002,h001
h[003,002,001]

When the server does not have the slurm RPM installed, or for more sophisticated host list processing, some non-Slurm tools may be used as shown below.

The nodeset command

The ClusterShell_tool ‘s nodeset command (see below) enables easy manipulation of node sets, as well as node groups, at the command line level. For example:

$ nodeset --expand node[13-15,17-19]
node13 node14 node15 node17 node18 node19

The hostlist command

The python-hostlist tool is very convenient for expanding or compressing node lists.

To install this tool (make sure to download the latest release):

dnf install python3-devel
wget https://www.nsc.liu.se/~kent/python-hostlist/python-hostlist-2.2.1.tar.gz
rpmbuild -ta python-hostlist-2.2.1.tar.gz
dnf install ~/rpmbuild/RPMS/noarch/python-hostlist-2.2.1-1.el8.x86_64

For usage see the python-hostlist, but a useful example is:

# hostlist --expand --sep " "  n[001-012]
n001 n002 n003 n004 n005 n006 n007 n008 n009 n010 n011 n012

The snodelist command

The snodelist command is a tool for working with Slurm hostlists. Rather than relying on scontrol show hostnames to expand a Slurm compact host list to a newline-delimited list. Installation instructions are in the snodelist page.

SSH keys for password-less access to cluster nodes

Users may have a need for SSH access to Slurm compute nodes, for example, if they have to use an MPI library which is using SSH in stead of Slurm to start MPI tasks.

However, it is a good idea to configure the slurm-pam-adopt module on the nodes to control and restrict SSH access, see Slurm_configuration#pam-module-restrictions.

The SSH (Secure Shell) configuration files including server private/public keys are in the /etc/ssh/ folder.

The file /etc/ssh/ssh_known_hosts containing the SSH public keys of all nodes should be created on the central server and distributed to all Slurm nodes. The ssh-keyscan tool is very convenient for gathering SSH public keys of the cluster nodes, some examples are:

ssh-keyscan -t ssh-ed25519 node001 node002                   # Scan nodes node001+node002 for key type ssh-ed25519
scontrol show hostnames node[001-022] | ssh-keyscan -f - 2>/dev/null | sort # Scan nodes node[001-022], pipe comments to /dev/null, and sort the output
sinfo -Nho %N | uniq | ssh-keyscan -f - 2>/dev/null | sort          # Scan all Slurm nodes (uniq suppresses duplicates)

Remember to set the SELinux context correctly for the files in /etc/ssh:

chcon system_u:object_r:etc_t:s0 /etc/ssh/ssh_known_hosts

When all SSH public keys of the Slurm nodes are available in /etc/ssh/ssh_known_hosts, each individual user can configure a password-less SSH login. First the user must generate personal SSH keys (placed in the $HOME/.ssh/ folder) using the ssh-keygen tool.

Each user may use the convenient tool authorized_keys for generating SSH keys and adding them to the $HOME/.ssh/authorized_keys file.

For external computers the personal SSH_authorized_keys (preferably protected with a passphrase or Multi-Factor Authentication) should be used.

For the servers running the slurmctld and slurmdbd services it is strongly recommended not to permit login by normal users because they have no business on those servers! To restrict which users can login to the management hosts, append this line to the SSH server /etc/ssh/sshd_config file:

AllowUsers root

You can add more trusted system managers to this line if needed. Then restart the SSH service:

systemctl restart sshd

Host-based authentication

Another way to enable password-less SSH login is to configure login nodes and compute nodes in the cluster to allow Host-based_Authentication. Please beware that:

Here are the steps for configuring Host-based_Authentication:

  1. First populate all SSH keys in the file /etc/ssh/ssh_known_hosts as shown above.

  2. Configure only these lines in the SSH client configuration /etc/ssh/ssh_config on all nodes:

    HostbasedAuthentication yes
    EnableSSHKeysign yes
    

    These lines do not work inside Host or Match statements, but must be defined at the global level.

    You may also configure PreferredAuthentications (order of authentication methods) so that the hostbased method is preferred for the nodes in the cluster’s domainname (replace by your DNS domain). Furthermore GSSAPI and ForwardX11Trusted may be configured:

    Host *.<domainname>
      PreferredAuthentications gssapi-keyex,gssapi-with-mic,hostbased,publickey,keyboard-interactive,password
      GSSAPIAuthentication yes
      ForwardX11Trusted yes
    

    The ssh_config manual page explains the configuration keywords.

    The GSSAPI (Generic Security Service Application Program Interface (GSS-API) Authentication and Key Exchange for the Secure Shell (SSH) Protocol) is defined in rfc4462.

  3. Add these lines to the SSH server /etc/ssh/sshd_config file on all nodes:

    HostbasedAuthentication yes
    UseDNS yes
    

    and restart the SSH service:

    systemctl restart sshd
    
  4. Populate the file /etc/ssh/shosts.equiv for every node in the cluster listed in /etc/ssh/ssh_known_hosts with 1 line per node including the full DNS domainname, for example:

    node001.<domainname>
    node002.<domainname>
    ...
    

    Wildcard hostnames are not possible, so you must list all hosts one per line. To list all cluster nodes:

    sinfo -Nho %N | uniq | awk '{print $1 ".domainname"}' > /etc/ssh/shosts.equiv
    

    where you must substitute your own domainname.

Remember to set the SELinux context correctly for the files in /etc/ssh:

chcon system_u:object_r:etc_t:s0 /etc/ssh/sshd_config /etc/ssh/ssh_config /etc/ssh/shosts.equiv /etc/ssh/ssh_known_hosts

A normal (non-root) user should now be able to login from a node to itself, for example:

testnode$ ssh -v testnode

and the verbose output should inform you:

debug1: Authentication succeeded (hostbased).

ClusterShell

ClusterShell provides a light and unified command execution Python framework to help administer GNU/Linux or BSD clusters. There is a ClusterShell_manual and a ClusterShell_configuration guide.

Install the ClusterShell_tool from the EPEL repository:

dnf install epel-release
dnf install clustershell

Copy the example file for Slurm.conf:

cp /etc/clustershell/groups.conf.d/slurm.conf.example /etc/clustershell/groups.conf.d/slurm.conf

You should define slurm as the default group in /etc/clustershell/groups.conf:

[Main]
# Default group source
default: slurm

It is convenient to add a Slurm binding for all running jobs belonging to a specific user. Append to /etc/clustershell/groups.conf.d/slurm.conf the lines:

#
# SLURM user job bindings
#
[slurmuser,su]
map: squeue -h -u $GROUP -o "%N" -t running
list: squeue -h -o "%i" -t R
reverse: squeue -h -w $NODE -o "%i"
cache_time: 60

This feature was included in the version 1.8.1.

You may encounter some surprising zero-padding behavior in node names, see also issue_293.

ClusterShell usage

You can list all node groups including hostnames and node counts using this ClusterShell_tool command:

cluset -LLL

Simple usage of clush:

clush -w node[001-003] date

For a Slurm partition:

clush -g <partition-name> date

If option -b or –dshbak is specified, clush waits for command completion while displaying a progress indicator and then displays gathered output results:

clush -b -g <partition-name> date

To execute a command only on nodes with a specified Slurm state (here: drained):

clush -w@slurmstate:drained date
clush -bw@slurmstate:down 'uname -r; dmidecode -s bios-version'

To execute a command only on nodes running a particular Slurm JobID (here: 123456):

clush -w@sj:123456 <command>

To execute a command only on nodes running jobs for a particular username (requires the above mentioned slurmuser configuration):

clush -w@su:username <command>

If you want to run commands on hosts not under Slurm, select a group source defined in /etc/clustershell/groups (see man clush):

clush -s GROUPSOURCE or --groupsource=GROUPSOURCE <other arguments>

For example:

clush -s local -g testcluster <command>

The nodeset command enables easy manipulation of node sets, as well as node groups, at the command line level. For example:

$ nodeset --expand node[13-15,17-19]
node13 node14 node15 node17 node18 node19

Copying files with ClusterShell

When ClusterShell_tool has been set up, it’s very simply to copy files and folders to nodes, see the clush manual page. Example:

clush -bw node[001-099] --copy /etc/slurm/slurm.conf --dest /etc/slurm/

Listing nodes

Use sinfo to list nodes that are responding (for example, to be used in clush scripts):

sinfo -r -h -o '%n'
sinfo --responding --noheader --format='%n'

List reasons nodes are in the down, drained, fail or failing state:

sinfo -R
sinfo --list-reasons
sinfo -lRN

List of nodes with features and status:

sinfo --format="%25N %.40f %.6a %.10A"

Use scontrol to list node properties:

scontrol -o show nodes <Nodename>

Listing node resources used

Use sinfo to see what resources are used/remaining on a per node basis:

sinfo -Nle -o '%n %C %t'

The flag -p <partition> may be added. Nodes states listed with * means that the node is not responding.

Note the STATE column:

  • State of the nodes. Possible states include: allocated, completing, down, drained, draining, fail, failing, future, idle, maint, mixed, perfctrs, power_down, power_up, reserved, and unknown plus Their abbreviated forms: alloc, comp, down, drain, drng, fail, failg, futr, idle, maint, mix, npc, pow_dn, pow_up, resv, and unk respectively.

    Note that the suffix “*” identifies nodes that are presently not responding.

Resume an offline node

A node may get stuck in an offline mode for several reasons. For example, you may see this:

# scontrol show node q007

NodeName=q007 Arch=x86_64 CoresPerSocket=2
...
 State=DOWN ThreadsPerCore=1 TmpDisk=32752 Weight=1 Owner=N/A
...
 Reason=NO NETWORK ADDRESS FOUND [slurm@2015-12-08T09:25:32]

Nodes states listed with * means that the node is not responding.

It is very difficult to find documentation on how to clear such an offline state. The solution is to use the scontrol command (section SPECIFICATIONS FOR UPDATE COMMAND, NODES):

scontrol update nodename=a001 state=down reason="undraining"
scontrol update nodename=a001 state=resume

See also How to “undrain” slurm nodes in drain state where it is recommended to avoid the down state (1st command above).

Slurm trigger information

Triggers include events such as:

  • a node failing

  • daemon stops or restarts

  • a job reaching its time limit

  • a job terminating.

These events can cause actions such as the execution of an arbitrary script. Typical uses include notifying system administrators of node failures and gracefully terminating a job when it’s time limit is approaching. A hostlist expression for the nodelist or job ID is passed as an argument to the program.

  • strigger - Used set, get or clear Slurm trigger information

An example script using this is notify_nodes_down. To set up the trigger as the slurm user:

slurm# strigger --set --node --down --program=/usr/local/bin/notify_nodes_down

To display enabled triggers:

strigger --get

Add and remove nodes

Nodes can be added or removed by modifying the slurm.conf file and distributing it to all nodes. If you use the topology.conf configuration, that file must also be updated and distributed to all nodes. If you run a Configless Slurm setup setup then the configuration files are served automatically to nodes by the slurmctld.

Starting in Slurm 22.05, nodes can be dynamically added and removed from Slurm, see dynamic_nodes.

If nodes must initially be unavailable for starting jobs, define them in slurm.conf with a State and optionally a Reason parameter:

NodeName=xxx ... State=DRAIN Reason="Not yet ready"
NodeName=xxx ... State=FUTURE

For convenience the command:

slurmd -C

can be used on each compute node to print its physical configuration (sockets, cores, real memory size, etc.) for inclusion into slurm.conf.

An entire new partition may also be made unavailable using a State not equal to UP:

PartitionName=xxx ... State=INACTIVE
PartitionName=xxx ... State=DRAIN

However, the slurmctld daemon must then be restarted:

systemctl restart slurmctld

As stated in the scontrol page under the reconfigure option):

  • The slurmctld daemon must be restarted if nodes are added to or removed from the cluster.

Furthermore, the slurmd service on all compute nodes must also be restarted in order to pick up the changes in slurm.conf, for example:

clush -ba systemctl restart slurmd

See advice from the Slurm_publications talk Technical: Field Notes Mark 2: Random Musings From Under A New Hat, Tim Wickberg, SchedMD (2018) on the Safe procedure:

  1. Stop slurmctld

  2. Change configs

  3. Restart all slurmd processes

  4. Start slurmctld

Less-Safe, but usually okay, procedure:

  1. Change configs

  2. Restart slurmctld

  3. Restart all slurmd processes really quickly

See also https://thread.gmane.org/gmane.comp.distributed.slurm.devel/3039 (comment by Moe Jette).

Rebooting nodes

Slurm can reboot nodes by:

scontrol reboot [ASAP] [NodeList]
  Reboot  all nodes in the system when they become idle using the RebootProgram as configured in Slurm's slurm.conf file.
  The option "ASAP" prevents initiation of additional jobs so the node can be rebooted and returned to service "As Soon As Possible" (i.e. ASAP).
  Accepts an option list of nodes to reboot.
  By default all nodes are rebooted.

NOTE: The reboot request will be ignored for hosts in the following states: FUTURE, POWER_DOWN, POWERED_DOWN, POWERING_DOWN, REBOOT_ISSUED, REBOOT_REQUESTED, see bug_18505. Currently, no warning is issued in such cases. From Slurm 24.08 an error message will be printed by scontrol reboot when a node reboot request is ignored due to the current node state.

Compute node OS and firmware updates

Regarding the question of methods for Slurm compute node OS and firmware updates, we have for a long time used rolling updates while the cluster is in full production, so that we do not waste any resources.

When entire partitions are upgraded in this way, there is no risk of starting new jobs on nodes with differing states of OS and firmware, while running jobs continue on the not-yet-updated nodes.

The basic idea (which was provided by Niels Carl Hansen, ncwh -at- cscaa.dk) is to run a crontab script update.sh whenever a node is rebooted. Use scontrol to reboot the nodes as they become idle, thereby performing the updates that you want. Remove the crontab job as part of the update.sh script.

The update.sh script and instructions for usage are in: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/nodes

Shared Memory cleanup

Certain jobs allocate Shared Memory resources but do not release them before job completion. For example, the shared memory segments may hit the system limit (typically 4096), see the system limit by:

$ sysctl kernel.shmmni
kernel.shmmni = 4096

Error messages such as this one may occur:

getshmem_C in getshmem.c: cannot create shared segment 8
No space left on device

See also Bug_7232.

Information on the inter-process communication facilities:

ipcs -a

Users and root can clean up unused data by:

ipcrm -a

Resource Reservation

Compute nodes can be reserved for a number of purposes. Read the reservations guide.

For example, to reserve a set of nodes for a testing purpose with a duration of 720 hours:

scontrol create reservation starttime=now duration=720:00:00 ReservationName=Test1 Flags=MAGNETIC nodes=x[049-096] user=user1,user2

Ignore currently running jobs when creating the reservation by adding this flag:

flags=ignore_jobs

Magnetic reservations were introduced in Slurm 20.02, see the scontrol man-page:

Flags=MAGNETIC  # This flag allows jobs to be considered for this reservation even if they didn't request it.

Jobs will be eligible to run in such reservations even if they did not specify --reservation.

To reserve nodes for maintenance for 72 hours:

scontrol create reservation starttime=2017-06-19T12:00:00 duration=72:00:00 ReservationName=Maintenance flags=maint,ignore_jobs nodes=x[145-168] user=root

A specification of nodes=ALL will reserve all nodes.

If you want to reserve an entire partition, it is recommended to not specify nodes, but a partition in stead:

scontrol create reservation starttime=2017-06-19T12:00:00 duration=72:00:00 ReservationName=Maintenance flags=maint,ignore_jobs partitionname=xeon16 user=root

To list all reservations:

scontrol show reservations

and also previous reservations some weeks back in time:

scontrol show reservations start=now-5weeks

Batch jobs submitted for the reservation must explicitly refer to it, for example:

sbatch --reservation=Test1 -N4 my.script

One may also specify explicitly some nodes:

sbatch --reservation=Test1 -N2 --nodelist=x188,x140 my.script

Working with jobs

Tutorial pages about Slurm job management:

Interactive jobs

Using srun users can launch interactive jobs on compute nodes through Slurm. See the FAQ How can I get shell prompts in interactive mode?:

srun --pty bash -i [additional options]

If you need to run MPI tasks, see MPI_Guide_OpenMPI. It is required to invoke srun with pmi2 or pmix support as shown above in the MPI section, for example:

srun --pty --mpi=pmi2 bash -i [additional options]

Job arrays

Slurm job_arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily.

It is important to understand that job arrays, only at the moment when an individual job starts running, become independent jobs (similar to non-array jobs) and are assigned their own unique JobIDs.

To see the relationship between job arrays and JobIDs, this is a useful command for a specified ArrayJobID:

$squeue  -j 3394902 -O ArrayJobID,JobArrayID,JobID,State
ARRAY_JOB_ID        JOBID               JOBID               STATE
3394902             3394902_[34-91]     3394902             PENDING
3394902             3394902_30          3394932             RUNNING
3394902             3394902_28          3394930             RUNNING

Useful commands

See the overview of Slurm_man_pages as well as the individual command man-pages.

Command

Function

squeue

List jobs

squeue –start

List starting times of jobs

sbatch <options> –wrap=”some-command”

Submit a job running just some-command (without script file)

scontrol show job xxx

Get job details

scontrol –details show job xxx

Get more job details

scontrol suspend xxx

Suspend a job (root only)

scontrol resume xxx

Resume a job (root only)

scontrol hold xxx

Hold a job

scontrol uhold xxx

User-Hold a job

scontrol release xxx

Release a held job

scontrol update jobid=10208 nice=-10000

Increase a job’s priority (Slurm managers only)

scontrol update jobid=10208 nice=5000

Decrease a job’s priority (users and managers)

scontrol top 10208

Move the job to the top of the user’s queue

scontrol update jobid=10208 priority=50000

Set a job’s priority value

scontrol hold jobid=10208; scontrol release jobid=10208

Reset a job’s explicit priority=xxx value

scontrol update jobid=1163 EndTime=2022-04-27T08:30:00

Modify a job’s End time

scontrol update jobid=1163 timelimit=12:00:00

Modify a job’s time limit

scontrol update jobid=1163 qos=high

Set the job QOS to high (QOS list: sacctmgr show qos)

scontrol listpids <jobid> (on node running a job)

Print a listing of the process IDs in a job step

scontrol write batch_script job_id optional_filename

Write the batch script for a given job_id to a file or to stdout

scontrol show config

Prints the Slurm configuration and running parameters

scontrol write config optional_filename

Write the current Slurm configuration to a file

scancel job xxx

Kill a job

sjobexitmod -l jobid

Display job exit codes

sstat

Display various status information of a running job/step

scontrol show assoc_mgr

Displays the slurmctld’s internal cache for users, associations and/or qos such as GrpTRESRunMins, GrpTRESMins etc.

scontrol -o show assoc_mgr users=xxx accounts=yyy flags=assoc

Display the association limits and current values for user xxx in account yyy as a one-liner.

sacctmgr show user -s xxx

Display information about user xxx from the Slurm database

sacctmgr add user xxx Account=zzzz

Add user xxx to the non-default account zzzz, see the accounting page.

sacctmgr modify qos normal set priority=50

Modify the the QOS named normal to set a new priority value.

sacctmgr modify user where name=xxx set MaxSubmitJobs=NN

Update user’s maximum number of submitted jobs to NN. NN=0 blocks submissions, NN=-1 removes the limit.

sacctmgr -nP list associations user=xxx format=fairshare

Print the fairshare number of user xxx.

sacctmgr show event

Display information about events like downed or draining nodes on clusters.

sshare -lU -u xxx

Print the various fairshare values of user xxx.

squeue usage

The squeue command has a huge number of parameters for listing jobs. Here are some suggestions for usage of squeue: sbatch <options> –wrap=”some-command” * The long display gives more details:

squeue -l  # is equivalent to:
squeue -o "%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R"
  • Add columns for job priority (%Q) and CPU count (%C) and make some columns wider:

    squeue -o "%.18i %.9P %.8j %.8u %.10T %.9Q %.10M %.9l %.6D %.6C %R"
    
  • Set the output format by an environment variable:

    export SQUEUE_FORMAT="%.18i %.9P %.8j %.8u %.10T %.9Q %.10M %.9l %.6D %.6C %R"
    

    or using the new output format:

    export SQUEUE_FORMAT2="JobID:8,Partition:11,QOS:7,Name:10 ,UserName:9,Account:9,State:8,PriorityLong:9,ReasonList:16 ,TimeUsed:12 ,SubmitTime:19 ,TimeLimit:10 ,tres-alloc: "
    
  • List of pending jobs in the same order considered for scheduling by Slurm (see squeue man-page under –priority):

    squeue --priority  --sort=-p,i --states=PD
    

Slurm debugging

Change the debug level of the slurmctld daemon.:

scontrol setdebug LEVEL

where LEVEL may be: “quiet”, “fatal”, “error”, “info”, “verbose”, “debug”, “debug2”, “debug3”, “debug4”, or “debug5”. See the scontrol OPTIONS section. For example:

scontrol setdebug debug2

This value is temporary and will be overwritten whenever the slurmctld daemon reads the slurm.conf configuration file (e.g. when the daemon is restarted or scontrol reconfigure is executed).

Add or remove DebugFlags of the slurmctld daemon:

scontrol setdebugflags [+|-]FLAG

For example:

scontrol setdebugflags +backfill

See slurm.conf PARAMETERS section for the full list of supported DebugFlags. NOTE: Changing the value of some DebugFlags will have no effect without restarting the slurmctld daemon, which would set DebugFlags based upon the contents of the slurm.conf configuration file.