Slurm operations 

Jump to our top-level Slurm page: Slurm batch queueing system

Testing basic functionality 

We assume that you have carried out the above deployment along the lines of Slurm installation and upgrading, Slurm configuration, Slurm database, Slurm accounting and Slurm job scheduler.

From the Head node try to submit an interactive job:

srun -N1 /bin/hostname

If srun hangs, check the firewall settings described in Slurm configuration. Please note that interactive batch jobs from Login nodes seem to be impossible if your compute nodes are on an isolated private network relative to the Login node.

To display the job queue:

scontrol show jobs

To submit a batch job script using sbatch:

sbatch -N1 <script-file>

For troubleshotting there exists a testsuite for Slurm.

System information 

Useful sysadmin commands:

sinfo - view information about Slurm nodes and partitions.
showpartitions - Print a Slurm cluster partition status overview with 1 line per partition.
squeue - view information about jobs located in the Slurm scheduling queue
scancel Used to signal jobs or job steps
smap - graphically view information about Slurm jobs, partitions, and set configurations parameters
sview - graphical user interface to view and modify Slurm state (requires gtk2)
scontrol - view and modify Slurm configuration and state

MPI setup 

MPI use under Slurm depends upon the type of MPI being used, see MPI_and_UPC_Users_Guide. The current versions of Slurm and OpenMPI support task launch using the srun command, see the MPI_Guide_OpenMPI.

For PMIx please see the PMIx_Slurm_support page.

You must add these flags when building OpenMPI:

--with-slurm --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr

The Slurm RPM installs header files in /usr/include/slurm and libraries in /usr/lib64. Using the OpenMPI tools, verify the installation of slurm as well as pmi modules, for example:

# ompi_info | egrep -i 'slurm|pmi'
                MCA db: pmi (MCA v2.0.0, API v1.0.0, Component v1.10.3)
               MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component v1.10.3)
               MCA ess: slurm (MCA v2.0.0, API v3.0.0, Component v1.10.3)
           MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)
               MCA plm: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.3)
               MCA ras: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.3)
            MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)

Since Slurm provides both the PMI and PMI-2 interfaces, this advice in MPI_Guide_OpenMPI is important:

If the pmi2 support is enabled then the command line options '--mpi=pmi2' has to be specified on the srun command line.

Hence you must invoke srun like:

srun --mpi=pmi2

It may alternatively be convenient to add this line to slurm.conf:

MpiDefault=pmi2

See the FAQ: Running jobs under Slurm and the Process Management Interface (PMI) page.

MPI locked memory 

MPI stacks running over Infiniband or Omni-Path network fabric by Cornelis Networks require the ability to allocate more locked memory than the default limit. Unfortunately, user processes on login nodes may have a small memory limit (check it by ulimit -a) which by default are propagated into Slurm jobs and hence cause fabric errors for MPI. See the memlock FAQ.

This is fixed by adding to slurm.conf:

PropagateResourceLimitsExcept=MEMLOCK

You can view the running slurmd process limits by:

cat "/proc/$(pgrep -u 0 slurmd)/limits"

CPU management 

It is important to understand how Slurm manages nodes, CPUs, tasks etc. This is documented in the cpu_management page.

GPU accelerators 

Configure Slurm for GPU accelerators as described in the Slurm configuration page under the GRES section.

The AutoDetect configuration in gres.conf can be used to detect GPU hardware (currently Nvidia and AMD).

You should set the job Default count of CPUs allocated per allocated GPU (DefCpuPerGPU) for each partition containing GPUs in the slurm.conf file, for example:

PartitionName=xxx DefCpuPerGPU=4 ...

For accounting of GPU usage you must add to the AccountingStorageTRES in slurm.conf, for example:

AccountingStorageTRES=gres/gpu,gres/gpu:tesla

and restart slurmctld so that these new fields are added to the database.

Nvidia GPUs 

It is possible to build Slurm packages which include the Nvidia NVML library for easy handling of GPU hardware. NVML automatically detects GPUs, their type, cores, and NVLinks. Quoting the GRES page:

If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management Library (NVML) is installed on the node and was found during Slurm configuration,
configuration details will automatically be filled in for any system-detected NVIDIA GPU.
This removes the need to explicitly configure GPUs in gres.conf, though the Gres= line in slurm.conf is still required in order to tell slurmctld how many GRES to expect.

However, it is not necessary to include the NVML in your Slurm packages, since you can configure gres.conf manually for the GPU hardware in your nodes. See the mailing list thread Building Slurm RPMs with NVIDIA GPU support?.

Nvidia drivers 

Download Nvidia drivers from https://www.nvidia.com/Download/index.aspx and select the appropriate GPU version and host operating system. You can also download and install Nvidia UNIX drivers, and the CUDA toolkit from https://developer.nvidia.com/cuda-downloads.

To verify the availability of GPU accelerators in a node run the nvidia-smi command:

nvidia-smi -L

which is installed with the xorg-x11-drv-nvidia RPM package.

GPU monitoring tools 

There is a useful page Top 3 Linux GPU Monitoring Command Line Tools recommending the tools gpustat, nvtop, and nvitop. The NVIDIA tool nvidia-smi can of course also be used.

We recommend the gpustat tool which gives a 1-line status of each GPU in the system. The installation on EL8 systems is a bit tricky, so use these commands:

dnf install gcc python3-devel
python3 -m pip install setuptools-scm
python3 -m pip install gpustat

Our Slurm monitoring tools psjob and psnode use gpustat on nodes with GPU GRES to print a GPU usage summary.

Utilities for Slurm 

Here we list some useful third-party utilities that Slurm administrators or users may find useful:

A comprehensive list of tools on the Slurm_download page.
Slurm tools by Ole Holm Nielsen: https://github.com/OleHolmNielsen/Slurm_tools including:
- pestat prints a node status list (1 host per line) with information about jobids, users and CPU loads.
SlurmCommander is a simple, lightweight, no-dependencies text-based user interface (TUI) to your cluster. It ties together multiple slurm commands to provide you with a simple and efficient interaction point with slurm.
STUBL - Slurm Tools and UBiLities.
birc-aeh/slurm-utils: gnodes gives a visual representation of your cluster. jobinfo tries to collect information for a full job.
slurm_showq A showq style job summary utility for Slurm.

Graphical monitoring tools 

There exist a number of Open Source tools for graphical monitoring of Slurm:

Slurm-web provides a web interface on top of Slurm with intuitive graphical views, clear insights and advanced visualizations to track your jobs and monitor status of HPC supercomputers in your organization.
Open XDMoD is an open source tool to facilitate the management of high performance computing resources.
Graphing sdiag with Graphite using Graphite. See also slurm-diamond-collector.
Prometheus Slurm Exporter with a Grafana Slurm_dashboard.
Slurmbrowser A really thin web layer above Slurm. This tool requires Ganglia. Install first the RPMs python-virtualenv python2-bottle.

Working with Compute nodes 

Slurm power saving scripts 

Slurm provides an integrated power saving mechanism for powering down idle nodes, and starting them again when jobs need to be scheduled, see the Slurm_Power_Saving_Guide.

We provide some Slurm_power_saving_scripts which may be useful for power management using IPMI or with cloud services.

Expanding and collapsing host lists 

Slurm lists node/host lists in the compact format, for example node[001-123]. Sometimes you want to expand the host list, for example in scripts, to list all nodes individually.

You can use this command to output hostnames one line at a time:

scontrol show hostnames node[001-123]

or rewrite the list into a single line with paste:

scontrol show hostnames node[001-123] | paste -s -d ,

To contract expanded hostlists:

# scontrol show hostlistsorted h003,h002,h001
h[001-003]
# scontrol show hostlist h003,h002,h001
h[003,002,001]

When the server does not have the slurm RPM installed, or for more sophisticated host list processing, some non-Slurm tools may be used as shown below.

The nodeset command 

The ClusterShell_tool ‘s nodeset command (see below) enables easy manipulation of node sets, as well as node groups, at the command line level. For example:

$ nodeset --expand node[13-15,17-19]
node13 node14 node15 node17 node18 node19

The hostlist command 

The python-hostlist tool is very convenient for expanding or compressing node lists.

To install this tool (make sure to download the latest release):

dnf install python3-devel
wget https://www.nsc.liu.se/~kent/python-hostlist/python-hostlist-2.2.1.tar.gz
rpmbuild -ta python-hostlist-2.2.1.tar.gz
dnf install ~/rpmbuild/RPMS/noarch/python-hostlist-2.2.1-1.el8.x86_64

For usage see the python-hostlist, but a useful example is:

# hostlist --expand --sep " "  n[001-012]
n001 n002 n003 n004 n005 n006 n007 n008 n009 n010 n011 n012

The snodelist command 

The snodelist command is a tool for working with Slurm hostlists. Rather than relying on scontrol show hostnames to expand a Slurm compact host list to a newline-delimited list. Installation instructions are in the snodelist page.

Hostlist expressions for left-hand and right-hand nodes or other node increments 

When managing separately the left-hand and right-hand nodes in a Lenovo compute tray, or any other subset of compute nodes, the ClusterShell_tool comes in handily for selecting subsets of nodes. Let us assume that nodes are named numerically so that left-hand nodes have odd numbers, whereas right-hand nodes have even numbers, for example the left,right,left,right,… nodes:

e001,e002,...,e023,e024

The clush command can now perform commands separately:

clush -bw e[001-023/2] echo I am a left-hand node
clush -bw e[002-024/2] echo I am a right-hand node

Unfortunately, Slurm doesn’t recognize this syntax of node number increments. Here you can use the ClusterShell_tool’s command nodeset to print Slurm compatible nodelists to be used as Slurm command arguments:

$ nodeset -f e[001-024/2]
e[001,003,005,007,009,011,013,015,017,019,021,023]
$ nodeset -f e[002-024/2]
e[002,004,006,008,010,012,014,016,018,020,022,024]

An example where we assign nodelists to variables:

$ export nodelist=e[001-024]
$ export left=`nodeset -f e[001-024/2]`
$ export right=`nodeset -f e[002-024/2]`
$ sinfo -n $left

SSH keys for password-less access to cluster nodes 

Users may have a need for SSH access to Slurm compute nodes, for example, if they have to use an MPI library which is using SSH in stead of Slurm to start MPI tasks.

However, it is a good idea to configure the slurm-pam-adopt module on the nodes to control and restrict SSH access, see Slurm_configuration#pam-module-restrictions.

The SSH (Secure Shell) configuration files including server private/public keys are in the /etc/ssh/ folder.

The file /etc/ssh/ssh_known_hosts containing the SSH public keys of all nodes should be created on the central server and distributed to all Slurm nodes. The ssh-keyscan tool is very convenient for gathering SSH public keys of the cluster nodes, some examples are:

ssh-keyscan -t ssh-ed25519 node001 node002                   # Scan nodes node001+node002 for key type ssh-ed25519
scontrol show hostnames node[001-022] | ssh-keyscan -f - 2>/dev/null | sort # Scan nodes node[001-022], pipe comments to /dev/null, and sort the output
sinfo -Nho %N | uniq | ssh-keyscan -f - 2>/dev/null | sort          # Scan all Slurm nodes (uniq suppresses duplicates)

Remember to set the SELinux context correctly for the files in /etc/ssh:

chcon system_u:object_r:etc_t:s0 /etc/ssh/ssh_known_hosts

When all SSH public keys of the Slurm nodes are available in /etc/ssh/ssh_known_hosts, each individual user can configure a password-less SSH login. First the user must generate personal SSH keys (placed in the $HOME/.ssh/ folder) using the ssh-keygen tool.

Each user may use the convenient tool authorized_keys for generating SSH keys and adding them to the $HOME/.ssh/authorized_keys file.

For external computers the personal SSH_authorized_keys (preferably protected with a passphrase or Multi-Factor Authentication) should be used.

For the servers running the slurmctld and slurmdbd services it is strongly recommended not to permit login by normal users because they have no business on those servers! To restrict which users can login to the management hosts, append this line to the SSH server /etc/ssh/sshd_config file:

AllowUsers root

You can add more trusted system managers to this line if needed. Then restart the SSH service:

systemctl restart sshd

Host-based authentication 

Another way to enable password-less SSH login is to configure login nodes and compute nodes in the cluster to allow Host-based_Authentication. Please beware that:

For security reasons it is strongly recommended not to include the Slurm slurmctld and slurmdbd servers in the Host-based_Authentication because normal users have no business on those servers!
For security reasons the root user is not allowed to use Host-based_Authentication. You can add root’s public key to the /root/.ssh/authorized_keys file on all compute nodes for easy SSH access.
Furthermore, personal computers and other computers outside the cluster MUST NOT be trusted by the cluster nodes! For external computers the personal SSH_authorized_keys (preferably protected with a passphrase or Multi_Factor_Authentication) should be used.
You need to understand that Host-based_Authentication is a bad idea in general, but that it is a good and secure solution within a single Linux cluster’s security perimeter, see for example:
- Implementing ssh hostbased authentication.
- The mailing list thread at https://lists.schedmd.com/pipermail/slurm-users/2020-June/005578.html
It is recommended to configure the slurm-pam-adopt module on the nodes to control and restrict SSH access, see PAM module restrictions.

Here are the steps for configuring Host-based_Authentication:

First populate all SSH keys in the file /etc/ssh/ssh_known_hosts as shown above.
Configure only these lines in the SSH client configuration /etc/ssh/ssh_config on all nodes:
```
HostbasedAuthentication yes
EnableSSHKeysign yes
```
These lines do not work inside Host or Match statements, but must be defined at the global level.

You may also configure PreferredAuthentications (order of authentication methods) so that the hostbased method is preferred for the nodes in the cluster’s domainname (replace by your DNS domain). Furthermore GSSAPI and ForwardX11Trusted may be configured:
```
Host *.<domainname>
  PreferredAuthentications gssapi-keyex,gssapi-with-mic,hostbased,publickey,keyboard-interactive,password
  GSSAPIAuthentication yes
  ForwardX11Trusted yes
```
The ssh_config manual page explains the configuration keywords.

The GSSAPI (Generic Security Service Application Program Interface (GSS-API) Authentication and Key Exchange for the Secure Shell (SSH) Protocol) is defined in rfc4462.
Add these lines to the SSH server /etc/ssh/sshd_config file on all nodes:
```
HostbasedAuthentication yes
UseDNS yes
```
and restart the SSH service:
```
systemctl restart sshd
```
Populate the file /etc/ssh/shosts.equiv for every node in the cluster listed in /etc/ssh/ssh_known_hosts with 1 line per node including the full DNS domainname, for example:
```
node001.<domainname>
node002.<domainname>
...
```
Wildcard hostnames are not possible, so you must list all hosts one per line. To list all cluster nodes:
```
sinfo -Nho %N | uniq | awk '{print $1 ".domainname"}' > /etc/ssh/shosts.equiv
```
where you must substitute your own domainname.

Remember to set the SELinux context correctly for the files in /etc/ssh:

chcon system_u:object_r:etc_t:s0 /etc/ssh/sshd_config /etc/ssh/ssh_config /etc/ssh/shosts.equiv /etc/ssh/ssh_known_hosts

A normal (non-root) user should now be able to login from a node to itself, for example:

testnode$ ssh -v testnode

and the verbose output should inform you:

debug1: Authentication succeeded (hostbased).

ClusterShell 

ClusterShell provides a light and unified command execution Python framework to help administer GNU/Linux or BSD clusters. There is a ClusterShell_manual and a ClusterShell_configuration guide.

Install the ClusterShell_tool from the EPEL repository:

dnf install epel-release
dnf install clustershell

Copy the example file for Slurm.conf:

cp /etc/clustershell/groups.conf.d/slurm.conf.example /etc/clustershell/groups.conf.d/slurm.conf

You should define slurm as the default group in /etc/clustershell/groups.conf:

[Main]
# Default group source
default: slurm

It is convenient to add a Slurm binding for all running jobs belonging to a specific user. Append to /etc/clustershell/groups.conf.d/slurm.conf the lines:

#
# Slurm user job bindings
#
[slurmuser,su]
map: squeue -h -u $GROUP -o "%N" -t running
list: squeue -h -o "%i" -t R
reverse: squeue -h -w $NODE -o "%i"
cache_time: 60

This feature was included in the version 1.8.1.

You may encounter some surprising zero-padding behavior in node names, see also issue_293.

ClusterShell usage 

You can list all node groups including hostnames and node counts using this ClusterShell_tool command:

cluset -LLL

Simple usage of clush:

clush -w node[001-003] date

For a Slurm partition:

clush -g <partition-name> date

If option -b or –dshbak is specified, clush waits for command completion while displaying a progress indicator and then displays gathered output results:

clush -b -g <partition-name> date

To execute a command only on nodes with a specified Slurm state (here: drained):

clush -w@slurmstate:drained date
clush -bw@slurmstate:down 'uname -r; dmidecode -s bios-version'

To execute a command only on nodes running a particular Slurm JobID (here: 123456):

clush -w@sj:123456 <command>

To execute a command only on nodes running jobs for a particular username (requires the above mentioned slurmuser configuration):

clush -w@su:username <command>

If you want to run commands on hosts not under Slurm, select a group source defined in /etc/clustershell/groups (see man clush):

clush -s GROUPSOURCE or --groupsource=GROUPSOURCE <other arguments>

For example:

clush -s local -g testcluster <command>

The nodeset command enables easy manipulation of node sets, as well as node groups, at the command line level. For example:

$ nodeset --expand node[13-15,17-19]
node13 node14 node15 node17 node18 node19

Copying files with ClusterShell 

When ClusterShell_tool has been set up, it’s very simply to copy files and folders to nodes, see the clush manual page. Example:

clush -bw node[001-099] --copy /etc/slurm/slurm.conf --dest /etc/slurm/

Listing nodes 

Use sinfo to list nodes that are responding (for example, to be used in clush scripts):

sinfo -r -h -o '%n'
sinfo --responding --noheader --format='%n'

List reasons nodes are in the down, drained, fail or failing state:

sinfo -R
sinfo --list-reasons
sinfo -lRN

List of nodes with features and status:

sinfo --format="%25N %.40f %.6a %.10A"

Use scontrol to list node properties:

scontrol -o show nodes <Nodename>

Listing node resources used 

Use sinfo to see what resources are used/remaining on a per node basis:

sinfo -Nle -o '%n %C %t'

The flag -p <partition> may be added. Nodes states listed with * means that the node is not responding.

Note the STATE column:

State of the nodes. Possible states include: allocated, completing, down, drained, draining, fail, failing, future, idle, maint, mixed, perfctrs, power_down, power_up, reserved, and unknown plus Their abbreviated forms: alloc, comp, down, drain, drng, fail, failg, futr, idle, maint, mix, npc, pow_dn, pow_up, resv, and unk respectively.

Note that the suffix “*” identifies nodes that are presently not responding.

Resume an offline node 

A node may get stuck in an offline mode for several reasons. For example, you may see this:

# scontrol show node q007

NodeName=q007 Arch=x86_64 CoresPerSocket=2
...
 State=DOWN ThreadsPerCore=1 TmpDisk=32752 Weight=1 Owner=N/A
...
 Reason=NO NETWORK ADDRESS FOUND [slurm@2015-12-08T09:25:32]

Nodes states listed with * means that the node is not responding.

It is very difficult to find documentation on how to clear such an offline state. The solution is to use the scontrol command (section SPECIFICATIONS FOR UPDATE COMMAND, NODES):

scontrol update nodename=a001 state=down reason="undraining"
scontrol update nodename=a001 state=resume

See also How to “undrain” slurm nodes in drain state where it is recommended to avoid the down state (1st command above).

Slurm trigger information 

Triggers include events such as:

a node failing
daemon stops or restarts
a job reaching its time limit
a job terminating.

These events can cause actions such as the execution of an arbitrary script. Typical uses include notifying system administrators of node failures and gracefully terminating a job when it’s time limit is approaching. A hostlist expression for the nodelist or job ID is passed as an argument to the program.

strigger - Used set, get or clear Slurm trigger information

An example script using this is notify_nodes_down. To set up the trigger as the slurm user:

slurm# strigger --set --node --down --program=/usr/local/bin/notify_nodes_down

To display enabled triggers:

strigger --get

Add and remove nodes 

Nodes can be added or removed by modifying the slurm.conf file and distributing it to all nodes. If you use the topology.conf configuration, that file must also be updated and distributed to all nodes. If you run a Configless Slurm setup setup then the configuration files are served automatically to nodes by the slurmctld.

Starting in Slurm 22.05, nodes can be dynamically added and removed from Slurm, see dynamic_nodes.

If nodes must initially be unavailable for starting jobs, define them in slurm.conf with a State and optionally a Reason parameter:

NodeName=xxx ... State=DRAIN Reason="Not yet ready"
NodeName=xxx ... State=FUTURE

For convenience the command:

slurmd -C

can be used on each compute node to print its physical configuration (sockets, cores, real memory size, etc.) for inclusion into slurm.conf.

An entire new partition may also be made unavailable using a State not equal to UP:

PartitionName=xxx ... State=INACTIVE
PartitionName=xxx ... State=DRAIN

However, the slurmctld daemon must then be restarted:

systemctl restart slurmctld

As stated in the scontrol page under the reconfigure option):

The slurmctld daemon must be restarted if nodes are added to or removed from the cluster.

Furthermore, the slurmd service on all compute nodes must also be restarted in order to pick up the changes in slurm.conf, for example:

clush -ba systemctl restart slurmd

See advice from the Slurm_publications talk Technical: Field Notes Mark 2: Random Musings From Under A New Hat, Tim Wickberg, SchedMD (2018) on the Safe procedure:

Stop slurmctld
Change configs
Restart all slurmd processes
Start slurmctld

Less-Safe, but usually okay, procedure:

Change configs
Restart slurmctld
Restart all slurmd processes really quickly

See also https://thread.gmane.org/gmane.comp.distributed.slurm.devel/3039 (comment by Moe Jette).

Rebooting nodes 

Slurm can reboot nodes by:

scontrol reboot [ASAP] [NodeList]
  Reboot  all nodes in the system when they become idle using the RebootProgram as configured in Slurm's slurm.conf file.
  The option "ASAP" prevents initiation of additional jobs so the node can be rebooted and returned to service "As Soon As Possible" (i.e. ASAP).
  Accepts an option list of nodes to reboot.
  By default all nodes are rebooted.

NOTE: The reboot request will be ignored for hosts in the following states: FUTURE, POWER_DOWN, POWERED_DOWN, POWERING_DOWN, REBOOT_ISSUED, REBOOT_REQUESTED, see bug_18505. Currently, no warning is issued in such cases. From Slurm 24.08 an error message will be printed by scontrol reboot when a node reboot request is ignored due to the current node state.

Compute node OS and firmware updates 

Regarding the question of methods for Slurm compute node OS and firmware updates, we have for a long time used rolling updates while the cluster is in full production, so that we do not waste any resources.

When entire partitions are upgraded in this way, there is no risk of starting new jobs on nodes with differing states of OS and firmware, while running jobs continue on the not-yet-updated nodes.

The basic idea (which was provided by Niels Carl Hansen, ncwh -at- cscaa.dk) is to run a crontab script update.sh whenever a node is rebooted. Use scontrol to reboot the nodes as they become idle, thereby performing the updates that you want. Remove the crontab job as part of the update.sh script.

The update.sh script and instructions for usage are in: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/nodes

Shared Memory cleanup 

Certain jobs allocate Shared Memory resources but do not release them before job completion. For example, the shared memory segments may hit the system limit (typically 4096), see the system limit by:

$ sysctl kernel.shmmni
kernel.shmmni = 4096

Error messages such as this one may occur:

getshmem_C in getshmem.c: cannot create shared segment 8
No space left on device

Resource Reservation 

Compute nodes can be reserved for a number of purposes. Read the reservations guide.

For example, to reserve a set of nodes for a testing purpose with a duration of 720 hours:

scontrol create reservation starttime=now duration=720:00:00 ReservationName=Test1 Flags=MAGNETIC nodes=x[049-096] user=user1,user2

Ignore currently running jobs when creating the reservation by adding this flag:

flags=ignore_jobs

Magnetic reservations were introduced in Slurm 20.02, see the scontrol man-page:

Flags=MAGNETIC  # This flag allows jobs to be considered for this reservation even if they didn't request it.

Jobs will be eligible to run in such reservations even if they did not specify --reservation.

To reserve nodes for maintenance for 72 hours:

scontrol create reservation starttime=2017-06-19T12:00:00 duration=72:00:00 ReservationName=Maintenance flags=maint,ignore_jobs nodes=x[145-168] user=root

A specification of nodes=ALL will reserve all nodes.

If you want to reserve an entire partition, it is recommended to not specify nodes, but a partition in stead:

scontrol create reservation starttime=2017-06-19T12:00:00 duration=72:00:00 ReservationName=Maintenance flags=maint,ignore_jobs partitionname=xeon16 user=root

To list all reservations:

scontrol show reservations

and also previous reservations some weeks back in time:

scontrol show reservations start=now-5weeks

Batch jobs submitted for the reservation must explicitly refer to it, for example:

sbatch --reservation=Test1 -N4 my.script

One may also specify explicitly some nodes:

sbatch --reservation=Test1 -N2 --nodelist=x188,x140 my.script

Working with jobs 

Tutorial pages about Slurm job management:

Convenient Slurm Commands

Interactive jobs 

Using srun users can launch interactive jobs on compute nodes through Slurm. See the FAQ How can I get shell prompts in interactive mode?:

srun --pty bash -i [additional options]

If you need to run MPI tasks, see MPI_Guide_OpenMPI. It is required to invoke srun with pmi2 or pmix support as shown above in the MPI section, for example:

srun --pty --mpi=pmi2 bash -i [additional options]

Job arrays 

Slurm job_arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily.

It is important to understand that job arrays, only at the moment when an individual job starts running, become independent jobs (similar to non-array jobs) and are assigned their own unique JobIDs.

To see the relationship between job arrays and JobIDs, this is a useful command for a specified ArrayJobID:

$squeue  -j 3394902 -O ArrayJobID,JobArrayID,JobID,State
ARRAY_JOB_ID        JOBID               JOBID               STATE
3394902             3394902_[34-91]     3394902             PENDING
3394902             3394902_30          3394932             RUNNING
3394902             3394902_28          3394930             RUNNING

Useful commands 

See the overview of Slurm_man_pages as well as the individual command man-pages.

Command	Function
squeue	List jobs
squeue –start	List starting times of jobs
sbatch <options> –wrap=”some-command”	Submit a job running just `some-command` (without script file)
scontrol show job xxx	Get job details
scontrol –details show job xxx	Get more job details
scontrol suspend xxx	Suspend a job (root only)
scontrol resume xxx	Resume a job (root only)
scontrol hold xxx	Hold a job
scontrol uhold xxx	User-Hold a job
scontrol release xxx	Release a held job
scontrol update jobid=10208 nice=-10000	Increase a job’s priority (Slurm managers only)
scontrol update jobid=10208 nice=5000	Decrease a job’s priority (users and managers)
scontrol top 10208	Move the job to the top of the user’s queue
scontrol update jobid=10208 priority=50000	Set a job’s priority value
scontrol hold jobid=10208; scontrol release jobid=10208	Reset a job’s explicit priority=xxx value
scontrol update jobid=1163 EndTime=2022-04-27T08:30:00	Modify a job’s End time
scontrol update jobid=1163 timelimit=12:00:00	Modify a job’s time limit
scontrol update jobid=1163 qos=high	Set the job QOS to high (QOS list: `sacctmgr show qos`)
scontrol listpids <jobid> (on node running a job)	Print a listing of the process IDs in a job step
scontrol write batch_script job_id optional_filename	Write the batch script for a given job_id to a file or to stdout
scontrol show config	Prints the Slurm configuration and running parameters
scontrol write config optional_filename	Write the current Slurm configuration to a file
scancel job xxx	Kill a job
sjobexitmod -l jobid	Display job exit codes
sstat	Display various status information of a running job/step
scontrol show assoc_mgr	Displays the slurmctld’s internal cache for users, associations and/or qos such as GrpTRESRunMins, GrpTRESMins etc.
scontrol -o show assoc_mgr users=xxx accounts=yyy flags=assoc	Display the association limits and current values for user xxx in account yyy as a one-liner.
sacctmgr show user -s xxx	Display information about user xxx from the Slurm database
sacctmgr add user xxx Account=zzzz	Add user xxx to the non-default account zzzz, see the accounting page.
sacctmgr modify qos normal set priority=50	Modify the the QOS named normal to set a new priority value.
sacctmgr modify user where name=xxx set MaxSubmitJobs=NN	Update user’s maximum number of submitted jobs to NN. NN=0 blocks submissions, NN=-1 removes the limit.
sacctmgr -nP list associations user=xxx format=fairshare	Print the fairshare number of user xxx.
sacctmgr show event	Display information about events like downed or draining nodes on clusters.
sshare -lU -u xxx	Print the various fairshare values of user xxx.

squeue usage 

The squeue command has a huge number of parameters for listing jobs. Here are some suggestions for usage of squeue: sbatch <options> –wrap=”some-command” * The long display gives more details:

squeue -l  # is equivalent to:
squeue -o "%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R"

Add columns for job priority (%Q) and CPU count (%C) and make some columns wider:
```
squeue -o "%.18i %.9P %.8j %.8u %.10T %.9Q %.10M %.9l %.6D %.6C %R"
```

Set the output format by an environment variable:

export SQUEUE_FORMAT="%.18i %.9P %.8j %.8u %.10T %.9Q %.10M %.9l %.6D %.6C %R"

or using the new output format:

export SQUEUE_FORMAT2="JobID:8,Partition:11,QOS:7,Name:10 ,UserName:9,Account:9,State:8,PriorityLong:9,ReasonList:16 ,TimeUsed:12 ,SubmitTime:19 ,TimeLimit:10 ,tres-alloc: "

List of pending jobs in the same order considered for scheduling by Slurm (see squeue man-page under –priority):
```
squeue --priority  --sort=-p,i --states=PD
```

Slurm debugging 

Change the debug level of the slurmctld daemon.:

scontrol setdebug LEVEL

where LEVEL may be: “quiet”, “fatal”, “error”, “info”, “verbose”, “debug”, “debug2”, “debug3”, “debug4”, or “debug5”. See the scontrol OPTIONS section. For example:

scontrol setdebug debug2

This value is temporary and will be overwritten whenever the slurmctld daemon reads the slurm.conf configuration file (e.g. when the daemon is restarted or scontrol reconfigure is executed).

Add or remove DebugFlags of the slurmctld daemon:

scontrol setdebugflags [+|-]FLAG

For example:

scontrol setdebugflags +backfill

See slurm.conf PARAMETERS section for the full list of supported DebugFlags. NOTE: Changing the value of some DebugFlags will have no effect without restarting the slurmctld daemon, which would set DebugFlags based upon the contents of the slurm.conf configuration file.