Slurm operations
Jump to our top-level Slurm page: Slurm batch queueing system
Testing basic functionality
We assume that you have carried out the above deployment along the lines of Slurm installation and upgrading, Slurm configuration, Slurm database, Slurm accounting and Slurm job scheduler.
From the Head node try to submit an interactive job:
srun -N1 /bin/hostname
If srun hangs, check the firewall settings described in Slurm configuration. Please note that interactive batch jobs from Login nodes seem to be impossible if your compute nodes are on an isolated private network relative to the Login node.
To display the job queue:
scontrol show jobs
To submit a batch job script using sbatch:
sbatch -N1 <script-file>
System information
Useful sysadmin commands:
sinfo - view information about Slurm nodes and partitions.
showpartitions - Print a Slurm cluster partition status overview with 1 line per partition.
squeue - view information about jobs located in the Slurm scheduling queue
scancel Used to signal jobs or job steps
smap - graphically view information about Slurm jobs, partitions, and set configurations parameters
sview - graphical user interface to view and modify Slurm state (requires gtk2)
scontrol - view and modify Slurm configuration and state
Slurm test suite
There is a large test suite, see the Testing section of the Slurm_Quick_Start Administrator Guide.
The test suite is in the source .../testsuite/expect/
directory, see the file README.
The testsuite should be copied to the shared filesystem, for example, /home/$USER/testsuite/
and run by a non-root user:
cd testsuite/expect
./regression
MPI setup
MPI use under Slurm depends upon the type of MPI being used, see MPI_and_UPC_Users_Guide. The current versions of Slurm and OpenMPI support task launch using the srun command, see the MPI_Guide_OpenMPI.
For PMIx please see the PMIx_Slurm_support page.
You must add these flags when building OpenMPI:
--with-slurm --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr
The Slurm RPM installs header files in /usr/include/slurm
and libraries in /usr/lib64
.
Using the OpenMPI tools, verify the installation of slurm as well as pmi modules, for example:
# ompi_info | egrep -i 'slurm|pmi'
MCA db: pmi (MCA v2.0.0, API v1.0.0, Component v1.10.3)
MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component v1.10.3)
MCA ess: slurm (MCA v2.0.0, API v3.0.0, Component v1.10.3)
MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)
MCA plm: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.3)
MCA ras: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.3)
MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)
Since Slurm provides both the PMI and PMI-2 interfaces, this advice in MPI_Guide_OpenMPI is important:
If the pmi2 support is enabled then the command line options '--mpi=pmi2' has to be specified on the srun command line.
Hence you must invoke srun like:
srun --mpi=pmi2
It may alternatively be convenient to add this line to slurm.conf:
MpiDefault=pmi2
See the FAQ: Running jobs under Slurm and the Process Management Interface (PMI) page.
MPI locked memory
MPI stacks running over Infiniband or Omni-Path network fabric by Cornelis Networks require the ability to allocate more locked memory than the default limit.
Unfortunately, user processes on login nodes may have a small memory limit (check it by ulimit -a
) which by default are propagated into Slurm jobs and hence cause fabric errors for MPI.
See the memlock FAQ.
This is fixed by adding to slurm.conf:
PropagateResourceLimitsExcept=MEMLOCK
You can view the running slurmd process limits by:
cat "/proc/$(pgrep -u 0 slurmd)/limits"
CPU management
It is important to understand how Slurm manages nodes, CPUs, tasks etc. This is documented in the cpu_management page.
GPU accelerators
Configure Slurm for GPU accelerators as described in the Slurm configuration page under the GRES section.
The AutoDetect configuration in gres.conf can be used to detect GPU hardware (currently Nvidia and AMD).
You should set the job Default count of CPUs allocated per allocated GPU (DefCpuPerGPU) for each partition containing GPUs in the slurm.conf file, for example:
PartitionName=xxx DefCpuPerGPU=4 ...
For accounting of GPU usage you must add to the AccountingStorageTRES in slurm.conf, for example:
AccountingStorageTRES=gres/gpu,gres/gpu:tesla
and restart slurmctld so that these new fields are added to the database.
Nvidia GPUs
It is possible to build Slurm packages which include the Nvidia NVML library for easy handling of GPU hardware. NVML automatically detects GPUs, their type, cores, and NVLinks. Quoting the GRES page:
If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management Library (NVML) is installed on the node and was found during Slurm configuration,
configuration details will automatically be filled in for any system-detected NVIDIA GPU.
This removes the need to explicitly configure GPUs in gres.conf, though the Gres= line in slurm.conf is still required in order to tell slurmctld how many GRES to expect.
However, it is not necessary to include the NVML in your Slurm packages, since you can configure gres.conf manually for the GPU hardware in your nodes. See the mailing list thread Building Slurm RPMs with NVIDIA GPU support?.
Nvidia drivers
Download Nvidia drivers from https://www.nvidia.com/Download/index.aspx and select the appropriate GPU version and host operating system. You can also download and install Nvidia UNIX drivers, and the CUDA toolkit from https://developer.nvidia.com/cuda-downloads.
To verify the availability of GPU accelerators in a node run the nvidia-smi command:
nvidia-smi -L
which is installed with the xorg-x11-drv-nvidia RPM package.
GPU monitoring tools
There is a useful page Top 3 Linux GPU Monitoring Command Line Tools recommending the tools gpustat, nvtop, and nvitop. The NVIDIA tool nvidia-smi can of course also be used.
We recommend the gpustat tool which gives a 1-line status of each GPU in the system. The installation on EL8 systems is a bit tricky, so use these commands:
dnf install gcc python3-devel
python3 -m pip install setuptools-scm
python3 -m pip install gpustat
Our Slurm monitoring tools psjob and psnode use gpustat on nodes with GPU GRES to print a GPU usage summary.
RPC rate limiting
It is common to experience users who bombard the slurmctld server by executing commands such as squeue, sinfo, sbatch or the like with many requests per second. This can potentially make the slurmctld unresponsive and therefore affect the entire cluster.
The ability to do RPC rate limiting
on a per-user basis is a new feature with Slurm 23.02.
It acts as a virtual bucket of tokens that users consume with Remote Procedure Calls (RPC).
The RPC logging frequency
(rl_log_freq) is a new feature with Slurm 23.11.
Enable RPC rate limiting in slurm.conf by adding rl_enable
and other parameters, for example:
SlurmctldParameters=rl_enable,rl_refill_rate=10,rl_bucket_size=50,rl_log_freq=10
NOTE: After changing SlurmctldParameters
make an scontrol reconfig
to restart slurmctld.
See also bug_18067.
This allows users to submit a large number of requests in a short period of time, but not a sustained high rate of requests that would add stress to the slurmctld. You can define:
The maximum number of tokens with
rl_bucket_size
,the rate at which new tokens are added with
rl_refill_rate
,the frequency with which tokens are refilled with
rl_refill_period
and the number of entities to track with
rl_table_size
.New in 23.11:
rl_log_freq
option to limit the number of RPC limit exceeded… messages that are logged.
When this is enabled you may find lines in slurmctld.log
such as:
2023-10-06T10:22:32.893] RPC rate limit exceeded by uid 2851 with REQUEST_SUBMIT_BATCH_JOB, telling to back off
We have written a small script sratelimit for summarizing such log entries.
Utilities for Slurm
Here we list some useful third-party utilities that Slurm administrators or users may find useful:
A comprehensive list of tools on the Slurm_download page.
Slurm tools by Ole Holm Nielsen: https://github.com/OleHolmNielsen/Slurm_tools including:
pestat prints a node status list (1 host per line) with information about jobids, users and CPU loads.
SlurmCommander is a simple, lightweight, no-dependencies text-based user interface (TUI) to your cluster. It ties together multiple slurm commands to provide you with a simple and efficient interaction point with slurm.
birc-aeh/slurm-utils: gnodes gives a visual representation of your cluster. jobinfo tries to collect information for a full job.
slurm_showq A showq style job summary utility for SLURM.
Graphical monitoring tools
There exist a number of Open Source tools for graphical monitoring of Slurm:
Slurm-web provides a web interface on top of Slurm with intuitive graphical views, clear insights and advanced visualizations to track your jobs and monitor status of HPC supercomputers in your organization.
Open XDMoD is an open source tool to facilitate the management of high performance computing resources.
Graphing sdiag with Graphite using Graphite. See also slurm-diamond-collector.
Slurmbrowser A really thin web layer above Slurm. This tool requires Ganglia. Install first the RPMs
python-virtualenv python2-bottle
.
Working with Compute nodes
Slurm power saving scripts
Slurm provides an integrated power saving mechanism for powering down idle nodes, and starting them again when jobs need to be scheduled, see the Slurm_Power_Saving_Guide.
We provide some Slurm_power_saving_scripts which may be useful for power management using IPMI or with cloud services.
Expanding and collapsing host lists
Slurm lists node/host lists in the compact format, for example node[001-123]
.
Sometimes you want to expand the host list, for example in scripts, to list all nodes individually.
You can use this command to output hostnames one line at a time:
scontrol show hostnames node[001-123]
or rewrite the list into a single line with paste:
scontrol show hostnames node[001-123] | paste -s -d ,
To contract expanded hostlists:
# scontrol show hostlistsorted h003,h002,h001
h[001-003]
# scontrol show hostlist h003,h002,h001
h[003,002,001]
When the server does not have the slurm RPM installed, or for more sophisticated host list processing, some non-Slurm tools may be used as shown below.
The nodeset command
The ClusterShell_tool ‘s nodeset command (see below) enables easy manipulation of node sets, as well as node groups, at the command line level. For example:
$ nodeset --expand node[13-15,17-19]
node13 node14 node15 node17 node18 node19
The hostlist command
The python-hostlist tool is very convenient for expanding or compressing node lists.
To install this tool (make sure to download the latest release):
dnf install python3-devel
wget https://www.nsc.liu.se/~kent/python-hostlist/python-hostlist-2.2.1.tar.gz
rpmbuild -ta python-hostlist-2.2.1.tar.gz
dnf install ~/rpmbuild/RPMS/noarch/python-hostlist-2.2.1-1.el8.x86_64
For usage see the python-hostlist, but a useful example is:
# hostlist --expand --sep " " n[001-012]
n001 n002 n003 n004 n005 n006 n007 n008 n009 n010 n011 n012
The snodelist command
The snodelist command is a tool for working with Slurm hostlists.
Rather than relying on scontrol show hostnames
to expand a Slurm compact host list to a newline-delimited list.
Installation instructions are in the snodelist page.
SSH keys for password-less access to cluster nodes
Users may have a need for SSH access to Slurm compute nodes, for example, if they have to use an MPI library which is using SSH in stead of Slurm to start MPI tasks.
However, it is a good idea to configure the slurm-pam-adopt module on the nodes to control and restrict SSH access, see Slurm_configuration#pam-module-restrictions.
The SSH (Secure Shell) configuration files including server private/public keys are in the /etc/ssh/
folder.
The file /etc/ssh/ssh_known_hosts
containing the SSH public keys of all nodes should be created on the central server and distributed to all Slurm nodes.
The ssh-keyscan tool is very convenient for gathering SSH public keys of the cluster nodes, some examples are:
ssh-keyscan -t ssh-ed25519 node001 node002 # Scan nodes node001+node002 for key type ssh-ed25519
scontrol show hostnames node[001-022] | ssh-keyscan -f - 2>/dev/null | sort # Scan nodes node[001-022], pipe comments to /dev/null, and sort the output
sinfo -Nho %N | uniq | ssh-keyscan -f - 2>/dev/null | sort # Scan all Slurm nodes (uniq suppresses duplicates)
Remember to set the SELinux context correctly for the files in /etc/ssh
:
chcon system_u:object_r:etc_t:s0 /etc/ssh/ssh_known_hosts
When all SSH public keys of the Slurm nodes are available in /etc/ssh/ssh_known_hosts
, each individual user can configure a password-less SSH login.
First the user must generate personal SSH keys (placed in the $HOME/.ssh/
folder) using the ssh-keygen tool.
Each user may use the convenient tool authorized_keys for generating SSH keys and adding them to the $HOME/.ssh/authorized_keys
file.
For external computers the personal SSH_authorized_keys (preferably protected with a passphrase or Multi-Factor Authentication) should be used.
For the servers running the slurmctld and slurmdbd services it is strongly recommended not to permit login by normal users because they have no business on those servers!
To restrict which users can login to the management hosts, append this line to the SSH server /etc/ssh/sshd_config
file:
AllowUsers root
You can add more trusted system managers to this line if needed. Then restart the SSH service:
systemctl restart sshd
Host-based authentication
Another way to enable password-less SSH login is to configure login nodes and compute nodes in the cluster to allow Host-based_Authentication. Please beware that:
For security reasons it is strongly recommended not to include the Slurm slurmctld and slurmdbd servers in the Host-based_Authentication because normal users have no business on those servers!
For security reasons the root user is not allowed to use Host-based_Authentication. You can add root’s public key to the
/root/.ssh/authorized_keys
file on all compute nodes for easy SSH access.Furthermore, personal computers and other computers outside the cluster MUST NOT be trusted by the cluster nodes! For external computers the personal SSH_authorized_keys (preferably protected with a passphrase or Multi_Factor_Authentication) should be used.
You need to understand that Host-based_Authentication is a bad idea in general, but that it is a good and secure solution within a single Linux cluster’s security perimeter, see for example:
The mailing list thread at https://lists.schedmd.com/pipermail/slurm-users/2020-June/005578.html
It is recommended to configure the slurm-pam-adopt module on the nodes to control and restrict SSH access, see PAM module restrictions.
Here are the steps for configuring Host-based_Authentication:
First populate all SSH keys in the file
/etc/ssh/ssh_known_hosts
as shown above.Configure only these lines in the SSH client configuration
/etc/ssh/ssh_config
on all nodes:HostbasedAuthentication yes EnableSSHKeysign yes
These lines do not work inside Host or Match statements, but must be defined at the global level.
You may also configure PreferredAuthentications (order of authentication methods) so that the hostbased method is preferred for the nodes in the cluster’s domainname (replace by your DNS domain). Furthermore GSSAPI and ForwardX11Trusted may be configured:
Host *.<domainname> PreferredAuthentications gssapi-keyex,gssapi-with-mic,hostbased,publickey,keyboard-interactive,password GSSAPIAuthentication yes ForwardX11Trusted yes
The ssh_config manual page explains the configuration keywords.
The GSSAPI (Generic Security Service Application Program Interface (GSS-API) Authentication and Key Exchange for the Secure Shell (SSH) Protocol) is defined in rfc4462.
Add these lines to the SSH server
/etc/ssh/sshd_config
file on all nodes:HostbasedAuthentication yes UseDNS yes
and restart the SSH service:
systemctl restart sshd
Populate the file
/etc/ssh/shosts.equiv
for every node in the cluster listed in/etc/ssh/ssh_known_hosts
with 1 line per node including the full DNS domainname, for example:node001.<domainname> node002.<domainname> ...
Wildcard hostnames are not possible, so you must list all hosts one per line. To list all cluster nodes:
sinfo -Nho %N | uniq | awk '{print $1 ".domainname"}' > /etc/ssh/shosts.equiv
where you must substitute your own domainname.
Remember to set the SELinux context correctly for the files in /etc/ssh
:
chcon system_u:object_r:etc_t:s0 /etc/ssh/sshd_config /etc/ssh/ssh_config /etc/ssh/shosts.equiv /etc/ssh/ssh_known_hosts
A normal (non-root) user should now be able to login from a node to itself, for example:
testnode$ ssh -v testnode
and the verbose output should inform you:
debug1: Authentication succeeded (hostbased).
ClusterShell
ClusterShell provides a light and unified command execution Python framework to help administer GNU/Linux or BSD clusters. There is a ClusterShell_manual and a ClusterShell_configuration guide.
Install the ClusterShell_tool from the EPEL repository:
dnf install epel-release
dnf install clustershell
Copy the example file for Slurm.conf:
cp /etc/clustershell/groups.conf.d/slurm.conf.example /etc/clustershell/groups.conf.d/slurm.conf
You should define slurm as the default group in /etc/clustershell/groups.conf
:
[Main]
# Default group source
default: slurm
It is convenient to add a Slurm binding for all running jobs belonging to a specific user.
Append to /etc/clustershell/groups.conf.d/slurm.conf
the lines:
#
# SLURM user job bindings
#
[slurmuser,su]
map: squeue -h -u $GROUP -o "%N" -t running
list: squeue -h -o "%i" -t R
reverse: squeue -h -w $NODE -o "%i"
cache_time: 60
This feature was included in the version 1.8.1.
You may encounter some surprising zero-padding behavior in node names, see also issue_293.
ClusterShell usage
You can list all node groups including hostnames and node counts using this ClusterShell_tool command:
cluset -LLL
Simple usage of clush:
clush -w node[001-003] date
For a Slurm partition:
clush -g <partition-name> date
If option -b or –dshbak is specified, clush waits for command completion while displaying a progress indicator and then displays gathered output results:
clush -b -g <partition-name> date
To execute a command only on nodes with a specified Slurm state (here: drained
):
clush -w@slurmstate:drained date
clush -bw@slurmstate:down 'uname -r; dmidecode -s bios-version'
To execute a command only on nodes running a particular Slurm JobID (here: 123456):
clush -w@sj:123456 <command>
To execute a command only on nodes running jobs for a particular username (requires the above mentioned slurmuser configuration):
clush -w@su:username <command>
If you want to run commands on hosts not under Slurm, select a group source defined in /etc/clustershell/groups (see man clush
):
clush -s GROUPSOURCE or --groupsource=GROUPSOURCE <other arguments>
For example:
clush -s local -g testcluster <command>
The nodeset command enables easy manipulation of node sets, as well as node groups, at the command line level. For example:
$ nodeset --expand node[13-15,17-19]
node13 node14 node15 node17 node18 node19
Copying files with ClusterShell
When ClusterShell_tool has been set up, it’s very simply to copy files and folders to nodes, see the clush manual page. Example:
clush -bw node[001-099] --copy /etc/slurm/slurm.conf --dest /etc/slurm/
Listing nodes
Use sinfo to list nodes that are responding (for example, to be used in clush scripts):
sinfo -r -h -o '%n'
sinfo --responding --noheader --format='%n'
List reasons nodes are in the down, drained, fail or failing state:
sinfo -R
sinfo --list-reasons
sinfo -lRN
List of nodes with features and status:
sinfo --format="%25N %.40f %.6a %.10A"
Use scontrol to list node properties:
scontrol -o show nodes <Nodename>
Listing node resources used
Use sinfo to see what resources are used/remaining on a per node basis:
sinfo -Nle -o '%n %C %t'
The flag -p <partition>
may be added.
Nodes states listed with * means that the node is not responding.
Note the STATE column:
State of the nodes. Possible states include: allocated, completing, down, drained, draining, fail, failing, future, idle, maint, mixed, perfctrs, power_down, power_up, reserved, and unknown plus Their abbreviated forms: alloc, comp, down, drain, drng, fail, failg, futr, idle, maint, mix, npc, pow_dn, pow_up, resv, and unk respectively.
Note that the suffix “*” identifies nodes that are presently not responding.
Resume an offline node
A node may get stuck in an offline mode for several reasons. For example, you may see this:
# scontrol show node q007
NodeName=q007 Arch=x86_64 CoresPerSocket=2
...
State=DOWN ThreadsPerCore=1 TmpDisk=32752 Weight=1 Owner=N/A
...
Reason=NO NETWORK ADDRESS FOUND [slurm@2015-12-08T09:25:32]
Nodes states listed with * means that the node is not responding.
It is very difficult to find documentation on how to clear such an offline state. The solution is to use the scontrol command (section SPECIFICATIONS FOR UPDATE COMMAND, NODES):
scontrol update nodename=a001 state=down reason="undraining"
scontrol update nodename=a001 state=resume
See also How to “undrain” slurm nodes in drain state where it is recommended to avoid the down state (1st command above).
Slurm trigger information
Triggers include events such as:
a node failing
daemon stops or restarts
a job reaching its time limit
a job terminating.
These events can cause actions such as the execution of an arbitrary script. Typical uses include notifying system administrators of node failures and gracefully terminating a job when it’s time limit is approaching. A hostlist expression for the nodelist or job ID is passed as an argument to the program.
strigger - Used set, get or clear Slurm trigger information
An example script using this is notify_nodes_down. To set up the trigger as the slurm user:
slurm# strigger --set --node --down --program=/usr/local/bin/notify_nodes_down
To display enabled triggers:
strigger --get
Add and remove nodes
Nodes can be added or removed by modifying the slurm.conf file and distributing it to all nodes. If you use the topology.conf configuration, that file must also be updated and distributed to all nodes. If you run a Configless Slurm setup setup then the configuration files are served automatically to nodes by the slurmctld.
Starting in Slurm 22.05, nodes can be dynamically added and removed from Slurm, see dynamic_nodes.
If nodes must initially be unavailable for starting jobs, define them in slurm.conf with a State and optionally a Reason parameter:
NodeName=xxx ... State=DRAIN Reason="Not yet ready"
NodeName=xxx ... State=FUTURE
For convenience the command:
slurmd -C
can be used on each compute node to print its physical configuration (sockets, cores, real memory size, etc.) for inclusion into slurm.conf.
An entire new partition may also be made unavailable using a State not equal to UP:
PartitionName=xxx ... State=INACTIVE
PartitionName=xxx ... State=DRAIN
However, the slurmctld daemon must then be restarted:
systemctl restart slurmctld
As stated in the scontrol page under the reconfigure option):
The slurmctld daemon must be restarted if nodes are added to or removed from the cluster.
Furthermore, the slurmd service on all compute nodes must also be restarted in order to pick up the changes in slurm.conf, for example:
clush -ba systemctl restart slurmd
See advice from the Slurm_publications talk Technical: Field Notes Mark 2: Random Musings From Under A New Hat, Tim Wickberg, SchedMD (2018) on the Safe procedure:
Stop slurmctld
Change configs
Restart all slurmd processes
Start slurmctld
Less-Safe, but usually okay, procedure:
Change configs
Restart slurmctld
Restart all slurmd processes really quickly
See also https://thread.gmane.org/gmane.comp.distributed.slurm.devel/3039 (comment by Moe Jette).
Rebooting nodes
Slurm can reboot nodes by:
scontrol reboot [ASAP] [NodeList]
Reboot all nodes in the system when they become idle using the RebootProgram as configured in Slurm's slurm.conf file.
The option "ASAP" prevents initiation of additional jobs so the node can be rebooted and returned to service "As Soon As Possible" (i.e. ASAP).
Accepts an option list of nodes to reboot.
By default all nodes are rebooted.
NOTE: The reboot request will be ignored for hosts in the following states: FUTURE, POWER_DOWN, POWERED_DOWN, POWERING_DOWN, REBOOT_ISSUED, REBOOT_REQUESTED
,
see bug_18505.
Currently, no warning is issued in such cases.
From Slurm 24.08 an error message will be printed by scontrol reboot
when a node reboot request is ignored due to the current node state.
Compute node OS and firmware updates
Regarding the question of methods for Slurm compute node OS and firmware updates, we have for a long time used rolling updates while the cluster is in full production, so that we do not waste any resources.
When entire partitions are upgraded in this way, there is no risk of starting new jobs on nodes with differing states of OS and firmware, while running jobs continue on the not-yet-updated nodes.
The basic idea (which was provided by Niels Carl Hansen, ncwh -at- cscaa.dk) is to run a crontab script update.sh
whenever a node is rebooted.
Use scontrol to reboot the nodes as they become idle, thereby performing the updates that you want.
Remove the crontab job as part of the update.sh
script.
The update.sh
script and instructions for usage are in:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/nodes
Resource Reservation
Compute nodes can be reserved for a number of purposes. Read the reservations guide.
For example, to reserve a set of nodes for a testing purpose with a duration of 720 hours:
scontrol create reservation starttime=now duration=720:00:00 ReservationName=Test1 Flags=MAGNETIC nodes=x[049-096] user=user1,user2
Ignore currently running jobs when creating the reservation by adding this flag:
flags=ignore_jobs
Magnetic reservations were introduced in Slurm 20.02, see the scontrol man-page:
Flags=MAGNETIC # This flag allows jobs to be considered for this reservation even if they didn't request it.
Jobs will be eligible to run in such reservations even if they did not specify --reservation
.
To reserve nodes for maintenance for 72 hours:
scontrol create reservation starttime=2017-06-19T12:00:00 duration=72:00:00 ReservationName=Maintenance flags=maint,ignore_jobs nodes=x[145-168] user=root
A specification of nodes=ALL will reserve all nodes.
If you want to reserve an entire partition, it is recommended to not specify nodes, but a partition in stead:
scontrol create reservation starttime=2017-06-19T12:00:00 duration=72:00:00 ReservationName=Maintenance flags=maint,ignore_jobs partitionname=xeon16 user=root
To list all reservations:
scontrol show reservations
and also previous reservations some weeks back in time:
scontrol show reservations start=now-5weeks
Batch jobs submitted for the reservation must explicitly refer to it, for example:
sbatch --reservation=Test1 -N4 my.script
One may also specify explicitly some nodes:
sbatch --reservation=Test1 -N2 --nodelist=x188,x140 my.script
Working with jobs
Tutorial pages about Slurm job management:
Interactive jobs
Using srun users can launch interactive jobs on compute nodes through Slurm. See the FAQ How can I get shell prompts in interactive mode?:
srun --pty bash -i [additional options]
If you need to run MPI tasks, see MPI_Guide_OpenMPI. It is required to invoke srun with pmi2 or pmix support as shown above in the MPI section, for example:
srun --pty --mpi=pmi2 bash -i [additional options]
Job arrays
Slurm job_arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily.
It is important to understand that job arrays, only at the moment when an individual job starts running, become independent jobs (similar to non-array jobs) and are assigned their own unique JobIDs.
To see the relationship between job arrays and JobIDs, this is a useful command for a specified ArrayJobID:
$squeue -j 3394902 -O ArrayJobID,JobArrayID,JobID,State
ARRAY_JOB_ID JOBID JOBID STATE
3394902 3394902_[34-91] 3394902 PENDING
3394902 3394902_30 3394932 RUNNING
3394902 3394902_28 3394930 RUNNING
Useful commands
See the overview of Slurm_man_pages as well as the individual command man-pages.
Command |
Function |
List jobs |
|
squeue –start |
List starting times of jobs |
sbatch <options> –wrap=”some-command” |
Submit a job running just |
scontrol show job xxx |
Get job details |
scontrol –details show job xxx |
Get more job details |
scontrol suspend xxx |
Suspend a job (root only) |
scontrol resume xxx |
Resume a job (root only) |
scontrol hold xxx |
Hold a job |
scontrol uhold xxx |
User-Hold a job |
scontrol release xxx |
Release a held job |
scontrol update jobid=10208 nice=-10000 |
Increase a job’s priority (Slurm managers only) |
scontrol update jobid=10208 nice=5000 |
Decrease a job’s priority (users and managers) |
scontrol top 10208 |
Move the job to the top of the user’s queue |
scontrol update jobid=10208 priority=50000 |
Set a job’s priority value |
Reset a job’s explicit priority=xxx value |
|
scontrol update jobid=1163 EndTime=2022-04-27T08:30:00 |
Modify a job’s End time |
scontrol update jobid=1163 timelimit=12:00:00 |
Modify a job’s time limit |
scontrol update jobid=1163 qos=high |
Set the job QOS to high (QOS list: |
scontrol listpids <jobid> (on node running a job) |
Print a listing of the process IDs in a job step |
scontrol write batch_script job_id optional_filename |
Write the batch script for a given job_id to a file or to stdout |
scontrol show config |
Prints the Slurm configuration and running parameters |
scontrol write config optional_filename |
Write the current Slurm configuration to a file |
scancel job xxx |
Kill a job |
sjobexitmod -l jobid |
Display job exit codes |
Display various status information of a running job/step |
|
scontrol show assoc_mgr |
Displays the slurmctld’s internal cache for users, associations and/or qos such as GrpTRESRunMins, GrpTRESMins etc. |
scontrol -o show assoc_mgr users=xxx accounts=yyy flags=assoc |
Display the association limits and current values for user xxx in account yyy as a one-liner. |
sacctmgr show user -s xxx |
Display information about user xxx from the Slurm database |
sacctmgr add user xxx Account=zzzz |
Add user xxx to the non-default account zzzz, see the accounting page. |
sacctmgr modify qos normal set priority=50 |
Modify the the QOS named normal to set a new priority value. |
sacctmgr modify user where name=xxx set MaxSubmitJobs=NN |
Update user’s maximum number of submitted jobs to NN. NN=0 blocks submissions, NN=-1 removes the limit. |
sacctmgr -nP list associations user=xxx format=fairshare |
Print the fairshare number of user xxx. |
sacctmgr show event |
Display information about events like downed or draining nodes on clusters. |
sshare -lU -u xxx |
Print the various fairshare values of user xxx. |
squeue usage
The squeue command has a huge number of parameters for listing jobs. Here are some suggestions for usage of squeue: sbatch <options> –wrap=”some-command” * The long display gives more details:
squeue -l # is equivalent to:
squeue -o "%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R"
Add columns for job priority (%Q) and CPU count (%C) and make some columns wider:
squeue -o "%.18i %.9P %.8j %.8u %.10T %.9Q %.10M %.9l %.6D %.6C %R"
Set the output format by an environment variable:
export SQUEUE_FORMAT="%.18i %.9P %.8j %.8u %.10T %.9Q %.10M %.9l %.6D %.6C %R"
or using the new output format:
export SQUEUE_FORMAT2="JobID:8,Partition:11,QOS:7,Name:10 ,UserName:9,Account:9,State:8,PriorityLong:9,ReasonList:16 ,TimeUsed:12 ,SubmitTime:19 ,TimeLimit:10 ,tres-alloc: "
List of pending jobs in the same order considered for scheduling by Slurm (see squeue man-page under –priority):
squeue --priority --sort=-p,i --states=PD
Slurm debugging
Change the debug level of the slurmctld daemon.:
scontrol setdebug LEVEL
where LEVEL may be: “quiet”, “fatal”, “error”, “info”, “verbose”, “debug”, “debug2”, “debug3”, “debug4”, or “debug5”. See the scontrol OPTIONS section. For example:
scontrol setdebug debug2
This value is temporary and will be overwritten whenever the slurmctld daemon reads the slurm.conf configuration file (e.g. when the daemon is restarted or scontrol reconfigure is executed).
Add or remove DebugFlags of the slurmctld daemon:
scontrol setdebugflags [+|-]FLAG
For example:
scontrol setdebugflags +backfill
See slurm.conf PARAMETERS section for the full list of supported DebugFlags. NOTE: Changing the value of some DebugFlags will have no effect without restarting the slurmctld daemon, which would set DebugFlags based upon the contents of the slurm.conf configuration file.