Slurm batch queueing system
These pages constitute a HOWTO guide for setting up a Slurm workload manager software installation based on the CentOS/RHEL 7 Linux, but much of the information should be relevant on other Linux versions as well.
The information has been subdivided into sub-pages for separate topics:
- Slurm_installation and upgrading.
- Slurm_configuration setting up the services.
- Slurm_database for storing user and job data.
- Slurm_accounting defining user accounts.
- Slurm_scheduler for prioritizing jobs.
- The present page describes a number of additional Slurm topics:
- Slurm documentation
- Testing basic functionality
- MPI setup
- CPU management
- GPU accelerators
- Utilities for Slurm
- Working with Compute nodes
- Resource Reservation
- Working with jobs
Documentation about Slurm:
- Slurm_Quick_Start admin guide.
- Command_Summary (2-page sheet).
- Configuration file slurm.conf
- Slurm NEWS and RELEASE_NOTES on changes in recent versions of Slurm.
- Slurm_download with links to external tools under Download Related Software.
- Subscribe to Slurm_mailing_lists.
- The slurm_devel_archive.
- Slurm_publications and presentations.
- Slurm_man_pages overview of man-pages, configuration files, and daemons.
- Slurm_bugs tracking system.
- Large Cluster Administration Guide (clusters containing 1024 nodes or more).
- Slurm Troubleshooting Guide.
- Slurm Elastic Computing (Cloud Bursting) (Google Cloud, Amazon EC2 etc.)
From the Head/Master node try to submit an interactive job:
srun -N1 /bin/hostname
If srun hangs, check the firewall settings described in Slurm_configuration. Please note that interactive batch jobs from Login nodes seem to be impossible if your compute nodes are on an isolated private network relative to the Login node.
To display the job queue:
scontrol show jobs
To submit a batch job script using sbatch:
sbatch -N1 <script-file>
Useful sysadmin commands:
- sinfo - view information about Slurm nodes and partitions.
- squeue - view information about jobs located in the Slurm scheduling queue
- scancel Used to signal jobs or job steps
- smap - graphically view information about Slurm jobs, partitions, and set configurations parameters
- sview - graphical user interface to view and modify Slurm state (requires gtk2)
- scontrol - view and modify Slurm configuration and state
There is a large test suite, see the Testing section of the Slurm_Quick_Start Administrator Guide. The test suite is in the source .../testsuite/expect/ directory, see the file README.
The testsuite should be copied to the shared filesystem, for example, /home/$USER/testsuite/ and run by a non-root user:
cd testsuite/expect ./regression
MPI use under Slurm depends upon the type of MPI being used, see MPI_and_UPC_Users_Guide. The current versions of Slurm and OpenMPI support task launch using the srun command, see the MPI_Guide_OpenMPI.
You must add these flags when building OpenMPI:
--with-slurm --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr
The Slurm RPM installs header files in /usr/include/slurm and libraries in /usr/lib64. Using the OpenMPI tools, verify the installation of slurm as well as pmi modules, for example:
# ompi_info | egrep -i 'slurm|pmi' MCA db: pmi (MCA v2.0.0, API v1.0.0, Component v1.10.3) MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component v1.10.3) MCA ess: slurm (MCA v2.0.0, API v3.0.0, Component v1.10.3) MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3) MCA plm: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.3) MCA ras: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.3) MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)
If the pmi2 support is enabled then the command line options '--mpi=pmi2' has to be specified on the srun command line.
Hence you must invoke srun like:
It may alternatively be convenient to add this line to slurm.conf:
MPI stacks running over Infiniband or OmniPath require the ability to allocate more locked memory than the default limit. Unfortunately, user processes on login nodes may have a small memory limit (check it by ulimit -a) which by default are propagated into Slurm jobs and hence cause fabric errors for MPI. See the memlock FAQ.
This is fixed by adding to slurm.conf:
rpm -i nvidia-diag-driver-local-repo-rhel7-375.66-1.x86_64.rpm yum clean all yum install cuda-drivers reboot
To verify the availability of GPU accelerators in a node run the command:
which is installed with the xorg-x11-drv-nvidia RPM package.
Here we list some useful third-party utilities that Slurm administrators or users may find useful:
A comprehensive list of tools on the Slurm_download page.
Slurm tools by Ole Holm Nielsen: https://github.com/OleHolmNielsen/Slurm_tools including:
- pestat prints a node status list (1 host per line) with information about jobids, users and CPU loads.
birc-aeh/slurm-utils: gnodes gives a visual representation of your cluster. jobinfo tries to collect information for a full job.
slurm_showq A showq style job summary utility for SLURM.
Build a new RPM by:
rpmbuild --rebuild --with slurm schedtop-5.02-1.sdl6.src.rpm yum install ~/rpmbuild/RPMS/x86_64/slurmtop-5.02-1.el7.centos.x86_64.rpm
There exist a few Open Source tools for graphical monitoring of Slurm:
Slurm lists node/host lists in the compact format, for example node[001-123]. Sometimes you want to expand the host list, for example in scripts, to list all nodes individually.
You can use this command to output hostnames one line at a time:
scontrol show hostnames node[001-123]
or rewrite the list into a single line with paste:
scontrol show hostnames node[001-123] | paste -s -d ,
To contract expanded hostlists:
# scontrol show hostlistsorted h003,h002,h001 h[001-003] # scontrol show hostlist h003,h002,h001 h[003,002,001]
For more sophisticated host list processing the python-hostlist tool is very convenient. To install this tool (make sure to download the latest release):
wget https://www.nsc.liu.se/~kent/python-hostlist/python-hostlist-1.17.tar.gz rpmbuild -ta python-hostlist-1.17.tar.gz yum install ~/rpmbuild/RPMS/noarch/python-hostlist-1.17-1.noarch.rpm
For usage see the python-hostlist, but a useful example is:
# hostlist --expand --sep " " n[001-012] n001 n002 n003 n004 n005 n006 n007 n008 n009 n010 n011 n012
yum install epel-release yum install clustershell
Copy the example file for Slurm:
cp /etc/clustershell/groups.conf.d/slurm.conf.example /etc/clustershell/groups.conf.d/slurm.conf
You should define slurm as the default group in /etc/clustershell/groups.conf:
[Main] # Default group source default: slurm
It is convenient to add a Slurm binding for all running jobs belonging to a specific user. Append to /etc/clustershell/groups.conf.d/slurm.conf the lines:
# # SLURM user job bindings # [slurmuser,su] map: squeue -h -u $GROUP -o "%N" -t running list: squeue -h -o "%i" -t R reverse: squeue -h -w $NODE -o "%i" cache_time: 60
This feature will be included in the future version 1.8.1.
You can list all node groups including hostnames and node counts using this ClusterShell command:
Simple usage of clush:
clush -w node[001-003] date
For a Slurm partition:
clush -g <partition-name> date
clush -b -g <partition-name> date
To execute a command only on nodes with a specified Slurm state (here: drained):
clush -w@slurmstate:drained date
To execute a command only on nodes running a particular Slurm JobID (here: 123456):
clush -w@sj:123456 <command>
To execute a command only on nodes running jobs for a particular username (requires the above mentioned slurmuser configuration):
clush -w@su:username <command>
Install prerequisite packages:
yum install libnodeupdown-devel libgenders-devel whatsup
Rebuild the pdsh RPMs:
rpmbuild --rebuild --with=slurm --without=torque pdsh-2.31-1.el7.src.rpm
Notice: On CentOS 5 and 6 you must apparently remove the "=" signs due to a bug in rpmbuild.
Install the relevant (according to your needs) RPMs:
cd $HOME/rpmbuild/RPMS/x86_64/ yum install pdsh-2.31-1* pdsh-mod-slurm* pdsh-rcmd-ssh* pdsh-mod-dshgroup* pdsh-mod-nodeupdown*
pdsh -P <partition-name> date pdsh -j <job-name> date
See man pdsh for further details.
The whatsup command may also be useful, see man whatsup for further details.
sinfo -r -h -o '%n' sinfo --responding --noheader --format='%n'
List reasons nodes are in the down, drained, fail or failing state:
sinfo -R sinfo --list-reasons sinfo -lRN
List of nodes with features and status:
sinfo --format="%25N %.40f %.6a %.10A"
Use scontrol to list node properties:
scontrol -o show nodes <Nodename>
Use sinfo to see what resources are used/remaining on a per node basis:
sinfo -Nle -o '%n %C %t'
The flag -p <partition> may be added. Nodes states listed with * means that the node is not responding.
Note the STATE column:
State of the nodes. Possible states include: allocated, completing, down, drained, draining, fail, failing, future, idle, maint, mixed, perfctrs, power_down, power_up, reserved, and unknown plus Their abbreviated forms: alloc, comp, down, drain, drng, fail, failg, futr, idle, maint, mix, npc, pow_dn, pow_up, resv, and unk respectively.
Note that the suffix "*" identifies nodes that are presently not responding.
A node may get stuck in an offline mode for several reasons. For example, you may see this:
# scontrol show node q007 NodeName=q007 Arch=x86_64 CoresPerSocket=2 ... State=DOWN ThreadsPerCore=1 TmpDisk=32752 Weight=1 Owner=N/A ... Reason=NO NETWORK ADDRESS FOUND [slurm@2015-12-08T09:25:32]
Nodes states listed with * means that the node is not responding.
It is very difficult to find documentation on how to clear such an offline state. The solution is to use the scontrol command (section SPECIFICATIONS FOR UPDATE COMMAND, NODES):
scontrol update nodename=a001 state=down reason="undraining" scontrol update nodename=a001 state=resume
See also How to "undrain" slurm nodes in drain state where it is recommended to avoid the down state (1st command above).
Triggers include events such as:
- a node failing
- daemon stops or restarts
- a job reaching its time limit
- a job terminating.
These events can cause actions such as the execution of an arbitrary script. Typical uses include notifying system administrators of node failures and gracefully terminating a job when it's time limit is approaching. A hostlist expression for the nodelist or job ID is passed as an argument to the program.
- strigger - Used set, get or clear Slurm trigger information
An example script using this is notify_nodes_down. To set up the trigger as the slurm user:
slurm# strigger --set --node --down --program=/usr/local/bin/notify_nodes_down
To display enabled triggers:
Nodes can be added or removed by modifying the slurm.conf file and distributing it to all nodes. If you use the topology.conf configuration, that file must also be updated and distributed to all nodes.
If nodes must initially be unavailable for starting jobs, define them in slurm.conf with a State and optionally a Reason parameter:
NodeName=xxx ... State=DRAIN Reason="Not yet ready" NodeName=xxx ... State=FUTURE
For convenience the command:
can be used on each compute node to print its physical configuration (sockets, cores, real memory size, etc.) for inclusion into slurm.conf.
An entire new partition may also be made unavailable using a State not equal to UP:
PartitionName=xxx ... State=INACTIVE PartitionName=xxx ... State=DRAIN
However, the slurmctld daemon must then be restarted:
systemctl restart slurmctld
As stated in the scontrol page under the reconfigure option):
- The slurmctld daemon must be restarted if nodes are added to or removed from the cluster.
clush -ba systemctl reload slurmd
See advice from the Slurm_publications talk Technical: Field Notes Mark 2: Random Musings From Under A New Hat, Tim Wickberg, SchedMD (2018) on the Safe procedure:
- Stop slurmctld
- Change configs
- Restart all slurmd processes
- Start slurmctld
Less-Safe, but usually okay, procedure:
- Change configs
- Restart slurmctld
- Restart all slurmd processes really quickly
See also http://thread.gmane.org/gmane.comp.distributed.slurm.devel/3039 (comment by Moe Jette).
Slurm can reboot nodes by:
scontrol reboot [ASAP] [NodeList] Reboot all nodes in the system when they become idle using the RebootProgram as configured in Slurm's slurm.conf file. The option "ASAP" prevents initiation of additional jobs so the node can be rebooted and returned to service "As Soon As Possible" (i.e. ASAP). Accepts an option list of nodes to reboot. By default all nodes are rebooted.
Compute nodes can be reserved for a number of purposes. Read the reservations guide.
For example, to reserve a set of nodes for a testing purpose with a duration of 720 hours:
scontrol create reservation starttime=now duration=720:00:00 ReservationName=Test1 nodes=x[049-096] user=user1,user2
To reserve nodes for maintenance for 72 hours:
scontrol create reservation starttime=2017-06-19T12:00:00 duration=72:00:00 ReservationName=Maintenance flags=maint,ignore_jobs nodes=x[145-168] user=root
To list all reservations:
scontrol show reservations
Batch jobs submitted for the reservation must explicitly refer to it, for example:
sbatch --reservation=Test1 -N4 my.script
One may also specify explicitly some nodes:
sbatch --reservation=Test1 -N2 --nodelist=x188,x140 my.script
Tutorial pages about Slurm job management:
Slurm job_arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily.
See the overview of Slurm_man_pages as well as the individual command man-pages.
|squeue --start||List starting times of jobs|
|sbatch <options> --wrap="some-command"||Submit a job running just some-command (without script file)|
|scontrol show job xxx||Get job details|
|scontrol --details show job xxx||Get more job details|
|scontrol update job xxx TimeLimit=yyy||Update job xxx TimeLimit parameter|
|scontrol suspend xxx||Suspend a job (root only)|
|scontrol resume xxx||Resume a job (root only)|
|scontrol hold xxx||Hold a job|
|scontrol uhold xxx||User-Hold a job|
|scontrol release xxx||Release a held job|
|scontrol update jobid=10208 nice=-10000||Increase a job's priority|
|scontrol top 10208||Move the job to the top of the user's queue|
|scontrol update jobid=10208 priority=50000||Set a job's priority value|
|scontrol hold jobid=10208; scontrol release jobid=10208||Reset a job's explicit priority=xxx value|
|scontrol update jobid=1163 timelimit=12:00:00||Modify a job's time limit|
|scontrol update jobid=1163 qos=high||Set the job QOS to high (QOS list: sacctmgr show qos)|
|scancel job xxx||Kill a job|
|sjobexitmod -l jobid||Display job exit codes|
|sstat||Display various status information of a running job/step|
|scontrol show assoc_mgr||Displays the slurmctld's internal cache for users, associations and/or qos such as GrpTRESRunMins, GrpTRESMins etc.|
|scontrol -o show assoc_mgr users=xxx accounts=yyy flags=assoc||Display the association limits and current values for user xxx in account yyy as a one-liner.|
|sacctmgr add user xxx Account=zzzz||Add user xxx to the non-default account zzzz, see the accounting page.|
|sacctmgr modify qos normal set priority=50||Modify the the QOS named normal to set a new priority value.|
squeue -l # is equivalent to: squeue -o "%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R"
Add columns for job priority (%Q) and CPU count (%C) and make some columns wider:
squeue -o "%.18i %.9P %.8j %.8u %.10T %.9Q %.10M %.9l %.6D %.6C %R"
Set the output format by an environment variable:
export SQUEUE_FORMAT="%.18i %.9P %.8j %.8u %.10T %.9Q %.10M %.9l %.6D %.6C %R"
List of pending jobs in the same order considered for scheduling by Slurm (see squeue man-page under --priority):
squeue --priority --sort=-p,i --states=PD