Slurm batch queueing system

These pages constitute a HOWTO guide for setting up a Slurm workload manager software installation based on the CentOS/RHEL 7 Linux, but much of the information should be relevant on other Linux versions as well.

The information has been subdivided into sub-pages for separate topics:

Slurm documentation

Documentation about Slurm:

Other documentation:

Testing basic functionality

We assume that you have carried out the above deployment along the lines of Slurm_installation, Slurm_configuration, Slurm_database, Slurm_accounting and Slurm_scheduler.

From the Head/Master node try to submit an interactive job:

srun -N1 /bin/hostname

If srun hangs, check the firewall settings described above. Please note that interactive batch jobs from Login nodes seem to be impossible if your compute nodes are on an isolated private network relative to the Login node.

To display the job queue:

scontrol show jobs

To submit a batch job script using sbatch:

sbatch -N1 <script-file>

System information

Useful sysadmin commands:

  • sinfo - view information about Slurm nodes and partitions.
  • squeue - view information about jobs located in the Slurm scheduling queue
  • scancel Used to signal jobs or job steps
  • smap - graphically view information about Slurm jobs, partitions, and set configurations parameters
  • sview - graphical user interface to view and modify Slurm state (requires gtk2)
  • scontrol - view and modify Slurm configuration and state

Slurm test suite

There is a large test suite, see the Testing section of the Slurm_Quick_Start Administrator Guide. The test suite is in the source .../testsuite/expect/ directory, see the file README.

The testsuite should be copied to the shared filesystem, for example, /home/$USER/testsuite/ and run by a non-root user:

cd testsuite/expect
./regression

MPI setup

MPI use under Slurm depends upon the type of MPI being used, see MPI_and_UPC_Users_Guide. The current versions of Slurm and OpenMPI support task launch using the srun command, see the MPI_Guide_OpenMPI.

For PMIx please see the PMIx_Slurm_support page.

You must add these flags when building OpenMPI:

--with-slurm --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr

The Slurm RPM installs header files in /usr/include/slurm and libraries in /usr/lib64. Using the OpenMPI tools, verify the installation of slurm as well as pmi modules, for example:

# ompi_info | egrep -i 'slurm|pmi'
                MCA db: pmi (MCA v2.0.0, API v1.0.0, Component v1.10.3)
               MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component v1.10.3)
               MCA ess: slurm (MCA v2.0.0, API v3.0.0, Component v1.10.3)
           MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)
               MCA plm: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.3)
               MCA ras: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.3)
            MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)

Since Slurm provides both the PMI and PMI-2 interfaces, this advice in MPI_Guide_OpenMPI is important:

If the pmi2 support is enabled then the command line options '--mpi=pmi2' has to be specified on the srun command line.

Hence you must invoke srun like:

srun --mpi=pmi2

It may alternatively be convenient to add this line to slurm.conf:

MpiDefault=pmi2

See the FAQ: Running jobs under Slurm and the Process Management Interface (PMI) page.

MPI locked memory

MPI stacks running over Infiniband or OmniPath require the ability to allocate more locked memory than the default limit. Unfortunately, user processes on login nodes may have a small memory limit (check it by ulimit -a) which by default are propagated into Slurm jobs and hence cause fabric errors for MPI. See the memlock FAQ.

This is fixed by adding to slurm.conf:

PropagateResourceLimitsExcept=MEMLOCK

CPU management

It is important to understand how Slurm manages nodes, CPUs, tasks etc. This is documented in the cpu_management page.

GPU accelerators

Configure Slurm for GPU accelerators as described in the Slurm_configuration page under the GRES section.

Nvidia GPUs

Download Nvidia drivers from http://www.nvidia.com/Download/index.aspx and select the appropriate GPU version and host operating system. Installation instructions are provided on the download page:

rpm -i nvidia-diag-driver-local-repo-rhel7-375.66-1.x86_64.rpm
yum clean all
yum install cuda-drivers
reboot

To verify the availability of GPU accelerators in a node run the command:

nvidia-smi -L

which is installed with the xorg-x11-drv-nvidia RPM package.

Utilities for Slurm

Here we list some useful third-party utilities that Slurm administrators or users may find useful:

Graphical monitoring tools

There exist a few Open Source tools for graphical monitoring of Slurm:

Working with Compute nodes

Expanding and collapsing host lists

Slurm lists node/host lists in the compact format, for example node[001-123]. Sometimes you want to expand the host list, for example in scripts, to list all nodes individually.

You can use this command to output hostnames one line at a time:

scontrol show hostnames node[001-123]

or rewrite the list into a single line with paste:

scontrol show hostnames node[001-123] | paste -s -d ,

To contract expanded hostlists:

# scontrol show hostlistsorted h003,h002,h001
h[001-003]
# scontrol show hostlist h003,h002,h001
h[003,002,001]

For more sophisticated host list processing the python-hostlist tool is very convenient. To install this tool (make sure to download the latest release):

wget https://www.nsc.liu.se/~kent/python-hostlist/python-hostlist-1.17.tar.gz
rpmbuild -ta python-hostlist-1.17.tar.gz
yum install ~/rpmbuild/RPMS/noarch/python-hostlist-1.17-1.noarch.rpm

For usage see the python-hostlist, but a useful example is:

# hostlist --expand --sep " "  n[001-012]
n001 n002 n003 n004 n005 n006 n007 n008 n009 n010 n011 n012

ClusterShell

The ClusterShell tool is an alternative to pdsh (see above) which is more actively maintained and has some better features. There is a ClusterShell_manual and a ClusterShell_configuration guide.

Install ClusterShell from the EPEL repository:

yum install epel-release
yum install clustershell

Copy the example file for Slurm:

cp /etc/clustershell/groups.conf.d/slurm.conf.example /etc/clustershell/groups.conf.d/slurm.conf

You should define slurm as the default group in /etc/clustershell/groups.conf:

[Main]
# Default group source
default: slurm

It is convenient to add a Slurm binding for all running jobs belonging to a specific user. Append to /etc/clustershell/groups.conf.d/slurm.conf the lines:

#
# SLURM user job bindings
#
[slurmuser,su]
map: squeue -h -u $GROUP -o "%N" -t running
list: squeue -h -o "%i" -t R
reverse: squeue -h -w $NODE -o "%i"
cache_time: 60

This feature will be included in the future version 1.8.1.

ClusterShell usage

You can list all node groups including hostnames and node counts using this ClusterShell command:

cluset -LLL

Simple usage of clush:

clush -w node[001-003] date

For a Slurm partition:

clush -g <partition-name> date

If option -b or --dshbak (like with PDSH) is specified, clush waits for command completion while displaying a progress indicator and then displays gathered output results:

clush -b -g <partition-name> date

To execute a command only on nodes with a specified Slurm state (here: drained):

clush -w@slurmstate:drained date

To execute a command only on nodes running a particular Slurm JobID (here: 123456):

clush -w@sj:123456 <command>

To execute a command only on nodes running jobs for a particular username (requires the above mentioned slurmuser configuration):

clush -w@su:username <command>

PDSH - Parallel Distributed Shell

A crucial tool for the sysadmin is to execute commands in parallel on the compute nodes. The widely used pdsh tool may be used for this (see also ClusterShell below).

The pdsh RPM package may be installed from the EPEL repository, but unfortunately the slurm module hasn't been built in. Therefore you must manually rebuild the pdsh RPM:

  • Download the pdsh version 2.31 source RPM from https://dl.fedoraproject.org/pub/epel/7/SRPMS/p/:

    wget https://dl.fedoraproject.org/pub/epel/7/SRPMS/p/pdsh-2.31-1.el7.src.rpm
  • Install prerequisite packages:

    yum install libnodeupdown-devel libgenders-devel whatsup
  • Rebuild the pdsh RPMs:

    rpmbuild --rebuild --with=slurm --without=torque pdsh-2.31-1.el7.src.rpm

    Notice: On CentOS 5 and 6 you must apparently remove the "=" signs due to a bug in rpmbuild.

  • Install the relevant (according to your needs) RPMs:

    cd $HOME/rpmbuild/RPMS/x86_64/
    yum install pdsh-2.31-1* pdsh-mod-slurm* pdsh-rcmd-ssh* pdsh-mod-dshgroup* pdsh-mod-nodeupdown*

The pdsh command now knows about Slurm partitions and jobs:

pdsh -P <partition-name> date
pdsh -j <job-name> date

See man pdsh for further details.

The whatsup command may also be useful, see man whatsup for further details.

Listing nodes

Use sinfo to list nodes that are responding (for example, to be used in pdsh scripts):

sinfo -r -h -o '%n'
sinfo --responding --noheader --format='%n'

List reasons nodes are in the down, drained, fail or failing state:

sinfo -R
sinfo --list-reasons
sinfo -lRN

List of nodes with features and status:

sinfo --format="%25N %.40f %.6a %.10A"

Use scontrol to list node properties:

scontrol -o show nodes <Nodename>

Listing node resources used

Use sinfo to see what resources are used/remaining on a per node basis:

sinfo -Nle -o '%n %C %t'

The flag -p <partition> may be added. Nodes states listed with * means that the node is not responding.

Note the STATE column:

  • State of the nodes. Possible states include: allocated, completing, down, drained, draining, fail, failing, future, idle, maint, mixed, perfctrs, power_down, power_up, reserved, and unknown plus Their abbreviated forms: alloc, comp, down, drain, drng, fail, failg, futr, idle, maint, mix, npc, pow_dn, pow_up, resv, and unk respectively.

    Note that the suffix "*" identifies nodes that are presently not responding.

Resume an offline node

A node may get stuck in an offline mode for several reasons. For example, you may see this:

# scontrol show node q007

NodeName=q007 Arch=x86_64 CoresPerSocket=2
...
 State=DOWN ThreadsPerCore=1 TmpDisk=32752 Weight=1 Owner=N/A
...
 Reason=NO NETWORK ADDRESS FOUND [slurm@2015-12-08T09:25:32]

Nodes states listed with * means that the node is not responding.

It is very difficult to find documentation on how to clear such an offline state. The solution is to use the scontrol command (section SPECIFICATIONS FOR UPDATE COMMAND, NODES):

scontrol update nodename=a001 state=down reason="undraining"
scontrol update nodename=a001 state=resume

See also How to "undrain" slurm nodes in drain state where it is recommended to avoid the down state (1st command above).

Slurm trigger information

Triggers include events such as:

  • a node failing
  • daemon stops or restarts
  • a job reaching its time limit
  • a job terminating.

These events can cause actions such as the execution of an arbitrary script. Typical uses include notifying system administrators of node failures and gracefully terminating a job when it's time limit is approaching. A hostlist expression for the nodelist or job ID is passed as an argument to the program.

  • strigger - Used set, get or clear Slurm trigger information

An example script using this is notify_nodes_down. To set up the trigger as the slurm user:

slurm# strigger --set --node --down --program=/usr/local/bin/notify_nodes_down

To display enabled triggers:

strigger --get

Add and remove nodes

Nodes can be added or removed by modifying the slurm.conf file and distributing it to all nodes. If you use the topology.conf configuration, that file must also be updated and distributed to all nodes.

However, the slurmctld daemon must then be restarted:

systemctl restart slurmctld

As stated in the scontrol page under the reconfigure option):

  • The slurmctld daemon must be restarted if nodes are added to or removed from the cluster.

Furthermore, the slurmd service on all compute nodes must also be reloaded in order to pick up the changes in slurm.conf, for example:

clush -ba systemctl reload slurmd

See also http://thread.gmane.org/gmane.comp.distributed.slurm.devel/3039 (comment by Moe Jette).

Rebooting nodes

Slurm can reboot nodes by:

scontrol reboot [ASAP] [NodeList]
  Reboot  all nodes in the system when they become idle using the RebootProgram as configured in Slurm's slurm.conf file.
  The option "ASAP" prevents initiation of additional jobs so the node can be rebooted and returned to service "As Soon As Possible" (i.e. ASAP).
  Accepts an option list of nodes to reboot.
  By default all nodes are rebooted.

Resource Reservation

Compute nodes can be reserved for a number of purposes. Read the reservations guide.

For example, to reserve a set of nodes for a testing purpose with a duration of 720 hours:

scontrol create reservation starttime=now duration=720:00:00 ReservationName=Test1 nodes=x[049-096] user=user1,user2

To reserve nodes for maintenance for 72 hours:

scontrol create reservation starttime=2017-06-19T12:00:00 duration=72:00:00 ReservationName=Maintenance flags=maint,ignore_jobs nodes=x[145-168] user=root

To list all reservations:

scontrol show reservations

Batch jobs submitted for the reservation must explicitly refer to it, for example:

sbatch --reservation=Test1 -N4 my.script

One may also specify explicitly some nodes:

sbatch --reservation=Test1 -N2 --nodelist=x188,x140 my.script

Working with jobs

Tutorial pages about Slurm job management:

Slurm job_arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily.

Useful commands

See the overview of Slurm_man_pages as well as the individual command man-pages.

Command Function
squeue List jobs
squeue --start List starting times of jobs
sbatch <options> --wrap="some-command" Submit a job running just some-command (without script file)
scontrol show job xxx Get job details
scontrol --details show job xxx Get more job details
scontrol update job xxx TimeLimit=yyy Update job xxx TimeLimit parameter
scontrol suspend xxx Suspend a job (root only)
scontrol resume xxx Resume a job (root only)
scontrol hold xxx Hold a job
scontrol uhold xxx User-Hold a job
scontrol release xxx Release a held job
scontrol update jobid=10208 nice=-10000 Increase a job's priority
scontrol top 10208 Move the job to the top of the user's queue
scontrol update jobid=10208 priority=50000 Set a job's priority value
scontrol hold jobid=10208; scontrol release jobid=10208 Reset a job's explicit priority=xxx value
scontrol update jobid=1163 timelimit=12:00:00 Modify a job's time limit
scontrol update jobid=1163 qos=high Set the job QOS to high (QOS list: sacctmgr show qos)
scancel job xxx Kill a job
sjobexitmod -l jobid Display job exit codes
sstat Display various status information of a running job/step
squeue usage

The squeue command has a huge number of parameters for listing jobs. Here are some suggestions for usage of squeue: sbatch <options> --wrap="some-command" * The long display gives more details:

squeue -l  # is equivalent to:
squeue -o "%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R"
  • Add columns for job priority (%Q) and CPU count (%C) and make some columns wider:

    squeue -o "%.18i %.9P %.8j %.8u %.10T %.9Q %.10M %.9l %.6D %.6C %R"
  • Set the output format by an environment variable:

    export SQUEUE_FORMAT="%.18i %.9P %.8j %.8u %.10T %.9Q %.10M %.9l %.6D %.6C %R"
  • List of pending jobs in the same order considered for scheduling by Slurm (see squeue man-page under --priority):

    squeue --priority  --sort=-p,i --states=PD

Niflheim: SLURM (last edited 2018-08-24 07:58:25 by OleHolmNielsen)