Slurm configuration

Jump to our Slurm top-level page.

Slurm configuration and slurm.conf

With Slurm 17.11 you probably want the example configuration files in this RPM:

rpm -q slurm-example-configs

On the Head/Master node you should build a slurm.conf configuration file. When it has been fully tested, then slurm.conf must be copied to all other nodes.

It is mandatory that slurm.conf is identical on all nodes in the system!

Consult the Slurm_Quick_Start Administrator Guide. See also man slurm.conf or the on-line slurm.conf documentation.

Copy the HTML files to your $HOME directory, for example:

mkdir $HOME/slurm/
cp -rp /usr/share/doc/slurm-*/html $HOME/slurm/

Configurator for slurm.conf

You can generate an initial slurm.conf file using several tools:

  • The Slurm Version 17.02 Configuration Tool configurator.
  • The Slurm Version 17.02 Configuration Tool - Easy Version configurator.easy.
  • Build a configuration file using your favorite web browser and open file://$HOME/slurm/html/configurator.html or the simpler file configurator.easy.html.
  • Copy the more extensive sample configuration file .../etc/slurm.conf.example from the source tar-ball and use it as a starting point.

Save the resulting output to /etc/slurm/slurm.conf.

The parameters are documented in man slurm.conf and slurm.conf, and it's recommended to read through the long list of parameters.

In slurm.conf it's essential that the important spool directories and the slurm user are defined correctly:

SlurmUser=slurm
SlurmdSpoolDir=/var/spool/slurmd
StateSaveLocation=/var/spool/slurmctld

NOTE: These spool directories must be created manually (see below), as they are not part of the RPM installation.

Starting slurm daemons at boot time

Enable startup of services as appropriate for the given node:

systemctl enable slurmd      # Compute node
systemctl enable slurmctld   # Master/head server
systemctl enable slurmdbd    # Database server

The systemd service files are /usr/lib/systemd/system/slurm*.service.

Slurm 16.05 init script bug

The Slurm 16.05 RPM packages install and configure (it's bug 3371) the init boot script /etc/init.d/slurm - even for systems like RHEL/CentOS 7 which use systemd! The bug has been fixed in Slurm 17.02.

If you have Slurm 16.05 (or older) on RHEL/CentOS 7, check if you have enabled the init script:

chkconfig --list slurm

We should modify this setup to use systemd exclusively. First disable the init script on all nodes, including login-nodes:

chkconfig --del slurm

In order to avoid accidentally starting services with /etc/init.d/slurm, it is best to also remove the offending script:

rm -f /etc/init.d/slurm

Then enable the services properly as shown above.

Beware that any update of the Slurm 16.05 RPMs will recreate the missing /etc/init.d/slurm file, so you must remember to remove it after every update.

Manual startup of services

If there is any question about:

  • The availability and sanity of the daemons' spool directories (perhaps on remote storage)
  • The MySQL database
  • If Slurm has been upgraded to a new version

it may be a good idea to start each service manually in stead of automatically as shown above. For example:

slurmctld -Dvvvv

Watch the the output for any signs of problems. If the daemon looks sane, type Control-C and start the service in the normal way:

systemctl start slurmctld

Reconfiguration of slurm.conf

When changing the configuration files slurm.conf and cgroup.conf, they must first be distributed to all compute and login nodes. On the master node make the daemons reread the configuration files:

scontrol reconfigure

From the scontrol man-page about the reconfigure option:

  • Instruct all Slurm daemons to re-read the configuration file. This command does not restart the daemons. This mechanism would be used to modify configuration parameters (Epilog, Prolog, SlurmctldLogFile, SlurmdLogFile, etc.). The Slurm controller (slurmctld) forwards the request all other daemons (slurmd daemon on each compute node). Running jobs continue execution.
  • Most configuration parameters can be changed by just running this command, however, Slurm daemons should be shutdown and restarted if any of these parameters are to be changed:
    • AuthType, BackupAddr, BackupController, ControlAddr, ControlMach, PluginDir, StateSaveLocation, SlurmctldPort or SlurmdPort.
  • The slurmctld daemon and all slurmd daemons must be restarted if nodes are added to or removed from the cluster.
Adding nodes

According to the scontrol man-page, when adding or removing nodes to slurm.conf, it is necessary to restart slurmctld. However, it is also necessary to restart the slurmd daemon on all nodes, see bug_3973.

It is also possible to add nodes to slurm.conf with a state of future:

FUTURE
  Indicates the node is defined for future use and need not exist when the Slurm daemons are started.
  These nodes can be made available for use simply by updating the node state using the scontrol command rather than restarting the slurmctld daemon.
  After these nodes are made available, change their State in the slurm.conf file.
  Until these nodes are made available, they will not be seen using any Slurm commands or nor will any attempt be made to contact them.

However, such future nodes must not be members of any partition.

Cgroup configuration

Control Groups (Cgroups v1) provide a Linux kernel mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour.

Documentation about the usage of Cgroups:

To list current Cgroups use the command:

lscgroup
lscgroup -g cpu:/

Usage of Cgroups within Slurm is described in the Cgroups_Guide. Slurm provides Cgroups versions of a number of plugins:

  • proctrack (process tracking)
  • task (task management)
  • jobacct_gather (job accounting statistics)

See also the cgroup.conf configuration file for the Cgroups support.

If you use jobacct_gather, change the default ProctrackType in slurm.conf:

ProctrackType=proctrack/linuxproc

otherwise you'll get this warning in the slurmctld log:

WARNING: We will use a much slower algorithm with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other proctrack when using jobacct_gather/linux

Notice: Linux kernel 2.6.38 or greater is strongly recommended, see the Cgroups_Guide General Usage Notes. CentOS6/RHEL6 is using kernel 2.6.32.

Getting started with Cgroups

In this example we want to constrain jobs to the number of CPU cores as well as RAM memory requested by the job.

Configure slurm.conf to use Cgroups as well as the affinity plugin:

TaskPlugin=affinity,cgroup

For a discussion see bug 3853.

You should probably also configure this (unless you have lots of short running jobs):

ProctrackType=proctrack/cgroup

see the section ProctrackType of slurm.conf.

Create cgroup.conf file:

cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf

Edit the file to change these lines:

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

The cgroup.conf page defines:

  • ConstrainCores=<yes|no>

    If configured to "yes" then constrain allowed cores to the subset of allocated resources. It uses the cpuset subsystem.

  • ConstrainRAMSpace=<yes|no>

    If configured to "yes" then constrain the job's RAM usage. The default value is "no", in which case the job's RAM limit will be set to its swap space limit. Also see AllowedSwapSpace, AllowedRAMSpace and ConstrainSwapSpace.

  • ConstrainSwapSpace=<yes|no>

    If configured to "yes" then constrain the job's swap space usage. The default value is "no". Note that when set to "yes" and ConstrainRAMSpace is set to "no", AllowedRAMSpace is automatically set to 100% in order to limit the RAM+Swap amount to 100% of job's requirement plus the percent of allowed swap space. This amount is thus set to both RAM and RAM+Swap limits. This means that in that particular case, ConstrainRAMSpace is automatically enabled with the same limit than the one used to constrain swap space. Also see AllowedSwapSpace.

You may also consider defining MemSpecLimit in slurm.conf:

  • MemSpecLimit Amount of memory, in megabytes, reserved for system use and not available for user allocations. If the task/cgroup plugin is configured and that plugin constrains memory allocations (i.e. TaskPlugin=task/cgroup in slurm.conf, plus ConstrainRAMSpace=yes in cgroup.conf), then Slurm compute node daemons (slurmd plus slurmstepd) will be allocated the specified memory limit. The daemons will not be killed if they exhaust the memory allocation (ie. the Out-Of-Memory Killer is disabled for the daemon's memory cgroup). If the task/cgroup plugin is not configured, the specified memory will only be unavailable for user allocations.

See an interesting discussion in bug 2713.

If compute nodes mount Lustre or NFS file systems, it may be a good idea to configure cgroup.conf with:

ConstrainKmemSpace=no

See the cgroup.conf man-page, bug_3874 and [slurm-dev] Interaction between cgroups and NFS. This requires Slurm 17.02.5 or later, see NEWS. After distributing the cgroup.conf file to all nodes, make a scontrol reconfigure.

Activating Cgroups

Now propagate the updated files slurm.conf and cgroup.conf to all compute nodes and restart their slurmd service.

Cgroup bugs

There may be some problems with Cgroups.

Jobs may crash with an error like:

slurmstepd: error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_207887' : No space left on device

The bug_3890 explains this, it may be a kernel bug (CentOS 7 has kernel 3.10), see:

Workaround: Reboot the node.

Node Health Check

To insure the health status of Head/Master node and compute nodes, install the LBNL Node Health Check (NHC) package from LBL. The NHC releases are in https://github.com/mej/nhc/releases/.

It's simple to configure NHC Slurm integration, see the NHC page. Add the following to slurm.conf on your Head/Master node and your compute nodes:

HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=3600
HealthCheckNodeState=ANY

This will execute NHC every 60 minutes on nodes in ANY states, see the slurm.conf documentation about Health* variables.

We add the following lines in the NHC configuration file /etc/nhc/nhc.conf for nodes in the domain nifl.fysik.dtu.dk:

* || NHC_RM=slurm
# Flag df to list only local filesystems (omit NFS mounts)
* || DF_FLAGS="-Tkl"
* || DFI_FLAGS="-Til"
# Setting short hostname for compute nodes (default in our Slurm setup)
*.nifl.fysik.dtu.dk || HOSTNAME=$HOSTNAME_S
# Busy batch nodes may take a long time to run nhc
*.nifl.fysik.dtu.dk  || TIMEOUT=120
# Check OmniPath/Infiniband link
x*.nifl.fysik.dtu.dk  || check_hw_ib 100

If you want an E-mail alert from NHC you must add a crontab entry to execute the nhc-wrapper script, see the NHC page section Periodic Execution.

For example, to execute the NHC check once per hour with a specified E-mail interval of 1 day, add this to the system's crontab:

# Node Health Check
3 * * * * /usr/sbin/nhc-wrapper -X 1d
NHC and GPU nodes

The NHC has a check for Nvidia GPU health, namely check_nv_healthmon. Unfortunately, it seems that Nvidia no longer offers the tool nvidia-healthmon for this purpose. Perhaps it may be useful in stead to check for the presence of the GPU devices with a check similar to this (for 4 GPU devices):

gpu* || check_file_test -c -r /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3

It seems that these device files do not get created automatically at reboot, but only if you run this (for example, in /etc/rc.local):

/usr/bin/nvidia-smi

The physical presence of Nvidia devices can be tested by this command:

# lspci | grep NVIDIA
NHC bugs

It may be necessary to force the NHC configuration file /etc/nhc/nhc.conf to use the Slurm scheduler by adding this line near the top:

* || NHC_RM=slurm

because NHC (version 1.4.2) may autodetect NHC_RM=pbs if the file /usr/bin/pbsnodes is present (see issue 20).

Also, NHC 1.4.2 has a bug for Slurm multi-node jobs (see issue 15), so you have to comment out any lines in nhc.conf calling:

# check_ps_unauth_users

Both bugs should be fixed in NHC 1.4.3 (when it becomes available).

Reboot option

Nodes may occasionally have to be rebooted after firmware or kernel upgrades.

Reboot the nodes automatically as they become idle using the RebootProgram as configured in slurm.conf, see the scontrol reboot option and explanation in the man-page:

scontrol reboot [ASAP] [NodeList]

The ASAP flag is available from Slurm 17.02, see man scontrol for earlier versions.

Add this line to slurm.conf:

RebootProgram="/usr/sbin/reboot"

The path to reboot may be different on other OSes.

Notice: Command arguments to RebootProgram like:

RebootProgram="/sbin/shutdown -r now"

seem to be ignored for Slurm 16.05 until 17.02.3, see bug_3612.

Timeout options

A number of Timeout options may be configured in slurm.conf.

In bug_3941 is discussed the problem of nodes being drained due to the killing of jobs taking too long to complete. To extend this timeout configure in slurm.conf:

UnkillableStepTimeout=120

(or an even larger value). This may also be accompanied by a custom command UnkillableStepProgram. If this timeout is reached, the node will also be drained with reason batch job complete failure.

ReturnToService option

The ReturnToService option in slurm.conf controls when a DOWN node will be returned to service, see slurm.conf and the FAQ Why is a node shown in state DOWN when the node has registered for service?.

MaxJobCount limit

In slurm.conf is defined:

MaxJobCount
  The maximum number of jobs Slurm can have in its active database at one time.
  Set the values of MaxJobCount and MinJobAge to insure the slurmctld daemon does not exhaust its memory or other resources.
  Once  this  limit  is  reached, requests to submit additional jobs will fail.
  The default value is 10000 jobs.

If you exceed 10000 jobs in the queue users will get an error when submitting jobs:

sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying.
sbatch: error: Batch job submission failed: Resource temporarily unavailable

Add a higher value to slurm.conf, for example:

MaxJobCount=20000

Another parameter in slurm.conf may perhaps need modification with higher MaxJobCount:

MinJobAge
  The minimum age of a completed job before its record is purged from Slurm's active database.
  Set the values of MaxJobCount and to insure the slurmctld daemon does not exhaust its memory or other resources.
  The default value is 300 seconds.

In addition, it may be a good idea to implement MaxSubmitJobs and MaxJobs resource_limits for user associations or QOSes, for example:

sacctmgr modify user where name=<username> set MaxJobs=100 MaxSubmitJobs=500

Requeueing of jobs

Jobs may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. The following parameter in slurm.conf may be changed for the default ability for batch jobs to be requeued:

JobRequeue=0

This function is:

  • If JobRequeue is set to a value of 1, then batch job may be requeued unless explicitly disabled by the user.
  • If JobRequeue is set to a value of 0, then batch job will not be requeued unless explicitly enabled by the user.
  • The default value is 1.

Use:

sbatch --no-requeue or --requeue

to change the default behavior for individual jobs.

High throughput configuration or large clusters

The following document contains Slurm administrator information specifically for high throughput computing, namely the execution of many short jobs. Getting optimal performance for high throughput computing does require some tuning and this document should help you off to a good start:

The following document contains Slurm administrator information specifically for clusters containing 1,024 nodes or more:

Head/Master server configuration

The following must be done on the Head/Master node. Create the spool and log directories and make them owned by the slurm user:

mkdir /var/spool/slurmctld /var/log/slurm
chown slurm: /var/spool/slurmctld /var/log/slurm
chmod 755 /var/spool/slurmctld /var/log/slurm

Create log files:

touch /var/log/slurm/slurmctld.log
chown slurm: /var/log/slurm/slurmctld.log

Create the (Linux default) accounting file:

touch /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
chown slurm: /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log

NOTICE: If you plan to enable job accounting, it is mandatory to configure the database and accounting as explained in the Slurm_accounting page.

slurmctld daemon

Start and enable the slurmctld daemon:

systemctl enable slurmctld.service
systemctl start slurmctld.service
systemctl status slurmctld.service

Warning: With Slurm 14.x and a compute node running RHEL 7 there is a bug systemctl start/stop does not work on RHEL 7. This problem has apparently been resolved in Slurm 15.08.

Copy slurm.conf to all nodes

Finally copy /etc/slurm/slurm.conf to all compute nodes:

scp -p /etc/slurm/slurm.conf nodeXXX:/etc/slurm/slurm.conf

It's convenient to use the pdsh command, see PDSH.

It is important to keep this file identical on both the Head/Master server and all Compute nodes. Remember to include all of the NodeName= lines for all compute nodes.

Compute node configuration

The following must be done on each compute node. Create the slurmd spool and log directories and make the correct ownership:

mkdir /var/spool/slurmd /var/log/slurm
chown slurm: /var/spool/slurmd  /var/log/slurm
chmod 755 /var/spool/slurmd  /var/log/slurm

Create log files:

touch /var/log/slurm/slurmd.log
chown slurm: /var/log/slurm/slurmd.log

Executing the command:

slurmd -C

on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which must be added to the global slurm.conf file. For example a node may be defined as:

NodeName=test001 Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=8010 TmpDisk=32752 Feature=xeon

Here the TmpDisk is defined in slurm.conf as the size of the TmpFS file system (default: /tmp). Define another temporary file system in slurm.conf, for example:

TmpFS=/scratch

Start and enable the slurmd daemon:

systemctl enable slurmd.service
systemctl start slurmd.service
systemctl status slurmd.service

Partition limits

If EnforcePartLimits is set to "ALL" then jobs which exceed a partition's size and/or time limits will be rejected at submission time:

EnforcePartLimits=ALL

Job limits

By default, Slurm will propagate all user limits from the submitting node (see ulimit -a) to be effective also within batch jobs.

It is important to configure slurm.conf so that the locked memory limit isn't propagated to the batch jobs:

PropagateResourceLimitsExcept=MEMLOCK

as explained in https://slurm.schedmd.com/faq.html#memlock. A possible memory limit error with OmniPath was discussed in Slurm bug 3363.

In fact, if you have imposed any non-default limits in /etc/security/limits.conf or /etc/security/limits.d/\*.conf in the login nodes, you probably want to prohibit these from the batch jobs by configuring:

PropagateResourceLimitsExcept=ALL

See the slurm.conf page for the list of all PropagateResourceLimitsExcept limits.

PAM module restrictions

On Compute nodes you may additionally install the slurm-pam_slurm RPM package to prevent rogue users from logging in. A more important functions is the containment of SSH tasks, for example, by some MPI libraries not using Slurm for spawning tasks. The pam_slurm_adopt module makes sure that child SSH tasks are controlled by Slurm on the job's master node.

For further details, the pam_slurm_adopt module is described by its author in Caller ID: Handling ssh-launched processes in Slurm. Features include:

  • This module restricts access to compute nodes in a cluster where Slurm is in use. Access is granted to root, any user with an Slurm-launched job currently running on the node, or any user who has allocated resources on the node according to the Slurm.

Usage of pam_slurm_adopt is described in the source files pam_slurm_adopt. There is also a nice description in bug_4098. Documentation of pam_slurm_adopt is discussed in bug_3567.

The PAM usage of, for example, /etc/pam.d/system-auth on CentOS/RHEL is configured through the authconfig command.

Configure PrologFlags

Warning: Do NOT configure UsePAM=1 in slurm.conf (this advice can be found on the net). Please see bug_4098 (comment 3).

You need to configure slurm.conf with:

PrologFlags=contain

Then distribute the slurm.conf file to all nodes. Reconfigure the slurmctld service:

scontrol reconfigure

This can be done while the cluster is in production, see bug_4098 (comment 3).

PAM configuration

Warnings:

  • First make the PrologFlags=contain configuration described above.
  • Do NOT configure UsePAM=1 in slurm.conf.
  • Reconfiguration of the PAM setup should only be done on compute nodes that can't run jobs (for example, drained nodes).
  • You should only configure this on Slurm 17.02.2 or later.

First make sure that you have installed this Slurm package:

rpm -q slurm-pam_slurm

Create a new file in /etc/pam.d/ where the line with pam_systemd.so has been removed:

cd /etc/pam.d/
grep -v pam_systemd.so < password-auth > password-auth-no-systemd

The reason is (quoting pam_slurm_adopt) that:

  • pam_systemd.so is known to not play nice with Slurm's usage of cgroups. It is recommended that you disable it or possibly add pam_slurm_adopt.so after pam_systemd.so.

Insert some new lines in the file /etc/pam.d/sshd at this place:

...
account    required     pam_nologin.so
# - PAM config for Slurm - BEGIN
account    sufficient   pam_slurm_adopt.so
account    required     pam_access.so
# - PAM config for Slurm - END
account    include      password-auth
...

and also replace the password-auth line by:

# - PAM config for Slurm - BEGIN
session    include      password-auth-no-systemd
# - PAM config for Slurm - END

Options to the pam_slurm_adopt.so module are documented in the pam_slurm_adopt page.

Now append these lines to /etc/security/access.conf (see man access.conf or access.conf for further possibilities):

+ : root   : ALL
- : ALL    : ALL

so that pam_access.so will:

  • Allow access to the root user.
  • Deny access to ALL other users.

This can be tested immediately by trying to make SSH logins to the node. Normal user logins should be rejected with the message:

Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection closed by <IP address>

System Message: WARNING/2 (<string>, line 692)

Literal block ends without a blank line; unexpected unindent.

/usr/lib/i386-linux-gnu/hp-scexe-compat/CP035134.scexe

slurmd systemd limits

MPI jobs and other tasks using the Infiniband or OmniPath fabrics must have unlimited locked memory, see above. Limits defined in /etc/security/limits.conf or /etc/security/limits.d/\*.conf are not effective for systemd services, see https://access.redhat.com/solutions/1257953, so any limits must be defined in the service file, see man systemd.exec.

For slurmd running under systemd the default limits are configured in /usr/lib/systemd/system/slurmd.service as:

LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity

If you want to modify/override these limits, create a new service file rather than editing the slurmd.service file. For example, create a file /etc/systemd/system/slurmd.service.d/core_limit.conf with the contents:

[Service]
LimitCORE=0

and do:

systemctl daemon-reload
systemctl restart slurmd

This file could be distributed to all compute nodes from a central location.

The possible process limit parameters are documented in the systemd.exec page section on Process Properties. The list is:

LimitCPU=, LimitFSIZE=, LimitDATA=, LimitSTACK=, LimitCORE=, LimitRSS=, LimitNOFILE=, LimitAS=, LimitNPROC=, LimitMEMLOCK=, LimitLOCKS=, LimitSIGPENDING=, LimitMSGQUEUE=, LimitNICE=, LimitRTPRIO=, LimitRTTIME=

To ensure that job tasks running under Slurm have the desired configuration, verify the slurmd daemon's limits by:

cat /proc/$(pgrep -u 0 slurmd)/limits

If slurmd has a memory lock limited less than expected, it may be due to slurmd having been started at boot time by the old init-script /etc/init.d/slurm rather than by systemctl. To remedy this problem see the section Starting slurm daemons at boot time above.

Setting job limits with PAM

By default jobs started by slurmd do not use PAM and therefore do not honor the /etc/security/limits.conf file. This behavior may be changed by adding to slurm.conf (see the man-page):

UsePAM=1

Then you can create a file /etc/pam.d/slurm containing:

auth            required        pam_localuser.so
account         required        pam_unix.so
session         required        pam_limits.so

Configure Prolog and Epilog scripts

It may be necessary to execute Prolog and/or Epilog scripts on the compute nodes when slurmd executes a task step (by default none are executed). See also the Prolog and Epilog Guide.

In the slurm.conf page this is described:

  • Prolog

    Fully qualified pathname of a program for the slurmd to execute whenever it is asked to run a job step from a new job allocation (e.g. /usr/local/slurm/prolog). A glob pattern (See glob(7)) may also be used to specify more than one program to run (e.g. /etc/slurm/prolog.d/*). The slurmd executes the prolog before starting the first job step. The prolog script or scripts may be used to purge files, enable user login, etc.

    By default there is no prolog. Any configured script is expected to complete execution quickly (in less time than MessageTimeout).

    If the prolog fails (returns a non-zero exit code), this will result in the node being set to a DRAIN state and the job being requeued in a held state, unless nohold_on_prolog_fail is configured in SchedulerParameters. See Prolog and Epilog Scripts for more information.

  • TaskProlog

    Fully qualified pathname of a program to be execute as the slurm job's owner prior to initiation of each task. Besides the normal environment variables, this has SLURM_TASK_PID available to identify the process ID of the task being started. Standard output from this program can be used to control the environment variables and output for the user program. (further details in the slurm.conf page).

  • TaskEpilog

    Fully qualified pathname of a program to be execute as the slurm job's owner after termination of each task. See TaskProlog for execution order details.

See also the items:

  • PrologEpilogTimeout
  • PrologFlags
  • SrunEpilog

Prolog and epilog examples

An example script is shown in the FAQ https://slurm.schedmd.com/faq.html#task_prolog:

#!/bin/sh
#
# Sample TaskProlog script that will print a batch job's
# job ID and node list to the job's stdout
#

if [ X"$SLURM_STEP_ID" = "X" -a X"$SLURM_PROCID" = "X"0 ]
then
  echo "print =========================================="
  echo "print SLURM_JOB_ID = $SLURM_JOB_ID"
  echo "print SLURM_NODELIST = $SLURM_NODELIST"
  echo "print =========================================="
fi

The script is supposed to output commands to be read by slurmd:

  • The task prolog is executed with the same environment as the user tasks to be initiated. The standard output of that program is read and processed as follows:
    • export name=value - sets an environment variable for the user task
    • unset name - clears an environment variable from the user task
    • print ... - writes to the task's standard output.

Configure partitions

System partitions are configured in slurm.conf, for example:

PartitionName=xeon8 Nodes=a[070-080] Default=YES DefaultTime=50:00:00 MaxTime=168:00:00 State=UP

Partitions may overlap so that some nodes belong to several partitions.

Access to partitions is configured in slurm.conf using AllowAccounts, AllowGroups, or AllowQos.

If some partition (like big memory nodes) should have a higher priority, this is controlled in slurm.conf using the multifactor plugin, for example:

PartitionName ... PriorityJobFactor=10
PriorityWeightPartition=1000

Sharing nodes

By default nodes are allocated exclusively to jobs, but it is possible to permit multiple jobs and/or multiple users per node. This is configured using Consumable Resource Allocation Plugin or cons_res in slurm.conf:

SelectType=select/cons_res
SelectTypeParameters=CR_CPU_MEMORY

In this configuration CPU and Memory are consumable resources. It is mandatory to use OverSubscribe=NO for the partitions as stated in the cons_res page:

  • All CR_s assume OverSubscribe=No or OverSubscribe=Force EXCEPT for CR_MEMORY which assumes OverSubscribe=Yes

Strange behaviour will result if you use the wrong OverSubscribe parameter. The OverSubscribe parameter (default= NO) is defined in the section OverSubscribe in slurm.conf. See also the cons_res_share page.

Configure multiple nodes and their features

Some defaults may be configured in slurm.conf for similar compute nodes, for example:

NodeName=DEFAULT Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=8000 TmpDisk=32752 Weight=1
NodeName=q001
NodeName=q002
...

Node features, similar to node properties used in the Torque resource manager are defined for each NodeName in slurm.conf by:

  • Feature:

    A comma delimited list of arbitrary strings indicative of some characteristic associated with the node. There is no value associated with a feature at this time, a node either has a feature or it does not. If desired a feature may contain a numeric component indicating, for example, processor speed. By default a node has no features.

Some examples are:

NodeName=DEFAULT Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=8000 TmpDisk=32752 Feature=xeon8,ethernet Weight=1
NodeName=q001
NodeName=q002

Node weight

For clusters with heterogeneous node hardware it is useful to assign different Weight values to each type of node, see this slurm.conf parameter:

Weight
  The priority of the node for scheduling purposes. All things being equal, jobs will be allocated the nodes with the lowest weight which satisfies their requirements.

This enables prioritization based upon a number of hardware parameters such as GPUs, RAM memory size, CPU clock speed, CPU core number, CPU generation. For example, GPU nodes should be avoided for non-GPU jobs.

A nice method was provided by Kilian Cavalotti of SRCC where a weight mask is used in slurm.conf. Each digit in the weight mask represents a hardware parameter of the node (a weight prefix of 1 is prepended in order to avoid octal conversion). For example, the following weight mask example puts a higher weight on GPUs, then RAM memory, then number of cores, and finally the CPU generation:

# (A weight prefix of "1" is prepended)
#       #GRES           Memory          #Cores          CPU_generation
#        none: 0         24 GB: 0        8: 0           Nehalem:      1
#       1 GPU: 1         48 GB: 1        16: 1          Sandy Bridge: 2
#       2 GPU: 2         64 GB: 2        24: 2          Ivy Bridge:   3
#       3 GPU: 3        128 GB: 3        32: 3          Broadwell:    4
#       4 GPU: 4        256 GB: 4        36: 4          Skylake:      5
# Example: Broadwell (=4) with 24 cores (=2), 128 GB memory (=3), and 0 GPUs (=0): Weight=10324

This example would be used to assign a Weight value in slurm.conf for the relevant nodes:

NodeName=xxx Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=128000 Weight=10324

A different prioritization of hardware can be selected with different columns and numbers in the mask, but a fixed number is the result of the mask calculation for each type of node.

Generic resources (GRES) and GPUs

The Generic resources (GRES) are a comma delimited list of generic resources (GRES) specifications for a node. Such resources may be occupied by jobs, for example, GPU accelerators. In this case you must also configure the gres.conf file.

An example with a gpu GRES may be a gres.conf file:

Nodename=h[001-002] Name=gpu Type=K20Xm File=/dev/nvidia[0-3]

If GRES is used, you must also configure slurm.conf, so define the named GRES in slurm.conf:

GresTypes=gpu

and append a list of GRES resources in the slurm.conf NodeName specifications:

NodeName=h[001-002] Gres=gpu:K20Xm:4

See also the examples in the gres.conf page.

Configure network topology

Slurm can be configured to support topology-aware resource allocation to optimize job performance, see the Topology_Guide and the topology.conf manual page.

Configure firewall for Slurm daemons

The Slurm compute nodes must be allowed to connect to the Head/Master node's slurmctld daemon. In the configuration file these ports are by default (see slurm.conf):

SlurmctldPort=6817
SlurmdPort=6818
SchedulerPort=7321

CentOS7/RHEL7 firewall

The CentOS7/RHEL7 default firewall service is firewalld and not the well-known iptables service. The dynamic firewall daemon firewalld provides a dynamically managed firewall with support for network “zones” to assign a level of trust to a network and its associated connections and interfaces. See Introduction to firewalld.

A nice introduction is RHEL7: How to get started with Firewalld.

Install firewalld by:

yum install firewalld firewall-config
Head/Master node

Open port 6817 (slurmctld):

firewall-cmd --permanent --zone=public --add-port=6817/tcp
firewall-cmd --reload

Alternatively, completely whitelist the compute nodes' private subnet (here: 10.2.x.x):

firewall-cmd --permanent --direct --add-rule ipv4 filter INPUT_direct 0 -s 10.2.0.0/16 -j ACCEPT
firewall-cmd --reload

The configuration is stored in the file /etc/firewalld/direct.xml.

Database (slurmdbd) node

The slurmdbd service by default listens to port 6819, see slurmdbd.conf.

Open port 6819 (slurmdbd):

firewall-cmd --permanent --zone=public --add-port=6819/tcp
firewall-cmd --reload
Compute node firewall must be off

Quoting Moe Jette from [slurm-dev] No route to host: Which ports are used?:

Other communications (say between srun and the spawned tasks) are intended to operate within a cluster and have no port restrictions.

The simplest solution is to ensure that the compute nodes must have no firewall enabled:

systemctl stop firewalld
systemctl disable firewalld

However, you may run a firewall service, as long as you ensure that all ports are open between the compute nodes.

Login node firewall

A login node doesn't need any special firewall rules for Slurm because no such daemons should be running on login nodes.

Warning: The srun command only works if the login node can:

  • Connect to the Head node port 6817.
  • Resolve the DNS name of the compute nodes.
  • Connect to the Compute nodes port 6818.

Therefore interactive batch jobs with srun seem to be impossible if your compute nodes are on an isolated private network relative to the Login node.

Checking the Slurm daemons

Check the configured daemons using the scontrol command:

scontrol show daemons

To verify the basic cluster partition setup:

scontrol show partition

To display the Slurm configuration:

scontrol show config

To display the compute nodes:

scontrol show nodes

One may also run the daemons interactively as described in Slurm_Quick_Start (Starting the Daemons). You can use one window to execute slurmctld -D -vvvvvv, a second window to execute slurmd -D -vvvvv.

Configure ARP cache for large networks

If the number of network devices (cluster nodes plus switches etc.) approaches or exceeds 512, you must consider the Linux kernel's limited dynamic ARP-cache size. Please read the man-page man 7 arp about the kernel's ARP-cache.

The best solution to this ARP-cache trashing problem is to increase the kernel's ARP-cache garbage collection (gc) parameters by adding these lines to /etc/sysctl.conf:

# Don't allow the arp table to become bigger than(clusters containing 1024 nodes or more). this
net.ipv4.neigh.default.gc_thresh3 = 4096
# Tell the gc when to become aggressive with arp table cleaning.
# Adjust this based on size of the LAN.
net.ipv4.neigh.default.gc_thresh2 = 2048
# Adjust where the gc will leave arp table alone
net.ipv4.neigh.default.gc_thresh1 = 1024
# Adjust to arp table gc to clean-up more often
net.ipv4.neigh.default.gc_interval = 3600
# ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600

You may also consider increasing the SOMAXCONN limit:

# Limit of socket listen() backlog, known in userspace as SOMAXCONN
net.core.somaxconn = 1024

see Large Cluster Administration Guide.

Then reread this configuration file:

/sbin/sysctl -p

Niflheim: Slurm_configuration (last edited 2018-09-08 16:07:10 by OleHolmNielsen)