Slurm configuration
Jump to our top-level Slurm page: Slurm batch queueing system
Network configuration for Slurm
There are a lot of components in a Slurm cluster that need to be able to communicate with each other. Some sites have security requirements that prevent them from opening all communications between the machines and will need to be able to selectively open just the ports that are necessary.
Read more in the Slurm Network_Configuration_Guide.
Slurm configuration and slurm.conf
You probably want to look at the example configuration files found in this RPM:
rpm -ql slurm-example-configs
On the Slurm Head node you should build a slurm.conf configuration file. When it has been fully tested, then slurm.conf must be copied to all other nodes.
It is mandatory that the slurm.conf file is identical on all nodes in the system!
Consult the Slurm_Quick_Start Administrator Guide.
See also man slurm.conf
or the on-line slurm.conf documentation.
Copy the HTML files to your $HOME directory, for example:
mkdir $HOME/slurm/
cp -rp /usr/share/doc/slurm-*/html $HOME/slurm/
Configless Slurm setup
The configless feature that allows the compute nodes — specifically the slurmd process — and user commands running on login nodes to pull configuration information directly from the slurmctld controller instead of from a pre-distributed local file. The order of precedence for determining what configuration source to use is listed in the configless page.
On startup the compute node slurmd will query the slurmctld server that you specify, and the configuration files will be pulled to the node’s local disk. The pulled slurmd conguration files are stored in this folder:
$ ls -ld /run/slurm/conf
lrwxrwxrwx. 1 root root 28 Mar 18 08:24 /run/slurm/conf -> /var/spool/slurmd/conf-cache
$ ls -la /var/spool/slurmd/conf-cache
total 24
drwxr-xr-x. 2 root root 81 Mar 18 08:24 .
drwxr-xr-x. 3 slurm slurm 92 Mar 18 08:24 ..
-rw-r--r--. 1 root root 506 Mar 18 08:24 cgroup.conf
-rw-r--r--. 1 root root 165 Mar 18 08:24 gres.conf
-rw-r--r--. 1 root root 11711 Mar 18 08:24 slurm.conf
-rw-r--r--. 1 root root 2538 Mar 18 08:24 topology.conf
Testing configless setup
The slurmctld server information can preferably be provided in a DNS SRV_record for your DNS_zone, pointing to port 6817 on your slurmctld server(s) and with a suggested Time_to_live (TTL) of 3600 seconds:
_slurmctld._tcp 3600 IN SRV 10 0 6817 slurm-backup
_slurmctld._tcp 3600 IN SRV 0 0 6817 slurm-master
Note: The value TTL=3600 could be any value at all, because slurmd will only read the DNS SRV_record at initial startup and never thereafter, see bug_20462.
To verify the DNS setup, install these packages with tools required below:
dnf install bind-utils hostname
Lookup the SRV_record by either of these commands:
dig +short -t SRV -n _slurmctld._tcp.`dnsdomainname`
host -t SRV _slurmctld._tcp.`dnsdomainname`
Add login and submit nodes to slurm.conf
The SLUG 2020 talk (see Slurm_Publications) Field Notes 4: From The Frontlines of Slurm Support by Jason Booth recommends on slide 31 to run slurmd on all login nodes in configless Slurm mode:
We generally suggest that you run a slurmd to manage the configs on those nodes that run client commands, including submit or login nodes
The simplest way to achieve this is described in bug_9832:
Add the login and submit nodes to slurm.conf as default-configured nodes, for example:
NodeName=login1,login2
and do not add these nodes to any partitions!
Remember to add these nodes to the
topology.conf
file as well, for example:SwitchName=public_switch Nodes=login1,login2
and open the firewall on the login nodes (see the firewall section below).
Install the slurm-slurmd RPM on the login nodes and make sure to create the logging directory:
mkdir /var/log/slurm chown slurm.slurm /var/log/slurm
Then start the slurmd service:
systemctl enable slurmd systemctl start slurmd
Verify that the Slurm config files have been downloaded:
ls -l /run/slurm/conf
Delay start of slurmd until InfiniBand/Omni-Path network is up
Unfortunately, slurmd may start up before the Infiniband or Omni-Path network fabric by Cornelis Networks network fabric ports are up. The reason is that Infiniband ports may take a number of seconds to become activated at system boot time, and NetworkManager unfortunately cannot be configured to wait for Infiniband, but will claim that the network is online as soon as one of the NIC interfaces is ready (typically Ethernet). This issue seems to be serious on EL8 (RHEL 8 and clones) with 10-15 seconds of delay.
If you have configured Node Health Check (NHC) to check the Infiniband ports, the NHC check is going to fail until the Infiniband ports are up. Please note that slurmd will call NHC at startup, if HealthCheckProgram has been configured in slurm.conf. Jobs started by slurmd may fail if the Infiniband port is not yet up.
We have written some InfiniBand_tools to delay the NetworkManager network-online.target
for Infiniband/Omni-Path network fabric by Cornelis Networks networks
so that slurmd gets started only after all networks including Infiniband are actually up.
Configuring a custom slurmd service
The SLURMD_OPTIONS
can be defined in the file /etc/sysconfig/slurmd
:
SLURMD_OPTIONS=-M --conf-server <name of slurmctld server>
which is read by the Systemd service file /usr/lib/systemd/system/slurmd.service
.
Another way is to use systemctl edit slurmd
to create an override file, see the systemctl manual page.
The override files will be placed in the /etc/systemd/system/slurmd.service.d/
folder.
An example file /etc/systemd/system/slurmd.service.d/override.conf
file could be:
[Service]
Environment="SLURMD_OPTIONS=-M --conf-server <name of slurmctld server>"
In this example the slurmd option -M
locks slurmd
in memory, and the slurmctld server name is given.
See configless and the slurmd manual page.
Configurator for slurm.conf
You can generate an initial slurm.conf file using several tools:
The Slurm Configuration Tool configurator.
The Slurm Configuration Tool - Easy Version configurator.easy.
Build a configuration file using your favorite web browser and open
file://$HOME/slurm/html/configurator.html
or the simpler fileconfigurator.easy.html
.Copy the more extensive sample configuration file
.../etc/slurm.conf.example
from the source tar-ball and use it as a starting point.
Save the resulting output to /etc/slurm/slurm.conf
.
The parameters are documented in man slurm.conf
and slurm.conf, and it’s recommended to read through the long list of parameters.
In slurm.conf it’s essential that the important spool directories and the slurm user are defined correctly:
SlurmUser=slurm
SlurmdSpoolDir=/var/spool/slurmd
StateSaveLocation=/var/spool/slurmctld
NOTE: These spool directories must be created manually and owned by user slurm (see below), as they are not part of the RPM installation.
Configure AccountingStorageType in slurm.conf
As shown in the slurm.conf manual page, the AccountingStorageType option (if defined) only has a single acceptable value:
AccountingStorageType=accounting_storage/slurmdbd
This basically means that the use of a Slurm database with a slurmdbd service is strongly encouraged!
If AccountingStorageType is omitted, or set to the obsolete value accounting_storage/none (removed from Slurm 23.11), then account records are not maintained, meaning that anything related to user accounts will not work! See also a discussion in bug_21398.
Starting slurm daemons at boot time
Enable startup of services as appropriate for the given node:
systemctl enable slurmd # Compute node
systemctl enable slurmctld # Head server
systemctl enable slurmdbd # Database server
The systemd service files are /usr/lib/systemd/system/slurm*.service
.
Manual startup of services
If there is any question about:
The availability and sanity of the daemons’ spool directories (perhaps on remote storage)
The MySQL database
If Slurm has been upgraded to a new version
it may be a good idea to start each service manually in stead of automatically as shown above. For example:
slurmctld -Dvvvv
Watch the the output for any signs of problems. If the daemon looks sane, type Control-C and start the service in the normal way:
systemctl start slurmctld
E-mail notification setup
The slurm.conf variables MailProg
and MailDomain
determine the delivery of E-mail messages from Slurm.
You may want to use smail
from the slurm-contribs
RPM package by setting:
MailProg=/usr/bin/smail
This will include some job statistics in the message.
Another possibility is Goslmailer (GoSlurmMailer).
Reconfiguration of slurm.conf
When changing configuration files such as slurm.conf and cgroup.conf, they must first be distributed to all compute and login nodes (not needed in configless Slurm clusters).
On the master node make the daemons reread the configuration files:
scontrol reconfigure
From the scontrol man-page about the reconfigure option:
Instruct all slurmctld and slurmd daemons to re-read the configuration file. This mechanism can be used to modify configuration parameters set in slurm.conf without interrupting running jobs.
New: Starting in 23.11, this command operates by creating new processes for the daemons, then passing control to the new processes when or if they start up successfully. This allows it to gracefully catch configuration problems and keep running with the previous configuration if there is a problem. This will not be able to change the daemons’ listening TCP port settings or authentication mechanism.
The slurmctld daemon and all slurmd daemons must be restarted if nodes are added to or removed from the cluster.
Adding nodes
According to the scontrol man-page, when adding or removing nodes to slurm.conf, it is necessary to restart slurmctld. However, it is also necessary to restart the slurmd daemon on all nodes, see bug_3973:
Stop slurmctld
Add/remove nodes in slurm.conf
Restart slurmd on all nodes
Start slurmctld
For a configless setup the slurmctld must be restarted first, in this case the order is:
Stop slurmctld
Add/remove nodes in slurm.conf
Start slurmctld
Quickly restart slurmd on all nodes using ClusterShell.
It is also possible to add nodes to slurm.conf with a state of future:
FUTURE
Indicates the node is defined for future use and need not exist when the Slurm daemons are started.
These nodes can be made available for use simply by updating the node state using the scontrol command rather than restarting the slurmctld daemon.
After these nodes are made available, change their State in the slurm.conf file.
Until these nodes are made available, they will not be seen using any Slurm commands or nor will any attempt be made to contact them.
However, such future nodes must not be members of any Slurm partition.
Cgroup configuration
Control Groups (cgroups v1) provide a Linux kernel mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour.
Documentation about the usage of cgroups:
Install cgroups tools:
dnf install libcgroup-tools
To list current cgroups use the command:
lscgroup
lscgroup -g cpu:/
To list processes that are not properly constrained by Slurm cgroups:
ps --no-headers -eo pid,user,comm,cgroup | egrep -vw 'root|freezer:/slurm.*devices:/slurm.*cpuacct,cpu:/slurm.*memory:/slurm|cpuset:/slurm.*|dbus-daemon|munged|ntpd|gmond|polkitd|chrony|smmsp|rpcuser|rpc'
Usage of cgroups within Slurm is described in the Cgroups_Guide. Slurm provides cgroups versions of a number of plugins:
proctrack (process tracking)
task (task management)
jobacct_gather (job accounting statistics)
See also the cgroup.conf configuration file for the cgroups support.
If you use jobacct_gather, change the default ProctrackType in slurm.conf:
ProctrackType=proctrack/linux
otherwise you’ll get this warning in the slurmctld log:
WARNING: We will use a much slower algorithm with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other proctrack when using jobacct_gather/linux
Notice: Linux kernel 2.6.38 or greater is strongly recommended, see the Cgroups_Guide General Usage Notes.
Getting started with cgroups
In this example we want to constrain jobs to the number of CPU cores as well as RAM memory requested by the job.
Configure slurm.conf to use cgroups as well as the affinity plugin:
TaskPlugin=affinity,cgroup
For a discussion see bug 3853.
You should probably also configure this (unless you have lots of short running jobs):
ProctrackType=proctrack/cgroup
see the section ProctrackType of slurm.conf.
Create cgroup.conf file:
cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
Edit the file to change these lines:
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
The cgroup.conf page defines:
- ConstrainCores=<yes|no>
If configured to “yes” then constrain allowed cores to the subset of allocated resources. It uses the cpuset subsystem.
- ConstrainRAMSpace=<yes|no>
If configured to “yes” then constrain the job’s RAM usage. The default value is “no”, in which case the job’s RAM limit will be set to its swap space limit. Also see AllowedSwapSpace, AllowedRAMSpace and ConstrainSwapSpace.
- ConstrainSwapSpace=<yes|no>
If configured to “yes” then constrain the job’s swap space usage. The default value is “no”. Note that when set to “yes” and ConstrainRAMSpace is set to “no”, AllowedRAMSpace is automatically set to 100% in order to limit the RAM+Swap amount to 100% of job’s requirement plus the percent of allowed swap space. This amount is thus set to both RAM and RAM+Swap limits. This means that in that particular case, ConstrainRAMSpace is automatically enabled with the same limit than the one used to constrain swap space. Also see AllowedSwapSpace.
You may also consider defining MemSpecLimit in slurm.conf:
MemSpecLimit Amount of memory, in megabytes, reserved for system use and not available for user allocations. If the task/cgroup plugin is configured and that plugin constrains memory allocations (i.e. TaskPlugin=task/cgroup in slurm.conf, plus ConstrainRAMSpace=yes in cgroup.conf), then Slurm compute node daemons (slurmd plus slurmstepd) will be allocated the specified memory limit. The daemons will not be killed if they exhaust the memory allocation (ie. the Out-Of-Memory Killer is disabled for the daemon’s memory cgroup). If the task/cgroup plugin is not configured, the specified memory will only be unavailable for user allocations.
See an interesting discussion in bug 2713.
After distributing the cgroup.conf file to all nodes, make a scontrol reconfigure
.
Node Health Check
To insure the health status of Head node and compute nodes, install the LBNL Node Health Check (NHC) package from LBL. The NHC releases are in https://github.com/mej/nhc/releases/.
It’s simple to configure NHC Slurm integration, see the NHC page. Add the following to slurm.conf on your Head node and your compute nodes:
HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=3600
HealthCheckNodeState=ANY
This will execute NHC every 60 minutes on nodes in ANY states, see the slurm.conf documentation about Health*
variables.
There are other criteria for when to execute NHC as defined by HealthCheckNodeState in slurm.conf: ALLOC, ANY, CYCLE, IDLE, MIXED.
At our site we add the following lines in the NHC configuration file /etc/nhc/nhc.conf
for nodes in the domain nifl.fysik.dtu.dk:
* || NHC_RM=slurm
# Flag df to list only local filesystems (omit NFS mounts)
* || DF_FLAGS="-Tkl"
* || DFI_FLAGS="-Til"
# Setting short hostname for compute nodes (default in our Slurm setup)
*.nifl.fysik.dtu.dk || HOSTNAME=$HOSTNAME_S
# Busy batch nodes may take a long time to run nhc
*.nifl.fysik.dtu.dk || TIMEOUT=120
# Check OmniPath/Infiniband link
x*.nifl.fysik.dtu.dk || check_hw_ib 100
If you want to receive E-mail alerts from NHC, you can add a crontab entry to execute the nhc-wrapper
script, see the NHC page section Periodic Execution.
For example, to execute the NHC check once per hour with a specified E-mail interval of 1 day, add this to the system’s crontab:
# Node Health Check
3 * * * * /usr/sbin/nhc-wrapper -X 1d
NHC and GPU nodes
The NHC has a check for Nvidia GPU health, namely check_nv_healthmon
.
Unfortunately, it seems that Nvidia no longer offers the tool nvidia-healthmon for this purpose.
Nvidia has a new Data Center GPU Manager (DCGM) suite of tools which includes NVIDIA Validation Suite (NVVS). Download of DCGM requires membership of the Data Center GPU Manager (DCGM) Program. Install the RPM by:
dnf install datacenter-gpu-manager-1.7.1-1.x86_64.rpm
Run the NVVS tool:
nvvs -g -l /tmp/nvvs.log
The (undocumented?) log file (-l) seems to be required.
It does not seem obvious how to use NVVS as a fast running tool under NHC.
Perhaps it may be useful in stead to check for the presence of the GPU devices with a check similar to this (for 4 GPU devices):
gpu* || check_file_test -c -r /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3
It seems that these device files do not get created automatically at reboot, but only if you run this (for example, in /etc/rc.local
):
/usr/bin/nvidia-smi
The physical presence of Nvidia devices can be tested by this command:
# lspci | grep NVIDIA
NHC bugs
It may be necessary to force the NHC configuration file /etc/nhc/nhc.conf
to use the Slurm scheduler by adding this line near the top:
* || NHC_RM=slurm
because NHC (version 1.4.2) may autodetect NHC_RM=pbs
if the file /usr/bin/pbsnodes
is present (see issue 20).
Also, NHC 1.4.2 has a bug for Slurm multi-node jobs (see issue 15), so you have to comment out any lines in nhc.conf
calling:
# check_ps_unauth_users
Both bugs should be fixed in NHC 1.4.3 (when it becomes available).
Reboot option
Nodes may occasionally have to be rebooted after firmware or kernel upgrades.
Reboot the nodes automatically as they become idle using the RebootProgram as configured in slurm.conf, see the scontrol reboot option and explanation in the man-page:
scontrol reboot [ASAP] [NodeList]
The ASAP flag is available from Slurm 17.02, see man scontrol
for earlier versions.
Add this line to slurm.conf:
RebootProgram="/usr/sbin/reboot"
The path to reboot
may be different on other OSes.
Notice: Command arguments to RebootProgram
like:
RebootProgram="/sbin/shutdown -r now"
seem to be ignored for Slurm 16.05 until 17.02.3, see bug_3612.
Timeout options
A number of Timeout options may be configured in slurm.conf.
In bug_3941 is discussed the problem of nodes being drained due to the killing of jobs taking too long to complete. To extend this timeout you can configure the UnkillableStepTimeout parameter in slurm.conf, for example:
UnkillableStepTimeout=180
Ensure that UnkillableStepTimeout is at least 5 times larger than MessageTimeout (default is 10 seconds). This may also be accompanied by a custom command UnkillableStepProgram. If this timeout is reached, the node will also be drained with reason batch job complete failure.
ReturnToService option
The ReturnToService option in slurm.conf controls when a DOWN node will be returned to service, see slurm.conf and the FAQ Why is a node shown in state DOWN when the node has registered for service?.
MaxJobCount limit
In slurm.conf is defined:
MaxJobCount
The maximum number of jobs Slurm can have in its active database at one time.
Set the values of MaxJobCount and MinJobAge to insure the slurmctld daemon does not exhaust its memory or other resources.
Once this limit is reached, requests to submit additional jobs will fail.
The default value is 10000 jobs.
If you exceed 10000 jobs in the queue users will get an error when submitting jobs:
sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying.
sbatch: error: Batch job submission failed: Resource temporarily unavailable
Add a higher value to slurm.conf, for example:
MaxJobCount=20000
Another parameter in slurm.conf may perhaps need modification with higher MaxJobCount
:
MinJobAge
The minimum age of a completed job before its record is purged from Slurm's active database.
Set the values of MaxJobCount and to insure the slurmctld daemon does not exhaust its memory or other resources.
The default value is 300 seconds.
In addition, it may be a good idea to implement MaxSubmitJobs and MaxJobs resource_limits for user associations or QOSes, for example:
sacctmgr modify user where name=<username> set MaxJobs=100 MaxSubmitJobs=500
Job arrays
The job_arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily; job arrays with millions of tasks can be submitted in milliseconds (subject to configured size limits).
A slurm.conf configuration parameter controls the maximum job array size:
MaxArraySize.
Be mindful about the value of MaxArraySize as job arrays offer an easy way for users to submit large numbers of jobs very quickly.
Requeueing of jobs
Jobs may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. The following parameter in slurm.conf may be changed for the default ability for batch jobs to be requeued:
JobRequeue=0
This function is:
If JobRequeue is set to a value of 1, then batch job may be requeued unless explicitly disabled by the user.
If JobRequeue is set to a value of 0, then batch job will not be requeued unless explicitly enabled by the user.
The default value is 1.
Use:
sbatch --no-requeue or --requeue
to change the default behavior for individual jobs.
Power monitoring and management
Slurm can be configured to monitor the power and energy usage of compute nodes, see the SLUG’18 presentation Workload Scheduling and Power Management. This paper also describes Slurm power management. See also the Slurm Power Management Guide.
The Slurm configuration file for the acct_gather plugins such as acct_gather_energy, acct_gather_profile and acct_gather_interconnect is described in acct_gather.conf.
RAPL CPU+DIMM power monitoring
On most types of processors one may activate Running Average Power Limit (RAPL) sensors for CPUs and RAM memory, see these papers:
Notice: Please beware that the power monitoring may or may not cover entire compute node cabinets and other infrastructure! For example, the RAPL method described below monitors CPUs and RAM only, and does not cover other power usage within the node such as GPUs, motherboard, fans, power supplies, PCIe network and storage adapters.
With Slurm several AcctGatherEnergyType types are defined in the slurm.conf manual page. RAPL data gathering can be enabled in Slurm by:
# Power and energy monitoring
AcctGatherEnergyType=acct_gather_energy/rapl
AcctGatherNodeFreq=30
and do a scontrol reconfig
.
Building IPMI power monitoring into Slurm
Many types of Baseboard Management Controllers (BMC) permit the reading of power consumption values using the IPMI DCMI extensions.
Note that Slurm version 23.02.7 (or later)
should be used for correct functionality, see bug_17639.
Install the FreeIPMI prerequisite packages version 1.6.12 or later on the Slurm RPM-building server. FreeIPMI version 1.6.14 is available with RockyLinux and AlmaLinux (EL8) 8.10:
dnf install freeipmi freeipmi-devel
Then build Slurm RPM packages including freeipmi
libraries:
rpmbuild -ta slurm-<version>.tar.bz2 --with mysql --with freeipmi
When installing slurm
RPM packages the freeipmi
packages are now going to be required as prerequisites.
Note that the Slurm quickstart admin guide states:
IPMI Energy Consumption: The acct_gather_energy/ipmi accounting plugin will be built if the freeipmi development library is present.
See also the discussion about IPMI Data Center Manageability Interface (DCMI) in bug bug_17704.
You can check if Slurm has been built with the acct_gather_energy/ipmi accounting plugin,
and verify that the libfreeipmi.so.*
library file is also available on the system:
$ ldd /usr/lib64/slurm/acct_gather_energy_ipmi.so | grep ipmi
libipmimonitoring.so.6 => /usr/lib64/libipmimonitoring.so.6 (0x00001552d1fa4000)
libfreeipmi.so.17 => /usr/lib64/libfreeipmi.so.17 (0x00001552d186f000)
$ ls -l /usr/lib64/libfreeipmi.so*
lrwxrwxrwx 1 root root 22 Apr 6 17:05 /usr/lib64/libfreeipmi.so.17 -> libfreeipmi.so.17.2.12
-rwxr-xr-x 1 root root 5469832 Apr 6 17:05 /usr/lib64/libfreeipmi.so.17.2.12
Using IPMI power monitoring (from Slurm 23.02.7)
IMPORTANT:
The acct_gather_energy/ipmi plugin should not be used with Slurm prior to 23.02.7! The reason is that this plugin has a bug where file descriptors in slurmd are not closed when making IPMI DCMI library calls. This issue was fixed in bug_17639 starting with Slurm 23.02.7.
On each type of compute node to be monitored, test whether the power values can be read by the commands:
ipmi-dcmi --get-dcmi-capability-info
ipmi-dcmi --get-system-power-statistics
ipmi-dcmi --get-enhanced-system-power-statistics
Slurm can be configured for IPMI power monitoring by slurmd in the compute nodes by this slurm.conf configuration:
AcctGatherEnergyType=acct_gather_energy/ipmi
At the same time you must configure the acct_gather.conf file in /etc/slurm/
:
EnergyIPMIPowerSensors=Node=DCMI
EnergyIPMIFrequency=30
However, avoid the EnergyIPMICalcAdjustment
parameter in acct_gather.conf, see bug_20207 Comment 26.
Set also this slurm.conf parameter, where example values may be:
JobAcctGatherFrequency=task=30,energy=30
as described in the manual page:
The default value for task sampling interval is 30 seconds.
The default value for all other intervals is 0.
Smaller (non-zero) values have a greater impact upon job performance, but a value of 30 seconds is not likely to be noticeable for applications having less than 10,000 tasks.
The JobAcctGatherFrequency
should be >= EnergyIPMIFrequency
, see bug_20207.
IMPORTANT:
You must configure
acct_gather_energy/ipmi
parameters in slurm.conf and at the same time create the above file acct_gather.conf. All slurmd’s may crash if one is configured without the other! If done incorrectly theslurmd.log
will reportfatal: Could not open/read/parse acct_gather.conf file ...
.
When the above configuration files are ready and have been distributed to all nodes (not needed with Configless), then perform a reconfiguration:
scontrol reconfigure
As a test you can monitor some power values as shown in the section below.
Energy accounting of individual jobs
When power monitoring has been enabled as shown above,
it becomes possible to make energy accounting of individual jobs.
The accounting command sacct command has an output field ConsumedEnergyRaw
that can be specified using the --format
option:
ConsumedEnergyRaw: Total energy consumed by all tasks in a job, in joules. Note: Only in the case of an exclusive job allocation does this value reflect the job's real energy consumption.
However, job energy accounting is not fully reliable as of Slurm 23.11.8 (July 2024) due to a number of issues in slurmd that are tracked in bug_20207, see the list of issues in Comment 31.
Note: Joule is the unit of energy equal to the power in Watt multiplied by time. One Kilowatt-hour (i.e., 1000 Watt consumed for 3600 seconds) is 3.6 Mega Joule .
Non DCMI compliant BMCs
Some vendors’ BMC (verified January 2024: Huawei and Xfusion) do NOT currently support reading power usage values with the IPMI DCMI extensions, which you can verify by this command:
[xfusion]$ ipmi-dcmi --get-system-power-statistics
ipmi_cmd_dcmi_get_power_reading: command invalid or unsupported
The slurmd.log
may contain IPMI DCMI error messages such as:
error: _get_dcmi_power_reading: get DCMI power reading failed: command invalid or unsupported
For such BMC types it is unfortunately not possible to perform power reading with the IPMI DCMI extensions,
which is what has been implemented by Slurm.
The scontrol show node
will report zero values for CurrentWatts
and AveWatts
for such nodes (note the definition of Watt).
For nodes which do not support the IPMI DCMI extensions,
some error messages may be logged to slurmd.log
:
error: _get_joules_task: can't get info from slurmd
error: slurm_get_node_energy: Zero Bytes were transmitted or received
This issue has been fixed in Slurm 23.11.8.
Monitoring power with Slurm
After reconfiguring the power values become available:
$ scontrol show node n123
...
CurrentWatts=641 AveWatts=480
Note the definition of Watt .
Notice some potentially incorrect power and CPU load values:
bug_17759:
scontrol show node
shows CurrentWatts and CPULoad greater than zero for nodes that are powered off (fixed in Slurm 23.11).Beware that the Slurm bug_9956 states: RAPL plugin: incorrect *Watts and ConsumedEnergy values.
A convenient script showpower is available for printing node power values as well as the total/average for sets of nodes with 1 line per node:
Usage: showpower < -w node-list | -p partition(s) | -a | -h > [ -S sorting-variable ]
where:
-w node-list: Print this node-list
-p partition(s): Print this partition
-a: All nodes in the cluster
-h: Print help information
-S: Sort output by this column (e.g. CurrentWatts)
An example output is:
$ showpower -w d[001-005]
NodeName #CPUs CPU- Current Average Cap ExtSensor ExtSensor
load Watts Watts Watts Watts Joules
d001 56 56.7 681 605 n/a 0 n/s
d002 56 56.5 646 579 n/a 0 n/s
d003 56 56.8 655 582 n/a 0 n/s
d004 56 56.6 544 408 n/a 0 n/s
d005 56 56.6 643 415 n/a 0 n/s
NodeName #CPUs CPU- Current Average Cap ExtSensor ExtSensor
load Watts Watts Watts Watts Joules
TOTAL 280 283.2 3169 2589 0 0 0
Average 56 56.6 633 517 0 0 0
Note: Joule is the unit of energy equal to the power in Watt multiplied by time. One Kilowatt-hour (i.e., 1000 Watt consumed for 3600 seconds) is 3.6 Mega Joule .
turbostat utility
A CLI utility turbostat is provided by the kernel-tools package for reporting processor topology, frequency, idle power-state statistics, temperature, and power usage on Intel® 64 processors, for example:
$ turbostat --quiet --Summary
The turbostat reads the model-specific registers (MSRs) /dev/cpu/CPUNUM/msr
, see man 4 msr
.
Power saving configuration
Slurm provides an integrated power_save mechanism for powering down idle nodes. Nodes that remain idle for a configurable period of time can be placed in a power saving mode, which can reduce power consumption or fully power down the node. The nodes will be restored to normal operation once work is assigned to them.
We describe the power_save configuration in the Slurm_cloud_bursting page section on Configuring slurm.conf for power saving.
Slurm head server configuration
The following must be done on the Slurm Head node. Create the spool and log directories and make them owned by the slurm user:
mkdir /var/spool/slurmctld /var/log/slurm
chown slurm: /var/spool/slurmctld /var/log/slurm
chmod 755 /var/spool/slurmctld /var/log/slurm
Create log files:
touch /var/log/slurm/slurmctld.log
chown slurm: /var/log/slurm/slurmctld.log
Create the (Linux default) accounting file:
touch /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
chown slurm: /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
NOTICE: If you plan to enable job accounting, it is mandatory to configure the database and accounting as explained in the Slurm accounting page.
slurmctld daemon
Start and enable the slurmctld daemon:
systemctl enable slurmctld.service
systemctl start slurmctld.service
systemctl status slurmctld.service
Copy slurm.conf to all nodes
This section is not relevant when running a Configless Slurm setup.
Copy /etc/slurm/slurm.conf
to all compute nodes:
clush -bw <node-list> --copy /etc/slurm/slurm.conf --dest /etc/slurm/slurm.conf
It is important to keep this file identical on both the Head server and all Compute nodes. Remember to include all of the NodeName= lines for all compute nodes.
Compute node configuration
The following must be done on each compute node. Create the slurmd spool and log directories and make the correct ownership:
mkdir /var/spool/slurmd /var/log/slurm
chown slurm: /var/spool/slurmd /var/log/slurm
chmod 755 /var/spool/slurmd /var/log/slurm
Create log files:
touch /var/log/slurm/slurmd.log
chown slurm: /var/log/slurm/slurmd.log
Executing the command:
slurmd -C
on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which must be added to the global slurm.conf file. For example a node may be defined as:
NodeName=test001 Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=8010 TmpDisk=32752 Feature=xeon
Warning: You should configure the RealMemory value slightly less than what is reported by slurmd -C
,
because kernel upgrades may give a slightly lower RealMemory value in the future and cause problems with the node’s health status.
For recent Xeon and EPYC CPUs, the Sub NUMA Cluster (SNC) BIOS setting has been shown to improve performance, see BIOS characterization for HPC with Intel Cascade Lake processors. This will cause each processor socket to have two NUMA domains, one for each of the memory controllers, so a dual-socket server will have 4 NUMA domains, for example:
$ slurmd -C
slurmd: Considering each NUMA node as a socket
CPUs=40 Boards=1 SocketsPerBoard=4 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=385380
Here the TmpDisk
is defined in slurm.conf as the size of the TmpFS file system (default: /tmp
).
It is possible to define another temporary file system in slurm.conf, for example:
TmpFS=/scratch
Start and enable the slurmd daemon:
systemctl enable slurmd.service
systemctl start slurmd.service
systemctl status slurmd.service
Kernel configuration
It is recommended to consider some of the default limits in the Linux kernel.
The High_Throughput_Computing_Administration_Guide contains Slurm administrator information specifically for high throughput computing, namely the execution of many short jobs. See also the Large_Cluster_Administration_Guide.
If configurations in /etc/sysctl.conf
are updated, you need to run:
sysctl -p
Configure ARP cache for large networks
If the number of network devices (including cluster nodes, BMC s, servers, switches, etc.) approaches or exceeds the value 512
,
you must consider the Linux kernel’s limited dynamic ARP_Cache size, see the arp_command manual page.
ARP (Address Resolution Protocol) is the Linux kernel’s mapping between IP_address (such as 10.1.2.3) and Ethernet MAC_address (such as 00:08:02:8E:05:F2).
If the soft maximum number of entries to keep in the ARP_Cache, gc_thresh2=512
, is exceeded, the kernel will try to remove ARP_Cache entries by a garbage collection process.
This is going to hit you in terms of sporadic loss of connectivitiy between pairs of nodes.
No garbage collection will take place if the ARP_Cache has fewer than gc_thresh1=128
entries, so you should be safe if your network is smaller than this number.
Documentation is in the kernel page for ip-sysctl.
The best solution to this ARP_Cache trashing problem is to increase the kernel’s ARP_Cache garbage collection (gc) parameters by adding these lines to /etc/sysctl.conf
:
# Don't allow the arp table to become bigger than(clusters containing 1024 nodes or more). this
net.ipv4.neigh.default.gc_thresh3 = 4096
# Tell the gc when to become aggressive with arp table cleaning.
# Adjust this based on size of the LAN.
net.ipv4.neigh.default.gc_thresh2 = 2048
# Adjust where the gc will leave arp table alone
net.ipv4.neigh.default.gc_thresh1 = 1024
# Adjust to arp table gc to clean-up more often
net.ipv4.neigh.default.gc_interval = 3600
# ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600
Display the current ARP_Cache values by:
sysctl net.ipv4.neigh.default
You may also consider increasing the SOMAXCONN limit (see Large_Cluster_Administration_Guide):
# Limit of socket listen() backlog, known in userspace as SOMAXCONN
net.core.somaxconn = 2048
Configure maximum number of open files
We strongly recommend to increase significantly the kernel’s fs.file-max
limit on all Slurm compute nodes!
The default slurmd service is configured with a Systemd limit on the
number of open files in the service file
/usr/lib/systemd/system/slurmd.service
:
LimitNOFILE=131072
A customized service file /etc/systemd/system/slurmd.service
may also be used and takes precedence.
Please note that the usual limits defined in /etc/security/limits.conf
are not relevant to jobs running under the slurmd service!
The LimitNOFILE
puts a limit on individual Slurm job steps.
A compute node may run multiple jobs, each of which may have LimitNOFILE
open files.
If up to N jobs might run in each node, the Linux kernel must allow for N * LimitNOFILE
open files,
in addition to open files used by the OS.
Therefore a line should be configured in /etc/sysctl.conf
, for example 100 times the LimitNOFILE
:
fs.file-max = 13107200
System default values of fs.file-max
:
The EL8
fs.file-max
calculated by the kernel at boot time is approximately 1/10 of physical RAM size in units of MB (no explanation is given).The EL9
fs.file-max
is set to max value itself which is 9223372036854775807 (2^63-1).
Partition limits
If EnforcePartLimits
is set to “ALL” in slurm.conf then jobs which exceed a partition’s size and/or limits will be rejected at submission time:
EnforcePartLimits=ALL
NOTE: The partition limits being considered are its configured MaxMemPerCPU, MaxMemPerNode, MinNodes, MaxNodes, MaxTime, AllocNodes, AllowAccounts, AllowGroups, AllowQOS, and QOS usage threshold.
Job limits
By default, Slurm will propagate all user limits from the submitting node (see ulimit -a
) to be effective also within batch jobs.
It is important to configure slurm.conf so that the locked memory limit isn’t propagated to the batch jobs:
PropagateResourceLimitsExcept=MEMLOCK
as explained in https://slurm.schedmd.com/faq.html#memlock. A possible memory limit error with Omni-Path network fabric by Cornelis Networks was discussed in Slurm bug 3363.
In fact, if you have imposed any non-default limits in /etc/security/limits.conf
or /etc/security/limits.d/\*.conf
in the login nodes,
you probably want to prohibit these from the batch jobs by configuring:
PropagateResourceLimitsExcept=ALL
See the slurm.conf page for the list of all PropagateResourceLimitsExcept
limits.
PAM module restrictions
On Compute nodes you may optionally install the slurm-pam_slurm
RPM package which can prevent rogue users from logging in.
A more important function is the containment of SSH tasks, for example, by some MPI libraries not using Slurm for spawning tasks.
The pam_slurm_adopt module makes sure that child SSH tasks are controlled by Slurm on the job’s master node.
SELinux may conflict with pam_slurm_adopt, so it might need to be disabled by this command:
setenforce 0
Disable SELinux permanently in /etc/selinux/config
:
SELINUX=disabled
For further details, the pam_slurm_adopt module is described by its author in Caller ID: Handling ssh-launched processes in Slurm. Features include:
This module restricts access to compute nodes in a cluster where Slurm is in use. Access is granted to root, any user with an Slurm-launched job currently running on the node, or any user who has allocated resources on the node according to the Slurm.
Usage of pam_slurm_adopt is described in the source files pam_slurm_adopt. There is also a nice description in bug_4098. Documentation of pam_slurm_adopt is discussed in bug_3567.
The PAM usage of, for example, /etc/pam.d/system-auth
on RHEL and clones is configured through the authconfig command.
Configure PrologFlags
Warning: Do NOT configure UsePAM=1
in slurm.conf (this advice can be found on the net).
Please see bug_4098 (comment 3).
You need to configure slurm.conf with:
PrologFlags=contain
Then distribute the slurm.conf file to all nodes. Reconfigure the slurmctld service:
scontrol reconfigure
This can be done while the cluster is in production, see bug_4098 (comment 3).
PAM configuration
Warnings:
First make the
PrologFlags=contain
configuration described above.DO NOT configure
UsePAM=1
in slurm.conf!Reconfiguration of the PAM setup should only be done on compute nodes that can’t run jobs (for example, drained nodes).
You should only configure this on Slurm 17.02.2 or later.
First make sure that you have installed this Slurm package:
rpm -q slurm-pam_slurm
Create a new file in /etc/pam.d/
where the line with pam_systemd.so
has been removed:
cd /etc/pam.d/
grep -v pam_systemd.so < password-auth > password-auth-no-systemd
The reason is (quoting pam_slurm_adopt) that:
pam_systemd.so
is known to not play nice with Slurm’s usage of cgroup. It is recommended that you disable it or possibly addpam_slurm_adopt.so
afterpam_systemd.so
.
Insert some new lines in the file /etc/pam.d/sshd
at this place:
...
account required pam_nologin.so
# - PAM config for Slurm - BEGIN
account sufficient pam_slurm_adopt.so
account required pam_access.so
# - PAM config for Slurm - END
account include password-auth
...
and also replace the line:
session include password-auth
by:
# - PAM config for Slurm - BEGIN
session include password-auth-no-systemd
# - PAM config for Slurm - END
Options to the pam_slurm_adopt.so
module are documented in the pam_slurm_adopt page.
Now append these lines to /etc/security/access.conf
(see man access.conf
or access.conf for further possibilities):
+ : root : ALL
- : ALL : ALL
so that pam_access.so
will:
Allow access to the root user.
Deny access to ALL other users.
This can be tested immediately by trying to make SSH logins to the node. Normal user logins should be rejected with the message:
Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection closed by <IP address>
Logins may also fail if SELinux got enabled by accident, check that it is disabled with:
$ getenforce
Disabled
slurmd systemd limits
MPI jobs and other tasks using the Infiniband or Omni-Path network fabric by Cornelis Networks fabrics must have unlimited locked memory, see above.
Limits defined in /etc/security/limits.conf
or /etc/security/limits.d/\*.conf
are not effective for systemd services, see https://access.redhat.com/solutions/1257953,
so any limits must be defined in the service file, see man systemd.exec
.
For slurmd running under systemd the default limits are configured in /usr/lib/systemd/system/slurmd.service
as:
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
If you want to modify/override these limits, create a new service file rather than editing the slurmd.service
file.
For example, create a file /etc/systemd/system/slurmd.service.d/core_limit.conf
with the contents:
[Service]
LimitCORE=0
and do:
systemctl daemon-reload
systemctl restart slurmd
This file could be distributed to all compute nodes from a central location.
The possible process limit parameters are documented in the systemd.exec page section on Process Properties. The list is:
LimitCPU=, LimitFSIZE=, LimitDATA=, LimitSTACK=, LimitCORE=, LimitRSS=, LimitNOFILE=, LimitAS=, LimitNPROC=, LimitMEMLOCK=, LimitLOCKS=, LimitSIGPENDING=, LimitMSGQUEUE=, LimitNICE=, LimitRTPRIO=, LimitRTTIME=
To ensure that job tasks running under Slurm have the desired configuration, verify the slurmd
daemon’s limits by:
cat /proc/$(pgrep -u 0 slurmd)/limits
If slurmd has a memory lock limited less than expected, it may be due to slurmd having been started at boot time by the old init-script /etc/init.d/slurm
rather than by systemctl.
To remedy this problem see the section Starting slurm daemons at boot time above.
Temporary job directories
Jobs may be storing temporary files in /tmp
, /scratch
, and /dev/shm/
.
These directories may be filled up, and no clean-up is done after the job exits.
There are several possible solutions discussed below.
The job_container_tmpfs plugin
You should read the tmpfs_jobcontainer FAQ as well as bug_11183 and bug_11135 for further details. The job_container_tmpfs plugin uses Linux_namespaces.
WARNING:
NFS automount and job_container/tmpfs
do not play well together prior to Slurm 23.02:
If a directory does not exist when the tmpfs is created, then that directory cannot be accessed by the job, see bug_14344 and bug_12567.
The issue has been resolved in Slurm 23.02 according to bug_12567.
The job_container.conf configuration file /etc/slurm/job_container.conf
must be created, and an example is:
AutoBasePath=true
BasePath=/scratch Dirs=/tmp,/var/tmp,/dev/shm Shared=true
It is important to use the new 23.02 option Shared=true
since it enables using autofs on the node.
The slurm.conf must be configured for the job_container_tmpfs plugin:
JobContainerType=job_container/tmpfs
PrologFlags=Contain
The auto_tmpdir plugin
The auto_tmpdir SPANK plugin provides automated handling of temporary directories for jobs (see also this page).
A great advantage of this plugin that it actually works correctly with NFS home directories automounted by autofs, in contrast to Slurm’s job_container_tmpfs plugin prior to 23.02 (see more below), however, it is a bit more complicated to install and maintain third-party plugins.
You can build a customized RPM package for the auto_tmpdir plugin:
CMake version 3.6 (or greater) is required. Make sure the EPEL repo is enabled, then install this package:
dnf install epel-release dnf install cmake
Download the source:
git checkout git@github.com:University-of-Delaware-IT-RCI/auto_tmpdir.git or: git clone https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir.git cd auto_tmpdir mkdir builddir cd builddir
Configure the node local temporary directory as
/scratch/slurm-<slurm_jobid>
(choose whatever scratch disk is appropriate for your cluster installation):cmake3 -DSLURM_PREFIX=/usr -DSLURM_MODULES_DIR=/usr/lib64 -DCMAKE_BUILD_TYPE=Release -DAUTO_TMPDIR_DEFAULT_LOCAL_PREFIX=/scratch/slurm- .. make package
Here the
..
just refers to the parent directory. The generated RPM package may be named similar toauto_tmpdir-1.0.1-23.11.8.el8.x86_64.rpm
.Note: If you are upgrading Slurm to a new major version (like 23.11 to 24.05), you must use a test node to build the new auto_tmpdir RPM:
Uninstall any preexisting RPM:
dnf remove auto_tmpdir
Upgrade Slurm to the new version.
Rebuild the auto_tmpdir RPM as shown above.
Copy the auto_tmpdir RPM to where you keep the Slurm RPMs so that you can upgrade compute nodes with the
slurm-*
as well asauto_tmpdir
simultaneously.
Install the
auto_tmpdir
RPM package on all slurmd compute nodes, as well as all submit/login nodes (see notes below).Now you can create the file
/etc/slurm/plugstack.conf
(see the SPANK page) with contents:required auto_tmpdir.so mount=/tmp mount=/var/tmp
Notes:
The
/etc/slurm/plugstack.conf
file name can be changed by the PlugStackConfig parameter in slurm.conf.If you use configless Slurm the
/etc/slurm/plugstack.conf
file is automatically distributed from the slurmctld host.It is not required that
plugstack.conf
is identical or even installed on every node in the cluster, since Slurm does not check for that. Therefore you can have different configurations on different nodes (except when you use configless Slurm).If the
plugstack.conf
file is installed on a submit/login or compute node, it is mandatory that all plugins listed in the file are actually installed as well, otherwise user commands or slurmd will fail with errors. See a discussion in bug_14483.
Quickly restart the slurmd service on all compute nodes to actually activate the
/etc/slurm/plugstack.conf
feature:systemctl restart slurmd
This is required in order for new srun commands etc. to run correctly with the SPANK plugin. See the SPANK manual page:
Note: Plugins loaded in slurmd context persist for the entire time slurmd is running, so if configuration is changed or plugins are updated, slurmd must be restarted for the changes to take effect.
For information about Linux_namespaces currently mounted on the compute nodes use:
lsns -t mnt
Other tmpdir solutions
Another SPANK plugin is at https://github.com/hpc2n/spank-private-tmp. This plugin does not do any cleanup, so cleanup will have to be handled separately.
A manual cleanup of temporary files could be made (if needed) by a crontab job on the compute node, for example for the
/scratch
directory:# Remove files > 7 days old under /scratch/XXX (mindepth=2) find /scratch -depth -mindepth 2 -mtime +7 -exec rm -rf {} \;
Login node configuration
The login nodes should have the Slurm packages installed as described in the Slurm installation and upgrading page. See also the Login node firewall section.
Bash command completion for Slurm
The Bash shell includes a TAB bash_command_completion feature (see also bash-completion on GitHub). On EL8/EL9 Linux enable this feature by:
dnf install bash-completion
Slurm includes a slurm_completion_help script which offers completion for user commands like squeue, sbatch etc.,
which is installed by the slurm-contribs
package starting from Slurm 24.11 (see bug_20932).
The installed file is /usr/share/bash-completion/completions/slurm_completion.sh
.
To enable the slurm_completion_help script on Slurm 24.05 or older,
you may manually copy the slurm_completion.sh file to the /etc/bash_completion.d/
folder.
When upgrading to Slurm 24.11 (or later), remember to remove the file again:
rm /etc/bash_completion.d/slurm_completion.sh
Configure Prolog and Epilog scripts
It may be necessary to execute Prolog and/or Epilog scripts on the compute nodes when slurmd executes a task step (by default none are executed). See also the Prolog and Epilog Guide.
In the slurm.conf page this is described:
Prolog
Fully qualified pathname of a program for the slurmd to execute whenever it is asked to run a job step from a new job allocation (e.g.
/usr/local/slurm/prolog
). A glob pattern (See glob(7)) may also be used to specify more than one program to run (e.g./etc/slurm/prolog.d/*
). The slurmd executes the prolog before starting the first job step. The prolog script or scripts may be used to purge files, enable user login, etc.By default there is no prolog. Any configured script is expected to complete execution quickly (in less time than MessageTimeout).
If the prolog fails (returns a non-zero exit code), this will result in the node being set to a DRAIN state and the job being requeued in a held state, unless nohold_on_prolog_fail is configured in SchedulerParameters. See Prolog and Epilog Scripts for more information.
TaskProlog
Fully qualified pathname of a program to be execute as the slurm job’s owner prior to initiation of each task. Besides the normal environment variables, this has SLURM_TASK_PID available to identify the process ID of the task being started. Standard output from this program can be used to control the environment variables and output for the user program. (further details in the slurm.conf page).
TaskEpilog
Fully qualified pathname of a program to be execute as the slurm job’s owner after termination of each task. See TaskProlog for execution order details.
See also the items:
PrologEpilogTimeout
PrologFlags
SrunEpilog
Prolog and epilog examples
An example script is shown in the FAQ https://slurm.schedmd.com/faq.html#task_prolog:
#!/bin/sh
#
# Sample TaskProlog script that will print a batch job's
# job ID and node list to the job's stdout
#
if [ X"$SLURM_STEP_ID" = "X" -a X"$SLURM_PROCID" = "X"0 ]
then
echo "print =========================================="
echo "print SLURM_JOB_ID = $SLURM_JOB_ID"
echo "print SLURM_NODELIST = $SLURM_NODELIST"
echo "print =========================================="
fi
The script is supposed to output commands to be read by slurmd:
The task prolog is executed with the same environment as the user tasks to be initiated. The standard output of that program is read and processed as follows:
export name=value - sets an environment variable for the user task
unset name - clears an environment variable from the user task
print … - writes to the task’s standard output.
Configure partitions
System partitions are configured in slurm.conf, for example:
PartitionName=xeon8 Nodes=a[070-080] Default=YES DefaultTime=50:00:00 MaxTime=168:00:00 State=UP
Partitions may overlap so that some nodes belong to several partitions.
Access to partitions is configured in slurm.conf using AllowAccounts, AllowGroups, or AllowQos.
If some partition (like big memory nodes) should have a higher priority, this is controlled in slurm.conf using the multifactor plugin, for example:
PartitionName ... PriorityJobFactor=10
PriorityWeightPartition=1000
Configure multiple nodes and their features
Some defaults may be configured in slurm.conf for similar compute nodes, for example:
NodeName=DEFAULT Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=8000 TmpDisk=32752 Weight=1
NodeName=q001
NodeName=q002
...
Node features (similar to node properties used in the Torque resource manager) are defined for each NodeName in slurm.conf by:
Feature:
A comma delimited list of arbitrary strings indicative of some characteristic associated with the node. There is no value associated with a feature at this time, a node either has a feature or it does not. If desired a feature may contain a numeric component indicating, for example, processor speed. By default a node has no features.
Some examples are:
NodeName=DEFAULT Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=8000 TmpDisk=32752 Feature=xeon8,ethernet Weight=1
NodeName=q001
NodeName=q002
NodeSet configuration
A new NodeSet configuration is available in slurm.conf. The nodeset configuration allows you to define a name for a specific set of nodes which can be used to simplify the partition configuration section, especially for heterogenous or condo-style systems. Each nodeset may be defined by an explicit list of nodes, and/or by filtering the nodes by a particular configured feature.
This can be used to simplify partitions in slurm.conf, and some examples are:
NodeSet=a_nodes Nodes=a[001-100]
NodeSet=gpu_nodes Feature=GPU
Node weight
For clusters with heterogeneous node hardware it is useful to assign different Weight values to each type of node, see this slurm.conf parameter:
Weight
The priority of the node for scheduling purposes. All things being equal, jobs will be allocated the nodes with the lowest weight which satisfies their requirements.
This enables prioritization based upon a number of hardware parameters such as GPUs, RAM memory size, CPU clock speed, CPU core number, CPU generation. For example, GPU nodes should be avoided for non-GPU jobs.
A nice method was provided by Kilian Cavalotti of SRCC where a weight mask is used in slurm.conf. Each digit in the weight mask represents a hardware parameter of the node (a weight prefix of 1 is prepended in order to avoid octal conversion). For example, the following weight mask example puts a higher weight on GPUs, then RAM memory, then number of cores, and finally the CPU generation:
# (A weight prefix of "1" is prepended)
# #GRES Memory #Cores CPU_generation
# none: 0 24 GB: 0 8: 0 Nehalem: 1
# 1 GPU: 1 48 GB: 1 16: 1 Sandy Bridge: 2
# 2 GPU: 2 64 GB: 2 24: 2 Ivy Bridge: 3
# 3 GPU: 3 128 GB: 3 32: 3 Broadwell: 4
# 4 GPU: 4 256 GB: 4 36: 4 Skylake: 5
# Example: Broadwell (=4) with 24 cores (=2), 128 GB memory (=3), and 0 GPUs (=0): Weight=10324
This example would be used to assign a Weight value in slurm.conf for the relevant nodes:
NodeName=xxx Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=128000 Weight=10324
A different prioritization of hardware can be selected with different columns and numbers in the mask, but a fixed number is the result of the mask calculation for each type of node.
Generic resources (GRES) and GPUs
The Generic resources (GRES) are a comma delimited list of generic resources (GRES) specifications for a node. Such resources may be occupied by jobs, for example, GPU accelerators. In this case you must also configure the gres.conf file.
An example with a gpu GRES may be a gres.conf file:
Nodename=h[001-002] Name=gpu Type=K20Xm File=/dev/nvidia[0-3]
If GRES is used, you must also configure slurm.conf, so define the named GRES in slurm.conf:
GresTypes=gpu
and append a list of GRES resources in the slurm.conf NodeName specifications:
NodeName=h[001-002] Gres=gpu:K20Xm:4
See also the examples in the gres.conf page.
Configure network topology
Slurm can be configured to support topology-aware resource allocation to optimize job performance, see the Topology_Guide and the topology.conf manual page.
Check consistency of /etc/slurm/topology.conf
with nodelist in /etc/slurm/slurm.conf
using the checktopology tool.
Configure firewall for Slurm daemons
The Slurm compute nodes must be allowed to connect to the Head node’s slurmctld daemon. In the configuration file these ports are by default (see slurm.conf):
SlurmctldPort=6817
SlurmdPort=6818
SchedulerPort=7321
Install firewalld by:
dnf install firewalld firewall-config
Head node
Open port 6817 (slurmctld):
firewall-cmd --permanent --zone=public --add-port=6817/tcp
firewall-cmd --reload
Alternatively, completely whitelist the compute nodes’ private subnet (here: 10.2.x.x):
firewall-cmd --permanent --direct --add-rule ipv4 filter INPUT_direct 0 -s 10.2.0.0/16 -j ACCEPT
firewall-cmd --reload
The configuration is stored in the file /etc/firewalld/direct.xml
.
Database (slurmdbd) node
The slurmdbd service by default listens to port 6819, see slurmdbd.conf.
Open port 6819 (slurmdbd):
firewall-cmd --permanent --zone=public --add-port=6819/tcp
firewall-cmd --reload
Compute node firewall must be off
Quoting Moe Jette from [slurm-dev] No route to host: Which ports are used?:
Other communications (say between srun and the spawned tasks) are intended to operate within a cluster and have no port restrictions.
The simplest solution is to ensure that the compute nodes must have no firewall enabled:
systemctl stop firewalld
systemctl disable firewalld
However, you may run a firewall service, as long as you ensure that all ports are open between the compute nodes.
Login node firewall
A login node doesn’t need any special firewall rules for Slurm because no such daemons should be running on login nodes.
Warning: The srun command only works if the login node can:
Connect to the Head node port 6817.
Resolve the DNS name of the compute nodes.
Connect to the Compute nodes port 6818.
Therefore interactive batch jobs with srun seem to be impossible if your compute nodes are on an isolated private network relative to the Login node.
Firewall between slurmctld and slurmdbd
See advice from the Slurm_publications presentation Technical: Field Notes Mark 2: Random Musings From Under A New Hat, Tim Wickberg, SchedMD (2018).
SchedMD recommends to run slurmctld and slurmdbd daemons on separate servers, see the My Preferred Deployment Pattern slides in the presentation. If you use this configuration, the firewall is an important issue. See the Related Networking Notes slides in the presentation:
This is almost always an issue with a firewall in between slurmctld and slurmdbd.
slurmdbd opens a new connection to slurmctld to push changes.
If you’ve firewalled that off, the update will not be propogated.
Conclusion:
Open firewall between servers
On these servers, insert a firewalld direct_rule so that any incoming source IP packet (src) from a specific IP_address (A.B.C.D) gets accepted, for example:
firewall-cmd --permanent --direct --add-rule ipv4 filter INPUT_direct 0 -s A.B.C.D/32 -j ACCEPT
Then reload the firewall for any changes to take effect:
firewall-cmd --reload
List the rules by:
firewall-cmd --permanent --direct --get-all-rules
Checking the Slurm daemons
Check the configured daemons using the scontrol command:
scontrol show daemons
To verify the basic cluster partition setup:
scontrol show partition
To display the Slurm configuration:
scontrol show config
To display the compute nodes:
scontrol show nodes
One may also run the daemons interactively as described in Slurm_Quick_Start (Starting the Daemons). You can use one window to execute slurmctld -D -vvvvvv, a second window to execute slurmd -D -vvvvv.
Slurm plugins
A Slurm plugin is a dynamically linked code object which is loaded explicitly at run time by the Slurm libraries. A plugin provides a customized implementation of a well-defined API connected to tasks such as authentication, interconnect fabric, and task scheduling.
For plugin documentation see items in the section Slurm Developers
in the Slurm_documentation page.
Plugins include:
Slurm scheduler plugins (schedplugins) are Slurm plugins that implement the Slurm scheduler API.
SPANK - Slurm Plug-in Architecture for Node and job (K)control.
cli_filter Plugin API provides programmatic hooks during the execution of the salloc, sbatch, and srun command line interface (CLI) programs.
The site_factor plugin is designed to provide the site a way to build a custom multifactor priority factor, and will only be loaded and operation alongside PriorityType=priority/multifactor.
Job submit plugins
The Job_Submit_Plugin (a Lua plugin) will execute a Lua script named /etc/slurm/job_submit.lua
on the slurmctld host.
Some clarification of the documentation is needed, however, see bug_14472 and bug_14500.
Sample Lua scripts can be copied from the Slurm source distribution in the directories contribs/lua/
and etc/
:
We also provide a job submit plugin in https://github.com/OleHolmNielsen/Slurm_tools/tree/master/plugins
Please note that job_submit.lua.example has an issue with use of log.user()
in job_modify()
prior to Slurm 23.02, see bug_14539.
On the slurmctld server you may start with this example:
cp ~/rpmbuild/BUILD/slurm-23.11.8/etc/job_submit.lua.example /etc/slurm/job_submit.lua
(replace the 23.11 version number) and read in the Lua_manual about Lua programming. Install also the Lua package:
dnf install lua
Inspiration for writing you custom job_submit.lua
script can be found in:
https://funinit.wordpress.com/2018/06/07/how-to-use-job_submit_lua-with-slurm/
https://github.com/edf-hpc/slurm-llnl-misc-plugins/blob/master/job_submit/job_submit.lua
It is strongly recommended to check your Lua code before using it with Slurm! Any error in the code might cause the slurmctld to crash! If possible, verify the code on a test cluster before using it in a production cluster.
A good starting point is to make a syntax check with the luac compiler:
luac -p /etc/slurm/job_submit.lua
Other Lua syntax checker tools can be found on the net, for example:
Lua functions for the job_submit plugin
When writing the Job_Submit_Plugin Lua script it is nice to have an overview of available functions and variables. This is not well documented at present.
We have discovered the following functions (TODO: is there a list of all functions?):
slurm.log_info
slurm.log_debug
slurm.log_debug2
slurm.log_debug3
slurm.log_user
The function _get_job_req_field
in job_submit_lua.c lists all available job descriptor fields in job_desc
, for example, the following may be useful:
job_desc.partition
job_desc.script
job_desc.environment
job_desc.gres
job_desc.num_tasks
job_desc.max_nodes
job_desc.cpus_per_task
job_desc.tres_per_node
job_desc.tres_per_socket
job_desc.tres_per_task
job_desc.user_name
NOTE: If some field is undefined in the user’s job script, for example max_nodes
, slurmctld sets an “invalid” value (see bug_15012) which can be tested for in /etc/slurm/job_submit.lua
:
Numeric values (a Lua double) if absent will be set to
slurm.NO_VAL
(32-bit, as defined in/usr/include/slurm/slurm.h
).For completeness, there are both 16, 32, and 64-bit integer values
NO_VAL16, NO_VAL, NO_VAL64
defined inslurm.h
structjob_desc_msg_t
.
Slurm error symbols ESLURM*
and corresponding numeric values are defined in the file /usr/include/slurm/slurm_errno.h
, see also bug_14500.
Note that only a few selected symbols ESLURM*
are exposed to the Lua script, but from Slurm 23.02 all the error codes in /usr/include/slurm/slurm_errno.h
are exposed.
Your /etc/slurm/job_submit.lua
script can test for undefined values like in this example:
slurm.ESLURM_INVALID_PARTITION_NAME=2000
if (job_desc.partition == nil) then
slurm.log_user("No partition specified, please specify partition")
return slurm.ESLURM_INVALID_PARTITION_NAME
end
if (job_desc.max_nodes == slurm.NO_VAL) then
slurm.log_user("No max_nodes specified, please specify a number of nodes")
return slurm.ESLURM_INVALID_PARTITION_NAME
end
It is worth noting that the Lua version 5.1.4 from EL7 does not handle nil values well in all cases as discussed in bug_19564: When printing a string with a nil value an error such as bad argument #2 to ‘format’ (string expected, got nil) may occur. Therefore arguments to a print function must be checked for nil values when using Lua 5.1.4. The only known solution is to upgrade Lua to version 5.3.4 (available in EL8).
Configure Slurm for Lua JobSubmitPlugins
The Job_Submit_Plugin will only execute the Lua script named /etc/slurm/job_submit.lua
on the slurmctld host, and it is not used by any other nodes.
Then configure slurm.conf with this parameter:
JobSubmitPlugins=lua
which will make Slurm use the /etc/slurm/job_submit.lua
script.
Make sure to distribute slurm.conf to all nodes (or use a configless setup).
Then reconfigure slurmctld
:
scontrol reconfigure
If slurmctld
gets an error when executing /etc/slurm/job_submit.lua
, it will use any previously cached script and ignore the file on disk henceforth
(see comment 15 in bug_14472).
WARNING:
If slurmctld
does not have a cached script (because it was just restarted, for example) it may crash!