- Slurm configuration and slurm.conf
- Configless Slurm
- Configurator for slurm.conf
- Starting slurm daemons at boot time
- Reconfiguration of slurm.conf
- Cgroup configuration
- Node Health Check
- Reboot option
- Timeout options
- ReturnToService option
- MaxJobCount limit
- Job arrays
- Requeueing of jobs
- High throughput configuration or large clusters
- Head/Master server configuration
- Compute node configuration
- Configure Prolog and Epilog scripts
- Configure partitions
- Sharing nodes
- Configure multiple nodes and their features
- Configure network topology
- Configure firewall for Slurm daemons
- Firewall between slurmctld and slurmdbd
- Checking the Slurm daemons
- Configure ARP cache for large networks
- Slurm plugins
Jump to our Slurm top-level page.
Starting from Slurm 17.11 you probably want to look at the example configuration files found in this RPM:
rpm -q slurm-example-configs
It is mandatory that slurm.conf is identical on all nodes in the system!
Copy the HTML files to your $HOME directory, for example:
mkdir $HOME/slurm/ cp -rp /usr/share/doc/slurm-*/html $HOME/slurm/
With Slurm 20.02 there is a new configless feature that allows the compute nodes — specifically the slurmd process — and user commands running on login nodes to pull configuration information directly from the slurmctld instead of from a pre-distributed local file.
- Slurm versions 20.02.0 and 20.02.1 had a slurm_pam_adopt issue when using configless mode, see bug_8712.
- Slurm versions up to an including 20.11.7 may start the slurmd service before the network is fully up, causing slurmd to fail. Observed on some CentOS 8 systems, see bug_11878. The workaround is to restart the slurmd service manually.
The order of precedence for determining what configuration source to use is listed in the configless page.
systemctl set-environment SLURMD_OPTIONS="-M --conf-server <name of slurmctld server>" systemctl show-environment systemctl restart slurmd
Another way is to use systemctl edit slurmd to create an override file, see the systemctl manual page.
_slurmctld._tcp 3600 IN SRV 10 0 6817 slurm-backup _slurmctld._tcp 3600 IN SRV 0 0 6817 slurm-master
Install these RPMs with tools needed below:
yum install bind-utils hostname
Lookup the SRV record by either of:
dig +short -t SRV -n _slurmctld._tcp.`dnsdomainname` host -t SRV _slurmctld._tcp.`dnsdomainname`
We generally suggest that you run a slurmd to manage the configs on those nodes that run client commands, including submit or login nodes
The simplest way to achieve this is described in bug_9832:
Add the login and submit nodes to slurm.conf as default-configured nodes, for example:
and do not add these nodes to any partitions!
Remember to add these nodes to the topology.conf file as well, for example:
and open the firewall on these nodes (see the firewall section below).
Install the slurm-slurmd RPM on the login nodes and make sure to create the logging directory:
mkdir /var/log/slurm chown slurm.slurm /var/log/slurm
Then start the slurmd service:
systemctl enable slurmd systemctl start slurmd
Verify that the Slurm config files have been downloaded:
ls -l /run/slurm/conf
You can generate an initial slurm.conf file using several tools:
- The Slurm Version 17.02 Configuration Tool configurator.
- The Slurm Version 17.02 Configuration Tool - Easy Version configurator.easy.
- Build a configuration file using your favorite web browser and open file://$HOME/slurm/html/configurator.html or the simpler file configurator.easy.html.
- Copy the more extensive sample configuration file .../etc/slurm.conf.example from the source tar-ball and use it as a starting point.
Save the resulting output to /etc/slurm/slurm.conf.
The parameters are documented in man slurm.conf and slurm.conf, and it's recommended to read through the long list of parameters.
In slurm.conf it's essential that the important spool directories and the slurm user are defined correctly:
SlurmUser=slurm SlurmdSpoolDir=/var/spool/slurmd StateSaveLocation=/var/spool/slurmctld
NOTE: These spool directories must be created manually and owned by user slurm (see below), as they are not part of the RPM installation.
Enable startup of services as appropriate for the given node:
systemctl enable slurmd # Compute node systemctl enable slurmctld # Master/head server systemctl enable slurmdbd # Database server
The systemd service files are /usr/lib/systemd/system/slurm*.service.
The Slurm 16.05 RPM packages install and configure (it's bug 3371) the init boot script /etc/init.d/slurm - even for systems like RHEL/CentOS 7 which use systemd! The bug has been fixed in Slurm 17.02.
If you have Slurm 16.05 (or older) on RHEL/CentOS 7, check if you have enabled the init script:
chkconfig --list slurm
We should modify this setup to use systemd exclusively. First disable the init script on all nodes, including login-nodes:
chkconfig --del slurm
In order to avoid accidentally starting services with /etc/init.d/slurm, it is best to also remove the offending script:
rm -f /etc/init.d/slurm
Then enable the services properly as shown above.
Beware that any update of the Slurm 16.05 RPMs will recreate the missing /etc/init.d/slurm file, so you must remember to remove it after every update.
If there is any question about:
- The availability and sanity of the daemons' spool directories (perhaps on remote storage)
- The MySQL database
- If Slurm has been upgraded to a new version
it may be a good idea to start each service manually in stead of automatically as shown above. For example:
Watch the the output for any signs of problems. If the daemon looks sane, type Control-C and start the service in the normal way:
systemctl start slurmctld
From the scontrol man-page about the reconfigure option:
- Instruct all Slurm daemons to re-read the configuration file. This command does not restart the daemons. This mechanism would be used to modify configuration parameters (Epilog, Prolog, SlurmctldLogFile, SlurmdLogFile, etc.). The Slurm controller (slurmctld) forwards the request all other daemons (slurmd daemon on each compute node). Running jobs continue execution.
- Most configuration parameters can be changed by just running this command, however, Slurm
daemons should be shutdown and restarted if any of these parameters are to be changed:
- AuthType, BackupAddr, BackupController, ControlAddr, ControlMach, PluginDir, StateSaveLocation, SlurmctldPort or SlurmdPort.
- The slurmctld daemon and all slurmd daemons must be restarted if nodes are added to or removed from the cluster.
According to the scontrol man-page, when adding or removing nodes to slurm.conf, it is necessary to restart slurmctld. However, it is also necessary to restart the slurmd daemon on all nodes, see bug_3973:
It is also possible to add nodes to slurm.conf with a state of future:
FUTURE Indicates the node is defined for future use and need not exist when the Slurm daemons are started. These nodes can be made available for use simply by updating the node state using the scontrol command rather than restarting the slurmctld daemon. After these nodes are made available, change their State in the slurm.conf file. Until these nodes are made available, they will not be seen using any Slurm commands or nor will any attempt be made to contact them.
However, such future nodes must not be members of any partition.
Control Groups (Cgroups v1) provide a Linux kernel mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour.
Documentation about the usage of Cgroups:
To list current Cgroups use the command:
lscgroup lscgroup -g cpu:/
ps --no-headers -eo pid,user,comm,cgroup | egrep -vw 'root|freezer:/slurm.*devices:/slurm.*cpuacct,cpu:/slurm.*memory:/slurm|cpuset:/slurm.*|dbus-daemon|munged|ntpd|gmond|polkitd|chrony|smmsp|rpcuser|rpc'
- proctrack (process tracking)
- task (task management)
- jobacct_gather (job accounting statistics)
If you use jobacct_gather, change the default ProctrackType in slurm.conf:
otherwise you'll get this warning in the slurmctld log:
WARNING: We will use a much slower algorithm with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other proctrack when using jobacct_gather/linux
Notice: Linux kernel 2.6.38 or greater is strongly recommended, see the Cgroups_Guide General Usage Notes.
In this example we want to constrain jobs to the number of CPU cores as well as RAM memory requested by the job.
For a discussion see bug 3853.
You should probably also configure this (unless you have lots of short running jobs):
see the section ProctrackType of slurm.conf.
Create cgroup.conf file:
cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
Edit the file to change these lines:
ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes
The cgroup.conf page defines:
If configured to "yes" then constrain allowed cores to the subset of allocated resources. It uses the cpuset subsystem.
If configured to "yes" then constrain the job's RAM usage. The default value is "no", in which case the job's RAM limit will be set to its swap space limit. Also see AllowedSwapSpace, AllowedRAMSpace and ConstrainSwapSpace.
If configured to "yes" then constrain the job's swap space usage. The default value is "no". Note that when set to "yes" and ConstrainRAMSpace is set to "no", AllowedRAMSpace is automatically set to 100% in order to limit the RAM+Swap amount to 100% of job's requirement plus the percent of allowed swap space. This amount is thus set to both RAM and RAM+Swap limits. This means that in that particular case, ConstrainRAMSpace is automatically enabled with the same limit than the one used to constrain swap space. Also see AllowedSwapSpace.
You may also consider defining MemSpecLimit in slurm.conf:
- MemSpecLimit Amount of memory, in megabytes, reserved for system use and not available for user allocations. If the task/cgroup plugin is configured and that plugin constrains memory allocations (i.e. TaskPlugin=task/cgroup in slurm.conf, plus ConstrainRAMSpace=yes in cgroup.conf), then Slurm compute node daemons (slurmd plus slurmstepd) will be allocated the specified memory limit. The daemons will not be killed if they exhaust the memory allocation (ie. the Out-Of-Memory Killer is disabled for the daemon's memory cgroup). If the task/cgroup plugin is not configured, the specified memory will only be unavailable for user allocations.
See an interesting discussion in bug 2713.
If compute nodes mount Lustre or NFS file systems, it may be a good idea to configure cgroup.conf with:
See the cgroup.conf man-page, bug_3874 and [slurm-dev] Interaction between cgroups and NFS. This requires Slurm 17.02.5 or later, see NEWS. After distributing the cgroup.conf file to all nodes, make a scontrol reconfigure.
There may be some problems with Cgroups.
Jobs may crash with an error like:
slurmstepd: error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_207887' : No space left on device
The bug_3890 explains this, it may be a kernel bug (CentOS 7 has kernel 3.10), see:
Workaround: Reboot the node.
HealthCheckProgram=/usr/sbin/nhc HealthCheckInterval=3600 HealthCheckNodeState=ANY
This will execute NHC every 60 minutes on nodes in ANY states, see the slurm.conf documentation about Health* variables. There are other criteria for when to execute NHC as defined by HealthCheckNodeState in slurm.conf: ALLOC, ANY, CYCLE, IDLE, MIXED.
We add the following lines in the NHC configuration file /etc/nhc/nhc.conf for nodes in the domain nifl.fysik.dtu.dk:
* || NHC_RM=slurm # Flag df to list only local filesystems (omit NFS mounts) * || DF_FLAGS="-Tkl" * || DFI_FLAGS="-Til" # Setting short hostname for compute nodes (default in our Slurm setup) *.nifl.fysik.dtu.dk || HOSTNAME=$HOSTNAME_S # Busy batch nodes may take a long time to run nhc *.nifl.fysik.dtu.dk || TIMEOUT=120 # Check OmniPath/Infiniband link x*.nifl.fysik.dtu.dk || check_hw_ib 100
For example, to execute the NHC check once per hour with a specified E-mail interval of 1 day, add this to the system's crontab:
# Node Health Check 3 * * * * /usr/sbin/nhc-wrapper -X 1d
Nvidia has a new Data Center GPU Manager (DCGM) suite of tools which includes NVIDIA Validation Suite (NVVS). Download of DCGM requires membership of the Data Center GPU Manager (DCGM) Program. Install the RPM by:
yum install datacenter-gpu-manager-1.7.1-1.x86_64.rpm
Run the NVVS tool:
nvvs -g -l /tmp/nvvs.log
The (undocumented?) log file (-l) seems to be required.
Perhaps it may be useful in stead to check for the presence of the GPU devices with a check similar to this (for 4 GPU devices):
gpu* || check_file_test -c -r /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3
It seems that these device files do not get created automatically at reboot, but only if you run this (for example, in /etc/rc.local):
The physical presence of Nvidia devices can be tested by this command:
# lspci | grep NVIDIA
* || NHC_RM=slurm
because NHC (version 1.4.2) may autodetect NHC_RM=pbs if the file /usr/bin/pbsnodes is present (see issue 20).
Both bugs should be fixed in NHC 1.4.3 (when it becomes available).
Nodes may occasionally have to be rebooted after firmware or kernel upgrades.
scontrol reboot [ASAP] [NodeList]
The ASAP flag is available from Slurm 17.02, see man scontrol for earlier versions.
Add this line to slurm.conf:
The path to reboot may be different on other OSes.
Notice: Command arguments to RebootProgram like:
RebootProgram="/sbin/shutdown -r now"
A number of Timeout options may be configured in slurm.conf.
Values above 127 should not be used, see bug_11103.
This may also be accompanied by a custom command UnkillableStepProgram. If this timeout is reached, the node will also be drained with reason batch job complete failure.
The ReturnToService option in slurm.conf controls when a DOWN node will be returned to service, see slurm.conf and the FAQ Why is a node shown in state DOWN when the node has registered for service?.
In slurm.conf is defined:
MaxJobCount The maximum number of jobs Slurm can have in its active database at one time. Set the values of MaxJobCount and MinJobAge to insure the slurmctld daemon does not exhaust its memory or other resources. Once this limit is reached, requests to submit additional jobs will fail. The default value is 10000 jobs.
If you exceed 10000 jobs in the queue users will get an error when submitting jobs:
sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying. sbatch: error: Batch job submission failed: Resource temporarily unavailable
Add a higher value to slurm.conf, for example:
Another parameter in slurm.conf may perhaps need modification with higher MaxJobCount:
MinJobAge The minimum age of a completed job before its record is purged from Slurm's active database. Set the values of MaxJobCount and to insure the slurmctld daemon does not exhaust its memory or other resources. The default value is 300 seconds.
In addition, it may be a good idea to implement MaxSubmitJobs and MaxJobs resource_limits for user associations or QOSes, for example:
sacctmgr modify user where name=<username> set MaxJobs=100 MaxSubmitJobs=500
The job_arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily; job arrays with millions of tasks can be submitted in milliseconds (subject to configured size limits).
A slurm.conf configuration parameter controls the maximum job array size:
Be mindful about the value of MaxArraySize as job arrays offer an easy way for users to submit large numbers of jobs very quickly.
Jobs may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. The following parameter in slurm.conf may be changed for the default ability for batch jobs to be requeued:
This function is:
- If JobRequeue is set to a value of 1, then batch job may be requeued unless explicitly disabled by the user.
- If JobRequeue is set to a value of 0, then batch job will not be requeued unless explicitly enabled by the user.
- The default value is 1.
sbatch --no-requeue or --requeue
to change the default behavior for individual jobs.
The following document contains Slurm administrator information specifically for high throughput computing, namely the execution of many short jobs. Getting optimal performance for high throughput computing does require some tuning and this document should help you off to a good start:
The following document contains Slurm administrator information specifically for clusters containing 1,024 nodes or more:
The following must be done on the Head/Master node. Create the spool and log directories and make them owned by the slurm user:
mkdir /var/spool/slurmctld /var/log/slurm chown slurm: /var/spool/slurmctld /var/log/slurm chmod 755 /var/spool/slurmctld /var/log/slurm
Create log files:
touch /var/log/slurm/slurmctld.log chown slurm: /var/log/slurm/slurmctld.log
Create the (Linux default) accounting file:
touch /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log chown slurm: /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
NOTICE: If you plan to enable job accounting, it is mandatory to configure the database and accounting as explained in the Slurm_accounting page.
Start and enable the slurmctld daemon:
systemctl enable slurmctld.service systemctl start slurmctld.service systemctl status slurmctld.service
Warning: With Slurm 14.x and a compute node running RHEL 7 there is a bug systemctl start/stop does not work on RHEL 7. This problem has apparently been resolved in Slurm 15.08.
Finally copy /etc/slurm/slurm.conf to all compute nodes:
scp -p /etc/slurm/slurm.conf nodeXXX:/etc/slurm/slurm.conf
It's convenient to use the pdsh command, see PDSH.
It is important to keep this file identical on both the Head/Master server and all Compute nodes. Remember to include all of the NodeName= lines for all compute nodes.
The following must be done on each compute node. Create the slurmd spool and log directories and make the correct ownership:
mkdir /var/spool/slurmd /var/log/slurm chown slurm: /var/spool/slurmd /var/log/slurm chmod 755 /var/spool/slurmd /var/log/slurm
Create log files:
touch /var/log/slurm/slurmd.log chown slurm: /var/log/slurm/slurmd.log
Executing the command:
on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which must be added to the global slurm.conf file. For example a node may be defined as:
NodeName=test001 Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=8010 TmpDisk=32752 Feature=xeon
For recent Xeon and EPYC CPUs, the Sub NUMA Cluster (SNC) BIOS setting has been shown to improve performance, see BIOS characterization for HPC with Intel Cascade Lake processors. This will cause each processor socket to have two NUMA domains, one for each of the memory controllers, so a dual-socket server will have 4 NUMA domains, for example:
$ slurmd -C slurmd: Considering each NUMA node as a socket CPUs=40 Boards=1 SocketsPerBoard=4 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=385380
NodeName=test001 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=8010 TmpDisk=32752 Feature=xeon
This has been fixed in Slurm 20.02.4.
Start and enable the slurmd daemon:
systemctl enable slurmd.service systemctl start slurmd.service systemctl status slurmd.service
If EnforcePartLimits is set to "ALL" then jobs which exceed a partition's size and/or limits will be rejected at submission time:
NOTE: The partition limits being considered are its configured MaxMemPerCPU, MaxMemPerNode, MinNodes, MaxNodes, MaxTime, AllocNodes, AllowAccounts, AllowGroups, AllowQOS, and QOS usage threshold.
By default, Slurm will propagate all user limits from the submitting node (see ulimit -a) to be effective also within batch jobs.
It is important to configure slurm.conf so that the locked memory limit isn't propagated to the batch jobs:
In fact, if you have imposed any non-default limits in /etc/security/limits.conf or /etc/security/limits.d/\*.conf in the login nodes, you probably want to prohibit these from the batch jobs by configuring:
See the slurm.conf page for the list of all PropagateResourceLimitsExcept limits.
On Compute nodes you may additionally install the slurm-pam_slurm RPM package to prevent rogue users from logging in. A more important functions is the containment of SSH tasks, for example, by some MPI libraries not using Slurm for spawning tasks. The pam_slurm_adopt module makes sure that child SSH tasks are controlled by Slurm on the job's master node.
SELinux may conflict with pam_slurm_adopt, so it might need to be disabled by this command:
Disable SELinux permanently in /etc/selinux/config:
For further details, the pam_slurm_adopt module is described by its author in Caller ID: Handling ssh-launched processes in Slurm. Features include:
- This module restricts access to compute nodes in a cluster where Slurm is in use. Access is granted to root, any user with an Slurm-launched job currently running on the node, or any user who has allocated resources on the node according to the Slurm.
The PAM usage of, for example, /etc/pam.d/system-auth on CentOS/RHEL is configured through the authconfig command.
You need to configure slurm.conf with:
This can be done while the cluster is in production, see bug_4098 (comment 3).
- First make the PrologFlags=contain configuration described above.
- Do NOT configure UsePAM=1 in slurm.conf.
- Reconfiguration of the PAM setup should only be done on compute nodes that can't run jobs (for example, drained nodes).
- You should only configure this on Slurm 17.02.2 or later.
First make sure that you have installed this Slurm package:
rpm -q slurm-pam_slurm
Create a new file in /etc/pam.d/ where the line with pam_systemd.so has been removed:
cd /etc/pam.d/ grep -v pam_systemd.so < password-auth > password-auth-no-systemd
The reason is (quoting pam_slurm_adopt) that:
- pam_systemd.so is known to not play nice with Slurm's usage of cgroups. It is recommended that you disable it or possibly add pam_slurm_adopt.so after pam_systemd.so.
Insert some new lines in the file /etc/pam.d/sshd at this place:
... account required pam_nologin.so # - PAM config for Slurm - BEGIN account sufficient pam_slurm_adopt.so account required pam_access.so # - PAM config for Slurm - END account include password-auth ...
and also replace the line:
session include password-auth
# - PAM config for Slurm - BEGIN session include password-auth-no-systemd # - PAM config for Slurm - END
Options to the pam_slurm_adopt.so module are documented in the pam_slurm_adopt page.
Now append these lines to /etc/security/access.conf (see man access.conf or access.conf for further possibilities):
+ : root : ALL - : ALL : ALL
so that pam_access.so will:
- Allow access to the root user.
- Deny access to ALL other users.
This can be tested immediately by trying to make SSH logins to the node. Normal user logins should be rejected with the message:
Access denied by pam_slurm_adopt: you have no active jobs on this node Connection closed by <IP address>
MPI jobs and other tasks using the Infiniband or OmniPath fabrics must have unlimited locked memory, see above. Limits defined in /etc/security/limits.conf or /etc/security/limits.d/\*.conf are not effective for systemd services, see https://access.redhat.com/solutions/1257953, so any limits must be defined in the service file, see man systemd.exec.
LimitNOFILE=51200 LimitMEMLOCK=infinity LimitSTACK=infinity
If you want to modify/override these limits, create a new service file rather than editing the slurmd.service file. For example, create a file /etc/systemd/system/slurmd.service.d/core_limit.conf with the contents:
systemctl daemon-reload systemctl restart slurmd
This file could be distributed to all compute nodes from a central location.
The possible process limit parameters are documented in the systemd.exec page section on Process Properties. The list is:
LimitCPU=, LimitFSIZE=, LimitDATA=, LimitSTACK=, LimitCORE=, LimitRSS=, LimitNOFILE=, LimitAS=, LimitNPROC=, LimitMEMLOCK=, LimitLOCKS=, LimitSIGPENDING=, LimitMSGQUEUE=, LimitNICE=, LimitRTPRIO=, LimitRTTIME=
To ensure that job tasks running under Slurm have the desired configuration, verify the slurmd daemon's limits by:
cat /proc/$(pgrep -u 0 slurmd)/limits
If slurmd has a memory lock limited less than expected, it may be due to slurmd having been started at boot time by the old init-script /etc/init.d/slurm rather than by systemctl. To remedy this problem see the section Starting slurm daemons at boot time above.
By default jobs started by slurmd do not use PAM and therefore do not honor the /etc/security/limits.conf file. This behavior may be changed by adding to slurm.conf (see the man-page):
Then you can create a file /etc/pam.d/slurm containing:
auth required pam_localuser.so account required pam_unix.so session required pam_limits.so
In the slurm.conf page this is described:
Fully qualified pathname of a program for the slurmd to execute whenever it is asked to run a job step from a new job allocation (e.g. /usr/local/slurm/prolog). A glob pattern (See glob(7)) may also be used to specify more than one program to run (e.g. /etc/slurm/prolog.d/*). The slurmd executes the prolog before starting the first job step. The prolog script or scripts may be used to purge files, enable user login, etc.
By default there is no prolog. Any configured script is expected to complete execution quickly (in less time than MessageTimeout).
If the prolog fails (returns a non-zero exit code), this will result in the node being set to a DRAIN state and the job being requeued in a held state, unless nohold_on_prolog_fail is configured in SchedulerParameters. See Prolog and Epilog Scripts for more information.
Fully qualified pathname of a program to be execute as the slurm job's owner prior to initiation of each task. Besides the normal environment variables, this has SLURM_TASK_PID available to identify the process ID of the task being started. Standard output from this program can be used to control the environment variables and output for the user program. (further details in the slurm.conf page).
Fully qualified pathname of a program to be execute as the slurm job's owner after termination of each task. See TaskProlog for execution order details.
See also the items:
An example script is shown in the FAQ https://slurm.schedmd.com/faq.html#task_prolog:
#!/bin/sh # # Sample TaskProlog script that will print a batch job's # job ID and node list to the job's stdout # if [ X"$SLURM_STEP_ID" = "X" -a X"$SLURM_PROCID" = "X"0 ] then echo "print ==========================================" echo "print SLURM_JOB_ID = $SLURM_JOB_ID" echo "print SLURM_NODELIST = $SLURM_NODELIST" echo "print ==========================================" fi
The script is supposed to output commands to be read by slurmd:
- The task prolog is executed with the same environment as the user tasks to be initiated.
The standard output of that program is read and processed as follows:
- export name=value - sets an environment variable for the user task
- unset name - clears an environment variable from the user task
- print ... - writes to the task's standard output.
System partitions are configured in slurm.conf, for example:
PartitionName=xeon8 Nodes=a[070-080] Default=YES DefaultTime=50:00:00 MaxTime=168:00:00 State=UP
Partitions may overlap so that some nodes belong to several partitions.
Access to partitions is configured in slurm.conf using AllowAccounts, AllowGroups, or AllowQos.
If some partition (like big memory nodes) should have a higher priority, this is controlled in slurm.conf using the multifactor plugin, for example:
PartitionName ... PriorityJobFactor=10 PriorityWeightPartition=1000
Some defaults may be configured in slurm.conf for similar compute nodes, for example:
NodeName=DEFAULT Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=8000 TmpDisk=32752 Weight=1 NodeName=q001 NodeName=q002 ...
Note for Slurm 20.02: The Boards=1 SocketsPerBoard=2 configuration gives error messages, see bug_9241. Use this in stead:
NodeName=DEFAULT Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=8000 TmpDisk=32752 Weight=1
A comma delimited list of arbitrary strings indicative of some characteristic associated with the node. There is no value associated with a feature at this time, a node either has a feature or it does not. If desired a feature may contain a numeric component indicating, for example, processor speed. By default a node has no features.
Some examples are:
NodeName=DEFAULT Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=8000 TmpDisk=32752 Feature=xeon8,ethernet Weight=1 NodeName=q001 NodeName=q002
From Slurm 20.02 a new NodeSet configuration is available in slurm.conf.
The nodeset configuration allows you to define a name for a specific set of nodes which can be used to simplify the partition configuration section, especially for heterogenous or condo-style systems. Each nodeset may be defined by an explicit list of nodes, and/or by filtering the nodes by a particular configured feature.
This can be used to simplify partitions in slurm.conf, and some examples are:
NodeSet=a_nodes Nodes=a[001-100] NodeSet=gpu_nodes Feature=GPU
For clusters with heterogeneous node hardware it is useful to assign different Weight values to each type of node, see this slurm.conf parameter:
Weight The priority of the node for scheduling purposes. All things being equal, jobs will be allocated the nodes with the lowest weight which satisfies their requirements.
This enables prioritization based upon a number of hardware parameters such as GPUs, RAM memory size, CPU clock speed, CPU core number, CPU generation. For example, GPU nodes should be avoided for non-GPU jobs.
A nice method was provided by Kilian Cavalotti of SRCC where a weight mask is used in slurm.conf. Each digit in the weight mask represents a hardware parameter of the node (a weight prefix of 1 is prepended in order to avoid octal conversion). For example, the following weight mask example puts a higher weight on GPUs, then RAM memory, then number of cores, and finally the CPU generation:
# (A weight prefix of "1" is prepended) # #GRES Memory #Cores CPU_generation # none: 0 24 GB: 0 8: 0 Nehalem: 1 # 1 GPU: 1 48 GB: 1 16: 1 Sandy Bridge: 2 # 2 GPU: 2 64 GB: 2 24: 2 Ivy Bridge: 3 # 3 GPU: 3 128 GB: 3 32: 3 Broadwell: 4 # 4 GPU: 4 256 GB: 4 36: 4 Skylake: 5 # Example: Broadwell (=4) with 24 cores (=2), 128 GB memory (=3), and 0 GPUs (=0): Weight=10324
This example would be used to assign a Weight value in slurm.conf for the relevant nodes:
NodeName=xxx Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=128000 Weight=10324
A different prioritization of hardware can be selected with different columns and numbers in the mask, but a fixed number is the result of the mask calculation for each type of node.
The Generic resources (GRES) are a comma delimited list of generic resources (GRES) specifications for a node. Such resources may be occupied by jobs, for example, GPU accelerators. In this case you must also configure the gres.conf file.
Nodename=h[001-002] Name=gpu Type=K20Xm File=/dev/nvidia[0-3]
See also the examples in the gres.conf page.
Check consistency of /etc/slurm/topology.conf with nodelist in /etc/slurm/slurm.conf using the checktopology tool.
SlurmctldPort=6817 SlurmdPort=6818 SchedulerPort=7321
The CentOS7/RHEL7 default firewall service is firewalld and not the well-known iptables service. The dynamic firewall daemon firewalld provides a dynamically managed firewall with support for network “zones” to assign a level of trust to a network and its associated connections and interfaces. See Introduction to firewalld.
A nice introduction is RHEL7: How to get started with Firewalld.
Install firewalld by:
yum install firewalld firewall-config
Open port 6817 (slurmctld):
firewall-cmd --permanent --zone=public --add-port=6817/tcp firewall-cmd --reload
Alternatively, completely whitelist the compute nodes' private subnet (here: 10.2.x.x):
firewall-cmd --permanent --direct --add-rule ipv4 filter INPUT_direct 0 -s 10.2.0.0/16 -j ACCEPT firewall-cmd --reload
The configuration is stored in the file /etc/firewalld/direct.xml.
Open port 6819 (slurmdbd):
firewall-cmd --permanent --zone=public --add-port=6819/tcp firewall-cmd --reload
Quoting Moe Jette from [slurm-dev] No route to host: Which ports are used?:
Other communications (say between srun and the spawned tasks) are intended to operate within a cluster and have no port restrictions.
The simplest solution is to ensure that the compute nodes must have no firewall enabled:
systemctl stop firewalld systemctl disable firewalld
However, you may run a firewall service, as long as you ensure that all ports are open between the compute nodes.
A login node doesn't need any special firewall rules for Slurm because no such daemons should be running on login nodes.
Warning: The srun command only works if the login node can:
- Connect to the Head node port 6817.
- Resolve the DNS name of the compute nodes.
- Connect to the Compute nodes port 6818.
Therefore interactive batch jobs with srun seem to be impossible if your compute nodes are on an isolated private network relative to the Login node.
See advice from the Slurm_publications presentation Technical: Field Notes Mark 2: Random Musings From Under A New Hat, Tim Wickberg, SchedMD (2018).
If you use this configuration, the firewall is an important issue.
See the Related Networking Notes slides in the presentation:
- This is almost always an issue with a firewall in between slurmctld and slurmdbd.
- slurmdbd opens a new connection to slurmctld to push changes.
- If you’ve firewalled that off, the update will not be propogated.
firewall-cmd --permanent --direct --add-rule ipv4 filter INPUT_direct 0 -s A.B.C.D/32 -j ACCEPT
Then reload the firewall for any changes to take effect:
List the rules by:
firewall-cmd --permanent --direct --get-all-rules
Check the configured daemons using the scontrol command:
scontrol show daemons
To verify the basic cluster partition setup:
scontrol show partition
To display the Slurm configuration:
scontrol show config
To display the compute nodes:
scontrol show nodes
One may also run the daemons interactively as described in Slurm_Quick_Start (Starting the Daemons). You can use one window to execute slurmctld -D -vvvvvv, a second window to execute slurmd -D -vvvvv.
If the number of network devices (cluster nodes plus switches etc.) approaches or exceeds 512, you must consider the Linux kernel's limited dynamic ARP-cache size. Please read the man-page man 7 arp about the kernel's ARP-cache.
The best solution to this ARP-cache trashing problem is to increase the kernel's ARP-cache garbage collection (gc) parameters by adding these lines to /etc/sysctl.conf:
# Don't allow the arp table to become bigger than(clusters containing 1024 nodes or more). this net.ipv4.neigh.default.gc_thresh3 = 4096 # Tell the gc when to become aggressive with arp table cleaning. # Adjust this based on size of the LAN. net.ipv4.neigh.default.gc_thresh2 = 2048 # Adjust where the gc will leave arp table alone net.ipv4.neigh.default.gc_thresh1 = 1024 # Adjust to arp table gc to clean-up more often net.ipv4.neigh.default.gc_interval = 3600 # ARP cache entry timeout net.ipv4.neigh.default.gc_stale_time = 3600
You may also consider increasing the SOMAXCONN limit:
# Limit of socket listen() backlog, known in userspace as SOMAXCONN net.core.somaxconn = 1024
Then reread this configuration file:
A Slurm plugin is a dynamically linked code object which is loaded explicitly at run time by the Slurm libraries. A plugin provides a customized implementation of a well-defined API connected to tasks such as authentication, interconnect fabric, and task scheduling.
- Slurm scheduler plugins (schedplugins) are Slurm plugins that implement the Slurm scheduler API.
- SPANK - Slurm Plug-in Architecture for Node and job (K)control
- cli_filter Plugin API
- The site_factor plugin is designed to provide the site a way to build a custom multifactor priority factor, and will only be loaded and operation alongside PriorityType=priority/multifactor.