.. _Slurm_operations: ================ Slurm operations ================ .. Contents:: Jump to our top-level Slurm page: :ref:`SLURM` .. _Slurm: https://www.schedmd.com/ .. _Slurm_homepage: https://www.schedmd.com/ .. _Slurm_docs: https://slurm.schedmd.com/ .. _Slurm_Quick_Start: https://slurm.schedmd.com/quickstart_admin.html .. _Command_Summary: https://slurm.schedmd.com/pdfs/summary.pdf .. _Slurm_FAQ: https://slurm.schedmd.com/faq.html .. _Slurm_download: https://slurm.schedmd.com/download.html .. _Slurm_mailing_lists: https://lists.schedmd.com/cgi-bin/dada/mail.cgi/list .. _slurm_devel_archive: https://groups.google.com/forum/#!forum/slurm-devel .. _Slurm_publications: https://slurm.schedmd.com/publications.html .. _Slurm_tutorials: https://slurm.schedmd.com/tutorials.html .. _Slurm_bugs: https://bugs.schedmd.com .. _Slurm_man_pages: https://slurm.schedmd.com/man_index.html .. _slurm.conf: https://slurm.schedmd.com/slurm.conf.html .. _scontrol: https://slurm.schedmd.com/scontrol.html .. _sacctmgr: https://slurm.schedmd.com/sacctmgr.html .. _slurmctld: https://slurm.schedmd.com/slurmctld.html .. _slurmdbd: https://slurm.schedmd.com/slurmdbd.html .. _slurmd: https://slurm.schedmd.com/slurmd.html .. _EPEL: https://fedoraproject.org/wiki/EPEL Testing basic functionality =========================== We assume that you have carried out the above deployment along the lines of :ref:`Slurm_installation`, :ref:`Slurm_configuration`, :ref:`Slurm_database`, :ref:`Slurm_accounting` and :ref:`Slurm_scheduler`. From the *Head/Master* node try to submit an interactive job:: srun -N1 /bin/hostname If srun_ hangs, check the firewall settings described in :ref:`Slurm_configuration`. Please note that interactive batch jobs from Login nodes seem to be impossible if your compute nodes are on an isolated private network relative to the Login node. To display the job queue:: scontrol show jobs To submit a batch job script using sbatch_:: sbatch -N1 .. _sbatch: https://slurm.schedmd.com/sbatch.html .. _srun: https://slurm.schedmd.com/srun.html System information ------------------ Useful sysadmin commands: * sinfo_ - view information about Slurm nodes and partitions. * squeue_ - view information about jobs located in the Slurm scheduling queue * scancel_ Used to signal jobs or job steps * smap_ - graphically view information about Slurm jobs, partitions, and set configurations parameters * sview_ - graphical user interface to view and modify Slurm state (requires gtk2) * scontrol_ - view and modify Slurm configuration and state .. _sinfo: https://slurm.schedmd.com/sinfo.html .. _squeue: https://slurm.schedmd.com/squeue.html .. _scancel: https://slurm.schedmd.com/scancel.html .. _smap: https://slurm.schedmd.com/smap.html .. _sview: https://slurm.schedmd.com/sview.html .. _sview: https://slurm.schedmd.com/sview.html Slurm test suite ---------------- There is a large *test suite*, see the *Testing* section of the Slurm_Quick_Start_ Administrator Guide. The test suite is in the source ``.../testsuite/expect/`` directory, see the file *README*. The testsuite should be copied to the shared filesystem, for example, ``/home/$USER/testsuite/`` and run by a non-root user:: cd testsuite/expect ./regression MPI setup ========= MPI use under Slurm depends upon the type of MPI being used, see MPI_and_UPC_Users_Guide_. The current versions of Slurm and OpenMPI_ support task launch using the srun_ command, see the MPI_Guide_OpenMPI_. For PMIx_ please see the PMIx_Slurm_support_ page. .. _MPI_and_UPC_Users_Guide: https://slurm.schedmd.com/mpi_guide.html .. _MPI_Guide_OpenMPI: https://slurm.schedmd.com/mpi_guide.html#open_mpi .. _OpenMPI: https://www.open-mpi.org/ .. _PMIx: https://pmix.org/ .. _PMIx_Slurm_support: https://pmix.org/support/how-to/slurm-support/ You must add these flags when building OpenMPI_:: --with-slurm --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr The Slurm RPM installs header files in ``/usr/include/slurm`` and libraries in ``/usr/lib64``. Using the OpenMPI tools, verify the installation of *slurm* as well as *pmi* modules, for example:: # ompi_info | egrep -i 'slurm|pmi' MCA db: pmi (MCA v2.0.0, API v1.0.0, Component v1.10.3) MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component v1.10.3) MCA ess: slurm (MCA v2.0.0, API v3.0.0, Component v1.10.3) MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3) MCA plm: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.3) MCA ras: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.3) MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3) Since Slurm provides both the PMI and PMI-2 interfaces, this advice in MPI_Guide_OpenMPI_ is important:: If the pmi2 support is enabled then the command line options '--mpi=pmi2' has to be specified on the srun command line. Hence you must invoke srun_ like:: srun --mpi=pmi2 It may alternatively be convenient to add this line to slurm.conf_:: MpiDefault=pmi2 See the `FAQ: Running jobs under Slurm `_ and the *Process Management Interface* (PMI_) page. .. _PMI: https://www.open-mpi.org/projects/pmix/ MPI locked memory ----------------- MPI stacks running over Infiniband or :ref:`OmniPath` require the ability to allocate more locked memory than the default limit. Unfortunately, user processes on login nodes may have a small memory limit (check it by ``ulimit -a``) which by default are propagated into Slurm jobs and hence cause fabric errors for MPI. See the `memlock FAQ `_. This is fixed by adding to slurm.conf_:: PropagateResourceLimitsExcept=MEMLOCK You can view the running slurmd_ process limits by:: cat "/proc/$(pgrep -u 0 slurmd)/limits" CPU management ============== It is important to understand how Slurm manages nodes, CPUs, tasks etc. This is documented in the cpu_management_ page. .. _cpu_management: https://slurm.schedmd.com/cpu_management.html GPU accelerators ================ Configure Slurm for GPU_ accelerators as described in the :ref:`Slurm_configuration` page under the GRES_ section. The *AutoDetect* configuration in gres.conf_ can be used to detect GPU hardware (currently Nvidia and AMD). .. _GPU: https://en.wikipedia.org/wiki/Graphics_processing_unit .. _gres.conf: https://slurm.schedmd.com/gres.conf.html .. _GRES: https://slurm.schedmd.com/gres.html You should set the job Default count of CPUs allocated per allocated GPU (**DefCpuPerGPU**) for each partition containing GPUs in the slurm.conf_ file, for example:: PartitionName=xxx DefCpuPerGPU=4 ... For accounting of GPU usage you must add to the **AccountingStorageTRES** in slurm.conf_, for example:: AccountingStorageTRES=gres/gpu,gres/gpu:tesla and restart slurmctld_ so that these new fields are added to the database. Nvidia GPUs ----------- It is possible to build Slurm packages which include the Nvidia NVML_ library for easy handling of GPU hardware. NVML automatically detects GPUs, their type, cores, and NVLinks. Quoting the GRES_ page:: If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management Library (NVML) is installed on the node and was found during Slurm configuration, configuration details will automatically be filled in for any system-detected NVIDIA GPU. This removes the need to explicitly configure GPUs in gres.conf, though the Gres= line in slurm.conf is still required in order to tell slurmctld how many GRES to expect. However, it is **not necessary** to include the NVML_ in your Slurm packages, since you can configure gres.conf_ manually for the GPU_ hardware in your nodes. See the mailing list thread `Building Slurm RPMs with NVIDIA GPU support? `_. .. _NVML: https://developer.nvidia.com/nvidia-management-library-nvml Nvidia drivers .............. Download Nvidia drivers from https://www.nvidia.com/Download/index.aspx and select the appropriate GPU_ version and host operating system. Installation instructions are provided on the download page:: rpm -i nvidia-diag-driver-local-repo-rhel7-375.66-1.x86_64.rpm yum clean all yum install cuda-drivers reboot You can also download and install Nvidia `UNIX drivers `_, and the CUDA toolkit from https://developer.nvidia.com/cuda-downloads. To verify the availability of GPU_ accelerators in a node run the command:: nvidia-smi -L which is installed with the *xorg-x11-drv-nvidia* RPM package. RPC rate limiting ===================== It is common to experience users who bombard the slurmctld_ server by executing commands such as squeue_, sinfo_, sbatch_ or the like with many requests per second. This can potentially make the slurmctld_ unresponsive and therefore affect the entire cluster. The ability to do ``RPC rate limiting`` on a per-user basis is a new feature with Slurm_ 23.02. It acts as a virtual bucket of tokens that users consume with *Remote Procedure Calls* (RPC_). Enable this feature in slurm.conf_ by adding ``rl_enable`` and other parameters such as ``rl_refill_period``, for example:: SlurmctldParameters=rl_enable,rl_refill_period=5 **NOTE:** After changing ``SlurmctldParameters`` make an ``scontrol reconfig``, or restart the ``slurmctld``, see bug_18067_. The correct action seems to be undocumented as of 23.02.6. This allows users to submit a large number of requests in a short period of time, but not a sustained high rate of requests that would add stress to the slurmctld_. You can define: * The maximum number of tokens with ``rl_bucket_size``, * the rate at which new tokens are added with ``rl_refill_rate``, * the frequency with which tokens are refilled with ``rl_refill_period`` * and the number of entities to track with ``rl_table_size``. * New in 23.11: ``rl_log_freq`` option to limit the number of *RPC limit exceeded...* messages that are logged. When this is enabled you may find lines in ``slurmctld.log`` such as:: 2023-10-06T10:22:32.893] RPC rate limit exceeded by uid 2851 with REQUEST_SUBMIT_BATCH_JOB, telling to back off We have written a small script sratelimit_ for summarizing such log entries. .. _RPC: https://en.wikipedia.org/wiki/Remote_procedure_call .. _sratelimit: https://github.com/OleHolmNielsen/Slurm_tools/blob/master/jobs/sratelimit .. _bug_17835: https://bugs.schedmd.com/show_bug.cgi?id=17835 .. _bug_18067: https://bugs.schedmd.com/show_bug.cgi?id=18067 Utilities for Slurm =================== Here we list some useful third-party utilities that Slurm administrators or users may find useful: * A comprehensive list of tools on the Slurm_download_ page. * Slurm tools by Ole Holm Nielsen: https://github.com/OleHolmNielsen/Slurm_tools including: - pestat_ prints a node status list (1 host per line) with information about jobids, users and CPU loads. * `SlurmCommander `_ is a simple, lightweight, no-dependencies text-based user interface (TUI) to your cluster. It ties together multiple slurm commands to provide you with a simple and efficient interaction point with slurm. * `STUBL - SLURM Tools and UBiLities `_. * `birc-aeh/slurm-utils `_: *gnodes* gives a visual representation of your cluster. *jobinfo* tries to collect information for a full job. * `slurm_showq `_ A *showq* style job summary utility for SLURM. * schedtop_ cluster monitoring tool (see also bug_1868_). Build a new RPM by:: rpmbuild --rebuild --with slurm schedtop-5.02-1.sdl6.src.rpm yum install ~/rpmbuild/RPMS/x86_64/slurmtop-5.02-1.el7.centos.x86_64.rpm Then run:: slurmtop .. _schedtop: https://svn.princeton.edu/schedtop/ .. _pestat: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat .. _bug_1868: https://bugs.schedmd.com/show_bug.cgi?id=1868 Graphical monitoring tools -------------------------- There exist a few Open Source tools for graphical monitoring of Slurm: * Open XDMoD_ is an open source tool to facilitate the management of high performance computing resources. * `Graphing sdiag with Graphite `_ using Graphite_. See also `slurm-diamond-collector `_. * `Prometheus Slurm Exporter `_ with a Grafana_ Slurm_dashboard_. * `Slurmbrowser `_ A really thin web layer above Slurm. This tool requires *Ganglia*. Install first the RPMs ``python-virtualenv python2-bottle``. .. _XDMoD: https://open.xdmod.org/ .. _Graphite: https://graphite.readthedocs.org/en/latest/ .. _Grafana: https://grafana.com/ .. _Slurm_dashboard: https://grafana.com/dashboards/4323 Working with Compute nodes ========================== Slurm power saving scripts -------------------------- Slurm provides an integrated power saving mechanism for powering down idle nodes, and starting them again when jobs need to be scheduled, see the Slurm_Power_Saving_Guide_. We provide some Slurm_power_saving_scripts_ which may be useful for power management using IPMI_ or with cloud services. .. _Slurm_Power_Saving_Guide: https://slurm.schedmd.com/power_save.html .. _Slurm_power_saving_scripts: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save .. _IPMI: https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface Expanding and collapsing host lists ----------------------------------- Slurm lists node/host lists in the compact format, for example ``node[001-123]``. Sometimes you want to expand the host list, for example in scripts, to list all nodes individually. You can use this command to output hostnames one line at a time:: scontrol show hostnames node[001-123] or rewrite the list into a single line with paste_:: scontrol show hostnames node[001-123] | paste -s -d , .. _paste: https://en.wikipedia.org/wiki/Paste_(Unix) To contract expanded hostlists:: # scontrol show hostlistsorted h003,h002,h001 h[001-003] # scontrol show hostlist h003,h002,h001 h[003,002,001] When the server does not have the *slurm* RPM installed, or for more sophisticated host list processing, some non-Slurm tools may be used as shown below. The hostlist command .................... The python-hostlist_ tool is very convenient for expanding or compressing node lists. To install this tool (make sure to download the latest release):: wget https://www.nsc.liu.se/~kent/python-hostlist/python-hostlist-1.21.tar.gz rpmbuild -ta python-hostlist-1.21.tar.gz yum install python3-devel yum install ~/rpmbuild/RPMS/noarch/python2-hostlist-1.21-1.noarch.rpm yum install ~/rpmbuild/RPMS/noarch/python3-hostlist-1.21-1.noarch.rpm For usage see the python-hostlist_, but a useful example is:: # hostlist --expand --sep " " n[001-012] n001 n002 n003 n004 n005 n006 n007 n008 n009 n010 n011 n012 .. _python-hostlist: https://www.nsc.liu.se/~kent/python-hostlist/ The nodeset command ................... The ClusterShell_tool_ 's nodeset_ command (see below) enables easy manipulation of node sets, as well as node groups, at the command line level. For example:: $ nodeset --expand node[13-15,17-19] node13 node14 node15 node17 node18 node19 SSH keys for password-less access to cluster nodes -------------------------------------------------- Users may have a need for SSH access to Slurm compute nodes, for example, if their MPI library is using SSH in stead of Slurm to start MPI tasks. However, it is a good idea to configure the slurm-pam-adopt_ module on the nodes to control and restrict SSH access, see ``_. The SSH_ (*Secure Shell*) configuration files including server private/public keys are in the ``/etc/ssh/`` folder. The file ``/etc/ssh/ssh_known_hosts`` containing the SSH *public keys* of all nodes should be created on the central server and distributed to all Slurm nodes. The ssh-keyscan_ tool is very convenient for gathering SSH *public keys* of the cluster nodes, some examples are:: ssh-keyscan -t ssh-ed25519 node001 node002 # Scan nodes node001+node002 for key type ssh-ed25519 scontrol show hostnames node[001-022] | ssh-keyscan -f - 2>/dev/null | sort # Scan nodes node[001-022], pipe comments to /dev/null, and sort the output sinfo -Nho %N | uniq | ssh-keyscan -f - 2>/dev/null | sort # Scan all Slurm nodes (uniq suppresses duplicates) Remember to set the SELinux context correctly for the files in ``/etc/ssh``:: chcon system_u:object_r:etc_t:s0 /etc/ssh/ssh_known_hosts When all SSH *public keys* of the Slurm nodes are available in ``/etc/ssh/ssh_known_hosts``, each individual user can configure his password-less SSH login. First the user must generate SSH keys (placed in the ``$HOME/.ssh/`` folder) using the ssh-keygen_ tool. Each user may use the convenient tool authorized_keys_ for generating SSH keys and adding them to the ``$HOME/.ssh/authorized_keys`` file. For external computers the personal SSH_authorized_keys_ (preferably with a *passphrase* or *Multi-Factor Authentication*) should be used. For the servers running the slurmctld_ and slurmdbd_ services it is strongly recommended **not** to permit login by normal users because they have no business on those servers! To restrict which users can login to the management hosts, append this line to the SSH server ``/etc/ssh/sshd_config`` file:: AllowUsers root You can add more trusted system managers to this line if needed. Then restart the SSH service:: systemctl restart sshd .. _slurm-pam-adopt: https://slurm.schedmd.com/pam_slurm_adopt.html Host-based authentication ......................... Another way to enable password-less SSH login is to configure login nodes and compute nodes in the cluster to allow Host-based_Authentication_. Please beware that: * For security reasons it is strongly recommended **not** to include the Slurm slurmctld_ and slurmdbd_ servers in the Host-based_Authentication_ because normal users have no business on those servers! * For security reasons the **root** user is not allowed to use Host-based_Authentication_. You can add root's public key to the ``/root/.ssh/authorized_keys`` file on all compute nodes for easy SSH access. * Furthermore, personal computers and other computers outside the cluster **must not be trusted** by the cluster nodes! For external computers the personal SSH_authorized_keys_ (preferably with a *passphrase* or Multi_Factor_Authentication_) should be used. * You need to understand that Host-based_Authentication_ is a *bad idea in general*, but that it is a good and secure solution within a single Linux cluster's security perimeter, see for example: * `Implementing ssh hostbased authentication `_. * The mailing list thread at https://lists.schedmd.com/pipermail/slurm-users/2020-June/005578.html * It is a good idea to configure the slurm-pam-adopt_ module on the nodes to control and restrict SSH access, see ``_. Here are the steps for configuring Host-based_Authentication_ on CentOS 7 systems: 1. First populate all SSH keys in the file ``/etc/ssh/ssh_known_hosts`` as shown above. 2. Configure **only** these lines in the SSH client configuration ``/etc/ssh/ssh_config`` on all nodes:: HostbasedAuthentication yes EnableSSHKeysign yes These lines do not work inside *Host* or *Match* statements, but must be defined at the global level. You may also configure *PreferredAuthentications* (order of authentication methods) so that the *hostbased* method is preferred for the nodes in the cluster's domainname (replace by your DNS domain). Furthermore *GSSAPI* and *ForwardX11Trusted* may be configured:: Host *. PreferredAuthentications gssapi-keyex,gssapi-with-mic,hostbased,publickey,keyboard-interactive,password GSSAPIAuthentication yes ForwardX11Trusted yes The ssh_config_ manual page explains the configuration keywords. The *GSSAPI* (Generic Security Service Application Program Interface (GSS-API) Authentication and Key Exchange for the Secure Shell (SSH) Protocol) is defined in rfc4462_. 3. Add these lines to the SSH server ``/etc/ssh/sshd_config`` file on all nodes:: HostbasedAuthentication yes UseDNS yes and restart the SSH service:: systemctl restart sshd 4. Populate the file ``/etc/ssh/shosts.equiv`` for **every** node in the cluster listed in ``/etc/ssh/ssh_known_hosts`` with 1 line per node **including** the full DNS domainname, for example:: node001. node002. ... Wildcard hostnames are not possible, so you must list **all** hosts one per line. To list all cluster nodes:: sinfo -Nho %N | uniq | awk '{print $1 ".domainname"}' > /etc/ssh/shosts.equiv where you must substitute your own *domainname*. Remember to set the SELinux context correctly for the files in ``/etc/ssh``:: chcon system_u:object_r:etc_t:s0 /etc/ssh/sshd_config /etc/ssh/ssh_config /etc/ssh/shosts.equiv /etc/ssh/ssh_known_hosts A normal (non-root) user should now be able to login from a node to itself, for example:: testnode$ ssh -v testnode and the verbose output should inform you:: debug1: Authentication succeeded (hostbased). .. _SSH: https://en.wikipedia.org/wiki/Secure_Shell .. _ssh-keyscan: https://linux.die.net/man/1/ssh-keyscan .. _ssh-keygen: https://www.ssh.com/ssh/keygen/ .. _ssh_config: https://linux.die.net/man/5/ssh_config .. _SSH_authorized_keys: https://www.ssh.com/ssh/authorized_keys/openssh .. _Host-based_Authentication: https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Host-based_Authentication .. _authorized_keys: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/SSH .. _rfc4462: https://www.ietf.org/rfc/rfc4462.txt .. _Multi_Factor_Authentication: https://www.redhat.com/sysadmin/mfa-linux .. _clustershell: ClusterShell ------------ The ClusterShell_tool_ is an alternative to pdsh_ (see below) which is more actively maintained and has some better features. There is a ClusterShell_manual_ and a ClusterShell_configuration_ guide. .. _ClusterShell_manual: https://clustershell.readthedocs.io/en/latest/ .. _ClusterShell_configuration: https://clustershell.readthedocs.io/en/latest/config.html Install the ClusterShell_tool_ from the EPEL_ repository:: yum install epel-release yum install clustershell On CentOS/RHEL 7 this will be using the system default Python 2.7. To install the Python 3 version:: yum install python36-clustershell Copy the example file for Slurm.conf_:: cp /etc/clustershell/groups.conf.d/slurm.conf.example /etc/clustershell/groups.conf.d/slurm.conf You should define *slurm* as the default group in ``/etc/clustershell/groups.conf``:: [Main] # Default group source default: slurm It is convenient to add a Slurm binding for all running jobs belonging to a specific user. Append to ``/etc/clustershell/groups.conf.d/slurm.conf`` the lines:: # # SLURM user job bindings # [slurmuser,su] map: squeue -h -u $GROUP -o "%N" -t running list: squeue -h -o "%i" -t R reverse: squeue -h -w $NODE -o "%i" cache_time: 60 This feature was included in the version 1.8.1. You may encounter some surprising zero-padding_ behavior in node names, see also issue_293_. .. _zero-padding: https://clustershell.readthedocs.io/en/latest/tools/nodeset.html#zero-padding .. _issue_293: https://github.com/cea-hpc/clustershell/issues/293 ClusterShell usage .................. You can list all node groups including hostnames and node counts using this ClusterShell_tool_ command:: cluset -LLL Simple usage of clush_:: clush -w node[001-003] date For a Slurm partition:: clush -g date If option *-b* or *--dshbak* (like with PDSH_) is specified, clush_ waits for command completion while displaying a progress indicator and then displays gathered output results:: clush -b -g date To execute a command only on nodes with a specified Slurm state (here: ``drained``):: clush -w@slurmstate:drained date clush -bw@slurmstate:down 'uname -r; dmidecode -s bios-version' To execute a command only on nodes running a particular Slurm JobID (here: 123456):: clush -w@sj:123456 To execute a command only on nodes running jobs for a particular username (requires the above mentioned *slurmuser* configuration):: clush -w@su:username If you want to run commands on hosts **not** under Slurm, select a group source defined in /etc/clustershell/groups (see ``man clush``):: clush -s GROUPSOURCE or --groupsource=GROUPSOURCE For example:: clush -s local -g testcluster The nodeset_ command enables easy manipulation of node sets, as well as node groups, at the command line level. For example:: $ nodeset --expand node[13-15,17-19] node13 node14 node15 node17 node18 node19 .. _ClusterShell_tool: https://clustershell.readthedocs.io/en/latest/intro.html .. _clush: https://clustershell.readthedocs.io/en/latest/tools/clush.html .. _nodeset: https://clustershell.readthedocs.io/en/latest/tools/nodeset.html Copying files with ClusterShell ............................... When ClusterShell_tool_ has been set up, it's very simply to copy files and folders to nodes, see the clush_ manual page. Example:: clush -bw node[001-099] --copy /etc/slurm/slurm.conf --dest /etc/slurm/ PDSH - Parallel Distributed Shell --------------------------------- A crucial tool for the sysadmin is to execute commands in parallel on the compute nodes. The widely used pdsh_ tool may be used for this (see also ClusterShell_tool_ above). .. _pdsh: https://github.com/grondo/pdsh The pdsh_ RPM package may be installed from the EPEL_ repository, but unfortunately the slurm module hasn't been built in. Therefore you must manually rebuild the pdsh_ RPM: * Download the pdsh_ version 2.31 source RPM from https://dl.fedoraproject.org/pub/epel/7/SRPMS/p/:: wget https://dl.fedoraproject.org/pub/epel/7/SRPMS/p/pdsh-2.31-1.el7.src.rpm * Install prerequisite packages:: yum install libnodeupdown-devel libgenders-devel whatsup * Rebuild the pdsh_ RPMs:: rpmbuild --rebuild --with=slurm --without=torque pdsh-2.31-1.el7.src.rpm Notice: On CentOS 5 and 6 you must apparently remove the "=" signs due to a bug in rpmbuild. * Install the relevant (according to your needs) RPMs:: cd $HOME/rpmbuild/RPMS/x86_64/ yum install pdsh-2.31-1* pdsh-mod-slurm* pdsh-rcmd-ssh* pdsh-mod-dshgroup* pdsh-mod-nodeupdown* The pdsh_ command now knows about Slurm partitions and jobs:: pdsh -P date pdsh -j date See ``man pdsh`` for further details. The ``whatsup`` command may also be useful, see ``man whatsup`` for further details. Listing nodes ------------- Use sinfo_ to list nodes that are responding (for example, to be used in pdsh_ scripts):: sinfo -r -h -o '%n' sinfo --responding --noheader --format='%n' List reasons nodes are in the down, drained, fail or failing state:: sinfo -R sinfo --list-reasons sinfo -lRN List of nodes with features and status:: sinfo --format="%25N %.40f %.6a %.10A" Use scontrol_ to list node properties:: scontrol -o show nodes .. _sinfo: https://slurm.schedmd.com/sinfo.html Listing node resources used --------------------------- Use sinfo_ to see what resources are used/remaining on a per node basis:: sinfo -Nle -o '%n %C %t' The flag ``-p `` may be added. Nodes states listed with \* means that the node is **not responding**. Note the *STATE* column: * State of the nodes. Possible states include: allocated, completing, down, drained, draining, fail, failing, future, idle, maint, mixed, perfctrs, power_down, power_up, reserved, and unknown plus Their abbreviated forms: alloc, comp, down, drain, drng, fail, failg, futr, idle, maint, mix, npc, pow_dn, pow_up, resv, and unk respectively. Note that the suffix "*" identifies nodes that are presently not responding. Resume an offline node ---------------------- A node may get stuck in an offline mode for several reasons. For example, you may see this:: # scontrol show node q007 NodeName=q007 Arch=x86_64 CoresPerSocket=2 ... State=DOWN ThreadsPerCore=1 TmpDisk=32752 Weight=1 Owner=N/A ... Reason=NO NETWORK ADDRESS FOUND [slurm@2015-12-08T09:25:32] Nodes states listed with \* means that the node is **not responding**. It is very difficult to find documentation on how to clear such an offline state. The solution is to use the scontrol_ command (section *SPECIFICATIONS FOR UPDATE COMMAND, NODES*):: scontrol update nodename=a001 state=down reason="undraining" scontrol update nodename=a001 state=resume See also `How to "undrain" slurm nodes in drain state `_ where it is recommended to avoid the *down* state (1st command above). Slurm trigger information ------------------------- Triggers include events such as: * a node failing * daemon stops or restarts * a job reaching its time limit * a job terminating. These events can cause actions such as the execution of an arbitrary script. Typical uses include notifying system administrators of node failures and gracefully terminating a job when it's time limit is approaching. A hostlist expression for the nodelist or job ID is passed as an argument to the program. * strigger_ - Used set, get or clear Slurm trigger information An example script using this is notify_nodes_down_. To set up the trigger as the *slurm* user:: slurm# strigger --set --node --down --program=/usr/local/bin/notify_nodes_down To display enabled triggers:: strigger --get .. _notify_nodes_down: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers .. _strigger: https://slurm.schedmd.com/strigger.html Add and remove nodes -------------------- Nodes can be added or removed by modifying the slurm.conf_ file and distributing it to all nodes. If you use the topology.conf_ configuration, that file must also be updated and distributed to all nodes. If you run a :ref:`configless-slurm-setup` setup then the configuration files are served automatically to nodes by the slurmctld_. Starting in Slurm 22.05, nodes can be dynamically added and removed from Slurm, see dynamic_nodes_. .. _dynamic_nodes: https://slurm.schedmd.com/dynamic_nodes.html If nodes must initially be unavailable for starting jobs, define them in slurm.conf_ with a *State* and optionally a *Reason* parameter:: NodeName=xxx ... State=DRAIN Reason="Not yet ready" NodeName=xxx ... State=FUTURE For convenience the command:: slurmd -C can be used on each compute node to print its physical configuration (sockets, cores, real memory size, etc.) for inclusion into slurm.conf_. An entire new partition may also be made unavailable using a *State* not equal to *UP*:: PartitionName=xxx ... State=INACTIVE PartitionName=xxx ... State=DRAIN .. _topology.conf: https://slurm.schedmd.com/topology.conf.html However, the slurmctld_ daemon must then be restarted:: systemctl restart slurmctld As stated in the scontrol_ page under the *reconfigure* option): * The slurmctld_ daemon must be restarted if nodes are added to or removed from the cluster. Furthermore, the slurmd_ service on all compute nodes must also be restarted in order to pick up the changes in slurm.conf_, for example:: clush -ba systemctl restart slurmd See advice from the Slurm_publications_ talk *Technical: Field Notes Mark 2: Random Musings From Under A New Hat, Tim Wickberg, SchedMD* (2018) on the *Safe procedure*: 1. Stop slurmctld 2. Change configs 3. Restart all slurmd processes 4. Start slurmctld Less-Safe, but usually okay, procedure: 1. Change configs 2. Restart slurmctld 3. Restart all slurmd processes really quickly See also https://thread.gmane.org/gmane.comp.distributed.slurm.devel/3039 (comment by Moe Jette). Rebooting nodes --------------- Slurm can reboot nodes by:: scontrol reboot [ASAP] [NodeList] Reboot all nodes in the system when they become idle using the RebootProgram as configured in Slurm's slurm.conf file. The option "ASAP" prevents initiation of additional jobs so the node can be rebooted and returned to service "As Soon As Possible" (i.e. ASAP). Accepts an option list of nodes to reboot. By default all nodes are rebooted. **NOTE**: The reboot request will be ignored for hosts in the following states: ``FUTURE, POWER_DOWN, POWERED_DOWN, POWERING_DOWN, REBOOT_ISSUED, REBOOT_REQUESTED``, see bug_18505_. Currently, no warning is issued in such cases. From Slurm_ 24.08 an error message will be printed by ``scontrol reboot`` when a node reboot request is ignored due to the current node state. .. _bug_18505: https://bugs.schedmd.com/show_bug.cgi?id=18505 Compute node OS and firmware updates ------------------------------------ Regarding the question of methods for Slurm compute node OS and firmware updates, we have for a long time used rolling updates while the cluster is in full production, so that we do not waste any resources. When entire partitions are upgraded in this way, there is no risk of starting new jobs on nodes with differing states of OS and firmware, while running jobs continue on the not-yet-updated nodes. The basic idea (which was provided by Niels Carl Hansen, ncwh -at- cscaa.dk) is to run a crontab script ``update.sh`` whenever a node is rebooted. Use scontrol to reboot the nodes as they become idle, thereby performing the updates that you want. Remove the crontab job as part of the ``update.sh`` script. The ``update.sh`` script and instructions for usage are in: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/nodes Shared Memory cleanup --------------------- Certain jobs allocate Shared Memory resources but do not release them before job completion. For example, the shared memory segments may hit the system limit (typically 4096), see the system limit by:: $ sysctl kernel.shmmni kernel.shmmni = 4096 Error messages such as this one may occur:: getshmem_C in getshmem.c: cannot create shared segment 8 No space left on device See also Bug_7232_. Information on the inter-process communication facilities:: ipcs -a Users and root can clean up unused data by:: ipcrm -a .. _Bug_7232: https://bugs.schedmd.com/show_bug.cgi?id=7232 Resource Reservation ==================== Compute nodes can be reserved for a number of purposes. Read the reservations_ guide. .. _reservations: https://slurm.schedmd.com/reservations.html For example, to **reserve a set of nodes** for a testing purpose with a duration of 720 hours:: scontrol create reservation starttime=now duration=720:00:00 ReservationName=Test1 Flags=MAGNETIC nodes=x[049-096] user=user1,user2 Ignore currently running jobs when creating the reservation by adding this flag:: flags=ignore_jobs **Magnetic reservations** were introduced in Slurm 20.02, see the scontrol_ man-page:: Flags=MAGNETIC # This flag allows jobs to be considered for this reservation even if they didn't request it. Jobs will be eligible to run in such reservations even if they did not specify ``--reservation``. To **reserve nodes for maintenance** for 72 hours:: scontrol create reservation starttime=2017-06-19T12:00:00 duration=72:00:00 ReservationName=Maintenance flags=maint,ignore_jobs nodes=x[145-168] user=root A specification of **nodes=ALL** will reserve all nodes. If you want to reserve an **entire partition**, it is recommended to **not** specify nodes, but a partition in stead:: scontrol create reservation starttime=2017-06-19T12:00:00 duration=72:00:00 ReservationName=Maintenance flags=maint,ignore_jobs partitionname=xeon16 user=root To list all reservations:: scontrol show reservations and also previous reservations some weeks back in time:: scontrol show reservations start=now-5weeks Batch jobs submitted for the reservation must explicitly refer to it, for example:: sbatch --reservation=Test1 -N4 my.script One may also specify explicitly some nodes:: sbatch --reservation=Test1 -N2 --nodelist=x188,x140 my.script Working with jobs ================= Tutorial pages about Slurm job management: * `Slurm 101: Basic Slurm Usage for Linux Clusters `_ * `Convenient SLURM Commands `_ Interactive jobs ---------------- Using srun_ users can launch interactive jobs on compute nodes through Slurm. See the FAQ `How can I get shell prompts in interactive mode? `_:: srun --pty bash -i [additional options] If you need to run MPI tasks, see MPI_Guide_OpenMPI_. It is required to invoke srun_ with pmi2 or pmix support as shown above in the MPI section, for example:: srun --pty --mpi=pmi2 bash -i [additional options] Job arrays ---------- Slurm job_arrays_ offer a mechanism for submitting and managing collections of similar jobs quickly and easily. It is important to understand that job arrays, only at the moment when an individual job starts running, become independent jobs (similar to non-array jobs) and are assigned their own unique JobIDs. To see the relationship between job arrays and JobIDs, this is a useful command for a specified ArrayJobID:: $squeue -j 3394902 -O ArrayJobID,JobArrayID,JobID,State ARRAY_JOB_ID JOBID JOBID STATE 3394902 3394902_[34-91] 3394902 PENDING 3394902 3394902_30 3394932 RUNNING 3394902 3394902_28 3394930 RUNNING .. _job_arrays: https://slurm.schedmd.com/job_array.html Useful commands --------------- See the overview of Slurm_man_pages_ as well as the individual command man-pages. .. list-table:: :widths: 4 4 * - **Command** - **Function** * - squeue_ - List jobs * - squeue_ --start - List starting times of jobs * - sbatch_ --wrap="some-command" - Submit a job running just ``some-command`` (without script file) * - scontrol_ show job xxx - Get job details * - scontrol_ --details show job xxx - Get more job details * - scontrol_ suspend xxx - Suspend a job (root only) * - scontrol_ resume xxx - Resume a job (root only) * - scontrol_ hold xxx - Hold a job * - scontrol_ uhold xxx - User-Hold a job * - scontrol_ release xxx - Release a held job * - scontrol_ update jobid=10208 nice=-10000 - Increase a job's priority (Slurm managers only) * - scontrol_ update jobid=10208 nice=5000 - Decrease a job's priority (users and managers) * - scontrol_ top 10208 - Move the job to the top of the user's queue * - scontrol_ update jobid=10208 priority=50000 - Set a job's priority value * - scontrol_ hold jobid=10208; scontrol_ release jobid=10208 - Reset a job's explicit priority=xxx value * - scontrol_ update jobid=1163 EndTime=2022-04-27T08:30:00 - Modify a job's End time * - scontrol_ update jobid=1163 timelimit=12:00:00 - Modify a job's time limit * - scontrol_ update jobid=1163 qos=high - Set the job QOS to *high* (QOS list: ``sacctmgr show qos``) * - scontrol_ listpids (on node running a job) - Print a listing of the process IDs in a job step * - scontrol_ write batch_script job_id optional_filename - Write the batch script for a given job_id to a file or to stdout * - scontrol_ show config - Prints the Slurm configuration and running parameters * - scontrol_ write config optional_filename - Write the current Slurm configuration to a file * - scancel_ job xxx - Kill a job * - sjobexitmod_ -l jobid - Display job exit codes * - sstat_ - Display various status information of a running job/step * - scontrol_ show assoc_mgr - Displays the slurmctld_'s internal cache for users, associations and/or qos such as GrpTRESRunMins, GrpTRESMins etc. * - scontrol_ -o show assoc_mgr users=xxx accounts=yyy flags=assoc - Display the association limits and current values for user xxx in account yyy as a one-liner. * - sacctmgr_ show user -s xxx - Display information about user xxx from the Slurm database * - sacctmgr_ add user xxx Account=zzzz - Add user xxx to the non-default account zzzz, see the accounting_ page. * - sacctmgr_ modify qos normal set priority=50 - Modify the the QOS_ named *normal* to set a new priority value. * - sacctmgr_ modify user where name=xxx set MaxSubmitJobs=NN - Update user's maximum number of submitted jobs to NN. NN=0 blocks submissions, NN=-1 removes the limit. * - sacctmgr_ -nP list associations user=xxx format=fairshare - Print the fairshare number of user xxx. * - sacctmgr_ show event - Display information about events like downed or draining nodes on clusters. * - sshare_ -lU -u xxx - Print the various fairshare values of user xxx. .. _sjobexitmod: https://slurm.schedmd.com/job_exit_code.html .. _sstat: https://slurm.schedmd.com/sstat.html .. _sshare: https://slurm.schedmd.com/sshare.html .. _accounting: https://slurm.schedmd.com/accounting.html .. _QOS: https://slurm.schedmd.com/qos.html squeue usage ............ The squeue_ command has a huge number of parameters for listing jobs. Here are some suggestions for usage of squeue_: sbatch --wrap="some-command" * The long display gives more details:: squeue -l # is equivalent to: squeue -o "%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R" * Add columns for job priority (%Q) and CPU count (%C) and make some columns wider:: squeue -o "%.18i %.9P %.8j %.8u %.10T %.9Q %.10M %.9l %.6D %.6C %R" * Set the output format by an environment variable:: export SQUEUE_FORMAT="%.18i %.9P %.8j %.8u %.10T %.9Q %.10M %.9l %.6D %.6C %R" or using the new output format:: export SQUEUE_FORMAT2="JobID:8,Partition:11,QOS:7,Name:10 ,UserName:9,Account:9,State:8,PriorityLong:9,ReasonList:16 ,TimeUsed:12 ,SubmitTime:19 ,TimeLimit:10 ,tres-alloc: " * List of pending jobs in the same order considered for scheduling by Slurm (see squeue_ man-page under --priority):: squeue --priority --sort=-p,i --states=PD Slurm debugging =============== Change the debug level of the slurmctld_ daemon.:: scontrol setdebug LEVEL where LEVEL may be: "quiet", "fatal", "error", "info", "verbose", "debug", "debug2", "debug3", "debug4", or "debug5". See the `scontrol OPTIONS `_ section. For example:: scontrol setdebug debug2 This value is temporary and will be overwritten whenever the slurmctld daemon reads the slurm.conf configuration file (e.g. when the daemon is restarted or scontrol reconfigure is executed). Add or remove DebugFlags of the slurmctld daemon:: scontrol setdebugflags [+|-]FLAG For example:: scontrol setdebugflags +backfill See slurm.conf_ `PARAMETERS `_ section for the full list of supported DebugFlags. NOTE: Changing the value of some DebugFlags will have no effect without restarting the slurmctld daemon, which would set DebugFlags based upon the contents of the slurm.conf configuration file.