Slurm job scheduler

Jump to our Slurm top-level page.

Prerequisites

Before configuring the Multifactor_Priority_Plugin scheduler, you must first configure Slurm_accounting.

Scheduler configuration

The SchedulerType configuration parameter controls how queued jobs are executed. See the Scheduling_Configuration_Guide.

Backfill scheduler

We use the backfill scheduler in slurm.conf:

SchedulerType=sched/backfill
SchedulerParameters=kill_invalid_depend,defer,bf_continue

but there are some backfill parameters that should be considered (see slurm.conf), for example:

...bf_interval=60,bf_max_job_start=20,bf_resolution=600,bf_window=11000

The importance of bf_window is explained as:

  • The default value is 1440 minutes (one day). A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation. In order to limit the amount of data managed by the backfill scheduler, if the value of bf_window is increased, then it is generally advisable to also increase bf_resolution.

So you must configure bf_window according to the longest possible MaxTime in all partitions in slurm.conf:

PartitionName= ... MaxTime=XXX

Multifactor Priority Plugin scheduler

A sophisticated Multifactor_Priority_Plugin provides a very versatile facility for ordering the queue of jobs waiting to be scheduled. See the PriorityXXX parameters in the slurm.conf file.

Multifactor configuration

The Fairshare is configured with PriorityX parameters in the Configuration section of the Multifactor_Priority_Plugin page, also documented in the slurm.conf page:

  • PriorityType
  • PriorityDecayHalfLife
  • PriorityCalcPeriod
  • PriorityUsageResetPeriod
  • PriorityFavorSmall
  • PriorityMaxAge
  • PriorityWeightAge
  • PriorityWeightFairshare
  • PriorityWeightJobSize
  • PriorityWeightPartition
  • PriorityWeightQOS
  • PriorityWeightTRES

An example slurm.conf fairshare configuration may be:

PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityFavorSmall=NO
PriorityMaxAge=10-0
PriorityWeightAge=100000
PriorityWeightFairshare=1000000
PriorityWeightJobSize=100000
PriorityWeightPartition=100000
PriorityWeightQOS=100000

PropagateResourceLimitsExcept=MEMLOCK
PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE
AccountingStorageEnforce=associations,limits,qos,safe

PriorityWeightXXX values are all 32-bit integers. The final Job Priority is a 32-bit integer.

IMPORTANT: Set PriorityWeight values high to generate wide range of job priorities.

Fairshare

Several Fairshare Algorithms may be used:

The Fairshare algorithm is explained in the Multifactor_Priority_Plugin page. Note the meaning of the special value fairshare=parent:

  • If all users in an account are configured with:

    FairShare=parent

    the result is that all the jobs drawing from that account will get the same fairshare priority, based on the accounts total usage. No additional fairness is added based on users individual usage.

Quality of Service (QOS)

One can specify a Quality of Service (QOS) for each job submitted to Slurm. A description and example are in the QOS page.

Example:

sacctmgr show qos format=name,priority

To enforce user jobs to have a QOS you must (at least) have:

AccountingStorageEnforce=qos

see the slurm.conf and Resource_Limits documents. The AccountingStorageEnforce options include:

  • associations - This will prevent users from running jobs if their association is not in the database. This option will prevent users from accessing invalid accounts.
  • limits - This will enforce limits set to associations. By setting this option, the 'associations' option is also set.
  • qos - This will require all jobs to specify (either overtly or by default) a valid qos (Quality of Service). QOS values are defined for each association in the database. By setting this option, the 'associations' option is also set.
  • safe - limits and associations will automatically be set.

A non-zero weight must be defined in slurm.conf, for example:

PriorityWeightQOS=100000

Resource Limits

To enable any limit enforcement you must at least have:

AccountingStorageEnforce=limits

in your slurm.conf, otherwise, even if you have limits set, they will not be enforced. Other options for AccountingStorageEnforce and the explanation for each are found on the Resource_Limits document.

Limiting (throttling) jobs in the queue

It is desirable to prevent individual users from flooding the queue with jobs, in case there are idle nodes available, because it may block future jobs by other users. Note:

With Slurm it appears that the only way to achieve user job throttling is the following:

  • Using the GrpTRESRunMins parameter defined in the Resource_Limits document. See also the TRES definition.

  • The GrpTRESRunMins limits can be applied to associations (accounts or users) as well as QOS. Set the limit by:

    sacctmgr modify association where name=XXX set GrpTRESRunMin=cpu=1000000   # For an account/user asociation
    sacctmgr modify qos where name=some_QOS set GrpTRESRunMin=cpu=1000000      # For a QOS

Partition factor priority

If some partition XXX (for example big memory nodes) should have a higher priority, this is explained in Multifactor_Priority_Plugin by:

(PriorityWeightPartition) * (partition_factor) +

The Partition factor is controlled in slurm.conf, for example:

PartitionName=XXX ... PriorityJobFactor=10
PriorityWeightPartition=1000

Scheduling commands

View scheduling information for the Multifactor_Priority_Plugin by the commands:

  • sprio - view the factors that comprise a job's scheduling priority:

    sprio     # List job priorities
    sprio -l  # List job priorities including username etc.
    sprio -w  # List weight factors used by the multifactor scheduler
  • sshare - Tool for listing the shares of associations to a cluster:

    sshare
    sshare -l    # Long listing with additional information
    sshare -a    # Listing with also user information
  • sdiag - Scheduling diagnostic tool for Slurm

Niflheim: Slurm_scheduler (last edited 2018-04-08 20:00:27 by OleHolmNielsen)