Batch jobs

This page gives an introduction to submitting batch jobs to Niflheim.

Batch system

The batch job queueing system in Niflheim is the Open Source Torque resource manager (an old version 2.3.7). Torque is based on an earlier OpenPBS version (PBS = Portable Batch System).

There is extensive documentation of Torque, see the Torque_Administrator_Guide. Of special interest is the chapter on Job_submission.

The prioritization of jobs is done by a separate tool, the Maui Cluster Scheduler version 3.2.6p21 provided by Adaptive_Computing.

Batch queues

We have defined the following batch queues on NIFLHEIM:

  • small: Time < 25 minutes.
  • medium: 25 minutes < Time < 2 hours 15 minutes.
  • long: 2 hours 15 minutes < Time < 13 hours.
  • verylong: 13 hours < Time < 50 hours.

The Time is defined as wall-clock time, irrespective of the actual CPU-time or the number of nodes used. When you submit your batch job, you can either specify the queue name as given above, or the maximum wall-clock time.

Submitting batch jobs

Job scripts are submitted to the Torque batch system from the login nodes by qsub as in the examples shown below. If you're more familiar with other batch systems, there is an overview of commands in different batch systems in the page Rosetta Stone of Workload Managers.

Note: GPAW jobs need to be submitted using gpaw-qsub script. The syntax for the command line options of gpaw-qsub is the same as for qsub.

A script for a file to be submitted with qsub might begin with lines like:

#!/bin/sh
### Note: No commands may be executed until after the #PBS lines
### Job name (comment out the next line to get the name of the script used as the job name)
#PBS -N test
### Output files (comment out the next 2 lines to get the job name used instead)
#PBS -e test.err
#PBS -o test.log
### Send mail when job is aborted or terminates normally
#PBS -m ae
### Queue name (small, medium, long, verylong)
#PBS -q long
### Number of nodes
#PBS -l nodes=1:ppn=8:xeon
### Requesting time - 12 hours - overwrites **long** queue setting
#PBS -l walltime=12:00:00

# Go tho the directory from where the job was submitted (initial directory is $HOME)
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR

### Here follows the user commands:
# Define number of processors
NPROCS=`wc -l < $PBS_NODEFILE`
echo This job has allocated $NPROCS nodes

The $PBS... variables are set for the batch job by Torque. The complete list of variables is documented in Exported_Batch_Environment_Variables.

Further examples of Torque batch job submission is documented in Job_submission.

Specifying a different project account

If you run jobs under different projects, you can make sure that each project gets accounted for separately in the system's accounting statistics. This feature may be required for certain projects.

By default all your jobs will be accounted under your primary UNIX group account. You can specify a non-default project account (for example, proj123) for each individual job by using this flag to the qsub command:

qsub -A proj123 ...

or in the job script file add a line like this one near the top:

#PBS -A proj123

Please use project names only by agreement with your project owner, and please inform support@fysik.dtu.dk of any project accounts you intend to use. Please restrict yourself to max 8 characters in any project account names.

Requesting a minimum memory size

A number of node features can be requested, see the Torque Job_submission page and the Requesting_resources page.

For example, you may want a minimum physical memory size by requesting:

qsub -l nodes=2:ppn=16:xeon16,mem=120gb your-script  (2 entire nodes with 16 CPU cores each, the total memory of all nodes => 120 GB RAM)

Warning: The mem= parameter may possibly be ignored by Torque if the number of nodes is >1, so it may only have effect on single-node jobs!

Do not request the maximum physical amount of RAM, since the RAM memory available to users is slightly less than the physical RAM memory.

To see the available RAM memory sizes on the different nodes types see the Hardware page.

Waiting for specific jobs

It is possible to specify that a job should only run after another job has completed succesfully, please see the -W flags in the qsub page.

To run your-script after job 12345 has completed succesfully:

qsub -W depend=afterok:12345 your-script

Be sure that the exit status of job 12345 is meaningful: if it exits with status 0, you second job will run. If it exits with any other status, you second job will be cancelled.

It is also possible to run a job if another job fails (afternotok) or after another job completes, regardless of status (afterany). Be aware that the keyword after (as in -W depend=after:12345) means run after job 12345 has started.

Batch job node properties

Users must request the different node types explicitly:

  • xeon8: The HP DL160 or SL2x170z G6 server nodes with 8 Intel Nehalem Xeon X5500 series CPU cores and 24 GB of RAM.
  • xeon16: The HP SL230 Gen8 server nodes with 16 Intel Sandy Bridge Xeon E5-2650 CPU cores and 64, 128 or 256 GB of RAM and Infiniband network interconnect.

See the Hardware page for an overview of the RAM memory sizes of the different types of nodes.

Submitting jobs to 16-CPU Intel nodes

The dual-processor, 8-core Intel Sandy Bridge Xeon E5-2650 nodes (16 CPU cores total) we define to have a node property of xeon16 (nodes g001-g076).

If you run parallel jobs, it is obviously most efficient if you can parallelize over 16 CPUs in order to achieve maximum communication bandwidth.

Parallel jobs using more than 1 node must use an OpenMPI communications library built with Infiniband libraries. Please contact support@fysik.dtu.dk if you intend to use your own MPI library in stead of the ones provided by our system.

You could submit a batch job like in these examples:

1) qsub -l nodes=2:ppn=16:xeon16 your-script            # (2 entire Xeon nodes with 16 CPUs each, for a total of 32 CPU cores)

2) qsub -l nodes=g038:ppn=16 your-script                # (explicitly the g038 node with 16 CPU cores)

3) qsub -l nodes=2:ppn=16:xeon16,mem=120gb your-script  # (2 entire nodes with 16 CPUs each, the total memory of all nodes => 120 GB RAM)

Submitting jobs to 8-CPU Intel nodes

The dual-processor, quad core Intel Nehalem Xeon X5500 series nodes (8 CPU cores total) we define to have a node property of xeon8 (nodes a001-a140,b001-b140,c001-c132,d001-d116).

If you run parallel jobs, it is obviously most efficient if you can parallelize over 8 CPUs in order to achieve maximum communication bandwidth.

You could submit a batch job like in these examples:

1) qsub -l nodes=2:ppn=8:xeon8 your-script        # (2 entire Xeon nodes with 8 CPUs each, for a total of 16 CPU cores)

2) qsub -l nodes=a038:ppn=8 your-script           # (explicitly the a038 node with 8 CPU cores)

Submitting 1-CPU jobs

You could submit a batch job like in this example:

1) qsub -l nodes=1:ppn=1:xeon8 your-script            # 1 CPU on an 8-CPU Xeon node

You are requested to use the older 8-core nodes for serial (1 CPU) jobs, because you shouldn't be blocking entire 16-CPU nodes with only 1-CPU jobs, so it is better to block only an 8-CPU node.

More memory needed?

Each CPU core in Niflheim currently has 3, 4 or 8 GB of RAM available. On the multi-CPU servers the total RAM is of course shared among the running processes.

If your job exceeds the of physical RAM size per process, we may decide to kill your job because you're abusing the resources (see Monitoring batch jobs for more details)!

If you need additional RAM per process for your job, what do you do? The solution is to run jobs with limited number of processes per node. For example, you may submit an NN-processes (NN being integer number) job to NN xeon8 nodes:

qsub -l nodes=NN:ppn=8:xeon8 your-script

The job script should be constructed to use only the needed number of processes. This can be achieved by specifying, instead of simply mpiexec executable:

mpiexec -np NN --loadbalance executable

In this way NN processes will be distributed amongst NN nodes, running one process per node and giving you the total memory available on the node per each process.

By specifying:

mpiexec -np 2*NN --loadbalance executable

you get half memory available on the node per process. The --loadbalance option balances total number of processes across all allocated nodes.

Monitoring batch jobs

The Torque command qstat is used to inquire about the status of one or more jobs:

qstat -f <jobid>     (Inquire about a particular jobid)
qstat -r             (List all running jobs)
qstat -a             (List all jobs)

In addition, the Maui scheduler can be inquired using the showq command:

showq -r             (List all running jobs)
showq                (List all jobs)

If you want to check the status of a particular job-id use:

checkjob <jobid>

Badly behaving jobs

Another useful command for monitoring batch jobs is pestat:

pestat -f # show status of badly behaving jobs, with bad fields marked by star (*)

Note: one of the most common bad behaviour of batch jobs is exhausting of available RAM memory. Please use RAM memory estimate tools before running programs! For GPAW consult https://wiki.fysik.dtu.dk/gpaw/documentation/parallel_runs/parallel_runs.html.

An example of usage of pestat:

pestat | grep 263945
q008  excl  4.08    7974   4  18628   1275  1/1    4    263945 user
q037  excl  4.02    7974   4  18628   1285  1/1    4    263945 user

The example job above is behaving correctly. Please consult the script located at /usr/local/bin/pestat for the description of the fields. The most important fields are:

  • second: Torque state - node can be free (not all the cores used); excl (all cores used), down,
  • third: CPU load average,
  • seventh: Resident (used) memory - total memory on the given node in use (the one reported under RES by the "top" command),

If (Memory used (resident) exceeds physical RAM on the node (field fourth) ) or cpu load is significantly lower than number of CPUs (field fifth) the job becomes a candidate to be killed.

An example of a job exceeding physical memory:

pestat -f | grep 128081
m016  busy* 4.00    7990   4  23992   9937* 1/1    4    128081 user
m018  excl  4.00    7990   4  23992   9755* 1/1    4    128081 user

An example of a job with incorrect CPU load:

pestat -f | grep 129284
a014  excl  7.00*  24098   8  72097   2530  1/1    8    129284 user

Searching for free resources

Show what resources are available for immediate use (see Batch_jobs#batch-job-node-properties for more options):

  • xeon8:

    showbf -f xeon8
  • xeon16:

    showbf -f xeon16

pestat can also be used to check what resources are free:

pestat  | grep free
n085  free  3.00    7990   4  23992    358  3/2    3    292171 user1 293290 user1 293857 user2
...
c001  free  0.00   24098   8  72097    199  0/0    0

The node n085 is occupied by 3 jobs (9th column) and two users (8th column) each requesting 1 core. The node c001 is totally free.

To find what type of resources will become free soon run (need to be run under [ba]sh):

sh
for id in `showq | head -10 | grep "Running" | cut -d " " -f 1`; do qstat -f1 $id |  grep exec_host | cut -d "=" -f 2 | awk -F "+" '{printf "%3d %.5s\n", NF, $1}'; done
80  q134
32  d010
32  d095
 4  n005
16  d048
16  d026
16  c052
exit

The first column contains the total number of cores for the 7 jobs on top of the queue, the second column the coresponding master node of these jobs. This information is to be used to determine the node properties using Batch_jobs#batch-job-node-properties.

Estimating needed disk space

Find the 10 largest files under home directory (may take a long time):

find ${HOME} -printf "%s,'%p'\n" | sort -r -n -k 1 | head -10 | cut -d, -f2 | xargs du -h

Note: the purpose of quoting in the above command is to handle filenames contaning spaces.

Niflheim: Batch_jobs (last edited 2017-03-06 15:02:47 by OleHolmNielsen)