Differences between revisions 82 and 83
Revision 82 as of 2017-03-06 15:02:47
Size: 14186
Comment: Removed Opteron nodes
Revision 83 as of 2018-04-06 14:15:58
Size: 87
Comment: Removed Torque information
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:
This page gives an introduction to submitting batch jobs to Niflheim.

.. Contents::

Batch system

The batch job queueing system in Niflheim is the Open Source Torque_ resource manager (an old version 2.3.7).
Torque is based on an earlier `OpenPBS <http://www.openpbs.org/about.html>`_ version (PBS = *Portable Batch System*).

There is extensive documentation of Torque_, see the Torque_Administrator_Guide_.
Of special interest is the chapter on Job_submission_.

.. _Torque: http://docs.adaptivecomputing.com/torque/2-5-12/help.htm#topics/0-about/introduction.htm
.. _Torque_Administrator_Guide: http://docs.adaptivecomputing.com/torque/2-5-12/help.htm#topics/0-about/guideOverview.htm
.. _Job_submission: http://docs.adaptivecomputing.com/torque/2-5-12/help.htm#topics/2-jobs/jobSubmission.htm
.. _Exported_Batch_Environment_Variables: http://docs.adaptivecomputing.com/torque/2-5-12/help.htm#topics/2-jobs/exportedBatchEnvVar.htm

The prioritization of jobs is done by a separate tool, the Maui_ Cluster Scheduler version 3.2.6p21 provided by Adaptive_Computing_.

.. _Adaptive_Computing: http://www.adaptivecomputing.com/
.. _Maui: http://docs.adaptivecomputing.com/maui/index.php

Batch queues

We have defined the following batch queues on NIFLHEIM:

    * **small**: Time < 25 minutes.
    * **medium**: 25 minutes < Time < 2 hours 15 minutes.
    * **long**: 2 hours 15 minutes < Time < 13 hours.
    * **verylong**: 13 hours < Time < 50 hours.

The *Time* is defined as wall-clock time, irrespective of the actual CPU-time or the number of nodes used.
When you submit your batch job, you can either specify the queue name as given above, or the maximum wall-clock time.

Submitting batch jobs

Job scripts are submitted to the Torque batch system from the login nodes
by qsub_ as in the examples shown below.
If you're more familiar with other batch systems, there is an overview of commands in different batch systems in the page `Rosetta Stone of Workload Managers <http://www.schedmd.com/slurmdocs/rosetta.html>`_.

.. _qsub: http://docs.adaptivecomputing.com/torque/2-5-12/help.htm#topics/commands/qsub.htm

**Note**: GPAW jobs need to be submitted using ``gpaw-qsub`` script. The syntax for the command line options of ``gpaw-qsub`` is the same as for qsub_.

A script for a file to be submitted with qsub_ might begin with lines like::

  ### Note: No commands may be executed until after the #PBS lines
  ### Job name (comment out the next line to get the name of the script used as the job name)
  #PBS -N test
  ### Output files (comment out the next 2 lines to get the job name used instead)
  #PBS -e test.err
  #PBS -o test.log
  ### Send mail when job is aborted or terminates normally
  #PBS -m ae
  ### Queue name (small, medium, long, verylong)
  #PBS -q long
  ### Number of nodes
  #PBS -l nodes=1:ppn=8:xeon
  ### Requesting time - 12 hours - overwrites **long** queue setting
  #PBS -l walltime=12:00:00

  # Go tho the directory from where the job was submitted (initial directory is $HOME)
  echo Working directory is $PBS_O_WORKDIR

  ### Here follows the user commands:
  # Define number of processors
  echo This job has allocated $NPROCS nodes

The *$PBS...* variables are set for the batch job by Torque.
The complete list of variables is documented in Exported_Batch_Environment_Variables_.

Further examples of Torque batch job submission is documented in Job_submission_.

Specifying a different project account

If you run jobs under different projects, you can make sure that each project gets accounted for separately in the system's accounting statistics.
This feature may be required for certain projects.

By default all your jobs will be accounted under your primary UNIX group account.
You can specify a non-default project account (for example, *proj123*) for each individual job by using this flag to the qsub_ command::

  qsub -A proj123 ...

or in the job script file add a line like this one near the top::

  #PBS -A proj123

Please use project names only by agreement with your project owner, and please inform support@fysik.dtu.dk of any project accounts you intend to use.
Please restrict yourself to **max 8 characters** in any project account names.

Requesting a minimum memory size

A number of node features can be requested, see the Torque_ Job_submission_ page
and the Requesting_resources_ page.

.. _Requesting_resources: http://docs.adaptivecomputing.com/torque/2-5-12/help.htm#topics/2-jobs/requestingRes.htm

For example, you may want a minimum physical memory size by requesting::

  qsub -l nodes=2:ppn=16:xeon16,mem=120gb your-script (2 entire nodes with 16 CPU cores each, the total memory of all nodes => 120 GB RAM)

**Warning:** The *mem=* parameter may possibly be ignored by Torque_ if the number of nodes is >1, so it may only have effect on single-node jobs!

Do not request the maximum physical amount of RAM, since the RAM memory available to users is slightly less than the physical RAM memory.

To see the available RAM memory sizes on the different nodes types see the Hardware_ page.

Waiting for specific jobs

It is possible to specify that a job should only run after another job has completed succesfully, please see the *-W* flags in the qsub_ page.

To run your-script after job 12345 has completed succesfully::

  qsub -W depend=afterok:12345 your-script

Be sure that the exit status of job 12345 is meaningful: if it exits with status 0, you second job will run.
If it exits with any other status, you second job will be cancelled.

It is also possible to run a job if another job fails (``afternotok``) or after another job completes, regardless of status (``afterany``).
Be aware that the keyword ``after`` (as in ``-W depend=after:12345``) means run after job 12345 has *started*.

Batch job node properties

Users **must** request the different node types explicitly:

 * **xeon8**: The HP DL160 or SL2x170z G6 server nodes with 8 Intel *Nehalem* Xeon X5500 series CPU cores and 24 GB of RAM.

 * **xeon16**: The HP SL230 Gen8 server nodes with 16 Intel *Sandy Bridge* Xeon E5-2650 CPU cores and 64, 128 or 256 GB of RAM and Infiniband network interconnect.

See the Hardware_ page for an overview of the RAM memory sizes of the different types of nodes.

Submitting jobs to 16-CPU Intel nodes

The dual-processor, 8-core Intel *Sandy Bridge* Xeon E5-2650 nodes (16 CPU cores total) we define to have a node property of ``xeon16`` (nodes g001-g076).

If you run parallel jobs, it is obviously most efficient if you can parallelize over 16 CPUs in order to achieve maximum communication bandwidth.

Parallel jobs using more than 1 node **must** use an OpenMPI communications library built with Infiniband libraries.
Please contact support@fysik.dtu.dk if you intend to use your own MPI library in stead of the ones provided by our system.

You could submit a batch job like in these examples::

 1) qsub -l nodes=2:ppn=16:xeon16 your-script # (2 entire Xeon nodes with 16 CPUs each, for a total of 32 CPU cores)

 2) qsub -l nodes=g038:ppn=16 your-script # (explicitly the g038 node with 16 CPU cores)

 3) qsub -l nodes=2:ppn=16:xeon16,mem=120gb your-script # (2 entire nodes with 16 CPUs each, the total memory of all nodes => 120 GB RAM)

Submitting jobs to 8-CPU Intel nodes

The dual-processor, quad core Intel *Nehalem* Xeon X5500 series nodes (8 CPU cores total) we define to have a node property of ``xeon8`` (nodes a001-a140,b001-b140,c001-c132,d001-d116).

If you run parallel jobs, it is obviously most efficient if you can parallelize over 8 CPUs in order to achieve maximum communication bandwidth.

You could submit a batch job like in these examples::

 1) qsub -l nodes=2:ppn=8:xeon8 your-script # (2 entire Xeon nodes with 8 CPUs each, for a total of 16 CPU cores)

 2) qsub -l nodes=a038:ppn=8 your-script # (explicitly the a038 node with 8 CPU cores)

Submitting 1-CPU jobs

You could submit a batch job like in this example::

 1) qsub -l nodes=1:ppn=1:xeon8 your-script # 1 CPU on an 8-CPU Xeon node

You are requested to use the older 8-core nodes for serial (1 CPU) jobs, because you shouldn't be blocking entire 16-CPU nodes with only 1-CPU jobs,
so it is better to block only an 8-CPU node.

More memory needed?

Each CPU core in Niflheim currently has 3, 4 or 8 GB of RAM available.
On the multi-CPU servers the total RAM is of course shared among the running processes.

If your job exceeds the of physical RAM size per process, we may decide to **kill your job** because you're abusing the resources
(see `Monitoring batch jobs`_ for more details)!

If you need additional RAM per process for your job, what do you do?
The solution is to run jobs with limited number of processes per node.
For example, you may submit an NN-processes (NN being integer number) job to NN xeon8 nodes::

  qsub -l nodes=NN:ppn=8:xeon8 your-script

The job script should be constructed to use only the needed number of processes.
This can be achieved by specifying, instead of simply ``mpiexec executable``::

  mpiexec -np NN --loadbalance executable

In this way NN processes will be distributed amongst NN nodes, running one process per node
and giving you the total memory available on the node per each process.

By specifying::

  mpiexec -np 2*NN --loadbalance executable

you get half memory available on the node per process.
The `--loadbalance` option balances total number of processes across all allocated nodes.

Monitoring batch jobs

The Torque command qstat_ is used to inquire about the status of one or more jobs::

  qstat -f <jobid> (Inquire about a particular jobid)
  qstat -r (List all running jobs)
  qstat -a (List all jobs)

.. _qstat: http://docs.adaptivecomputing.com/torque/2-5-12/help.htm#topics/commands/qstat.htm

In addition, the Maui_ scheduler can be inquired using the showq_ command::

  showq -r (List all running jobs)
  showq (List all jobs)

If you want to check the status of a particular job-id use::

  checkjob <jobid>

.. _showq: http://docs.adaptivecomputing.com/maui/commands/showq.php

Badly behaving jobs

Another useful command for monitoring batch jobs is pestat_::

  pestat -f # show status of badly behaving jobs, with bad fields marked by star (*)

.. _pestat: http://www.clusterresources.com/pipermail/torqueusers/2007-September/006188.html

**Note**: one of the most common bad behaviour of batch jobs is exhausting of available RAM memory.
Please use RAM memory estimate tools before running programs!
For GPAW consult `https://wiki.fysik.dtu.dk/gpaw/documentation/parallel_runs/parallel_runs.html`_.

An example of usage of pestat_::

  pestat | grep 263945
  q008 excl 4.08 7974 4 18628 1275 1/1 4 263945 user
  q037 excl 4.02 7974 4 18628 1285 1/1 4 263945 user

The example job above is behaving correctly.
Please consult the script located at ``/usr/local/bin/pestat`` for the description of the fields.
The most important fields are:

- second: **Torque state** - node can be free (not all the cores used); excl (all cores used), down,

- third: **CPU load average**,

- seventh: **Resident (used) memory** - total memory on the given node in use (the one reported under RES by the "top" command),

If (Memory used (resident) exceeds physical RAM on the node (field fourth) ) or cpu load is significantly lower than number of CPUs (field fifth) the job becomes a candidate to be killed.

An example of a job exceeding physical memory::

  pestat -f | grep 128081
  m016 busy* 4.00 7990 4 23992 9937* 1/1 4 128081 user
  m018 excl 4.00 7990 4 23992 9755* 1/1 4 128081 user

An example of a job with incorrect CPU load::

  pestat -f | grep 129284
  a014 excl 7.00* 24098 8 72097 2530 1/1 8 129284 user

Searching for free resources

Show what resources are available for immediate use (see `Batch_jobs#batch-job-node-properties`_ for more options):

- xeon8::

   showbf -f xeon8

- xeon16::

   showbf -f xeon16

pestat_ can also be used to check what resources are free::

  pestat | grep free
  n085 free 3.00 7990 4 23992 358 3/2 3 292171 user1 293290 user1 293857 user2
  c001 free 0.00 24098 8 72097 199 0/0 0

The node n085 is occupied by 3 jobs (9th column) and two users (8th column) each requesting 1 core.
The node c001 is totally free.

To find what type of resources will become free soon run (need to be run under [ba]sh)::

  for id in `showq | head -10 | grep "Running" | cut -d " " -f 1`; do qstat -f1 $id | grep exec_host | cut -d "=" -f 2 | awk -F "+" '{printf "%3d %.5s\n", NF, $1}'; done
  80 q134
  32 d010
  32 d095
   4 n005
  16 d048
  16 d026
  16 c052

The first column contains the total number of cores for the 7 jobs on top of the queue, the second column the coresponding master node of these jobs.
This information is to be used to determine the node properties using `Batch_jobs#batch-job-node-properties`_.

Estimating needed disk space

Find the 10 largest files under home directory (may take a long time)::

  find ${HOME} -printf "%s,'%p'\n" | sort -r -n -k 1 | head -10 | cut -d, -f2 | xargs du -h

**Note**: the purpose of quoting in the above command is to handle filenames contaning spaces.
Please see the Niflheim7_Getting_started_ page.

Batch jobs

Please see the Niflheim7_Getting_started page.

Niflheim: Batch_jobs (last edited 2018-04-06 14:15:58 by OleHolmNielsen)