Torque is chosen here as job scheduler.

This configuration has been tested with torque-2.3.6 and maui-3.2.6p21

On dulak-server

  • if not yet done, go to configuring rpmbuild,

  • download torque-2.3.6.tar.gz, and:

    cd /tmp
    tar zxf ~/torque-2.3.6.tar.gz
    chown -R root.root torque-2.3.6    # security
    chmod -R go-w torque-2.3.6         # security
    cd torque-2.3.6
    ./configure --disable-rpp -disable-gui --without-tcl

    Note that this will configure torque to install under /usr/local.

  • build RPMs by doing:

    make rpm

    which will create the RPMs under /root/RPMS/i386/torque-*.

  • Skip this step if not installing on "dulak-server": copy the RPMs to the dulak-server:/home/dulak-server/rpm directory, and:

    cd /home/dulak-server/rpm
  • install RPMS with you so that you get an installation log:

    cd /root/RPMS/i386/
    yum localinstall --nogpgcheck torque-2*.rpm torque-mom-2*.rpm torque-client-2*.rpm \
                                  torque-docs-2*.rpm torque-devel-2*.rpm torque-server-2*.rpm

    Skip this step if not installing on "dulak-server": Make sure that /var/spool/torque/server_name contains dulak-server.dulak-cluster.fysik.dtu.dk.

  • download torque.csh and torque.sh to /etc/profile.d:

    chmod go+r /etc/profile.d/torque.*
  • Skip this step if not installing on "dulak-server": copy to the "Golden Client":

    scp /etc/profile.d/torque.* n001:/etc/profile.d
  • Initialize the PBS/Torque server once only with:

    pbs_server -t create
  • use the contents of qmgr_print_server as a base for configuration of a Routing Queue:

    qmgr < qmgr_print_server

    Note: this will setup the following queues: hour (default), day, halfday, twodays, and week.

    In case of workstation installation remove line with *dulak-cluster* references, and replace dulak-server with localhost.

  • download node definitions file nodes to /var/spool/torque/server_priv/.

    Note: that "Golden client" does participate in the node pool, if you prefer to remove it from the pool make sure that pbs_mom is not running on it ssh n001 "service pbs_mom stop".

    The nodes have pentium property (properties are useful if the cluster needs to be extended by other type of nodes to avoid load-balancing problems).

  • Download (you have to register) maui-3.2.6p21.tar.gz to ~/rpmbuild/SOURCES, change the following in the maui specfile (~/rpmbuild/SPECS/maui-*.spec):

    --with-key='fys-Jul-19-2007'

    and keep is secret.

  • build RPMS:

    cd ~/rpmbuild/SPECS
    rpmbuild -bb maui-3.2.6p21.spec
  • Skip this step if not installing on "dulak-server": copy the RPMs to the dulak-server:/home/dulak-server/rpm directory.

  • install:

    yum localinstall --nogpgcheck /root/RPMS/*/maui*.rpm
  • make sure that the following files/directories exist:

    cd /var/spool/maui/traces/; touch Resource.Trace1 Workload.Trace1
    mkdir /var/spool/maui/log
  • change all occurences of localhost in /var/spool/maui/maui.cfg into dulak-server.dulak-cluster.fysik.dtu.dk (use your hostname if installing a workstation):

    sed -i 's/localhost/dulak-server.dulak-cluster.fysik.dtu.dk/g' /var/spool/maui/maui.cfg
  • change /var/log/maui.log in /var/spool/maui/maui.cfg into /var/spool/maui/log/maui.log:

    sed -i 's#/var/log/maui.log#/var/spool/maui/log/maui.log#' /var/spool/maui/maui.cfg
  • reflect (only if needed - check the file /etc/init.d/maui first) the installation under /usr/local in /etc/init.d/maui:

    sed -i 's#/usr/sbin#/usr/local/sbin#g' /etc/init.d/maui
    sed -i 's#/usr/bin/schedctl#/usr/local/bin/schedctl#' /etc/init.d/maui

On Golden Client

  • install RPMS:

    cd /home/dulak-server/rpm
    rpm -ivh torque-2*.rpm torque-mom-2*.rpm torque-client-2*.rpm

    Make sure that /var/spool/torque/server_name contains dulak-server.dulak-cluster.fysik.dtu.dk.

  • synchronize ntp with "dulak-server" (needed by maui) by adding:

    server 10.3.0.2

    in /etc/ntp.conf

    and (see https://bugzilla.redhat.com/show_bug.cgi?id=456743):

    service ntpd restart

    wait ~30 minutes (until you see time reset in /var/log/messages).

On dulak-server

Continue:

  • download config, epilogue, and health_check_script, set permissions, and (when installing a cluster) copy to "Golden Client":

    chmod u+x config epilogue health_check_script
    scp config epilogue health_check_script n001:/var/spool/torque/mom_priv/

    Note:

    $usecp *:/home/dulak-server /home/dulak-server

    in config - it's necessary for the batch system to be able to copy jobs output files back. If your nodes have more than one core you must change $ideal_load and $max_load variables.

    Remove the $usecp line when installing on workstation, use also $pbsserver localhost and change arch to match you architecture. In case of workstation installation copy the modified config directly to workstation's /var/spool/torque/mom_priv/. Two other files do not need modifications.

  • Warning: in torque-2.3.6 /etc/rc.d/init.d/pbs_mom reload case contains a bug: replace SIGHUP with HUP,

  • Skip this step if not installing on "dulak-server": copy the fixed file to the "Golden Client":

    scp /etc/rc.d/init.d/pbs_mom n001:/etc/rc.d/init.d/
  • start PBS mom on the "Golden Client" (do `service pbs_mom restart` in case of installing workstation)::

    ssh n001 "service pbs_mom restart"

    Note that on a production system, after changing config file on compute nodes pbs_mom must not be restarted with service pbs_mom restart while the pbs_mom is still running a batch job!.

    Instead (in principle, but it seems not working!) a restart can be scheduled:

    momctl -q enablemomrestart=1 -h :ALL

    so an immediate reload need to be performed:

    service pbs_mom reload
  • restart pbs_server:

    qterm -t quick
    pbs_server
  • check node status:

    pbsnodes -a

    Make sure that all the running nodes (on dulak-cluster: currently n001 - the "Golden Client") have free status.

  • start maui:

    service maui start

Go to installing software.

Niflheim: Building_a_Cluster_-_Tutorial/installing_and_configuring_batch_system (last edited 2010-11-04 13:02:22 by OleHolmNielsen)