Linux cluster system administration

Some topics of relevance for cluster system administrators is described on this page. The NIFLHEIM project required the use a number of techniques, and we document below in some detail how we install and manage a cluster of 900+ nodes.

Cluster installation software

There are many ways to install Linux on a number of nodes, and many toolkits exist for this purpose. The NIFLHEIM cluster uses the SystemImager toolkit (see below).

Cluster installation toolkits which we have seen over the years include the following:

Cloning of nodes with SystemImager

The NIFLHEIM cluster uses the SystemImager toolkit on a central server to create an image of a Golden Client node that has been installed in the usual way using a distribution on CD-ROM (CentOS Linux in our case). The SystemImager is subsequently used to install identical images of the Golden Client on all of the nodes (changing of course hostname and network parameters).

Installing SystemImager

We have some notes on SystemImager_Installation.

When you have downloaded the Golden Client disk image to the image server you can find a suitable Linux kernel and initial ram-disk, the so-called UYOK (Use Your Own Kernel) files in SystemImager slang, in this directory:

/var/lib/systemimager/images/<image-name>/etc/systemimager/boot/

Copy the files kernel and initrd.img to the image server directory /tftpboot, possibly renaming these files so that they describe the type of golden client on which they were generated (you may end up with a number of such kernel and initrd.img files over time).

PXE network booting of nodes

SystemImager allows you to boot and install nodes using the nodes' Ethernet network interface. You will be using PXE, the Intel-defined Pre-Boot eXecution Environment which is implemented in all modern Ethernet chips. The following advice works correctly for Ethernet chips with built-in PXE, but for older version of PXE you may have to install a pxe daemon RPM on the server (not discussed any further here). The pxe daemon is not necessary with modern PXE versions.

Please consult this page for detailed information about PXE:

Node's BIOS boot order

We recommend to configure the client node BIOS to a boot order similar to the following:

  1. Diskette
  2. USB devices
  3. CD-ROM
  4. Network (PXE)
  5. Hard disk

The first 3 items enable you to perform node diagnostics and configuration. The Network (PXE) option will be the normal boot mode. The final hard disk option is only used if all the preceeding ones fail (for troubleshooting, only).

Configuring PXELINUX

Please consult this page for detailed information about PXE:

DHCP setup

The DHCP server daemon should be configured correctly for booting and installation of client nodes, please see the SYSLINUX documentation. Our configuration in /etc/dhcpd.conf is (partially):

# option-140 is the IP address of your SystemImager image server
option option-140 code 140 = text;
filename "pxelinux.0";          # For i386
#filename "elilo.efi";          # For ia64
subnet 10.1.0.0 netmask 255.255.0.0 {
  option domain-name "dcsc.fysik.dtu.dk";
  option domain-name-servers 10.1.128.5;
  option routers 10.1.128.5;            # Fake default gateway
  option ntp-servers    10.1.128.5;     # NTP server
  option time-offset    3600;           # Middle European Time (MET), in seconds
  option option-140 "10.1.128.5";
  next-server 10.1.128.5;               # next-server is your network boot server
  option log-servers 10.1.128.5;        # log-servers

  host n001 { hardware ethernet 00:11:25:c4:8e:82; fixed-address n001.dcsc.fysik.dtu.dk;}
  # Lots of additional hosts...
  }

Of course, you have to change IP addresses and domain-names for your own cluster.

If your cluster is on a private Internet (such as the 10.x.y.z net) and your DHCP server has multiple network interfaces, you must make sure that your DHCP server doesn't offer DHCP-service to the non-cluster networks (a sure way to find a lot of angry colleagues before long :-). Edit the Redhat configuration file /etc/sysconfig/dhcpd to contain:

DHCPDARGS=eth1

(where eth1 is the interface connected to your cluster) and restart the dhcpd daemon (service dhcpd restart).

Registering client node MAC addresses

The client nodes' Ethernet MAC addresses must be configured into the /etc/dhcpd.conf file. Alternatively, you can let the DHCP server hand out IP addresses freely, but then you may loose the ability to identify nodes physically from their IP addresses.

We recommend to use the statically assigned IP addresses in the /etc/dhcpd.conf. This can be achieved by the following procedure:

  1. Configure the DHCP server without the clients' MAC-addresses and use the deny unknown-clients DHCP option in /etc/dhcpd.conf.
  2. Connect the client nodes to the network and turn them on one by one. In the NIFLHEIM installation we did this as part of the setup process, at the same time as we customized the BIOS settings.
  3. For all the client node names, copy from the server's /var/log/messages file the client's Ethernet MAC-address. Label each client node with an adhesive label containing the correct node name.
  4. In a file with a list of client node names you add the MAC-address to the node's line in the file.
  5. When all nodes have been registered, use a simple awk-script or similar to convert this list into lines for the /etc/dhcpd.conf file, such as this one:

    host n001 { hardware ethernet 00:08:02:8e:05:f2; fixed-address n001.dcsc.fysik.dtu.dk;}

Network installation of nodes

With the above setup you're now ready to boot and install a fresh node across the network using SystemImager.

Make sure that the PC BIOS has been set up for a boot order where network/PXE boot precedes booting from hard disk. Use a screen to monitor the installation process (for the first node or two, at least). Monitor the DHCP server's /var/log/messages file to ensure that the client node actually requests and is assigned a proper IP address, and that the client downloads the kernel and initrd.img files successfully by TFTP.

The client node's PXE firmware will now transfer the small Linux kernel and ram-disk and begin the installation process by transferring the Golden Client disk image using rsync.

After the installation is completed, and if you don't use the Automated network installation below, you must change the boot order so that network/PXE booting no longer preceeds the booting from hard disk. Reboot the node, and watch it boot Linux from its own hard disk. The IP address should be assigned correctly by the DHCP server.

Automated network installation

Having to watch the installation process and finally change the client nodes' BIOS setup is cumbersome when you have more than a dozen or two client nodes.

After having tested the network installation process manually as described above, you can automate the process completely using the pxeconfig toolkit written by Bas van der Vlies. Now a client node installation is as simple as configuring on the central server whether a node should perform a network installation or simply boot from hard disk: When the node is turned on, it all happens automatically with no operator intervention at all ! The BIOS boot order must still have PXE/network before the hard disk. The SystemImager toolkit actually contains a similar utility called si_netbootmond, but IMHO the pxeconfig toolkit is a better solution.

Please see the following page for information about the pxeconfig toolkit:

Fixing the node system date

When you install brand new node hardware for the first time, the hardware real-time clock is probably off by a few hours. It is mandatory to have a correct system date on all compute nodes and servers, otherwise you'll see the Torque batch system having problems, and NFS may be broken as well.

Use pdsh (see the section on parallel commands below) to examine the system date on all nodes in question, for example nodes a001-a140:

pdsh -w a[001-140] 'date +%R' | dshbak -c
----------------
a[001-140]
----------------
14:20

Here the clocks are OK.

To synchronize all hardware clocks you need an NTP time server, let's assume it is ntpserver. Update all clocks on the desired nodes by, for example:

pdsh -w a[001-140] 'service ntpd stop; ntpdate ntpserver; service ntpd start'

If the Torque client daemon is already running, you need to restart it after fixing the system date, for example:

pdsh -w a[001-140] 'service pbs_mom restart'

Turning PC nodes into servers

Many PCs can be turned into compute cluster nodes by configuring their BIOS to operate without keyboard and mouse, and perform network/PXE booting at power-up time. It is also important to be able to save and restore BIOS configuration to removable media (such as diskette) for reliable replication of BIOS setup.

When selecting PCs as cluster nodes, the hardware ought to be suitable for mounting on shelves. Therefore we use these conditions for selecting appropriate PC hardware:

  • Cabinet must be small enough, but not so small as to prevent efficient cooling. With modern fast and hot PCs, cooling is the single most critical factor for reliable operation ! The cabinet must be able to stand on its side without a floor stand (because of the physical space required).
  • Air flow must be front to back, as true servers do it, and not a mish-mash of fans blowing air at a number of places around the cabinet. The HP/Compaq EVO d530 coms to mind...

Troubleshooting of node hardware

With a substantial number of nodes in a cluster, hardware failures are inevitable and must be dealt with efficiently. We list some useful tools below.

Ultimate Boot CD

If you would like a very complete set of tools, including diagnostics tools, you may want to take a look at the Ultimate Boot CD project. You can download an ISO image for burning your own CD.

Disk problems

The primary disk analysis tool for HP/Compaq PCs is available from the BIOS menus (press F10 at boot) under the item Storage->IDE DPS Self-test. This built-in diagnostics can scan a disk for errors.

In addition, the various disk vendors who supply harddisks have their own diagnostics tools. We refer to some of their home pages:

Memory problems

Both memory errors and CPU errors can be detected by a very useful tool Memtest86 - A Stand-alone Memory Diagnostic which is available under the Gnu Public License. The most modern version of this tool is Memtest86+.

This tool is usually booted from a diskette or CD-ROM drive. The memory tester will run for a long time with numerous tests, and will loop indefinitely. Typically, serious errors show up immediately, whereas some errors have shown up only intermittently after testing for 12-24 hours. The Memtest86 tool is much better than the vendor's diagnostics memory testing.

Network booting Memtest86

It is possible to boot up a PC using PXE network-booting and run immediately the Memtest86 executable from the network. First, please refer to our SystemImager page for how to set up PXE booting and possibly automate the selection of boot-images from the central server. Second, define a new Memtest86 PXE-boot method by creating the file /tftpboot/pxelinux.cfg/default.memtest with the following content:

default memtest
label memtest
kernel memtest86

The memtest86 kernel to be booted should be copied from the Memtest86+ source tarball available at http:/www.memtest.org. Unpack the tarball and copy the file precomp.bin to the file /tftpboot/memtest86. In some versions of Memtest86+ the precomp.bin file is outdated and you need to do a make and copy the file memtest.bin in stead.

Now you use the PXE-booting tools described above to let the central server determine which image will be booted when the PC does a PXE-boot. Basically, the hex-encoded IP-address of the PXE-client must be a soft-link to the file default.memtest, thus causing the PXE-client to boot into Memtest86.

Batch software

The Torque (Portable Batch System, Open Source version) is used for batch job management. There is extensive documentation of Torque. The MAUI batch scheduler is used for sophisticated batch job policies in conjunction with Torque (MAUI also works with many other batch systems).

Networking considerations

Using multiple network adapters

Some machines, especially servers, are equipped with dual Ethernet ports on the motherboard. In order to use both ports for increased bandwidth and/or redundancy, Linux must be configured appropriately.

We have a page about MultipleEthernetCards.

SSH setup

In order to run parallel codes we use the MPI message-passing interface (see the NIFLHEIM Cluster software - RPMS page), a prerequisite is the ability for all users to start processes on remote nodes without having to enter their password. This is accomplished using the Secure Shell (SSH) remote login in combination with a globally available /etc/hosts.equiv file that controls the way that nodes permit password-less logins.

The way we have chosen to configure SSH within the NIFLHEIM cluster is to clone the SystemImager Golden Client's SSH configuration files in the /etc/ssh directory on all nodes, meaning that all nodes have identical SSH keys. In addition, the SSH public-key database file ssh_known_hosts contains a single line for all cluster nodes, where all nodes have identical public keys.

When you have determined the Golden Client's public key, you can automatically generate the ssh_known_hosts file using our simple C-code clusterlabel.c (define the SSH_KEY constant in the code using your own public key). Place the resulting ssh_known_hosts file in all the nodes' /etc/ssh directory, which is easily accomplished on the Golden Client first, before cloning the other nodes (alternatively, the file can be distributed later).

The root superuser is a special case, since /etc/hosts.equiv is ignored for this user. The best method for password-less root logins is to create public keys on the (few) central servers that you wish to grant password-less root login to all cluster nodes. We have made a useful script authorized_keys for this purpose, useable for any user including root. In the case of the root user, the contents of the file /root/.ssh/id_rsa.pub is appended to /root/.ssh/authorized_keys, and this file must be distributed onto all client nodes, thereby enabling password-less root access.

In an alternative method, for all client nodes you must have the /root/.shosts file created with a line for each of the central servers.

Kernel ARP cache

If the number of network devices (cluster nodes plus switches etc.) approaches or exceeds 512, you must consider the Linux kernel's limited dynamic ARP-cache size. Please read the man-page man 7 arp about the kernel's ARP-cache.

ARP (Address Resolution Protocol) is the kernel's mapping between IP-addresses (such as 10.1.2.3) and Ethernet MAC-addresses (such as 00:08:02:8E:05:F2). If the soft maximum number of entries to keep in the ARP cache, gc_thresh2=512, is exceeded, the kernel will try to remove ARP-cache entries by a garbage collection process. This is going to hit you in terms of sporadic loss of connectivitiy between pairs of nodes. No garbage collection will take place if the ARP-cache has fewer than gc_thresh1=128 entries, so you should be safe if your network is smaller than this number.

The best solution to this ARP-cache trashing problem is to increase the kernel's ARP-cache garbage collection (gc) parameters by adding these lines to /etc/sysctl.conf:

# Don't allow the arp table to become bigger than this
net.ipv4.neigh.default.gc_thresh3 = 4096
# Tell the gc when to become aggressive with arp table cleaning.
net.ipv4.neigh.default.gc_thresh2 = 2048
# Adjust where the gc will leave arp table alone
net.ipv4.neigh.default.gc_thresh1 = 1024
# Adjust to arp table gc to clean-up more often
net.ipv4.neigh.default.gc_interval = 3600
# ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600

Then run /sbin/sysctl -p to reread this configuration file.

Another solution, although more cumbersome in daily adminsitration, is to create a static ARP database, which is customarily kept in the file /etc/ethers. It may look like this (see man 5 ethers):

00:08:02:8E:05:F2 n001
00:08:02:89:9E:5E n002
00:08:02:89:62:E6 n003
...

This file can easily be created from the DHCP configuration file /etc/dhcpd.conf by extracting hostnames and MAC-address fields (using awk, for example). In order to add this information to the permanent ARP-cache, run the command arp -f /etc/ethers.

In order to do this at boot time, the Redhat Linux file /etc/rc.local can be used. Add these lines to /etc/rc.local:

# Load the static ARP cache from /etc/ethers, if present
if test -f /etc/ethers then
  /sbin/arp -f /etc/ethers
fi

This configuration should be performed on all nodes and servers in the cluster, as well as any other network device that can be be configured in this way.

It doesn't hurt to use this configuration also on clusters with 128-512 network devices, since the dynamic ARP-cache will then have less work to do. However, you must maintain a consistent /etc/ethers as compared to the nodes defined in /etc/dhcpd.conf, and you must run the arp command every time the /etc/ethers file is modified (for example, when a node's network card is replaced).

Parallel commands

It is often necessary to execute a command on all compute nodes as the superuser for monitoring or maintenance. A serial loop over nodes will be fine for clusters up to a few hundred nodes, but above a dozen nodes it becomes convenient to use a parallel command tool.

pdsh

The command Parallel Distributed Shell is used to execute commands in parallel. Download the source RPM and rebuild the binary RPMs. We install these pdsh modules:

pdsh
pdsh-rcmd-ssh
pdsh-mod-dshgroup
pdsh-mod-machines

See man pdsh for the rather scarce documentation. Some useful commands are:

  • The command:

    pdsh -w host1,host2 <command>

    executes the command on host1,host2 nodes.

  • The command:

    pdsh -a <command>

    executes on "all" nodes. The meaning of "all" seems undocumented, but it's the hostnames in the file /etc/machines (1 hostname per line).

  • The command:

    pdsh -g <groupname>

    executes on all nodes in the group groupname. The groupname is a file with hostnames in the directory /etc/dsh/group/.

  • See the HOSTLIST EXPRESSIONS in the man-page for flexible hostname patterns.

  • Make compact output listings by piping the output of pdsh through dshbak, for example:

    pdsh -a <command> | dshbak -c

    See man dshbak for this command.

Optimizing Linux services

The standard Linux desktop/server services provided by your Linux installation should be pruned so that services not strictly required on a compute node are disabled. This will ensure stability of the software, and improve node performance because daemon processes won't interfere with the system operation and possibly be causes of operating system "jitter".

On a CentOS5/RHEL5 compute node we recommend to disable the following standard services:

chkconfig hidd off
chkconfig avahi-daemon off
chkconfig haldaemon off
chkconfig bluetooth off
chkconfig cups off
chkconfig ip6tables off
chkconfig iptables off
chkconfig xfs off
chkconfig yum-updatesd off

CPU speed daemon

Another standard Linux service to consider is the cpuspeed daemon. cpuspeed dynamically controls CPUFreq, slowing down the CPU to conserve power and reduce heat when the system is idle, on battery power or overheating, and speeding up the CPU when the system is busy and more processing power is needed.

While you may perhaps not want cpuspeed to slow down your compute nodes, there is a special function of cpuspeed on the Intel Nehalem architecture (and successors) with Turbo_Boost features.

If you wish your compute nodes to use Turbo_Boost mode, you must turn on the cpuspeed daemon:

chkconfig cpuspeed on

(it may be a good idea to reboot the node). There are some pages from Intel which describes the configuration of Turbo_Boost mode under Linux:

Niflheim: System_administration (last edited 2012-07-23 13:36:10 by OleHolmNielsen)