Intel OmniPath network fabric

The niflheim cluster has a 100 Gbit/s Intel OmniPath high speed network fabric on a new installation. This page assumes a RHEL/CentOS 7 Linux system.

OmniPath software and documentation

General marketing material is found in the Intel OmniPath fabric homepage.

To download software and documentation:

There is a Omni-Path User Group OPUG for public discussions about OmniPath.

OmniPath switches

Please see the OmniPath_switches page.

Hardware installation

Read the document Intel Omni-Path Host Fabric Interface Installation Guide, especially the section Hardware Installation.

The default BIOS setting must be that the PCIe speed is set to Auto (may vary with BIOS). For a PCIe Gen3 x16 adapter the PCIe bus speed should be 8 GT/s, whereas Gen2 speed would only be 5 GT/s. In older versions of this manual there was an incorrect requirement of Gen2 speed setting.

Please verify your adapter's speed and PCIe width. This can be done from the Linux OS by pdsh:

pdsh -w <node-list> 'lspci -vvv -s 04:00.0 | grep LnkSta:' | dshbak -c

The output may look like:

LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

Here the PCIe device ID is 04:00.0, you determine it by:

lspci | grep Omni-Path

Software installation

OmniPath HFI adapter hardware is supported on compute nodes with the following Intel processors:

Please note that older processors are not supported. However, the OmniPath adapter may well work on older Xeon servers, even though it's not officially supported by Intel. For example, we have tested OmniPath on an old Sandy Bridge server successfully.

Also note that OmniPath software versions must be identical on all compute nodes, or at most differ by 1 minor version (such as 10.6 and 10.5). Documentation???

The following software installation packages are available for an Intel ® Omni-Path Fabric:

  • Intel Omni-Path Fabric Host Software: – This is the basic installation package that installs the Intel ® Omni-Path Fabric Host Software components needed to set up compute, I/O, and Service nodes with drivers, stacks, and basic tools for local configuration and monitoring.
  • Intel Omni-Path Fabric Suite (IFS) Software: – This installation package provides special features and includes the Intel ® Omni-Path Fabric Host Software package, along with the Intel ® Omni-Path Fabric Suite FastFabric Toolset (FastFabric) and the Intel ® Omni-Path Fabric Suite Fabric Manager (Fabric Manager).
  • Intel Omni-Path Fabric Suite Fabric Manager GUI (Fabric Manager GUI): – This installation package provides a set of features for viewing and monitoring the fabric or multiple fabrics, and is installed on a computer outside of the fabric.

Operating Systems supported are listed in the Release Notes and include CentOS 7.2 and RHEL 7.2 and above. CentOS/RHEL 7.3 now includes full support for Intel® Omni-Path Architecture (OPA) kernel driver, see https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.3_Release_Notes/new_features_kernel.html.

Download the latest version Intel® Omni-Path Fabric Software (Including Intel® Omni-Path HFI Driver) from the OmniPath_software page:

  • IntelOPA-Basic.RHEL72-x86_64.10.X.*.tgz for compute nodes.
  • IntelOPA-IFS.RHEL72-x86_64.10.X.*.tgz for the management node.

Read the Intel® Omni-Path Fabric Software Installation Guide from the Publications page.

Install RHEL/CentOS 7 prerequisites

For RHEL/CentOS 7 the following base prerequisite packages must be installed on login and compute nodes. There are two distinct situations:

  1. The server contains only OmniPath adapters:

    yum install libibmad libibumad libibumad-devel libibverbs librdmacm libibcm libpfm.i686 ibacm qperf perftest rdma infinipath-psm infinipath-psm-devel libhfi1 expat elfutils-libelf-devel libstdc++-devel gcc-gfortran atlas tcl expect tcsh sysfsutils pciutils bc opensm-devel opensm-libs rpm-build redhat-rpm-config kernel-devel papi.i686
  2. The server contains both Mellanox Infiniband as well as OmniPath adapters. Go to the next section Mellanox OFED installation.
Mellanox OFED installation

In case the server contains both Mellanox Infiniband as well as OmniPath adapters, the required order of installation is:

  1. Install RHEL/CentOS 7 prerequisites:

    yum install expect tcl tk
The mlnxofedinstall will tell you if any prerequisites are missing.
  1. Install the Mellanox_OFED software before you install any OmniPath software. Read the Mellanox OFED for Linux User Manual and perform the software installation:

    mlnxofedinstall
  2. Install OmniPath software as described below.

Prevent yum update from overwriting Intel OPA packages

As of Intel OPA software release 10.6 (late 2017), all Intel RPMs are still not installed with yum, but in stead by a brain-dead operation:

rpm -i --force --nodeps <rpm list>

Since all CentOS/RHEL 7 packages are installed with yum, the yum database contains no record of the Intel OPA RPMs.

When you subsequently update the OS by:

yum update

the CentOS/RHEL 7 OPA updates version 10.3 RPMs will replace several Intel OPA RPMs (opa-*) previously installed. This will of course cause havoc on your OPA installation.

Until Intel has solved this problem, it is mandatory to exclude all OPA RPM updates from the distribution by appending these rules to /etc/yum.conf:

exclude=opa-* libpsm2* libfabric* hfi1*

Install OPA IFS software on manager node

Follow the Intel® Omni-Path Fabric Software Installation Guide chapter 4.0 Install the Intel® Omni-Path Software for installation details.

On IFS servers also the following are required:

yum install libibverbs-devel libibmad-devel librdmacm-devel ibacm-devel openssl-devel libuuid-devel expat-devel valgrind-devel

Unpack the IntelOPA-IFS.<DISTRO>-x86_64.10.<version>.tgz tar-ball and run the INSTALL script, for example for RHEL/CentOS 7.3:

tar xf IntelOPA-IFS.RHEL73-x86_64.10.3.0.0.81.tgz
cd IntelOPA-IFS.RHEL73-x86_64.10.3.0.0.81
./INSTALL

Select all appropriate softwares to be installed. The Fabric Manager node requires the FastFabric and OPA FM components besides the BASIC components, see the software installation guide chapter Upgrade from IntelOPA-Basic to IntelOPA-IFS

If the manager node should run the OPA Fabric Manager service, make sure to enable this Intel OPA Autostart item:

OPA FM (opafm)

The opafm can also be started using Systemd services:

systemctl enable opafm
systemctl start opafm

Alternatively, just run the CLI version to install the basic software manually as shown below. Then install and enable opafm and fastfabric:

./INSTALL -i opafm -i fastfabric -E opafm

You must make sure this host's Static hostname is set correctly (not just localhost.localdomain):

hostnamectl
hostnamectl set-hostname <hostname>.<domainname>

The node must be rebooted after the install to activate new kernel modules and set the correct hostname.

NOTE: It is important to permit the installation to update of the file /etc/security/limits.conf with memory locking limits:

* hard memlock unlimited
* soft memlock unlimited

This file is read by PAM when users log in. However, system daemons started during the boot process do not use /etc/security/limits.conf, and the correct memory limits must be set inside the daemon startup scripts. This is especially important for batch job services.

Install OPA software on all nodes using opafastfabric

NOTE: This uses Intel's installation tools, but you may alternatively use the manual installation method described below.

Follow the Intel® Omni-Path Fabric Software Installation Guide chapter 7.0 Install Host Software on the Remaining Hosts Using the FastFabric TUI. Run this on the manager node and select Host Setup:

opafastfabric

Run the following menu items in this order:

3) Host Setup
2) Set Up Password-Less SSH/SCP
1) Verify Hosts Pingable

The good nodes are listed in the file /etc/sysconfig/opa/good.

The tar-ball IntelOPA-BASIC.<DISTRO>-x86_64.10.<version>.tgz must be available on the Manager node for installation on the compute nodes. Now install the OPA software on all good nodes:

5) Install/Upgrade OPA Software
6) Configure IPoIB IP Address

At the end of the installation select to reboot the nodes:

8) Reboot Hosts
IPoIB device ib0 not present

We have seen an error when upgrading the OPA software stack from 10.2 to 10.3. The ib0 network interface is defined correctly in /etc/sysconfig/network-scripts/ifcfg-ib0, yet the ib0 network device doesn't exist and an error is printed:

/etc/sysconfig/network-scripts/ifup-ib[3239]: Device ib0 does not seem to be present, delaying initialization.

The OPA software INSTALL file menu:

3) Reconfigure Driver Autostart
   3) OFA IP over IB   [Enable ]

will fix this error after a reboot.

This can also be done with the opaconfig command:

# opaconfig -E delta_ipoib
Configuring autostart for Selected installed OPA Drivers
Enabling autostart for OFA IP over IB
Done OPA Driver Autostart Configuration.

To verify ping over IPoIB connectivity, use a Manager node with the IFS software:

/usr/sbin/opahostadmin -f /etc/sysconfig/opa/allhosts ipoibping

Manual software installation on a single node

When individual compute nodes are installed from scratch, the OPA software must be installed from the CLI command line in the Kickstart post-install scripts. The Intel OPA documentation does not describe this procedure, so we have to discover it by trial-and-error.

Start by reading the manual Intel ® Omni-Path Fabric Software Installation Guide chapter 4.

The installation steps are:

  1. Copy the Basic tar-ball to the system root and unpack it:

    cp (some location)/IntelOPA-Basic.RHEL73-x86_64.10.3.0.0.81.tgz /root/
    tar xzf IntelOPA-Basic.RHEL73-x86_64.10.3.0.0.81.tgz
    cd IntelOPA-Basic.RHEL73-x86_64.10.3.0.0.81
  2. You can run the INSTALL TUI script to learn about menu items. Then install the basic software:

    ./INSTALL -i opa_stack -i intel_hfi -i delta_ipoib -i oftools

    The installation log will be in /var/log/opa.log.

  3. The PSM2 library libpsm2 is not installed by any of the above components, so install it manually:

    cd ./IntelOPA-OFED_DELTA.RHEL73-x86_64.10.3.0.0.82/RPMS/redhat-ES73
    yum install libpsm2-10.X*rpm libpsm2-devel*rpm

    It seems that the libpsm2-compat RPM is not needed because it conflicts with the required infinipath-psm RPM.

  4. The IPoIB network script /etc/sysconfig/network-scripts/ifcfg-ib0 must be edited manually, see the section IPoIB Configuration below.

NOTE: The INSTALL TUI script installs RPM packages not by using yum, but directly with the rpm command, for example as seen in /var/log/opa.log:

/bin/rpm -U --force --nodeps  ./IntelOPA-OFED_DELTA.RHEL73-x86_64.10.3.0.0.82/RPMS/redhat-ES73/kmod-ifs-kernel-updates-3.10.0_514.el7.x86_64-123.x86_64.rpm

It is not a good practice to install packages with --force --nodeps (forcing installation without checking for dependencies)! The RPMs installed will unfortunately not be logged to /var/log/yum.log as is the best practice.

Uninstallation of OPA software

To uninstall all OPA software use the INSTALL script option:

  • -u - uninstall all ULPs and drivers with default options

The command is:

./INSTALL -u

Installation of Fabric Manager GUI

For the GUI download the RPM package IntelOPA-FMGUI.linux-<VERSION>.noarch.rpm (or similar) and install with:

yum install IntelOPA-FMGUI.linux-10.3.0.0.60.noarch.rpm

Read the Intel Omni-Path Fabric Software Installation Guide chapter 14 Install Intel Omni-Path Fabric Suite Fabric Manager GUI. The file /etc/opa-fm/opafm.xml must be edited to enable running the GUI on localhost without SSL encryption:

<SslSecurityEnable>0</SslSecurityEnable>

Also enable the Fabric Executive (FE) component of the Fabric Manager:

<Start>1</Start> <!-- default FE startup for all instances -->

Then restart the Fabric Manager:

systemctl restart opafm

Now run the GUI (a Java applet):

fmgui

Configure fmgui:

  • Enter the localhost hostname on the Fabric Manager server/node.
  • If you use a remote server, enter its hostname. You should also enable SSL.

The remote FM GUI requires port 3245 to be open on the Fabric Manager node, so you may have to open it in the firewall (if any):

firewall-cmd --zone=public --add-port=3245/tcp --permanent
firewall-cmd --reload

The next step is:

  • Menu item Subnet, select Connect To and click the network name you defined above.

Read the Intel Omni-Path Fabric Suite Fabric Manager GUI User Guide.

OPA kernel modules

During the above installation the INSTALL script installs a RPM package with OPA kernel modules. In /var/log/opa.log this is logged as:

installing kmod-ifs-kernel-updates-3.10.0_514.el7.x86_64-123.x86_64...
  /bin/rpm -U --force --nodeps  ./IntelOPA-OFED_DELTA.RHEL73-x86_64.10.3.0.0.82/RPMS/redhat-ES73/kmod-ifs-kernel-updates-3.10.0_514.el7.x86_64-123.x86_64.rpm

The source RPM file is:

./IntelOPA-OFED_DELTA.RHEL73-x86_64.10.3.0.0.82/SRPMS/ifs-kernel-updates-3.10.0_514.el7.x86_64-123.src.rpm

The RPM contains the following files:

# rpm -ql kmod-ifs-kernel-updates
/etc/depmod.d/ifs-kernel-updates.conf
/lib/modules/3.10.0-514.el7.x86_64/extra/ifs-kernel-updates/hfi1.ko
/lib/modules/3.10.0-514.el7.x86_64/extra/ifs-kernel-updates/rdmavt.ko

The problem with this package is that the kernel modules do not get updated when you update the Linux kernel! We are awaiting Intel's response to this problem. One good method would be to use Dynamic Kernel Module Support (DKMS).

OPA configuration files

On the management node, the OPA configuration files are stored in this directory:

/etc/sysconfig/opa/

OPA srpd services

The service srpd (SCSI RDMA Protocol over InfiniBand) is not used on compute nodes, so turn it off:

systemctl stop srpd
systemctl disable srpd

Managing the OPA fabric

Read the Intel® Omni-Path Fabric Suite Fabric Manager User Guide.

Check the fabric

On each host you can verify the OPA HFI adapter revision by:

opahfirev

(installed by the opa-basic-tools RPM package).

Check the OPA link quality on a list of nodes using pdsh:

pdsh -w <node-list>  'opainfo  | grep Link' | dshbak -c

Also, the opa-fastfabric RPM package (part of the IFS software package) contains a useful host checking script:

/usr/lib/opa/samples/hostverify.sh
/usr/share/opa/samples/hostverify.sh    # From OPA software version 10.7

You may copy this from an IFS host to other hosts and run it. To see available options run:

hostverify.sh --help

Check the Fabric Manager (FM)

The OPA FM Fabric Manager was installed above on the Manager node. Manage the opafm service by:

systemctl status opafm
systemctl enable opafm
systemctl start opafm
systemctl restart opafm
systemctl stop opafm

The /usr/lib/opa-fm/bin/opafmctrl allows the user to manage the instances of the FM that are running after the opafm service has been started.

The OPA FM configuration file is /etc/opa-fm/opafm.xml. Other OPA configuration files are in /etc/sysconfig/opa/.

Fabric Manager commands

See chapter 8 of the FM user guide. Useful commands are:

  • opafmconfigcheck: Parses and verifies the configuration file of a Fabric Manager (FM). Displays debugging and status information.
  • opafabricinfo: Provides a brief summary of the components in the fabric.
  • opatop: Fabric Performance Monitor menu to display performance, congestion, and error information about a fabric.
  • opareport: Provides powerful fabric analysis and reporting capabilities.
  • opafmcmd: Executes a command to a specific instance of the Fabric Manager (FM).

The opareport command displays information about nodes and links in the fabric, see the man-page or the FM user guide. For example, to list the Master Subnet Manager host in the fabric:

opareport -F sm

To list also other subnet manager hosts, it is simpler to do:

opareport | tail

To display link problems:

opareport -o errors -o slowlinks
opareport --clear     # Clears the port counters

Requirement of setting static hostname

Unfortunately, the OPA driver by default use the hostname localhost.localdomain in stead of the node name obtained from DHCP.

Any hostname or SM Name fields from the opareport command are obtained as the host's Static hostname, which by default is localhost.localdomain (see man hostnamectl and the file /etc/hostname).

This is rather inconvenient, so you must change the Static hostname using the correct hostname using one of these commands:

hostnamectl set-hostname <hostname>.<domainname>
hostnamectl set-hostname `hostname`

Then you have to reboot the system to reinitialize the OPA driver setup.

Redundant Fabric Manager hosts

You may want to run the FM on two hosts, an active Master and an inactive Slave FM. The Intel manual doesn't describe this scenario in any detail, so we have to experiment:

  • The first opafm service running on a host will be the master.
  • When several hosts/switches run opafm, an election will decide the master.
  • Any switches running a FM instance will have a lower priority and yield to a host-based master.
  • If the master's opafm is stopped, one of the inactive slaves will become the new master after some timeout.
  • One can flexibly add and remove opafm hosts, as long as there is one host/switch who will be the master.

Intel PSM2 Sample Program

To verify the basic functionality of the OmniPath network, copy the Intel® PSM2 Sample Program code from the PDF documentation file Intel® Performance Scaled Messaging 2 (PSM2) Programmer’s Guide in Intel's End User Publications web page. We attach the file psm2-demo.c for convenience.

Make sure the PSM2 packages have been installed:

rpm -q libpsm2 libpsm2-devel

and compile the code:

gcc psm2-demo.c -o psm2-demo -lpsm2

Now run two instances (server and client) on the same or different nodes:

./psm2-demo -s  # Server
./psm2-demo     # Client

If you get an error, see the Memory limits section below.

OpenMPI configuration

Optimized performance with OPA requires the PSM2 interface, see https://www.open-mpi.org/faq/?category=building#build-p2p. Search in the Intel documentation (link at the top of this page) for the document entitled Intel® Performance Scaled Messaging 2 (PSM2) Programmer’s Guide.

Intel ® Performance Scaled Messaging 2 (PSM2) is only available on RHEL/CentOS 7.2 or later, see https://github.com/01org/opa-psm2/blob/master/README which states:

Building PSM2 is possible on RHEL 7.2 as it ships with hfi1 kernel driver.

On CentOS 7 you must have these prerequisite packages, which are installed as above by the Intel OPA software:

rpm -q libpsm2 libpsm2-devel

If you get OpenMPI runtime errors like:

mca: base: components_open: component pml / cm open function failed

then you may need to install also these packages before building OpenMPI:

yum install infinipath-psm infinipath-psm-devel

see [OMPI users] Issue about cm PML and rocks 6.2 infiniband.

Build OpenMPI on RHEL/CentOS 7.2 or later with the configuration flags:

--with-psm2=/usr # Build support for the PSM 2 library (starting with the v1.10 series).

Note however in the RHEL7.2 Release Notes the following section:

  • PSM2 MTL disabled to avoid conflicts between PSM and PSM2 APIs:

    The new libpsm2 package provides the PSM2 API for use with Intel Omni-Path devices, which overlaps with the Performance Scaled Messaging (PSM) API installed by the infinipath-psm package for use with Truescale devices. The API overlap results in undefined behavior when a process links to libraries provided by both packages. This problem affects Open MPI if the set of its enabled MCA modules includes the psm2 Matching Transport Layer (MTL) and one or more modules that directly or indirectly depend on the libpsm_infinipath.so.1 library from the infinipath-psm package.

The older PSM library is not available on CentOS 7:

--with-psm=<dir> # Build support for the PSM library.

Intel OpenMPI

The IntelOPA-Basic.RHEL73-x86_64.10.3.0.0.81 package contains Intel's builds of OpenMPI using the GCC compiler. Install the hfi versions of RPMs to use OmniPath, for example:

cd IntelOPA-OFED_DELTA.RHEL73-x86_64.10.3.0.0.82/RPMS/redhat-ES73
install openmpi_gcc_hfi-1.10.4-9.x86_64.rpm mpi-selector-1.0.3-1.x86_64.rpm mpitests_openmpi_gcc_hfi-3.2-930.x86_64.rpm

To use the Intel OpenMPI see the Intel Omni-Path Fabric Performance Tuning User Guide chapter 5 MPI Performance:

  • Load the environment variables:

    source /usr/mpi/gcc/openmpi-1.10.4-hfi/bin/mpivars.sh
  • Use the options in your mpirun command to specify the use of PSM2 with OpenMPI:

    mpirun -mca pml cm -mca mtl psm2 ...

Using OpenMPI with OmniPath

First make the correct version of OpenMPI available to your applications. If you use software modules (see the EasyBuild_modules page) load the appropriate module, for example:

 # module load foss
 # module list
 Currently Loaded Modules:
 1) EasyBuild/3.0.1
 2) GCCcore/5.4.0
 3) binutils/2.26-GCCcore-5.4.0
 4) GCC/5.4.0-2.26
 5) numactl/2.0.11-GCC-5.4.0-2.26
 6) hwloc/1.11.3-GCC-5.4.0-2.26
 7) OpenMPI/1.10.3-GCC-5.4.0-2.26
 8) OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1
 9) gompi/2016b
10) FFTW/3.3.4-gompi-2016b
11) ScaLAPACK/2.0.2-gompi-2016b-OpenBLAS-0.2.18-LAPACK-3.6.1
12) foss/2016b

Now verify that the psm2 component has been built into OpenMPI:

# ompi_info | grep psm2
  MCA mtl: psm2 (MCA v2.0.0, API v2.0.0, Component v1.10.3)

MPI performance tuning

The Intel Omni-Path Fabric Performance Tuning User Guide discusses in chapter 5 MPI Performance.

  • Use the options in your mpirun command to specify the use of PSM2 with OpenMPI:

    mpirun -mca pml cm -mca mtl psm2 ...

OpenMPI tests

The Intel RPM mpitests_openmpi_gcc_hfi contains a number of MPI testing codes in the /usr/mpi/gcc/openmpi-1.10.4-hfi/tests subdirectories, for example:

  • intel/deviation - MPI bandwidth and latency deviations from Intel MPI Benchmarks (IMB).
  • osu_benchmarks-3.1.1/osu_bibw - Bidirectional Bandwidth Test from OSU_benchmarks.

Memory limits

OmniPath requires all user processes to have unlimited locked memory. For normal users starting a shell, this is configured in /etc/security/limits.conf by adding the lines:

* hard memlock unlimited
* soft memlock unlimited

This file is read by PAM when users log in. However, system daemons started during the boot process do not use /etc/security/limits.conf, and the correct memory limits must be set inside the daemon startup scripts. This is especially important for batch job services.

Users may verify the correct locked memory limits by the command:

# ulimit -l
unlimited

If the locked memory limit is too low, a rather strange error will be printed by the PSM2 library:

PSM2 can't open hfi unit: -1 (err=23)
PSM2 was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.
  Error: Failure in initializing endpoint
hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable

There will be system syslog messages as well like:

psm2-demo: (hfi/PSM)[4982]: PSM2 can't open hfi unit: -1 (err=23)
kernel: cache_from_obj: Wrong slab cache. kmalloc-64(382:step_batch) but object is from kmem_cache_node

In the libpsm2 source code the error originate from the function hfi_userinit() in the file libpsm2-10.*/opa/opa_proto.c.

Slurm configuration

MPI jobs and other tasks using the OmniPath fabric must have unlimited locked memory, see above. For slurmd running under systemd the limits are configured in /usr/lib/systemd/system/slurmd.service as:

LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity

Limits defined in /etc/security/limits.conf or /etc/security/limits.d/\*.conf are not effective for systemd services, see https://access.redhat.com/solutions/1257953, so any limits must be defined in the service file, see man systemd.exec.

To ensure that job tasks running under Slurm have this configuration, verify the slurmd daemon's limits by:

# grep locked /proc/$(pgrep -u 0 slurmd)/limits
Max locked memory         unlimited            unlimited            bytes

Also, the slurm.conf file must have this configuration:

PropagateResourceLimitsExcept=MEMLOCK

as explained in https://slurm.schedmd.com/faq.html#memlock.

The memory limit error with OmniPath was discussed in Slurm bug 3363.

IPoIB configuration

The role of IPoIB is to provide an IP network emulation layer on top of InfiniBand RDMA networks, see Understanding_InfiniBand_and_RDMA_technologies.

DNS hostnames: For the IPoIB convention, append -opa to the hostname.

To configure IPoIB on CentOS/RHEL 7 see:

For convenience we provide a script which will help you configure OmniPath and/or Mellanox Infiniband adapters on a CentOS/RHEL 7 system:

WARNING: You cannot use the instructions below if you have also installed the Mellanox_OFED distribution, since Mellanox_OFED replaces many RHEL/CentOS system utilities. However, the ibstat command is still the best way to display adapter information.

You must also configure /etc/rdma/rdma.conf as shown in Configuring_the_Base_RDMA_Subsystem. Suggested parameters (including NFS service) are:

IPOIB_LOAD=yes
SRP_LOAD=no
SRPT_LOAD=no
ISER_LOAD=no
ISERT_LOAD=no
RDS_LOAD=no
XPRTRDMA_LOAD=yes
SVCRDMA_LOAD=yes
FIXUP_MTRR_REGS=no
ARPTABLE_TUNING=yes

IPoIB devices

The IPoIB network devices must be configured carefully by hand, since this is not done automatically, and there are no standard device names.

If you use only a single Infiniband adapter and network interface, it will probably be named ib0, and you do not necessarily have to perform any device configuration. Then go to the IPoIB network configuration below.

However, if you have:

  • Multiple Infiniband and/or OmniPath adapters,
  • Multiple ports per adapter,
  • You want to control the device names in stead of the default ib0, ib1 etc.,

then you must configure the Infiniband devices carefully:

  • Install the prequisite RPM:

    yum install infiniband-diags

    and then discover the link/infiniband hardware addresses by:

    ibstat

    You can also get the link/infiniband hardware address of all network interfaces by:

    ip link show
  • Select device names for the IPoIB devices, since there doesn't seem to be any naming standard for these (for Ethernet there is a Consistent_Network_Device_Naming standard).

    The ibstat command lists adapter names:

    mlx4_0, mlx4_1 etc. for Mellanox adapters no. 0 and 1.
    hfi1_0, hfi1_1 etc. for Intel OmniPath adapters no. 0 and 1.

    The adapter ports may either be configured for Infiniband or for Ethernet, so it may be reasonable to name the IPoIB ports as XXXib0, XXXib1 etc., since there may be several adapters. The kernel's internal device names ib0 etc. should not be reused in a manual configuration.

    Suggested interface names might be concatenating adapter and port names like:

    mlx4_0ib0
    mlx4_0ib1
    hfi1_0ib0
  • Edit the udev file /etc/udev/rules.d/70-persistent-ipoib.rules as explained in Usage_of_70-persistent-ipoib using the last 8 bytes of each link/infiniband hardware address. An example file may be:

    ACTION=="add", SUBSYSTEM=="net", DRIVERS=="?*", ATTR{type}=="32", ATTR{address}=="?*70:10:6f:ff:ff:a0:74:71", NAME="mlx4_0ib0"
    ACTION=="add", SUBSYSTEM=="net", DRIVERS=="?*", ATTR{type}=="32", ATTR{address}=="?*70:10:6f:ff:ff:a0:74:72", NAME="mlx4_0ib1"
    ACTION=="add", SUBSYSTEM=="net", DRIVERS=="?*", ATTR{type}=="32", ATTR{address}=="?*00:11:75:01:01:7a:ff:df", NAME="hfi1_0ib0"

    It is perfectly possible for OmniPath adapters to coexist with Mellanox adapters in this way.

  • You can force the IPoIB interfaces to be renamed without performing a reboot by removing the ib_ipoib kernel module and then reloading it as follows:

    rmmod ib_ipoib
    modprobe ib_ipoib

IPoIB network configuration

When you have set up the Infiniband devices, the next step is to configure the IPoIB interfaces:

  • See Configure_IPoIB_Using_the_command_line about creating ifcfg files in /etc/sysconfig/network-scripts/. Notice these points:

    • The DEVICE field must match the custom name created in any udev renaming rules.
    • The NAME entry need not match the device name. If the GUI connection editor is started, the NAME field is what is used to present a name for this connection to the user.
    • The TYPE field must be InfiniBand in order for InfiniBand options to be processed properly.
    • CONNECTED_MODE is either yes or no, where yes will use connected mode and no will use datagram mode for communications, see see https://www.kernel.org/doc/Documentation/infiniband/ipoib.txt. The value yes should be used for performance reasons.
  • An example ifcfg file ifcfg-OmniPath would be:

    NM_CONTROLLED=no
    CONNECTED_MODE=yes
    TYPE=InfiniBand
    BOOTPROTO=none
    IPADDR=10.4.128.107
    PREFIX=16
    DEFROUTE=no
    IPV4_FAILURE_FATAL=yes
    IPV6INIT=no
    NAME=OmniPath
    DEVICE=hfi1_0ib0
    ONBOOT=yes
    MTU=65520

With the above configurations in place you can restart the network service:

systemctl restart network

and display all network interfaces:

ifconfig -a

where the OmniPath and/or Infiniband interfaces should now be shown.

Monitoring IPoIB interfaces

Install these RPMs:

yum install libibverbs-utils infiniband-diags

Then you can list available Infiniband-like devices:

ibv_devices
ibv_devinfo

and see the device status:

ibstat

To display the OPA device ib0 IP address information on a list of nodes:

pdsh -w <node-list> '/sbin/ip -4 -o addr show label ib0' | sort

Performance tuning

Download the manual Intel® Omni-Path Performance Tuning User Guide. See Chapter 2.0 BIOS Settings about recommended settings, they include:

  • CPU power and performance policy = Performance or Balanced performance.
  • Enhanced Intel SpeedStep Technology = Enabled.
  • Intel Turbo Boost Technology = Enabled.
  • Intel VT for Directed I/O (VT-d) = Disabled.
  • CPU C-State = Enabled.
  • Processor C3 = Disabled.
  • Processor C6 = Enabled.
  • IOU Non-posted Prefetch = Disabled (where available).
  • Cluster-on-Die = Disabled.
  • Early Snoop = Disabled.
  • Home Snoop = Enabled.
  • NUMA Optimized = Enabled.
  • MaxPayloadSize = Auto or 256B.

Niflheim: OmniPath (last edited 2018-07-16 15:44:31 by OleHolmNielsen)