Lenovo SD665_V3 server

This page contains information about Lenovo SD665_V3 servers deployed in our cluster. The Lenovo ThinkSystem SD665_V3 is a 2-socket ½U server that features the AMD EPYC 9004 “Genoa” family of processors.

The nodes are housed in the upgraded ThinkSystem DW612S enclosure.

NVIDIA InfiniBand Adapter (SharedIO)

The SD665_V3 has a water-cooled NVIDIA 2-Port PCIe Gen5 x16 InfiniBand Adapter (SharedIO) ThinkSystem NVIDIA ConnectX-7 NDR200 InfiniBand QSFP112 Adapters. The adapter is located in the right-hand SD665_V3 node and connects both servers in the tray.

There is important information regarding SharedIO for older SD650 servers in the article Considerations when using ThinkSystem SD650, SD650 V2, SD650 V3 and ConnectX-6 HDR, ConnectX-7 NDR SharedIO. The issues have apparently been resolved in the SD665_V3 system.

Please note that several Infiniband tools such as ibnetdiscover fail with an error message when executed on the SD665_V3 “auxiliary” (left-hand) node, and you must execute such tools on the “primary” (right-hand) node (private communication with a Lenovo support person).

Documentation and software

Lenovo provides SD665_V3 information and downloads:

There is a Product Home page for downloads.

The EasyBuild software module OpenMPI seems to have issues with the Mellanox libraries. Setting these variables may be a workaround:

export OMPI_MCA_btl='^openib,ofi'
export OMPI_MCA_mtl='^ofi'

Booting and BIOS configuration

See the Lenovo BIOS settings common to servers page.

See the Lenovo XClarity (XCC) BMC page.

There is a document Lenovo ThinkSystem SR645 Recommended UEFI and OS settings for Lenovo Scalable Infrastructure (LeSI) which recommends:

  • For best performance set to Maximum Performance first, then set to Custom Mode

OFED software and drivers

The OpenFabrics Enterprise Distribution (OFED) is open-source software for RDMA and kernel bypass applications, as provided by the OpenFabrics Alliance. Mellanox provides some information about Inbox_drivers from various OS vendors, but it is not stated whether they can be used in place of the drivers from Mellanox described below.

Nvidia’s Red Hat Enterprise Linux (RHEL) Inbox Driver documentation has the statement:

Warning
ConnectX-7 is only supported as technical preview (i.e., the feature is not fully supported for production).

Since the SD665_V3 nodes have ConnectX-7 adapters, these are NOT SUPPORTED at present!

Install these prerequisite packages:

dnf -y install libibverbs rdma libmlx4 libibverbs-utils infiniband-diags librdmacm librdmacm-utils ibacm
dnf -y install tk gcc-gfortran kernel-modules-extra

For the Mellanox Infiniband adapters it is recommended to download the .tar.gz file from Mellanox OpenFabrics Enterprise Distribution for Linux (MLNX_OFED). Unpack the tar-ball and run the installer, for example:

tar xzf MLNX_OFED_LINUX-24.01-0.3.3.1-rhel8.9-x86_64.tgz
cd MLNX_OFED_LINUX-24.01-0.3.3.1-rhel8.9-x86_64
./mlnxofedinstall

The installer script has some options:

./mlnxofedinstall --help
./mlnxofedinstall -q          # Set quiet - no messages will be printed
yes | ./mlnxofedinstall       # Answer yes to all questions

The installer attempts to make firmware updates, but we may experience this warning:

Attempting to perform Firmware update...
The firmware for this device is not distributed inside Mellanox driver: 42:00.0 (PSID: LNV0000000049)
To obtain firmware for this device, please contact your HW vendor.
Failed to update Firmware.

so it may be a good idea to add this flag and omit firmware updates:

./mlnxofedinstall --without-fw-update

Installation instructions are in the User Manual from the Mellanox documentation.

Verify that the Mellanox driver RPMs have been installed and the openibd service started:

rpm -qa | grep mlnx
systemctl status openibd

If your kernel version does not match with any of the offered pre-built RPMs, you can add your kernel version by using the mlnx_add_kernel_support.sh script located inside the MLNX_OFED package.

Notices:

  • On Redhat and SLES distributions with errata kernel installed there is no need to use the mlnx_add_kernel_support.sh script. The regular installation can be performed and weak-updates mechanism will create symbolic links to the MLNX_OFED kernel modules.

  • OFED software includes kernel modules for the running kernel, and these must be rebuilt if the kernel is upgraded!