Supermicro servers

4029GP-TRT2 servers

We have some Supermicro servers 4029GP-TRT2 including Nvidia GPU (installed in December 2020).

BIOS configuration

Startup menus:

  • Press ESC or DEL at startup to enter BIOS settings menus.

  • Press F11 at startup to enter the boot menu.

  • Press F12 at startup to perform network booting.

  • One-time boot settings may also be selected in the BIOS Save&Exit screen from the list below Boot override.

Note: The NIC MAC addresses must be read from the BMC web interface, or from the printed server configuration report.

Boot menu

In the Boot mode select set UEFI to avoid Legacy booting. The default is Dual.

In FIXED BOOT ORDER priorities the Hard disk should be first and the Network:IBA second (or if desired the other way around).

NOTE: UEFI network booting will not work immediately with this setup! You must first configure Onboard LAN Option ROM Type as shown below!

Advanced menu

Boot Features

  • Quiet Boot: Enabled (to get a startup screen with information)

  • Bootup Numlock State: Off

CPU configuration

  • Hyper-Threading: Enable (disable if desired).

Chipset configuration

To enable Sub NUMA Cluster (SNC):

  • Advanced->Chipset configuration->North bridge->UPI configuration->SNC=Enable

PCIe/PCI/PnP Configuration

NOTE: This is where you must configure UEFI network booting for the LAN adapter:

  • Onboard LAN1 Option ROM (OROM): EFI

  • Network stack configuration:

    • IPv6 PXE support: Disabled.

IPMI

BMC Network Configuration

  • IPMI LAN Selection: Dedicated

Connect a BMC LAN cable to the dedicated BMC port.

BMC controller

The BMC network port is by default set to Shared, and this should be changed to Dedicated in the IPMI BIOS setup menu.

Read the BMC Ethernet MAC address from the BIOS interface or from the label on the chassis.

From 2019 Supermicro servers no longer ship with the ADMIN/ADMIN BMC login, see https://www.supermicro.com/en/support/BMC_Unique_Password

The system unique password for the ADMIN user is located on the top cover of the cabinet in the front left corner.

Note: Servers delivered by Nextron have a modified BMC password: Nextronipmi1

BMC reboot

The menu item for rebooting the BMC is under the web GUI item Maintenance->Unit Reset.

BMC Remote Console

In the BMC web GUI go to the Remote control - > Remote console window. Click on the here link:

To set the Remote Console default interface, please click. here
Current interface: HTML5

and set the interface to HTML5 (default seems to be Java plug-in). Strangely, HTML5 only works after the BMC has been rebooted (if you changed this option), and you can do this from the Maintenance->IVKM Reset menu or with the Linux CLI command:

ipmitool bmc reset cold

Firmware update licenses

It is possible to upgrade BIOS and BMC/IPMI firmware from the BMC web interface. Check the Miscellaneous->Activate Licenses screen where Node Product Key status should be Activated. Otherwise you must buy an Out of Band (OOB) license, which can then be typed in here.

Firmware and BIOS update can be performed under the Maintenance pull-down menu.

BIOS and BMC firmware upgrades

BIOS and BMC firmware can be downloaded from the above product page. Unzip the firmware files.

Any remote BMC console sessions will be terminated when the firmware updates start!

Log into the BMC web page and go to the Maintenance tab:

  1. The BMC firmware upgrade menu is Firmware Update. There will be a warning message:

    Do you want to enter update mode? You will not be able to perform any other tasks until firmware upgrade is complete and the device is rebooted.
    

    Browse for the firmware file, it may be like BMC_X11AST2500-4101MS_20240624_01.74.15_STDsp.bin, and start the upgrade.

    NOTE: The IPMI Firmware Update PDF document states:

    NOTE !!! Uncheck preserve configuration box during flashing (very important step for FW to work properly). All settings will be reset to default.
    Uncheck "Preserve configuration" and "Preserve SDR".
    

    Uncheck this box:

    Preserve configuration
    

    Unfortunately, this means that the BMC login and password are reset to the factory default values printed on the cabinet label! You must run the Kickstart script 55_ipmi (copied from the niflnet2 server) again which sets our BMC password!

    Keep these checked settings:

    Preserve SDR
    Preserve SSL certificate (Unchecking this option will restore the default SSL certificate.)
    
  2. The BIOS upgrade menu is BIOS Upgrade. The BIOS firmware file name may be like BIOS_X11DPG-OT-1A06_20240716_4.4_STD.bin.

    The following check boxes are displayed (the meaning is undocumented):

    Preserve ME Region  (do not check)
    Preserve NVRAM      (do not check)
    Preserve SMBIOS     (checked by default)
    

    If you check the first 2 boxes, the server may be unable to boot. In this case you must reflash the BIOS upgrade!

    Then all BIOS settings will get reset to default!! There does not seem to be any way to preserve BIOS settings.

    When the update is completed, a popup windows asks for confirmation of BIOS update complete. Do you wish to reset the system? Curiously, it seems that you need to restart the server or reset the power manually!

    After the BIOS has been upgraded, connect to the system console (the BMC’s Remote HTML5 Console) and make all the BIOS configuration settings again shown above for a new server.

Nvidia RTX3090 GPUs

Drivers for Nvidia GPUs can be downloaded from https://www.nvidia.com/en-us/drivers/unix/ The Latest Production Branch Version: 450.80.02 (or greater) is required for the RTX3090.

Defective GPUs

If a GPU is defective, it may be missing from the hardware list. There are two places to see this:

  • The DMI command dmidecode lists all devices and a Current Usage: Available slot may indicate a GPU not registering with the system, for example:

    System Slot Information
        Designation: CPU1 Slot2 PCI-E 3.0 X16
        Type: x16 PCI Express 3 x16
        Current Usage: In Use
        ...
    
    System Slot Information
        Designation: CPU1 Slot3 PCI-E 3.0 X16
        Type: x16 PCI Express 3 x16
        Current Usage: Available
        ...
    
  • The BMC web interface menu System->Hardware Information should list all GPUs and their status. Check for missing GPUs.

Nvidia drivers

Download Nvidia drivers from https://www.nvidia.com/Download/index.aspx and select the appropriate GPU version and host operating system. Installation instructions are provided on the download page:

rpm -i nvidia-diag-driver-local-repo-rhel7-375.66-1.x86_64.rpm
yum clean all
yum install cuda-drivers
reboot

You can also download and install Nvidia UNIX drivers, and the CUDA toolkit from https://developer.nvidia.com/cuda-downloads.

To verify the availability of GPU accelerators in a node run the command:

nvidia-smi -L

which is installed with the xorg-x11-drv-nvidia RPM package.

Verify the loaded kernel module version:

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.86.05  Fri Jul 14 20:46:33 UTC 2023
GCC version:  gcc version 8.5.0 20210514 (Red Hat 8.5.0-18) (GCC)

CUDA

The CUDA toolkit can be downloaded from https://developer.nvidia.com/cuda-downloads. There is an installation guide at http://docs.nvidia.com/cuda/cuda-installation-guide-linux

Download the repo file and install the CUDA tools:

yum install cuda-repo-rhel7-8.0.61-1.x86_64.rpm
yum clean all
yum install cuda

Installation instructions for a static version:

wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.run
sudo sh cuda_12.2.0_535.54.03_linux.run