Linux Software-RAID disks

In order to save money on true hardware-RAID controllers, it is sometimes useful to have Software-RAID disks where Linux takes care of the mirroring (RAID-1) or RAID-5. This is certainly much less safe than hardware-RAID controllers, especially for bootable Linux partitions which are difficult to boot in the event of a disk failure.

Listing SCSI disk devices

In order to list the SCSI disk devices in the system, use the lsscsi command. Install relevant software by:

yum install lsscsi smp_utils sg3_utils

The disk listing may look like:

# lsscsi
[0:0:0:0]    disk    ATA      ST3250410AS      3.AH  /dev/sda
[1:0:0:0]    disk    ATA      WDC WD2002FAEX-0 05.0  /dev/sdb
[2:0:0:0]    disk    ATA      WDC WD2002FAEX-0 05.0  /dev/sdc
[3:0:0:0]    disk    ATA      WDC WD2002FAEX-0 05.0  /dev/sdd
[4:0:0:0]    disk    ATA      WDC WD2002FAEX-0 05.0  /dev/sde

See also:

LSI RAID and HBA controllers

Our server rsnap1 has a LSI SAS 9207-8e HBA disk controller.

LSI RAID and HBA controllers may use some LSI tools:

  • Userworld Tool (SAS2IRCU) for RAID configuration on LSI SAS2 Controllers that is designed to run from a host.

  • lsiutil tool (may be obsolete), see download hints.

SAS2IRCU may be searched from LSI. Current version is SAS2IRCU_P17. See the SAS2IRCU User Guide. Unpack the SAS2IRCU zip-file and copy the relevant binary utility to /usr/local/bin.

To list LSI controllers:

sas2ircu LIST

To list disks on controller 0:

sas2ircu 0 display

Disk drive blinking LED

To turn on the blinking disk drive LED use the command:

sas2ircu <controller #> LOCATE <Encl:Bay> <Action>
  where <controller #> is:
    A controller number between 0 and 255.
  where <Encl:Bay> is:
    A valid Enclosure and Bay pair to identify
    the drive
  where <Action> is:
    ON   -  turn ON the drives LED
    OFF  -  turn OFF the drives LED

for example:

sas2ircu 0 locate 3:18 ON

Notice: ON and OFF must be in Upper Case!

The script lsi_device_list.sh uses the sas2ircu command to list devices in a readable format.

Software RAID documentation

You should read the Software-RAID HOWTO. The Wikipedia article about the mdadm command is extremely useful. Also, the on-line manual for the mdadm command is useful.

Some articles on the net about software RAID:

Creating a RAID array from command line

For a running system you have use the mdadm command to create and manage RAID disks.

First partition all the disks to be used for the RAID array (we use disk /dev/sdXX in this example):

# parted /dev/sdXX
(parted) mklabel gpt                 # Makes a "GPT" label permitting large filesystems
(parted) mkpart primary xfs 0 100%  # Allocate 100% of the partition for XFS filesystem
(parted) set 1 raid on               # Configure partition 1 for RAID
(parted) set 1 boot on               # Configure as bootable (optional)
(parted) print                       # Check the partition
(parted) quit

If you need to wipe any preexisting partitions on the disk, this may be done by zeroing the first few blocks on the disk:

dd if=/dev/zero of=/dev/sdXX bs=512 count=10

Create a RAID 5 volume from 3 partitions of exactly or nearly exactly the same size (for example):

mdadm --create /dev/md0 --level=5 --raid-devices=3 /dev/sdd1 /dev/sde1 /dev/sdf1

Warning: anaconda (kickstart) creates partitions is random order https://bugzilla.redhat.com/show_bug.cgi?id=733791 There is no guarantee that /dev/sda1 is created first - always make sure you select the correct partitions for /dev/mdX device!

Configure RAID volume for reboot

First identify all current RAID devices by:

mdadm --examine --scan

To add all RAID devices to /etc/mdadm.conf so that it is recognized the next time you boot:

mdadm --examine --scan > /etc/mdadm.conf

Monitoring RAID disks with mdmonitor

RAID device events can be monitored by the daemon service mdmonitor, see the Monitor section of the mdadm man-page.

First you must define the notification E-mail address or program in /etc/mdadm.conf, see man 5 mdadm.conf, for example:

MAILADDR root@mail.fysik.dtu.dk

Then start the mdmonitor service:

chkconfig mdmonitor on
service mdmonitor start

Monitor disk errors in syslog

A disk may be partly failing, but not so badly that it’s kicked out of a RAID set. To monitor the syslog for kernel messages such as:

Feb 24 09:16:39 ghost309 kernel: ata2.00: failed command: READ FPDMA QUEUED

(and many others), insert the following crontab job:

# Report any kernel syslog messages (maybe broken ATA disks)
0 3 * * * /bin/grep kernel: /var/log/messages

A script to look only for md or ata errors from today is:

TODAY=`date +'%b %e'`
SYSLOG=/var/log/messages
/bin/grep "$TODAY.*kernel:.*md:" $SYSLOG
/bin/grep "$TODAY.*kernel:.*ata" $SYSLOG

Weekly RAID checks

The mdadm RPM package includes a cron script for weekly checks of the RAID devices in the file /etc/cron.d/raid-check:

# Run system wide raid-check once a week on Sunday at 1am by default
0 1 * * Sun root /usr/sbin/raid-check

The raid-check configuration file is /etc/sysconfig/raid-check. To make the checks occur sequentially (a good idea for RAID devices on the same controller) use this setting:

MAXCONCURRENT=1

You can disable the raid checks by setting:

ENABLED=no

Set the check nice level:

NICE=normal

To cancel a running test, use:

echo idle > /sys/devices/virtual/block/md1/md/sync_action

See https://lxadm.com/Mdadm:_stopping_and_starting_RAID_check_in_Linux

Increasing speed of RAID check

The default RAID check speed is controlled by these kernel parameter default values:

# cat  /proc/sys/dev/raid/speed_limit_min /proc/sys/dev/raid/speed_limit_max
1000
200000

meaning:

  • Minimum of 1000 kB/second per disk device.

  • Maximum of 200.000 kB/second for the RAID set.

The kernel will report this in the syslog:

md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.

See also http://www.cyberciti.biz/tips/linux-raid-increase-resync-rebuild-speed.html.

Since 200 MB/sec is quite modest and designed to keep the system responsive, the maximum speed can be increased at the cost of system resources, for example:

echo 100000  > /proc/sys/dev/raid/speed_limit_min
echo 1000000 > /proc/sys/dev/raid/speed_limit_max

which sets the minimum to 100 MB/s for each disk and maximum to 1 GB/s for the RAID array.

This can be configured at boot time in /etc/sysctl.conf, for example:

#################NOTE ################
##  You are limited by CPU and memory too #
###########################################
dev.raid.speed_limit_min = 50000
## good for 4-5 disks based array ##
dev.raid.speed_limit_max = 2000000
## good for large 6-12 disks based array ###
dev.raid.speed_limit_max = 5000000

Monitoring RAID disks with logwatch

The RHEL6/CentOS6 logwatch tool doesn’t have scripts for RAID disk monitoring with mdadm. Later versions of logwatch (7.4?) have scripts in the /scripts/services/mdadm and /conf/services/mdadm.conf. But these seem to need debugging for RHEL systems.

Performance optimization of RAID5/RAID6

The Linux kernel by default allocates much too small kernel buffers for efficient RAID5 or RAID6 operations. See for example:

To increase the kernel read-ahead of a disk device:

blockdev --setra 20480 /dev/md0

To check the current value:

blockdev --report  /dev/md0

To change the cache kernel buffer size of RAID device md0:

echo 8192 > /sys/block/md0/md/stripe_cache_size

To test RAID I/O performance:

cd <RAID-disk dir>
time dd bs=1M count=65536 if=/dev/zero of=test conv=fdatasync

The md man-page says:

  • md/stripe_cache_size

    This is only available on RAID5 and RAID6. It records the size (in pages per device) of the stripe cache which is used for synchronising all write operations to the array and all read operations if the array is degraded. The default is 256. Valid values are 17 to 32768. Increasing this number can increase performance in some situations, at some cost in system memory. Note, setting this value too high can result in an “out of memory” condition for the system.

    memory_consumed = system_page_size * nr_disks * stripe_cache_size

Disable NCQ on SATA disks in mdadm RAID arrays

See advcie in:

This loop may be put in /etc/rc.local:

for i in sdaa sdab sdac sdad sdae sdaf sdag sdah sdai sdaj sdak sdal sdam sdan sdao sdap sdaq sdar sdas sdat sdb sdc sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdn sdo sdp sdq sdr sds sdt sdu sdv sdw sdx sdy sdz
do
      echo 1 > /sys/block/$i/device/queue_depth
done

How to replace a failed RAID disk

The mdadm monitoring may send mail about a failed disk. To see the status of a RAID array do:

mdadm --detail /dev/md0
...
    Number   Major   Minor   RaidDevice State
     0       8       17        0      active sync   /dev/sdb1
     1       8       33        1      active sync   /dev/sdc1
     2       0        0        2      removed
     3       8       65        3      active sync   /dev/sde1

Make sure the failed disk state is faulty:

mdadm --manage /dev/md0 --fail /dev/sdd1

and removed from the array:

mdadm --manage /dev/md0 --remove /dev/sdd1

This may need to be performed for all the partitions on the failed physical disk.

Only working devices should be listed by cat /proc/mdstat now.

You now have to physically identify the failed hard disk. The first system disk may be /dev/sda, the second /dev/sdb and so on, and the system board may show you which disk is SATA0, SATA1 and so on.

For a simple few-disks systems with disk drives mounted externally, one can identify working drives by their activity:

cat /dev/sdX >/dev/null

Power down the system and remove the failed disk. If the failed disk was the boot device replacing it with a clean disk will prevent booting. In this case one has to physically switch the order of disks, so the system boots from the first disk (is there a workaround?). On hot-swap systems you can boot from single, working disk, and add the new disk after. Boot up the system and check the RAID status as above.

Replacing a hot-swap disk

You can blink the drive LED on an LSI controller as described above.

If your system supports hot-swap disks, swap the disk and list all devices:

lsscsi

If the disk does not appear as /dev/sdX after inserting, force a rescan on a SCSI BUS:

echo "- - -" >/sys/class/scsi_host/host<n>/scan  # for all n

If the disk contains data, you may clear the partitions on the new disk (remember that cat /proc/mdstat lists only active disks now):

dd if=/dev/zero of=/dev/sdd bs=512 count=10

We have had cases where the SCSI bus appeared on the disk drive, and we had to reboot the server.

Partition the new disk (for example, /dev/sdd1) for RAID as shown above, or clone the partition table of the working disk (/dev/sdc):

sfdisk -d /dev/sdc | sfdisk --force /dev/sdd

Note: one is supposed to use gdisk (yum install gdisk), but this didn’t work for me:

sgdisk -R /dev/sdd /dev/sdc  # clone - note the order of arguments!
sgdisk -G /dev/sdd  # randomize UUID of /dev/sdd

Now you can add the (all) new disk partitions to (all) the RAID disks:

mdadm /dev/md0 -a /dev/sdd1
mdadm --detail /dev/md0

The rebuilding to the newly added disk begins automatically (see man mdadm). This can also be monitored in the output like this:

# mdadm --detail /dev/md0 | grep Rebuild
Rebuild Status : 8% complete

# cat /proc/mdstat