ZFS filesystems quick configuration guide

The ZFS filesystem is an alternative to XFS. While introduced originally in the Solaris OS, ZFS has been ported to ZFS_on_Linux. See also the OpenZFS developers page.

ZFS documentation

NOTICE: Aaron_Toponce ’s documentation is apparently not available any longer since early 2024!! You may find a ZFS_web_archive copy of this documentation.

Installation of ZFS

We assume an EL8 OS in this page. Following the RHEL-based-distro guide, enable the zfs-release repo from ZFS_on_Linux:

dnf install https://zfsonlinux.org/epel/zfs-release-2-2$(rpm --eval "%{dist}").noarch.rpm

(The rpm --eval "%{dist}" command simply prints .el8 or similar for your OS).

Use the DKMS kernel module installation method:

dnf install epel-release
dnf install kernel-devel
dnf install zfs

Then activate the ZFS kernel module:

/sbin/modprobe zfs

The alternative kABI-tracking kmod installation method may break the ZFS_on_Linux software after kernel upgrades.

Ansible management of ZFS

See the page on Ansible configuration of Linux servers and desktops. There are Ansible modules for ZFS management:

There does not seem to be any module for zpool management, however.

List disks in the system

The disks in the system must be identified. The following commands are useful for listing disk block devices:

lsblk
lsscsi --wwn --size

List HPE server’s disks

If using a HPE HBA controller, the disks in the system can be displayed using the ssacli command from the ssacli RPM package. See the HPE Proliant SmartArray page.

Example usage may be:

$ /usr/sbin/ssacli
=> controller all show status
=> ctrl slot=1 pd all show status
=> ctrl slot=1 physicaldrive 2I:1:29 show detail

Smart HBA H240 in Slot 1 (HBA Mode)

 HBA Drives

    physicaldrive 2I:1:29
       Port: 2I
       Box: 1
       Bay: 29
       Status: OK
       Drive Type: HBA Mode Drive
       Interface Type: SAS
       Size: 6 TB
       Drive exposed to OS: True
       Logical/Physical Block Size: 512/512
       Rotational Speed: 7200
       Firmware Revision: HPD7
       Serial Number: 1EK2RLEJ
       WWID: 5000CCA232AE1049
       Model: HP      MB6000FEDAU
       .....
       Disk Name: /dev/sdac

Here you can read the disk name, serial number etc., and compare disk names with lists from lsblk and lsscsi as shown above as well as zpool status.

If a replacement disk is hidden from the OS, it may be because it was previously attached to a RAID adapter, see https://serverfault.com/questions/1142870/hp-smart-array-p812-hba-mode-masked-drives This can me modified like in this example:

$ /usr/sbin/ssacli
=> ctrl slot=1 physicaldrive 2I:1:29 modify clearconfigdata

Trying out ZFS

Aaron_Toponce ‘s page has some initial examples.

Create a simple zpool named tank with 4 unused drives (sde sdf sdg sdh):

zpool create tank sde sdf sdg sdh
zpool status tank
df -Ph /tank

Define the mount point for the dataset by adding this option:

-m <mountpoint>

A mirrored pool where all data are mirrored 4 times:

zpool create tank mirror sde sdf sdg sdh

A RAID 0+1 pool with 2+2 disks:

zpool create tank mirror sde sdf mirror sdg sdh

Destroy the testing zpool created above with zpool-destroy:

zpool destroy tank

WARNING: The zpool-destroy command will destroy your ZFS pool without any warnings!!.

Configuring ZFS

The sections below describe how we have configured ZFS.

List disks in the system

First identify the disk device WWN names and the corresponding /dev/sd… device names:

$ ls -l /dev/disk/by-id/wwn* | sed /part/d | awk '{print $9 " is disk " $11}' | sort -k 4
/dev/disk/by-id/wwn-0x600508b1001cf4b3e98de44628d4583c is disk ../../sda
...

or use one of the following commands:

lsblk
lsscsi --wwn --size

For ZFS usage it is recommended to use the permanent hardware-based WWN names in stead of the Linux disk device names which are changeable. You should make a record of the above mapping of WWN names to Linux disk device names.

Create RAIDZ disks

Read the zpool_concepts page about VDEV devices, Hot_spare etc.

To setup a RAIDZ pool <poolname> with RAIDZ-1, we use zpool-create with the “raidz1” VDEV, for example:

zpool create <poolname> raidz1 sde sdf sdg

The recommended disk naming with WWN names must include the wwn- string before the disks’ WWN names, for example::

zpool create <poolname> raidz1 wwn-0x5000c500ec6e2b9f wwn-0x5000c500f294ad3f wwn-0x5000c500f29d1a3b

To setup a RAIDZ pool with RAIDZ-2, we use the “raidz2” VDEV:

zpool create <poolname> raidz2 sde sdf sdg sdh

You can also create a pool with multiple VDEV devices, so that each VDEV doesn’t contain too many physical disks, for example:

zpool create <poolname>   raidz2 sde sdf sdg sdh   raidz2 sdi sdj sdk sdl

or add a new VDEV device with zpool-add to an existing pool:

zpool add <poolname>   raidz2 sdi sdj sdk sdl

You may even designate one or more Hot_spare disks to the pool, for example a single spare disk sdm:

zpool create <poolname>   raidz2 sde sdf sdg sdh   raidz2 sdi sdj sdk sdl   spare sdm

Check the status of the pools:

zpool status

Adding disks for an SLOG

Read about the Separate Intent Logging Device (SLOG) in the ZFS Intent Log (ZIL) page. The disks should be as fast as possible, such as NVMe or SSD.

To correlate a namespace to a disk device use one of the following commands:

lsblk
lsscsi --wwn --size

Use /dev/disk/by-id/* disk names with ZFS in stead of /dev/sd* which could become renamed.

Add SLOG and ZIL disks

This section shows how to configure an L2ARC_cache on 2 disk devices.

Assume that the 2 disks /dev/sdb and /dev/sdc will be used. First partition the disks:

parted /dev/sdb unit s mklabel gpt mkpart primary 2048 4G mkpart primary 4G 120G
parted /dev/sdc unit s mklabel gpt mkpart primary 2048 4G mkpart primary 4G 120G

Note: Perhaps it is necessary to use the parted command line and make individual commands like:

parted /dev/sdb
(parted) unit s
(parted) mklabel gpt
(parted) mkpart primary 2048 4G
(parted) mkpart primary 4G 120G
(parted) print
(parted) quit

Use /dev/disk/by-id/* disk names with ZFS in stead of /dev/sd* which could become renamed.

To add 2 disks, for example /dev/sdb and /dev/sdc, to the SLOG, first identify the device WWN names:

ls -l /dev/disk/by-id/* | egrep 'sdb|sdc' | grep wwn

The disks and their partitions partN may be listed as in this example:

/dev/disk/by-id/wwn-0x600508b1001c5db0139e52b3964d02ee -> ../../sdb
/dev/disk/by-id/wwn-0x600508b1001c5db0139e52b3964d02ee-part1 -> ../../sdb1
/dev/disk/by-id/wwn-0x600508b1001c5db0139e52b3964d02ee-part2 -> ../../sdb2
/dev/disk/by-id/wwn-0x600508b1001c45bf78142b67cda9c82b -> ../../sdc
/dev/disk/by-id/wwn-0x600508b1001c45bf78142b67cda9c82b-part1 -> ../../sdc1
/dev/disk/by-id/wwn-0x600508b1001c45bf78142b67cda9c82b-part2 -> ../../sdc2

When the partitions have been created, add the disk partitions 1 and 2 as a ZFS mirrored log and cache, respectively:

zpool add <pool-name> log mirror /dev/disk/by-id/wwn-<name>-part1 /dev/disk/by-id/wwn-<name>-part1 cache /dev/disk/by-id/wwn-<name>-part2 /dev/disk/by-id/wwn-<name>-part2

where the WWN names found above must be used.

Cache and mirror devices can be removed, if necessary, by the zpool-remove command, for example:

zpool remove <pool-name> <mirror>
zpool remove <pool-name> /dev/disk/by-id/wwn-<name>-part2

where the disks are listed by the zpool-status command.

Add SLOG and ZIL on Optane NVDIMM persistent memory

Setting up NVDIMM persistent memory is described in NVDIMM Optane persistent memory setup. Install thse packages:

dnf install ndctl ipmctl

Display NVDIMM devices by:

ipmctl show -dimm

This section show how to configure an L2ARC_cache using NVDIMM 3D_XPoint known as Intel Optane persistent memory DIMM modules.

Partition the NVDIMM disks:

parted /dev/pmem0 unit s mklabel gpt mkpart primary 2048 4G mkpart primary 4G 120G
parted /dev/pmem1 unit s mklabel gpt mkpart primary 2048 4G mkpart primary 4G 120G

and then add the disk partitions 1 and 2 as ZFS cache and log:

zpool add <pool-name> log mirror /dev/pmem0p1 /dev/pmem1p1 cache /dev/pmem0p2 /dev/pmem1p2

ZFS pool capacity should be under 80%

From the Best_practices page:

  • Keep ZFS pool capacity under 80% for best performance. Due to the copy-on-write nature of ZFS, the filesystem gets heavily fragmented.

  • Email reports of capacity at least monthly.

Use this command to view the ZFS pool capacity:

zpool list
zpool list -H -o name,capacity

This crontab job for Monday mornings might be useful:

# ZFS list capacity
0 6 * * 1 /sbin/zpool list

ZFS Compression

Compression is transparent with ZFS if you enable it, see the Compression_and_Deduplication page. This means that every file you store in your pool can be compressed. From your point of view as an application, the file does not appear to be compressed, but appears to be stored uncompressed.

To enable compression on a dataset, we just need to modify the compression property. The valid values for that property are: “on”, “off”, “lzjb”, “lz4”, “gzip”, “gzip[1-9]”, and “zle”:

zfs set compression=lz4 <pool-name>

Monitor compression:

zfs get compressratio <pool-name>

Create ZFS filesystems

You can create multiple separate filesystems within a ZFS pool, for example:

zfs create -o mountpoint=/u/test1 zfspool1/test1

ZFS filesystems can be unmounted and mounted manually by zfs_mount commands:

zfs unmount ...
zfs mount ...

ZFS Snapshots and clones

zfs_snapshot_ is similar to a Linux LVM snapshot, see Snapshots_and_clones.

You can list snapshots by two methods:

zfs list -t all
cd <mountpoint>/.zfs ; ls -l

You can access the files in a snapshot by mounting it, for example:

mount -t zfs zfstest/zfstest@finbul1-20230131080810 /mnt

The files will be visible in /mnt. Remember to unmount /mnt afterwards.

To destroy a snapshot use zfs-destroy:

zfs destroy [-Rdnprv] filesystem|volume@snap[%snap[,snap[%snap]]]

WARNING: The zfs-destroy command will destroy your ZFS volume without any warnings!!.

It is recommended to create a zfs_snapshot_ and use zfs-hold to prevent zfs-destroy from destroying accidentally, see prevent dataset/zvol from accidental destroy.

For example create a snapshot and hold it:

zfs snapshot tank@snapshot1
zfs list -t snapshot
zfs hold for_safety tank@snapshot1
zfs holds tank@snapshot1

General snapshot advice:

  • Snapshot frequently and regularly.

  • Snapshots are cheap, and can keep a plethora of file versions over time.

  • Consider using something like the zfs-auto-snapshot script.

ZFS backups

Backup of ZFS filesystems to a remote storage may be done by Sending_and_receiving_filesystems.

A ZFS snapshot can be sent to a remote system like this example:

zfs send tank/test@tuesday | ssh user@server.example.com "zfs receive pool/test"

There are several tools for performing such backups:

zfs-autobackup

See the zfs-autobackup Getting Started Wiki page.

On the remote source machine, we set the autobackup:offsite1 zfs property to true as follows:

[root@remote ~]# zfs set autobackup:offsite1=true <poolname>
[root@remote ~]# zfs get -t filesystem,volume autobackup:offsite1

Running a pull backup from the remote host:

zfs-autobackup -v --ssh-source <remote> offsite1 <poolname>

Since the path to zfs-autobackup is /usr/local/bin and ZFS commands are in /usr/sbin, you must add these paths when running crontab jobs, for example:

0 4 * * * PATH=$PATH:/usr/sbin:/usr/local/bin; zfs-autobackup args...

It is convenient to list all snapshots created by zfs-autobackup:

zfs list -t all

You can mount a snapshot as shown above.

There is a zfs-autobackup troubleshooting page. We have seen the error:

cannot receive incremental stream: destination has been modified since most recent snapshot

which was resolved by zfs_rollback:

zfs rollback <problem-snapshot-name>

Useful ZFS commands

List ZFS filesystems and their properties:

zfs list
zpool list
zpool status <pool-name>
zpool get all <pool-name>
mount -l -t zfs

See the sub-command manual pages for details (for example man zpool-list).

Display logical I/O statistics for ZFS storage pools with zpool-iostat:

zpool iostat -v

Get and set a mountpoint:

zfs get mountpoint <pool-name>
zfs set mountpoint=/u/zfs <pool-name>

E-mail notifications

Using the ZFS Event Daemon (see ZED or man zed), ZFS can send E-mail messages when zpool-events occur. Check the status of ZED by:

systemctl status zed

The ZED configuration file /etc/zfs/zed.d/zed.rc defines variables such as the Email address of the zpool administrator for receipt of notifications; multiple addresses can be specified if they are delimited by whitespace:

ZED_EMAIL_ADDR="root"

You should change root into a system administrator E-mail address, otherwise the domain root@localhost.localdomain will be used. Perhaps you need to do systemctl restart zed after changing the zed.rc file(?).

Scrub and Resilver disks

With ZFS on Linux, detecting and correcting silent data errors is done through scrubbing the disks, see the Scrub_and_Resilver page.

Scrubbing can be made regularly with crontab, for example monthly:

0 2 1 * * /sbin/zpool scrub <pool-name>

or alternatively on machines using Systemd, scrub timers can be enabled on per-pool basis. See the systemd.timer(5) manual page. Weekly and monthly timer units are provided:

systemctl enable zfs-scrub-weekly@<pool-name>.timer --now
systemctl enable zfs-scrub-monthly@<pool-name>.timer --now

Replacing defective disks

Detecting broken disks is explained in the Scrub_and_Resilver page. See the zpool-status if any disks have failed:

zpool status
zpool status -x       # Only pools with errors
zpool status -e       # Only VDEVs with errors
zpool status -L       # Display real paths for vdevs resolving all symbolic links
zpool status -P       # Display full paths for vdevs

The RHEL page How to rescan the SCSI bus to add or remove a SCSI device without rebooting the computer has useful information about Adding a Storage Device or a Path. You may scan the system for disk changes using /usr/bin/rescan-scsi-bus.sh from the sg3_utils package. Unfortunately, it may sometimes be necessary to reboot the server so that the OS will discover the replaced /dev/sd??? disk device.

Use the zpool-replace command to replace a failed disk, for example disk sde:

zpool replace <pool-name> sde(old) sde(new)
zpool replace -f <pool-name> sde(old) sde(new)

The -f flag may be required in case of errors such as invalid vdev specification.

Hot spare disks will not be added to the VDEV to replace a failed drive by default. You MUST enable this feature. Set the autoreplace feature to on, for example:

zpool set autoreplace=on <pool-name>

Replacing disks can come with big problems, see How to force ZFS to replace a failed drive in place.

ZFS troubleshooting

There is a useful Troubleshooting page which includes a discussion of ZFS_events. Some useful commands are:

zpool events -v
zpool history

If a normal user, and also the daily logwatch scripts, tries to execute zpool status an error message may appear:

Permission denied the ZFS utilities must be run as root

This seems to be a Systemd issue, see permissions issues with openzfs #28653. There seems to be a fix in Udev vs tmpfiles take 2 #28732, however, this has not been tested on EL8 yet.

Disk quotas for ZFS

Read the zfs-userspace manual page to display space and quotas of a ZFS dataset. We assume a ZFS filesystem <pool-name> and a specific user’s name <username> in the examples below.

Define a user’s disk quota userquota and number-of-files quota userobjquota:

zfs set userquota@<username>=1TB userobjquota@<username>=1M <pool-name>

Using a quota value of none will remove the quota.

We have written some Tools_for_managing_ZFS_disk_quotas providing, for example, commands similar to the standard Linux commands repquota and quota.

The superuser can view the user disk usage and quotas, see the zfs-userspace manual page:

zfs userspace filesystem|snapshot|path|mountpoint
zfs userspace -p filesystem|snapshot|path|mountpoint
zfs userspace -H -p -o name,quota,used,objquota,objused filesystem|snapshot|path|mountpoint

The -p prints parseable numbers, the -H omits the heading. The -o displays only specific columns, this could be used to calculate quota warnings.

Normal users are not allowed to read quotas with the above commands. The following command allows a normal user to print disk usage and quotas:

/usr/sbin/zfs get userquota@$USER,userused@$USER,userobjquota@$USER,userobjused@$USER <pool-name>

Default quotas

Unfortunately, the OpenZFS has no default user quota option, this is only available in the Oracle_Solaris_ZFS implementation, see the defaultuserquota page:

zfs set defaultuserquota=30gb <pool-name>

So with Linux OpenZFS you must set disk quotas individually for each user as shown above.

NFS sharing ZFS file systems

The zfsprops manual page explains about the NFS sharenfs option:

  • A file system with a sharenfs property of off is managed with the exportfs command and entries in the /etc/ exports file. Otherwise, the file system is automatically shared and unshared with the zfs share and zfs unshare commands.

Alternatively to the exports file, use the zfs set/get sharenfs command to set or list the sharenfs property like in this example:

zfs set sharenfs='rw=192.168.122.203' pool1/fs1
zfs get sharenfs pool1/fs1

ZFS will update its /etc/zfs/exports file automatically. Never edit this file directly!

There are some discussions on NFS with ZFS:

NFS tuning

Make sure that a sufficient number of nfsd threads are started by configuring the /etc/nfs.conf file:

threads=32

This number might be around the number of CPU cores in the server. A systemctl restart nfs-server is required to update the parameters.

For optimizing the transfer of large files, increase the NFS read and write size in the NFS mount command on NFS clients, see man 5 nfs:

rsize=32768,wsize=32768

Larger values (powers of 2, such as 131072) may also be tried.

See also Optimizing Your NFS Filesystem.

ZFS quotas over NFS

The quota tools for Linux has absolutely no knowledge about ZFS quotas, nor does rquotad, and hence clients mounting via NFS are also unable to obtain this information. See a hack at https://aaronsplace.co.uk/blog/2019-02-12-zfsonline-nfs-quota.html