ZFS filesystems quick configuration guide
The ZFS filesystem is an alternative to XFS. While introduced originally in the Solaris OS, ZFS has been ported to ZFS_on_Linux. See also the OpenZFS developers page.
ZFS documentation
NOTICE: Aaron_Toponce ’s documentation is apparently not available any longer since early 2024!! You may find a ZFS_web_archive copy of this documentation.
Getting_Started with ZFS including an RHEL-based-distro guide.
First time OpenZFS users are encouraged to check out Aaron_Toponce ’s excellent documentation.
Best_practices and caveats.
OpenZFS_Newcomers documentation and FAQ.
-
Lustre uses ZFS, see https://wiki.lustre.org/ZFS.
A JBOD Setup page.
zpool_concepts overview of ZFS storage pools.
ZFS 101—Understanding ZFS storage and performance and ZFS fans, rejoice—RAIDz expansion will be a thing very soon.
ZFS_checksums are a key feature of ZFS and an important differentiator for ZFS over other RAID implementations and filesystems.
Installation of ZFS
We assume an EL8 OS in this page. Following the RHEL-based-distro guide, enable the zfs-release repo from ZFS_on_Linux:
dnf install https://zfsonlinux.org/epel/zfs-release-2-2$(rpm --eval "%{dist}").noarch.rpm
(The rpm --eval "%{dist}"
command simply prints .el8
or similar for your OS).
Use the DKMS kernel module installation method:
dnf install epel-release
dnf install kernel-devel
dnf install zfs
Then activate the ZFS kernel module:
/sbin/modprobe zfs
The alternative kABI-tracking kmod installation method may break the ZFS_on_Linux software after kernel upgrades.
Ansible management of ZFS
See the page on Ansible configuration of Linux servers and desktops. There are Ansible modules for ZFS management:
https://docs.ansible.com/ansible/2.9/modules/zfs_module.html
Additional ZFS modules are found at https://docs.ansible.com/ansible/latest/collections/community/general/
There does not seem to be any module for zpool management, however.
List disks in the system
The disks in the system must be identified. The following commands are useful for listing disk block devices:
lsblk
lsscsi --wwn --size
List HPE server’s disks
If using a HPE HBA controller, the disks in the system can be displayed using the ssacli
command from the ssacli RPM package.
See the HPE Proliant SmartArray page.
Example usage may be:
$ /usr/sbin/ssacli
=> controller all show status
=> ctrl slot=1 pd all show status
=> ctrl slot=1 physicaldrive 2I:1:29 show detail
Smart HBA H240 in Slot 1 (HBA Mode)
HBA Drives
physicaldrive 2I:1:29
Port: 2I
Box: 1
Bay: 29
Status: OK
Drive Type: HBA Mode Drive
Interface Type: SAS
Size: 6 TB
Drive exposed to OS: True
Logical/Physical Block Size: 512/512
Rotational Speed: 7200
Firmware Revision: HPD7
Serial Number: 1EK2RLEJ
WWID: 5000CCA232AE1049
Model: HP MB6000FEDAU
.....
Disk Name: /dev/sdac
Here you can read the disk name, serial number etc.,
and compare disk names with lists from lsblk
and lsscsi
as shown above as well as zpool status
.
If a replacement disk is hidden from the OS, it may be because it was previously attached to a RAID adapter, see https://serverfault.com/questions/1142870/hp-smart-array-p812-hba-mode-masked-drives This can me modified like in this example:
$ /usr/sbin/ssacli
=> ctrl slot=1 physicaldrive 2I:1:29 modify clearconfigdata
Trying out ZFS
Aaron_Toponce ‘s page has some initial examples.
Create a simple zpool named tank with 4 unused drives (sde sdf sdg sdh):
zpool create tank sde sdf sdg sdh
zpool status tank
df -Ph /tank
Define the mount point for the dataset by adding this option:
-m <mountpoint>
A mirrored pool where all data are mirrored 4 times:
zpool create tank mirror sde sdf sdg sdh
A RAID 0+1 pool with 2+2 disks:
zpool create tank mirror sde sdf mirror sdg sdh
Destroy the testing zpool created above with zpool-destroy:
zpool destroy tank
WARNING: The zpool-destroy command will destroy your ZFS pool without any warnings!!.
Configuring ZFS
The sections below describe how we have configured ZFS.
List disks in the system
First identify the disk device WWN names and the corresponding /dev/sd… device names:
$ ls -l /dev/disk/by-id/wwn* | sed /part/d | awk '{print $9 " is disk " $11}' | sort -k 4
/dev/disk/by-id/wwn-0x600508b1001cf4b3e98de44628d4583c is disk ../../sda
...
or use one of the following commands:
lsblk
lsscsi --wwn --size
For ZFS usage it is recommended to use the permanent hardware-based WWN names in stead of the Linux disk device names which are changeable. You should make a record of the above mapping of WWN names to Linux disk device names.
Create RAIDZ disks
Read the zpool_concepts page about VDEV devices, Hot_spare etc.
To setup a RAIDZ pool <poolname>
with RAIDZ-1, we use zpool-create with the “raidz1” VDEV, for example:
zpool create <poolname> raidz1 sde sdf sdg
The recommended disk naming with WWN names
must include the wwn-
string before the disks’ WWN names, for example::
zpool create <poolname> raidz1 wwn-0x5000c500ec6e2b9f wwn-0x5000c500f294ad3f wwn-0x5000c500f29d1a3b
To setup a RAIDZ pool with RAIDZ-2, we use the “raidz2” VDEV:
zpool create <poolname> raidz2 sde sdf sdg sdh
You can also create a pool with multiple VDEV devices, so that each VDEV doesn’t contain too many physical disks, for example:
zpool create <poolname> raidz2 sde sdf sdg sdh raidz2 sdi sdj sdk sdl
or add a new VDEV device with zpool-add to an existing pool:
zpool add <poolname> raidz2 sdi sdj sdk sdl
You may even designate one or more Hot_spare disks to the pool, for example a single spare disk sdm
:
zpool create <poolname> raidz2 sde sdf sdg sdh raidz2 sdi sdj sdk sdl spare sdm
Check the status of the pools:
zpool status
Adding disks for an SLOG
Read about the Separate Intent Logging Device (SLOG) in the ZFS Intent Log (ZIL) page. The disks should be as fast as possible, such as NVMe or SSD.
To correlate a namespace to a disk device use one of the following commands:
lsblk
lsscsi --wwn --size
Use /dev/disk/by-id/*
disk names with ZFS in stead of /dev/sd*
which could become renamed.
Add SLOG and ZIL disks
This section shows how to configure an L2ARC_cache on 2 disk devices.
Assume that the 2 disks /dev/sdb
and /dev/sdc
will be used.
First partition the disks:
parted /dev/sdb unit s mklabel gpt mkpart primary 2048 4G mkpart primary 4G 120G
parted /dev/sdc unit s mklabel gpt mkpart primary 2048 4G mkpart primary 4G 120G
Note: Perhaps it is necessary to use the parted
command line and make individual commands like:
parted /dev/sdb
(parted) unit s
(parted) mklabel gpt
(parted) mkpart primary 2048 4G
(parted) mkpart primary 4G 120G
(parted) print
(parted) quit
Use /dev/disk/by-id/*
disk names with ZFS in stead of /dev/sd*
which could become renamed.
To add 2 disks, for example /dev/sdb
and /dev/sdc
, to the SLOG, first identify the device WWN names:
ls -l /dev/disk/by-id/* | egrep 'sdb|sdc' | grep wwn
The disks and their partitions partN
may be listed as in this example:
/dev/disk/by-id/wwn-0x600508b1001c5db0139e52b3964d02ee -> ../../sdb
/dev/disk/by-id/wwn-0x600508b1001c5db0139e52b3964d02ee-part1 -> ../../sdb1
/dev/disk/by-id/wwn-0x600508b1001c5db0139e52b3964d02ee-part2 -> ../../sdb2
/dev/disk/by-id/wwn-0x600508b1001c45bf78142b67cda9c82b -> ../../sdc
/dev/disk/by-id/wwn-0x600508b1001c45bf78142b67cda9c82b-part1 -> ../../sdc1
/dev/disk/by-id/wwn-0x600508b1001c45bf78142b67cda9c82b-part2 -> ../../sdc2
When the partitions have been created, add the disk partitions 1 and 2 as a ZFS mirrored log and cache, respectively:
zpool add <pool-name> log mirror /dev/disk/by-id/wwn-<name>-part1 /dev/disk/by-id/wwn-<name>-part1 cache /dev/disk/by-id/wwn-<name>-part2 /dev/disk/by-id/wwn-<name>-part2
where the WWN names found above must be used.
Cache and mirror devices can be removed, if necessary, by the zpool-remove command, for example:
zpool remove <pool-name> <mirror>
zpool remove <pool-name> /dev/disk/by-id/wwn-<name>-part2
where the disks are listed by the zpool-status command.
Add SLOG and ZIL on Optane NVDIMM persistent memory
Setting up NVDIMM persistent memory is described in NVDIMM Optane persistent memory setup. Install thse packages:
dnf install ndctl ipmctl
Display NVDIMM devices by:
ipmctl show -dimm
This section show how to configure an L2ARC_cache using NVDIMM 3D_XPoint known as Intel Optane persistent memory DIMM modules.
Partition the NVDIMM disks:
parted /dev/pmem0 unit s mklabel gpt mkpart primary 2048 4G mkpart primary 4G 120G
parted /dev/pmem1 unit s mklabel gpt mkpart primary 2048 4G mkpart primary 4G 120G
and then add the disk partitions 1 and 2 as ZFS cache and log:
zpool add <pool-name> log mirror /dev/pmem0p1 /dev/pmem1p1 cache /dev/pmem0p2 /dev/pmem1p2
ZFS pool capacity should be under 80%
From the Best_practices page:
Keep ZFS pool capacity under 80% for best performance. Due to the copy-on-write nature of ZFS, the filesystem gets heavily fragmented.
Email reports of capacity at least monthly.
Use this command to view the ZFS pool capacity:
zpool list
zpool list -H -o name,capacity
This crontab job for Monday mornings might be useful:
# ZFS list capacity
0 6 * * 1 /sbin/zpool list
ZFS Compression
Compression is transparent with ZFS if you enable it, see the Compression_and_Deduplication page. This means that every file you store in your pool can be compressed. From your point of view as an application, the file does not appear to be compressed, but appears to be stored uncompressed.
To enable compression on a dataset, we just need to modify the compression
property.
The valid values for that property are: “on”, “off”, “lzjb”, “lz4”, “gzip”, “gzip[1-9]”, and “zle”:
zfs set compression=lz4 <pool-name>
Monitor compression:
zfs get compressratio <pool-name>
Create ZFS filesystems
You can create multiple separate filesystems within a ZFS pool, for example:
zfs create -o mountpoint=/u/test1 zfspool1/test1
ZFS filesystems can be unmounted and mounted manually by zfs_mount commands:
zfs unmount ...
zfs mount ...
ZFS Snapshots and clones
zfs_snapshot_ is similar to a Linux LVM snapshot, see Snapshots_and_clones.
You can list snapshots by two methods:
zfs list -t all
cd <mountpoint>/.zfs ; ls -l
You can access the files in a snapshot by mounting it, for example:
mount -t zfs zfstest/zfstest@finbul1-20230131080810 /mnt
The files will be visible in /mnt
.
Remember to unmount /mnt
afterwards.
To destroy a snapshot use zfs-destroy:
zfs destroy [-Rdnprv] filesystem|volume@snap[%snap[,snap[%snap]]]
WARNING: The zfs-destroy command will destroy your ZFS volume without any warnings!!.
It is recommended to create a zfs_snapshot_ and use zfs-hold to prevent zfs-destroy from destroying accidentally, see prevent dataset/zvol from accidental destroy.
For example create a snapshot and hold it:
zfs snapshot tank@snapshot1
zfs list -t snapshot
zfs hold for_safety tank@snapshot1
zfs holds tank@snapshot1
General snapshot advice:
Snapshot frequently and regularly.
Snapshots are cheap, and can keep a plethora of file versions over time.
Consider using something like the zfs-auto-snapshot script.
ZFS backups
Backup of ZFS filesystems to a remote storage may be done by Sending_and_receiving_filesystems.
A ZFS snapshot can be sent to a remote system like this example:
zfs send tank/test@tuesday | ssh user@server.example.com "zfs receive pool/test"
There are several tools for performing such backups:
zfs-autobackup creates ZFS snapshots on a source machine and then replicates those snapshots to a target machine via SSH.
https://serverfault.com/questions/842531/how-to-perform-incremental-continuous-backups-of-zfs-pool
zfs-autobackup
See the zfs-autobackup Getting Started Wiki page.
On the remote source machine, we set the autobackup:offsite1
zfs property to true as follows:
[root@remote ~]# zfs set autobackup:offsite1=true <poolname>
[root@remote ~]# zfs get -t filesystem,volume autobackup:offsite1
Running a pull backup from the remote host:
zfs-autobackup -v --ssh-source <remote> offsite1 <poolname>
Since the path to zfs-autobackup is /usr/local/bin
and ZFS commands are in /usr/sbin
,
you must add these paths when running crontab jobs, for example:
0 4 * * * PATH=$PATH:/usr/sbin:/usr/local/bin; zfs-autobackup args...
It is convenient to list all snapshots created by zfs-autobackup:
zfs list -t all
You can mount a snapshot as shown above.
There is a zfs-autobackup troubleshooting page. We have seen the error:
cannot receive incremental stream: destination has been modified since most recent snapshot
which was resolved by zfs_rollback:
zfs rollback <problem-snapshot-name>
Useful ZFS commands
List ZFS filesystems and their properties:
zfs list
zpool list
zpool status <pool-name>
zpool get all <pool-name>
mount -l -t zfs
See the sub-command manual pages for details (for example man zpool-list
).
Display logical I/O statistics for ZFS storage pools with zpool-iostat:
zpool iostat -v
Get and set a mountpoint:
zfs get mountpoint <pool-name>
zfs set mountpoint=/u/zfs <pool-name>
E-mail notifications
Using the ZFS Event Daemon (see ZED or man zed
),
ZFS can send E-mail messages when zpool-events occur.
Check the status of ZED by:
systemctl status zed
The ZED configuration file /etc/zfs/zed.d/zed.rc
defines variables such as the
Email address of the zpool administrator for receipt of notifications;
multiple addresses can be specified if they are delimited by whitespace:
ZED_EMAIL_ADDR="root"
You should change root
into a system administrator E-mail address,
otherwise the domain root@localhost.localdomain
will be used.
Perhaps you need to do systemctl restart zed
after changing the zed.rc
file(?).
Scrub and Resilver disks
With ZFS on Linux, detecting and correcting silent data errors is done through scrubbing the disks, see the Scrub_and_Resilver page.
Scrubbing can be made regularly with crontab, for example monthly:
0 2 1 * * /sbin/zpool scrub <pool-name>
or alternatively on machines using Systemd, scrub timers can be enabled on per-pool basis.
See the systemd.timer(5)
manual page.
Weekly and monthly timer units are provided:
systemctl enable zfs-scrub-weekly@<pool-name>.timer --now
systemctl enable zfs-scrub-monthly@<pool-name>.timer --now
Replacing defective disks
Detecting broken disks is explained in the Scrub_and_Resilver page. See the zpool-status if any disks have failed:
zpool status
zpool status -x # Only pools with errors
zpool status -e # Only VDEVs with errors
zpool status -L # Display real paths for vdevs resolving all symbolic links
zpool status -P # Display full paths for vdevs
The RHEL page How to rescan the SCSI bus to add or remove a SCSI device without rebooting the computer
has useful information about Adding a Storage Device or a Path
.
You may scan the system for disk changes using /usr/bin/rescan-scsi-bus.sh
from the sg3_utils package.
Unfortunately, it may sometimes be necessary to reboot the server so that the OS will discover the replaced /dev/sd???
disk device.
Use the zpool-replace command to replace a failed disk, for example disk sde:
zpool replace <pool-name> sde(old) sde(new)
zpool replace -f <pool-name> sde(old) sde(new)
The -f
flag may be required in case of errors such as invalid vdev specification
.
Hot spare disks will not be added to the VDEV to replace a failed drive by default.
You MUST enable this feature.
Set the autoreplace
feature to on, for example:
zpool set autoreplace=on <pool-name>
Replacing disks can come with big problems, see How to force ZFS to replace a failed drive in place.
ZFS troubleshooting
There is a useful Troubleshooting page which includes a discussion of ZFS_events. Some useful commands are:
zpool events -v
zpool history
If a normal user, and also the daily logwatch
scripts, tries to execute zpool status
an error message may appear:
Permission denied the ZFS utilities must be run as root
This seems to be a Systemd issue, see permissions issues with openzfs #28653. There seems to be a fix in Udev vs tmpfiles take 2 #28732, however, this has not been tested on EL8 yet.
Disk quotas for ZFS
Read the zfs-userspace manual page to display space and quotas of a ZFS dataset.
We assume a ZFS filesystem <pool-name>
and a specific user’s name <username>
in the examples below.
Define a user’s disk quota userquota
and number-of-files quota userobjquota
:
zfs set userquota@<username>=1TB userobjquota@<username>=1M <pool-name>
Using a quota value of none
will remove the quota.
We have written some Tools_for_managing_ZFS_disk_quotas providing,
for example, commands similar to the standard Linux commands repquota
and quota
.
The superuser can view the user disk usage and quotas, see the zfs-userspace manual page:
zfs userspace filesystem|snapshot|path|mountpoint
zfs userspace -p filesystem|snapshot|path|mountpoint
zfs userspace -H -p -o name,quota,used,objquota,objused filesystem|snapshot|path|mountpoint
The -p
prints parseable numbers, the -H
omits the heading.
The -o
displays only specific columns, this could be used to calculate quota warnings.
Normal users are not allowed to read quotas with the above commands. The following command allows a normal user to print disk usage and quotas:
/usr/sbin/zfs get userquota@$USER,userused@$USER,userobjquota@$USER,userobjused@$USER <pool-name>
Default quotas
Unfortunately, the OpenZFS has no default user quota option, this is only available in the Oracle_Solaris_ZFS implementation, see the defaultuserquota page:
zfs set defaultuserquota=30gb <pool-name>
So with Linux OpenZFS you must set disk quotas individually for each user as shown above.
NFS sharing ZFS file systems
The zfsprops manual page explains about the NFS sharenfs option:
A file system with a sharenfs property of off is managed with the exportfs command and entries in the /etc/ exports file. Otherwise, the file system is automatically shared and unshared with the
zfs share
andzfs unshare
commands.
Alternatively to the exports file, use the zfs set/get sharenfs
command to set or list the sharenfs property like in this example:
zfs set sharenfs='rw=192.168.122.203' pool1/fs1
zfs get sharenfs pool1/fs1
ZFS will update its /etc/zfs/exports
file automatically.
Never edit this file directly!
There are some discussions on NFS with ZFS:
NFS tuning
Make sure that a sufficient number of nfsd threads are started by configuring the /etc/nfs.conf
file:
threads=32
This number might be around the number of CPU cores in the server.
A systemctl restart nfs-server
is required to update the parameters.
For optimizing the transfer of large files, increase the NFS read and write size in the NFS mount command on NFS clients,
see man 5 nfs
:
rsize=32768,wsize=32768
Larger values (powers of 2, such as 131072) may also be tried.
See also Optimizing Your NFS Filesystem.
ZFS quotas over NFS
The quota tools for Linux has absolutely no knowledge about ZFS quotas, nor does rquotad, and hence clients mounting via NFS are also unable to obtain this information. See a hack at https://aaronsplace.co.uk/blog/2019-02-12-zfsonline-nfs-quota.html