Parallel ZFS send/receive

I've been doing some work recently centered around how to backup a continuously updated directory structure. The workload generates large numbers of files and directories over time. UFS would have no chance of handling this workload, but ZFS seems to handle it quite well. The current data shows that we can use ZFS send/receive to backup this data efficiently, even when large workloads are present. There are some tricks needed, though.

But first, a quick review of ZFS file systems. ZFS is designed with a different philosophy than many other file systems. In ZFS, physical devices are assigned to a storage pool. File systems (plural) are created in the pool. In traditional file system design, there is a 1:1 relationship between the file system and a physical device or physical device look-alike (which is how RAID systems are traditionally implemented). In all modern file systems, directories (sometimes called folders) are used to manage collections of files. In ZFS, file systems work similarly. In some of the early ZFS documents, you might notice words to the effect of "in ZFS file systems are as easy to manage as directories." To a large extent, this is true: it is as easy to create a ZFS file system as it is to create a directory and for the vast majority of applications, there is no perceptible difference between a heirarchy of ZFS file systems or directories.

However easy it is to create ZFS file systems, they do increase the complexity of systems administration. Many tools for managing files and backups treat the file system boundary differently than directories. To help systems administrators maintain sanity, I recommend using file systems when you need to implement a different policy. For example, you may wish to make a file system read-only instead of read-write. The most common policy seems to be quotas and ZFS today implements quotas only on file systems. A complete list of the possible policies you might consider is available by looking at the parameters you can set on a file system:

# zfs get all my/file/system

The results of my testing on ZFS replication using send/receive adds another reason to use file systems: sends can be done in parallel. This has subtle, but important implications. In my test workload, new files and directories are constantly being created. I used the ZFS auto-snapshot feature to make regularly scheduled snapshots. I then experimented with rsync and zfs send/receive to copy the data to a backup pool. I found:
  • Both rsync and ZFS send/receive can make incremental backups
  • Both rsync and ZFS send/receive are I/O bound in performance. rsync also has the ability to manage its bandwidth, which ZFS send does not, yet (this can be managed by an entity placed in the pipeline, though). However, in this workload, the backup was iops bound, not bandwidth bound, so rsync throttling would probably not work well.
  • Both rsync and ZFS send/receive work on a per-filesystem basis. This is an option for rsync, but an inherent constraint for ZFS. Sends can be recursive, with the -R option, though that will also replicate the file system parameters -- more on that later.
  • rsync will traverse the directory structure and stat(2) every file. For my workload, this would get slower and slower over time because there would be millions of more files added over time. The performance of these stats will be impacted worse by iop-bound workloads, such as mine. Note: DNLC size is important for my workload and requires tuning.
  • ZFS send/receive can send differences in the file system dataset, which doesn't really care about files or directories.
  • ARC size can impact backup performance. The ongoing workload is iop-bound with many application reads and writes occurring continuously while the backup is being made. Recent writes will likely still be in the ARC, as long as the ARC is large enough. For my system, careful monitoring of the ARC size confirmed that I could devote a substantially large amount of RAM for the ARC while meeting application requirements. No L2ARC was used, mostly because operations constraints limited me to using Solaris 10 10/08, which does not have the L2ARC feature. Backups using ZFS send/receive should be even more efficient with a reasonably large L2ARC -- 100+ GBytes.
  • Over time, I believe rsync will get totally bogged down traversing the directory structure and stat'ing files whereas ZFS send/receive will take nominally the same time -- based on the amount of changes during the time interval.
  • If something goes amiss, and for some reason the backups are not completed for a long period of time, then it will take a long time to catch up, for either choice. For this workload, the data tends to be added and removed, not reused. For a reuse case, such as you might find with a database, more work will have to be done to fully understand the implications of incremental, real-time replication.

The big trick to get this to work well is the parallelism obtained by backing up multiple file systems in parallel. The bottleneck for the parallel replication is the iops load which will increase for rsync over time, while remain more-or-less constant for ZFS send/receive. Having an iops-bounded workload means that latency is more important than bandwidth, and while one of the parallel sends is waiting, another can be working. In a sense, this is very similar to the way that chip multithreading (CMT) works, but ZFS is waiting on disks, not memory.

There is a lot more work to be done here, but I feel this is a pretty positive start. It may take months or years to really get my workload to a state where the steady state is such that it can be fully characterized. I look forward to seeing how it goes.

Comments