Friday, January 30, 2009


A few weeks ago I had a workload which I suspected had a fair amount of synchronous writes on a ZFS file system. The general recommendation for this case is to use a separate ZIL log, preferably on a device which has a nonvolatile write cache or a write-optimized SSD. However, once you add a separate log to a pool, you cannot easily remove it. So before you make that move, it is a better idea to look at your workload and answer the question, "how much ZIL write activity is generated by my workload?"

The result of this pondering is a script I call zilstat. You can get it here.

Sunday, January 25, 2009

Parallel ZFS send/receive

I've been doing some work recently centered around how to backup a continuously updated directory structure. The workload generates large numbers of files and directories over time. UFS would have no chance of handling this workload, but ZFS seems to handle it quite well. The current data shows that we can use ZFS send/receive to backup this data efficiently, even when large workloads are present. There are some tricks needed, though.

But first, a quick review of ZFS file systems. ZFS is designed with a different philosophy than many other file systems. In ZFS, physical devices are assigned to a storage pool. File systems (plural) are created in the pool. In traditional file system design, there is a 1:1 relationship between the file system and a physical device or physical device look-alike (which is how RAID systems are traditionally implemented). In all modern file systems, directories (sometimes called folders) are used to manage collections of files. In ZFS, file systems work similarly. In some of the early ZFS documents, you might notice words to the effect of "in ZFS file systems are as easy to manage as directories." To a large extent, this is true: it is as easy to create a ZFS file system as it is to create a directory and for the vast majority of applications, there is no perceptible difference between a heirarchy of ZFS file systems or directories.

However easy it is to create ZFS file systems, they do increase the complexity of systems administration. Many tools for managing files and backups treat the file system boundary differently than directories. To help systems administrators maintain sanity, I recommend using file systems when you need to implement a different policy. For example, you may wish to make a file system read-only instead of read-write. The most common policy seems to be quotas and ZFS today implements quotas only on file systems. A complete list of the possible policies you might consider is available by looking at the parameters you can set on a file system:

# zfs get all my/file/system

The results of my testing on ZFS replication using send/receive adds another reason to use file systems: sends can be done in parallel. This has subtle, but important implications. In my test workload, new files and directories are constantly being created. I used the ZFS auto-snapshot feature to make regularly scheduled snapshots. I then experimented with rsync and zfs send/receive to copy the data to a backup pool. I found:
  • Both rsync and ZFS send/receive can make incremental backups
  • Both rsync and ZFS send/receive are I/O bound in performance. rsync also has the ability to manage its bandwidth, which ZFS send does not, yet (this can be managed by an entity placed in the pipeline, though). However, in this workload, the backup was iops bound, not bandwidth bound, so rsync throttling would probably not work well.
  • Both rsync and ZFS send/receive work on a per-filesystem basis. This is an option for rsync, but an inherent constraint for ZFS. Sends can be recursive, with the -R option, though that will also replicate the file system parameters -- more on that later.
  • rsync will traverse the directory structure and stat(2) every file. For my workload, this would get slower and slower over time because there would be millions of more files added over time. The performance of these stats will be impacted worse by iop-bound workloads, such as mine. Note: DNLC size is important for my workload and requires tuning.
  • ZFS send/receive can send differences in the file system dataset, which doesn't really care about files or directories.
  • ARC size can impact backup performance. The ongoing workload is iop-bound with many application reads and writes occurring continuously while the backup is being made. Recent writes will likely still be in the ARC, as long as the ARC is large enough. For my system, careful monitoring of the ARC size confirmed that I could devote a substantially large amount of RAM for the ARC while meeting application requirements. No L2ARC was used, mostly because operations constraints limited me to using Solaris 10 10/08, which does not have the L2ARC feature. Backups using ZFS send/receive should be even more efficient with a reasonably large L2ARC -- 100+ GBytes.
  • Over time, I believe rsync will get totally bogged down traversing the directory structure and stat'ing files whereas ZFS send/receive will take nominally the same time -- based on the amount of changes during the time interval.
  • If something goes amiss, and for some reason the backups are not completed for a long period of time, then it will take a long time to catch up, for either choice. For this workload, the data tends to be added and removed, not reused. For a reuse case, such as you might find with a database, more work will have to be done to fully understand the implications of incremental, real-time replication.

The big trick to get this to work well is the parallelism obtained by backing up multiple file systems in parallel. The bottleneck for the parallel replication is the iops load which will increase for rsync over time, while remain more-or-less constant for ZFS send/receive. Having an iops-bounded workload means that latency is more important than bandwidth, and while one of the parallel sends is waiting, another can be working. In a sense, this is very similar to the way that chip multithreading (CMT) works, but ZFS is waiting on disks, not memory.

There is a lot more work to be done here, but I feel this is a pretty positive start. It may take months or years to really get my workload to a state where the steady state is such that it can be fully characterized. I look forward to seeing how it goes.

Its a conspiracy...

The other day I blogged about some blown capacitors on my graphics card. Well, around the same time, my wife noticed a strange burning smell near her computer. Sure enough, it smelled like burned electronics. Since there is a bunch of electronics near her desk and everything seemed to be working ok, it took a while to track down the source of the smell.

Argv! Another blown graphics card. This time the fan motor blew and sent the fan tumbling around inside the case. The card seemed to keep working ok, though it is not clear how long that would have been the case. Fortunately, she doesn't stress the 3-D graphics too much, except for the occasional screen blanker.
But now I think there is a conspiracy... how could two different graphics cards from different vendors, in different computers, connected to different power conditioners, both fail at around the same time -- and yet keep working for a while? It must've been related to those strange lights and humming noise we saw moving rapidly over the hilltops... no, those were F/A-18s from Miramar... time to call the ghostbusters...

Friday, January 23, 2009

POP! What was that???

I was sitting in my office yesterday afternoon, with Rosie, our Australian Shepherd, asleep at my feet, when I heard a loud POP! It sounded like a fire cracker. Rosie didn't budge, which is no longer surprising as she has been losing her hearing lately. I looked around to see if I could find something amiss. Nothing. No smoke. No funny smells. No new holes in the window. Nothing. So I went about my work as usual.
Today, when I went into the office and sat down to read the morning's e-mail, my monitors were all goofed up. Argv! This looks like what I would expect from failed hardware: gobbledegook on the screen. So I pulled the video card and the problem is self-evident.

The tops of three capacitors were blown off! I'm not really surprised the capacitors blew -- that happens on these cheap devices. What surprised me is that I worked well into the evening yesterday with no problems. I presume that when I retired for the evening and the screenblank kicked in, that was the end of proper functioning. Fortunately, I also have plenty of spares hanging around... always be prepared for failures.

Cross over and link back

I'm continuing my previous blog on this new site.