Saturday, January 9, 2010

I/O Reduction and the ZIL

I came across an interesting microbenchmark this week. It shows that some workloads can show confusing results, or head fakes, can lead to difficulty in understanding benchmark results. In this case, a method we use for finding the performance envelope for ZFS is not effective.

Before I dive into the microbenchmark, a few words about the ZFS Intent Log (ZIL). ZFS is a transactional file system, which means that it collects I/O into a transaction group (txg) and commits that txg to persistent storage. In later ZFS implementations, that txg commit occurs every 30 seconds. However, if an application needs to ensure that an I/O is written to persistent storage immediately, often called synchronous writes (though that is arguably not the best descriptive term), then waiting for up to 30 seconds is not an option. This is where the ZIL enters the picture. In the synchronous write case, ZFS will write the record to the ZIL and later commit the record with the txg. This ensures the synchronous write agreement between the application and ZFS is honored -- a good thing. Neil Perrin offers a more detailed description in his famous lumberjack blog posting.

Synchronous writes are the bane of high performance. Really. We see this every day. It causes performance guys to gnash their teeth and cuss. When a microbenchmark wanders along and does a lot of synchronous writes, complaints about how " sucks" and "I can't believe those file system developers could be so insensitive" come pouring forth.

To determine the performance envelope of a benchmark, it is relatively easy to disable the ZIL. This is neither a safe nor recommended option for production systems or people who like their data. But for benchmarking, it allows a performance engineer to quickly determine the best possible performance for the given system configuration. The ZIL is then re-enabled and the work can concentrate on how to approach that performance goal. Tools like zilstat are designed to help with this endeavor, and can save you a lot of time when you suspect synchronous write performance might be an issue.

But disabling the ZIL can also hide important behavior. That is why this microbenchmark could be a poster child for benchmarking that doesn't do what you expect. Here it is:

while true; do
echo "blah" > outputfile

When run on an Solaris NFS client with a Solaris NFS server using default NFS settings, this will cause the following to occur:
  1. outputfile is LOOKUPed
  2. outputfile is OPENed
  3. ACCESS to outputfile is checked
  4. The data is written to the file with WRITE
  5. The data is COMMITted
  6. outputfile is CLOSEd
This will also, by default, cause the file to be synchronously written, the so-called "sync-on-close" operation.

Argv! This simple microbenchmark actually makes lot of synchronous writes to the file system. zilstat will happily show that the ZIL is working hard when running this microbenchmark. If you run this, then you can experiment with various pool or separate (ZIL) log configurations to your heart's content.

However, if you disable the ZIL, then the number of I/O operations is reduced to just a handfull, every 30 seconds. Why? Because ZFS is clever enough to recognize that the same file is being overlaid and is only concerned with physically commiting the last one in the transaction group. In other words, the amount of I/O traffic to the pool is dramatically reduced. When this happens, you are no longer measuring the affect of ZIL I/O, you are also measuring the main pool I/O. The results look something like this:
  1. ZIL enabled, no separate log = 100 iterations/second
  2. ZIL enabled, separate log on a fast SSD = 1,000 iterations/second
  3. ZIL disabled = 10,000 iterations/second
In other words the affect of eliminating the pool I/O in addition to the ZIL I/O made the system faster! Hurray! But wait just a dog-gone second. That means that the benchmark is basically useless -- it does almost zero physical I/O when the ZIL is disabled. This is kinda like redirecting all of the data to /dev/null -- a fun trick to amuse your friends at parties, but otherwise completely useless.

The moral of this tale is: beware of microbenchmarks and how they can confuse your understanding of the real system behavior.

P.S. Don't disable the ZIL.

P.P.S. I really mean it, don't disable the ZIL. Seriously. I might cut you some slack for benchmark purposes, but other than that, don't disable the ZIL. Period. End of discussion.


  1. Hi Richard,

    Interesting post, but regarding the ZIL, having 1 storage server and 2 servers using NFS, where the storage server is the NFS Server, connected to a no-break system that lasts for 4 hours in case of power-off, the ZIL still should not be disabled?

  2. @it4it, even if you have power, that does not preserve the data if the cause of the outage is not the power subsystem. For example, a reset, panic, or catastrophic mobo crash would result in data loss. Do your data a favor, don't disable the ZIL.