zilstat improved

I've improved upon the first release of zilstat by adding a few columns to show the distribution of operations by size. To understand why this might be important for your situation, I refer to the source code comments for zfs_immediate_write_sz (link good until the source changes :-)
    486         /*
487 * Writes are handled in three different ways:
488 *
489 * WR_INDIRECT:
490 * In this mode, if we need to commit the write later, then the block
491 * is immediately written into the file system (using dmu_sync),
492 * and a pointer to the block is put into the log record.
493 * When the txg commits the block is linked in.
494 * This saves additionally writing the data into the log record.
495 * There are a few requirements for this to occur:
496 * - write is greater than zfs_immediate_write_sz
497 * - not using slogs (as slogs are assumed to always be faster
498 * than writing into the main pool)
499 * - the write occupies only one block
500 * WR_COPIED:
501 * If we know we'll immediately be committing the
502 * transaction (FSYNC or FDSYNC), the we allocate a larger
503 * log record here for the data and copy the data in.
504 * WR_NEED_COPY:
505 * Otherwise we don't allocate a buffer, and *if* we need to
506 * flush the write later then a buffer is allocated and
507 * we retrieve the data using the dmu.
508 */

zilstat can see the size of the write and compare it to zfs_immediate_write_sz, but it is not so easy to implement the rest of the logic. To get past this difficulty, lets return to the original reason for writing zilstat in the first place: answer the question, “how much ZIL write activity is generated by my workload and will a separate log help?”

To get there from here, we could take a look at the size distribution of the ZIL writes. I've implemented this in an updated version of zilstat as follows:
  • Itty-bitty writes, those less than 4 kBytes. These might suggest a workload which is updating a lot of small files synchronously or perhaps a lot of metadata writes (though not all metadata writes may fit in small spaces).
  • Medium-sized writes, 4-32 kBytes. It might be more difficult to pin these down, so they get a bin.
  • Larger writes, > 32 kBytes. If you do not have a separate log, then these will be written to the pool and not the ZIL.
Using these bins will allow better observability into the work to moved around by adding a separate log. Here is what a sample output looks like:





# ./zilstat.ksh -t 60
TIME N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops <=4kB 4-32kB >=32kB
2009 Feb 6 14:25:22 0 0 0 0 0 0 0 0 0 0
2009 Feb 6 14:26:22 287368 4789 283064 618496 10308 524288 7 0 2 5
2009 Feb 6 14:27:22 10304 171 10304 94208 1570 94208 2 0 0 2
2009 Feb 6 14:28:22 223120 3718 213152 618496 10308 544768 7 0 2 5
2009 Feb 6 14:29:22 0 0 0 0 0 0 0 0 0 0
2009 Feb 6 14:30:22 3336 55 1264 28672 477 12288 7 7 0 0
2009 Feb 6 14:31:22 0 0 0 0 0 0 0 0 0 0
2009 Feb 6 14:32:22 149768 2496 145480 294912 4915 282624 8 1 2 5
2009 Feb 6 14:33:22 248 4 248 4096 68 4096 1 1 0 0


For this workload, it looks like large synchronous writes are more common than small ones and these will go directly to the pool, since I do not have a separate log on any of my pools. I really can't draw many conclusions from this sample, because the rate is quite low. More experiments with busier workloads may reveal some best practices guidelines. Let me know what you discover.

Comments