Saturday, February 21, 2009

Tour goes to Palomar

This year the AMGEN Tour of California is (finally) coming to San Diego County to climb Palomar Mountain. Palomar Mountain is not too far from the Ranch, and the Palomar Volunteer Fire Department has put out the call for volunteers to help the department with whatever might happen as thousands of race fans converge on the mountain. The Ramona Outback Amateur Radio Society and Ramona-CERT is providing volunteers to help with communications for the department. I've spent quite a few hours recently with the fine folks on Palomar Mountain helping to coordinate communications for the event.

Today, we held a pre-race communications exercise as part of the 2009 San Diego County CERT Spring Exercise. Dozens of hams and CERT folks surveyed and characterized communications capabilities on the mountain. What is surprising to many people who don't regularly travel to the back country is that cell phone service is likely to be non-existent. And if you have a need for emergency response, you are likely to be an hour, not minutes, from the nearest hospital. There are many dangerous areas, such as high cliffs, sharp curves, and places where you could have an accident and not be found for some time. By providing extra eyes and ears on the mountain while thousands of fans arrive to witness the Tour of California, we're hoping for an accident-free event.

If you happen to be in the crowd coming to the mountain, prepare to arrive early. There are only two roads which are practical for getting to the top and the highway patrol is planning to monitor them closely. Fortunately, the Palomar Mountain Volunteer Fire Department is also sponsoring a bar-b-que, so be sure to stop by and visit.

I look forward to seeing everyone who climbs the mountain, even those arriving in cars. Stop by and visit for a while. And who knows, you might see yourself on Versus!

Saturday, February 7, 2009

zilstat improved

I've improved upon the first release of zilstat by adding a few columns to show the distribution of operations by size. To understand why this might be important for your situation, I refer to the source code comments for zfs_immediate_write_sz (link good until the source changes :-)
    486         /*
487 * Writes are handled in three different ways:
488 *
490 * In this mode, if we need to commit the write later, then the block
491 * is immediately written into the file system (using dmu_sync),
492 * and a pointer to the block is put into the log record.
493 * When the txg commits the block is linked in.
494 * This saves additionally writing the data into the log record.
495 * There are a few requirements for this to occur:
496 * - write is greater than zfs_immediate_write_sz
497 * - not using slogs (as slogs are assumed to always be faster
498 * than writing into the main pool)
499 * - the write occupies only one block
500 * WR_COPIED:
501 * If we know we'll immediately be committing the
502 * transaction (FSYNC or FDSYNC), the we allocate a larger
503 * log record here for the data and copy the data in.
505 * Otherwise we don't allocate a buffer, and *if* we need to
506 * flush the write later then a buffer is allocated and
507 * we retrieve the data using the dmu.
508 */

zilstat can see the size of the write and compare it to zfs_immediate_write_sz, but it is not so easy to implement the rest of the logic. To get past this difficulty, lets return to the original reason for writing zilstat in the first place: answer the question, “how much ZIL write activity is generated by my workload and will a separate log help?”

To get there from here, we could take a look at the size distribution of the ZIL writes. I've implemented this in an updated version of zilstat as follows:
  • Itty-bitty writes, those less than 4 kBytes. These might suggest a workload which is updating a lot of small files synchronously or perhaps a lot of metadata writes (though not all metadata writes may fit in small spaces).
  • Medium-sized writes, 4-32 kBytes. It might be more difficult to pin these down, so they get a bin.
  • Larger writes, > 32 kBytes. If you do not have a separate log, then these will be written to the pool and not the ZIL.
Using these bins will allow better observability into the work to moved around by adding a separate log. Here is what a sample output looks like:

# ./zilstat.ksh -t 60
TIME N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops <=4kB 4-32kB >=32kB
2009 Feb 6 14:25:22 0 0 0 0 0 0 0 0 0 0
2009 Feb 6 14:26:22 287368 4789 283064 618496 10308 524288 7 0 2 5
2009 Feb 6 14:27:22 10304 171 10304 94208 1570 94208 2 0 0 2
2009 Feb 6 14:28:22 223120 3718 213152 618496 10308 544768 7 0 2 5
2009 Feb 6 14:29:22 0 0 0 0 0 0 0 0 0 0
2009 Feb 6 14:30:22 3336 55 1264 28672 477 12288 7 7 0 0
2009 Feb 6 14:31:22 0 0 0 0 0 0 0 0 0 0
2009 Feb 6 14:32:22 149768 2496 145480 294912 4915 282624 8 1 2 5
2009 Feb 6 14:33:22 248 4 248 4096 68 4096 1 1 0 0

For this workload, it looks like large synchronous writes are more common than small ones and these will go directly to the pool, since I do not have a separate log on any of my pools. I really can't draw many conclusions from this sample, because the rate is quite low. More experiments with busier workloads may reveal some best practices guidelines. Let me know what you discover.