Tuesday, August 5, 2014

kstat changes in illumos

One of the nice changes to the kstat (kernel statistics) command in illumos is its conversion to C from perl. There were several areas in the illumos (nee OpenSolaris) code where perl had been used. But these were too few to maintain critical mass and it is difficult for interpreted runtimes to change at the pace of an OS, so keeping the two in lockstep is simply not worthwhile. So the few places where parts of illumos were written in perl have been replaced by native C implementations.

The kstat(1m) command rewritten in C was contributed by David Höppner, an active member of the illumos community. It is fast and efficient at filtering and printing kstats. By contrast, the old perl version had to start perl (an interpreter), find and load the kstat-to-perl module, and then filter and print the kstats. Internal to the kernel, kstats are stored as a name-value list (nvlist) containing strongly-typed data. Many of these are 64-bit integers. This poses a problem for the version of perl used (5.12) as the 64-bit support is dependent on the compiled version and illumos can be compiled for both 32 and 64 bit processors. To compensate for this mismatch, the following was added to the man page for kstat(3perl):

Several of the statistics provided by the kstat facility are stored as 64-bit integer values. Perl 5 does not yet internally support 64-bit integers, so these values are approximated in this module. There are two classes of 64-bit value to be dealt with: 64-bit intervals and times

These are the crtime and snaptime fields of all the statistics hashes, and the wtime, wlentime, wlastupdate, rtime, rlentime and rlastupdate fields of the kstat I/O statistics structures. These are measured by the kstat facility in nanoseconds, meaning that a 32-bit value would represent approximately 4 seconds. The alternative is to store the values as floating-point numbers, which offer approximately 53 bits of precision on present hardware. 64-bit intervals and timers as floating point values expressed in seconds, meaning that time-related kstats are being rounded to approximately microsecond resolution.

It is not useful to store these values as 32-bit values. As noted above, floating-point values offer 53 bitsof precision. Accordingly, all 64-bit counters are stored as floating-point values.

For consumers of the kstat(1m) command output, this means that kstat I/O statistics are stored as seconds (floating point) instead of nanoseconds. For example, with formatting adjusted for readability:
Perl-based kstat(1m)
# kstat -p sd:0:sd0
sd:0:sd0:class        disk
sd:0:sd0:crtime       1855326.99995062
sd:0:sd0:nread        380919301
sd:0:sd0:nwritten     1984175104
sd:0:sd0:rcnt         0
sd:0:sd0:reads        18455
sd:0:sd0:rlastupdate  2371703.49260763
sd:0:sd0:rlentime     147.154123471
sd:0:sd0:rtime        49.399890683
sd:0:sd0:snaptime     2371828.77138052
sd:0:sd0:wcnt         0
sd:0:sd0:wlastupdate  2371703.49174494
sd:0:sd0:wlentime     2.425675727
sd:0:sd0:writes       103602
sd:0:sd0:wtime        1.43643661

C-based kstat(1m)
# kstat -p sd:0:sd0
sd:0:sd0:class        disk
sd:0:sd0:crtime       244.271312204
sd:0:sd0:nread        25549493
sd:0:sd0:nwritten     1698218496
sd:0:sd0:rcnt         0
sd:0:sd0:reads        4043
sd:0:sd0:rlastupdate  104543293563241
sd:0:sd0:rlentime     68750036336
sd:0:sd0:rtime        64365048052
sd:0:sd0:snaptime     104509.092582653
sd:0:sd0:wcnt         0
sd:0:sd0:wlastupdate  104543293482995
sd:0:sd0:wlentime     4569934990
sd:0:sd0:writes       289766
sd:0:sd0:wtime        4551425719

I find kstat(1m) output to be very convenient for historical tracking and use it often. If you do too, then be aware of this conversion.

One of the best features of the new C-based kstat(1m) is the ability to get kstats as JSON. This is even more useful than the "parseable" output shown previously.
# kstat -j sd:0:sd0
  "module": "sd",
  "instance": 0,
  "name": "sd0",
  "class": "disk",
  "type": 3,
  "snaptime": 104547.013504692,
  "data": {
    "crtime": 244.271312204,
    "nread": 25549493,
    "nwritten": 1700980224,
    "rcnt": 0,
    "reads": 4043,
    "rlastupdate": 104733296813446,
    "rlentime": 68901598866,
    "rtime": 64513785819,
    "snaptime": 104547.013504692,
    "wcnt": 0,
    "wlastupdate": 104733296708770,
    "wlentime": 4579560895,
    "writes": 290404,
    "wtime": 4561051625

Using JSON has the added advantage of being easy to parse without making assumptions about the data. For example, did you know that some kernel modules use ':' in the kstat module, instance, name, or statistic? This makes using the parseable output a little bit tricky. The JSON output is not affected and is readily and consistently readable or storable in the many tools that support JSON.

Now you can see how to take advantage of the kstat(1m) command and how it has evolved under illumos to be more friendly to building tools and taking quick measurements. Go forth and kstat!

Monday, August 4, 2014

ZIO Scheduling and Resilver Workloads

Latency and performance problems in storage subsystems can be tricky to understand and tune. If you've ever been stuck in a traffic jam or waited in line to get into a concert, you know that queues can be frustrating to understand and trying on your patience. In modern computing systems, there are many different queues and any time we must share a constrained resource, one or more queues will magically appear in the architecture. Thus in order to understand the impacts of the constrained resource, you need to understand the queues.

The Good Old ZFS I/O Scheduler

Looking at the queues the old ZFS I/O (ZIO) scheduler deserves investigation first, because it still represents the vast majority of the installed base, including all Solaris installations. The new OpenZFS write throttle changes this area of the code and deserves its own treatment in a separate post. Adding a resilvering workload into the mix shows how overall performance is affected by the queues and constrained resources.

The ZIO scheduler sends I/O requests to the virtual devices (vdevs) based upon weighted priority. For old ZFS implementations, the priority table maps an I/O type to a value useful for sorting incoming I/O workload requests. The priority table has grown over time, but the general approach is to favor the "I need it now!" demands from the "process this when you get a chance" workloads. The table usually looked like this, with higher priority entries having lower numbers:

0 - Priority NOW
0 - Sync read
0 - Sync write
0 - Log write
1 - Cache fill
2 - Dedup Table Prefetch
4 - Free
4 - Async write
6 - Async read
10 - Resilver
20 - Scrub

Background I/Os such as resilver and scrub are scheduled with a lower priority so they do not interfere with more immediate needs, such as sync reads or writes. The ZIO scheduler does a good job scheduling high priority I/Os over resilver and scrub I/Os.

But the ZIO scheduler only works for one queue -- and there are many queues in the system -- some of which don't have priority-based schedulers. And this is where I begin to grumble about queues... grumble, grumble, grumble.

Consider the following, simplified model of the ZIO scheduler: a resilver workload, a normal (I need it now!) workload, and a pool built of mirrors.

The ZIO scheduler prioritizes work in the ZIO pipeline. But once the I/Os have been passed along to the vdevs or disks, the ZIO scheduler can do nothing. Also, the typical queue wrapped around a constrained physical storage device, such as a HDD, doesn't have an easy-to-use, priority-based scheduler.

The depth of the queues down the stack can be quite large. It is not uncommon to see HBAs capable of queuing thousands of I/Os. Modern enterprise-grade SSDs can also have thousands of concurrent I/Os. Even modest HDDs can handle multiple concurrent I/Os.

To maintain some control over the number of I/Os passed along to the disks, old ZFS has a kernel tunable named zfs_vdev_max_pending. In very old ZFS, this was set to 35. This turned out to be a bad default choice. In old ZFS, it is 10, a better choice than 35 for many systems. In modern OpenZFS, the ZIO scheduler is replaced with a better algorithm altogether. Concentrating on old ZFS, this means 10 I/Os will be passed along to the disks and the remainder will hang around in a queue and be scheduled by the ZIO scheduler. In the previous queue diagram, this is represented by the sigma (Σ) as the sum of the depth of the queues.

The two-queue model for the disks shown in the diagram also works well for analysis in Solaris-based OSes, where the iostat command has a two-queue model. The internal kstats implement a Riemann algorithm with two queues and this can give some insight to the workings of the I/O subsystem. In iostat, these are known as the wait and active queues, for historical reasons. 

An important assumption in this model is the service time for handling the I/Os is constant. If you actually measure this, you will find a very different service time for SSDs vs HDDs. For example, most Flash-based SSDs really like many concurrent I/Os and have write pipelines especially well suited for coalescing many write I/Os. For these devices, deeper queues can result in lower per-I/O response time, a very good thing. By contrast, HDDs perform poorly as the queue depth increases, but in an opposite manner than Flash SSDs. By default, HDDs buffer writes so their service time tends to be more consistent. But HDD reads do require physical movement and media rotation leading to response times less predictable than writes.

The biggest issue with queuing concurrent I/Os to HDDs is the elevator algorithms used tend to result in a few I/Os getting starved, while the HDD tries to optimize for the majority. Further, these I/Os are not prioritized, so it is quite possible for a ZFS sync read I/O to be starved while the disk optimizes for the resilver I/Os. This starvation is not under control of the OS or ZFS. 

What might this starvation look like? Benchmarks and statistics will clearly show us. We know there is a scheduler in the disk device and we hope it will work for us. To find out, we'll create a full-stroke, totally random, 4KB-sized, 100% read or 100% write workload. Our workload includes:
  • Full-stroke gives us the average worst case workload, but also one validating the vendor's datasheet for average seek and average rotational delay
  • 4KB isn't particularly magic, it is just small enough that we won't reach bandwidth saturation for any component in the system: memory, CPU, PCIexpress, HBA, SAS/SATA link, or media bandwidth
  • 100% read workload because these workloads are difficult to cache, thus providing better insight into where the real performance limits lie
  • 100% write workload to show how deferred writes can lull you into a false sense of performance, due to caching
In this workload experiment, we're measuring latency. Bandwidth isn't saturated and IOPS are meaningless. To prove both, we'll take a look at the workload running with a typical enterprise-grade 7,200 rpm HDD.

For this device, the average rotational delay is 4.167ms (7,200 rpm) and the datasheet says the average read seek time is just over 8ms. We expect the average 4KB full-stroke random read to be about 12.5ms or 80 IOPS. And there it is! For a single thread (single concurrent I/O) 100% read workload, the measured response time is close to 12.5 ms and 80 IOPS. Woohoo!

The bandwidth consumed by servicing 80 IOPS at 4KB is a whopping 320 KB/sec or around 2.5 Mbits/sec. This is, at best, about 1/1000th of the bandwidth of the disk interconnect, HBA, media, or memory. Clearly, there is no bandwidth constraint for this experiment.

Unfortunately, vendor datasheets are not sufficient for explaining the constraints demonstrated by the experiment. The 100% write workload has both lower latency and higher IOPS than the read for all measured numbers of concurrent operations (threads). The writes are also buffered and only committed to permanent media when we issue a synchronize cache command. In this test, we don't issue synchronize cache commands, so the decision to put on media is left to the HDD. We know there is a scheduler ultimately trying to make the best decision on writes to the media, so the latency is a measure of the efficiency of the HDD scheduler with the constraints placed by a continuous stream of small, random writes. The benefits of the scheduler and deferred commits to permanent media for writes are shown by the lower latency and higher IOPS than the 100% read workload.

The clue that there is trouble in paradise is not in the number of measured IOPS. In all cases, the IOPS increases as the concurrency increases -- a good thing. Unfortunately, the latency also increases rather significantly -- a very bad thing. 

All IOPS are Not Created Equal

The measurements shown in the previous graph are averages and do not show a linear relationship between IOPS and latency. Also, if we were to try a linear regression, we would have a problem fitting the mixed workload cases. This is evident from the data, in which a linear relationship could fit the curve well, but the read and write cases have very different slopes. 

We need to dig further into the latency and queus to understand what lies beneath. For this, we examine the latency of each and every I/O operation. I've long been a fan of iosnoop and will draw your attention to Brendan's porting of iosnoop to Linux. Now you can run iosnoop on Mac OSX, illumos and Solaris distros, and now Linux. Toss in your favorite stats graphing program and you no longer have any excuse for not understanding the distribution of the latency of your storage systems. Let's take a look at our data distributions by read vs write and one vs ten concurrent I/Os. We'll measure in units of microseconds, which is useful for storage devices.

We begin by looking at the Concurrent I/Os=1 cases. For both reads and writes, we see a really nice example of the normal distribution; this is what your stats teacher was trying to drill into your head while you were daydreaming in class. Rejoice when you see these in real life; in computer systems, they are few and far between. From a practical perspective, this means the median and averages are almost identical and common intuition about response time will apply. 

The mean of the reads is 11.8 ms, very close to the datasheet prediction of 12.5ms. Writes show a mean of 5.1ms, due primarily to the deferred writes and device scheduler optimizations. The 90% percentile for writes is also much better than the reads, also affirming the benefits of the device's scheduler.

By contrast, issuing 10 concurrent I/Os shows a different picture of the latency. The read workload looks staggering similar to an exponential distribution. The large difference between mean and median latency is an indictor the device's scheduling algorithms are starving some reads. The 90% percentile has ballooned to 148 ms, and some poor reads took more than 300 ms. 

The write workload still seems normal, but not quite the bell curve we expect from normal distributions. Writes fares better than the reads, with the worse case write taking slightly more than 100 ms.

Device Queues and the ZIO Scheduler

Now that we understand how HDDs react to concurrent workloads, we can revisit the ZIO scheduler. The resilver workload is typically described as one or more reads for each write:
  • Resilver I/Os are constantly running in the background, injecting their workload into the the ZIO pipeline along with the "I need it now!" workload
  • Once issued to the device, the ZIO scheduler cannot adjust the priority
  • At the device, more concurrent I/Os leads to longer latency, with extreme latency 10x longer than the single-I/O case
To prevent normal workload I/Os from getting stuck behind resilver I/Os after being issued to the device, we need to reduce the depth of the queue at the device by reducing zfs_vdev_max_pending. The perfect value is based on the expected workload. For latency-sensitive workloads, a smaller number results in more consistency and less variance. For bandwidth-intensive workloads, having a few more I/Os can help.

To quiet my grumbling, here are my rules-of-thumb for zfs_vdev_max_pending:

1 = the lower bound, but exposes all of the serialization in the stack
2 = a good alternative to 1 for HDDs -- offers the disk an opportunity to optimize, but doesn't overwhelm
3 = not a bad choice
4 = upper bound for HDDs, IMHO
16 = good for many Flash SSDs
30+ = good for RAID arrays

Herein lies the dilemma: HDDs like fewer concurrent I/Os, but SSDs like more concurrent I/Os, especially for write-intensive workloads. But there is only one tunable for all disks in all pools on the system. Fortunately, you can change this on the fly, without rebooting, and measure the impact on your workload. For Solaris or older illumos-based distributions, this example sets the value to 2:

echo zfs_vdev_max_pending/W0t2 | mdb -kw

to return to the modern default:

echo zfs_vdev_max_pending/W0t10 | mdb -kw

The effects will take place almost immediately. Monitor the latency changes using iostat -x or another disk latency measure. For direct measurements of all I/Os, as shown previously, consider using iosnoop and a stats package.

In some storage appliances based on ZFS, there are options for setting this value. Consult your documentation for further information. In some cases, the term zfs_vdev_max_pending might be disguised as another name, so look for a tunable related to the number of concurrent I/Os sent to disks.

Understanding queues and their impact on constrained resources can be tricky. Not all of the behaviors follow normal distributions and simple math. Measuring the latency of constrained resources while varying the workload allows you to make adjustments, improving the overall performance of the system. Tuning ZFS device queues is easier when using my tips and proven techniques.

May your traffic jams be short, may you be in the front of the line at the next concert, and may all of your I/O have low latency variance!