Monday, August 4, 2014

ZIO Scheduling and Resilver Workloads

Latency and performance problems in storage subsystems can be tricky to understand and tune. If you've ever been stuck in a traffic jam or waited in line to get into a concert, you know that queues can be frustrating to understand and trying on your patience. In modern computing systems, there are many different queues and any time we must share a constrained resource, one or more queues will magically appear in the architecture. Thus in order to understand the impacts of the constrained resource, you need to understand the queues.

The Good Old ZFS I/O Scheduler

Looking at the queues the old ZFS I/O (ZIO) scheduler deserves investigation first, because it still represents the vast majority of the installed base, including all Solaris installations. The new OpenZFS write throttle changes this area of the code and deserves its own treatment in a separate post. Adding a resilvering workload into the mix shows how overall performance is affected by the queues and constrained resources.

The ZIO scheduler sends I/O requests to the virtual devices (vdevs) based upon weighted priority. For old ZFS implementations, the priority table maps an I/O type to a value useful for sorting incoming I/O workload requests. The priority table has grown over time, but the general approach is to favor the "I need it now!" demands from the "process this when you get a chance" workloads. The table usually looked like this, with higher priority entries having lower numbers:

0 - Priority NOW
0 - Sync read
0 - Sync write
0 - Log write
1 - Cache fill
2 - Dedup Table Prefetch
4 - Free
4 - Async write
6 - Async read
10 - Resilver
20 - Scrub

Background I/Os such as resilver and scrub are scheduled with a lower priority so they do not interfere with more immediate needs, such as sync reads or writes. The ZIO scheduler does a good job scheduling high priority I/Os over resilver and scrub I/Os.

But the ZIO scheduler only works for one queue -- and there are many queues in the system -- some of which don't have priority-based schedulers. And this is where I begin to grumble about queues... grumble, grumble, grumble.

Consider the following, simplified model of the ZIO scheduler: a resilver workload, a normal (I need it now!) workload, and a pool built of mirrors.


The ZIO scheduler prioritizes work in the ZIO pipeline. But once the I/Os have been passed along to the vdevs or disks, the ZIO scheduler can do nothing. Also, the typical queue wrapped around a constrained physical storage device, such as a HDD, doesn't have an easy-to-use, priority-based scheduler.

The depth of the queues down the stack can be quite large. It is not uncommon to see HBAs capable of queuing thousands of I/Os. Modern enterprise-grade SSDs can also have thousands of concurrent I/Os. Even modest HDDs can handle multiple concurrent I/Os.

To maintain some control over the number of I/Os passed along to the disks, old ZFS has a kernel tunable named zfs_vdev_max_pending. In very old ZFS, this was set to 35. This turned out to be a bad default choice. In old ZFS, it is 10, a better choice than 35 for many systems. In modern OpenZFS, the ZIO scheduler is replaced with a better algorithm altogether. Concentrating on old ZFS, this means 10 I/Os will be passed along to the disks and the remainder will hang around in a queue and be scheduled by the ZIO scheduler. In the previous queue diagram, this is represented by the sigma (Σ) as the sum of the depth of the queues.

The two-queue model for the disks shown in the diagram also works well for analysis in Solaris-based OSes, where the iostat command has a two-queue model. The internal kstats implement a Riemann algorithm with two queues and this can give some insight to the workings of the I/O subsystem. In iostat, these are known as the wait and active queues, for historical reasons. 

An important assumption in this model is the service time for handling the I/Os is constant. If you actually measure this, you will find a very different service time for SSDs vs HDDs. For example, most Flash-based SSDs really like many concurrent I/Os and have write pipelines especially well suited for coalescing many write I/Os. For these devices, deeper queues can result in lower per-I/O response time, a very good thing. By contrast, HDDs perform poorly as the queue depth increases, but in an opposite manner than Flash SSDs. By default, HDDs buffer writes so their service time tends to be more consistent. But HDD reads do require physical movement and media rotation leading to response times less predictable than writes.


The biggest issue with queuing concurrent I/Os to HDDs is the elevator algorithms used tend to result in a few I/Os getting starved, while the HDD tries to optimize for the majority. Further, these I/Os are not prioritized, so it is quite possible for a ZFS sync read I/O to be starved while the disk optimizes for the resilver I/Os. This starvation is not under control of the OS or ZFS. 

What might this starvation look like? Benchmarks and statistics will clearly show us. We know there is a scheduler in the disk device and we hope it will work for us. To find out, we'll create a full-stroke, totally random, 4KB-sized, 100% read or 100% write workload. Our workload includes:
  • Full-stroke gives us the average worst case workload, but also one validating the vendor's datasheet for average seek and average rotational delay
  • 4KB isn't particularly magic, it is just small enough that we won't reach bandwidth saturation for any component in the system: memory, CPU, PCIexpress, HBA, SAS/SATA link, or media bandwidth
  • 100% read workload because these workloads are difficult to cache, thus providing better insight into where the real performance limits lie
  • 100% write workload to show how deferred writes can lull you into a false sense of performance, due to caching
In this workload experiment, we're measuring latency. Bandwidth isn't saturated and IOPS are meaningless. To prove both, we'll take a look at the workload running with a typical enterprise-grade 7,200 rpm HDD.


For this device, the average rotational delay is 4.167ms (7,200 rpm) and the datasheet says the average read seek time is just over 8ms. We expect the average 4KB full-stroke random read to be about 12.5ms or 80 IOPS. And there it is! For a single thread (single concurrent I/O) 100% read workload, the measured response time is close to 12.5 ms and 80 IOPS. Woohoo!

The bandwidth consumed by servicing 80 IOPS at 4KB is a whopping 320 KB/sec or around 2.5 Mbits/sec. This is, at best, about 1/1000th of the bandwidth of the disk interconnect, HBA, media, or memory. Clearly, there is no bandwidth constraint for this experiment.

Unfortunately, vendor datasheets are not sufficient for explaining the constraints demonstrated by the experiment. The 100% write workload has both lower latency and higher IOPS than the read for all measured numbers of concurrent operations (threads). The writes are also buffered and only committed to permanent media when we issue a synchronize cache command. In this test, we don't issue synchronize cache commands, so the decision to put on media is left to the HDD. We know there is a scheduler ultimately trying to make the best decision on writes to the media, so the latency is a measure of the efficiency of the HDD scheduler with the constraints placed by a continuous stream of small, random writes. The benefits of the scheduler and deferred commits to permanent media for writes are shown by the lower latency and higher IOPS than the 100% read workload.

The clue that there is trouble in paradise is not in the number of measured IOPS. In all cases, the IOPS increases as the concurrency increases -- a good thing. Unfortunately, the latency also increases rather significantly -- a very bad thing. 

All IOPS are Not Created Equal

The measurements shown in the previous graph are averages and do not show a linear relationship between IOPS and latency. Also, if we were to try a linear regression, we would have a problem fitting the mixed workload cases. This is evident from the data, in which a linear relationship could fit the curve well, but the read and write cases have very different slopes. 

We need to dig further into the latency and queus to understand what lies beneath. For this, we examine the latency of each and every I/O operation. I've long been a fan of iosnoop and will draw your attention to Brendan's porting of iosnoop to Linux. Now you can run iosnoop on Mac OSX, illumos and Solaris distros, and now Linux. Toss in your favorite stats graphing program and you no longer have any excuse for not understanding the distribution of the latency of your storage systems. Let's take a look at our data distributions by read vs write and one vs ten concurrent I/Os. We'll measure in units of microseconds, which is useful for storage devices.


We begin by looking at the Concurrent I/Os=1 cases. For both reads and writes, we see a really nice example of the normal distribution; this is what your stats teacher was trying to drill into your head while you were daydreaming in class. Rejoice when you see these in real life; in computer systems, they are few and far between. From a practical perspective, this means the median and averages are almost identical and common intuition about response time will apply. 

The mean of the reads is 11.8 ms, very close to the datasheet prediction of 12.5ms. Writes show a mean of 5.1ms, due primarily to the deferred writes and device scheduler optimizations. The 90% percentile for writes is also much better than the reads, also affirming the benefits of the device's scheduler.

By contrast, issuing 10 concurrent I/Os shows a different picture of the latency. The read workload looks staggering similar to an exponential distribution. The large difference between mean and median latency is an indictor the device's scheduling algorithms are starving some reads. The 90% percentile has ballooned to 148 ms, and some poor reads took more than 300 ms. 

The write workload still seems normal, but not quite the bell curve we expect from normal distributions. Writes fares better than the reads, with the worse case write taking slightly more than 100 ms.

Device Queues and the ZIO Scheduler

Now that we understand how HDDs react to concurrent workloads, we can revisit the ZIO scheduler. The resilver workload is typically described as one or more reads for each write:
  • Resilver I/Os are constantly running in the background, injecting their workload into the the ZIO pipeline along with the "I need it now!" workload
  • Once issued to the device, the ZIO scheduler cannot adjust the priority
  • At the device, more concurrent I/Os leads to longer latency, with extreme latency 10x longer than the single-I/O case
To prevent normal workload I/Os from getting stuck behind resilver I/Os after being issued to the device, we need to reduce the depth of the queue at the device by reducing zfs_vdev_max_pending. The perfect value is based on the expected workload. For latency-sensitive workloads, a smaller number results in more consistency and less variance. For bandwidth-intensive workloads, having a few more I/Os can help.

To quiet my grumbling, here are my rules-of-thumb for zfs_vdev_max_pending:

1 = the lower bound, but exposes all of the serialization in the stack
2 = a good alternative to 1 for HDDs -- offers the disk an opportunity to optimize, but doesn't overwhelm
3 = not a bad choice
4 = upper bound for HDDs, IMHO
16 = good for many Flash SSDs
30+ = good for RAID arrays

Herein lies the dilemma: HDDs like fewer concurrent I/Os, but SSDs like more concurrent I/Os, especially for write-intensive workloads. But there is only one tunable for all disks in all pools on the system. Fortunately, you can change this on the fly, without rebooting, and measure the impact on your workload. For Solaris or older illumos-based distributions, this example sets the value to 2:

echo zfs_vdev_max_pending/W0t2 | mdb -kw

to return to the modern default:

echo zfs_vdev_max_pending/W0t10 | mdb -kw

The effects will take place almost immediately. Monitor the latency changes using iostat -x or another disk latency measure. For direct measurements of all I/Os, as shown previously, consider using iosnoop and a stats package.

In some storage appliances based on ZFS, there are options for setting this value. Consult your documentation for further information. In some cases, the term zfs_vdev_max_pending might be disguised as another name, so look for a tunable related to the number of concurrent I/Os sent to disks.

Understanding queues and their impact on constrained resources can be tricky. Not all of the behaviors follow normal distributions and simple math. Measuring the latency of constrained resources while varying the workload allows you to make adjustments, improving the overall performance of the system. Tuning ZFS device queues is easier when using my tips and proven techniques.

May your traffic jams be short, may you be in the front of the line at the next concert, and may all of your I/O have low latency variance!

No comments:

Post a Comment