Monday, September 24, 2012

Designing ZFS at Datacenter Scale

Here is the abstract for my talk next week at zfsday!

The ZFS hybrid storage pool model is very flexible and allows many different combinations of storage technology to be used. This presents a dilemma to the systems architect: what is the best way to build and configure a pool to meet business requirements? We'll discuss modeling ZFS systems and hybrid storage pools at a datacenter scale. The models consider space, performance, dependability, and cost of the storage devices and any interconnecting networks (including SANs). We will also discuss methods for measuring the performance of the system.

I hope to see your smiling face in the audience, or you can register to see the live, streaming video. Sign up today!

Friday, September 21, 2012

OmniTI adds weight behind illumos/ZFS day

Theo and the crew at OmniTI have added their support to the illumos/ZFS day event in San Francisco next month. Thanks guys! We look forward to hearing more about the OmniOS distribution!

Tuesday, September 18, 2012

ZFS day is coming soon!

We are hosting illumos and ZFS day events in San Francisco October 1 - 3, 2012. Our good friends from DDRdrive, Delphix, Joyent, and Nexenta are also sponsoring the event. I will be talking about how to optimize the design of ZFS-based systems and explain how to get the best bang for your buck. Jason and Garrett are also on the speakers list, talking about how illumos has really taken hold as a foundation for building modern businesses.

For all of the Solaris fans and alumni, there is also a Solaris Family Reunion on Monday evening.

There is no cover charge, just register at and join us.

We look forward to seeing you there... even if you have to sneak out of a boring Oracle Open World session!

Wednesday, September 12, 2012

cifssvrtop updated

When I originally wrote cifssvrtop (top for CIFS servers), all of the systems I tested with had one thing in common: the workstations (clients) had names. Interestingly, I recently found a case where the workstations are not named, so the results were less useful than normal.

2012 Sep 11 23:50:48, load: 3.11, read: 0        KB, write: 176448   KB
Client          CIFSOPS   Reads  Writes   Rd_bw   Wr_bw    Rd_t    Wr_t  Align%
                   3391       0    3033       0  192408       0      85     100
all                3391       0    3033       0  192408       0      85     100

In this case, there are supposed to be 5 clients. But none have workstation names, so they all get lumped together under "". 

The fix is, of course, easy and obvious: make an option to discern clients by IPv4 address instead of workstation name. This is more consistent with nfssvrtop and iscsisvrtop, a good thing. Now the output looks like:

2012 Sep 12 19:52:23, load: 2.50, read: 0        KB, write: 1766632  KB
Client          CIFSOPS   Reads  Writes   Rd_bw   Wr_bw    Rd_t    Wr_t  Align%        452       0     441       0   27984       0     108     100        488       0     473       0   30072       0     101     100        505       0     490       0   31068       0     849      99        625       0     614       0   38979       0    2710      99        792       0     773       0   49002       0    4548      99
all                2864       0    2792       0  177106       0    2030      99

Here we can clearly see the clients separated by IPv4 address. The sorting is by CIFSOPS, which is the easiest way to deal with dtrace aggregations.

To implement this change, I added a new "-w" flag that will print the workstation name instead of the IPv4 address. If you prefer the previous defaults, then feel free to fork it on github.

I've updated the cifssvrtop sources in github, check it out. The code has details, the "-h" option shows usage, and there is a PDF presentation to accompany the top tools there. Finally, feedback and bug reports are always welcome!

Friday, August 10, 2012

Hello DEY Storage Systems!

Many of my friends have been asking where I've been lately and why they haven't seen me lurking around in the usual haunts. In January of this year, Jason Yoho, Garrett D'Amore, and I started a new company, DEY Storage Systems. I'm the E. And I'll take full responsibility for the uncleverness of the name, though it does lead to some fun puns (where are you when I need you, Simonton?)
We're currently heads-down, working hard on building the product. We've got terrific backing from some truly exceptional entrepreneurs, a cadre of experienced advisors, a great vision, and innovative ideas. We are creating some really cool stuff, and I'm eagerly awaiting the first product launch.

One last thought, we're hiring too!

Saturday, April 21, 2012

Performability Analysis

Modern systems are continuing to evolve and become more tolerant to failures. For many systems today, a simple performance or availability analysis does not reveal how well a system will operate when in a degraded mode. A performability analysis can help answer these questions for complex systems. In this blog, an updated version of an old blog post on performability, I'll show one of the methods we use for performability analysis.

But before we dive into the analysis method, I need to introduce you to a scary word, DEGRADED. A simple operating definition of degraded is, something isn't working nominally. For large systems with many parts, such as the Internet, there is also a large number of components that are, at any given time, not operating nominally. The beauty of the Internet is that it was designed to continue to work while being degraded. Degraded is very different than being UP or DOWN. We often describe the availability of a system as the ratio of time UP versus total time, expressed as a percentage of a time interval, with 99.999% (five-nines) being the gold standard. Availability analysis is totally useless when describing how large systems work. As a systems engineer or architect, operation in the degraded mode is where all of the interesting work occurs.

With performability analysis, we pay close attention to how the system works when degraded so that we can improve the design and decrease the frequency, duration, or impact of operating in degraded mode. As more people are building or using cloudy systems, you can see how the move from system design focused on outage frequency (MTBF) and duration (MTTR) can be very different in design and economics than a system designed to reduce the impact (performability) of degradation.

We often begin with a small set of components for test and analysis. Traditional benchmarking or performance characterization is a good starting point. For this example, we will analyze a RAID storage array. We begin with an understanding of the performance characteristics of our desired workload, which can vary widely for storage subsystems. In our case, we will create a performance workload which includes a mix of reads and writes, with a consistent I/O distribution, size, and a desired performance metric expressed in IOPS. Storage arrays tend to have many possible RAID configurations which will have different performance and data protection trade-offs, so we will pick a RAID configuration which we think will best suit our requirements. If it sounds like we're making a lot of choices early, it is because we are. We know that some choices are clearly bad, some are clearly good, and there are a whole bunch of choices in between. If we can't meet our design targets after the performability analysis, then we might have to go back to the beginning and start again - such is the life of a systems engineer.

Once we have a reasonable starting point, we will setup a baseline benchmark to determine the best performance for a fully functional system. We will then use fault injection to measure the system performance characteristics under the various failure modes expected in the system. For most cases, we are concerned with hardware failures. Often the impact on the performance of a system under failure conditions is not constant. There may be a fault diagnosis and isolation phase, a degraded phase, and a repair phase. There may be several different system performance behaviors during these phases. The transient diagram below shows the performance measurements of a RAID array with dual redundant controllers configured in a fully redundant, active/active operating mode. We bring the system to a steady state and then inject a fault into one of the controllers.
This analysis is interesting for several different reasons. We see that when the fault was injected, there was a short period where the array serviced no I/O operations. Once the fault was isolated, then a recovery phase was started during which the array was operating at approximately half of its peak performance. Once recovery was completed, the performance returned to normal, even though the system remains in the degraded state. Next we repaired the fault. After the system reconfigured itself, performance returned to normal and the system is operating nominally. 

You'll note that during the post-repair reconfiguration the array stopped servicing I/O operations and this outage was longer than the outage in the original fault. Sometimes, a trade-off is made such that the impact of the unscheduled fault is minimized at the expense of the repair activity. This is usually a good trade-off because the repair activity is usually a scheduled event, so we can limit the impact via procedures and planning. If you have ever waited for a file system check (fsck or chkdsk) to finish when booting a system, then you've felt the impact of such decisions and understand why modern file systems have attempted to minimize the performance costs of fsck, or eliminated the need for fsck altogether.

Modeling the system in this way means that we will consider both the unscheduled faults as well as the planned repair, though we usually make the simplifying assumption that there will be one repair action for each unscheduled fault. The astute operations expert will notice that this simplifying assumption is not appropriate for the well-managed systems, where even better performability is possible.

If this sort of characterization sounds tedious, well it is. But it is the best way for us to measure the performance of a subsystem under faulted conditions. Trying to measure the performance of a more complex system with multiple servers, switches, and arrays under a comprehensive set of fault conditions would be untenable. We do gain some reduction of the test matrix because we know that some components (eg most, but not all, power supplies) have no impact on performance when they fail.

Once we know how the system performs while degraded, we can build a Markov model that can be used to examine trade-offs and design decisions. Solving the performability Markov model provides us with the average staying time per year in each of the states.

So now we have the performance for each state, and the average staying time per year. These are two variables, so lets graph them on an X-Y plot. To make it easier to compare different systems, we sort by the performance (in the Y-axis). We call the resulting graph a performability graph or P-Graph for short. Here is an example of a performability graph showing the results for three different RAID array configurations.

We usually label availability targets across the top as an alternate X-axis label because many people are more comfortable with availability targets represented as "nines" than seconds or minutes. In order to show the typically small staying time, we use a log scale on the X-axis. The Y-axis shows the performance metric. I refer to the system's performability curve as a performability envelope because it represents the boundaries of performance and availability, where we can expect the actual use to fall below the curve for any interval.

In the example above, there are 3 products: A, B, and C. Each has a different performance capacity, redundancy, and cost. As much as engineers enjoy optimizing for performance or availability, we cannot dismiss the actual cost of the system. With performability analysis, we can help determine if a lower-cost system that tolerates degradation is better than a higher-cost system that delivers less downtime.

Suppose you have a requirement for an array that delivers 1,500 IOPS with "four-nines" availability. You can see from the performability graph that Product A and C can deliver 1,500 IOPS, Product C can deliver "four-nines" availability, but only Product A can deliver both 1,500 IOPS and "four-nines" availability. To help you understand the composition of the graph, I colored some of the states which have longer staying times.

You can see that some of the failure states have little impact on performance, whereas others will have a significant impact on performance. You can also clearly see that this system is expected to operate in a degraded mode for approximately 2 hours per year, on average. While degraded, the performance can be the same as the nominal system. See, degraded isn't really such a bad word, it is just a fact of life, or good comedy as in the case of the Black Knight.

For this array, when a power supply/battery unit fails, the write cache is placed in write through mode, which has a significant performance impact. Also, when a disk fails and is being reconstructed, the overall performance is impacted. Now we have a clearer picture of what performance we can expect from this array per year.

This composition view is particularly useful for product engineers, but is less useful to systems engineers. For complex systems, there are many products, many failure modes, and many more trade-offs to consider. More on that later...

Monday, April 16, 2012

Latency and I/O Size: Cars vs Trains

A legacy view of system performance is that bigger I/O is better than smaller I/O. This has led many to worry about things like "jumbo" frames for Ethernet or setting the maximum I/O size for SANs. Is this worry justified? Let's take a look...

This post is the second in a series looking at the use and misuse of IOPS for storage system performance analysis or specification.

In this experiment, the latency and bandwidth of random NFS writes is examined. Conventional wisdom says, jumbo frames and large I/Os is better than default frame size or small I/Os. If that is the case, then we expect to see a correlation between I/O size and latency. Remember, latency is what we care about for performance, not operations per second (OPS). The test case is a typical VM workload where the client is generating lots of small random write I/Os, as generated by the iozone benchmark. The operations are measured at the NFS server along with their size, internal latency, and bandwidth. The internal latency is the time required for the NFS server to respond to the NFS operation request. The NFS client will see the internal latency plus the transport latency.

If the large I/O theory holds, we expect that we will see better performance with larger I/Os. By default, the NFSv3 I/O size for the server and client in this case is 1MB. It can be tuned to something smaller, so for comparison, we also measured when the I/O size was 32KB (the NFSv2 default).

Toss the results into JMP and we get this nice chart that shows two consecutive iozone benchmark runs - the first with NFS I/O size limited to 32KB, the second with NFS I/O size the default 1MB:

The results are not as expected. What is expected is that the larger I/Os are more efficient and therefore offer better effective bandwidth while reducing overall latency. What we see is that we get higher bandwidth and significantly lower latency with the smaller I/O size! The small I/O size configuration on the left clearly outperforms the same system using large I/O sizes.

The way I like to describe this is using the cars vs trains analogy. Trains are much more efficient at moving people from one place to another. Hundreds or thousands of people can be carried on a train at high speed (except in the US, where high speed trains are unknown, but that is a different topic). By contrast cars can carry only a few people at a time, but can move about without regard to the train schedules and without having to wait as hundreds of people load or unload from the train. On the other hand, if a car and train approach a crossing at the same time, the car must wait for the train to pass. And that can take some time. The same thing happens on a network where small packets must wait until large packets pass through the interface. Hence, there is no correlation between the size of the packets and how quickly they move through the network because when large packets are moving, the small packets can be blocked - cars wait at the crossing for the train to pass.

This notion leads to a design choice that is counter to the conventional wisdom. To improve overall performance of the system, smaller I/O sizes can be better. As usual, for performance issues, there are many factors involved in performance constraints, but consider that there can be positive improvement when the I/O sizes are more like cars than trains.