Saturday, April 21, 2012

Performability Analysis

Modern systems are continuing to evolve and become more tolerant to failures. For many systems today, a simple performance or availability analysis does not reveal how well a system will operate when in a degraded mode. A performability analysis can help answer these questions for complex systems. In this blog, an updated version of an old blog post on performability, I'll show one of the methods we use for performability analysis.

But before we dive into the analysis method, I need to introduce you to a scary word, DEGRADED. A simple operating definition of degraded is, something isn't working nominally. For large systems with many parts, such as the Internet, there is also a large number of components that are, at any given time, not operating nominally. The beauty of the Internet is that it was designed to continue to work while being degraded. Degraded is very different than being UP or DOWN. We often describe the availability of a system as the ratio of time UP versus total time, expressed as a percentage of a time interval, with 99.999% (five-nines) being the gold standard. Availability analysis is totally useless when describing how large systems work. As a systems engineer or architect, operation in the degraded mode is where all of the interesting work occurs.

With performability analysis, we pay close attention to how the system works when degraded so that we can improve the design and decrease the frequency, duration, or impact of operating in degraded mode. As more people are building or using cloudy systems, you can see how the move from system design focused on outage frequency (MTBF) and duration (MTTR) can be very different in design and economics than a system designed to reduce the impact (performability) of degradation.

We often begin with a small set of components for test and analysis. Traditional benchmarking or performance characterization is a good starting point. For this example, we will analyze a RAID storage array. We begin with an understanding of the performance characteristics of our desired workload, which can vary widely for storage subsystems. In our case, we will create a performance workload which includes a mix of reads and writes, with a consistent I/O distribution, size, and a desired performance metric expressed in IOPS. Storage arrays tend to have many possible RAID configurations which will have different performance and data protection trade-offs, so we will pick a RAID configuration which we think will best suit our requirements. If it sounds like we're making a lot of choices early, it is because we are. We know that some choices are clearly bad, some are clearly good, and there are a whole bunch of choices in between. If we can't meet our design targets after the performability analysis, then we might have to go back to the beginning and start again - such is the life of a systems engineer.

Once we have a reasonable starting point, we will setup a baseline benchmark to determine the best performance for a fully functional system. We will then use fault injection to measure the system performance characteristics under the various failure modes expected in the system. For most cases, we are concerned with hardware failures. Often the impact on the performance of a system under failure conditions is not constant. There may be a fault diagnosis and isolation phase, a degraded phase, and a repair phase. There may be several different system performance behaviors during these phases. The transient diagram below shows the performance measurements of a RAID array with dual redundant controllers configured in a fully redundant, active/active operating mode. We bring the system to a steady state and then inject a fault into one of the controllers.
This analysis is interesting for several different reasons. We see that when the fault was injected, there was a short period where the array serviced no I/O operations. Once the fault was isolated, then a recovery phase was started during which the array was operating at approximately half of its peak performance. Once recovery was completed, the performance returned to normal, even though the system remains in the degraded state. Next we repaired the fault. After the system reconfigured itself, performance returned to normal and the system is operating nominally. 

You'll note that during the post-repair reconfiguration the array stopped servicing I/O operations and this outage was longer than the outage in the original fault. Sometimes, a trade-off is made such that the impact of the unscheduled fault is minimized at the expense of the repair activity. This is usually a good trade-off because the repair activity is usually a scheduled event, so we can limit the impact via procedures and planning. If you have ever waited for a file system check (fsck or chkdsk) to finish when booting a system, then you've felt the impact of such decisions and understand why modern file systems have attempted to minimize the performance costs of fsck, or eliminated the need for fsck altogether.

Modeling the system in this way means that we will consider both the unscheduled faults as well as the planned repair, though we usually make the simplifying assumption that there will be one repair action for each unscheduled fault. The astute operations expert will notice that this simplifying assumption is not appropriate for the well-managed systems, where even better performability is possible.

If this sort of characterization sounds tedious, well it is. But it is the best way for us to measure the performance of a subsystem under faulted conditions. Trying to measure the performance of a more complex system with multiple servers, switches, and arrays under a comprehensive set of fault conditions would be untenable. We do gain some reduction of the test matrix because we know that some components (eg most, but not all, power supplies) have no impact on performance when they fail.

Once we know how the system performs while degraded, we can build a Markov model that can be used to examine trade-offs and design decisions. Solving the performability Markov model provides us with the average staying time per year in each of the states.

So now we have the performance for each state, and the average staying time per year. These are two variables, so lets graph them on an X-Y plot. To make it easier to compare different systems, we sort by the performance (in the Y-axis). We call the resulting graph a performability graph or P-Graph for short. Here is an example of a performability graph showing the results for three different RAID array configurations.

We usually label availability targets across the top as an alternate X-axis label because many people are more comfortable with availability targets represented as "nines" than seconds or minutes. In order to show the typically small staying time, we use a log scale on the X-axis. The Y-axis shows the performance metric. I refer to the system's performability curve as a performability envelope because it represents the boundaries of performance and availability, where we can expect the actual use to fall below the curve for any interval.

In the example above, there are 3 products: A, B, and C. Each has a different performance capacity, redundancy, and cost. As much as engineers enjoy optimizing for performance or availability, we cannot dismiss the actual cost of the system. With performability analysis, we can help determine if a lower-cost system that tolerates degradation is better than a higher-cost system that delivers less downtime.

Suppose you have a requirement for an array that delivers 1,500 IOPS with "four-nines" availability. You can see from the performability graph that Product A and C can deliver 1,500 IOPS, Product C can deliver "four-nines" availability, but only Product A can deliver both 1,500 IOPS and "four-nines" availability. To help you understand the composition of the graph, I colored some of the states which have longer staying times.

You can see that some of the failure states have little impact on performance, whereas others will have a significant impact on performance. You can also clearly see that this system is expected to operate in a degraded mode for approximately 2 hours per year, on average. While degraded, the performance can be the same as the nominal system. See, degraded isn't really such a bad word, it is just a fact of life, or good comedy as in the case of the Black Knight.

For this array, when a power supply/battery unit fails, the write cache is placed in write through mode, which has a significant performance impact. Also, when a disk fails and is being reconstructed, the overall performance is impacted. Now we have a clearer picture of what performance we can expect from this array per year.

This composition view is particularly useful for product engineers, but is less useful to systems engineers. For complex systems, there are many products, many failure modes, and many more trade-offs to consider. More on that later...

Monday, April 16, 2012

Latency and I/O Size: Cars vs Trains

A legacy view of system performance is that bigger I/O is better than smaller I/O. This has led many to worry about things like "jumbo" frames for Ethernet or setting the maximum I/O size for SANs. Is this worry justified? Let's take a look...

This post is the second in a series looking at the use and misuse of IOPS for storage system performance analysis or specification.

In this experiment, the latency and bandwidth of random NFS writes is examined. Conventional wisdom says, jumbo frames and large I/Os is better than default frame size or small I/Os. If that is the case, then we expect to see a correlation between I/O size and latency. Remember, latency is what we care about for performance, not operations per second (OPS). The test case is a typical VM workload where the client is generating lots of small random write I/Os, as generated by the iozone benchmark. The operations are measured at the NFS server along with their size, internal latency, and bandwidth. The internal latency is the time required for the NFS server to respond to the NFS operation request. The NFS client will see the internal latency plus the transport latency.

If the large I/O theory holds, we expect that we will see better performance with larger I/Os. By default, the NFSv3 I/O size for the server and client in this case is 1MB. It can be tuned to something smaller, so for comparison, we also measured when the I/O size was 32KB (the NFSv2 default).

Toss the results into JMP and we get this nice chart that shows two consecutive iozone benchmark runs - the first with NFS I/O size limited to 32KB, the second with NFS I/O size the default 1MB:

The results are not as expected. What is expected is that the larger I/Os are more efficient and therefore offer better effective bandwidth while reducing overall latency. What we see is that we get higher bandwidth and significantly lower latency with the smaller I/O size! The small I/O size configuration on the left clearly outperforms the same system using large I/O sizes.

The way I like to describe this is using the cars vs trains analogy. Trains are much more efficient at moving people from one place to another. Hundreds or thousands of people can be carried on a train at high speed (except in the US, where high speed trains are unknown, but that is a different topic). By contrast cars can carry only a few people at a time, but can move about without regard to the train schedules and without having to wait as hundreds of people load or unload from the train. On the other hand, if a car and train approach a crossing at the same time, the car must wait for the train to pass. And that can take some time. The same thing happens on a network where small packets must wait until large packets pass through the interface. Hence, there is no correlation between the size of the packets and how quickly they move through the network because when large packets are moving, the small packets can be blocked - cars wait at the crossing for the train to pass.

This notion leads to a design choice that is counter to the conventional wisdom. To improve overall performance of the system, smaller I/O sizes can be better. As usual, for performance issues, there are many factors involved in performance constraints, but consider that there can be positive improvement when the I/O sizes are more like cars than trains.

Sunday, April 8, 2012

DTrace Conference Aura Chart Demo

In case you missed the DTrace conference on April 3, 2012, Dierdre recorded all of the sessions and is publishing the videos. I had a few minutes to discuss the Aura Graph work that was demonstrated in Nexenta's booth at VMworld 2011. The short video explains what we were visualizing and why it is useful for operators.