Saturday, April 21, 2012

Performability Analysis

Modern systems are continuing to evolve and become more tolerant to failures. For many systems today, a simple performance or availability analysis does not reveal how well a system will operate when in a degraded mode. A performability analysis can help answer these questions for complex systems. In this blog, an updated version of an old blog post on performability, I'll show one of the methods we use for performability analysis.

But before we dive into the analysis method, I need to introduce you to a scary word, DEGRADED. A simple operating definition of degraded is, something isn't working nominally. For large systems with many parts, such as the Internet, there is also a large number of components that are, at any given time, not operating nominally. The beauty of the Internet is that it was designed to continue to work while being degraded. Degraded is very different than being UP or DOWN. We often describe the availability of a system as the ratio of time UP versus total time, expressed as a percentage of a time interval, with 99.999% (five-nines) being the gold standard. Availability analysis is totally useless when describing how large systems work. As a systems engineer or architect, operation in the degraded mode is where all of the interesting work occurs.

With performability analysis, we pay close attention to how the system works when degraded so that we can improve the design and decrease the frequency, duration, or impact of operating in degraded mode. As more people are building or using cloudy systems, you can see how the move from system design focused on outage frequency (MTBF) and duration (MTTR) can be very different in design and economics than a system designed to reduce the impact (performability) of degradation.

We often begin with a small set of components for test and analysis. Traditional benchmarking or performance characterization is a good starting point. For this example, we will analyze a RAID storage array. We begin with an understanding of the performance characteristics of our desired workload, which can vary widely for storage subsystems. In our case, we will create a performance workload which includes a mix of reads and writes, with a consistent I/O distribution, size, and a desired performance metric expressed in IOPS. Storage arrays tend to have many possible RAID configurations which will have different performance and data protection trade-offs, so we will pick a RAID configuration which we think will best suit our requirements. If it sounds like we're making a lot of choices early, it is because we are. We know that some choices are clearly bad, some are clearly good, and there are a whole bunch of choices in between. If we can't meet our design targets after the performability analysis, then we might have to go back to the beginning and start again - such is the life of a systems engineer.

Once we have a reasonable starting point, we will setup a baseline benchmark to determine the best performance for a fully functional system. We will then use fault injection to measure the system performance characteristics under the various failure modes expected in the system. For most cases, we are concerned with hardware failures. Often the impact on the performance of a system under failure conditions is not constant. There may be a fault diagnosis and isolation phase, a degraded phase, and a repair phase. There may be several different system performance behaviors during these phases. The transient diagram below shows the performance measurements of a RAID array with dual redundant controllers configured in a fully redundant, active/active operating mode. We bring the system to a steady state and then inject a fault into one of the controllers.
This analysis is interesting for several different reasons. We see that when the fault was injected, there was a short period where the array serviced no I/O operations. Once the fault was isolated, then a recovery phase was started during which the array was operating at approximately half of its peak performance. Once recovery was completed, the performance returned to normal, even though the system remains in the degraded state. Next we repaired the fault. After the system reconfigured itself, performance returned to normal and the system is operating nominally. 

You'll note that during the post-repair reconfiguration the array stopped servicing I/O operations and this outage was longer than the outage in the original fault. Sometimes, a trade-off is made such that the impact of the unscheduled fault is minimized at the expense of the repair activity. This is usually a good trade-off because the repair activity is usually a scheduled event, so we can limit the impact via procedures and planning. If you have ever waited for a file system check (fsck or chkdsk) to finish when booting a system, then you've felt the impact of such decisions and understand why modern file systems have attempted to minimize the performance costs of fsck, or eliminated the need for fsck altogether.

Modeling the system in this way means that we will consider both the unscheduled faults as well as the planned repair, though we usually make the simplifying assumption that there will be one repair action for each unscheduled fault. The astute operations expert will notice that this simplifying assumption is not appropriate for the well-managed systems, where even better performability is possible.

If this sort of characterization sounds tedious, well it is. But it is the best way for us to measure the performance of a subsystem under faulted conditions. Trying to measure the performance of a more complex system with multiple servers, switches, and arrays under a comprehensive set of fault conditions would be untenable. We do gain some reduction of the test matrix because we know that some components (eg most, but not all, power supplies) have no impact on performance when they fail.

Once we know how the system performs while degraded, we can build a Markov model that can be used to examine trade-offs and design decisions. Solving the performability Markov model provides us with the average staying time per year in each of the states.

So now we have the performance for each state, and the average staying time per year. These are two variables, so lets graph them on an X-Y plot. To make it easier to compare different systems, we sort by the performance (in the Y-axis). We call the resulting graph a performability graph or P-Graph for short. Here is an example of a performability graph showing the results for three different RAID array configurations.

We usually label availability targets across the top as an alternate X-axis label because many people are more comfortable with availability targets represented as "nines" than seconds or minutes. In order to show the typically small staying time, we use a log scale on the X-axis. The Y-axis shows the performance metric. I refer to the system's performability curve as a performability envelope because it represents the boundaries of performance and availability, where we can expect the actual use to fall below the curve for any interval.

In the example above, there are 3 products: A, B, and C. Each has a different performance capacity, redundancy, and cost. As much as engineers enjoy optimizing for performance or availability, we cannot dismiss the actual cost of the system. With performability analysis, we can help determine if a lower-cost system that tolerates degradation is better than a higher-cost system that delivers less downtime.

Suppose you have a requirement for an array that delivers 1,500 IOPS with "four-nines" availability. You can see from the performability graph that Product A and C can deliver 1,500 IOPS, Product C can deliver "four-nines" availability, but only Product A can deliver both 1,500 IOPS and "four-nines" availability. To help you understand the composition of the graph, I colored some of the states which have longer staying times.

You can see that some of the failure states have little impact on performance, whereas others will have a significant impact on performance. You can also clearly see that this system is expected to operate in a degraded mode for approximately 2 hours per year, on average. While degraded, the performance can be the same as the nominal system. See, degraded isn't really such a bad word, it is just a fact of life, or good comedy as in the case of the Black Knight.

For this array, when a power supply/battery unit fails, the write cache is placed in write through mode, which has a significant performance impact. Also, when a disk fails and is being reconstructed, the overall performance is impacted. Now we have a clearer picture of what performance we can expect from this array per year.

This composition view is particularly useful for product engineers, but is less useful to systems engineers. For complex systems, there are many products, many failure modes, and many more trade-offs to consider. More on that later...

No comments:

Post a Comment