Sunday, January 10, 2016

Observing cache hit/miss rates

At InterModal Data we build large systems with many components running in highly available configurations 24x7x365. For such systems, understanding how the components are working is very important. Our analytics system measures and records thousands of metrics from all components and makes these measurements readily available for performance analysis, capacity planning, and trouble shooting. Alas, having access to the records of hundreds of thousands of metrics is not enough, we need good, concise methods of showing that in meaningful ways. In this post, we'll look at the cache hit/miss data for a storage system and a few methods of observing the data.

In general, caches exist to optimize the cost vs performance of a system. For storage systems in particular, we often see RAM working as cache for drives. Drives are slow, relatively inexpensive ($/bit) and persistently store data even when powered off. By contrast, RAM is fast, relatively expensive, and volatile. Component and systems designers balance the relatively high cost of RAM against the lower cost of drives while managing performance and volatility. For the large systems we design at InterModal Data, the cache designs are very important to overall system scalability and performance.

Once we have a cache in the system, we're always interested to know how well it is working. If over-designed, adding expensive caches just raises the system cost, adding little benefit. One metric often used for this analysis is the cache hit/miss ratio. Hits are good, misses are bad. But it is impossible to always have 100% hits when volatile RAM is used. We can easily plot this over time as our workload varies.

In the following graphs, the data backing the graph is identical. The workload varies over approximately 30 hours.

Traditionally, this is tracked as the hit/miss ratio easily represented as a ratio.

Here we see lots of hits (green = good) with a few cases where the misses (red = bad) seem to rear their ugly heads. Should we be worried? We can't really tell from this graph because there is only the ratio, no magnitude. Perhaps the system is really idle and a handful of misses are measured. When presented with only hit/miss ratio, it is impractical to make any analysis, the magnitude is also needed. Many analysis systems then show you the magnitudes stacked as below.

In this view, the number of accesses are the top of the stacked lines. Under each access point we see the ratio of hits/misses expressed as magnitude. This is better than the ratio graph. Now we can see that the magnitudes are changing from a few thousand accesses/second to approximately 170,000 accesses/second. We can also see that there were times where we saw misses, but during those times the number of accesses was relatively small. If the ratio graph caused some concern, this graph removes almost all of that concern.

However, in this graph we also lose the ability to discern the hit/miss ratio because of the stacking. Consider if we had two or more levels of cache and wanted to see the overall cache effectiveness, we could quickly lose the details in the stacking.

Recall that hits are good (green) and misses are bad (red). Also consider that Wall Street has trained us to like graphs that go "up and to the right" (good). We can use this to our advantage and more easily separate the good from the bad.

Here we've graphed the misses as negative values. Hits go up to the top and are green (all good things). Misses go down and are red (all bad things). The number of accesses is the spread between the good and the bad, so as the spread increases, more work is being asked of the system. In this case we can still see that the cache misses are a relatively small portion of the overall access and, more importantly, occur early in time. As time progresses the hit ratio and accesses both increase for this workload. This is a much better view of the data.

Here is another example of the SSD read cache for this same experiment. First, the hit/miss ratio graph.

If this was the only view you see, you should be horrified: too much red and red is bad! Don't panic.

This graph clearly shows the story in the appropriate context. There are some misses and hits, but the overall magnitude is very low, especially when compared to the RAM cache graph of the same system. No need to panic, the SSD cache is doing its job, though it is not especially busy compared to the RAM cache.

This method scales to multiple cache levels and systems -- very useful for the large, scalable systems we design at InterModal Data.