Friday, July 15, 2016

Happy 10th Birthday Snapshot!

On this date, 10 years ago, I made my first ZFS snapshot.

# zfs get creation stuff/home@20060715 
NAME                 PROPERTY  VALUE                  SOURCE
stuff/home@20060715  creation  Sat Jul 15 20:20 2006  -
The stuff pool itself has changed dramatically over the decade. Originally, it was a spare 40G IDE drive I had laying around the hamshack.  Today it is a mirror of 4T/6T drives from different vendors, for diversity. Over the years the pool has been upgraded, expanded, and had its drives replaced numerous times. This is a true testament to the long-term planning and management built into OpenZFS.

The original size of the stuff/home file system was 9GB. Today, it is 1.6TB, which I'll blame mostly on backups of media files. Ten years ago I had a 1.6M pixel camera, today 16M pixel plus HD video and phones.

What was I working on back then?  SATA controllers, Sun X4500, ROARS Field Day, ...

Wednesday, June 1, 2016

As we're getting ready for summer at the ranch...

Lion's tails reach for the sky!

Tuesday, May 3, 2016

Observing Failover of Busy Pools

While looking on failover tests under load, we can easily see the system-level effects of failover in a single chart.

But first, some background. At InterModal Data, we've built an easy-to-manage system of many nodes that can provide scalable NFS and iSCSI shares in the Petabytes range. This software defined storage system scales nicely with great hardware, such as the HPE systems shown here. An important part of the system is the site-wide analytics where we measure many aspects of performance, usage, and environmental data. This data from both clients and servers is stored in an influxdb time-series database for analysis.

For this test, the NFS clients are throwing a sustained, mixed read/write, mixed size, mixed randomness workload at multiple shares on two pools (p115 and p116) served by two Data Nodes (s115 and s116). At the start of the sample, both pools are served by sut116. At 01:34 pool put115 is migrated (failed over, in cluster terminology) to sut115. The samples are taken every minute, but the actual failover time for pool p115 is 11 seconds under an initial load of 11.5k VOPS (VFS layer operations per second). After the migration, both Data Nodes are serving the workload, thus the per-pool performance increases to 16.5k VOPS. The system changes from serving an aggregate of 23k VOPS to 33k VOPS -- a valid reason for re-balancing the load.

Sunday, January 10, 2016

Observing cache hit/miss rates

At InterModal Data we build large systems with many components running in highly available configurations 24x7x365. For such systems, understanding how the components are working is very important. Our analytics system measures and records thousands of metrics from all components and makes these measurements readily available for performance analysis, capacity planning, and trouble shooting. Alas, having access to the records of hundreds of thousands of metrics is not enough, we need good, concise methods of showing that in meaningful ways. In this post, we'll look at the cache hit/miss data for a storage system and a few methods of observing the data.

In general, caches exist to optimize the cost vs performance of a system. For storage systems in particular, we often see RAM working as cache for drives. Drives are slow, relatively inexpensive ($/bit) and persistently store data even when powered off. By contrast, RAM is fast, relatively expensive, and volatile. Component and systems designers balance the relatively high cost of RAM against the lower cost of drives while managing performance and volatility. For the large systems we design at InterModal Data, the cache designs are very important to overall system scalability and performance.

Once we have a cache in the system, we're always interested to know how well it is working. If over-designed, adding expensive caches just raises the system cost, adding little benefit. One metric often used for this analysis is the cache hit/miss ratio. Hits are good, misses are bad. But it is impossible to always have 100% hits when volatile RAM is used. We can easily plot this over time as our workload varies.

In the following graphs, the data backing the graph is identical. The workload varies over approximately 30 hours.

Traditionally, this is tracked as the hit/miss ratio easily represented as a ratio.


Here we see lots of hits (green = good) with a few cases where the misses (red = bad) seem to rear their ugly heads. Should we be worried? We can't really tell from this graph because there is only the ratio, no magnitude. Perhaps the system is really idle and a handful of misses are measured. When presented with only hit/miss ratio, it is impractical to make any analysis, the magnitude is also needed. Many analysis systems then show you the magnitudes stacked as below.


In this view, the number of accesses are the top of the stacked lines. Under each access point we see the ratio of hits/misses expressed as magnitude. This is better than the ratio graph. Now we can see that the magnitudes are changing from a few thousand accesses/second to approximately 170,000 accesses/second. We can also see that there were times where we saw misses, but during those times the number of accesses was relatively small. If the ratio graph caused some concern, this graph removes almost all of that concern.

However, in this graph we also lose the ability to discern the hit/miss ratio because of the stacking. Consider if we had two or more levels of cache and wanted to see the overall cache effectiveness, we could quickly lose the details in the stacking.

Recall that hits are good (green) and misses are bad (red). Also consider that Wall Street has trained us to like graphs that go "up and to the right" (good). We can use this to our advantage and more easily separate the good from the bad.


Here we've graphed the misses as negative values. Hits go up to the top and are green (all good things). Misses go down and are red (all bad things). The number of accesses is the spread between the good and the bad, so as the spread increases, more work is being asked of the system. In this case we can still see that the cache misses are a relatively small portion of the overall access and, more importantly, occur early in time. As time progresses the hit ratio and accesses both increase for this workload. This is a much better view of the data.

Here is another example of the SSD read cache for this same experiment. First, the hit/miss ratio graph.

If this was the only view you see, you should be horrified: too much red and red is bad! Don't panic.


This graph clearly shows the story in the appropriate context. There are some misses and hits, but the overall magnitude is very low, especially when compared to the RAM cache graph of the same system. No need to panic, the SSD cache is doing its job, though it is not especially busy compared to the RAM cache.

This method scales to multiple cache levels and systems -- very useful for the large, scalable systems we design at InterModal Data.

Wednesday, April 15, 2015

Spring at the Ranch

This spring is bringing new changes to the ranch. This vine surprised us with lavender-colored flowers. Other surprises include an early change in scenery as the grasses and weeds have already reached their summer hue. Long time readers of my blog might recognize the meaning of flowers... its all good.

Saturday, October 4, 2014

On ZFS use at home...

The other day someone asked on the #zfs IRC (irc.freenode.net) chat about using ZFS at home. As one of the early adopters, I can say it is a great idea! I've been running ZFS at home since late 2005. The first pool of "stuff" I created has been upgraded, expanded, and had its drives replaced. In 2008 I created the latest version of "stuff" as a simple mirrored pair of HDDs. The prior version of "stuff" was transferred to the 2008 pool which is still in use. Therefore I do not have the actual creation date of the original "stuff," but since I used ZFS send/receive to transfer the datasets, I can definitively say the oldest snapshot was created in July 2006.

# zfs get creation stuff/home@20060715
NAME                 PROPERTY  VALUE                  SOURCE
stuff/home@20060715  creation  Sat Jul 15 20:20 2006  -

I've made many snapshots since and it seems quite impressive to know that I can roll back in time over 8 years to see how my "stuff" has evolved. Let's hear it for long-lived data!

Tuesday, August 5, 2014

kstat changes in illumos

One of the nice changes to the kstat (kernel statistics) command in illumos is its conversion to C from perl. There were several areas in the illumos (nee OpenSolaris) code where perl had been used. But these were too few to maintain critical mass and it is difficult for interpreted runtimes to change at the pace of an OS, so keeping the two in lockstep is simply not worthwhile. So the few places where parts of illumos were written in perl have been replaced by native C implementations.

The kstat(1m) command rewritten in C was contributed by David Höppner, an active member of the illumos community. It is fast and efficient at filtering and printing kstats. By contrast, the old perl version had to start perl (an interpreter), find and load the kstat-to-perl module, and then filter and print the kstats. Internal to the kernel, kstats are stored as a name-value list (nvlist) containing strongly-typed data. Many of these are 64-bit integers. This poses a problem for the version of perl used (5.12) as the 64-bit support is dependent on the compiled version and illumos can be compiled for both 32 and 64 bit processors. To compensate for this mismatch, the following was added to the man page for kstat(3perl):

Several of the statistics provided by the kstat facility are stored as 64-bit integer values. Perl 5 does not yet internally support 64-bit integers, so these values are approximated in this module. There are two classes of 64-bit value to be dealt with: 64-bit intervals and times

These are the crtime and snaptime fields of all the statistics hashes, and the wtime, wlentime, wlastupdate, rtime, rlentime and rlastupdate fields of the kstat I/O statistics structures. These are measured by the kstat facility in nanoseconds, meaning that a 32-bit value would represent approximately 4 seconds. The alternative is to store the values as floating-point numbers, which offer approximately 53 bits of precision on present hardware. 64-bit intervals and timers as floating point values expressed in seconds, meaning that time-related kstats are being rounded to approximately microsecond resolution.

It is not useful to store these values as 32-bit values. As noted above, floating-point values offer 53 bitsof precision. Accordingly, all 64-bit counters are stored as floating-point values.

For consumers of the kstat(1m) command output, this means that kstat I/O statistics are stored as seconds (floating point) instead of nanoseconds. For example, with formatting adjusted for readability:
Perl-based kstat(1m)
# kstat -p sd:0:sd0
sd:0:sd0:class        disk
sd:0:sd0:crtime       1855326.99995062
sd:0:sd0:nread        380919301
sd:0:sd0:nwritten     1984175104
sd:0:sd0:rcnt         0
sd:0:sd0:reads        18455
sd:0:sd0:rlastupdate  2371703.49260763
sd:0:sd0:rlentime     147.154123471
sd:0:sd0:rtime        49.399890683
sd:0:sd0:snaptime     2371828.77138052
sd:0:sd0:wcnt         0
sd:0:sd0:wlastupdate  2371703.49174494
sd:0:sd0:wlentime     2.425675727
sd:0:sd0:writes       103602
sd:0:sd0:wtime        1.43643661

C-based kstat(1m)
# kstat -p sd:0:sd0
sd:0:sd0:class        disk
sd:0:sd0:crtime       244.271312204
sd:0:sd0:nread        25549493
sd:0:sd0:nwritten     1698218496
sd:0:sd0:rcnt         0
sd:0:sd0:reads        4043
sd:0:sd0:rlastupdate  104543293563241
sd:0:sd0:rlentime     68750036336
sd:0:sd0:rtime        64365048052
sd:0:sd0:snaptime     104509.092582653
sd:0:sd0:wcnt         0
sd:0:sd0:wlastupdate  104543293482995
sd:0:sd0:wlentime     4569934990
sd:0:sd0:writes       289766
sd:0:sd0:wtime        4551425719

I find kstat(1m) output to be very convenient for historical tracking and use it often. If you do too, then be aware of this conversion.

One of the best features of the new C-based kstat(1m) is the ability to get kstats as JSON. This is even more useful than the "parseable" output shown previously.
# kstat -j sd:0:sd0
[{
  "module": "sd",
  "instance": 0,
  "name": "sd0",
  "class": "disk",
  "type": 3,
  "snaptime": 104547.013504692,
  "data": {
    "crtime": 244.271312204,
    "nread": 25549493,
    "nwritten": 1700980224,
    "rcnt": 0,
    "reads": 4043,
    "rlastupdate": 104733296813446,
    "rlentime": 68901598866,
    "rtime": 64513785819,
    "snaptime": 104547.013504692,
    "wcnt": 0,
    "wlastupdate": 104733296708770,
    "wlentime": 4579560895,
    "writes": 290404,
    "wtime": 4561051625
  }
}]

Using JSON has the added advantage of being easy to parse without making assumptions about the data. For example, did you know that some kernel modules use ':' in the kstat module, instance, name, or statistic? This makes using the parseable output a little bit tricky. The JSON output is not affected and is readily and consistently readable or storable in the many tools that support JSON.

Now you can see how to take advantage of the kstat(1m) command and how it has evolved under illumos to be more friendly to building tools and taking quick measurements. Go forth and kstat!