Monday, November 14, 2016

How Would You Use 10 Million IOPS?

This week we proudly announced a 2u storage server that delivers 10 million IOPS @ 4k over the network. If you are visiting Salt Lake City and the Supercomputing 2016 Conference, drop by the exhibition booth and take a look.

What I find more exciting than the IOPS (10 million) or bandwidth (40 GB/sec) is the latency penalty for the network is an impressive 7 µsec over Ethernet. Yes, microseconds! Over Ethernet! Amazing!

If you've never heard us discussing microseconds in the context of network storage, it is because disks were so slow we could only grumble on about milliseconds. To give some perspective here, a high-end SAS SSD have response times on the order of 50 - 100 µsec @ 4k. If you run "iostat -x" to see latency on a typical Unix/Linux/OSX distro, you only get 10 µsec resolution, today. This is truly an breakthrough in enabling technology for building large, scalable, and fast computing solutions.

So, how will you use 10 million IOPS?

Kudos to the Newisys / Sanmina, and Kazan Networks team for a job well done!  Very impressive!

Sunday, August 28, 2016

On ZFS Copies

I tried to reply there, but the website wouldn't accept my reply, complaining about cookies and contact the site administrator. So I'll reply here. Internet Hurrah!

For a single device pool, copies=2 places the redundant copies approximately 1/3 (copies=2) and 2/3 (copies=3) into the LBA range of the single device. Assuming devices allocate with some diversity by LBA, this allows recovery from a range of LBA failures. For HDDs, think head-contacts-media type of failures. For a random failure case, you get random failures.

By contrast, if the pool has two top-level vdevs, such as a simple 2-drive stripe, then the copies are placed on separate drives, if possible. In this case, copies=2|3 provides protection more similar to mirroring, where the copies are on diverse devices. It is not identical to mirroring, because the pool itself depends on all top-level vdevs functioning. On the other hand, you can have different sized devices, with some data diversely stored.

In summary, copies is useful for specifying different redundancy policies for datasets, but it is not a replacement for proper mirroring or raidz. This is why https://blogs.oracle.com/relling/entry/zfs_copies_and_data_protection (apologies, in the acquisition, the new regime blew the image links) and http://jrs-s.net/2016/05/02/zfs-copies-equals-n/

For ZFS enthusiasts, you can see where the copies of blocks of your data are allocated using zdb's dataset option to show the data virtual addresses (DVAs) assigned to each copy. Here's how to do it.

1. First, create a test dataset with copies=2 and create a file with enough data to be interesting. Since we know the default recordsize is 128k, we'll write 2x128k or two ZFS blocks in size.

# zfs create -o copies=2 zwimming/copies-example
# dd if=/dev/urandom of=/zwimming/copies-example/data bs=128k count=2
2+0 records in
2+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.0243667 s, 10.8 MB/s

2. Locate the object number of the file, cleverly the same as the inode number, 7 in this case.

# ls -li /zwimming/copies-example/data
7 -rw-r--r-- 1 root root 262144 Aug 28 15:20 /zwimming/copies-example/data

3. Ask zdb to show the dataset information with details about the block allocations for object 7 in dataset zwimming/copies-example

 # zdb -dddddd zwimming/copies-example 7
Dataset zwimming/copies-example [ZPL], ID 49, cr_txg 287, 537K, 7 objects, rootbp DVA[0]=<0:8000:200> DVA[1]=<0:300c200:200> DVA[2]=<0:600bc00:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique triple size=800L/200P birth=291L/291P fill=7 cksum=dbfb763cf:52582f00d6d:fed1f02d1ce1:21f172427f4d22

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         7    2    16K   128K   514K   256K  100.00  ZFS plain file (K=inherit) (Z=inherit)

Here we verify the logical size (lsize) is 256k and the data block size (dsize) is, nominally, 2x the logical block size. Recall we wrote random, non-compressible data, so no compression tricks here.

                                        168   bonus  System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED 
dnode maxblkid: 1
path /data

Verify the object 7 is our file named "data"

uid     0
gid     0
atime Sun Aug 28 15:20:43 2016
mtime Sun Aug 28 15:20:43 2016
ctime Sun Aug 28 15:20:43 2016
crtime Sun Aug 28 15:20:43 2016
gen 291
mode 100644
size 262144
parent 4
links 1
pflags 40800000004
Indirect blocks:
               0 L1  0:be00:200 0:300aa00:200 0:6003a00:200 4000L/200P F=2 B=291/291
               0  L0 0:24000:20000 0:3020800:20000 20000L/20000P F=1 B=291/291
           20000  L0 0:44000:20000 0:3040800:20000 20000L/20000P F=1 B=291/291

Here's the meat of the example. This file has one level-1 (L1) indirect block (metadata), with 3 DVAs. Why 3? Because, by default, the number of copies of the metadata is 1+copies, up to 3. With copies=2, the number of metadata copies=3, hence the three DVAs. These DVAs consume 0x200 physical bytes each, or 1.5k. This explains why the accounting for the dsize above is 514k rather than 512k.

Each DVA is a tuple of vdev-index:offset:size. Thus a DVA of 0:be00:200 is 512 bytes allocated to vdev-0 (there is only one vdev in this pool) at offset 0xbe00. You can see that the 3 DVAs are offset further into the vdev at 0x300aa00 and 0x6003a00. If this pool had more than one vdev, and there was enough space on them, then we expect the diversity to be across vdevs.

Looking at the two level-0 (L0) data blocks, we see our actual data. Each block is 128K (0x20000) and the logical (20000L) size is the same as the physical size (20000P) showing no compression. Again we see all blocks allocated to vdev-0 and the offset for the second copy is 0x2FFC800 or 50,317,312 sectors away (here, sectors=512 bytes).

Referring back to JRS System's test, randomly corrupting data will give predictable results. Simply calculate the probability of corrupting two L0 blocks of a given size for a given LBA range.

But storage doesn't tend to fail randomly, failures tend to be spatially clustered. Thus copies is a reasonable use of redundancy techniques even when the device is not redundant. Indeed, copies is routinely used for the precious metadata.

Friday, July 15, 2016

Happy 10th Birthday Snapshot!

On this date, 10 years ago, I made my first ZFS snapshot.

# zfs get creation stuff/home@20060715 
NAME                 PROPERTY  VALUE                  SOURCE
stuff/home@20060715  creation  Sat Jul 15 20:20 2006  -
The stuff pool itself has changed dramatically over the decade. Originally, it was a spare 40G IDE drive I had laying around the hamshack.  Today it is a mirror of 4T/6T drives from different vendors, for diversity. Over the years the pool has been upgraded, expanded, and had its drives replaced numerous times. This is a true testament to the long-term planning and management built into OpenZFS.

The original size of the stuff/home file system was 9GB. Today, it is 1.6TB, which I'll blame mostly on backups of media files. Ten years ago I had a 1.6M pixel camera, today 16M pixel plus HD video and phones.

What was I working on back then?  SATA controllers, Sun X4500, ROARS Field Day, ...

Wednesday, June 1, 2016

Tuesday, May 3, 2016

Observing Failover of Busy Pools

While looking on failover tests under load, we can easily see the system-level effects of failover in a single chart.

But first, some background. At InterModal Data, we've built an easy-to-manage system of many nodes that can provide scalable NFS and iSCSI shares in the Petabytes range. This software defined storage system scales nicely with great hardware, such as the HPE systems shown here. An important part of the system is the site-wide analytics where we measure many aspects of performance, usage, and environmental data. This data from both clients and servers is stored in an influxdb time-series database for analysis.

For this test, the NFS clients are throwing a sustained, mixed read/write, mixed size, mixed randomness workload at multiple shares on two pools (p115 and p116) served by two Data Nodes (s115 and s116). At the start of the sample, both pools are served by sut116. At 01:34 pool put115 is migrated (failed over, in cluster terminology) to sut115. The samples are taken every minute, but the actual failover time for pool p115 is 11 seconds under an initial load of 11.5k VOPS (VFS layer operations per second). After the migration, both Data Nodes are serving the workload, thus the per-pool performance increases to 16.5k VOPS. The system changes from serving an aggregate of 23k VOPS to 33k VOPS -- a valid reason for re-balancing the load.

Sunday, January 10, 2016

Observing cache hit/miss rates

At InterModal Data we build large systems with many components running in highly available configurations 24x7x365. For such systems, understanding how the components are working is very important. Our analytics system measures and records thousands of metrics from all components and makes these measurements readily available for performance analysis, capacity planning, and trouble shooting. Alas, having access to the records of hundreds of thousands of metrics is not enough, we need good, concise methods of showing that in meaningful ways. In this post, we'll look at the cache hit/miss data for a storage system and a few methods of observing the data.

In general, caches exist to optimize the cost vs performance of a system. For storage systems in particular, we often see RAM working as cache for drives. Drives are slow, relatively inexpensive ($/bit) and persistently store data even when powered off. By contrast, RAM is fast, relatively expensive, and volatile. Component and systems designers balance the relatively high cost of RAM against the lower cost of drives while managing performance and volatility. For the large systems we design at InterModal Data, the cache designs are very important to overall system scalability and performance.

Once we have a cache in the system, we're always interested to know how well it is working. If over-designed, adding expensive caches just raises the system cost, adding little benefit. One metric often used for this analysis is the cache hit/miss ratio. Hits are good, misses are bad. But it is impossible to always have 100% hits when volatile RAM is used. We can easily plot this over time as our workload varies.

In the following graphs, the data backing the graph is identical. The workload varies over approximately 30 hours.

Traditionally, this is tracked as the hit/miss ratio easily represented as a ratio.


Here we see lots of hits (green = good) with a few cases where the misses (red = bad) seem to rear their ugly heads. Should we be worried? We can't really tell from this graph because there is only the ratio, no magnitude. Perhaps the system is really idle and a handful of misses are measured. When presented with only hit/miss ratio, it is impractical to make any analysis, the magnitude is also needed. Many analysis systems then show you the magnitudes stacked as below.


In this view, the number of accesses are the top of the stacked lines. Under each access point we see the ratio of hits/misses expressed as magnitude. This is better than the ratio graph. Now we can see that the magnitudes are changing from a few thousand accesses/second to approximately 170,000 accesses/second. We can also see that there were times where we saw misses, but during those times the number of accesses was relatively small. If the ratio graph caused some concern, this graph removes almost all of that concern.

However, in this graph we also lose the ability to discern the hit/miss ratio because of the stacking. Consider if we had two or more levels of cache and wanted to see the overall cache effectiveness, we could quickly lose the details in the stacking.

Recall that hits are good (green) and misses are bad (red). Also consider that Wall Street has trained us to like graphs that go "up and to the right" (good). We can use this to our advantage and more easily separate the good from the bad.


Here we've graphed the misses as negative values. Hits go up to the top and are green (all good things). Misses go down and are red (all bad things). The number of accesses is the spread between the good and the bad, so as the spread increases, more work is being asked of the system. In this case we can still see that the cache misses are a relatively small portion of the overall access and, more importantly, occur early in time. As time progresses the hit ratio and accesses both increase for this workload. This is a much better view of the data.

Here is another example of the SSD read cache for this same experiment. First, the hit/miss ratio graph.

If this was the only view you see, you should be horrified: too much red and red is bad! Don't panic.


This graph clearly shows the story in the appropriate context. There are some misses and hits, but the overall magnitude is very low, especially when compared to the RAM cache graph of the same system. No need to panic, the SSD cache is doing its job, though it is not especially busy compared to the RAM cache.

This method scales to multiple cache levels and systems -- very useful for the large, scalable systems we design at InterModal Data.

Wednesday, April 15, 2015

Spring at the Ranch

This spring is bringing new changes to the ranch. This vine surprised us with lavender-colored flowers. Other surprises include an early change in scenery as the grasses and weeds have already reached their summer hue. Long time readers of my blog might recognize the meaning of flowers... its all good.