Sunday, August 28, 2016

On ZFS Copies

I tried to reply there, but the website wouldn't accept my reply, complaining about cookies and contact the site administrator. So I'll reply here. Internet Hurrah!

For a single device pool, copies=2 places the redundant copies approximately 1/3 (copies=2) and 2/3 (copies=3) into the LBA range of the single device. Assuming devices allocate with some diversity by LBA, this allows recovery from a range of LBA failures. For HDDs, think head-contacts-media type of failures. For a random failure case, you get random failures.

By contrast, if the pool has two top-level vdevs, such as a simple 2-drive stripe, then the copies are placed on separate drives, if possible. In this case, copies=2|3 provides protection more similar to mirroring, where the copies are on diverse devices. It is not identical to mirroring, because the pool itself depends on all top-level vdevs functioning. On the other hand, you can have different sized devices, with some data diversely stored.

In summary, copies is useful for specifying different redundancy policies for datasets, but it is not a replacement for proper mirroring or raidz. This is why (apologies, in the acquisition, the new regime blew the image links) and

For ZFS enthusiasts, you can see where the copies of blocks of your data are allocated using zdb's dataset option to show the data virtual addresses (DVAs) assigned to each copy. Here's how to do it.

1. First, create a test dataset with copies=2 and create a file with enough data to be interesting. Since we know the default recordsize is 128k, we'll write 2x128k or two ZFS blocks in size.

# zfs create -o copies=2 zwimming/copies-example
# dd if=/dev/urandom of=/zwimming/copies-example/data bs=128k count=2
2+0 records in
2+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.0243667 s, 10.8 MB/s

2. Locate the object number of the file, cleverly the same as the inode number, 7 in this case.

# ls -li /zwimming/copies-example/data
7 -rw-r--r-- 1 root root 262144 Aug 28 15:20 /zwimming/copies-example/data

3. Ask zdb to show the dataset information with details about the block allocations for object 7 in dataset zwimming/copies-example

 # zdb -dddddd zwimming/copies-example 7
Dataset zwimming/copies-example [ZPL], ID 49, cr_txg 287, 537K, 7 objects, rootbp DVA[0]=<0:8000:200> DVA[1]=<0:300c200:200> DVA[2]=<0:600bc00:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique triple size=800L/200P birth=291L/291P fill=7 cksum=dbfb763cf:52582f00d6d:fed1f02d1ce1:21f172427f4d22

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         7    2    16K   128K   514K   256K  100.00  ZFS plain file (K=inherit) (Z=inherit)

Here we verify the logical size (lsize) is 256k and the data block size (dsize) is, nominally, 2x the logical block size. Recall we wrote random, non-compressible data, so no compression tricks here.

                                        168   bonus  System attributes
dnode maxblkid: 1
path /data

Verify the object 7 is our file named "data"

uid     0
gid     0
atime Sun Aug 28 15:20:43 2016
mtime Sun Aug 28 15:20:43 2016
ctime Sun Aug 28 15:20:43 2016
crtime Sun Aug 28 15:20:43 2016
gen 291
mode 100644
size 262144
parent 4
links 1
pflags 40800000004
Indirect blocks:
               0 L1  0:be00:200 0:300aa00:200 0:6003a00:200 4000L/200P F=2 B=291/291
               0  L0 0:24000:20000 0:3020800:20000 20000L/20000P F=1 B=291/291
           20000  L0 0:44000:20000 0:3040800:20000 20000L/20000P F=1 B=291/291

Here's the meat of the example. This file has one level-1 (L1) indirect block (metadata), with 3 DVAs. Why 3? Because, by default, the number of copies of the metadata is 1+copies, up to 3. With copies=2, the number of metadata copies=3, hence the three DVAs. These DVAs consume 0x200 physical bytes each, or 1.5k. This explains why the accounting for the dsize above is 514k rather than 512k.

Each DVA is a tuple of vdev-index:offset:size. Thus a DVA of 0:be00:200 is 512 bytes allocated to vdev-0 (there is only one vdev in this pool) at offset 0xbe00. You can see that the 3 DVAs are offset further into the vdev at 0x300aa00 and 0x6003a00. If this pool had more than one vdev, and there was enough space on them, then we expect the diversity to be across vdevs.

Looking at the two level-0 (L0) data blocks, we see our actual data. Each block is 128K (0x20000) and the logical (20000L) size is the same as the physical size (20000P) showing no compression. Again we see all blocks allocated to vdev-0 and the offset for the second copy is 0x2FFC800 or 50,317,312 sectors away (here, sectors=512 bytes).

Referring back to JRS System's test, randomly corrupting data will give predictable results. Simply calculate the probability of corrupting two L0 blocks of a given size for a given LBA range.

But storage doesn't tend to fail randomly, failures tend to be spatially clustered. Thus copies is a reasonable use of redundancy techniques even when the device is not redundant. Indeed, copies is routinely used for the precious metadata.