Thursday, August 20, 2009

Variance and cloud computing

Recently, I was asked about what I've done for cloud computing. Personally, I think "cloud computing" is just the latest marketing buzzword, and represents a passing fad. But the concepts people use when trying to describe the cloud are a good foundation for providing computing services. Many of these concepts have been in place for 25-30 years, at least in the engineering workstation market, but perhaps not widely applied to other markets. At their core, the goal is to reduce cost and complexity by reducing variance -- a noble goal well served by six sigma-like approaches.

For example, in the conversation the problem of rapid deployment or provisioning arose. 20 years ago this month, when I was the Manager of Network Support for the Auburn University College of Engineering, I had the task of taking a classroom full of boxes containing workstations and deploy them to the departments and classrooms prior to school starting. The only sane way to approach this is to reduce the variance between the systems. I produced a golden image of the OS and started cranking out workstations. But we still had variance problems. The network was not yet implemented in every building. So there were some deployments which were unique. Later, when we were able to connect all of the buildings (and eventually, we even connected to the internet :-) then I quickly resolved the unique systems and brought them into the fold. By the time I left Auburn, we could deploy a new workstation in about 20 minutes, including physically unpacking and installing the hardware, with the workstation being fully usable in the ubiquitous environment and connected to the internet. Sometimes, it would take longer than 20 minutes because we allowed the faculty member or staff to choose their hostname -- a process that could take days. Since the hostname was the only variable, it was no wonder that it cost time to resolve. But it is good to have real names instead of numbers for such things. Today, with a well designed "cloud" you could expect to deploy a New Thing in just a few moments -- after resolving the unique properties such as billing information.

Rapid deployment is a Good Thing, but it isn't a new thing. We've been doing that for many years. What seems to be new is that people who do not have the net in their DNA are starting to figure it out. I can loosely classify these as PC people, those who have been exposed to the software installation and configuration issues in the disconnected, fat client world -- aka PCs.

This is where SaaS comes in. The problem with installing software on each and every fat client is that software changes over time. By the time a software vendor becomes successful in the fat client market, they could have dozens of versions of the software installed in the client base. Keeping track of all of the versions, and the features, bugs, and platform support is a nightmare. At Sun we constantly had customers asking for a compatibility matrix, which I call a sparse matrix, because there were so many products and platforms that were never tested together, it was impossible to make sure everything worked together at any one point in time, let alone as they changed over time. The best way to tackle this variance problem is to not install software on the fat client at all. I know, I know, this is the same mantra we sang about 20 years ago when the network was the computer, but it really is all about decreasing costs and complexity by getting rid of variance.

In a SaaS environment, you can roll out a new, improved version of your software and know that all of your customers will be on the same version. This is a huge reduction of variance in the installed base. This is also a huge win for the customer service group in your company. We did many studies at Sun that found the number of bug reports which had known solutions is large. Many enterprise customers are late adopters, which further compounds the problem of variance in the installed base.

So if it takes a marketing buzzword, like "cloud computing" or "SaaS," to help point people in the direction of removing variance, then I can live with it. Reducing variance is a Good Thing.
Thanks Martha.

Wednesday, August 19, 2009

Backups for file systems with millions of files


Recently, there have been a number of discussions about how to backup an active file system with millions of files. This is a challenge because traditional backup tools do a file system walk -- traversing the file system from top to bottom looking at the modification time for each file. This works well for file systems with a modest number of files. For example, one of my OpenSolaris systems has around 62,000 files in the root file system and backups go at media speed. But when you get into the millions of files with deep heirarchy, the time required to manage the file system walk begins to dominate backup time. This can be a bad thing, especially if you backup to tape.

We didn't notice this problem with UFS because, quite simply, UFS can't handle millions of files in a single file system. But with ZFS becoming mainstream, it does not have this limitation and people are taking advantage of the ease of managing large datasets with ZFS.

One successful approach to solving this problem uses ZFS snapshots to backup an active file system to a backup file system. For diversity, the backup file system could be located on another disk, array, or host. For cost efficiency, the backup file system can have different properties than the active file system -- compression is often a good idea. The trick is that the active file system can be optimized for high IOPS and the backup file system can be optimized for low cost per byte.
This might look familiar to you. Many people have performed backups from a replica of the production data. Most of those implementations perform the replication at the block level. ZFS can replicate at the file system level -- allowing policy or configuration changes at the block level, such as different RAID configurations or data compression.

With block-level replicators, such as the Sun StorageTek Availability Suite, the replica has no knowledge of the context of the data. The replica's view is block-for-block identical to the original. With ZFS replicas, the file systems can have different data retention policies. For example, the production site may have a snapshot retention policy of 24 hours and the replica may have a retention policy of 31 days. As long as there is a latest, common snapshot between the production site and replica, any later snapshots can be replicated. Try doing that on a block-level replicator!

There is much more to this story. In particular, retention policies and performance optimization can get very complex. I'm working on a good example which goes into more detail and discusses performance concerns... more on that later...

Friday, August 14, 2009

Justifying new compression algorithms in ZFS?

Denis Ahrens recently posted a compression comparison of LZO and LZJB to the OpenSolaris ZFS forum. This is interesting work and there are plenty of opportunities for research and development of new and better ways of compressing data. But when does it make sense to actually implement a new compression scheme in ZFS?

The first barrier is the religilous arguments surrounding licensing. I'd rather not begin to go down that rat hole. Suffice to say, if someone really wants to integrate, they will integrate.

The second barrier is patents. Algorithms can be patented, and in the US patents have real value ($). This is another rat hole, so let's assume that monies are exchanged and the lawyers are held at bay.

The third barrier is integration into the OS. Changes to a file system, especially a file system used for boot, take time to integrate with all of the other parts of the OS: installation, backup, upgrades, boot loaders, etc. This isn't especially hard, but it does take time and involves interacting with many different people.

Now we can get down to the nitty-gritty engineering challenges.

Today, disks use a 512 byte sector size. This is the smallest size you can write to the disk. So compressing below 512 bytes gains nothing. Similarly, if the compression on a larger record does not reduce the overall size by 512 bytes, then it isn't worth compressing. Also, compression algorithms can increase the size of a record, depending on the data and algorithm. ZFS implements the policy that if compression does not reduce the record's size by more than 12.5% (1/8), then the record will be written uncompressed. This prevents inflation and provides a minimum limit for evaluating compression effectiveness. The smallest record size of interest is 8 blocks of 512 bytes, or 4 kBytes.

ZFS compresses each record of a file rather than the whole file. If a file contains some compressible bits and some uncompressible bits, then the file can be compressed, depending on the distribution of compressible bits in the file. This seems like an odd thing to say, but it is needed to understand that the maximum record size is 128 kBytes. When evaluating a compression algorithm for ZFS, the record sizes to be tested should range from 4 kBytes to 128 kBytes. In Denis' example, the test data is in the 200 MBytes to 801 MBytes range. Interesting, but it would be better to measure with the same policy that ZFS implements. Also, two of Denis' tests were on a tarball comprised of files. Again, this is interesting, but will not be representative of the compression of the untarred files, especially files smaller than 4 kBytes.

Now we can build a test profile that compares the effectiveness of compression for ZFS. The records should range from 4 kBytes to 128 kBytes. To do this easily with existing files, they could be split, compressed and the results compared for each split after applying ZFS' policies. The results should also be compared at the sector size, not the file length. To demonstrate, I'll use an example. I took the zfs(1m) man page and split it into 4 kByte files. Then I compressed it with compress(1) and gzip(1). For gzip, I used the -6 option, which is the default for ZFS when gzip compression is specified.


original

compress

gzip -6

file

length

sectors

length

sectors

length

sectors

xaa

4096

8

1688

4

945

2

xab

4096

8

2198

5

1664

4

xac

4096

8

2217

5

1615

4

xad

4096

8

2081

5

1438

3

xae

4096

8

2161

5

1468

3

xaf

4096

8

2170

5

1544

4

xag

4096

8

2174

5

1386

3

xah

4096

8

2163

5

1442

3

xai

4096

8

2154

5

1477

3

xaj

4096

8

2279

5

1747

4

xak

4096

8

2100

5

1299

3

xal

4096

8

2066

5

1335

3

xam

4096

8

2177

5

1481

3

xan

4096

8

2056

5

1326

3

xao

4096

8

2025

4

1206

3

xap

4096

8

2073

5

1388

3

xaq

4096

8

1960

4

1279

3

xar

4096

8

1897

4

1272

3

xas

4096

8

1766

4

1135

3

xat

4096

8

2221

5

1478

3

xau

3721

8

1820

4

1056

3

zfs.1m

85641

168

31464

62

20582

41


It is clear that gzip -6 does a better job compressing for space on these text files. I won't measure the performance costs for this example, but in general gzip -6 uses more CPU resources than compress. The real improvements in space are not readily comparable, though. A bit of spreadsheet summing reveals:


split files single file all files
compress 59% 37% 48%
gzip -6 39% 24% 32%

This shows that the space savings from compression on a single, large file is much better than for smaller files. This also reiterates the issue with compression in general -- you can't accurately predict how well it will work in advance.

In conclusion, the work required to add a compressor to ZFS is largely dominated by non-technical issues. But proper evaluation of the technical issues is also needed to be sure that the engineering results can justify the time and expense to tackle the non-technical issues. This can be done by experiments prior to coding to the ZFS interfaces. Denis and others, are interested in improving ZFS which is very cool. I think you should also help improve ZFS, or at least use it.

Thursday, August 13, 2009

Whither Btrfs?

Is the best technology the pathway to success? Nope. In this post, I'll take a strategic look at the future of the Btrfs file system.

Using B-trees (or modified B-trees) for space allocation has been the rage among file system designers in the past few years. Some of the more notable efforts are ZFS, Btrfs, Reiser4, and NILFS. The availability of open source operating systems, especially BSD and Linux, has enabled explorations of interesting new ways to manage storage and implement file systems. This is a good thing. But being technologically cool and successful does not foretell commercial success. For the purpose of evaluating file systems, I'll define commercially successful as having a large installed base for decades. The list of commercially successful file systems is fairly small: FAT, NTFS, HFS+, UFS, and ext2/3 are perhaps the most commercially successful general-purpose file systems today. The key to commercial success is to provide good value and have a good delivery channel.

Btrfs was announced by Oracle in June 2007 and is being integrated into the Linux kernel. It offers some of the more interesting features of other file systems built on B-tree notions: snapshots, efficient backups, copy-on-write, multiple file systems in a single logical volume (called subvolumes), dynamic inode allocation, multiple device support, internal mirroring, etc. These are all cool features and represent a viable technology direction. But technological feats often run into barriers to adoption which prevent them from becoming commercially successful.

The most important barrier to adoption is the delivery channel. Clearly, Microsoft dominates the industry as it carefully controls the deliver channel of software onto approximately 90% of the computer systems volume. Microsoft owns (is the proprietor of) NTFS and FAT which dominate the market. The next major vendor by volume is Apple, who owns HFS+ that is used as the default file system for OSX. The largest Linux channels, Red Hat and SuSE, use ext2/3 and seem to be planning to use ext4 in the future. Changing the default file system for a popular OS is a very expensive, time-consuming, and disruptive event, which is why OS vendors will spend a lot of time and money to fix and incrementally improve the default file system when possible. The life cycle for a default file system is measured in decades. The development of a new file system takes time, too -- on the order of 5-6 years seems to be typical as measured by having enough stable new features that the value of migration is greater than the inertia of the legacy. The barrier here is time to maturity and time to become the default in the channel. Introduced in 2007, we can expect to see Btrfs be mature in the 2012 timeframe. But what about the prospects of becoming the default in the channel?

Oracle has been trying to reduce their costs by eliminating the OS vendor for many years. Until recently, their efforts were to completely eliminate the OS vendor (raw iron) or to take away Linux from Red Hat (so-called Oracle Enterprise Linux, aka Larry Linux). Neither has been very successful. But Oracle's acquisition of Sun Microsystems changes the industry structure in many ways. Now, Oracle will have an entire solution stack: software, hardware, and services. The solution stack represents a channel for Oracle to deliver innovations, such as a spiffy new file system. Herein lies the problem for Btrfs: Oracle will now own ZFS. This means:
  • Btrfs is not mature enough to become the default file system for OEL. ZFS is more than 5 years old and stable enough to become the default file system for Solaris 10 and OpenSolaris.
  • It makes little sense for Oracle to continue funding two, competing file system projects -- one trying to match features with the other. ZFS has approximately 45 associated patents, and patents have real value ($) in the US.
  • Tossing Btrfs to the open-source winds is not likely to improve its schedule or channel prospects.
There are a couple of scenarios that could still play out -- Oracle could break the GPLv2 barrier that prevents Linux from accepting ZFS in the kernel or Oracle could take a more competitive stance against Red Hat and Novell by leveraging [Open]Solaris. Either way I don't see a good business case for Oracle to continue to invest in Btrfs. What do you think?

Monday, August 3, 2009

Purple Rain


We were eating dinner this evening out on the deck (hot wings, one of my favorites) when nature created a beautiful sight, purple rain. Some tropical moisture flowed into the San Diego area today and made some beautiful clouds. A few dropped out some rain, though most of it was virga. As the Sun was setting, the colors were just right for a few moments and I was able to snap a picture of the purple rain over Ramona.