Wednesday, February 17, 2010

ZFS data protection comparison

ZFS now offers triple-parity raidz3. Conceptually, raidz3 is an N+3 parity protection scheme. Today, there are few, if any, other implementations of triple parity protection, so when we say "raidz is similar to RAID-5" and "raidz2 is similar to RAID-6" there is no similar allusion for raidz3. I prefer to say "raidz3 is like raidz2 with one additional level of parity protection. But how much better is raidz3 than raidz2? To help answer that question, I used the simple Mean Time to Data Loss (MTTDL) model to calculate the data retention capabilities of the possible configurations of 12 disks under ZFS. To be fair, the same model applies to other RAID implementations, but I'll use the ZFS terminology here.

In this MTTDL model, the configuration includes N total disks. If the data protection scheme is raidz3, then the minimum N = 1 data disk + 3 parity disks = 4. You can add more data disks to increase the overall available space, so if N=6 then you have 3 data disks + 3 parity disks.

The model uses the Mean Time between Failure (MTBF) as specified in a vendor's datasheet. It also uses a Mean Time to Repair (MTTR) which includes both the logistical repair time and any data reconstruction required. The simple model calculates MTTDL as:
For non-protected schemes (dynamic striping, RAID-0)
MTTDL[1] = MTBF / N
For single parity schemes (2-way mirror, raidz, RAID-1,RAID-5):
MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)
For double parity schemes (3-way mirror, raidz2, RAID-6):
MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)
For triple parity schemes (4-way mirror, raidz3):
MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3)
A graph the results for combinations of 12 disks looks like:



The results are consistent with previous MTTDL analysis. The 12-disk Stripe has an MTTDL of 6.7 years, which isn't very good (annualized rate = 15%) whereas the 12 disk 4-way stripe MTTDL is 2.75e+13 years (annualized rate = 3.63e-12%) and the 12 disk raidz3 MTTDL is 1.67e+11 years (annualized rate = 5.99e-10%).

The theory behind raidz3 will allow more parity disks. But at some point, the system design will be dominated by common failures and not the failure of independent disks. I hope this model will be useful for you to evaluate the data retention of your storage system.

Thursday, February 4, 2010

ZFS training in Atlanta, March 16-18, 2010

I will be presenting a 3-day training session for systems and storage administrators on ZFS and NexentaStor in the Atlanta area March 16-18, 2010. The team has put together a fantastic syllabus including in-depth exposure to the latest ZFS and NAS trends.


Attendees can choose to attend the three-day program, or the two-day advanced portion. The course is structured as follows:

  • Day 1: Introduction to ZFS and Nexenta Systems Storage Technologies
  • Day 2: De-duplication in a VM World
  • Day 3: Optimizing NAS Performance

Attendees should have some familiarity with storage concepts and terminology, but does not assume any knowledge of ZFS or familiarity with the NexentaStor storage appliance.
The course will include hands-on exercises with ZFS and NexentaStor.
Best of all, lunch will be provided each day.
To sign up or view the detailed syllabus, visit the nexenta-atlanta.eventbrite.com event registration site

Sunday, January 31, 2010

Community helping the community, ala YouTube

The video generation is taking hold in the OpenSolaris community. Recently, Michelle Knight, a self-described general lunatic, asked for help on the OpenSolaris ZFS forum. But quite unlike most folks who get help and quietly wander away, or (hopefully) post a summary for posterity, she made a video describing what she learned and posted to YouTube. Very cool. Well done, Michelle!

Wednesday, January 27, 2010

Magic Mouse and Ring Finger Solution

I use an Apple Magic Mouse and really do love it. I use Adobe InDesign CS4 for writing technical documents. I also have a ring finger. Each of these things works well by itself. Together, they don't work well. This is a typical systems engineering problem. Each part works as designed, but together they don't work well.
Now you are probably wondering why these three things don't work together - millions of people use mice, thousands of writers use InDesign, and almost all of the people on Earth have ring fingers. Let me explain.

  1. The Magic Mouse is so very, very cool because the entire surface is touch sensitive. It is very easy to use and allows you do things you could never do before with a mouse. For instance, many mice have had a little scroll wheel and OSes are designed to use the scroll wheel movement to scroll up and down inside a window. Some mice have little trackballs that allow you to scroll left or right, too. The Magic Mouse is almost like giving your fingers a trackpad on top of the mouse. Implementing a multiple button click function is simply a matter of the programming that determines where your finger is when you press. Very cool. Very habit forming. In just a few short weeks, my hand is already forgetting how to use older mice.
  2. Adobe InDesign is a very powerful publishing product. I've been using FrameMaker since 1987 and find that InDesign has many of the features I've use in FrameMaker, but is even more powerful and flexible. One of the interesting concepts in InDesign is the pasteboard. Your document sits atop the pasteboard. If you want to move a frame, text, image, or other object out of your document quickly, but without deleting the object, then you can just slide it over to the pasteboard beside the page. Only the objects on the page are printed or exported to PDF, so you can use the pasteboard to keep your miscellaneous collection of stuff very easily. The pasteboard is larger than your page, and by default adds about 8 inches to each side of your document. This means that your pasteboard for a letter sized document is around 24 inches wide. Since my screen is not 24 inches wide (is Santa listening? I'll be a good boy) the windows I use have horizontal scroll bars. For the most common case, the page is in the center of the scroll bar. I've spent a few hours trying to figure out how to make the pasteboard thinner, but none of the tricks work.
  3. My ring finger has a tendency to rest on the right side of the mouse while my index and middle fingers wander about the mouse top and click.
OK, so now you should be able to recognize the problem. My ring finger is interpreted by the Magic Mouse to do a horizontal scroll and InDesign extends the scrolling area by 60%, most of which is area I rarely use. In other words, while I'm working away, I get suddenly scrolled off into the blank area of the pasteboard. Since the document is in the middle, I have to scroll back to the center, which is harder to do than scrolling full left or full right.
The solution I've found is to put a small bit of painter's tape over the area where my ring finger rests. I could have used duct tape, and that would make a good joke, but I prefer the painters tape for now.

So far this is working well. A programmatic way to build dead spots on the Magic Mouse would be a useful feature. InDesign could allow me to control the horizontal size of the pasteboard. All of these programming changes are perhaps not difficult, but will also not be solved soon. For now I can be highly productive without having to horizontally scroll back to center on InDesign.
Now, about those deadlines...

Integrated Systems Engineering Redux

Today, Oracle is presenting a webcast describing their strategy for the company going forward after the Sun acquisition. In the first 20 minutes there was much discussion about delivering integrated systems: applications + database + OS + hardware. This is a tremendous value proposition. It is such a tremendous value proposition that it could have been taken from the slides we put together 8 years ago in Sun's Integrated Systems Engineering group.
We had difficulties with the business of delivering such integrated solutions. Sure, there were a few technical difficulties, but working together with the different engineering groups at Sun and Oracle, we were able to deliver a good technical solution. However, the business challenges of working across different product groups and companies were insurmountable at the time. In the end, the Integrated Systems Engineering group was disbanded and the products were EOLed.
In my position as Chief Architect for the Integrated Systems Engineering group, I had the pleasure of working with many talented engineers and product marketing teams. But the experience taught me that very good technical solutions may not be successful because of the rest of the business activities needed to ensure the right products offering the right value are delivered at the right time to the right market. And those products include much more than a what a systems engineering team can integrate in a lab. This is why I entered the EMBA program at USC's Marshall School of Business. I already knew how to integrate complex systems and make them simple to install and manage. But I did not know how to take such a product and make it successful in the market. I'm a lot smarter now.
I wish Oracle well in their future endeavors. The value proposition is good. The need exists. The challenges are difficult. If they can overcome the non-technical barriers, the future looks bright.

Monday, January 25, 2010

National recognition for San Diego County CERT

The AMGEN Tour of California bicycle race rode to the top of Palomar Mountain last year. I mentioned it in my blog prior to the race. This month I am pleased to announce that the CERT National Newsletter (Volume 2, issue 3) features a story about the preparation and nearly flawless execution of the event. This was truly a case where dozens of volunteers came together, at short notice, to pull off a significant event involving thousands of people and a nationwide TV audience.
Bill Leininger and the crew from Palomar Mountain Volunteer Fire Department CERT demonstrated superb leadership and I am proud to have been able to participate. I'd also like to thank all of the volunteers and groups who came together to make this event a success.
On page 2, you can yours truly (in the red shirt, strategically located near the donuts) during the pre-event briefing.

Friday, January 15, 2010

Looking at I/O Performance with Bubbles

I am helping a client work through some performance problems and thought I might share a view with you. The data was collected for 57 seconds during a production run. The problem we are chasing is the usual performance problem: latency. In some cases the latency is close to 100ms, which would make everyone except a floppy disk user unhappy. The view of the data is intended to shed some light on where problems might exist that we need to further explore. Using summary data from tools like iostat, vmstat, mpstat, prstat, or top won't show you anything like this.


In the bubble chart, the Y axis is the size of the I/Os. Along the X axis, reads are on the left and writes are on the right. The size of the bubbles is the latency in microseconds. Big bubbles mean big performance problems. Press the play button to see the changes over time.

There are two ZFS transaction group (txg) commits: one at 8:49:14 and another at 8:49:44. ZFS will, by default depending on the version, commit the txg every 30 seconds. When the txg commits, you will see a flurry of relatively small (8 KB) write activity. Though this may look really terrible (and it is) remember that txg commits are asynchronous, so you will rarely feel them. But in this sample, some of the txg I/Os take more than 50 milliseconds to complete. In the entire sample, the worst latency was more than 370 milliseconds (more than 1/3 of a second). For a slow HDD, 50 milliseconds might not be so bad. But in this case, the target is an expensive RAID array. More work needed to get to the bottom of this mystery...

If you would like to see this sort of analysis for your system, contact me and we can discuss an engagement.