Looking at I/O Performance with Bubbles

I am helping a client work through some performance problems and thought I might share a view with you. The data was collected for 57 seconds during a production run. The problem we are chasing is the usual performance problem: latency. In some cases the latency is close to 100ms, which would make everyone except a floppy disk user unhappy. The view of the data is intended to shed some light on where problems might exist that we need to further explore. Using summary data from tools like iostat, vmstat, mpstat, prstat, or top won't show you anything like this.

In the bubble chart, the Y axis is the size of the I/Os. Along the X axis, reads are on the left and writes are on the right. The size of the bubbles is the latency in microseconds. Big bubbles mean big performance problems. Press the play button to see the changes over time.

There are two ZFS transaction group (txg) commits: one at 8:49:14 and another at 8:49:44. ZFS will, by default depending on the version, commit the txg every 30 seconds. When the txg commits, you will see a flurry of relatively small (8 KB) write activity. Though this may look really terrible (and it is) remember that txg commits are asynchronous, so you will rarely feel them. But in this sample, some of the txg I/Os take more than 50 milliseconds to complete. In the entire sample, the worst latency was more than 370 milliseconds (more than 1/3 of a second). For a slow HDD, 50 milliseconds might not be so bad. But in this case, the target is an expensive RAID array. More work needed to get to the bottom of this mystery...

If you would like to see this sort of analysis for your system, contact me and we can discuss an engagement.