Monday, April 16, 2012

Latency and I/O Size: Cars vs Trains

A legacy view of system performance is that bigger I/O is better than smaller I/O. This has led many to worry about things like "jumbo" frames for Ethernet or setting the maximum I/O size for SANs. Is this worry justified? Let's take a look...

This post is the second in a series looking at the use and misuse of IOPS for storage system performance analysis or specification.

In this experiment, the latency and bandwidth of random NFS writes is examined. Conventional wisdom says, jumbo frames and large I/Os is better than default frame size or small I/Os. If that is the case, then we expect to see a correlation between I/O size and latency. Remember, latency is what we care about for performance, not operations per second (OPS). The test case is a typical VM workload where the client is generating lots of small random write I/Os, as generated by the iozone benchmark. The operations are measured at the NFS server along with their size, internal latency, and bandwidth. The internal latency is the time required for the NFS server to respond to the NFS operation request. The NFS client will see the internal latency plus the transport latency.

If the large I/O theory holds, we expect that we will see better performance with larger I/Os. By default, the NFSv3 I/O size for the server and client in this case is 1MB. It can be tuned to something smaller, so for comparison, we also measured when the I/O size was 32KB (the NFSv2 default).

Toss the results into JMP and we get this nice chart that shows two consecutive iozone benchmark runs - the first with NFS I/O size limited to 32KB, the second with NFS I/O size the default 1MB:

The results are not as expected. What is expected is that the larger I/Os are more efficient and therefore offer better effective bandwidth while reducing overall latency. What we see is that we get higher bandwidth and significantly lower latency with the smaller I/O size! The small I/O size configuration on the left clearly outperforms the same system using large I/O sizes.

The way I like to describe this is using the cars vs trains analogy. Trains are much more efficient at moving people from one place to another. Hundreds or thousands of people can be carried on a train at high speed (except in the US, where high speed trains are unknown, but that is a different topic). By contrast cars can carry only a few people at a time, but can move about without regard to the train schedules and without having to wait as hundreds of people load or unload from the train. On the other hand, if a car and train approach a crossing at the same time, the car must wait for the train to pass. And that can take some time. The same thing happens on a network where small packets must wait until large packets pass through the interface. Hence, there is no correlation between the size of the packets and how quickly they move through the network because when large packets are moving, the small packets can be blocked - cars wait at the crossing for the train to pass.

This notion leads to a design choice that is counter to the conventional wisdom. To improve overall performance of the system, smaller I/O sizes can be better. As usual, for performance issues, there are many factors involved in performance constraints, but consider that there can be positive improvement when the I/O sizes are more like cars than trains.


  1. Two related anecdotes:

    1) We used to compare T1 line performance to 24x 56k modems in parallel. You can move 1.5Mbps across both, but eventually the receipt of data has to be acknowledged. The 60ms+ turnaround time for the modems kills the comparison...

    2) When running big VDI workloads on NAS, the conventional wisdom is wrong too. In reality, tuning the NAS for "acceptable bandwidth" and "very low latency" allows for more desktops per storage pool. Traditional tuning results in extremely slow file system operations and mount resets. This type of workload is a VERY good example of your "cars waiting at the crossing" analogy...

    It stands to reason that in "cloudy" infrastructures, tuning for latency provides for a wider set of "acceptable" workloads - since workloads of many profiles will inhabit the same infrastructure (storage and network). I'd submit that NFS/NAS is more susceptible to this kind of phenomena than block storage because both data and file system semantics are subject to the same delays. However, I've seen the same type of latency-bound performance characteristic in iSCSI systems in VDI workloads resulting in catastrophic failures (LUN timeouts, APD failures, lost resource locks, etc). The damn train sometimes takes too long to clear the crossing...

  2. Right. Larger block sizes work better workloads that are more like "streaming", where there are few or no cars on the road. (Actually examining trains in places like Russia, where you have trains sharing the road with cars, might be an interesting analogy.) Trains work best when they are isolated.

    With respect to Jumbo frames - Ethernet is an extreme case because 1500 bytes is *so* tiny that you wind up paying a huge overhead just to send even a single 8K page. Jumbo frames are designed to minimize that overhead, and yet are still small enough (9K usually) that they look more like minivans than trains. :-)

    Another consideration with Ethernet is LSO and other TOE-ish techniques. With these, you try to "gather up" a large chunk of data (1MB) and that does cause the stoppage at the NIC, or elsewhere in the stack. Especially as there are TCP control packets that have to get involved to transmit this data. I view LSO as a sadly necessary bit of infrastructure when folks want to send large TCP segments over 10GbE -- mostly because CPUs still have some trouble keeping up with 10GbE. (As CPUs get faster, ToE should matter less -- except network pipes are getting faster too.) We used to worry a lot about this stuff at 1GbE, but modern CPUs hardly break a sweat filling a 1GbE pipe, even without *any* offloading.

    1. It is a myth that jumbo frames are 9k. In fact there is no standard jumbo frame size, which is why changing them can be frustrating.

      It is also a myth that jumbo frames are always faster. They can get stuck in other coalescing software/firmware and actually be slower for workloads where the average xfer is < 1/2 of the MTU.

      As with most cases where you deviate from the standard, testing in your environment with your workload is necessary to optimize.