Interrupts Visualized using CPU Swimlanes

Who is running what, where? This is one of the questions that keeps performance analysts awake at night. Optimizing system performance often means running the right code at right time on the right CPU can make a big difference. But how can we see this information? CPU swimlanes are a useful way to explore and understand deep kernel behaviors such as scheduling and interrupt handling. My previous post on CPU swimlanes, showed a simple example of a MacOS laptop with 4 processors. In this post we'll go into more detail and explore how the linux interrupt balancer, irqbalance, works to spread the interrupt load over many processors. This analysis is more intuitive and visually recognizable than hundreds of lines of mpstat data scrolling across a terminal window.

Background

The system under test (SUT) is running a linux kernel and has 32 processors as seen by the OS. The kernel is NUMA-aware, so it also knows that there are two sockets, each with an 8-core Intel cpu. Each core has two hyperthreads. The workload is a heavy server workload with millions of I/Os flowing between disks and network.
For this workload and environment we make changes to the dashboard to better represent the server's tasks:

Per-CPU metrics for Linux are more comprehensive than MacOS:

user time accounts for user programs running and we expect very little user time on this SUT
idle time occurs when the CPUs did not have code to run
iowait occurs when there is no code to run, but there is an outstanding I/O in progress
irq refers to the time the kernel is handing hardware interrupts
nice refers to the time spent running user programs in nice priority
softirq refers to the time the kernel is handling software interrupts
steal is the time spent in involuntary wait for a virtual CPU while a hypervisor is servicing another virtual CPU
system is time spent running kernel code that is not servicing hardware or software interrupts

For this SUT, we expect to be running our kernel code and this isn't a bad thing. System time is colored as a different shade of green than user time. This satisfies the tenet that green is good, red is bad.
Data is collected in influxdb using telegraf's CPU agent without modification
Detailed CPU data is broken out by socket
The relationship between two hyperthreads that share a core is consistent for this SUT, but not immediately obvious to the casual observer. This relationship changes for different processor families and is not recorded in the telemetry stream, by default. To illustrate the shared core, the graph is annotated.

The Experiment

The goal of the experiment is to determine the optimal assignment of interrupts and services to the CPUs and whether irqbalance can handle the task automatically. There are five timestamps where irqbalance is requested to re-balance the system with a settling time of approximately two minutes. Though not displayed here, the total system throughput increases in proportion to the amount of system time used. So a good result is when we have more system time and thus more transactions being processed.

Analysis

The results clearly show how observing the overall system CPU usage is inadequate for analyzing use of the kernel's resources.

Workload is started and it is immediately obvious that cpu21 is completely consumed.
First irq rebalancing shifted some work from socket 0 to socket 1 CPUs, but poor cpu21 is still getting hammered and cpu7 is now also being hammered.
Second irq rebalancing reduced the load on cpu7 and cpu21with visibly better all-around throughput.
Third irq rebalancing looks fine: the system time usage is spreading well and no CPU is getting hammered by system time.
Fourth irq rebalance confirms the SUT has an optimal balance.

Many programs, such as iostat or vmstat, show summary CPU statistics. They try to describe in numbers what is shown in the top-most summary of all CPUs. It is not surprising that something can be consuming 100% of one CPU while many others are idle. For this 32-CPU system, the summary data for this condition would show 4% used by softirq, while all of the softirq usage is confined to cpu11. In the detailed CPU swimlanes, it is immediately obvious that cpu11 is getting hammered. Fear not! We know what this is and what we can do about it. Interestingly, cpu27 shares a core with cpu11 and the irqbalance does not try to schedule more work there. IMHO this is a deficiency in irqbalance, but I can only say that with deep knowledge of what is running on cpu11 vs the rest of the test workload. In any case, there is significant idle time on almost of the other CPUs and subsequent experiments can show if more work gets scheduled to cpu27 as the others become more busy.

In this test, the irqbalance did a respectable job adjusting the system to deliver better performance. However, that is not always the case. We do see instances where irqbalance de-tunes a system for a short while and then rebalances nicely. As the system nears saturation, any de-tuning can dramatically affect overall performance. With detailed CPU swimlanes the balance can be quickly understood and correlated to other changes in system behavior.

Conclusion

CPU swimlanes are an excellent approach to understanding how work is spread across multiple CPUs in a system. If you love mpstat and systems with many CPUs, you'll love detailed CPU swimlanes.

Ramblings from Richard's Ranch

Search This Blog