Diagnosing high io wait on CentOS

Twice now I have seen machines crippled by long waits. No, not the standard joke you apply to the n00b in the office - but the Linux Wait that is a process waiting for access to IO usually. In both cases identifying wait as being the issue was key to the resolution. In both cases identifying the processes most affected by this was less useful - as this is what had brought our attention to the matter in the first place. None the less - this is an outline of identifying a wait issue.

The first time I had spotted it was through the use of TOP, unlike the much more configurable, easier on the eye, and graph-tastic HTOP - it shows as default 'wa' in the list of stats at the top. Ideally, this should be zero, busy boxes, up and down occasionally. Poorly boxes, up and staying up. In the specific case, I was exploring the other day this was ranging between 24 and 45. That is a LOT of wait... and it felt like it.

// What is IO WAIT?

IO wait happens if a process is in 'uninterruptible'-states while waiting for the IO-device.

A process is 'uninterruptible' if it currently executes certain system-calls - a normal read waiting for a disc to spin up won't lead to IO-wait I think - that would lead to buggy behaviour in the application or possible data-loss if the process were to be interrupted (due to e.g. limitations in the driver used to access special hardware).

// Do we have a problem here?

Getting a better picture of this over time you can use something like:

vmstat 2

The above will output the stats every 2 seconds. The second to last column is the amount of wait being experienced. As time progresses, it will build a picture of what is occurring, sustained or high levels of wait. If your machine is continually waiting for IO, then it is not going to perform well at all - and some further investigation is required.

Finding the cause or most affected process

ps aufx is your friend here. Status codes are not something you pay a bunch of attention to above and beyond Z for zombie which can often mean a restart is in your future.

So here is a breakdown of that STAT or status of each process when listing them:

D Uninterruptible sleep (usually IO)
R Running or runnable (on run queue)
S Interruptible sleep (waiting for an event to complete)
T Stopped, either by a job control signal or because it is being traced.
W paging (not valid since the 2.6.xx kernel)
X dead (should never be seen)
Z Defunct ("zombie") process, terminated but not reaped by its parent.

Okay, so we are looking for D here. Things that really are waiting, immovable, waiting for an answer back from a drive. Let's single those out:

while true; do date; ps auxf | awk '{if($8=="D") print $0;}'; sleep 1; done

Which will continue generating a time-stamped list, with a second in time spacing, of any processes that have their STAT set to D. Just that, and not other processes. Moreover - this will provide a diagnostic data over time for applications stuck in D the Uninterruptible Sleep (very Snow White).

// Conclusion

Chances are you will either see the culprit - obviously - or you will see a load caused by "nothing", wait caused by "nothing" - and with a bunch of processes that require disk (MySQL / web) slow as heck, and waiting on locks. Things like yum do not run at all. General failure all over the place, limited resources, things running-like-a-dog and painfully no apparent way out it or cause. The upside being you have now done your homework - you have the smoking gun - OR - you need to start asking questions about the filesystem, controller, storage.

Good luck. Be brave. Share your experiences. Go find the cause people!

// Postfix

In both cases where I have found myself chasing IOwait - these were not due to applications being shoddy or acting up - but to drivers, hardware, the storage itself.
Here are two examples:

- Dell PERC H200 - hardware RAID1 controller. Two SSD's. Trim not supported. Drives toileted. Exhibiting wait of consistently 24 to 48 with peaking in the 80's. Attached to the motherboard RAID controller. Struggled through running a trim -v / which reported back 300G liberated (!) and the wait dropped back down to zero. Now either this was the trim, or the OS trying to trim/discard with a card that did not support it. Problem solved, but tools above used to help see the issue for what it was - no specific application. We have seen this before in other applications - resulting in drive corruption of one of the mirrors. Unpleasant defence. Do not use SSD's in high churn environments with cards that do not support TRIM.

- R1soft - HCPD. The module that allows it to monitor changes to the disk - 'deltas'. The backup process seeds the backup with a block scan, and then after that deltas are recorded against that. Much the same way as a journal would work. So yes, lack of reliable support for a given set of Ubuntu kernels. Reported. Investigated. Found to be at fault. Stopped using R1soft for a while - resolving the issue immediately. This issue presented as a constant load for very little operation, wait hovering a lot higher than you would expect, and with a USB device also connected - when any large IO, backups locally or remotely - the wait spiked to the point of unusable - and the host falling over. Investigation done. Module removed. Worked around.

References:

Many thanks to these two links:
http://www.chileoffshore.com/en/interesting-articles/126-linux-wait-io-problem
https://stackoverflow.com/questions/36796635/vmstat-what-exactly-does-cpu-wait-mean#36796991

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: