Saturday, February 23, 2019

I/O stalls on Azure VMs with read-write data disk cache

A few years ago I set up our cloud instances in Azure. There are three caching modes available for premium (SSD) data disks: None, ReadOnly, and ReadWrite. I didn't read up enough on the choice, but ReadWrite seemed safe for journalling filesystems or databases with proper write-ahead logs, so I figured I'd start with it and adjust if there was a problem.

Much later we migrated back to Azure after some time away and I kept the old ReadWrite cache setting. Everything was fine for a couple of months until we introduced a feature that spilled to large temporary files. All of a sudden our apps would stop responding for a minute or more at a time. Our Docker Swarm nodes have relatively small P4 disks (32 GB), with an expected throughput of 25 MiB/sec; we'd be humming along writing temp files at a much higher rate for 20-30 seconds and then all I/O to the disk would stop for 2 or more minutes.

This showed up as a bunch of spikes in the disk I/O graphs with painfully long troughs of 0 MiB/sec in between each spike:


The struggle was real for any API call that got routed to the host during one of those troughs. I chalked it up to the data disk being a silly place to write temporary files, configured everything to write to an encrypted filesystem on top of the ephemeral disk, and moved on. But it's been gnawing at me: Yes, it was a mistake to write large temporary files to a tiny managed data disk, but it can't be the only thing I got wrong. The Azure documentation talks a bit about throttling, but this is more than just throttling - it's an outage. Today I had some time to dig into it.

The original problem was caused by sequential writes in a Docker container, but it was devastatingly easy to reproduce directly on the host, even with  dd if=/dev/zero (which almost never does the right thing for an I/O benchmark).

In one window I ran dd if=/dev/zero of=/var/lib/docker/bigzero bs=100k and in the other, iostat -xm 5.  At first, the data disk (sdc) was humming along writing nearly 128 MiB/sec, way over the expected rate.


(Big ups to iostat for adding colour coding. The big red 99.9 %util in the rightmost column is how we found the original problem in production when everything was on fire.)

After half a minute or so, the writes dropped to 0, but %util stayed at 100%:


The aqu-sz is 9 so there's something stuck in the queue (my dd writes) but nothing getting written. A couple of minutes later it briefly picked up again, and so on forever.

On another VM I reconfigured the cache mode to ReadOnly. This gives a slower but steadier iostat display, holding right around the expected 25 MiB/sec:


The Azure portal graph looks flatter, too:

From what I understand, the read-write cache is using a local SSD to buffer writes while the slower managed disk tries to catch up. It seems like, if you've got a small managed disk with a low rated speed, all I/O stalls once the buffer fills up and you have to wait until the contents are dumped to the SSD before additional read or write requests can make it through.

I'm not sure why it would be working this way, and I can't reproduce these stalls on a VM with ReadWrite caching on a larger P20 disk (512 GiB). I did some storage performance tests with fio when I started with Azure and didn't see this issue, but I was always using larger disks. The next time I go into a new cloud I'll be checking the storage performance with dumb old dd, you never know what it might catch.

This was on a D8s v3 VM running Ubuntu 18.04. Single data disk, P4, LVM, ext4, default options with noatime. Barriers are disabled automatically - LVM doesn't support them.

No comments: