Saturday, February 23, 2019

I/O stalls on Azure VMs with read-write data disk cache

A few years ago I set up our cloud instances in Azure. There are three caching modes available for premium (SSD) data disks: None, ReadOnly, and ReadWrite. I didn't read up enough on the choice, but ReadWrite seemed safe for journalling filesystems or databases with proper write-ahead logs, so I figured I'd start with it and adjust if there was a problem.

Much later we migrated back to Azure after some time away and I kept the old ReadWrite cache setting. Everything was fine for a couple of months until we introduced a feature that spilled to large temporary files. All of a sudden our apps would stop responding for a minute or more at a time. Our Docker Swarm nodes have relatively small P4 disks (32 GB), with an expected throughput of 25 MiB/sec; we'd be humming along writing temp files at a much higher rate for 20-30 seconds and then all I/O to the disk would stop for 2 or more minutes.