March 31st, 2008

IO atomicity

I have long wondered about exactly what atomicity guarantees of read() and write() are so I did some code and posix spelunking over the weekend. The scenarios I'm talking about are as such:

1) Can readers read concurrently with readers?
2) Can readers read concurrently with writers?
3) If readers read concurrently with writers do they see old bytes, new bytes, or potentially a mix of both?
4) Can writers write concurrently with other writers?
5) If writers can write concurrently what constraints are there on the resulting bytes?

So first, what BSD does is hold a shared lock on the inode while reading and an exclusive lock while writing. There is an additional issue with the file descriptor offset that really should be a second post that I might do sometime. So on BSD you have a strong guarantee that readers and writers will see a write as a single atomic transaction. No partial writes are visible. No interleaved writes are possible. Readers are concurrent.

On linux, excepting appends, the inode is not locked for io. Instead page reference counts and locks are used for individual parts of the file. You can think of this like impromptu byte-range locking. Linux allows readers to proceed with other readers and writers for overlapping byte ranges. This means you can call read() and see incomplete results of a file rewrite as it is happening on another cpu. If you read and the data is not buffered an exclusive lock on the page is used until the data is valid. The same exclusive lock protects overlapping writes to the same page. However, the results when writes span pages seem to be undefined. This is basically as concurrent as you can reasonably get.

So what does posix say?
ISO IEC 9945-2 2002 POSIX.1 - 2 System Interfaces page 1174 (1203rd page in the pdf)

Rationale section for the read syscall:

I/O is intended to be atomic to ordinary files and pipes and FIFOs. Atomic
means that all the bytes from a single operation that started out together
end up together, without interleaving from other I/O operations.

There are other statements elsewhere in posix that state a read following a write in time must see the results of the write as a whole. The emphasis on time likely has to do with nfs. So posix is fairly clear. Linux is too loose but FreeBSD is too tight. We can allow concurrent writers to the same file as long as they are non-overlapping without violating any rules.

Really standards have just been derived from legacy behavior of older unix in order to define the properties that they believed were important for existing applications. In this vein I looked at seventh edition unix, which uses an exclusive lock over the inode in all cases. Clearly it is even more strict than FreeBSD.

I believe for 8.0 I will try to make this programmable on a per-file or per-system basis. Once the basic infrastructure is in place it would be easy to define the types of locks required for the operation to permit willing applications to see a less consistent view of the bytes. However, I find it hard to imagine any application wants to see partial byte results. I suspect range locking on writes will be sufficient in almost all cases.