jeffr_tech (jeffr_tech) wrote,

file offset semantics

I'm further exploring the concurrency guarantees of file i/o in various operating systems. I've found more surprising race conditions and differences of implementation between operating systems.

Each file descriptor in UNIX has an associated offset with it. This is what allows you to say read() over and over again without specifying a position and getting later and later chunks of a file. Or to write and continue where you left off. There is the additional complicated of append mode writes but let's ignore that for a moment.

To keep things straight let's call the actual file representation the inode (in FreeBSD it's a "vnode") and the open descriptor is a 'file'. This is in keeping with how it's done in the kernel. So many threads or even processes may share a single file descriptor that points to one file, so they have a shared offset. Or many processes may have unique file descriptors and so they have unique offsets.

In the shared case we have to determine how updates to this offset are serialized. One important detail is that the offset is 64bit. On 32bit platforms this means it's written with two discrete writes. Without some serialization other threads can see half of the update, or in the worst case, two simultaneous updaters may set different bytes in the final offset leaving you with a corrupt or invalid offset.

Another question is, what happens with two simultaneous writes to the file? If we don't serialize the offset they will both write to the same location. If we do, they write one after the other. The same goes for the read side. If two threads in the same process read from the same file simultaneously do they get unique data or the same data? This is true of threads and processes forked with rfork().

Before about 1986 in unix there was no serialization on updates. It also was non-preemptive, uniprocessor and had 32bit offsets so you didn't have to worry about partial writes even on 16bit machines. The inode was locked after the offset was loaded and multiple readers could see the same data and multiple writers would write to the same offset. McKusick changed it in CSRG sources in 1987 so the exclusive inode lock also protected offset to handle a case where a forked process was getting output mixed up.

Solaris manipulates the offset within a shared vnode lock for reads and an exclusive lock for writes. This means writers are serialized but readers are not. It also means that offset updates in the read case on 32bit can corrupt the offset value.

Linux manipulates the offset without a lock in any case. The offset pointer is corruptible on 32bit processors. Neither readers nor writers are serialized.

FreeBSD now allows shared vnode locks on read which 4.3BSD did not, but we use a separate lock to maintain the strict f_offset protection in all cases. This actually serializes reads done to the same fd if they don't use pread().

Posix doesn't specify this carefully enough to say what is required.

I think at a minimum solaris/linux need to protect the value on 32bit architectures. It's a once in a year type event that could lead to problems but these are the kinds of races and bugs that are impossible to track down. FreeBSD, on the other hand, could relax the restriction on read updates. It doesn't make much sense to do so for writes and this fixes the original bug encountered in 1986. I'll have to think of an elegant way to handle 64bit writes on 32bit platforms however.
  • Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded