jeffr_tech ([info]jeffr_tech) wrote,
@ 2008-03-31 13:21:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
IO atomicity
I have long wondered about exactly what atomicity guarantees of read() and write() are so I did some code and posix spelunking over the weekend. The scenarios I'm talking about are as such:

1) Can readers read concurrently with readers?
2) Can readers read concurrently with writers?
3) If readers read concurrently with writers do they see old bytes, new bytes, or potentially a mix of both?
4) Can writers write concurrently with other writers?
5) If writers can write concurrently what constraints are there on the resulting bytes?

So first, what BSD does is hold a shared lock on the inode while reading and an exclusive lock while writing. There is an additional issue with the file descriptor offset that really should be a second post that I might do sometime. So on BSD you have a strong guarantee that readers and writers will see a write as a single atomic transaction. No partial writes are visible. No interleaved writes are possible. Readers are concurrent.

On linux, excepting appends, the inode is not locked for io. Instead page reference counts and locks are used for individual parts of the file. You can think of this like impromptu byte-range locking. Linux allows readers to proceed with other readers and writers for overlapping byte ranges. This means you can call read() and see incomplete results of a file rewrite as it is happening on another cpu. If you read and the data is not buffered an exclusive lock on the page is used until the data is valid. The same exclusive lock protects overlapping writes to the same page. However, the results when writes span pages seem to be undefined. This is basically as concurrent as you can reasonably get.

So what does posix say?
ISO IEC 9945-2 2002 POSIX.1 - 2 System Interfaces page 1174 (1203rd page in the pdf)

Rationale section for the read syscall:

I/O is intended to be atomic to ordinary files and pipes and FIFOs. Atomic
means that all the bytes from a single operation that started out together
end up together, without interleaving from other I/O operations.

There are other statements elsewhere in posix that state a read following a write in time must see the results of the write as a whole. The emphasis on time likely has to do with nfs. So posix is fairly clear. Linux is too loose but FreeBSD is too tight. We can allow concurrent writers to the same file as long as they are non-overlapping without violating any rules.

Really standards have just been derived from legacy behavior of older unix in order to define the properties that they believed were important for existing applications. In this vein I looked at seventh edition unix, which uses an exclusive lock over the inode in all cases. Clearly it is even more strict than FreeBSD.

I believe for 8.0 I will try to make this programmable on a per-file or per-system basis. Once the basic infrastructure is in place it would be easy to define the types of locks required for the operation to permit willing applications to see a less consistent view of the bytes. However, I find it hard to imagine any application wants to see partial byte results. I suspect range locking on writes will be sufficient in almost all cases.



(7 comments) - (Post a new comment)


[info]jeffr_tech
2008-04-01 02:51 am UTC (link)
Some further clarification;

Some people do not feel that the information being in the Rationale section makes it part of the standard. I feel that it clarifies the intent of other language in the standard. It's debatable. My point isn't really whether linux is standards compliant or not, it's just to document what atomicity guarantees there are and develop ways to maintain the current guarantees while providing greater concurrency in FreeBSD.

However, digging deeper you can realize some side-effects of the linux approach that produce even weirder results than a read seeing part of a write. For example, if two threads issue writes that cross page boundaries the results after they both complete can contain some data from both threads with the separation being the page boundary.

Usually when we deal with userspace races we try to let one or the other win. Like if two threads fnctl(.. F_SETFL, ..) simultaneously only one of them can win. But you wouldn't like the resulting flags to be some mix of both? I think you'd like it even less with file data.

(Reply to this) (Thread)


[info]sas_spidey01
2008-04-01 05:01 am UTC (link)
Usually when we deal with userspace races we try to let one or the other win. Like if two threads fnctl(.. F_SETFL, ..) simultaneously only one of them can win. But you wouldn't like the resulting flags to be some mix of both? I think you'd like it even less with file data

You can say that again.

(Reply to this) (Parent)


[info]jeffr_tech
2008-04-01 03:03 am UTC (link)
Yet more clarification;

I believe solaris implements essentially the same atomicity as FreeBSD for file data. It, however, has a looser interpretation of f_offset.

(Reply to this)

RCU
(Anonymous)
2008-04-01 05:02 am UTC (link)
Just a side node: is there any chance for FreeBSD to adopt RCU? From what I hear, it's a very powerful synchronization mechanism. Asking Google, it seems that you discussed this more than 4 years ago [1] and a few months ago you even listed it as one of your "10 things i like about Linux". Are you still interested in adopting it?

http://unix.derkeiler.com/Mailing-Lists/FreeBSD/arch/2004-03/0057.html

(Reply to this) (Thread)

Re: RCU
[info]jeffr_tech
2008-04-01 05:40 am UTC (link)
So we can't use it until the patent expires. Neither can anyone other than Linux and IBM, I believe. Or was it SGI?

Anyway, we've come up with alternate ways to get the same concurrency. We have 'rmlock' which are read-mostly locks in the tree. They allow you to read without a lock, essentially like RCU, but use a different method to synchronize updates to avoid the patent. Our update is slightly more expensive but hopefully for both systems very infrequent.

We've also been looking at lock free or entirely atomic based datastructures and have used them effectively in a few places where RCU would've been nice. I don't think we're terribly hindered by not having RCU at the moment. Of course we're working on scalability to 16-32 processors at the moment and not many dozens or hundreds. In many cases we did extremely coarse locking early on and now that we're coming back around we're doing locking that should scale to much larger machines than we have access to presently.

(Reply to this) (Parent)(Thread)

Re: RCU
(Anonymous)
2008-04-01 06:27 am UTC (link)
Yes you are right about the patent, I didn't know that, but I remember that IBM promised not to sue any Open Source project a few years ago.. well "promised" ;-) And if I remember correctly SUN holds a some patents on ZFS as well (this might be a different case since ZFS is not in your base system).

What kind of machines are you talking about when speaking of "much larger machines"? I thought there are some pretty fundamental assumptions in the FreeBSD kernel that limit you to 64 CPUs on any architecture [1], so I assume you cannot fully utilize a Niagara machine at present, but I guess that's not a big deal right now. Maybe we'll see 64-way x86 machines when Intel releases their native Octacores in the end of this year, but who knows :-)

[1] http://lists.freebsd.org/pipermail/cvs-src/2008-March/088052.html

(Reply to this) (Parent)(Thread)

Re: RCU
[info]jeffr_tech
2008-04-01 07:43 am UTC (link)
Truthfully it is about two days work to convert users from cpumask_t to cpuset_t to permit the generic kernel code to work with numbers of cpus greater than word size.

The bigger issue is just getting hardware into someones hands so that they care to do it.

Really what I was speaking of though was that the solutions we're picking now for most scalability problems should more closely reflect the real underlying serialization requirements. That is to say there isn't much left to do in some cases but ask the applications not to hit the same syscalls for the same objects on many cpus simultaneously. Eventually apps just have to use the system smarter.

(Reply to this) (Parent)


(7 comments) - (Post a new comment)

Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…