jeffr_tech ([info]jeffr_tech) wrote,
@ 2008-04-07 18:50:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
file offset semantics
I'm further exploring the concurrency guarantees of file i/o in various operating systems. I've found more surprising race conditions and differences of implementation between operating systems.

Each file descriptor in UNIX has an associated offset with it. This is what allows you to say read() over and over again without specifying a position and getting later and later chunks of a file. Or to write and continue where you left off. There is the additional complicated of append mode writes but let's ignore that for a moment.

To keep things straight let's call the actual file representation the inode (in FreeBSD it's a "vnode") and the open descriptor is a 'file'. This is in keeping with how it's done in the kernel. So many threads or even processes may share a single file descriptor that points to one file, so they have a shared offset. Or many processes may have unique file descriptors and so they have unique offsets.

In the shared case we have to determine how updates to this offset are serialized. One important detail is that the offset is 64bit. On 32bit platforms this means it's written with two discrete writes. Without some serialization other threads can see half of the update, or in the worst case, two simultaneous updaters may set different bytes in the final offset leaving you with a corrupt or invalid offset.

Another question is, what happens with two simultaneous writes to the file? If we don't serialize the offset they will both write to the same location. If we do, they write one after the other. The same goes for the read side. If two threads in the same process read from the same file simultaneously do they get unique data or the same data? This is true of threads and processes forked with rfork().

Before about 1986 in unix there was no serialization on updates. It also was non-preemptive, uniprocessor and had 32bit offsets so you didn't have to worry about partial writes even on 16bit machines. The inode was locked after the offset was loaded and multiple readers could see the same data and multiple writers would write to the same offset. McKusick changed it in CSRG sources in 1987 so the exclusive inode lock also protected offset to handle a case where a forked process was getting output mixed up.

Solaris manipulates the offset within a shared vnode lock for reads and an exclusive lock for writes. This means writers are serialized but readers are not. It also means that offset updates in the read case on 32bit can corrupt the offset value.

Linux manipulates the offset without a lock in any case. The offset pointer is corruptible on 32bit processors. Neither readers nor writers are serialized.

FreeBSD now allows shared vnode locks on read which 4.3BSD did not, but we use a separate lock to maintain the strict f_offset protection in all cases. This actually serializes reads done to the same fd if they don't use pread().

Posix doesn't specify this carefully enough to say what is required.

I think at a minimum solaris/linux need to protect the value on 32bit architectures. It's a once in a year type event that could lead to problems but these are the kinds of races and bugs that are impossible to track down. FreeBSD, on the other hand, could relax the restriction on read updates. It doesn't make much sense to do so for writes and this fixes the original bug encountered in 1986. I'll have to think of an elegant way to handle 64bit writes on 32bit platforms however.



(Post a new comment)


[info]fanf
2008-04-08 07:11 am UTC (link)
This post and your previous one on atomicity are interesting and slightly scary - though I suppose I shouldn't be surprised that the details can't be relied on.

Edited at 2008-04-08 07:12 am UTC

(Reply to this) (Thread)

Posix and f_pos
(Anonymous)
2008-04-08 11:34 am UTC (link)
POSIX takes a relaxed view of f_pos because that is all that was guaranteed by existing implementations. The standards people also recognized that the traditional unix API made no sense in a finely threaded world and introduced pread/pwrite.

As stdio does its own internal buffering the cases where locking f_pos would have benefit are very few indeed and the performance impact is quite high. Linux instead copies the position before use and does locking in the lseek paths to ensure a seek updates the full value correctly.

pread/pwrite do the same as read/write and lseek in less calls without the need for additional locking via lockf and friends and so have pretty much replaced traditional unix behaviour in any vaguely modern and well written code.

(Reply to this) (Parent)(Thread)

Re: Posix and f_pos
[info]jeffr_tech
2008-04-08 11:19 pm UTC (link)
I agree about synchronizing updates across IO. However, I don't think the lack of atomicity when updating the value on small word size platforms is acceptable.

(Reply to this) (Parent)


[info]nathan
2008-04-08 07:24 am UTC (link)
dunno if you can use cmpxchg8b, but it's pretty easy to figure out how to do an interlocked increment with it.

(Reply to this) (Thread)


[info]jeffr_tech
2008-04-08 10:44 am UTC (link)
We can do this on the 32bit platforms that support it. Hard to say ultimately what we'll do on the others.

(Reply to this) (Parent)(Thread)

(Reply from suspended user)

[info]eqe
2008-04-08 05:30 pm UTC (link)
Does anybody have a testcase for the pointer-update-race case, or is this purely based on code inspection?

Unfortunately I don't think I have any live 32-bit multiprocessor systems so testing the testcase is hard.

(Reply to this) (Thread)


[info]jeffr_tech
2008-04-09 12:58 am UTC (link)
Purely based on code inspection. It's an offset and not a pointer btw.

It could theoretically happen on UP with an interrupt between the two instructions and a preemptive kernel.

Given that this is a race between two instructions that appear as statement in the c code it'd be very difficult to construct a test case to show this.

(Reply to this) (Parent)(Thread)


[info]jeffr_tech
2008-04-09 12:59 am UTC (link)
s/as statement/as one assignment/

(Reply to this) (Parent)

Why 32+32 offset as lock?
(Anonymous)
2008-04-09 02:00 pm UTC (link)
The 32+32 offsets can't used as locks. The locks are 1-bit objects.

Why not eliminate the 32-bit x86 architecture and to use pure 64-bit x86-64?

(Reply to this) (Thread)

Re: Why 32+32 offset as lock?
(Anonymous)
2008-04-10 04:05 pm UTC (link)
This question is taken somewhat often to the linux kernel mail list, and there's not a clear consensus that it should be fixed (http://lkml.org/lkml/2006/4/13/124).

(Reply to this) (Parent)

interlock vs. vnode lock
[info]chitah
2008-07-08 12:59 am UTC (link)
this reminded me of something else you wrote:
> I found this to be quite an interesting set of circumstances and an example of what a pain in the ass
> locking and reference counting can be in large systems. If anyone read this far and is really curious
> about why we have a separate interlock and vnode lock, I could easily write 4 pages on that topic as well.

Out of curiosity, did you ever end up writing 4 pages on that as well ?

(Reply to this) (Thread)

(Reply from suspended user)

(Reply from suspended user)

(Reply from suspended user)

[info]aeerotataeue
2009-05-27 09:24 am UTC (link)
http://seuoligoyspudi.blogspot.com/
http://shockingadatingmv.blogspot.com/
http://shorplavizwxc.blogspot.com/
http://glomauryaungewtv.blogspot.com/
http://togeteroticjavagx.blogspot.com/
http://yaurvediahvagv.blogspot.com/
http://heysexrollersli.blogspot.com/
http://tosuckacuntaw.blogspot.com/
http://healthbeautysexnz.blogspot.com/
http://smallinsectssexnj.blogspot.com/
http://blockocumkmnb.blogspot.com/
http://babiesofdatingno.blogspot.com/
http://zaysofeirykmlx.blogspot.com/
http://datingof3dtu.blogspot.com/
http://youngdatinget.blogspot.com/
http://lliskisexbx.blogspot.com/
http://datingistorturemp.blogspot.com/
http://oseonwheziacivj.blogspot.com/
http://datingispeople9.blogspot.com/
http://sexmp3oe.blogspot.com/
http://arolohugiqxtt.blogspot.com/
http://sexsmallhole6iq.blogspot.com/
http://seuolbubbliradw.blogspot.com/
http://heysexrollersrw.blogspot.com/
http://seymelfxifv.blogspot.com/
http://sexofasiaticsef.blogspot.com/
http://analsexofwomanvk.blogspot.com/
http://cilibsgriozirzdcx.blogspot.com/
http://13datingofaphotoky.blogspot.com/
http://whypeopleneedsexdf.blogspot.com/
http://dompzugjabspnjx.blogspot.com/
http://cilibrezyhugikaqhj.blogspot.com/
http://stallionsexjs.blogspot.com/
http://moszurbliwdzbcre.blogspot.com/
http://sexwithhorsesdk.blogspot.com/
http://picturesexgg.blogspot.com/
http://datingnewanuseskq.blogspot.com/
http://anautoissexaw.blogspot.com/
http://asubwayisdatingyn.blogspot.com/
http://intsestofgirl5.blogspot.com/
http://sexwithladiesbz.blogspot.com/
http://sexlittlefj.blogspot.com/
http://sexafullbladderec.blogspot.com/
http://zoofilovxh.blogspot.com/
http://sexwithchildrenmz.blogspot.com/
http://theswedishdatingtl.blogspot.com/
http://sexofponycf.blogspot.com/
http://blawjabgydwhhtp.blogspot.com/
http://melfyaurmivfb.blogspot.com/

(Reply to this)

Thanks
[info]cacala
2009-06-27 05:37 pm UTC (link)
sesli sohbet
sesli chat
sesli chat
seslichat
sesli panel
tatil otelleri
ucuz oteller
kiralık tekne
tekne kiralama
ajans
oyuncu
tekirdağ nakliyat
tekirdağ evden eve nakliyat
evden eve nakliyat
sağlık

(Reply to this)

adadad
[info]cacala
2009-06-27 05:44 pm UTC (link)
sesli sohbet
sesli chat
sesli chat
seslichat
sesli panel
tatil otelleri
ucuz oteller
kiralık tekne
tekne kiralama
ajans
oyuncu
tekirdağ nakliyat
tekirdağ evden eve nakliyat
evden eve nakliyat
sağlık

(Reply to this)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…