more dram access timings on two interesting architectures

Ever wonder what memory latency is like on a large loosely connected opteron system? I lay awake at nights wondering myself. Fortunately, I have access to a tyan 8 socket barcelona system. This is basically two 4 socket boards with two very slow HT links between them. I also have access to a nehalem based box that I have timings for. The results are behind the cut.

Read more...Collapse )

dev summit

I just returned from two weeks of travel. One for my wedding anniversary and another for the FreeBSD developer summit which preceded BSDCan. The summit was productive but I'm very happy to be done with the travel.

There were many great discussions at the summit with topics ranging from release engineering to TCP scalability. I participated in one on mbufs (network buffers) and one on the buffer cache (file-system buffers). For mbufs I presented a technique that I developed for Nokia based on Kip Macy's excellent work on 10gigabit ethernet drivers. This technique should simplify referenced data while reducing the number and temporal scope of cache lines accessed to manipulate buffers in the common case. There is still much work to do to prove it out however.

The buf discussion lasted almost two hours and was much broader in scope. We will hopefully have a fully revamped IO path for 8.0 to address a wide variety of structural and performance problems. I'm very excited to see this work progress after many years of planning and discussion. My SoC student this year is implementing one essential piece by replacing a splay tree in the vm with a radix tree.

I also had a very interesting discussion with a new project member, Lawrence Stewart, about tcp congestion control which he gave a talk on later in the conference. I spent the very first part of my career working in the tcp/ip group at microsoft when tcp vegas was still relatively new. Congestion control was one part of my work responsibilities and an area I pursued as a hobby. I was surprised to hear that delay based congestion control was making a comeback in some research circles. It was nice to hear about developments in this field that I haven't followed in some time.

After all of that socializing and discussion I had a horrible flight but was pleased to find that Hawaii now feels very much like home to me and returning was quite a relief.

adaptive idling

One lesson learned from working on synchronization primitives is that it's often profitable to spin before sleeping. We have adaptive mutexes, rwlocks, etc. rather than simply having sleeping locks or spinlocks. This has had an unexpected influence on our idle loop.

When a thread becomes runnable it is often desirable to run it on a cpu other than the current one. If the target cpu is in the idle loop, it may actually be waiting in a low power state using the 'hlt' instruction or some acpi mechanism that I try to avoid. To wake up this remote cpu we currently issue an IPI (inter-processor interrupt). This is actually very expensive for the sender and receiver.

On some CPUs which support SSE3 there is a pair of instructions, monitor and mwait, which allow you to signal a remote cpu using only memory operations. This works by giving the programmer access to the existing hardware bus snooping interface. The sleeping cpu sees another cpu write to a memory location we're snooping and we wake up.

On barcelona mwait doesn't enter as deep of a sleep as on the xeons. So I decided to use an adaptive algorithm that would mwait when we're busy and hlt when we're not. With mwait you can actually specify the power state you'd like so I keep both the Xeon and Opterons in C0 to further reduce wakeup latency.

Then an engineer at Nokia suggested I go one step further and allow the idle thread to spin waiting for work for a short period. So this is now the first stage in the adaptive algorithm, we spin a while, then sleep at a high power state, and then sleep at a low power state depending on load.

Using a 'ping-pong' threads program that sends a single token around a ring of threads I see a 20% perf improvement vs the old non-adaptive mechanism. In most cases we're still idling in hlt as well, so there should be no negative effect on power. In fact, it wastes a lot of time and energy to enter and exit the idle states so it might improve power under load by reducing the total cpu time required.

file offset semantics

I'm further exploring the concurrency guarantees of file i/o in various operating systems. I've found more surprising race conditions and differences of implementation between operating systems.

Each file descriptor in UNIX has an associated offset with it. This is what allows you to say read() over and over again without specifying a position and getting later and later chunks of a file. Or to write and continue where you left off. There is the additional complicated of append mode writes but let's ignore that for a moment.

To keep things straight let's call the actual file representation the inode (in FreeBSD it's a "vnode") and the open descriptor is a 'file'. This is in keeping with how it's done in the kernel. So many threads or even processes may share a single file descriptor that points to one file, so they have a shared offset. Or many processes may have unique file descriptors and so they have unique offsets.

In the shared case we have to determine how updates to this offset are serialized. One important detail is that the offset is 64bit. On 32bit platforms this means it's written with two discrete writes. Without some serialization other threads can see half of the update, or in the worst case, two simultaneous updaters may set different bytes in the final offset leaving you with a corrupt or invalid offset.

Another question is, what happens with two simultaneous writes to the file? If we don't serialize the offset they will both write to the same location. If we do, they write one after the other. The same goes for the read side. If two threads in the same process read from the same file simultaneously do they get unique data or the same data? This is true of threads and processes forked with rfork().

Before about 1986 in unix there was no serialization on updates. It also was non-preemptive, uniprocessor and had 32bit offsets so you didn't have to worry about partial writes even on 16bit machines. The inode was locked after the offset was loaded and multiple readers could see the same data and multiple writers would write to the same offset. McKusick changed it in CSRG sources in 1987 so the exclusive inode lock also protected offset to handle a case where a forked process was getting output mixed up.

Solaris manipulates the offset within a shared vnode lock for reads and an exclusive lock for writes. This means writers are serialized but readers are not. It also means that offset updates in the read case on 32bit can corrupt the offset value.

Linux manipulates the offset without a lock in any case. The offset pointer is corruptible on 32bit processors. Neither readers nor writers are serialized.

FreeBSD now allows shared vnode locks on read which 4.3BSD did not, but we use a separate lock to maintain the strict f_offset protection in all cases. This actually serializes reads done to the same fd if they don't use pread().

Posix doesn't specify this carefully enough to say what is required.

I think at a minimum solaris/linux need to protect the value on 32bit architectures. It's a once in a year type event that could lead to problems but these are the kinds of races and bugs that are impossible to track down. FreeBSD, on the other hand, could relax the restriction on read updates. It doesn't make much sense to do so for writes and this fixes the original bug encountered in 1986. I'll have to think of an elegant way to handle 64bit writes on 32bit platforms however.

email

I worked at an internet service provider when I was in highschool and as a result got free email for life. This is my @chesapeake.net address. Unfortunately chesapeake.net outsourced email to a bulk provider and I've had a remarkable number of emails just plain dropped since. So what am I to do about it?

I registered jroberson.net and set the mx to point at google. Then I set my google account to let me pop and send mail via smtp. So what I don't understand is why google does this for free? I'm not looking at any targeted advertising. They're just acting as temporary storage until I pop the mail. I may never use the web interface again.

Also, I'd like to take this opportunity to point out that google is the new microsoft which was the new ibm which was probably the new something else. What I mean is, exciting and innovative company becomes large and imposing and then no one likes them. I'm surprised more people haven't seen the writing on the wall. I speculate that after 10-15 years of declining popularity and utility microsoft will eventually become a respected, productive member of society again like IBM did.

IO atomicity

I have long wondered about exactly what atomicity guarantees of read() and write() are so I did some code and posix spelunking over the weekend. The scenarios I'm talking about are as such:

1) Can readers read concurrently with readers?
2) Can readers read concurrently with writers?
3) If readers read concurrently with writers do they see old bytes, new bytes, or potentially a mix of both?
4) Can writers write concurrently with other writers?
5) If writers can write concurrently what constraints are there on the resulting bytes?

So first, what BSD does is hold a shared lock on the inode while reading and an exclusive lock while writing. There is an additional issue with the file descriptor offset that really should be a second post that I might do sometime. So on BSD you have a strong guarantee that readers and writers will see a write as a single atomic transaction. No partial writes are visible. No interleaved writes are possible. Readers are concurrent.

On linux, excepting appends, the inode is not locked for io. Instead page reference counts and locks are used for individual parts of the file. You can think of this like impromptu byte-range locking. Linux allows readers to proceed with other readers and writers for overlapping byte ranges. This means you can call read() and see incomplete results of a file rewrite as it is happening on another cpu. If you read and the data is not buffered an exclusive lock on the page is used until the data is valid. The same exclusive lock protects overlapping writes to the same page. However, the results when writes span pages seem to be undefined. This is basically as concurrent as you can reasonably get.

So what does posix say?
ISO IEC 9945-2 2002 POSIX.1 - 2 System Interfaces page 1174 (1203rd page in the pdf)

Rationale section for the read syscall:

I/O is intended to be atomic to ordinary files and pipes and FIFOs. Atomic
means that all the bytes from a single operation that started out together
end up together, without interleaving from other I/O operations.

There are other statements elsewhere in posix that state a read following a write in time must see the results of the write as a whole. The emphasis on time likely has to do with nfs. So posix is fairly clear. Linux is too loose but FreeBSD is too tight. We can allow concurrent writers to the same file as long as they are non-overlapping without violating any rules.

Really standards have just been derived from legacy behavior of older unix in order to define the properties that they believed were important for existing applications. In this vein I looked at seventh edition unix, which uses an exclusive lock over the inode in all cases. Clearly it is even more strict than FreeBSD.

I believe for 8.0 I will try to make this programmable on a per-file or per-system basis. Once the basic infrastructure is in place it would be easy to define the types of locks required for the operation to permit willing applications to see a less consistent view of the bytes. However, I find it hard to imagine any application wants to see partial byte results. I suspect range locking on writes will be sufficient in almost all cases.

(no subject)

MyISAM performance is terrible in FreeBSD 7.0 due to the user-space pthread_rwlock implementation. Just a word of warning if you intend to deploy a database server based on 7.0. I am certain we will have this fixed in 7.1. It will most likely be in CURRENT in a week or two.

FreeBSD SoC

I've signed up to mentor a FreeBSD SoC project again this year. I'm most interested in sponsoring the following projects:

1) Improved schedgraph support.
2) User-space lock profiling tool
3) Improved VM tree structures
4) SMP safing Giant protected filesystems.

Maybe others. I know schedgraph and the user-space lock profiling may not sound as glamorous but they have the potential to have the biggest long-term impact on performance. Please email jeff at freebsd.org if you'd like to discuss these further.

The official list of project ideas is here:

http://www.freebsd.org/projects/summerofcode.html

(no subject)

People are always posting comments of 'what about solaris!'. I'm going to install some new operating systems on an 8way xeon (2x4). So what about solaris? What should I install? Can I do a network install or do I have to burn dvds? Any tips? Which Linux should I install for benchmarking? I've just been using fedora. Maybe I should stick with that since I'm familiar with it.

(no subject)

I have an opteron with older slower memory that I reproduced the pipe tests on to see if it was any different on a 64bit system. I'm not going to paste the full results but here's a couple of data points:

linux-2.6.24
64[writer]: 97.235 wall (2.031 usr, 68.674 sys), 10.531 Mb/sec
1024[writer]: 13.300 wall (0.145 usr, 9.039 sys), 76.991 Mb/sec
65536[writer]: 3.068 wall (0.001 usr, 1.718 sys), 333.766 Mb/sec

FreeBSD 8.0-CURRENT undermydesk (no cpu switch patches though)
64[writer]: 53.163 wall (1.057 usr, 42.083 sys), 19.261 Mb/sec
1024[writer]: 5.325 wall (0.118 usr, 4.146 sys), 192.284 Mb/sec
65536[writer]: 0.567 wall (0.000 usr, 0.130 sys), 1805.509 Mb/sec

So on this machine we start of 2x as fast and end up 5.5x as fast. The numbers pretty much follow a curve through those points. This verifies the data taken from the old 32bit HTT machine they tested on. I don't intend to post configs and so on as the original lkml thread is plenty rigorous enough.

I forgot to mention earlier. The FreeBSD Alan Cox has committed super-pages! We're seeing some great gains from that. This allows the kernel to automatically use large TLBs for conforming regions of memory. It has a component that ensures that large, contiguous, chunks of physical memory will be available to support this. There is also a defragmenting/compacting piece. There's some great work going into FreeBSD 8.0 already!