jeffr_tech's Journal
[Most Recent Entries]
[Calendar View]
[Friends]
Below are the 20 most recent journal entries recorded in
jeffr_tech's LiveJournal:
[ << Previous 20 ]
| Wednesday, April 16th, 2008 | | 11:02 pm |
adaptive idling One lesson learned from working on synchronization primitives is that it's often profitable to spin before sleeping. We have adaptive mutexes, rwlocks, etc. rather than simply having sleeping locks or spinlocks. This has had an unexpected influence on our idle loop.
When a thread becomes runnable it is often desirable to run it on a cpu other than the current one. If the target cpu is in the idle loop, it may actually be waiting in a low power state using the 'hlt' instruction or some acpi mechanism that I try to avoid. To wake up this remote cpu we currently issue an IPI (inter-processor interrupt). This is actually very expensive for the sender and receiver.
On some CPUs which support SSE3 there is a pair of instructions, monitor and mwait, which allow you to signal a remote cpu using only memory operations. This works by giving the programmer access to the existing hardware bus snooping interface. The sleeping cpu sees another cpu write to a memory location we're snooping and we wake up.
On barcelona mwait doesn't enter as deep of a sleep as on the xeons. So I decided to use an adaptive algorithm that would mwait when we're busy and hlt when we're not. With mwait you can actually specify the power state you'd like so I keep both the Xeon and Opterons in C0 to further reduce wakeup latency.
Then an engineer at Nokia suggested I go one step further and allow the idle thread to spin waiting for work for a short period. So this is now the first stage in the adaptive algorithm, we spin a while, then sleep at a high power state, and then sleep at a low power state depending on load.
Using a 'ping-pong' threads program that sends a single token around a ring of threads I see a 20% perf improvement vs the old non-adaptive mechanism. In most cases we're still idling in hlt as well, so there should be no negative effect on power. In fact, it wastes a lot of time and energy to enter and exit the idle states so it might improve power under load by reducing the total cpu time required. | | Monday, April 7th, 2008 | | 6:50 pm |
file offset semantics I'm further exploring the concurrency guarantees of file i/o in various operating systems. I've found more surprising race conditions and differences of implementation between operating systems.
Each file descriptor in UNIX has an associated offset with it. This is what allows you to say read() over and over again without specifying a position and getting later and later chunks of a file. Or to write and continue where you left off. There is the additional complicated of append mode writes but let's ignore that for a moment.
To keep things straight let's call the actual file representation the inode (in FreeBSD it's a "vnode") and the open descriptor is a 'file'. This is in keeping with how it's done in the kernel. So many threads or even processes may share a single file descriptor that points to one file, so they have a shared offset. Or many processes may have unique file descriptors and so they have unique offsets.
In the shared case we have to determine how updates to this offset are serialized. One important detail is that the offset is 64bit. On 32bit platforms this means it's written with two discrete writes. Without some serialization other threads can see half of the update, or in the worst case, two simultaneous updaters may set different bytes in the final offset leaving you with a corrupt or invalid offset.
Another question is, what happens with two simultaneous writes to the file? If we don't serialize the offset they will both write to the same location. If we do, they write one after the other. The same goes for the read side. If two threads in the same process read from the same file simultaneously do they get unique data or the same data? This is true of threads and processes forked with rfork().
Before about 1986 in unix there was no serialization on updates. It also was non-preemptive, uniprocessor and had 32bit offsets so you didn't have to worry about partial writes even on 16bit machines. The inode was locked after the offset was loaded and multiple readers could see the same data and multiple writers would write to the same offset. McKusick changed it in CSRG sources in 1987 so the exclusive inode lock also protected offset to handle a case where a forked process was getting output mixed up.
Solaris manipulates the offset within a shared vnode lock for reads and an exclusive lock for writes. This means writers are serialized but readers are not. It also means that offset updates in the read case on 32bit can corrupt the offset value.
Linux manipulates the offset without a lock in any case. The offset pointer is corruptible on 32bit processors. Neither readers nor writers are serialized.
FreeBSD now allows shared vnode locks on read which 4.3BSD did not, but we use a separate lock to maintain the strict f_offset protection in all cases. This actually serializes reads done to the same fd if they don't use pread().
Posix doesn't specify this carefully enough to say what is required.
I think at a minimum solaris/linux need to protect the value on 32bit architectures. It's a once in a year type event that could lead to problems but these are the kinds of races and bugs that are impossible to track down. FreeBSD, on the other hand, could relax the restriction on read updates. It doesn't make much sense to do so for writes and this fixes the original bug encountered in 1986. I'll have to think of an elegant way to handle 64bit writes on 32bit platforms however. | | 1:36 am |
email I worked at an internet service provider when I was in highschool and as a result got free email for life. This is my @chesapeake.net address. Unfortunately chesapeake.net outsourced email to a bulk provider and I've had a remarkable number of emails just plain dropped since. So what am I to do about it?
I registered jroberson.net and set the mx to point at google. Then I set my google account to let me pop and send mail via smtp. So what I don't understand is why google does this for free? I'm not looking at any targeted advertising. They're just acting as temporary storage until I pop the mail. I may never use the web interface again.
Also, I'd like to take this opportunity to point out that google is the new microsoft which was the new ibm which was probably the new something else. What I mean is, exciting and innovative company becomes large and imposing and then no one likes them. I'm surprised more people haven't seen the writing on the wall. I speculate that after 10-15 years of declining popularity and utility microsoft will eventually become a respected, productive member of society again like IBM did. | | Monday, March 31st, 2008 | | 1:21 pm |
IO atomicity I have long wondered about exactly what atomicity guarantees of read() and write() are so I did some code and posix spelunking over the weekend. The scenarios I'm talking about are as such:
1) Can readers read concurrently with readers? 2) Can readers read concurrently with writers? 3) If readers read concurrently with writers do they see old bytes, new bytes, or potentially a mix of both? 4) Can writers write concurrently with other writers? 5) If writers can write concurrently what constraints are there on the resulting bytes?
So first, what BSD does is hold a shared lock on the inode while reading and an exclusive lock while writing. There is an additional issue with the file descriptor offset that really should be a second post that I might do sometime. So on BSD you have a strong guarantee that readers and writers will see a write as a single atomic transaction. No partial writes are visible. No interleaved writes are possible. Readers are concurrent.
On linux, excepting appends, the inode is not locked for io. Instead page reference counts and locks are used for individual parts of the file. You can think of this like impromptu byte-range locking. Linux allows readers to proceed with other readers and writers for overlapping byte ranges. This means you can call read() and see incomplete results of a file rewrite as it is happening on another cpu. If you read and the data is not buffered an exclusive lock on the page is used until the data is valid. The same exclusive lock protects overlapping writes to the same page. However, the results when writes span pages seem to be undefined. This is basically as concurrent as you can reasonably get.
So what does posix say? ISO IEC 9945-2 2002 POSIX.1 - 2 System Interfaces page 1174 (1203rd page in the pdf)
Rationale section for the read syscall:
I/O is intended to be atomic to ordinary files and pipes and FIFOs. Atomic means that all the bytes from a single operation that started out together end up together, without interleaving from other I/O operations.
There are other statements elsewhere in posix that state a read following a write in time must see the results of the write as a whole. The emphasis on time likely has to do with nfs. So posix is fairly clear. Linux is too loose but FreeBSD is too tight. We can allow concurrent writers to the same file as long as they are non-overlapping without violating any rules.
Really standards have just been derived from legacy behavior of older unix in order to define the properties that they believed were important for existing applications. In this vein I looked at seventh edition unix, which uses an exclusive lock over the inode in all cases. Clearly it is even more strict than FreeBSD.
I believe for 8.0 I will try to make this programmable on a per-file or per-system basis. Once the basic infrastructure is in place it would be easy to define the types of locks required for the operation to permit willing applications to see a less consistent view of the bytes. However, I find it hard to imagine any application wants to see partial byte results. I suspect range locking on writes will be sufficient in almost all cases. | | Saturday, March 22nd, 2008 | | 9:41 pm |
MyISAM performance is terrible in FreeBSD 7.0 due to the user-space pthread_rwlock implementation. Just a word of warning if you intend to deploy a database server based on 7.0. I am certain we will have this fixed in 7.1. It will most likely be in CURRENT in a week or two. | | 6:09 pm |
FreeBSD SoC I've signed up to mentor a FreeBSD SoC project again this year. I'm most interested in sponsoring the following projects: 1) Improved schedgraph support. 2) User-space lock profiling tool 3) Improved VM tree structures 4) SMP safing Giant protected filesystems. Maybe others. I know schedgraph and the user-space lock profiling may not sound as glamorous but they have the potential to have the biggest long-term impact on performance. Please email jeff at freebsd.org if you'd like to discuss these further. The official list of project ideas is here: http://www.freebsd.org/projects/summerofcode.html | | Saturday, March 15th, 2008 | | 11:11 pm |
People are always posting comments of 'what about solaris!'. I'm going to install some new operating systems on an 8way xeon (2x4). So what about solaris? What should I install? Can I do a network install or do I have to burn dvds? Any tips? Which Linux should I install for benchmarking? I've just been using fedora. Maybe I should stick with that since I'm familiar with it. | | Wednesday, March 12th, 2008 | | 7:22 pm |
I have an opteron with older slower memory that I reproduced the pipe tests on to see if it was any different on a 64bit system. I'm not going to paste the full results but here's a couple of data points:
linux-2.6.24 64[writer]: 97.235 wall (2.031 usr, 68.674 sys), 10.531 Mb/sec 1024[writer]: 13.300 wall (0.145 usr, 9.039 sys), 76.991 Mb/sec 65536[writer]: 3.068 wall (0.001 usr, 1.718 sys), 333.766 Mb/sec
FreeBSD 8.0-CURRENT undermydesk (no cpu switch patches though) 64[writer]: 53.163 wall (1.057 usr, 42.083 sys), 19.261 Mb/sec 1024[writer]: 5.325 wall (0.118 usr, 4.146 sys), 192.284 Mb/sec 65536[writer]: 0.567 wall (0.000 usr, 0.130 sys), 1805.509 Mb/sec
So on this machine we start of 2x as fast and end up 5.5x as fast. The numbers pretty much follow a curve through those points. This verifies the data taken from the old 32bit HTT machine they tested on. I don't intend to post configs and so on as the original lkml thread is plenty rigorous enough.
I forgot to mention earlier. The FreeBSD Alan Cox has committed super-pages! We're seeing some great gains from that. This allows the kernel to automatically use large TLBs for conforming regions of memory. It has a component that ensures that large, contiguous, chunks of physical memory will be available to support this. There is also a defragmenting/compacting piece. There's some great work going into FreeBSD 8.0 already! | | 1:59 pm |
A couple bits of news; We tracked down our problem with the performance drop above 30 threads on Nick Piggin's mysql benchmark to conservative settings for the pthread adaptive spinning. We see a big gain relative to where we were before. Frankly at this point we're splitting hairs with linux and I don't really care where we stand. We had a tremendous problem and we resolved it. Time to move on.. I removed kernel support for our M:N threading library last night. 8.0 will only support 1:1. This will open the way to a lot of optimizations in the signal and sleeping paths. Hopefully reducing the total number of locks required in the sleepq path to a minimum. There are some 'interesting' pipe benchmarks floating around. You can read about it on the lkml and the author's website: http://213.148.29.37/PipeBench/http://lkml.org/lkml/2008/3/5/61I say 'interesting' of course because FreeBSD is doing way better than linux. ;) pipes, the next battleground? I don't know but it's worth a read anyway. I also have a patch to implement cpu affinity for our callout mechanism. This is for time based callbacks. The legacy callouts may have order dependencies or may not tolerate concurrency. So by default they are all scheduled on the first callout thread. There is one callout thread per-cpu and they have a kind of 'medium' affinity for that cpu, however, if they are overloaded by some interrupt work another cpu can complete the callouts. This removes the need to do any kind of load balancing across callout handlers because the scheduler can do a better job anyway. New callouts can specify any cpu when setting a timer and then they have an affinity for that thread until a different cpu is requested. All migration is explicit. Hopefully having callout affinity will benefit our tcp stack where Robert Watson is experimenting with different kinds of affinity for tcp sessions. It will also discourage migration of threads who are sleeping on time based events like select and nanosleep(). | | Monday, March 10th, 2008 | | 4:52 pm |
ULE happenings, context switch surprise. Lately I've been able to spend a bunch of time on ULE thanks to Nokia. They use it in one of their networking products. I've been doing all of this work in 8.0-CURRENT and backporting it for them at the same time. It's a great model for both parties because users on -CURRENT shake out bugs that they'd have to find in testing otherwise and we get new development paid for.
I finished and committed the topology aware cpu scheduling that I discussed in earlier posts. I also implemented a mechanism for CPU provisioning that you can use to restrict groups of processes to sets of cpus which can be dynamically migrated. This will be useful for restricting jails to certain CPUs or dedicating some CPUs to real-time special-purpose tasks for example.
Over the last few days I cleaned up my cpu switch optimizations and got those in. The results are 25% faster context switching in a yield benchmark. Even faster than linux on the same hardware. Some day I'll put open solaris on so I have something else to compare to.
Separate from the other switch benchmarks I've been working on reimplementing amd64's context switching routine almost entirely in C. I just wanted to do it because we're putting more complex things in and it was getting hard to find registers but it turns out you can make it much faster too. The yield benchmark is another 10% faster with the C switch routine. Mostly due to enabling more complex checks, like not setting MSR_FSBASE/GSBASE if they haven't changed, and getting uncommon code out of the fast path. | | Friday, March 7th, 2008 | | 12:22 am |
More sysbench noise. http://www.kernel.org/pub/linux/kernel/people/npiggin/sysbench/ Nick Piggin has been doing some benchmarking of recent linux kernels and FreeBSD 7.0 on a 2xquad core barcelona opteron. He verified that the CFS problems seem to be fixed and FreeBSD's performance on this box with mysql is really very similar up to about 20 threads. I feel confident that the test was conducted fairly and I'm happy with these results. Our stable release is doing very well even if fresh-out-of-git linux is showing better on this platform. We already have some good gains in this workload in 8.0-CURRENT as well. What's most important to me is that we stay relevant on common server hardware and we're doing a good job at that. I'm also happy to see some collaboration and competition between linux and bsd kernel developers. I hope that continues. We're really more alike than we are different. Next up, we now have a 16 way xeon and 16 way opteron system to tune and test with. More points of contention are being removed. The code marches on. | | Tuesday, February 26th, 2008 | | 9:46 pm |
| | Thursday, February 21st, 2008 | | 12:27 am |
lung tech. I, like many nerds and athletes before me, have suffered from asthma and lung problems for almost the entirety of my life. I don't have the blue-in-the-face, bronchial-spasm, send-me-to-the-hospital variety. Rather, I have a seemingly constant irritation and periodic, primarily exercise-induced, restriction of my airways that mostly just slows me down. This is actually caused by a poor immune system reaction to airborn allergins. Exercise triggers attacks because as much as 10x more air is moving over your lungs so they're likely to get 10x as irritated. In any event, this hadn't been much of a problem for me in seattle, except when I lived in a very moldy old house. However, after moving to hawaii something started really bothering me. My training started out great but after a virus I found myself unable to significantly exert myself for longer than 5 minutes or so. I went to the Dr but wasn't satisfied with their diagnosis so I bought myself a peak flow meter, blood oximeter and a few other gadgets. and so the nerding began. The peak flow meter is really the most interesting. This measures, in liters/minute, how rapidly you can force air through a constrained passage. It's just a tube you blow in with a column and a gauge. For someone of my height and age a 'normal' peak flow rate would be around 600 l/m. My actual measured flow rate very regular at 675 l/m, so 112% above predicted, not bad! However, 5 minutes of vigorous exercise on a stationary bike and that had dropped 20% to 540. A 20% reduction in the rate your lungs move air is enough to perceive as constricted and tight. Interestingly I'd still be considered in a healthy flow range, and indeed I could rest and talk and walk just fine, I just couldn't ride my damned bike. The blood oximeter also showed a 5% drop in blood oxygen saturation during the constriction. Armed with these findings I asked for a maintenance drug, advair, which has a corticosteroid to reduce inflammation. And indeed, 3 days after starting this treatment, my peak flow now measures around 775, or 13% better. This would be better than the average flow rate for a 6'8" male. And hopefully now after missing the first two races of the season, my training can begin again in earnest. And the moral of the story is; You can never have enough gadgets. | | Thursday, February 14th, 2008 | | 11:45 am |
Using inlines to reduce code duplication I recently was able to use a neat trick in my scheduler code that I thought I'd share. It might be old news to many of you and it doesn't come up a lot but it's useful when it does. The basic notion is that you can use inlines with const arguments to create a sort of parameterized function with no duplicated code post-compile. ( Read more... ) | | Sunday, January 27th, 2008 | | 4:55 pm |
cookies Last night, in remembrance of Rich Steven, I made his ultimate chocolate chip cookies: http://www.kohala.com/start/recipes/ultimatecookie.htmlI do this every couple of years and I have to say the closer you follow the directions the better they come out. I used my wife's kitchenaid mixer and they were the best yet. | | Friday, January 25th, 2008 | | 2:05 am |
non-uniform cpu scheduling. After a month sick with some random virus I'm finally starting to feel normal and get some work done again. I spent some time implementing a long standing idea I've had for a more flexible and dynamic CPU topology to improve scheduling decisions. Modern multi-core processors are non-uniform from a variety of perspectives. For example, the barcelona has a shared L3 among all cores on a package. So if you have two packages and you're placing two threads on cores within one package you're wasting half your cache. To take advantage of this knowledge the scheduler needs detailed information about the cpu layout and it needs to intelligently act on that information. ( Read more... ) | | Thursday, December 6th, 2007 | | 11:49 pm |
I recently collaborated with Kip Macy to mostly rewrite FreeBSD's lock profiling facility. This provides a rich set of statistics about lock acquisition and contention that is instrumental in continuing to refactor the locking in the kernel. Statistics include: max hold time, total hold time, total wait time, number of acquisitions, average hold time, average wait time, and number of times contended. It's a little tricky because these statistics are not kept on a per-lock basis, but rather, per (file, line, lock name) triple.
This means you can readily identify not just which locks are problematic but which source files are causing the problems. Issues of high latency or coarse locking readily stand out. Unfortunately, all of these statistics are quite expensive to gather. At the moment common kernel-heavy workloads slow down to about 1/5th speed. Before the rewrite it was 1/10th! The overhead is entirely due to the time keeping functions which must be called for every acquisition and release.
The goal of the rewrite was to better support shared locks. Previously some data was kept in each lock and it was assumed the locker had exclusive access to that data to update it. We changed it to have a notion of ownership records instead, so each lock ownership adds a small pre-allocated structure to a per-thread list. This structure tracks timing and contention information for this specific instance.
When the lock is released we aggregate the information into a structure that is associated with the (file, line, lock name) triple. This is found via hash lookup. I changed the hash table to be per-cpu which removed an array of locks we used to protect it before. This makes displaying statistics much more complicated because each record must be merged with any records for the same triple that may exist on another cpu. However, this is responsible for the 2x speedup.
The remainder of the overhead will go away once multiprocessor systems have reliable, synchronized time-stamp counters (TSC). This is an extremely cheap time source, on the order of dozens of cycles, compared to the hundreds of cycles to access a global system clock that you must use for reliable cross-processor timing information today. | | Saturday, November 24th, 2007 | | 2:39 pm |
Dear SQL language "designers",
Please report to my office to receive your beating.
Thanks, Management | | Tuesday, November 6th, 2007 | | 1:50 pm |
Here's my horrible hack for the day. I have two identical lcd displays that I've been using independently on different computers. I wanted to join them into one display on my main development box so I bought a dual head ATI X1300 pro card. Turns out the ati video support in x windows isn't as good as I thought. The new radeonhd driver has only been in development for a few weeks and only supports VGA connectors and clones the image to both screens. Fortunately for me all of the registers were being programed for the second CRTC (sort of like a display driver). I hacked it up to make sort of a virtual desktop area and then pointed the second CRTC at the second half of the frame buffer. Shockingly it worked with only a couple hours of hacking. To X it still looks like a virtual desktop so it tried to scroll around until I removed that with a hammer. Patch is here: http://people.freebsd.org/~jeff/rhd-multihead.diffThere's no bounds checking to make sure you actually have enough video memory. It also only works if both devices are the same resolution and display depth. Still it's not too bad for my first time working on a video driver. I'm also happy to support ATI/AMD for releasing their docs. | | Saturday, November 3rd, 2007 | | 4:34 pm |
Some scheduler updates; Long ago I got rid of slice size adjustment to facilitate different cpu allocation based on nice. I've now brought it back for a different purpose. To reduce latency for timesharing threads when there is significant load on the run-queue I now turn down the allowed slice size. Each CPU now keeps track of the sum of the interactive scores of all threads on the run-queue. This is better than a simple load count since it takes into consideration the likely runtime for each thread.
I also found that the larger default slice size in ULE actually pessimizes some workloads. For example, parallel buildworlds. I hypothesize that allowing a compiler to run for too long without allowing make or similar to run reduces the amount of potential concurrency since new jobs can't be scheduled. It's just a theory however, hard to directly measure, but cutting the slice size from ~100ms to ~50ms actually yielded a ~3% perf improvement on a parallel buildworld. Surprising!
ULE will not be the default scheduler in 7.0 but is a selectable option. It is the default in -CURRENT and will be for 7.1. |
[ << Previous 20 ]
|