[Most Recent Entries]
Below are 20 journal entries, after skipping by the 20 most recent ones recorded in
[ << Previous 20 -- Next 20 >> ]
[ << Previous 20 -- Next 20 >> ]
|Wednesday, March 12th, 2008|
A couple bits of news; We tracked down our problem with the performance drop above 30 threads on Nick Piggin's mysql benchmark to conservative settings for the pthread adaptive spinning. We see a big gain relative to where we were before. Frankly at this point we're splitting hairs with linux and I don't really care where we stand. We had a tremendous problem and we resolved it. Time to move on..
I removed kernel support for our M:N threading library last night. 8.0 will only support 1:1. This will open the way to a lot of optimizations in the signal and sleeping paths. Hopefully reducing the total number of locks required in the sleepq path to a minimum.
There are some 'interesting' pipe benchmarks floating around. You can read about it on the lkml and the author's website:http://188.8.131.52/PipeBench/http://lkml.org/lkml/2008/3/5/61
I say 'interesting' of course because FreeBSD is doing way better than linux. ;) pipes, the next battleground? I don't know but it's worth a read anyway.
I also have a patch to implement cpu affinity for our callout mechanism. This is for time based callbacks. The legacy callouts may have order dependencies or may not tolerate concurrency. So by default they are all scheduled on the first callout thread. There is one callout thread per-cpu and they have a kind of 'medium' affinity for that cpu, however, if they are overloaded by some interrupt work another cpu can complete the callouts. This removes the need to do any kind of load balancing across callout handlers because the scheduler can do a better job anyway. New callouts can specify any cpu when setting a timer and then they have an affinity for that thread until a different cpu is requested. All migration is explicit.
Hopefully having callout affinity will benefit our tcp stack where Robert Watson is experimenting with different kinds of affinity for tcp sessions. It will also discourage migration of threads who are sleeping on time based events like select and nanosleep().
|Monday, March 10th, 2008|
|ULE happenings, context switch surprise.
Lately I've been able to spend a bunch of time on ULE thanks to Nokia. They use it in one of their networking products. I've been doing all of this work in 8.0-CURRENT and backporting it for them at the same time. It's a great model for both parties because users on -CURRENT shake out bugs that they'd have to find in testing otherwise and we get new development paid for.
I finished and committed the topology aware cpu scheduling that I discussed in earlier posts. I also implemented a mechanism for CPU provisioning that you can use to restrict groups of processes to sets of cpus which can be dynamically migrated. This will be useful for restricting jails to certain CPUs or dedicating some CPUs to real-time special-purpose tasks for example.
Over the last few days I cleaned up my cpu switch optimizations and got those in. The results are 25% faster context switching in a yield benchmark. Even faster than linux on the same hardware. Some day I'll put open solaris on so I have something else to compare to.
Separate from the other switch benchmarks I've been working on reimplementing amd64's context switching routine almost entirely in C. I just wanted to do it because we're putting more complex things in and it was getting hard to find registers but it turns out you can make it much faster too. The yield benchmark is another 10% faster with the C switch routine. Mostly due to enabling more complex checks, like not setting MSR_FSBASE/GSBASE if they haven't changed, and getting uncommon code out of the fast path.
|Friday, March 7th, 2008|
|More sysbench noise.
Nick Piggin has been doing some benchmarking of recent linux kernels and FreeBSD 7.0 on a 2xquad core barcelona opteron. He verified that the CFS problems seem to be fixed and FreeBSD's performance on this box with mysql is really very similar up to about 20 threads. I feel confident that the test was conducted fairly and I'm happy with these results. Our stable release is doing very well even if fresh-out-of-git linux is showing better on this platform. We already have some good gains in this workload in 8.0-CURRENT as well. What's most important to me is that we stay relevant on common server hardware and we're doing a good job at that.
I'm also happy to see some collaboration and competition between linux and bsd kernel developers. I hope that continues. We're really more alike than we are different.
Next up, we now have a 16 way xeon and 16 way opteron system to tune and test with. More points of contention are being removed. The code marches on.
|Tuesday, February 26th, 2008|
|Thursday, February 21st, 2008|
I, like many nerds and athletes before me, have suffered from asthma and lung problems for almost the entirety of my life. I don't have the blue-in-the-face, bronchial-spasm, send-me-to-the-hospital variety. Rather, I have a seemingly constant irritation and periodic, primarily exercise-induced, restriction of my airways that mostly just slows me down. This is actually caused by a poor immune system reaction to airborn allergins. Exercise triggers attacks because as much as 10x more air is moving over your lungs so they're likely to get 10x as irritated.
In any event, this hadn't been much of a problem for me in seattle, except when I lived in a very moldy old house. However, after moving to hawaii something started really bothering me. My training started out great but after a virus I found myself unable to significantly exert myself for longer than 5 minutes or so. I went to the Dr but wasn't satisfied with their diagnosis so I bought myself a peak flow meter, blood oximeter and a few other gadgets. and so the nerding began.
The peak flow meter is really the most interesting. This measures, in liters/minute, how rapidly you can force air through a constrained passage. It's just a tube you blow in with a column and a gauge. For someone of my height and age a 'normal' peak flow rate would be around 600 l/m
. My actual measured flow rate very regular at 675 l/m, so 112% above predicted, not bad! However, 5 minutes of vigorous exercise on a stationary bike and that had dropped 20% to 540. A 20% reduction in the rate your lungs move air is enough to perceive as constricted and tight. Interestingly I'd still be considered in a healthy flow range, and indeed I could rest and talk and walk just fine, I just couldn't ride my damned bike. The blood oximeter also showed a 5% drop in blood oxygen saturation during the constriction.
Armed with these findings I asked for a maintenance drug, advair, which has a corticosteroid to reduce inflammation. And indeed, 3 days after starting this treatment, my peak flow now measures around 775, or 13% better. This would be better than the average flow rate for a 6'8" male. And hopefully now after missing the first two races of the season, my training can begin again in earnest.
And the moral of the story is; You can never have enough gadgets.
|Thursday, February 14th, 2008|
|Using inlines to reduce code duplication
I recently was able to use a neat trick in my scheduler code that I thought I'd share. It might be old news to many of you and it doesn't come up a lot but it's useful when it does. The basic notion is that you can use inlines with const arguments to create a sort of parameterized function with no duplicated code post-compile.( Read more...Collapse )
|Sunday, January 27th, 2008|
Last night, in remembrance of Rich Steven, I made his ultimate chocolate chip cookies: http://www.kohala.com/start/recipes/ultimatecookie.html
I do this every couple of years and I have to say the closer you follow the directions the better they come out. I used my wife's kitchenaid mixer and they were the best yet.
|Friday, January 25th, 2008|
|non-uniform cpu scheduling.
After a month sick with some random virus I'm finally starting to feel normal and get some work done again. I spent some time implementing a long standing idea I've had for a more flexible and dynamic CPU topology to improve scheduling decisions. Modern multi-core processors are non-uniform from a variety of perspectives. For example, the barcelona has a shared L3 among all cores on a package. So if you have two packages and you're placing two threads on cores within one package you're wasting half your cache. To take advantage of this knowledge the scheduler needs detailed information about the cpu layout and it needs to intelligently act on that information.( Read more...Collapse )
|Thursday, December 6th, 2007|
I recently collaborated with Kip Macy to mostly rewrite FreeBSD's lock profiling facility. This provides a rich set of statistics about lock acquisition and contention that is instrumental in continuing to refactor the locking in the kernel. Statistics include: max hold time, total hold time, total wait time, number of acquisitions, average hold time, average wait time, and number of times contended. It's a little tricky because these statistics are not kept on a per-lock basis, but rather, per (file, line, lock name) triple.
This means you can readily identify not just which locks are problematic but which source files are causing the problems. Issues of high latency or coarse locking readily stand out. Unfortunately, all of these statistics are quite expensive to gather. At the moment common kernel-heavy workloads slow down to about 1/5th speed. Before the rewrite it was 1/10th! The overhead is entirely due to the time keeping functions which must be called for every acquisition and release.
The goal of the rewrite was to better support shared locks. Previously some data was kept in each lock and it was assumed the locker had exclusive access to that data to update it. We changed it to have a notion of ownership records instead, so each lock ownership adds a small pre-allocated structure to a per-thread list. This structure tracks timing and contention information for this specific instance.
When the lock is released we aggregate the information into a structure that is associated with the (file, line, lock name) triple. This is found via hash lookup. I changed the hash table to be per-cpu which removed an array of locks we used to protect it before. This makes displaying statistics much more complicated because each record must be merged with any records for the same triple that may exist on another cpu. However, this is responsible for the 2x speedup.
The remainder of the overhead will go away once multiprocessor systems have reliable, synchronized time-stamp counters (TSC). This is an extremely cheap time source, on the order of dozens of cycles, compared to the hundreds of cycles to access a global system clock that you must use for reliable cross-processor timing information today.
|Saturday, November 24th, 2007|
Dear SQL language "designers",
Please report to my office to receive your beating.
|Tuesday, November 6th, 2007|
Here's my horrible hack for the day. I have two identical lcd displays that I've been using independently on different computers. I wanted to join them into one display on my main development box so I bought a dual head ATI X1300 pro card. Turns out the ati video support in x windows isn't as good as I thought.
The new radeonhd driver has only been in development for a few weeks and only supports VGA connectors and clones the image to both screens. Fortunately for me all of the registers were being programed for the second CRTC (sort of like a display driver). I hacked it up to make sort of a virtual desktop area and then pointed the second CRTC at the second half of the frame buffer. Shockingly it worked with only a couple hours of hacking. To X it still looks like a virtual desktop so it tried to scroll around until I removed that with a hammer.
Patch is here: http://people.freebsd.org/~jeff/rhd-multihead.diff
There's no bounds checking to make sure you actually have enough video memory. It also only works if both devices are the same resolution and display depth. Still it's not too bad for my first time working on a video driver. I'm also happy to support ATI/AMD for releasing their docs.
|Saturday, November 3rd, 2007|
Some scheduler updates; Long ago I got rid of slice size adjustment to facilitate different cpu allocation based on nice. I've now brought it back for a different purpose. To reduce latency for timesharing threads when there is significant load on the run-queue I now turn down the allowed slice size. Each CPU now keeps track of the sum of the interactive scores of all threads on the run-queue. This is better than a simple load count since it takes into consideration the likely runtime for each thread.
I also found that the larger default slice size in ULE actually pessimizes some workloads. For example, parallel buildworlds. I hypothesize that allowing a compiler to run for too long without allowing make or similar to run reduces the amount of potential concurrency since new jobs can't be scheduled. It's just a theory however, hard to directly measure, but cutting the slice size from ~100ms to ~50ms actually yielded a ~3% perf improvement on a parallel buildworld. Surprising!
ULE will not be the default scheduler in 7.0 but is a selectable option. It is the default in -CURRENT and will be for 7.1.
|Sunday, October 14th, 2007|
We made it to Maui. I forgot how many stars there are in the sky. We rented a nice house on a good size lot with 10 or so different types of fruit trees and a view of the ocean. It's even walking distance to many restaurants and fairly bike friendly. We couldn't be happier.
I will still be mostly unavailable for the next two weeks as our things are crossing the pacific. Then I will finish setting up my 8core Xeon with Solaris, Linux, and FreeBSD for bit of scalability testing.
|Saturday, October 6th, 2007|
Just a quick post about mysql scaling. I upgraded my mysql installation to 5.0.45 and my linux kernel to 2.6.22-rc7. The linux scaling problem is resolved in this setup. I didn't test 2.6.22-rc7 with 5.0.33 again to see if it was the kernel or the database. I just did it on a whim. I've been too busy lately to continue to pursue this. Also the system time in 5.0.45 has dropped in half so they must be doing quite a lot fewer syscalls.
iXsystems has given us access to a 4x4 Xeon to experiment with however. So we're coming up with some new CPU selection algorithms for these big quad core machines. It should give us a boost in perf at the low end. I may end up with a tunable that is power vs speed to control how load is distributed among the cores to save power or give the best perf.
|Wednesday, October 3rd, 2007|
The forecast for my last 10 days in seattle? Rain and 55f of course.
|Thursday, September 20th, 2007|
|More subjective scheduler tests.
So I've just installed fedora core 7 on my laptop so I can do a side by side comparison of scheduler performance. The test I like to do with ULE is to make -j64 kernel while playing a dvd with mplayer and browsing the web. I have verified that make -j64 in the base directory of the linux kernel gets the load average up to the same level. The actual source involved matters little. Both systems are using a gcc 4.x compiler. The laptop is an IBM T42 with a 1.8ghz PentiumM,2 gigabytes of ram, and a 7200 rpm drive.
With the FreeBSD 7.0 configured without debugging options INVARIANTS and WITNESS, but with SCHED_ULE, I get no skipping or glitching of any kind.
With the O(1) scheduler in the default fedora core 7 kernel I get intermittent skipping but it's generally pretty tolerable.
I installed a 2.6.23-pre7 kernel built with 'make defconfig'. Feel free to suggest configuration options that may be relevant. At -j64 the ui was basically unusable. At -j4 it skipped more than the older kernel using -j64. It is my understanding that this is using the CFS scheduler.
I must say that in general the linux experience was very good. The installer was easy enough, although neither it or the installed system seemed to detect my atheros wireless card. I didn't care to futz with it so I just plugged it into my switch directly. It sure looked pretty otherwise. Maybe I'll try PC-BSD when they update to 7.0.
The issue with CFS is that the simple algorithm works very well as long as your interactive tasks consume less cpu in proportion to the other tasks. So for example mplayer takes 10% of the cpu playing a dvd on my machine. If you have 9 cpu hogs and mplayer running, without any information other than runtime or %cpu you can't distinguish between them.
Whether this is an important workload or not is subject to debate.
|Tuesday, September 18th, 2007|
A number of people called bs on my last post, some publicly, some privately. So I thought I'd explain why this is possible and how the system really responds under load in my own tests.( Read more...Collapse )
|Friday, September 14th, 2007|
"Linux, even with CFS, it's still fairly easy to 'upset' it by just producing a fairly large (2-4) amount of load. Solaris did notably better. While it seemed to have a few quirks with scheduling in general, it could sustain a load of around 8-12 without producing video/audio frame drops. FreeBSD, with the experimental SCHED_ULE 2.0 scheduler (as of March 2007) could sustain a load of over 80 with no problems, frame drops, or even glxgears slowing down to a complete crawl"
|Wednesday, September 5th, 2007|
Netapp is suing sun for what seems to be blatant patent infringement in ZFS. WAFL, Netapp's filesystem uses a novel method for consistency using checkpoints that sun directly copied. Allegedly, this was instigated by sun trying to collect royalties on other patents. If this is true, I'm glad it's biting them now. You can't position yourself as an open and friendly player and then start suing companies to generate revenue. Especially when your hands are dirty too.
I was actually a little surprised that sun decided to use WAFL's COW check-pointing system to implement ZFS. It's a great idea but requires a lot of extra metadata writes compared to a traditional filesystem. I believe netapp uses a nvram to store the block allocation bitmaps to eliminate the extra cost of the io associated with this scheme. However, sun is using it on normal disks so they're just sucking up the extra IOs.