(no subject)

A couple bits of news; We tracked down our problem with the performance drop above 30 threads on Nick Piggin's mysql benchmark to conservative settings for the pthread adaptive spinning. We see a big gain relative to where we were before. Frankly at this point we're splitting hairs with linux and I don't really care where we stand. We had a tremendous problem and we resolved it. Time to move on..

I removed kernel support for our M:N threading library last night. 8.0 will only support 1:1. This will open the way to a lot of optimizations in the signal and sleeping paths. Hopefully reducing the total number of locks required in the sleepq path to a minimum.

There are some 'interesting' pipe benchmarks floating around. You can read about it on the lkml and the author's website:

http://213.148.29.37/PipeBench/
http://lkml.org/lkml/2008/3/5/61

I say 'interesting' of course because FreeBSD is doing way better than linux. ;) pipes, the next battleground? I don't know but it's worth a read anyway.

I also have a patch to implement cpu affinity for our callout mechanism. This is for time based callbacks. The legacy callouts may have order dependencies or may not tolerate concurrency. So by default they are all scheduled on the first callout thread. There is one callout thread per-cpu and they have a kind of 'medium' affinity for that cpu, however, if they are overloaded by some interrupt work another cpu can complete the callouts. This removes the need to do any kind of load balancing across callout handlers because the scheduler can do a better job anyway. New callouts can specify any cpu when setting a timer and then they have an affinity for that thread until a different cpu is requested. All migration is explicit.

Hopefully having callout affinity will benefit our tcp stack where Robert Watson is experimenting with different kinds of affinity for tcp sessions. It will also discourage migration of threads who are sleeping on time based events like select and nanosleep().

ULE happenings, context switch surprise.

Lately I've been able to spend a bunch of time on ULE thanks to Nokia. They use it in one of their networking products. I've been doing all of this work in 8.0-CURRENT and backporting it for them at the same time. It's a great model for both parties because users on -CURRENT shake out bugs that they'd have to find in testing otherwise and we get new development paid for.

I finished and committed the topology aware cpu scheduling that I discussed in earlier posts. I also implemented a mechanism for CPU provisioning that you can use to restrict groups of processes to sets of cpus which can be dynamically migrated. This will be useful for restricting jails to certain CPUs or dedicating some CPUs to real-time special-purpose tasks for example.

Over the last few days I cleaned up my cpu switch optimizations and got those in. The results are 25% faster context switching in a yield benchmark. Even faster than linux on the same hardware. Some day I'll put open solaris on so I have something else to compare to.

Separate from the other switch benchmarks I've been working on reimplementing amd64's context switching routine almost entirely in C. I just wanted to do it because we're putting more complex things in and it was getting hard to find registers but it turns out you can make it much faster too. The yield benchmark is another 10% faster with the C switch routine. Mostly due to enabling more complex checks, like not setting MSR_FSBASE/GSBASE if they haven't changed, and getting uncommon code out of the fast path.

More sysbench noise.

http://www.kernel.org/pub/linux/kernel/people/npiggin/sysbench/

Nick Piggin has been doing some benchmarking of recent linux kernels and FreeBSD 7.0 on a 2xquad core barcelona opteron. He verified that the CFS problems seem to be fixed and FreeBSD's performance on this box with mysql is really very similar up to about 20 threads. I feel confident that the test was conducted fairly and I'm happy with these results. Our stable release is doing very well even if fresh-out-of-git linux is showing better on this platform. We already have some good gains in this workload in 8.0-CURRENT as well. What's most important to me is that we stay relevant on common server hardware and we're doing a good job at that.

I'm also happy to see some collaboration and competition between linux and bsd kernel developers. I hope that continues. We're really more alike than we are different.

Next up, we now have a 16 way xeon and 16 way opteron system to tune and test with. More points of contention are being removed. The code marches on.

lung tech.

I, like many nerds and athletes before me, have suffered from asthma and lung problems for almost the entirety of my life. I don't have the blue-in-the-face, bronchial-spasm, send-me-to-the-hospital variety. Rather, I have a seemingly constant irritation and periodic, primarily exercise-induced, restriction of my airways that mostly just slows me down. This is actually caused by a poor immune system reaction to airborn allergins. Exercise triggers attacks because as much as 10x more air is moving over your lungs so they're likely to get 10x as irritated.

In any event, this hadn't been much of a problem for me in seattle, except when I lived in a very moldy old house. However, after moving to hawaii something started really bothering me. My training started out great but after a virus I found myself unable to significantly exert myself for longer than 5 minutes or so. I went to the Dr but wasn't satisfied with their diagnosis so I bought myself a peak flow meter, blood oximeter and a few other gadgets. and so the nerding began.

The peak flow meter is really the most interesting. This measures, in liters/minute, how rapidly you can force air through a constrained passage. It's just a tube you blow in with a column and a gauge. For someone of my height and age a 'normal' peak flow rate would be around 600 l/m. My actual measured flow rate very regular at 675 l/m, so 112% above predicted, not bad! However, 5 minutes of vigorous exercise on a stationary bike and that had dropped 20% to 540. A 20% reduction in the rate your lungs move air is enough to perceive as constricted and tight. Interestingly I'd still be considered in a healthy flow range, and indeed I could rest and talk and walk just fine, I just couldn't ride my damned bike. The blood oximeter also showed a 5% drop in blood oxygen saturation during the constriction.

Armed with these findings I asked for a maintenance drug, advair, which has a corticosteroid to reduce inflammation. And indeed, 3 days after starting this treatment, my peak flow now measures around 775, or 13% better. This would be better than the average flow rate for a 6'8" male. And hopefully now after missing the first two races of the season, my training can begin again in earnest.

And the moral of the story is; You can never have enough gadgets.

Using inlines to reduce code duplication

I recently was able to use a neat trick in my scheduler code that I thought I'd share. It might be old news to many of you and it doesn't come up a lot but it's useful when it does. The basic notion is that you can use inlines with const arguments to create a sort of parameterized function with no duplicated code post-compile.

Read more...Collapse )

non-uniform cpu scheduling.

After a month sick with some random virus I'm finally starting to feel normal and get some work done again. I spent some time implementing a long standing idea I've had for a more flexible and dynamic CPU topology to improve scheduling decisions. Modern multi-core processors are non-uniform from a variety of perspectives. For example, the barcelona has a shared L3 among all cores on a package. So if you have two packages and you're placing two threads on cores within one package you're wasting half your cache. To take advantage of this knowledge the scheduler needs detailed information about the cpu layout and it needs to intelligently act on that information.

Read more...Collapse )

(no subject)

I recently collaborated with Kip Macy to mostly rewrite FreeBSD's lock profiling facility. This provides a rich set of statistics about lock acquisition and contention that is instrumental in continuing to refactor the locking in the kernel. Statistics include: max hold time, total hold time, total wait time, number of acquisitions, average hold time, average wait time, and number of times contended. It's a little tricky because these statistics are not kept on a per-lock basis, but rather, per (file, line, lock name) triple.

This means you can readily identify not just which locks are problematic but which source files are causing the problems. Issues of high latency or coarse locking readily stand out. Unfortunately, all of these statistics are quite expensive to gather. At the moment common kernel-heavy workloads slow down to about 1/5th speed. Before the rewrite it was 1/10th! The overhead is entirely due to the time keeping functions which must be called for every acquisition and release.

The goal of the rewrite was to better support shared locks. Previously some data was kept in each lock and it was assumed the locker had exclusive access to that data to update it. We changed it to have a notion of ownership records instead, so each lock ownership adds a small pre-allocated structure to a per-thread list. This structure tracks timing and contention information for this specific instance.

When the lock is released we aggregate the information into a structure that is associated with the (file, line, lock name) triple. This is found via hash lookup. I changed the hash table to be per-cpu which removed an array of locks we used to protect it before. This makes displaying statistics much more complicated because each record must be merged with any records for the same triple that may exist on another cpu. However, this is responsible for the 2x speedup.

The remainder of the overhead will go away once multiprocessor systems have reliable, synchronized time-stamp counters (TSC). This is an extremely cheap time source, on the order of dozens of cycles, compared to the hundreds of cycles to access a global system clock that you must use for reliable cross-processor timing information today.