The old ULE selection algorithm was based on the notion that you would want to shared load evenly across processors. It was relatively simple and produced decent results depending on the workload. One of the primary concerns, which turns out to be invalid, was that we not touch the remote CPU's run-queue and we try to get by without evaluating their state very much.
This algorithm fell down in various cases. Inspiration came from Solaris, which tries to pick the best priority thread to run at any given time. This attempts to mimic what a UP scheduler would've chosen to run at any given time. It turns out that it's important for application performance to run things in priority order even globally. This is more beneficial than the extra expense of constantly examining what other cpus are doing.
I'll outline the new algorithm here. This is called to pick the cpu that a thread will run on when it is woken up or rescheduled.
1) If the last cpu we ran on was idle, run there.
2) If the thread ran recently enough and it's priority is better than the current thread on the last cp,u run there.
3) If the cpu that woke us up (current cpu) is about to be idle, run here.
4) Check for an entirely idle cpu in a bitmask of idle cpus, run on any idle cpu.
5) If there are no idle cpus, see if the current cpu is running something with a less priority, and run here.
6) Scan all cpus finding the cpu executing the lowest priority thread and run there.
Steps 1 and 2 implement affinity. The affinity from step 2 is tunable with a sysctl. If a thread hasn't run in N ticks we can assume that it has a cold cache and there is no penalty for running it elsewhere. If it hasn't run in a while, but that cpu is idle, that's as good a place as any for it, and this actually considerably speeds up some tests which are mostly idle.
Step 3 and 5 implement a preference for the current cpu. This is because scheduling it on the current cpu is faster and may actually allow the waking thread and the woken thread to share some information in the cache. Imagine one thread writing to a pipe and waking up the receiver. The written data may still be in cache on the current cpu.
Step 6 is really pretty simple. If there are no idle cpus and no cpu affinity we just find the cpu which is executing the lowest priority task. When we assign to this cpu we may actually send an IPI (Inter-Processor Interrupt) to interrupt the current task and run the new one. This reduces scheduling latency for the highest priority task.
One effect of this algorithm is that when all cpus are busy you can get load imbalances. Imagine all other cpus are running a real-time priority task while one is running a task with poor priority. The real-time tasks then scheduling successively higher priority tasks, all of which will be assigned to the cpu running the poor priority tasks. Once the real-time tasks stop, most of the threads are now assigned to the cpu that was executing lower priority tasks. This can lead to poor utilization due to short-term imbalances and higher execution latency for the remaining tasks.
To solve the load imbalance problem ULE now tracks which CPUs have more load than a configurable 'busy' threshold. When a CPU idles it looks for busy cpus and tries to steal some work from them. We don't want this threshold set too low because it destroys affinity.
That's the new algorithm in a nutshell. Hopefully with some time I'll be able to tune the various parameters to the point that ULE is a speedup in almost all cases and negligible slowdown in others.