I was never very happy with the double-queue mechanism for ensuring fairness that I borrowed from the Linux O(1) scheduler. Briefly, O(n) schedulers would typically iterate over every thread that was waiting to run and slightly boost its priority to ensure that it would eventually run. Think of a thread which had used 100% cpu for some time. It would have a poor priority and would be placed on the end of the queue. If many new threads with lower priorities now want to run they will starve the original thread until their priorities also drop. Initially that's what you want, but eventually it should be allowed to run again, and more quickly than it takes to decay the priorities of other threads otherwise the latency is intolerable.
Obviously this algorithm forces us to periodically visit each scheduled thread and to keep accurate cpu% estimation we would visit every thread, scheduled or not. To get O(1) scheduling, Linux and ULE 1.0 employed a double queue mechanism where threads are inserted into one queue and run from another. These queues are swapped when the running queue is empty. Some high priority threads are always inserted into the running queue.
This gives us O(1) but also some peculiar scheduling features. Firstly, it's very difficult to implement nice correctly. When I evaluated the linux scheduler years ago it just punted on the problem. This is basically because any task which is not always inserted into the active queue, which are the interactive tasks, ends up being semi-priority order round-robin. They all get one turn for each flip of the queues and everyone's turn has about the same time quantum allocated to it. If your priority is better you simply get your quantum sooner and then you wait for the flip again. I gave some response to nice values by altering the scheduling quantum based on nice value but this was more complicated than it sounds.
For fair scheduling, what we'd really like is to give all userspace threads a proportional share of the CPU available for batch jobs with the option of user adjustment via nice. To do this, the old BSD scheduler is essentially modeling how much cpu each process has used recently. The priority is adjusted to reflect this and all threads are run in priority order. The problem 4BSD had was there is no distinction between batch and interactive tasks. Threads simply get higher priority treatment when they run infrequently. In ULE I solved this by having an interactivity score which takes into consideration how much time the thread ran vs how much time it voluntarily slept. This differs from how much time it ran vs overall system time and is an important distinction.
Back to the two queues.. There were other practical problems with having two queues. Firstly, it is expensive in memory footprint and cache hits. The priority queues are big datastructures. Accessing many of them per cpu has a measurable impact. Secondly, having threads on multiple queues caused problems for the rest of the system. If a thread waiting to run needs its priority elevated for priority propagation do we have to move it from one queue to another? How does that effect our queue-flipping and starvation strategy?
So between the poor performance, the quirky batch job scheduling behavior, and the difficulty in manipulating priorities for the rest of the system, I was never quite satisfied with the double queues. To replace it, I've used a circular queue inspired by a calender queue I used in a real-time callout system. Imagine one bucket per priority with an insertion index. The priority is
relative to the insertion index. As time moves on the insertion index is moved down and a thread that was inserted with a worse
priority is elevated relative to this index without specifically visiting each thread. We have only one queue so there is little extra cpu overhead and we have more flexibility in deciding how much runtime each thread gets relative to those with a lower priority. ULE only uses this queue for batch jobs. For jobs that ULE considers interactive we place them on the same queue we place realtime and interrupt threads. They are always serviced before any non-realtime threads and they are quickly punted to the batch queue if they overuse the cpu.
In the old system priorities were decided based on the interactivity score, which is the ratio runtime to involuntary sleeps. If you have many threads that run without voluntary sleeps they quickly reach identical priorities regardless of what share of the cpu they have gotten. This was convenient because I was already calculating interactivity, but ULE also now accurately calculates CPU utilization over the last 10 seconds. I now base the priority on this metric, which gives us a better response with many competing cpu hogs. Old hogs temporarily aren't allowed to run until new hogs have evened things out.
ULE is also now slightly faster on uni-processor than 4BSD. The balance used to tip in 4BSD's favor on UP. The SMP balancing algorithms are unaffected by this latest change. I still have some tuning and adjusting to do but I expect to commit this within the next week or so. I've been running some version of the circular queue for a number of months now although I've just recently refined the design to the point of general appeal.
p.s. I sort of like programming again.