jeffr_tech (jeffr_tech) wrote,
jeffr_tech
jeffr_tech

ULE cpu selection.

With the rewrite of the CPU selection algorithm, ULE 2.0 is something like a 70% rewrite. Read on for my new, more efficient way of choosing target cpus. Surprisingly more complicated than you might expect. With these changes ULE is about 300% faster than 4BSD on a mysql benchmark on my 8core opteron. It's about 10% faster on a 2core xeon. 4BSD does a reasonable job on 2 processor machines but it doesn't scale well.


The old ULE selection algorithm was based on the notion that you would want to shared load evenly across processors. It was relatively simple and produced decent results depending on the workload. One of the primary concerns, which turns out to be invalid, was that we not touch the remote CPU's run-queue and we try to get by without evaluating their state very much.

This algorithm fell down in various cases. Inspiration came from Solaris, which tries to pick the best priority thread to run at any given time. This attempts to mimic what a UP scheduler would've chosen to run at any given time. It turns out that it's important for application performance to run things in priority order even globally. This is more beneficial than the extra expense of constantly examining what other cpus are doing.

I'll outline the new algorithm here. This is called to pick the cpu that a thread will run on when it is woken up or rescheduled.

1) If the last cpu we ran on was idle, run there.
2) If the thread ran recently enough and it's priority is better than the current thread on the last cp,u run there.
3) If the cpu that woke us up (current cpu) is about to be idle, run here.
4) Check for an entirely idle cpu in a bitmask of idle cpus, run on any idle cpu.
5) If there are no idle cpus, see if the current cpu is running something with a less priority, and run here.
6) Scan all cpus finding the cpu executing the lowest priority thread and run there.

Steps 1 and 2 implement affinity. The affinity from step 2 is tunable with a sysctl. If a thread hasn't run in N ticks we can assume that it has a cold cache and there is no penalty for running it elsewhere. If it hasn't run in a while, but that cpu is idle, that's as good a place as any for it, and this actually considerably speeds up some tests which are mostly idle.

Step 3 and 5 implement a preference for the current cpu. This is because scheduling it on the current cpu is faster and may actually allow the waking thread and the woken thread to share some information in the cache. Imagine one thread writing to a pipe and waking up the receiver. The written data may still be in cache on the current cpu.

Step 6 is really pretty simple. If there are no idle cpus and no cpu affinity we just find the cpu which is executing the lowest priority task. When we assign to this cpu we may actually send an IPI (Inter-Processor Interrupt) to interrupt the current task and run the new one. This reduces scheduling latency for the highest priority task.

One effect of this algorithm is that when all cpus are busy you can get load imbalances. Imagine all other cpus are running a real-time priority task while one is running a task with poor priority. The real-time tasks then scheduling successively higher priority tasks, all of which will be assigned to the cpu running the poor priority tasks. Once the real-time tasks stop, most of the threads are now assigned to the cpu that was executing lower priority tasks. This can lead to poor utilization due to short-term imbalances and higher execution latency for the remaining tasks.


To solve the load imbalance problem ULE now tracks which CPUs have more load than a configurable 'busy' threshold. When a CPU idles it looks for busy cpus and tries to steal some work from them. We don't want this threshold set too low because it destroys affinity.

That's the new algorithm in a nutshell. Hopefully with some time I'll be able to tune the various parameters to the point that ULE is a speedup in almost all cases and negligible slowdown in others.
Subscribe

  • Asynchronous partial truncation

    I have spent a month of my life on partial truncation. Softupdates asynchronously handles the case where you were completely truncating a file, such…

  • Performance problems in SUJ

    SUJ has been around for a year now and 9.0 will release with it this summer. In preparation I am working on the few known performance problems. The…

  • Interactivity score in ULE

    I sometimes speak with Con Kolvis who is known for several Linux schedulers. Con is an interesting fellow because his background is not CS and he is…

  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 11 comments

  • Asynchronous partial truncation

    I have spent a month of my life on partial truncation. Softupdates asynchronously handles the case where you were completely truncating a file, such…

  • Performance problems in SUJ

    SUJ has been around for a year now and 9.0 will release with it this summer. In preparation I am working on the few known performance problems. The…

  • Interactivity score in ULE

    I sometimes speak with Con Kolvis who is known for several Linux schedulers. Con is an interesting fellow because his background is not CS and he is…