| jeffr_tech ( @ 2008-01-25 02:05:00 |
non-uniform cpu scheduling.
After a month sick with some random virus I'm finally starting to feel normal and get some work done again. I spent some time implementing a long standing idea I've had for a more flexible and dynamic CPU topology to improve scheduling decisions. Modern multi-core processors are non-uniform from a variety of perspectives. For example, the barcelona has a shared L3 among all cores on a package. So if you have two packages and you're placing two threads on cores within one package you're wasting half your cache. To take advantage of this knowledge the scheduler needs detailed information about the cpu layout and it needs to intelligently act on that information.
Other non types of non-uniformity are shared system bus or memory controller links, hyper-threaded or symmetric multi-threaded cores, etc. All of these factors stack up to mean simply picking the most idle cpu is no longer an effective solution. For example, if a thread has affinity for one cpu and it's not available, it is sensible to pick another cpu that shares cache with the desired cpu. It's also important to balance load across system bus links and caches not just processors.
Optimizing for these characteristics is not a novel idea. Linux and Solaris both make some attempts to do so. I have not analyzed them in depth but I have noticed that they are both fairly static approaches requiring changes for new architectures or missing opportunities to optimize odd ones such as intel's initial quad core (two sets of dualcore cpus with a large l2 each but a shared system bus). ULE even had some primitive improvements to optimize for HTT. However, my new solution is very dynamic, allowing the machine dependent code to describe arbitrary topographies with slightly abstracted levels of sharing.
The basic datastructure is a tree that describes the relationships between cpus. What resource they share and which cpus are in the set. An example tree might have a root that describes 4 sockets which share nothing (independent bus links), with another layer describing a package with an L3, and another describing a par of cores within that package sharing an L2. Yet another layer may then describe 4 SMT threads per-core. Most systems are actually less complex and require only one or two levels in the tree which reduces algorithm cost.
When searching for a cpu with affinity the scheduler will walk the tree backwards from the last cpu scheduled to the highest level which the thread has affinity for. It then recursively evaluates the sub-tree to find the most idle match with the best affinity. To find an idle cpu, we scan the whole tree finding the least loaded CPU on the least loaded path through the tree. This balances load across tree levels and hence all cpu resources.
The periodic load balancer now recursively iterates the tree, load balancing every level. A search routine simultaneously finds the most and least loaded core via each path and rectifies their loads. The idle time load balancer which runs when a cpu runs out of work will scan starting from the nearest cpu outward to pick the best cpu to steal a thread from.
The majority of the algorithms are quite generic, however, I may implement some special casing for SMT/HTT as there really is a much stronger disincentive to scheduling these cores if others are available unless there is significant cache overlap between threads. When there is cache overlap between threads the scheduler can now place them near-by each other in the topology to decrease cache coherency messages that must be broadcast off package.
So after all that, how much does it help? Some charts:
http://people.freebsd.org/~kris/scaling/p gsql-16cpu.png
http://people.freebsd.org/~kris/scaling/m ysql-16cpu.png
These are from a 16core 4x4 xeon machine. The pgsql graph has a lot of lines but you're interested in ule + topo vs 16 cores, all active. mysql is a little more straightforward. Interestingly it scales well on 8 cpus but runs into terrible user-land contention on 16. I'm not sure if this is running with adaptive pthreads mutexes or not.
Anyway, the patch is still young yet and it's performing very well on this very asymmetric machine while not hindering others. Expect to see this in 8.0 and maybe backported for 7.1 to alleviate the poor scheduling behavior that all those of you with quad quad core machines are experiencing. ;)
After a month sick with some random virus I'm finally starting to feel normal and get some work done again. I spent some time implementing a long standing idea I've had for a more flexible and dynamic CPU topology to improve scheduling decisions. Modern multi-core processors are non-uniform from a variety of perspectives. For example, the barcelona has a shared L3 among all cores on a package. So if you have two packages and you're placing two threads on cores within one package you're wasting half your cache. To take advantage of this knowledge the scheduler needs detailed information about the cpu layout and it needs to intelligently act on that information.
Other non types of non-uniformity are shared system bus or memory controller links, hyper-threaded or symmetric multi-threaded cores, etc. All of these factors stack up to mean simply picking the most idle cpu is no longer an effective solution. For example, if a thread has affinity for one cpu and it's not available, it is sensible to pick another cpu that shares cache with the desired cpu. It's also important to balance load across system bus links and caches not just processors.
Optimizing for these characteristics is not a novel idea. Linux and Solaris both make some attempts to do so. I have not analyzed them in depth but I have noticed that they are both fairly static approaches requiring changes for new architectures or missing opportunities to optimize odd ones such as intel's initial quad core (two sets of dualcore cpus with a large l2 each but a shared system bus). ULE even had some primitive improvements to optimize for HTT. However, my new solution is very dynamic, allowing the machine dependent code to describe arbitrary topographies with slightly abstracted levels of sharing.
The basic datastructure is a tree that describes the relationships between cpus. What resource they share and which cpus are in the set. An example tree might have a root that describes 4 sockets which share nothing (independent bus links), with another layer describing a package with an L3, and another describing a par of cores within that package sharing an L2. Yet another layer may then describe 4 SMT threads per-core. Most systems are actually less complex and require only one or two levels in the tree which reduces algorithm cost.
When searching for a cpu with affinity the scheduler will walk the tree backwards from the last cpu scheduled to the highest level which the thread has affinity for. It then recursively evaluates the sub-tree to find the most idle match with the best affinity. To find an idle cpu, we scan the whole tree finding the least loaded CPU on the least loaded path through the tree. This balances load across tree levels and hence all cpu resources.
The periodic load balancer now recursively iterates the tree, load balancing every level. A search routine simultaneously finds the most and least loaded core via each path and rectifies their loads. The idle time load balancer which runs when a cpu runs out of work will scan starting from the nearest cpu outward to pick the best cpu to steal a thread from.
The majority of the algorithms are quite generic, however, I may implement some special casing for SMT/HTT as there really is a much stronger disincentive to scheduling these cores if others are available unless there is significant cache overlap between threads. When there is cache overlap between threads the scheduler can now place them near-by each other in the topology to decrease cache coherency messages that must be broadcast off package.
So after all that, how much does it help? Some charts:
http://people.freebsd.org/~kris/scaling/p
http://people.freebsd.org/~kris/scaling/m
These are from a 16core 4x4 xeon machine. The pgsql graph has a lot of lines but you're interested in ule + topo vs 16 cores, all active. mysql is a little more straightforward. Interestingly it scales well on 8 cpus but runs into terrible user-land contention on 16. I'm not sure if this is running with adaptive pthreads mutexes or not.
Anyway, the patch is still young yet and it's performing very well on this very asymmetric machine while not hindering others. Expect to see this in 8.0 and maybe backported for 7.1 to alleviate the poor scheduling behavior that all those of you with quad quad core machines are experiencing. ;)