When a thread becomes runnable it is often desirable to run it on a cpu other than the current one. If the target cpu is in the idle loop, it may actually be waiting in a low power state using the 'hlt' instruction or some acpi mechanism that I try to avoid. To wake up this remote cpu we currently issue an IPI (inter-processor interrupt). This is actually very expensive for the sender and receiver.
On some CPUs which support SSE3 there is a pair of instructions, monitor and mwait, which allow you to signal a remote cpu using only memory operations. This works by giving the programmer access to the existing hardware bus snooping interface. The sleeping cpu sees another cpu write to a memory location we're snooping and we wake up.
On barcelona mwait doesn't enter as deep of a sleep as on the xeons. So I decided to use an adaptive algorithm that would mwait when we're busy and hlt when we're not. With mwait you can actually specify the power state you'd like so I keep both the Xeon and Opterons in C0 to further reduce wakeup latency.
Then an engineer at Nokia suggested I go one step further and allow the idle thread to spin waiting for work for a short period. So this is now the first stage in the adaptive algorithm, we spin a while, then sleep at a high power state, and then sleep at a low power state depending on load.
Using a 'ping-pong' threads program that sends a single token around a ring of threads I see a 20% perf improvement vs the old non-adaptive mechanism. In most cases we're still idling in hlt as well, so there should be no negative effect on power. In fact, it wastes a lot of time and energy to enter and exit the idle states so it might improve power under load by reducing the total cpu time required.