One lesson learned from working on synchronization primitives is that it's often profitable to spin before sleeping. We have adaptive mutexes, rwlocks, etc. rather than simply having sleeping locks or spinlocks. This has had an unexpected influence on our idle loop.
When a thread becomes runnable it is often desirable to run it on a cpu other than the current one. If the target cpu is in the idle loop, it may actually be waiting in a low power state using the 'hlt' instruction or some acpi mechanism that I try to avoid. To wake up this remote cpu we currently issue an IPI (inter-processor interrupt). This is actually very expensive for the sender and receiver.
On some CPUs which support SSE3 there is a pair of instructions, monitor and mwait, which allow you to signal a remote cpu using only memory operations. This works by giving the programmer access to the existing hardware bus snooping interface. The sleeping cpu sees another cpu write to a memory location we're snooping and we wake up.
On barcelona mwait doesn't enter as deep of a sleep as on the xeons. So I decided to use an adaptive algorithm that would mwait when we're busy and hlt when we're not. With mwait you can actually specify the power state you'd like so I keep both the Xeon and Opterons in C0 to further reduce wakeup latency.
Then an engineer at Nokia suggested I go one step further and allow the idle thread to spin waiting for work for a short period. So this is now the first stage in the adaptive algorithm, we spin a while, then sleep at a high power state, and then sleep at a low power state depending on load.
Using a 'ping-pong' threads program that sends a single token around a ring of threads I see a 20% perf improvement vs the old non-adaptive mechanism. In most cases we're still idling in hlt as well, so there should be no negative effect on power. In fact, it wastes a lot of time and energy to enter and exit the idle states so it might improve power under load by reducing the total cpu time required.