jeffr_tech ([info]jeffr_tech) wrote,
@ 2008-04-16 23:02:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
adaptive idling
One lesson learned from working on synchronization primitives is that it's often profitable to spin before sleeping. We have adaptive mutexes, rwlocks, etc. rather than simply having sleeping locks or spinlocks. This has had an unexpected influence on our idle loop.

When a thread becomes runnable it is often desirable to run it on a cpu other than the current one. If the target cpu is in the idle loop, it may actually be waiting in a low power state using the 'hlt' instruction or some acpi mechanism that I try to avoid. To wake up this remote cpu we currently issue an IPI (inter-processor interrupt). This is actually very expensive for the sender and receiver.

On some CPUs which support SSE3 there is a pair of instructions, monitor and mwait, which allow you to signal a remote cpu using only memory operations. This works by giving the programmer access to the existing hardware bus snooping interface. The sleeping cpu sees another cpu write to a memory location we're snooping and we wake up.

On barcelona mwait doesn't enter as deep of a sleep as on the xeons. So I decided to use an adaptive algorithm that would mwait when we're busy and hlt when we're not. With mwait you can actually specify the power state you'd like so I keep both the Xeon and Opterons in C0 to further reduce wakeup latency.

Then an engineer at Nokia suggested I go one step further and allow the idle thread to spin waiting for work for a short period. So this is now the first stage in the adaptive algorithm, we spin a while, then sleep at a high power state, and then sleep at a low power state depending on load.

Using a 'ping-pong' threads program that sends a single token around a ring of threads I see a 20% perf improvement vs the old non-adaptive mechanism. In most cases we're still idling in hlt as well, so there should be no negative effect on power. In fact, it wastes a lot of time and energy to enter and exit the idle states so it might improve power under load by reducing the total cpu time required.



(6 comments) - (Post a new comment)


[info]zenspider
2008-04-17 10:09 am UTC (link)
Totally unrelated, but would you have any clue why my FreeBSD 4.11 parallels slice running a lot more stuff (postfix, mailman, apache 1, cronjobs) has a substantially lower CPU time (on the parallels side) than my much less active FreeBSD 6.3 parallels slice (only running a single static blog on lighttpd)?

Would 7 stable be any better in this regard?

(Reply to this)


[info]evan
2008-04-17 03:54 pm UTC (link)
Can you measure CPU power consumption by anything other than a dongle on the wall socket?

It seems it would be hard to tune algorithms/data structures to different CPU characteristics -- like maybe another CPU has cheap IPIs.

(Reply to this) (Thread)

IPI fundamentally expensive
(Anonymous)
2008-04-18 12:19 am UTC (link)
An IPI is fundamentally expensive because it causes the destination CPU to trap. This is likely to take hundreds of cycles.

An IPI could be implemented nicely on the bus because it doesn't have to worry about cache coherency issues (on the other hand, CPU/bus designers have far far more motive to improve cache coherency than to improve IPI performance).

The ideal primitive would be a synchronous wait-for-interrupt-number instruction that would rather not trap but just return with a flag set if the given interrupt triggers. OTOH this might add additional complexity to the CPU's interrupt handling paths...

(Reply to this) (Parent)


[info]mjg59
2008-05-09 12:28 am UTC (link)
It's not generally possible to measure CPU power consumption directly unless you have an instrumented board. You can retrieve information on how many C-state transitions have occured and how long the CPU stayed in those states, which (assuming you have reasonable knowledge about the processor) lets you come up with a pretty reasonable estimate.

(Reply to this) (Parent)

Off point, have you heard about BFQ I/O scheduler?
(Anonymous)
2008-04-17 04:24 pm UTC (link)
Hello Jeff, have you heard about BFQ (Budget Fair Queuing) I/O scheduler in a few days ago? It was interest to read for me, so maybe you will too.

http://kerneltrap.org/Linux/Budget_Fair_Queuing_IO_Scheduler
http://algo.ing.unimo.it/people/paolo/disk_sched/

(Reply to this)


[info]sas_spidey01
2008-04-18 08:04 am UTC (link)
What about non X86/AMD64 based architectures?


Not that I've actually seen one in person since 256K of ram was considered fairly big... lol.

(Reply to this)


(6 comments) - (Post a new comment)

Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…