jeffr_tech ([info]jeffr_tech) wrote,
@ 2007-08-06 01:06:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
I took a little while to learn more about SD and CFS to see what the linux guys were up to. I have a couple of interesting comments. Including some discussion of increased algorithm complexity in the CFS scheduler; it's no longer O(1). Please keep in mind that these comments are entirely from reading code and documentation. I make no claim of knowing which scheduler actually makes the best scheduling decisions.


There were two things about the earlier linux scheduler that I didn't like, both of which I had tried in earlier versions of ULE. The first was the double-queue mechanism, which I only replaced in the "ULE 2.0" commits. This made fair scheduling incredibly difficult. It also made nice very difficult. The second thing was using variable slice sizes to adjust cpu time allocation. It just seemed hacky and pessimized throughput in tasks with small slices.

The SD scheduler kept both of these things. It then used an very complex priority bitmap to determine how many slices a thread would be assigned on the current run-queue. It did remove many of the interactivity related heuristics in an attempt to simplify the scheduler. I understand it had favorable behavior on the desktop, but I imagine there were many pathological behaviors with it. In general I found it overcomplicated, as most of the attempts I made with the double-queue system were.

CFS got rid of the variable slice sizes and the double-queue. However, it also increased the algorithmic complexity on insert/removal to O(log n) through the use of a rb tree rather than a run-queue The mainline scheduler was much touted for it's O(1) complexity. The historic scheduler, prior to O(1), was only O(n) in a loop that ran once per-second to recalculate task priorities. This is similar to the 4BSD scheduler algorithm. They were otherwise O(1) for insert/removal.

I'm curious to see if they find the increased algorithm complexity will show up on many profiles. You can imagine that when dealing with hundreds or thousands of running threads the cache misses when walking log n could be quite expensive. Especially because you'd have many, many, more scheduling events with this number of threads.

The other interesting thing in CFS is nanosecond resolution precise accounting of time. We've actually been discussing this for a while in bsd as we already keep some of these stats for getrusage(). Hooking it up in the scheduler would make several issues nicer and would remove the possibility of timing heuristic based scheduler attacks.

I have not looked at what heuristics were employed to determine which processes are interactive in the mainline scheduler. However, ULE's interactivity heuristic adapts with CPU load and I can't imagine abandoning it as they have with CFS. ULE and CFS both attempt fair cpu distribution over some period modulo nice values. In this regard ULE with the interactivity disabled (sysctl kern.sched.interact=0) should be as 'completely fair' as CFS. As was the older 4BSD scheduler. However, FreeBSD users report a marked improvement in responsiveness with the interactivity hint in ULE.



(Post a new comment)

some comments
[info]sniket
2007-08-06 07:04 pm UTC (link)
Hey Jeff, I've got two questions and one comment ;)

Ingo claims that with the current PID limit of the kernel (32k) O(log n) *should* be more or less the same for *most* workloads because CFS's algorithm never goes deeper than 15-20 in the rbtree. However, if that's the case, one might wonder what the big fuss about O(1) was all about when he released his new scheduler in January 2002.. He explains this in more detail at
http://linux.slashdot.org/article.pl?sid=07/07/10/2335217&tid=185 (make sure to read all of his comments).

Now for the questions..

1.)
"Hooking it up in the scheduler would make several issues nicer" <- for example?

2.)
Now that you've looked at CFS' architecture, would it be much work to replace the rbtree with the circular queues you used in ULE?

(Reply to this) (Thread)

Re: some comments
[info]wyrm_brother
2007-08-06 07:46 pm UTC (link)
Unfortunately, Ingo is wrong here. It is O(long n), because the big-O notation means the behaviour when n approaches infinity. When he limits the PID to a maximum number, the old O(n) algorithm would be O(1), and even O(n^2) would be O(1).

So to be correct, the algorithm still is O(log n). But that does not mean much. Its the benchmarks that count. O(log n) is pretty good, and most of the times fast enough.

(Reply to this) (Parent)(Thread)

Re: some comments
[info]sniket
2007-08-06 07:59 pm UTC (link)
That's what I was trying to say.. Of course the algorithm is still O(log n) (and Ingo did not deny that), but for *most* practical workloads it *should* be enough, that's what you said as well :)

(Reply to this) (Parent)

Re: some comments
(Anonymous)
2007-08-06 09:18 pm UTC (link)
Copying my post from OSNews....

This kind of stuff really ticks me off.

O(log(n)) does not equal O(1) for small values of n!!!

That would be like saying okay, its really O(n^2) but since n has an upper limit of 32,000 the worst case is O(1,024,000,000) which is still O(1).

EVERY algorithm has limits...that doesn't make them constant time algorithms!

(Reply to this) (Parent)(Thread)

Re: some comments
(Anonymous)
2007-08-07 03:56 am UTC (link)
You were just stupid. O(#@$#@(n)) is only meaningful for big n. If n is limited to small number, no one should care about O. being it O(n**n) or O(1).

No one Cares of O for small n.

(Reply to this) (Parent)(Thread)

Re: some comments
[info]jeffr_tech
2007-08-07 04:02 am UTC (link)
Given that it can cause the switch time to increase by 20-30% in extreme but possible cases I don't think it's 'stupid' to characterize the cause of this using big O.

It's important to be realistic about the numbers. Yes, it is O(log N), no you can't call it O(1), and no, it probably doesn't make a difference in the real world.

(Reply to this) (Parent)(Thread)


[info]daphnesikuh
2008-07-17 06:53 am UTC (link)
Personally, I don't think it's just a matter of semantics, but I can understand why some posters see at as such.

(Reply to this) (Parent)

Re: some comments
(Anonymous)
2007-08-07 12:10 pm UTC (link)
Okay, if no one should care about big-O for small values of n, then DON'T MENTION IT! And certainly don't lie about it calling it O(1). Just say "the new scheduler is really fast", don't flat out lie.

(Reply to this) (Parent)

Re: some comments
(Anonymous)
2007-08-07 12:13 pm UTC (link)
I think you do not understand what O(n) means. Is n = 32K small? Small for what ?
O(1) means does not depend on n (time constant)
O(n) means the execution time scales with n.
O(n^2) means the execution time scales with the square of n.
Period.
Now if you use O(N) algorithm to find the N-th element you need to do (in the worst case) N comparison, in case of O(N^2) then you need to do N^2 comparison.
In case of CFS you need to do O(log N) comparison.
Do you want some numbers?
In case of Linux PID limit of 32K then means:
O(N) => 32K Comparison
O(N^2) => 1048576K Comparison

As you can see the difference between O(N) and O(2) is 1E5 expensive than the other one.


(Reply to this) (Parent)(Thread)

Re: some comments
(Anonymous)
2007-08-10 09:47 am UTC (link)
Yes, but O(N) (or O(N^N) for that matter) for fixed N is O(1).

None is really interested in the time needed to schedule when the maximum processes are in used, because it will (or at least should) take the same amount of time every time, the interesting thing is how it scales when increasing the number of processes from 20 to 50 and so on, and that's when big-O notation becomes interesting.

(Reply to this) (Parent)


[info]paulahuqik
2008-07-16 06:13 am UTC (link)
So N (natural numbers) and O (odd numbers) are equal in size because there exists at least one function that uniquely correlates all the members of N with all the members of O.

(Reply to this) (Parent)


[info]nicilepo
2008-07-16 06:45 pm UTC (link)
I always ended in something like O(n. Log(n)) IIRC. But maybe this is overkill for real life anyway.

(Reply to this) (Parent)

Re: some comments
[info]bilmiyorum1
2009-06-18 10:09 pm UTC (link)
Good my dream teacher

Porno izle

(Reply to this) (Parent)

Re: some comments
[info]jeffr_tech
2007-08-06 09:10 pm UTC (link)
15-20 cache misses to schedule a thread is pretty significant. I can't say whether it will show up on profiles or not but I suspect so given my experience with similar tree datastructures in the VM. Consider that with the ULE queue you'll get somewhere around 3 cache misses. Certainly there are more i-cache lines touched due to code abstraction than d-cache lines touched on insert. There might be one less d-cache line touched with cfs and 1-3 threads runnable. I think it was a suboptimal choice given that other algorithms are readily available.

In response to your first question; We use clocks with a resolution in the hundreds of picoseconds on modern cpus for determining runtime for rusage. However, to determine how much of that runtime was user/system/interrupt we sample based on a ~1khz clock. So this causes some trouble for accurate rusage information. It also means we always need a 1khz clock running. If we were to sample on every state transition and exposed this information to the scheduler it could do more precise accounting for rusage as well as scheduling decisions. This would also allow us to more easily kill that timer to do variable timer based callout processing to allow processors to idle for longer as linux has recently done.

And finally; yes I believe you could easily replace the rbtree with a queue similar to the one in ULE. In fact, in this case it's a direct application of a calender queue as BSD uses in the callout infrastructure. The trick is to size your buckets appropriately so you can avoid sorting within the buckets. If threads have a ms or less difference in runtime, for example, you could place them all in the same bucket and have little impact on scheduling decisions.

The other odd thing about CFS is that relative priorities become meaningless. This prohibits things like priority inherited mutexes from working. I know they don't have that in mainline, but I wonder about rtlinux and others? I'd guess they replace the scheduler anyway?

(Reply to this) (Parent)(Thread)

Re: some comments
[info]jeffr_tech
2007-08-06 09:31 pm UTC (link)
According to ingo himself having 1,000 runnable tasks increases context switch cost by 20%.. That's a 10 deep tree. At worst you'd expect another 5 levels in the tree, so 30% total. It's not killing you but it's not ideal either.

I mostly just find it ironic given all of the fanfair surrounding the O(1) scheduler. I don't expect it's performance problem for real workloads.

(Reply to this) (Parent)(Thread)

Re: some comments
(Anonymous)
2007-08-07 04:05 am UTC (link)
Obviously the fanfair surrounding the O(1) scheduler was in its contrast to
the O(n) scheduler. O(logn) is undoubtedly still worse than O(1) if everything
else can be held constant, but the magnitude is still far smaller than O(logn)
is O(n).

(Reply to this) (Parent)(Thread)

Re: some comments
[info]jeffr_tech
2007-08-07 07:48 am UTC (link)
Well this is a misconception about what we're describing the complexity of.

The old scheduler certainly wasn't O(n) on every insert/removal. It was O(n) once or a few times a second. It was O(1) for insert and removal.

The new scheduler is O(log(n)) for every insert/removal.

As far as algorithm complexity it's a step behind the older scheduler.

(Reply to this) (Parent)(Thread)

Re: some comments
(Anonymous)
2007-08-08 03:35 am UTC (link)
Wrong.

The old scheduler was O(1) on insert/removal, but O(n) on every context
switch.

Given that after a task is inserted on the runqueue, it doesn't get removed
until it goes through a context switch. Total complexity to wake/run/sleep M
times is O(n*m).

With CFS, it is O(logn*logm) -- far lower.

However, even the old scheduler was not bad -- throughput-wise -- in most
workloads where don't have thousands of runnable tasks. The biggest problem
was the single scheduler lock.

(Reply to this) (Parent)(Thread)

Re: some comments
[info]jeffr_tech
2007-08-08 04:06 am UTC (link)
Every runnable task was examined on every context switch? really? Who is this btw?

I thought I had looked at it and it was essentially a variation on the same oldish unix scheduler. The one we call SCHED_4BSD in FreeBSD. This one is O(1) when we switch and only O(n) periodically.

What did the old algorithm do that was so silly as to traverse all runnable processes on every switch?

(Reply to this) (Parent)(Thread)

Re: some comments
[info]jeffr_tech
2007-08-08 04:51 am UTC (link)
repeat_schedule:
        /*
         * Default process to select..
         */
        next = idle_task(this_cpu);
        c = -1000;
        list_for_each(tmp, &runqueue_head) {
                p = list_entry(tmp, struct task_struct, run_list);
                if (can_schedule(p, this_cpu)) {
                        int weight = goodness(p, this_cpu, prev->active_mm);
                        if (weight > c)
                                c = weight, next = p;
                }
        }



Incredible.

(Reply to this) (Parent)(Thread)

Re: some comments
(Anonymous)
2007-08-08 05:20 am UTC (link)
Yep. And it even gets worse than that: there actually _is_ a "once every
so often" loop in schedule(), but that loops over ALL tasks in the system!

And even then, as I say, the old scheduler was never really _too_ great a
problem for real workloads (the single lock was probably the biggest one,
followed by the for_each_process loop, followed by the loop you quote
above). So I hope CFS shouldn't be too much worse than O(1).

Actually, CFS is apparently 10-20% faster at context switching with small
numbers of runnable processes (after some debug code is turned off). As
this is generally the most realistic situation, hopefully most common cases
will be actually improved with CFS.

(Reply to this) (Parent)


[info]a_s_g_a_r_d
2007-08-07 05:16 pm UTC (link)
hm, quite interesting, thanks for comprehensible argumentation.

p/s
i found that Michal Piotrowski made several comprassions of sd and cfs schedulers:
here, there and this one.

(Reply to this) (Thread)


[info]jeffr_tech
2007-08-07 06:55 pm UTC (link)
Do you know where I might find these benchmarking tools so I can compare FreeBSD's schedulers?

(Reply to this) (Parent)(Thread)


[info]a_s_g_a_r_d
2007-08-07 07:35 pm UTC (link)
this tool called interbench and unfortunately it disigned for linux kernel only. it would be great if you can make such compression.

(Reply to this) (Parent)(Thread)


[info]jeffr_tech
2007-08-07 10:59 pm UTC (link)
I wrote something very similar to this for use in my ULE paper. Although Con's tool does a bit more. It was very simple to port to BSD. I will get results comparing this on BSD vs Linux with several schedulers on the same machine.

(Reply to this) (Parent)(Thread)


[info]a_s_g_a_r_d
2007-08-08 01:40 am UTC (link)
I wrote something very similar to this for use in my ULE paper. Although Con's tool does a bit more. It was very simple to port to BSD. I will get results comparing this on BSD vs Linux with several schedulers on the same machine.

i've red this paper few hours ago. it would be quite interesting to look at the results.

(Reply to this) (Parent)(Thread)


[info]jeffr_tech
2007-08-08 01:55 am UTC (link)
Unfortunately a lot of the information in that paper is no longer accurate. For more detail on recent ULE developments you can read back entries in my journal.

Specifically, see here:

http://jeffr-tech.livejournal.com/3729.html#cutid1

(Reply to this) (Parent)


[info]jeffr_tech
2007-08-07 11:31 pm UTC (link)
Interesting. This test has problems with ULE because it goes to some length to penalize parents for spawning expensive children. The subsequent forked children from that parent are further penalized. So all of the tests are eventually run starting out with very poor priorities.

I'll have to make it sleep a few seconds between runs to get meaningful data.

(Reply to this) (Parent)(Thread)


[info]jeffr_tech
2007-08-07 11:45 pm UTC (link)
No it seems to be totally broken on BSD. Reporting maximum latencies of 16 minutes on a test that only runs for a minute or so.

(Reply to this) (Parent)


[info]sniket
2007-08-07 07:48 pm UTC (link)
You can find massive_intr here -> http://people.redhat.com/mingo/cfs-scheduler/tools/massive_intr.c

(Reply to this) (Parent)

CFS immature
(Anonymous)
2007-08-23 10:53 pm UTC (link)
Not a big fan of O(1) et al really, but you should consider there are problems showing in CFS implementation immediately upon it's 'release'. CK was definately bang-on when he spoke about the maturity of SD.

Here's a specific example of CFS failing to live up to SD standards when doing low latency audio work off of a ReiserFS v3 filesystem (which is _still_ in the Big Kernel Lock [BKL]! - amusingly? the only other in BKL atm is TTY):
http://bhhdoa.org.au/pipermail/ck/2007-July/008326.html

One 'corner case' exposed already. Developer reactions? Well one would assume (I would anyway) a desire to rectify this scenario with the scheduler - however the initial reaction to it is one of, "oh Reiser 3 is BKL, nothing to see here - move along".

SD performs more than just fine in that scenario, CFS doesn't.

(Reply to this) (Thread)

Re: CFS immature
[info]jeffr_tech
2007-08-24 02:18 am UTC (link)
Thanks for the link. I'll have to look into it.

I have investigated and there are still more cases than just tty that use the bkl. Even some generic filesystem code still uses it.

(Reply to this) (Parent)


[info]sence1
2009-03-12 10:05 pm UTC (link)
Koxp, Koxp 1726, Koxp Merkezi

Koxp

(Reply to this)

thanks
[info]cacala
2009-06-27 05:39 pm UTC (link)
sesli sohbet
sesli chat
sesli chat
seslichat
sesli panel
tatil otelleri
ucuz oteller
kiralık tekne
tekne kiralama
ajans
oyuncu
tekirdağ nakliyat
tekirdağ evden eve nakliyat
evden eve nakliyat
sağlık

(Reply to this)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…