jeffr_tech ([info]jeffr_tech) wrote,
@ 2007-03-17 09:21:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
Update on the linux scaling situation.
Many people have posted this link http://ozlabs.org/~anton/linux/sysbench/. Which details how switching to the google malloc fixed the problem. However if you follow the lkml thread a little deeper here http://lkml.org/lkml/2007/3/13/159, you'll see that there is more to the issue than userland malloc.

The google malloc, for reasons unknown to me, is apparently not a good general purpose allocator. Further on in the thread it's revealed that the problem may actually be that a process global semaphore is taken on every contested futex acquire and release. So it may very well be that a reasonable allocator design is being penalized by kernel contention. FreeBSD avoids this contention by having two tiers of userland locks (we call them umtx vs futex). One is process local and is used by default for most threading applications and another is system global. In the system global case we also have to traverse the vm datastructures with an expensive lock, however, we don't default to that.

I suspect that linux ended up with the process global lock being used in all cases to support the older clone() based linuxthreads which was really a bad idea for more than this reason.

On the FreeBSD front we've been improving our filedescriptor locking and replacing my tophalf patch with a more general approach via improved sx locks and sleep apis. This has resulted in another 4% or so perf boost with a slightly flatter falloff for us. I'm still working on my patch to break up the scheduler lock which should give us another nice boost.



(16 comments) - (Post a new comment)


[info]evan
2007-03-17 05:59 pm UTC (link)
These links are pretty interesting.

What do you mean by "not a good general purpose allocator"? (Not disputing it, just don't quite know what you mean.) If you mean "not good for normal apps", then yeah, there's a good reason for that. If you haven't seen it, the tcmalloc people have written a nice doc explaining how it works. In general, Google code tends to do things to waste a bit more memory to gain performance. There's always a trade-off available in that space, and operating systems are typically a bit constrained to wasting the minimum amount of memory per process 'cause you need to handle people spawning processes left and right. But for Google servers you know you'll be the one of the only processes on the machine.

(Reply to this) (Thread)


[info]jeffr_tech
2007-03-17 06:37 pm UTC (link)
"for reasons unknown to me". I was really only quoting what I read in lkml. I don't know anything at all about their allocator. I just wanted people to realize that the issue wasn't as simple as 'glibc allocator sux'.

I'll check out the google link.

(Reply to this) (Parent)(Thread)


[info]trhodes
2007-03-17 08:20 pm UTC (link)
Now, is your patch at all available at least to the FreeBSD developer community? I'd be interested in not only seeing it but testing it out.

--
Tom Rhodes

(Reply to this) (Parent)(Thread)


[info]jeffr_tech
2007-03-17 08:46 pm UTC (link)
http://people.freebsd.org/~jeff/tophalf.diff

This isn't the diff that will go into the tree. jhb and rookie are working on something better.

(Reply to this) (Parent)


[info]smkelly
2007-03-18 07:28 pm UTC (link)
This may be an imposter Tom Rhodes. I suggest extreme caution.

(Reply to this) (Parent)(Thread)


[info]trhodes
2007-03-18 09:22 pm UTC (link)
Ack, I've been caught. I'm actually The Real Tom Rhodes. Anyway, along with being the real Tom Rhodes, I'm also really confused. Might I ask why you are working so hard on a patch that isn't going to be used? Perhaps to prove that one method may yield more performance over another? Thanks,

--
Tom Rhodes

(Reply to this) (Parent)(Thread)


[info]jeffr_tech
2007-03-19 04:47 pm UTC (link)
Oh are you asking about the schedlock patch or the tophalf patch? I thought you wanted to see the tophalf patch. I only put about a half hour into that. My scheduler lock patch is threadlock2.diff at http://people.freebsd.org/~jeff/

(Reply to this) (Parent)(Thread)


[info]trhodes
2007-03-23 08:28 am UTC (link)
Wow, ace man, truly ace. Thanks!

--
Tom Rhodes

(Reply to this) (Parent)

(Reply from suspended user)

[info]evan
2007-03-17 06:18 pm UTC (link)
For those reading at home, here's a better link for that LKML thread. (Argh, I wish someone had a more pleasant archive for this stuff... even gmane is a pain.)

(Reply to this) (Thread)

some comments
(Anonymous)
2007-03-18 03:49 am UTC (link)
No, the mmap_sem acquisition in futex is not the problem. (I instrumented it on
the MySQL regression load and it definitely doesn't make a difference). It is a
non-exclusive lock anyway, and generally only becomes a problem at higher core
counts, due to cacheline ping pong.

The problem really is a combination of the single MySQL heap lock and the malloc
global lock. My guess is that MySQL on FreeBSD might be avoiding this issue by
setting the process to SCHED_RR realtime scheduling so that it doesn't get preempted
while holding the heap mutex (which it doesn't do for Linux). You should run a test
without that code in.

Finally, tcmalloc is not a good GP allocator AFAIK because it does not release
memory back to the system. This is probably done to make locking a bit simpler,
but it could probably be improved.

(Reply to this) (Parent)(Thread)

Re: some comments
[info]jeffr_tech
2007-03-19 04:46 pm UTC (link)
If mysql has a global heap mutex why does switching allocators resolve the problem? Wouldn't contention be just as high, minus some time assuming the google malloc is more efficient? Any locks in malloc are irrelevant if mysql does have a global heap lock. It would prevent any internal allocator contention.

I'll look into the SCHED_RR bit later this week. I know we support it, I don't know if it's having an effect here.

(Reply to this) (Parent)(Thread)

Re: some comments
(Anonymous)
2007-03-20 04:53 am UTC (link)
Because the malloc mutex is also contended, and it gets taken inside the heap
mutex. So now a lot of your malloc lock contention gets transferred to the
heap mutex.

I'm sure malloc is used in places other than simply the MySQL heap, which is
why the heap mutex serialisation does not mask the malloc lock.

(Reply to this) (Parent)


[info]jennyrexac
2008-07-17 04:45 am UTC (link)
A significant amount of work has been done to make the locking SMPng-friendly and to cut down on kernel stack abuse.

(Reply to this) (Parent)

GNUkFreeBSD
(Anonymous)
2007-03-18 06:56 pm UTC (link)
It would be interesting if someone with a 4-8 core box could run this test on GNU/kFreeBSD, which has GlibC + FreeBSD's kernel. That might give some clues about how much glibc's malloc is the problem and how much the Linux kernel.

http://glibc-bsd.alioth.debian.org/doc/

(Reply to this)

tşk
[info]cacala
2009-06-27 05:33 pm UTC (link)
sesli sohbet
sesli chat
sesli chat
seslichat
sesli panel
tatil otelleri
ucuz oteller
kiralık tekne
tekne kiralama
ajans
oyuncu

(Reply to this)


(16 comments) - (Post a new comment)

Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…