<?xml version="1.0" encoding="utf-8"?>
<!-- If you are running a bot please visit this policy page outlining rules you must respect. http://www.livejournal.com/bots/ -->
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:lj="http://www.livejournal.com">
  <id>urn:lj:livejournal.com:atom1:jeffr_tech</id>
  <title>jeffr_tech</title>
  <subtitle>jeffr_tech</subtitle>
  <author>
    <name>jeffr_tech</name>
  </author>
  <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/"/>
  <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom"/>
  <updated>2009-02-01T00:09:54Z</updated>
  <lj:journal userid="6769489" username="jeffr_tech" type="personal"/>
  <link rel="service.feed" type="application/x.atom+xml" href="http://jeffr-tech.livejournal.com/data/atom" title="jeffr_tech"/>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:22432</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/22432.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=22432"/>
    <title>New UMA features for more efficient memory layout</title>
    <published>2009-02-01T00:09:54Z</published>
    <updated>2009-02-01T00:09:54Z</updated>
    <content type="html">I have wanted to write for some time about UMA changes I recently made.  UMA is the "universal memory allocator" which serves as FreeBSD's kernel memory allocator.  I initially wrote this 7 years ago or so and many other people have since contributed.  I named it 'universal' because at that time FreeBSD had 3 separate kernel memory allocators and this unified them.  The two new features relate to network performance work I've been doing lately and allow the use of more efficient layout of network buffers.&lt;br /&gt;&lt;br /&gt;&lt;a name="cutid1"&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The first feature expands on the 'keg' concept, first introduced by Bosko to support network buffer allocations, by allowing zones to back-end to multiple kegs.  UMA is a zone or slab allocator wherein a zone is created to allocate a given type of memory and contains parameters specific to that type.  The type may be all allocations of a specific size, as is used for the malloc() front-end, or it may be a complex object type that has initializers, finalizers, specific alignment requirements, etc.  The keg is a refinement on this concept where a keg provides the backend allocation and storage description while the zone controls the contents of the individual allocated items and provides a caching layer, client api, etc.  The difference is subtle but important.  Restated, the keg describes the format of the page or pages that the item lives in while the zone describes the format within each allocated item.  In this way multiple kegs can provide items which meet the clients requirements while varying the format and source of memory that they come from.&lt;br /&gt;&lt;br /&gt;Now that a zone may contain multiple kegs you may have multiple back-end sources for the allocation but the consumers are not required to differentiate.  For example, I have two kegs for 2 kilobyte network buffers.  One which is allocated from single pages and one which allocates from hardware supported large page sizes.  If large, contiguous pages are available this keg is used.  This optimizes access to network buffers by allowing the use of fewer hardware TLBs to describe them.  However, due to fragmentation, we don't always have large aligned chunks of memory available and in this case we can fall back to the other keg transparently.  This keg concept is also being used to implement NUMA support.  There exists one keg per-numa node and the search function can be coded to be topology aware.  Items from multiple compatible kegs can exist in the same fast cache in the zone and will be automatically retired to the correct source.&lt;br /&gt;&lt;br /&gt;The second feature builds on the first and allows for much more efficiently aligned datastructures.  This is akin to cache-coloring but goes one step further.  The start address of each allocated item is aligned such that it falls on a different cache-line than the previous allocation.  Simple cache-coloring typically ignores naturally aligned allocations or only colorizes each page or slab.  In this scheme a large contiguous block of memory is allocated and each item is padded until it reaches a new color.  For large, uniformly sized, network buffers, this has a tremendous benefit.  Consider a 2k allocation always ending up on a 2k boundary. This uses 1/(allocation size / line size) available lines for the most accessed bit at the beginning.  So for a 2k buffer and a 64byte line size the start addresses fall on only 1/32 of the available lines.  Essentially the buffer is padded by the line size so that the start addresses alternate lines and the number of allocations required to hit every line is computed to determine the storage requirements.  Using a large contiguous block of memory ensures that this is equally effective for l1 virtually indexed caches as it is on l2-l3 physically indexed caches.&lt;br /&gt;&lt;br /&gt;This has a secondary effect of improving utilization on striped memory controllers.  The exact details of how a physical address maps to a channel/bank/rank of dram are not published.  However, it is clear that predictable access patterns aligned on a large power of two size will strongly favor a smaller set of available dimms as it also favors a small set of available cachelines.  By alternating the start addresses on cache-line boundaries we can be assured that we are uniformly loading the available dimms because a cacheline is the smallest unit of memory transfer for all practical purposes.&lt;br /&gt;&lt;br /&gt;These optimizations are only useful in workloads where memory and cache pressure are the significant bottleneck.  Unfortunately, I don't have any benchmarks for stock FreeBSD to share at this time.  I'm not certain that I'm permitted to share details about the yields in the proprietary stack this was implemented for.  I will say that it was on the order of 10% in an already heavily optimized environment where traditional profile-guided software optimizations were yielding much less.&lt;br /&gt;&lt;br /&gt;I should also mention that this work was primarily funded by Nokia and most graciously donated to FreeBSD.&lt;br /&gt;</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:22106</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/22106.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=22106"/>
    <title>more dram access timings on two interesting architectures</title>
    <published>2009-01-16T02:38:29Z</published>
    <updated>2009-01-16T02:39:26Z</updated>
    <content type="html">Ever wonder what memory latency is like on a large loosely connected opteron system?  I lay awake at nights wondering myself.  Fortunately, I have access to a tyan 8 socket barcelona system.  This is basically two 4 socket boards with two very slow HT links between them.  I also have access to a nehalem based box that I have timings for.  The results are behind the cut.&lt;br /&gt;&lt;br /&gt;&lt;a name="cutid1"&gt;&lt;/a&gt;&lt;br /&gt;The test code simply allocates a user specified size of memory, prefaults it, mlocks it, and then uses rdtsc to count the cycles of individual random memory accesses (read only).  I also have a small calibration loop to try to determine the rough cost of the rdtsc instruction and remove it from the total cycles counted.  With very large blocks of memory you see dram performance and with small blocks of memory you can see different stages of the cache hierarchy.&lt;br /&gt;&lt;br /&gt;Here are the results as a histogram for a 32way 8 socket opteron sampling 1gb of memory:&lt;br /&gt;&lt;img src="http://people.freebsd.org/~jeff/opteron32way.png"&gt;&lt;br /&gt;&lt;br /&gt;Since even all processors in a 4 socket system are not directly connected to each other you usually see 2-3 peaks depending on which cpu you're scheduled on.  The fastest is always local memory and then you see one and perhaps two hops to remote nodes on the same board.  Here  we see a smallish peak for local memory and a larger peak as more random samples hit the other three cores on our board.&lt;br /&gt;&lt;br /&gt;Then there is the horrible clump once we go over the slow HT links to the other board where we regularly see up to .5us memory access latencies!  Incredible.  The other thing worth noting is that even the local access is unfortunately slow.  I believe this has to do with the cache coherency protocol still requiring us to query each other socket before owning the line.&lt;br /&gt;&lt;br /&gt;Next we have a simpler, two socket nehalem sampling 128mb of memory:&lt;br /&gt;&lt;br /&gt;&lt;img src="http://people.freebsd.org/~jeff/nehalem.png"&gt;&lt;br /&gt;&lt;br /&gt;This is really very clean.  In fact, we can pick out particular features of dram looking at this graph. First, we see two tall peaks, representing local and remote dram.  The second is only taller because the first has more minor variance, it is wider.  So it's roughly 85ns for a local access and 135ns for remote.  The other peaks we see are likely due to two causes.  The short peak after the dram timing is likely covering requests which occur during a dram refresh cycle.  The penalty is about the right amount of time for that.  The short peak before is likely occurring due to back to back requests coming in for the same row. &lt;br /&gt;&lt;br /&gt;Information like this helps us understand the relative trade-offs for different optimizations related to memory organization and locality.&lt;br /&gt;</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:21805</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/21805.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=21805"/>
    <title>What happened to jeffr_tech?</title>
    <published>2008-11-20T01:32:22Z</published>
    <updated>2008-11-20T01:32:22Z</updated>
    <content type="html">Although you probably aren't, you might be wondering &amp;quot;What happened to jeffr? He doesn't post anymore!&amp;quot;.  Last year I was posting somewhat regularly, I was active in FreeBSD, and I was cycling seriously again.  Over this past year my health started to falter.  I started working the bare minimum, couldn't keep up with my open-source responsibilities, and couldn't even ride my bike up the hill to get home.  I seemed to have trouble concentrating at all.  My memory was weakening and my motivation was nil.&lt;br /&gt;&lt;br /&gt;As it turns out, my brain stopped producing a sufficient amount of a hormone that regulates water secretion from the kidneys (&lt;a href="http://en.wikipedia.org/wiki/Vasopressin"&gt;Vasopressin&lt;/a&gt;).  I was slowly becoming seriously dehydrated despite drinking as much as 7 liters of water a day.  Eventually, my blood volume became so low that my heart rate was 160 bpm at rest and I started to develop hypothermia because I couldn't heat my body any longer.&lt;br /&gt;&lt;br /&gt;After spending a few days in the cardiac ward of the hospital they finally injected me with a synthetic analog of vasopressin.  The doctors seemed very suspicious of my story and wanted to attribute my problems to psychological causes.  The disorder I have, &lt;a href="http://www.diabetesinsipidus.org/"&gt;Diabetes Insipidus&lt;/a&gt;, has a prevalence of only 1 in 25,000.  There are perhaps 5 other people on the island that may have it.  I am very fortunate that I had begun to suspect and research this illness prior to being admitted to the ER.  I had to resist some potentially very dangerous treatment and insist they look closer at DI.&lt;br /&gt;&lt;br /&gt;Finally I am getting the treatment I need.  I gained back 7 pounds of water weight in the first two days.  Vasopressin has a positive effect on memory and recall and I feel like I can think clearly for the first time in a year.  I'm still weak and tired from such a long period inactivity and stress on my body but my prognosis is good.  Aside from having to take a synthetic hormone and avoid diuretics for the rest of my life I should suffer no long-term effects.&lt;br /&gt;&lt;br /&gt;I am very eager to return to doing some work which is worthy of posting.  I am even more eager to be able to cycle again.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:21654</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/21654.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=21654"/>
    <title>dev summit</title>
    <published>2008-05-18T23:58:33Z</published>
    <updated>2008-05-18T23:58:33Z</updated>
    <content type="html">I just returned from two weeks of travel.  One for my wedding anniversary and another for the FreeBSD developer summit which preceded BSDCan.  The summit was productive but I'm very happy to be done with the travel.&lt;br /&gt;&lt;br /&gt;There were many great discussions at the summit with topics ranging from release engineering to TCP scalability.  I participated in one on mbufs (network buffers) and one on the buffer cache (file-system buffers).  For mbufs I presented a technique that I developed for Nokia based on Kip Macy's excellent work on 10gigabit ethernet drivers.  This technique should simplify referenced data while reducing the number and temporal scope of cache lines accessed to manipulate buffers in the common case.  There is still much work to do to prove it out however.&lt;br /&gt;&lt;br /&gt;The buf discussion lasted almost two hours and was much broader in scope.  We will hopefully have a fully revamped IO path for 8.0 to address a wide variety of structural and performance problems.  I'm very excited to see this work progress after many years of planning and discussion.  My SoC student this year is implementing one essential piece by replacing a splay tree in the vm with a radix tree. &lt;br /&gt;&lt;br /&gt;I also had a very interesting discussion with a new project member, Lawrence Stewart, about tcp congestion control which he gave a talk on later in the conference.  I spent the very first part of my career working in the tcp/ip group at microsoft when tcp vegas was still relatively new.  Congestion control was one part of my work responsibilities and an area I pursued as a hobby.  I was surprised to hear that delay based congestion control was making a comeback in some research circles.  It was nice to hear about developments in this field that I haven't followed in some time.&lt;br /&gt;&lt;br /&gt;After all of that socializing and discussion I had a horrible flight but was pleased to find that Hawaii now feels very much like home to me and returning was quite a relief.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:21310</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/21310.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=21310"/>
    <title>adaptive idling</title>
    <published>2008-04-17T09:16:19Z</published>
    <updated>2008-04-17T09:16:19Z</updated>
    <content type="html">One lesson learned from working on synchronization primitives is that it's often profitable to spin before sleeping.  We have adaptive mutexes, rwlocks, etc. rather than simply having sleeping locks or spinlocks.  This has had an unexpected influence on our idle loop.&lt;br /&gt;&lt;br /&gt;When a thread becomes runnable it is often desirable to run it on a cpu other than the current one.  If the target cpu is in the idle loop, it may actually be waiting in a low power state using the 'hlt' instruction or some acpi mechanism that I try to avoid.  To wake up this remote cpu we currently issue an IPI (inter-processor interrupt).  This is actually very expensive for the sender and receiver.&lt;br /&gt;&lt;br /&gt;On some CPUs which support SSE3 there is a pair of instructions, monitor and mwait, which allow you to signal a remote cpu using only memory operations.  This works by giving the programmer access to the existing hardware bus snooping interface.  The sleeping cpu sees another cpu write to a memory location we're snooping and we wake up.&lt;br /&gt;&lt;br /&gt;On barcelona mwait doesn't enter as deep of a sleep as on the xeons.  So I decided to use an adaptive algorithm that would mwait when we're busy and hlt when we're not.  With mwait you can actually specify the power state you'd like so I keep both the Xeon and Opterons in C0 to further reduce wakeup latency.&lt;br /&gt;&lt;br /&gt;Then an engineer at Nokia suggested I go one step further and allow the idle thread to spin waiting for work for a short period.  So this is now the first stage in the adaptive algorithm, we spin a while, then sleep at a high power state, and then sleep at a low power state depending on load.&lt;br /&gt;&lt;br /&gt;Using a 'ping-pong' threads program that sends a single token around a ring of threads I see a 20% perf improvement vs the old non-adaptive mechanism.  In most cases we're still idling in hlt as well, so there should be no negative effect on power.  In fact, it wastes a lot of time and energy to enter and exit the idle states so it might improve power under load by reducing the total cpu time required.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:21014</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/21014.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=21014"/>
    <title>file offset semantics</title>
    <published>2008-04-08T05:11:53Z</published>
    <updated>2008-04-08T05:11:53Z</updated>
    <content type="html">I'm further exploring the concurrency guarantees of file i/o in various operating systems.  I've found more surprising race conditions and differences of implementation between operating systems.&lt;br /&gt;&lt;br /&gt;Each file descriptor in UNIX has an associated offset with it.  This is what allows you to say read() over and over again without specifying a position and getting later and later chunks of a file.  Or to write and continue where you left off.  There is the additional complicated of append mode writes but let's ignore that for a moment.&lt;br /&gt;&lt;br /&gt;To keep things straight let's call the actual file representation the inode (in FreeBSD it's a "vnode") and the open descriptor is a 'file'.  This is in keeping with how it's done in the kernel.  So many threads or even processes may share a single file descriptor that points to one file, so they have a shared offset.  Or many processes may have unique file descriptors and so they have unique offsets.&lt;br /&gt;&lt;br /&gt;In the shared case we have to determine how updates to this offset are serialized.  One important detail is that the offset is 64bit.  On 32bit platforms this means it's written with two discrete writes.  Without some serialization other threads can see half of the update, or in the worst case, two simultaneous updaters may set different bytes in the final offset leaving you with a corrupt or invalid offset.&lt;br /&gt;&lt;br /&gt;Another question is, what happens with two simultaneous writes to the file?  If we don't serialize the offset they will both write to the same location.  If we do, they write one after the other.  The same goes for the read side.  If two threads in the same process read from the same file simultaneously do they get unique data or the same data?  This is true of threads and processes forked with rfork().&lt;br /&gt;&lt;br /&gt;Before about 1986 in unix there was no serialization on updates.  It also was non-preemptive, uniprocessor and had 32bit offsets so you didn't have to worry about partial writes even on 16bit machines.  The inode was locked after the offset was loaded and multiple readers could see the same data and multiple writers would write to the same offset.  McKusick changed it in CSRG sources in 1987 so the exclusive inode lock also protected offset to handle a case where a forked process was getting output mixed up.&lt;br /&gt;&lt;br /&gt;Solaris manipulates the offset within a shared vnode lock for reads and an exclusive lock for writes.  This means writers are serialized but readers are not.  It also means that offset updates in the read case on 32bit can corrupt the offset value.&lt;br /&gt;&lt;br /&gt;Linux manipulates the offset without a lock in any case.  The offset pointer is corruptible on 32bit processors.  Neither readers nor writers are serialized.&lt;br /&gt;&lt;br /&gt;FreeBSD now allows shared vnode locks on read which 4.3BSD did not, but we use a separate lock to maintain the strict f_offset protection in all cases.  This actually serializes reads done to the same fd if they don't use pread().&lt;br /&gt;&lt;br /&gt;Posix doesn't specify this carefully enough to say what is required.&lt;br /&gt;&lt;br /&gt;I think at a minimum solaris/linux need to protect the value on 32bit architectures.  It's a once in a year type event that could lead to problems but these are the kinds of races and bugs that are impossible to track down.  FreeBSD, on the other hand, could relax the restriction on read updates.  It doesn't make much sense to do so for writes and this fixes the original bug encountered in 1986.  I'll have to think of an elegant way to handle 64bit writes on 32bit platforms however.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:20824</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/20824.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=20824"/>
    <title>email</title>
    <published>2008-04-07T11:43:00Z</published>
    <updated>2008-04-07T11:43:00Z</updated>
    <content type="html">I worked at an internet service provider when I was in highschool and as a result got free email for life.  This is my @chesapeake.net address.  Unfortunately chesapeake.net outsourced email to a bulk provider and I've had a remarkable number of emails just plain dropped since.  So what am I to do about it?&lt;br /&gt;&lt;br /&gt;I registered jroberson.net and set the mx to point at google.  Then I set my google account to let me pop and send mail via smtp.  So what I don't understand is why google does this for free?  I'm not looking at any targeted advertising.  They're just acting as temporary storage until I pop the mail.  I may never use the web interface again.&lt;br /&gt;&lt;br /&gt;Also, I'd like to take this opportunity to point out that google is the new microsoft which was the new ibm which was probably the new something else.  What I mean is, exciting and innovative company becomes large and imposing and then no one likes them.  I'm surprised more people haven't seen the writing on the wall.  I speculate that after 10-15 years of declining popularity and utility microsoft will eventually become a respected, productive member of society again like IBM did.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:20707</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/20707.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=20707"/>
    <title>IO atomicity</title>
    <published>2008-03-31T23:50:10Z</published>
    <updated>2008-03-31T23:50:10Z</updated>
    <content type="html">I have long wondered about exactly what atomicity guarantees of read() and write() are so I did some code and posix spelunking over the weekend.  The scenarios I'm talking about are as such:&lt;br /&gt;&lt;br /&gt;1) Can readers read concurrently with readers?&lt;br /&gt;2) Can readers read concurrently with writers?&lt;br /&gt;3) If readers read concurrently with writers do they see old bytes, new bytes, or potentially a mix of both?&lt;br /&gt;4) Can writers write concurrently with other writers?&lt;br /&gt;5) If writers can write concurrently what constraints are there on the resulting bytes?&lt;br /&gt;&lt;br /&gt;So first, what BSD does is hold a shared lock on the inode while reading and an exclusive lock while writing.  There is an additional issue with the file descriptor offset that really should be a second post that I might do sometime.  So on BSD you have a strong guarantee that readers and writers will see a write as a single atomic transaction.  No partial writes are visible.  No interleaved writes are possible.  Readers are concurrent.&lt;br /&gt;&lt;br /&gt;On linux, excepting appends, the inode is not locked for io.  Instead page reference counts and locks are used for individual parts of the file.  You can think of this like impromptu byte-range locking.  Linux allows readers to proceed with other readers and writers for overlapping byte ranges.  This means you can call read() and see incomplete results of a file rewrite as it is happening on another cpu.  If you read and the data is not buffered an exclusive lock on the page is used until the data is valid. The same exclusive lock protects overlapping writes to the same page.  However, the results when writes span pages seem to be undefined.  This is basically as concurrent as you can reasonably get.&lt;br /&gt;&lt;br /&gt;So what does posix say?&lt;br /&gt;ISO IEC 9945-2 2002 POSIX.1 - 2 System Interfaces page 1174 (1203rd page in the pdf)&lt;br /&gt;&lt;br /&gt;Rationale section for the read syscall:&lt;br /&gt;&lt;br /&gt;I/O is intended to be atomic to ordinary files and pipes and FIFOs. Atomic&lt;br /&gt;means that all the bytes from a single operation that started out together&lt;br /&gt;end up together, without interleaving from other I/O operations.&lt;br /&gt;&lt;br /&gt;There are other statements elsewhere in posix that state a read following a write in time must see the results of the write as a whole.  The emphasis on time likely has to do with nfs.  So posix is fairly clear.  Linux is too loose but FreeBSD is too tight.  We can allow concurrent writers to the same file as long as they are non-overlapping without violating any rules.&lt;br /&gt;&lt;br /&gt;Really standards have just been derived from legacy behavior of older unix in order to define the properties that they believed were important for existing applications.  In this vein I looked at seventh edition unix, which uses an exclusive lock over the inode in all cases.  Clearly it is even more strict than FreeBSD.&lt;br /&gt;&lt;br /&gt;I believe for 8.0 I will try to make this programmable on a per-file or per-system basis.  Once the basic infrastructure is in place it would be easy to define the types of locks required for the operation to permit willing applications to see a less consistent view of the bytes.  However, I find it hard to imagine any application wants to see partial byte results.  I suspect range locking on writes will be sufficient in almost all cases.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:20425</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/20425.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=20425"/>
    <title>jeffr_tech @ 2008-03-22T21:41:00</title>
    <published>2008-03-23T07:42:06Z</published>
    <updated>2008-03-23T07:42:06Z</updated>
    <content type="html">MyISAM performance is terrible in FreeBSD 7.0 due to the user-space pthread_rwlock implementation.  Just a word of warning if you intend to deploy a database server based on 7.0.  I am certain we will have this fixed in 7.1.  It will most likely be in CURRENT in a week or two.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:20087</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/20087.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=20087"/>
    <title>FreeBSD SoC</title>
    <published>2008-03-23T04:13:47Z</published>
    <updated>2008-03-23T04:13:47Z</updated>
    <content type="html">I've signed up to mentor a FreeBSD SoC project again this year.  I'm most interested in sponsoring the following projects:&lt;br /&gt;&lt;br /&gt;1)  Improved schedgraph support.&lt;br /&gt;2)  User-space lock profiling tool&lt;br /&gt;3)  Improved VM tree structures&lt;br /&gt;4)  SMP safing Giant protected filesystems.&lt;br /&gt;&lt;br /&gt;Maybe others.  I know schedgraph and the user-space lock profiling may not sound as glamorous but they have the potential to have the biggest long-term impact on performance.  Please email jeff at freebsd.org if you'd like to discuss these further.&lt;br /&gt;&lt;br /&gt;The official list of project ideas is here:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.freebsd.org/projects/summerofcode.html"&gt;http://www.freebsd.org/projects/summerofcode.html&lt;/a&gt;</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:19908</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/19908.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=19908"/>
    <title>jeffr_tech @ 2008-03-15T23:11:00</title>
    <published>2008-03-16T09:12:53Z</published>
    <updated>2008-03-16T09:12:53Z</updated>
    <content type="html">People are always posting comments of 'what about solaris!'.  I'm going to install some new operating systems on an 8way xeon (2x4).  So what about solaris?  What should I install?  Can I do a network install or do I have to burn dvds?  Any tips?  Which Linux should I install for benchmarking?  I've just been using fedora.  Maybe I should stick with that since I'm familiar with it.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:19466</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/19466.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=19466"/>
    <title>jeffr_tech @ 2008-03-12T19:22:00</title>
    <published>2008-03-13T05:31:21Z</published>
    <updated>2008-03-13T05:32:32Z</updated>
    <content type="html">I have an opteron with older slower memory that I reproduced the pipe tests on to see if it was any different on a 64bit system.  I'm not going to paste the full results but here's a couple of data points:&lt;br /&gt;&lt;br /&gt;linux-2.6.24&lt;br /&gt; 64[writer]: 97.235 wall (2.031 usr, 68.674 sys), 10.531 Mb/sec&lt;br /&gt;1024[writer]: 13.300 wall (0.145 usr, 9.039 sys), 76.991 Mb/sec&lt;br /&gt;65536[writer]: 3.068 wall (0.001 usr, 1.718 sys), 333.766 Mb/sec&lt;br /&gt;&lt;br /&gt;FreeBSD 8.0-CURRENT undermydesk (no cpu switch patches though)&lt;br /&gt;64[writer]: 53.163 wall (1.057 usr, 42.083 sys), 19.261 Mb/sec&lt;br /&gt;1024[writer]: 5.325 wall (0.118 usr, 4.146 sys), 192.284 Mb/sec&lt;br /&gt;65536[writer]: 0.567 wall (0.000 usr, 0.130 sys), 1805.509 Mb/sec&lt;br /&gt;&lt;br /&gt;So on this machine we start of 2x as fast and end up 5.5x as fast.  The numbers pretty much follow a curve through those points.  This verifies the data taken from the old 32bit HTT machine they tested on.  I don't intend to post configs and so on as the original lkml thread is plenty rigorous enough.&lt;br /&gt;&lt;br /&gt;I forgot to mention earlier.  The FreeBSD Alan Cox has committed super-pages!  We're seeing some great gains from that.  This allows the kernel to automatically use large TLBs for conforming regions of memory.  It has a component that ensures that large, contiguous, chunks of physical memory will be available to support this.  There is also a defragmenting/compacting piece.  There's some great work going into FreeBSD 8.0 already!</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:19210</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/19210.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=19210"/>
    <title>jeffr_tech @ 2008-03-12T13:59:00</title>
    <published>2008-03-13T00:17:02Z</published>
    <updated>2008-03-13T00:17:02Z</updated>
    <content type="html">A couple bits of news;  We tracked down our problem with the performance drop above 30 threads on Nick Piggin's mysql benchmark to conservative settings for the pthread adaptive spinning.  We see a big gain relative to where we were before.  Frankly at this point we're splitting hairs with linux and I don't really care where we stand.  We had a tremendous problem and we resolved it.  Time to move on..&lt;br /&gt;&lt;br /&gt;I removed kernel support for our M:N threading library last night.  8.0 will only support 1:1.  This will open the way to a lot of optimizations in the signal and sleeping paths.  Hopefully reducing the total number of locks required in the sleepq path to a minimum.&lt;br /&gt;&lt;br /&gt;There are some 'interesting' pipe benchmarks floating around.  You can read about it on the lkml and the author's website:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://213.148.29.37/PipeBench/"&gt;http://213.148.29.37/PipeBench/&lt;/a&gt;&lt;br /&gt;&lt;a href="http://lkml.org/lkml/2008/3/5/61"&gt;http://lkml.org/lkml/2008/3/5/61&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I say 'interesting' of course because FreeBSD is doing way better than linux. ;)  pipes, the next battleground?  I don't know but it's worth a read anyway.&lt;br /&gt;&lt;br /&gt;I also have a patch to implement cpu affinity for our callout mechanism.  This is for time based callbacks.  The legacy callouts may have order dependencies or may not tolerate concurrency.  So by default they are all scheduled on the first callout thread.  There is one callout thread per-cpu and they have a kind of 'medium' affinity for that cpu, however, if they are overloaded by some interrupt work another cpu can complete the callouts.  This removes the need to do any kind of load balancing across callout handlers because the scheduler can do a better job anyway.  New callouts can specify any cpu when setting a timer and then they have an affinity for that thread until a different cpu is requested.  All migration is explicit.&lt;br /&gt;&lt;br /&gt;Hopefully having callout affinity will benefit our tcp stack where Robert Watson is experimenting with different kinds of affinity for tcp sessions.  It will also discourage migration of threads who are sleeping on time based events like select and nanosleep().</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:19139</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/19139.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=19139"/>
    <title>ULE happenings, context switch surprise.</title>
    <published>2008-03-11T02:59:39Z</published>
    <updated>2008-03-11T02:59:39Z</updated>
    <content type="html">Lately I've been able to spend a bunch of time on ULE thanks to Nokia.  They use it in one of their networking products.  I've been doing all of this work in 8.0-CURRENT and backporting it for them at the same time.  It's a great model for both parties because users on -CURRENT shake out bugs that they'd have to find in testing otherwise and we get new development paid for.&lt;br /&gt;&lt;br /&gt;I finished and committed the topology aware cpu scheduling that I discussed in earlier posts.  I also implemented a mechanism for CPU provisioning that you can use to restrict groups of processes to sets of cpus which can be dynamically migrated.  This will be useful for restricting jails to certain CPUs or dedicating some CPUs to real-time special-purpose tasks for example.&lt;br /&gt;&lt;br /&gt;Over the last few days I cleaned up my cpu switch optimizations and got those in.  The results are 25% faster context switching in a yield benchmark.  Even faster than linux on the same hardware.  Some day I'll put open solaris on so I have something else to compare to.&lt;br /&gt;&lt;br /&gt;Separate from the other switch benchmarks I've been working on reimplementing amd64's context switching routine almost entirely in C.  I just wanted to do it because we're putting more complex things in and it was getting hard to find registers but it turns out you can make it much faster too.  The yield benchmark is another 10% faster with the C switch routine.  Mostly due to enabling more complex checks, like not setting MSR_FSBASE/GSBASE if they haven't changed, and getting uncommon code out of the fast path.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:18706</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/18706.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=18706"/>
    <title>More sysbench noise.</title>
    <published>2008-03-07T10:32:14Z</published>
    <updated>2008-03-07T10:32:14Z</updated>
    <content type="html">&lt;a href="http://www.kernel.org/pub/linux/kernel/people/npiggin/sysbench/"&gt;http://www.kernel.org/pub/linux/kernel/people/npiggin/sysbench/ &lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Nick Piggin has been doing some benchmarking of recent linux kernels and FreeBSD 7.0 on a 2xquad core barcelona opteron.  He verified that the CFS problems seem to be fixed and FreeBSD's performance on this box with mysql is really very similar up to about 20 threads.  I feel confident that the test was conducted fairly and I'm happy with these results.  Our stable release is doing very well even if fresh-out-of-git linux is showing better on this platform.  We already have some good gains in this workload in 8.0-CURRENT as well.  What's most important to me is that we stay relevant on common server hardware and we're doing a good job at that.&lt;br /&gt;&lt;br /&gt;I'm also happy to see some collaboration and competition between linux and bsd kernel developers.  I hope that continues.  We're really more alike than we are different.&lt;br /&gt;&lt;br /&gt;Next up, we now have a 16 way xeon and 16 way opteron system to tune and test with.  More points of contention are being removed.  The code marches on.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:18568</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/18568.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=18568"/>
    <title>jeffr_tech @ 2008-02-26T21:46:00</title>
    <published>2008-02-27T07:46:41Z</published>
    <updated>2008-02-27T07:46:41Z</updated>
    <content type="html">&lt;a href="http://www.onlamp.com/pub/a/bsd/2008/02/26/whats-new-in-freebsd-70.html?page=1"&gt;http://www.onlamp.com/pub/a/bsd/2008/02/26/whats-new-in-freebsd-70.html?page=1&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Federico Biancuzzi did a collection of short interviews with many key FreeBSD contributers.  I'm in there at the end even.  Anyway, it's a good summary of some of the exciting technical things that are in 7.0.  I even learned some things while reading it.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:18215</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/18215.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=18215"/>
    <title>lung tech.</title>
    <published>2008-02-21T10:54:33Z</published>
    <updated>2008-02-21T10:54:33Z</updated>
    <content type="html">I, like many nerds and athletes before me, have suffered from asthma and lung problems for almost the entirety of my life.  I don't have the blue-in-the-face, bronchial-spasm, send-me-to-the-hospital variety.  Rather, I have a seemingly constant irritation and periodic, primarily exercise-induced, restriction of my airways that mostly just slows me down.  This is actually caused by a poor immune system reaction to airborn allergins. Exercise triggers attacks because as much as 10x more air is moving over your lungs so they're likely to get 10x as irritated.&lt;br /&gt;&lt;br /&gt;In any event, this hadn't been much of a problem for me in seattle, except when I lived in a very moldy old house.  However, after moving to hawaii something started really bothering me.  My training started out great but after a virus I found myself unable to significantly exert myself for longer than 5 minutes or so.  I went to the Dr but wasn't satisfied with their diagnosis so I bought myself a peak flow meter, blood oximeter and a few other gadgets. and so the nerding began.&lt;br /&gt;&lt;br /&gt;The peak flow meter is really the most interesting.  This measures, in liters/minute, how rapidly you can force air through a constrained passage.  It's just a tube you blow in with a column and a gauge.  For someone of my height and age a 'normal' peak flow rate would be around &lt;a href="http://www.healthcaresouth.com/pages/asthmaaverpeak.htm"&gt;600 l/m&lt;/a&gt;.  My actual measured flow rate very regular at 675 l/m, so 112% above predicted, not bad!  However, 5 minutes of vigorous exercise on a stationary bike and that had dropped 20% to 540.  A 20% reduction in the rate your lungs move air is enough to perceive as constricted and tight.  Interestingly I'd still be considered in a healthy flow range, and indeed I could rest and talk and walk just fine, I just couldn't ride my damned bike.  The blood oximeter also showed a 5% drop in blood oxygen saturation during the constriction.&lt;br /&gt;&lt;br /&gt;Armed with these findings I asked for a maintenance drug, advair, which has a corticosteroid to reduce inflammation.  And indeed, 3 days after starting this treatment, my peak flow now measures around 775, or 13% better.  This would be better than the average flow rate for a 6'8" male.  And hopefully now after missing the first two races of the season, my training can begin again in earnest.&lt;br /&gt;&lt;br /&gt;And the moral of the story is; You can never have enough gadgets.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:18170</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/18170.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=18170"/>
    <title>Using inlines to reduce code duplication</title>
    <published>2008-02-14T22:13:23Z</published>
    <updated>2008-02-14T22:13:23Z</updated>
    <content type="html">I recently was able to use a neat trick in my scheduler code that I thought I'd share.  It might be old news to many of you and it doesn't come up a lot but it's useful when it does.  The basic notion is that you can use inlines with const arguments to create a sort of parameterized function with no duplicated code post-compile.&lt;br /&gt;&lt;br /&gt;&lt;a name="cutid1"&gt;&lt;/a&gt;&lt;br /&gt;Consider a common set of steps with minor deviations that you don't want to duplicate but you also don't want to pay the cost of extra conditionals in a very frequently executed function.  In my case, it's the routine that searches the cpu tree for load conditions.  It may have to find the least or most loaded cpu, or both least and most loaded cpus given some parameters.  So I create an inline function that handles all three cases that takes an argument which specifies which cases are desired.&lt;br /&gt;&lt;br /&gt;When this inline is invoked at a particular call site the required cases are specified with a constant so the dead code elimination phase of the compiler kills everything in the cases we don't care about.  So I'm able to maintain a single function which traverses the tree without paying the cost of having a fully generic algorithm.  In this case the function is actually recursive so I create three non-inlined functions which handle the three variations by calling the inline with the right parameters.  The inline itself then has a switch statement to select the correct invocation.  The switch actually just compiles away into nothing.&lt;br /&gt;&lt;br /&gt;The code, abbreviated, looks something like this:&lt;br /&gt;&lt;pre&gt;&lt;code&gt;
#define CPU_SEARCH_LOWEST       0x1
#define CPU_SEARCH_HIGHEST      0x2
#define CPU_SEARCH_BOTH         (CPU_SEARCH_LOWEST|CPU_SEARCH_HIGHEST)

static inline int
cpu_search(struct cpu_group *cg, struct cpu_search *low,
    struct cpu_search *high, const int match)
{

        if (cg-&amp;gt;cg_children) {
                /* boilerplate goes here */
                for (i = 0; i &amp;lt; cg-&amp;gt;cg_children; i++) {
                        if (match &amp; CPU_SEARCH_LOWEST) {
                                ...
                        }
                        if (match &amp; CPU_SEARCH_HIGHEST) {
                                ...
                        }
                        switch (match) {
                        case CPU_SEARCH_LOWEST:
                                load = cpu_search_lowest(child, &amp;lgroup);
                                break;
                        case CPU_SEARCH_HIGHEST:
                                load = cpu_search_highest(child, &amp;hgroup);
                                break;
                        case CPU_SEARCH_BOTH:
                                load = cpu_search_both(child, &amp;lgroup, &amp;hgroup);
                                break;
                        }
                        if (match &amp; CPU_SEARCH_LOWEST) {
                                ...
                        }
                        if (match &amp; CPU_SEARCH_HIGHEST) {
                                ...
                        }
                }
        } else {
                int cpu;

                CPUMASK_FOREACH(cpu, cg-&amp;gt;cg_mask)
                        total += cpu_compare(cpu, low, high, match);
        }
        return (total);
}
int
cpu_search_lowest(struct cpu_group *cg, struct cpu_search *low)
{
        return cpu_search(cg, low, NULL, CPU_SEARCH_LOWEST);
}

int
cpu_search_highest(struct cpu_group *cg, struct cpu_search *high)
{
        return cpu_search(cg, NULL, high, CPU_SEARCH_HIGHEST);
}

int
cpu_search_both(struct cpu_group *cg, struct cpu_search *low,
    struct cpu_search *high)
{
        return cpu_search(cg, low, high, CPU_SEARCH_BOTH);
}


&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;cpu_compare() is a small inline which does the same trick.  I have verified that even at low optimization settings gcc will spit out three small functions with only the required code in each rather than one huge function with all of the code and lots of conditionals.&lt;br /&gt;&lt;br /&gt;</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:17766</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/17766.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=17766"/>
    <title>cookies</title>
    <published>2008-01-28T02:54:49Z</published>
    <updated>2008-01-28T02:54:49Z</updated>
    <content type="html">Last night, in remembrance of Rich Steven, I made his ultimate chocolate chip cookies: &lt;a href="http://www.kohala.com/start/recipes/ultimatecookie.html"&gt;http://www.kohala.com/start/recipes/ultimatecookie.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I do this every couple of years and I have to say the closer you follow the directions the better they come out.  I used my wife's kitchenaid mixer and they were the best yet.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:17426</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/17426.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=17426"/>
    <title>non-uniform cpu scheduling.</title>
    <published>2008-01-25T12:36:42Z</published>
    <updated>2008-01-25T12:36:42Z</updated>
    <content type="html">After a month sick with some random virus I'm finally starting to feel normal and get some work done again.  I spent some time implementing a long standing idea I've had for a more flexible and dynamic CPU topology to improve scheduling decisions.  Modern multi-core processors are non-uniform from a variety of perspectives.  For example, the barcelona has a shared L3 among all cores on a package.  So if you have two packages and you're placing two threads on cores within one package you're wasting half your cache.  To take advantage of this knowledge the scheduler needs detailed information about the cpu layout and it needs to intelligently act on that information.&lt;br /&gt;&lt;br /&gt;&lt;a name="cutid1"&gt;&lt;/a&gt;&lt;br /&gt;Other non types of non-uniformity are shared system bus or memory controller links, hyper-threaded or symmetric multi-threaded cores, etc.  All of these factors stack up to mean simply picking the most idle cpu is no longer an effective solution.  For example, if a thread has affinity for one cpu and it's not available, it is sensible to pick another cpu that shares cache with the desired cpu.  It's also important to balance load across system bus links and caches not just processors.&lt;br /&gt;&lt;br /&gt;Optimizing for these characteristics is not a novel idea.  Linux and Solaris both make some attempts to do so.  I have not analyzed them in depth but I have noticed that they are both fairly static approaches requiring changes for new architectures or missing opportunities to optimize odd ones such as intel's initial quad core (two sets of dualcore cpus with a large l2 each but a shared system bus).  ULE even had some primitive improvements to optimize for HTT.  However, my new solution is very dynamic, allowing the machine dependent code to describe arbitrary topographies with slightly abstracted levels of sharing.&lt;br /&gt;&lt;br /&gt;The basic datastructure is a tree that describes the relationships between cpus.  What resource they share and which cpus are in the set.  An example tree might have a root that describes 4 sockets which share nothing (independent bus links), with another layer describing a package with an L3, and another describing a par of cores within that package sharing an L2.  Yet another layer may then describe 4 SMT threads per-core.  Most systems are actually less complex and require only one or two levels in the tree which reduces algorithm cost.&lt;br /&gt;&lt;br /&gt;When searching for a cpu with affinity the scheduler will walk the tree backwards from the last cpu scheduled to the highest level which the thread has affinity for.  It then recursively evaluates the sub-tree to find the most idle match with the best affinity.  To find an idle cpu, we scan the whole tree finding the least loaded CPU on the least loaded path through the tree.  This balances load across tree levels and hence all cpu resources.&lt;br /&gt;&lt;br /&gt;The periodic load balancer now recursively iterates the tree, load balancing every level.  A search routine simultaneously finds the most and least loaded core via each path and rectifies their loads.  The idle time load balancer which runs when a cpu runs out of work will scan starting from the nearest cpu outward to pick the best cpu to steal a thread from.&lt;br /&gt;&lt;br /&gt;The majority of the algorithms are quite generic, however, I may implement some special casing for SMT/HTT as there really is a much stronger disincentive to scheduling these cores if others are available unless there is significant cache overlap between threads.  When there is cache overlap between threads the scheduler can now place them near-by each other in the topology to decrease cache coherency messages that must be broadcast off package.&lt;br /&gt;&lt;br /&gt;So after all that, how much does it help?  Some charts:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://people.freebsd.org/~kris/scaling/pgsql-16cpu.png"&gt;http://people.freebsd.org/~kris/scaling/pgsql-16cpu.png&lt;/a&gt;&lt;br /&gt;&lt;a href="http://people.freebsd.org/~kris/scaling/mysql-16cpu.png"&gt;http://people.freebsd.org/~kris/scaling/mysql-16cpu.png&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;These are from a 16core 4x4 xeon machine.  The pgsql graph has a lot of lines but you're interested in ule + topo vs 16 cores, all active.  mysql is a little more straightforward.  Interestingly it scales well on 8 cpus but runs into terrible user-land contention on 16.  I'm not sure if this is running with adaptive pthreads mutexes or not.&lt;br /&gt;&lt;br /&gt;Anyway, the patch is still young yet and it's performing very well on this very asymmetric machine while not hindering others.  Expect to see this in 8.0 and maybe backported for 7.1 to alleviate the poor scheduling behavior that all those of you with quad quad core machines are experiencing. ;)&lt;br /&gt;&lt;br /&gt;</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:17305</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/17305.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=17305"/>
    <title>jeffr_tech @ 2007-12-06T23:49:00</title>
    <published>2007-12-07T09:59:23Z</published>
    <updated>2007-12-07T09:59:23Z</updated>
    <content type="html">I recently collaborated with Kip Macy to mostly rewrite FreeBSD's lock profiling facility.  This provides a rich set of statistics about lock acquisition and contention that is instrumental in continuing to refactor the locking in the kernel.  Statistics include: max hold time, total hold time, total wait time, number of acquisitions, average hold time, average wait time, and number of times contended.  It's a little tricky because these statistics are not kept on a per-lock basis, but rather, per (file, line, lock name) triple.&lt;br /&gt;&lt;br /&gt;This means you can readily identify not just which locks are problematic but which source files are causing the problems.  Issues of high latency or coarse locking readily stand out.  Unfortunately, all of these statistics are quite expensive to gather.  At the moment common kernel-heavy workloads slow down to about 1/5th speed.  Before the rewrite it was 1/10th!  The overhead is entirely due to the time keeping functions which must be called for every acquisition and release.&lt;br /&gt;&lt;br /&gt;The goal of the rewrite was to better support shared locks.  Previously some data was kept in each lock and it was assumed the locker had exclusive access to that data to update it.  We changed it to have a notion of ownership records instead, so each lock ownership adds a small pre-allocated structure to a per-thread list.  This structure tracks timing and contention information for this specific instance.&lt;br /&gt;&lt;br /&gt;When the lock is released we aggregate the information into a structure that is associated with the (file, line, lock name) triple.  This is found via hash lookup.  I changed the hash table to be per-cpu which removed an array of locks we used to protect it before.  This makes displaying statistics much more complicated because each record must be merged with any records for the same triple that may exist on another cpu.  However, this is responsible for the 2x speedup.&lt;br /&gt;&lt;br /&gt;The remainder of the overhead will go away once multiprocessor systems have reliable, synchronized time-stamp counters (TSC).  This is an extremely cheap time source, on the order of dozens of cycles, compared to the hundreds of cycles to access a global system clock that you must use for reliable cross-processor timing information today.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:16929</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/16929.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=16929"/>
    <title>jeffr_tech @ 2007-11-24T14:39:00</title>
    <published>2007-11-25T00:38:23Z</published>
    <updated>2007-11-25T00:38:23Z</updated>
    <content type="html">Dear SQL language "designers",&lt;br /&gt;&lt;br /&gt;Please report to my office to receive your beating.&lt;br /&gt;&lt;br /&gt;Thanks,&lt;br /&gt;Management</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:16860</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/16860.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=16860"/>
    <title>jeffr_tech @ 2007-11-06T13:50:00</title>
    <published>2007-11-06T23:57:03Z</published>
    <updated>2007-11-06T23:57:03Z</updated>
    <content type="html">Here's my horrible hack for the day.  I have two identical lcd displays that I've been using independently on different computers.  I wanted to join them into one display on my main development box so I bought a dual head ATI X1300 pro card.  Turns out the ati video support in x windows isn't as good as I thought.&lt;br /&gt;&lt;br /&gt;The new radeonhd driver has only been in development for a few weeks and only supports VGA connectors and clones the image to both screens.  Fortunately for me all of the registers were being programed for the second CRTC (sort of like a display driver).  I hacked it up to make sort of a virtual desktop area and then pointed the second CRTC at the second half of the frame buffer.  Shockingly it worked with only a couple hours of hacking.  To X it still looks like a virtual desktop so it tried to scroll around until I removed that with a hammer.&lt;br /&gt;&lt;br /&gt;Patch is here: &lt;a href="http://people.freebsd.org/~jeff/rhd-multihead.diff"&gt;http://people.freebsd.org/~jeff/rhd-multihead.diff&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;There's no bounds checking to make sure you actually have enough video memory.  It also only works if both devices are the same resolution and display depth.  Still it's not too bad for my first time working on a video driver.  I'm also happy to support ATI/AMD for releasing their docs.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:16453</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/16453.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=16453"/>
    <title>jeffr_tech @ 2007-11-03T16:34:00</title>
    <published>2007-11-04T00:40:20Z</published>
    <updated>2007-11-04T00:40:20Z</updated>
    <content type="html">Some scheduler updates;  Long ago I got rid of slice size adjustment to facilitate different cpu allocation based on nice.  I've now brought it back for a different purpose.  To reduce latency for timesharing threads when there is significant load on the run-queue I now turn down the allowed slice size.  Each CPU now keeps track of the sum of the interactive scores of all threads on the run-queue.  This is better than a simple load count since it takes into consideration the likely runtime for each thread.&lt;br /&gt;&lt;br /&gt;I also found that the larger default slice size in ULE actually pessimizes some workloads.  For example, parallel buildworlds.  I hypothesize that allowing a compiler to run for too long without allowing make or similar to run reduces the amount of potential concurrency since new jobs can't be scheduled.  It's just a theory however, hard to directly measure, but cutting the slice size from ~100ms to ~50ms actually yielded a ~3% perf improvement on a parallel buildworld.  Surprising!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;ULE will not be the default scheduler in 7.0 but is a selectable option.  It is the default in -CURRENT and will be for 7.1.</content>
  </entry>
  <entry>
    <id>urn:lj:livejournal.com:atom1:jeffr_tech:16258</id>
    <link rel="alternate" type="text/html" href="http://jeffr-tech.livejournal.com/16258.html"/>
    <link rel="self" type="text/xml" href="http://jeffr-tech.livejournal.com/data/atom/?itemid=16258"/>
    <title>jeffr_tech @ 2007-10-14T19:36:00</title>
    <published>2007-10-15T02:38:54Z</published>
    <updated>2007-10-15T02:38:54Z</updated>
    <content type="html">We made it to Maui.  I forgot how many stars there are in the sky.  We rented a nice house on a good size lot with 10 or so different types of fruit trees and a view of the ocean.  It's even walking distance to many restaurants and fairly bike friendly.  We couldn't be happier.&lt;br /&gt;&lt;br /&gt;I will still be mostly unavailable for the next two weeks as our things are crossing the pacific.  Then I will finish setting up my 8core Xeon with Solaris, Linux, and FreeBSD for bit of scalability testing.</content>
  </entry>
</feed>
