<?xml version='1.0' encoding='utf-8' ?>
<!--  If you are running a bot please visit this policy page outlining rules you must respect. http://www.livejournal.com/bots/  -->
<rss version='2.0' xmlns:lj='http://www.livejournal.org/rss/lj/1.0/' xmlns:media='http://search.yahoo.com/mrss/' xmlns:atom10='http://www.w3.org/2005/Atom'>
<channel>
  <title>jeffr_tech</title>
  <link>http://jeffr-tech.livejournal.com/</link>
  <description>jeffr_tech - LiveJournal.com</description>
  <lastBuildDate>Tue, 10 May 2011 03:41:23 GMT</lastBuildDate>
  <generator>LiveJournal / LiveJournal.com</generator>
  <lj:journal>jeffr_tech</lj:journal>
  <lj:journalid>6769489</lj:journalid>
  <lj:journaltype>personal</lj:journaltype>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/24794.html</guid>
  <pubDate>Tue, 10 May 2011 03:41:23 GMT</pubDate>
  <title>Asynchronous partial truncation</title>
  <link>http://jeffr-tech.livejournal.com/24794.html</link>
  <description>I have spent a month of my life on partial truncation.  Softupdates asynchronously handles the case where you were completely truncating a file, such as is the case when you delete a file.  The operation would be scheduled, the in memory inode updated, and the whole thing would proceed in the background.  However, when you truncated to a non zero length it would do many blocking operations while synchronously truncating.  When I wrapped this synchronous operation for SUJ, I did not do it quite correctly, and as a result SUJ could leak blocks if you crashed during a partial truncation.  This could actually lead to filesystem corruption if the checker confused a regular disk block for an indirect block and started freeing random pointers.&lt;br /&gt;&lt;br /&gt;To resolve this, I modified the truncation machinery to handle partial truncation.  This is hard because you may have many indirect blocks involved with many children blocks in different states.  An indirect is a filesystem block that does nothing but point to other blocks, like a page table does for memory.  It also had to handle ffs&apos;s somewhat complex fragment rules as well as zeroing partially empty blocks.  It handles all of this now and supports an arbitrary number of partial truncations to the same file without any blocking operations.  It always keeps the on disk copy safe while the in memory copy is free to grow again and indeed be truncated again after that.  New pointers are not recorded in an indirect until prior truncation completes so there is no ambiguity about what revision of the file the blocks are from.  This brings more complexity to fsync() which must now flush all pending truncations to disk before it can return.&lt;br /&gt;&lt;br /&gt;The truncation code is a kind of asynchronous state machine that operates on leaf blocks first and then walks backwards up the tree until it reaches the root.  This ensures that we always have a valid path to a block in case we crash.  Indirects are only freed when all of their children are freed.  For partial truncate, the block is zeroed only once those child pointers that need be are freed.  Finally when all blocks have been freed the journal space can be reclaimed.&lt;br /&gt;&lt;br /&gt;This post can not convey how complex this work was.  It may not sound very dramatic or impressive but it truly has been one of the most complex projects I have ever undertaken.</description>
  <comments>http://jeffr-tech.livejournal.com/24794.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>3</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/24357.html</guid>
  <pubDate>Tue, 05 Apr 2011 03:43:17 GMT</pubDate>
  <title>Performance problems in SUJ</title>
  <link>http://jeffr-tech.livejournal.com/24357.html</link>
  <description>SUJ has been around for a year now and 9.0 will release with it this summer.  In preparation I am working on the few known performance problems.  The problems are sufficiently general to softupdates that they may be of interest to those who study different filesystem consistency mechanisms.&lt;br /&gt;&lt;br /&gt;The new code and dependencies add some extra CPU overhead to each filesystem operation but in practice this has been negligible.  However once disks reach ops per second rates similar to that of network interface cards we will have to re-evaluate filesystems entirely.  Back on topic, the two classes of problems we have encountered relate to synchronous journal writes and excessive rollbacks.&lt;br /&gt;&lt;br /&gt;You may recall that softupdates uses rollbacks to revert metadata operations that are not yet safe when a buffer is written to disk.  When the write completes the change is rolled forward in memory and the buffer is marked dirty again.  This allows us to separate potentially circular dependencies, rolling back some while writing others, allowing the filesystem state to move forward.  This eliminates the types of journaling problems that can occur when many operations are allowed to aggregate for efficiency reasons which may lead to waiting on unrelated IO when fsync() is called.  Our notion of a transaction is less simplistic.&lt;br /&gt;&lt;br /&gt;The journaling code adds new dependencies and new rollbacks to the filesystem.  Most importantly, the allocation bitmaps are now rolled back.  In some cases we may discover that one filesystem operation undoes another and softupdates handles this by canceling all of the dependencies after reverting the metadata changes.  It turned out there were some cases where the time between canceling the dependencies and the actual reversion of the changes could be longer than I expected.  This would leave a dependency that was unsatisfied which would hold a cylinder-group dirty for several seconds.  The solution was to simply allow the journal record to proceed even when we decide to cancel the operation.  If the operation is undone before the write is issued we will still eliminate it, however, there is no harm in journaling an operation that does not happen.  The checker will discover the true state of all the metadata and take no action.&lt;br /&gt;&lt;br /&gt;The second problem has to do with blocking journal writes.  There are some cases where rollbacks would be impractical so instead we detect them and force a synchronous journal write.  There are very few instances of this in the filesystem but one that remains is particularly egregious.  The checker requires that a new block allocation is journaled before the block is actually written.  The filesystem assumes that it can write to datablocks in any order and indeed it does so before the allocation bits hit the cylinder group.  These are not compatible so a new block which is immediately written to disk after allocation will wait first for the journal write and second for the block write, doubling the latency.  This is tricky because we only need to block in the case that the previous identity of the block was as an indirect block for a file whose truncation still exists in the journal.  The new record must first be written so the checker doesn&apos;t attempt to interpret the block as a table of indirect block pointers.&lt;br /&gt;&lt;br /&gt;I haven&apos;t yet solved this second problem.  My intent is to cache the list of recently freed indirect blocks in some fashion but I need to do it with the least memory and cpu overhead I can.  My hope is to solve this soon.  Experimental kernels where this restriction is relaxed perform as well as softupdates without journaling in all of the tests I&apos;ve tried.</description>
  <comments>http://jeffr-tech.livejournal.com/24357.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>6</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/24280.html</guid>
  <pubDate>Wed, 30 Mar 2011 02:04:06 GMT</pubDate>
  <title>Interactivity score in ULE</title>
  <link>http://jeffr-tech.livejournal.com/24280.html</link>
  <description>I sometimes speak with Con Kolvis who is known for several Linux schedulers.  Con is an interesting fellow because his background is not CS and he is very pragmatic about desktop performance.  He doesn&apos;t care for the interactivity boost that ULE and previous Linux schedulers use in various forms.  He periodically challenges me to consider the interactivity algorithm and whether it is ultimately necessary and effective.  Below I present some analysis done when constructing the algorithm in use in ULE and why I believe it is effective and necessary while not suffering many of the pitfalls of earlier approaches.&lt;br /&gt;&lt;br /&gt;Firstly, let me define the properties of what I believe is a good interactivity algorithm.  These were my guiding principles in creating the ULE algorithm.&lt;br /&gt;&lt;br /&gt;1)  Any interactivity boost is gained slowly and lost quickly.&lt;br /&gt;2)  Interactivity should be harder to achieve the greater the system load.&lt;br /&gt;3)  The algorithm should not be exploitable to achieve an unfair share of the CPU.&lt;br /&gt;4)  The algorithm should be cheap to maintain and compute.&lt;br /&gt;5)  There should be sufficient history to permit bursty applications like web browsers.&lt;br /&gt;&lt;br /&gt;The ULE algorithm uses a decaying history of voluntary sleep time and run time.  Similar to %cpu, however, involuntary sleep time is not considered.  That is to say, threads that are waiting due to contention for CPU resources are not given an interactivity boost for their time waiting.  That allows the algorithm to work properly regardless of CPU load where if you only consider %cpu eventually all threads on a busy system will look interactive.&lt;br /&gt;&lt;br /&gt;The algorithm scales the ratio of run time to sleep time to a value between 1 and 100.  This is quite awkward in the kernel where we can&apos;t use floating point math.  It decides the divisor depending on which value is larger giving a sort of bimodal distribution.&lt;br /&gt;&lt;br /&gt;Here is a graph of what we theoretically would like the score to produce before we switch the divisor around:&lt;br /&gt;&lt;br /&gt;&lt;img src=&quot;http://people.freebsd.org/~jeff/score.png&quot;&gt;&lt;br /&gt;&lt;br /&gt;And here is a graph generated by running the algorithm with a matrix of inputs:&lt;br /&gt;&lt;br /&gt;&lt;img src=&quot;http://people.freebsd.org/~jeff/realscore.png&quot;&gt;&lt;br /&gt;&lt;br /&gt;The second graph uses larger numbers as we do in the kernel to reduce rounding effects.  You can see an irregularity at 45 degrees where we switch divisors when the run time exceeds the sleep time.  In practice these are never computed as we define a threshold of 20 above which tasks are not considered interactive so there is no point in computing the score when run time exceeds sleep time unless this threshold is moved.&lt;br /&gt;&lt;br /&gt;Going from left to right runtime is increasing.  From background to foreground sleep time is increasing.  A thread would trace a path forward and to the right depending on its behavior.  When they increase equally the score quickly reaches an equilibrium well above the threshold for interactive scheduling.  A thread looking to abuse the system couldn&apos;t use much more than 20% of the cpu in a steady state.  This can be adjusted by reducing the interactive threshold.  On a busy system this 20% dwindles depending on load, ultimately providing no advantage to a would be exploiter.  A thread running right out of the gate raises its score super-linearly to 50 within milliseconds, while a recently awoken thread climbs linearly as it accumulates cpu time.&lt;br /&gt;&lt;br /&gt;The algorithm requires a lot of sleep time to be accumulated before a thread can be considered interactive.  This remembered sleep time is capped at a few seconds so it only takes a few hundred milliseconds before we discover that a thread is no-longer interactive.  It does permit interactive UI applications to wake up with the lowest possible latency since they have a very high priority.  If they then abuse this benefit for very long they are scheduled round-robin based on cpu utilization like other bulk tasks.  In practice we have picked values that keep desktop user applications interactive as well as is possible.</description>
  <comments>http://jeffr-tech.livejournal.com/24280.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>7</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/23876.html</guid>
  <pubDate>Wed, 23 Mar 2011 01:37:37 GMT</pubDate>
  <title>OFED, 10gigE, and SUJ</title>
  <link>http://jeffr-tech.livejournal.com/23876.html</link>
  <description>I have merged the OFED 1.5.3 Infiniband stack into FreeBSD CURRENT.  We have achieved feature and performance parity with the Linux stack using a combination of wrappers and re-implementation of sensitive pieces.  lwn wrote an article about the wrapper work &lt;a href=&quot;http://lwn.net/Articles/421601/&quot; rel=&quot;nofollow&quot;&gt;here&lt;/a&gt;.  Some FreeBSD developers are understandably concerned about growing a Linux kernel compat layer and how that could lower the quality of FreeBSD drivers.  I don&apos;t foresee this as a real complication but only time will tell.&lt;br /&gt;&lt;br /&gt;I&apos;m working on bringing in support for Mellanox&apos;s 10gigE adapters now.  It&apos;s always interesting for me to explore different directions operating systems take to accomplish the same features.  Network buffering is one of those areas that is starkly different across operating systems.  Maybe that deserves its own post.&lt;br /&gt;&lt;br /&gt;I am now looking for bug reports for SUJ for another round of bug-fixing before 9.0 ships.  There are a couple of areas where performance isn&apos;t great due to latency involved in blocking journal writes.  I know how to eliminate these but it will take some time to implement.  We are hoping to ship SUJ as the default in 9.0 and then I may provide an official backport to 8.x.</description>
  <comments>http://jeffr-tech.livejournal.com/23876.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>5</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/23569.html</guid>
  <pubDate>Fri, 28 Jan 2011 00:38:30 GMT</pubDate>
  <title>wherein I replicate my feet</title>
  <link>http://jeffr-tech.livejournal.com/23569.html</link>
  <description>I sometimes do things unrelated to computers that are probably still considered geeky.  I have made reference to being a cyclist before.  As you know cyclists are obsessed with all things carbon fiber and I am no different.  With the help of a boat builder friend of mine, I finally had an opportunity to make my own composite parts in pursuit of the perfectly fitting shoe.  Behind the cut are some pictures and a description of the process.&lt;br /&gt;&lt;br /&gt;&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;First, to identify the goal;  My feet have very high arches and long toes.  This presents a number of challenges which are best solved with orthotic insoles.  Unfortunately I couldn&apos;t find any I really liked.  Cycling is also unusual in that a perfectly stiff insole is not undesirable.  The average force on your feet is much lower than most sports and impacts are relatively non-existent.&lt;br /&gt;&lt;br /&gt;The solution was to make a plaster of paris mold of my feet.  From here I shaped the plaster and then built a plug, or positive mold inside of it.  The plug was made of polyester resin and fiberglass.  After curing the plaster was removed from the plug and then the plug was shaped further and sanded smooth.  I essentially eliminated the toes beyond the metatarsal joints and kept everything else very close to my actual foot shape.&lt;br /&gt;&lt;br /&gt;Once I had the mold I made a negative mold from it.  Once this was also cured and shaped I sandwiched 8 layers of carbon and epoxy between the two and let them cure over night.  After a lot of sanding and shaping this was the result:&lt;br /&gt;&lt;br /&gt;&lt;img src=&quot;http://people.freebsd.org/~jeff/bikes/carbonfeet.jpg&quot;&gt;&lt;br /&gt;&lt;br /&gt;Ultimately I had to completely remove the toe area so that this would fit in the shoe.  I have since re-added the toe area using 2mm EVA foam.  I further used a double-sided adhesive roll to adhere some suede to the insole to add a degree of comfort and traction along with a better looking finish.&lt;br /&gt;&lt;br /&gt;&lt;img src=&quot;http://people.freebsd.org/~jeff/bikes/suedefeet.jpg&quot;&gt;&lt;br /&gt;&lt;br /&gt;Using a 3d motion capture technology I have actually shown that these insoles straightened out my feet and removed lateral motion from my knees.  So not only are they exceptionally comfortable but they also improve performance by eliminating some forces applied tangential to the motion of the pedal.  The total cost of goods was around $100 and it probably took 15hrs of working time.&lt;br /&gt;&lt;br /&gt;&lt;a name=&apos;cutid1-end&apos;&gt;&lt;/a&gt;</description>
  <comments>http://jeffr-tech.livejournal.com/23569.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>1</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/23467.html</guid>
  <pubDate>Mon, 03 Jan 2011 23:02:10 GMT</pubDate>
  <title>Year in review</title>
  <link>http://jeffr-tech.livejournal.com/23467.html</link>
  <description>I have not posted in a very long time.  I have been busy though and I&apos;ll try to summarize the last year here.&lt;br /&gt;&lt;br /&gt;Firstly, I collaborated with my good friends at fairwheel bikes to work on a modification to Shimano&apos;s new electronic shifting group.  You can read about that at &lt;a href=&quot;http://www.cyclingnews.com/features/fairwheel-bikes-creates-stunning-sequential-shifting-di2-equipped-hardtail&quot; rel=&quot;nofollow&quot;&gt;cyclingnews.com&lt;/a&gt;.  I replaced the stock computer with my own micro-controller that enables some advanced shifting features.  I&apos;m trying to turn this into a commercial enterprise with a friend.  There is a chance that a pro team will be using it next year.&lt;br /&gt;&lt;br /&gt;The majority of my year has been occupied with a port of the OpenFabrics Enterprise Distribution infiniband stack from Linux to FreeBSD.  This is dual BSD/GPL licensed which permits the port.  In pursuit of this I have created a 10,000 line Linux kernel api compatibility layer which allows us to run the vast majority of the infiniband code unmodified.  As I mentioned on arch@freebsd the following pieces are emulated:&lt;br /&gt;&lt;br /&gt;&amp;gt; atomics, types, bitops, byte order conversion, character devices, pci &lt;br /&gt;&amp;gt; devices, dma, non-device files, idr tables, interrupts, ioremap, hashes, &lt;br /&gt;&amp;gt; kobjects, radix trees, lists, modules, notifier blocks, rbtrees, rwlock, &lt;br /&gt;&amp;gt; rwsem, semaphore, schedule, spinlocks, kalloc, wait queues, workqueues, &lt;br /&gt;&amp;gt; timers, etc.&lt;br /&gt;&lt;br /&gt;Additionally I have worked more on SUJ, mostly bug fixing.  Kirk and Kostik have been most helpful in that and really did most of the work.  There were some nasty bugs but we&apos;ve whittled them down and now there are only a few performance regressions (and improvements) to concern ourselves with.&lt;br /&gt;&lt;br /&gt;I wish I hadn&apos;t let this journal go for so long.  If anyone has any specific interests let me know and I will try to post more frequently.</description>
  <comments>http://jeffr-tech.livejournal.com/23467.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>11</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/23165.html</guid>
  <pubDate>Sun, 31 Jan 2010 01:53:05 GMT</pubDate>
  <link>http://jeffr-tech.livejournal.com/23165.html</link>
  <description>Just a short update;  I will commit SUJ to CURRENT soon and there are already backports to 8 and 7 available.  Things are really very stable.&lt;br /&gt;&lt;br /&gt;I have done some fsck benchmarking.  I recovered an 80% full 250gb volume that was doing a buildworld with 8 parallel processes in .9 seconds.  The traditional fsck took 24 minutes.  For the vast majority of workloads an unsafe shutdown will not prolong the boot by more than a few seconds.  I also shortened the time that entries stayed valid in the journal which allowed me to trim the maximum journal size to 32MB.  This will limit the maximum possible recovery time while still providing for 1 million outstanding transactions.&lt;br /&gt;&lt;br /&gt;The project to date has taken 370 hours.  The patch adds 11,000 lines and removes 2,000.  This is much bigger than I was anticipating, although of course lines of code can be misleading.</description>
  <comments>http://jeffr-tech.livejournal.com/23165.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>24</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/22933.html</guid>
  <pubDate>Fri, 11 Dec 2009 22:37:28 GMT</pubDate>
  <title>What&apos;s in a journal anyway?</title>
  <link>http://jeffr-tech.livejournal.com/22933.html</link>
  <description>In this post I&apos;ll detail the contents of the journal and the recovery operation.  Since we know that softupdates only leaves two inconsistencies, leaked inodes and leaked blocks, we only have to journal conditions which can create these circumstances.  In truth we have to track all link count changes to an inode since they can have multiple named references via hardlinks.  Blocks are somewhat simpler although ffs fragments complicate them considerably.  At recovery time we verify whether links or pointers to blocks exist and use this information to free them if necessary.  There are only 4 journal records (add ref, rem ref, new block, free block) and each is only 32bytes.  This is effectively an intent log, there is no copy of the metadata in each record.  Sounds simple no?&lt;br /&gt;&lt;br /&gt;&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;&lt;br /&gt;In the addref/remref case the journal record contains the inode number, directory inode number, 64bit offset within the directory for the new/removed link, and the link count before the adjustment.  At recovery time when we find one of these records we verify whether the path exists or not and adjust the link appropriately.  The path may not exist if the parent inode doesn&apos;t point at the directory block that this filename occupied, or if the directory write didn&apos;t happen in time, or any number of other scenarios.  If we adjust a link down to 0 we free the inode and any blocks it owns, and if it is a directory, we recursively decrement the link counts of any children also potentially freeing them.  This only happens when you crash immediately after adding a tree of files as with tar, etc.  The directory offset tells us the exact place this should exist, we don&apos;t need to know the actual name, and this is how we handle multiple links to the same inode within the same directory.  The recovery operation actual finds all valid journal records for each inode and sorts them in a list to remove duplicates before operating on an inode so we know if a name was added and immediately removed or if it was added twice we should not adjust the link twice, etc.&lt;br /&gt;&lt;br /&gt;For adding and removing blocks we record the inode, logical block number (like a file offset), and disk block address.  The lbn may be negative to indicate indirect blocks, which are blocks that hold pointers to data blocks for large files.  If we discover that a block does not exist at the indicated lbn we may recursively free indirect block children.  This allows us to truncate huge files with a very small number of journal entries, no more than 15 which is the number of direct and indirect block pointers contained in an inode.  There is an additional test to be certain that the freed block was not allocated to a new file after this record was written.&lt;br /&gt;&lt;br /&gt;In ffs, the filesystem is partitioned into &apos;cylinder groups&apos; which partition the data blocks and inodes for locality.  Each of these CGs has summary information describing where there are fragments, large clusters of available blocks, how many inodes are free, etc.  Some of this summary information is copied into the superblock so that we can quickly find a CG with free blocks, inodes, etc.  As a final stage in the recovery operation any CG that was modified recomputes its summary information and updates the superblock copy.&lt;br /&gt;&lt;br /&gt;So how fast is it?  In my tests so far it looks like less than 2 seconds per megabyte of journal in use.  A megabyte of journal space describes 32,768 filesystem operations!  Even on a machine with a very large amount of memory it&apos;s unlikely that you could have more than a few hundred thousand operations outstanding.  So this is really quite acceptable.  Furthermore the recovery operation is currently generating a text log of every decision it makes that is several times the size of the binary log.  Once disabled will probably halve the recovery time again.&lt;br /&gt;&lt;a name=&apos;cutid1-end&apos;&gt;&lt;/a&gt;</description>
  <comments>http://jeffr-tech.livejournal.com/22933.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>12</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/22716.html</guid>
  <pubDate>Wed, 09 Dec 2009 04:29:02 GMT</pubDate>
  <title>Journaling softupdates, SU+J</title>
  <link>http://jeffr-tech.livejournal.com/22716.html</link>
  <description>I suspect that most people who read this blog are familiar with journaling as a mechanism for ensuring post-crash filesystem consistency.  Some of you may also be familiar with copy-on-write and log structured filesystems as an alternative to journaling.  BSD&apos;s ffs, an extension of the original unix filesystem, has used an alternate approach called soft-updates to handle filesystem consistency for around 10 years.  For the past few months, I have been creating a hybrid journaled softupdate system to deal with inadequacies in the existing softdep system.  This work is opensource and will be available to FreeBSD-current users sometime this month.  Behind the cut I briefly describe the tradeoffs in each consistency mechanism and motivation for this work.&lt;br /&gt;&lt;br /&gt;&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;&lt;br /&gt;If filesystem technologies are the subject of fads, it seems everything in the human experience must be.  Lately we&apos;ve seen an resurgence of copy-on-write (COW) filesystems which have been popularized seemingly by ZFS but have existed at least as far back as 1990&apos;s Rosenblum paper on LFS.  These may be particularly attractive on flash media where seek time is not a problem and where clustering writes together can improve erase block fragmentation issues.  However, on spinning media COW tends to fragment the drive and reduce the quality of allocation decisions by enforcing essentially at least two versions of any modified data to be reachable at once.&lt;br /&gt;&lt;br /&gt;Journaled filesystems were of course the earlier fad.  In this mechanism a copy of any metadata and sometimes data that is to be modified is kept in a journal, a fixed area of the disk or another disk, that logs each modification.  In this mechanism the journal is replayed on reboot and the filesystem is left in a consistent state.  The problem with journaling is either it&apos;s very simple and uses a tremendous amount of space and I/O with a strict transaction model that prevents some concurrency, or it becomes incredibly complex, to the tune of over 20,000 lines of code in xfs.  Still I have questions about concurrency when multiple transactions affect a block in xfs, but I need to dig deeper to understand this.&lt;br /&gt;&lt;br /&gt;Soft-updates is an alternative to this scheme where the filesystem keeps a list of dependencies that must be satisfied before a change to the filesystem can be visible on disk.  For example, you wouldn&apos;t want to write a directory entry pointing at an inode until the inode was initialized on disk and marked allocated.  Softdep handles this by rolling back changes to metadata that don&apos;t yet have their dependencies satisfied when we try to write a block.  In this way we can commit any completed &apos;transactions&apos; while keeping the disk state consistent.  Softdep also allows these dependencies to discover operations which cancel each other out and thus nothing makes it to disk.  For example, let&apos;s say you create a temporary file and then remove it after writing some blocks, which compilers often do, if it all happens within the interval of the syncer nothing will make it to disk.&lt;br /&gt;&lt;br /&gt;Soft-updates guarantees that the only filesystem inconsistencies on unclean shutdown are leaked blocks and inodes.  To resolve this you can run a background fsck or you can ignore it until you start to run out of space.  We also could&apos;ve written a mark and sweep garbage collector but never did.  Ultimately, the bgfsck is too expensive and people did not like the uncertainty of not having run fsck.  To resolve these issues, I have added a small journal to softupdates.  However, I only have to journal block allocation and free, and inode link count changes since softdep guarantees the rest.  My journal records are each only 32bytes which is incredibly compact compared to any other journaling solution.  We still get the great concurrency and ability to ignore writes which have been canceled by new operations.  But now we have recovery time that is around 2 seconds per megabyte of journal in-use.  That&apos;s 32,768 blocks allocated, files created, links added, etc. per megabyte of journal.&lt;br /&gt;&lt;br /&gt;This work is being funded by a group of companies, iXsystems, Yahoo!, and Juniper networks.  I&apos;m interested to see if this kind of project if feasible in the future where companies can share development costs for specific opensource projects.&lt;br /&gt;&lt;br /&gt;The work is being done in collaboration with Kirk McKusick, the original author of ffs and softupdates.  We will likely present a paper at BSDCan and then at a more formal venue following that.  The code will be publicly available within two weeks.&lt;br /&gt;&lt;br /&gt;&lt;a name=&apos;cutid1-end&apos;&gt;&lt;/a&gt;</description>
  <comments>http://jeffr-tech.livejournal.com/22716.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>51</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/22432.html</guid>
  <pubDate>Sun, 01 Feb 2009 00:09:54 GMT</pubDate>
  <title>New UMA features for more efficient memory layout</title>
  <link>http://jeffr-tech.livejournal.com/22432.html</link>
  <description>I have wanted to write for some time about UMA changes I recently made.  UMA is the &quot;universal memory allocator&quot; which serves as FreeBSD&apos;s kernel memory allocator.  I initially wrote this 7 years ago or so and many other people have since contributed.  I named it &apos;universal&apos; because at that time FreeBSD had 3 separate kernel memory allocators and this unified them.  The two new features relate to network performance work I&apos;ve been doing lately and allow the use of more efficient layout of network buffers.&lt;br /&gt;&lt;br /&gt;&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The first feature expands on the &apos;keg&apos; concept, first introduced by Bosko to support network buffer allocations, by allowing zones to back-end to multiple kegs.  UMA is a zone or slab allocator wherein a zone is created to allocate a given type of memory and contains parameters specific to that type.  The type may be all allocations of a specific size, as is used for the malloc() front-end, or it may be a complex object type that has initializers, finalizers, specific alignment requirements, etc.  The keg is a refinement on this concept where a keg provides the backend allocation and storage description while the zone controls the contents of the individual allocated items and provides a caching layer, client api, etc.  The difference is subtle but important.  Restated, the keg describes the format of the page or pages that the item lives in while the zone describes the format within each allocated item.  In this way multiple kegs can provide items which meet the clients requirements while varying the format and source of memory that they come from.&lt;br /&gt;&lt;br /&gt;Now that a zone may contain multiple kegs you may have multiple back-end sources for the allocation but the consumers are not required to differentiate.  For example, I have two kegs for 2 kilobyte network buffers.  One which is allocated from single pages and one which allocates from hardware supported large page sizes.  If large, contiguous pages are available this keg is used.  This optimizes access to network buffers by allowing the use of fewer hardware TLBs to describe them.  However, due to fragmentation, we don&apos;t always have large aligned chunks of memory available and in this case we can fall back to the other keg transparently.  This keg concept is also being used to implement NUMA support.  There exists one keg per-numa node and the search function can be coded to be topology aware.  Items from multiple compatible kegs can exist in the same fast cache in the zone and will be automatically retired to the correct source.&lt;br /&gt;&lt;br /&gt;The second feature builds on the first and allows for much more efficiently aligned datastructures.  This is akin to cache-coloring but goes one step further.  The start address of each allocated item is aligned such that it falls on a different cache-line than the previous allocation.  Simple cache-coloring typically ignores naturally aligned allocations or only colorizes each page or slab.  In this scheme a large contiguous block of memory is allocated and each item is padded until it reaches a new color.  For large, uniformly sized, network buffers, this has a tremendous benefit.  Consider a 2k allocation always ending up on a 2k boundary. This uses 1/(allocation size / line size) available lines for the most accessed bit at the beginning.  So for a 2k buffer and a 64byte line size the start addresses fall on only 1/32 of the available lines.  Essentially the buffer is padded by the line size so that the start addresses alternate lines and the number of allocations required to hit every line is computed to determine the storage requirements.  Using a large contiguous block of memory ensures that this is equally effective for l1 virtually indexed caches as it is on l2-l3 physically indexed caches.&lt;br /&gt;&lt;br /&gt;This has a secondary effect of improving utilization on striped memory controllers.  The exact details of how a physical address maps to a channel/bank/rank of dram are not published.  However, it is clear that predictable access patterns aligned on a large power of two size will strongly favor a smaller set of available dimms as it also favors a small set of available cachelines.  By alternating the start addresses on cache-line boundaries we can be assured that we are uniformly loading the available dimms because a cacheline is the smallest unit of memory transfer for all practical purposes.&lt;br /&gt;&lt;br /&gt;These optimizations are only useful in workloads where memory and cache pressure are the significant bottleneck.  Unfortunately, I don&apos;t have any benchmarks for stock FreeBSD to share at this time.  I&apos;m not certain that I&apos;m permitted to share details about the yields in the proprietary stack this was implemented for.  I will say that it was on the order of 10% in an already heavily optimized environment where traditional profile-guided software optimizations were yielding much less.&lt;br /&gt;&lt;br /&gt;I should also mention that this work was primarily funded by Nokia and most graciously donated to FreeBSD.&lt;br /&gt;&lt;a name=&apos;cutid1-end&apos;&gt;&lt;/a&gt;</description>
  <comments>http://jeffr-tech.livejournal.com/22432.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>15</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/22106.html</guid>
  <pubDate>Fri, 16 Jan 2009 02:38:29 GMT</pubDate>
  <title>more dram access timings on two interesting architectures</title>
  <link>http://jeffr-tech.livejournal.com/22106.html</link>
  <description>Ever wonder what memory latency is like on a large loosely connected opteron system?  I lay awake at nights wondering myself.  Fortunately, I have access to a tyan 8 socket barcelona system.  This is basically two 4 socket boards with two very slow HT links between them.  I also have access to a nehalem based box that I have timings for.  The results are behind the cut.&lt;br /&gt;&lt;br /&gt;&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;&lt;br /&gt;The test code simply allocates a user specified size of memory, prefaults it, mlocks it, and then uses rdtsc to count the cycles of individual random memory accesses (read only).  I also have a small calibration loop to try to determine the rough cost of the rdtsc instruction and remove it from the total cycles counted.  With very large blocks of memory you see dram performance and with small blocks of memory you can see different stages of the cache hierarchy.&lt;br /&gt;&lt;br /&gt;Here are the results as a histogram for a 32way 8 socket opteron sampling 1gb of memory:&lt;br /&gt;&lt;img src=&quot;http://people.freebsd.org/~jeff/opteron32way.png&quot;&gt;&lt;br /&gt;&lt;br /&gt;Since even all processors in a 4 socket system are not directly connected to each other you usually see 2-3 peaks depending on which cpu you&apos;re scheduled on.  The fastest is always local memory and then you see one and perhaps two hops to remote nodes on the same board.  Here  we see a smallish peak for local memory and a larger peak as more random samples hit the other three cores on our board.&lt;br /&gt;&lt;br /&gt;Then there is the horrible clump once we go over the slow HT links to the other board where we regularly see up to .5us memory access latencies!  Incredible.  The other thing worth noting is that even the local access is unfortunately slow.  I believe this has to do with the cache coherency protocol still requiring us to query each other socket before owning the line.&lt;br /&gt;&lt;br /&gt;Next we have a simpler, two socket nehalem sampling 128mb of memory:&lt;br /&gt;&lt;br /&gt;&lt;img src=&quot;http://people.freebsd.org/~jeff/nehalem.png&quot;&gt;&lt;br /&gt;&lt;br /&gt;This is really very clean.  In fact, we can pick out particular features of dram looking at this graph. First, we see two tall peaks, representing local and remote dram.  The second is only taller because the first has more minor variance, it is wider.  So it&apos;s roughly 85ns for a local access and 135ns for remote.  The other peaks we see are likely due to two causes.  The short peak after the dram timing is likely covering requests which occur during a dram refresh cycle.  The penalty is about the right amount of time for that.  The short peak before is likely occurring due to back to back requests coming in for the same row. &lt;br /&gt;&lt;br /&gt;Information like this helps us understand the relative trade-offs for different optimizations related to memory organization and locality.&lt;br /&gt;&lt;a name=&apos;cutid1-end&apos;&gt;&lt;/a&gt;</description>
  <comments>http://jeffr-tech.livejournal.com/22106.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>43</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/21654.html</guid>
  <pubDate>Sun, 18 May 2008 23:58:33 GMT</pubDate>
  <title>dev summit</title>
  <link>http://jeffr-tech.livejournal.com/21654.html</link>
  <description>I just returned from two weeks of travel.  One for my wedding anniversary and another for the FreeBSD developer summit which preceded BSDCan.  The summit was productive but I&apos;m very happy to be done with the travel.&lt;br /&gt;&lt;br /&gt;There were many great discussions at the summit with topics ranging from release engineering to TCP scalability.  I participated in one on mbufs (network buffers) and one on the buffer cache (file-system buffers).  For mbufs I presented a technique that I developed for Nokia based on Kip Macy&apos;s excellent work on 10gigabit ethernet drivers.  This technique should simplify referenced data while reducing the number and temporal scope of cache lines accessed to manipulate buffers in the common case.  There is still much work to do to prove it out however.&lt;br /&gt;&lt;br /&gt;The buf discussion lasted almost two hours and was much broader in scope.  We will hopefully have a fully revamped IO path for 8.0 to address a wide variety of structural and performance problems.  I&apos;m very excited to see this work progress after many years of planning and discussion.  My SoC student this year is implementing one essential piece by replacing a splay tree in the vm with a radix tree. &lt;br /&gt;&lt;br /&gt;I also had a very interesting discussion with a new project member, Lawrence Stewart, about tcp congestion control which he gave a talk on later in the conference.  I spent the very first part of my career working in the tcp/ip group at microsoft when tcp vegas was still relatively new.  Congestion control was one part of my work responsibilities and an area I pursued as a hobby.  I was surprised to hear that delay based congestion control was making a comeback in some research circles.  It was nice to hear about developments in this field that I haven&apos;t followed in some time.&lt;br /&gt;&lt;br /&gt;After all of that socializing and discussion I had a horrible flight but was pleased to find that Hawaii now feels very much like home to me and returning was quite a relief.</description>
  <comments>http://jeffr-tech.livejournal.com/21654.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>27</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/21310.html</guid>
  <pubDate>Thu, 17 Apr 2008 09:16:19 GMT</pubDate>
  <title>adaptive idling</title>
  <link>http://jeffr-tech.livejournal.com/21310.html</link>
  <description>One lesson learned from working on synchronization primitives is that it&apos;s often profitable to spin before sleeping.  We have adaptive mutexes, rwlocks, etc. rather than simply having sleeping locks or spinlocks.  This has had an unexpected influence on our idle loop.&lt;br /&gt;&lt;br /&gt;When a thread becomes runnable it is often desirable to run it on a cpu other than the current one.  If the target cpu is in the idle loop, it may actually be waiting in a low power state using the &apos;hlt&apos; instruction or some acpi mechanism that I try to avoid.  To wake up this remote cpu we currently issue an IPI (inter-processor interrupt).  This is actually very expensive for the sender and receiver.&lt;br /&gt;&lt;br /&gt;On some CPUs which support SSE3 there is a pair of instructions, monitor and mwait, which allow you to signal a remote cpu using only memory operations.  This works by giving the programmer access to the existing hardware bus snooping interface.  The sleeping cpu sees another cpu write to a memory location we&apos;re snooping and we wake up.&lt;br /&gt;&lt;br /&gt;On barcelona mwait doesn&apos;t enter as deep of a sleep as on the xeons.  So I decided to use an adaptive algorithm that would mwait when we&apos;re busy and hlt when we&apos;re not.  With mwait you can actually specify the power state you&apos;d like so I keep both the Xeon and Opterons in C0 to further reduce wakeup latency.&lt;br /&gt;&lt;br /&gt;Then an engineer at Nokia suggested I go one step further and allow the idle thread to spin waiting for work for a short period.  So this is now the first stage in the adaptive algorithm, we spin a while, then sleep at a high power state, and then sleep at a low power state depending on load.&lt;br /&gt;&lt;br /&gt;Using a &apos;ping-pong&apos; threads program that sends a single token around a ring of threads I see a 20% perf improvement vs the old non-adaptive mechanism.  In most cases we&apos;re still idling in hlt as well, so there should be no negative effect on power.  In fact, it wastes a lot of time and energy to enter and exit the idle states so it might improve power under load by reducing the total cpu time required.</description>
  <comments>http://jeffr-tech.livejournal.com/21310.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>47</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/21014.html</guid>
  <pubDate>Tue, 08 Apr 2008 05:11:53 GMT</pubDate>
  <title>file offset semantics</title>
  <link>http://jeffr-tech.livejournal.com/21014.html</link>
  <description>I&apos;m further exploring the concurrency guarantees of file i/o in various operating systems.  I&apos;ve found more surprising race conditions and differences of implementation between operating systems.&lt;br /&gt;&lt;br /&gt;Each file descriptor in UNIX has an associated offset with it.  This is what allows you to say read() over and over again without specifying a position and getting later and later chunks of a file.  Or to write and continue where you left off.  There is the additional complicated of append mode writes but let&apos;s ignore that for a moment.&lt;br /&gt;&lt;br /&gt;To keep things straight let&apos;s call the actual file representation the inode (in FreeBSD it&apos;s a &quot;vnode&quot;) and the open descriptor is a &apos;file&apos;.  This is in keeping with how it&apos;s done in the kernel.  So many threads or even processes may share a single file descriptor that points to one file, so they have a shared offset.  Or many processes may have unique file descriptors and so they have unique offsets.&lt;br /&gt;&lt;br /&gt;In the shared case we have to determine how updates to this offset are serialized.  One important detail is that the offset is 64bit.  On 32bit platforms this means it&apos;s written with two discrete writes.  Without some serialization other threads can see half of the update, or in the worst case, two simultaneous updaters may set different bytes in the final offset leaving you with a corrupt or invalid offset.&lt;br /&gt;&lt;br /&gt;Another question is, what happens with two simultaneous writes to the file?  If we don&apos;t serialize the offset they will both write to the same location.  If we do, they write one after the other.  The same goes for the read side.  If two threads in the same process read from the same file simultaneously do they get unique data or the same data?  This is true of threads and processes forked with rfork().&lt;br /&gt;&lt;br /&gt;Before about 1986 in unix there was no serialization on updates.  It also was non-preemptive, uniprocessor and had 32bit offsets so you didn&apos;t have to worry about partial writes even on 16bit machines.  The inode was locked after the offset was loaded and multiple readers could see the same data and multiple writers would write to the same offset.  McKusick changed it in CSRG sources in 1987 so the exclusive inode lock also protected offset to handle a case where a forked process was getting output mixed up.&lt;br /&gt;&lt;br /&gt;Solaris manipulates the offset within a shared vnode lock for reads and an exclusive lock for writes.  This means writers are serialized but readers are not.  It also means that offset updates in the read case on 32bit can corrupt the offset value.&lt;br /&gt;&lt;br /&gt;Linux manipulates the offset without a lock in any case.  The offset pointer is corruptible on 32bit processors.  Neither readers nor writers are serialized.&lt;br /&gt;&lt;br /&gt;FreeBSD now allows shared vnode locks on read which 4.3BSD did not, but we use a separate lock to maintain the strict f_offset protection in all cases.  This actually serializes reads done to the same fd if they don&apos;t use pread().&lt;br /&gt;&lt;br /&gt;Posix doesn&apos;t specify this carefully enough to say what is required.&lt;br /&gt;&lt;br /&gt;I think at a minimum solaris/linux need to protect the value on 32bit architectures.  It&apos;s a once in a year type event that could lead to problems but these are the kinds of races and bugs that are impossible to track down.  FreeBSD, on the other hand, could relax the restriction on read updates.  It doesn&apos;t make much sense to do so for writes and this fixes the original bug encountered in 1986.  I&apos;ll have to think of an elegant way to handle 64bit writes on 32bit platforms however.</description>
  <comments>http://jeffr-tech.livejournal.com/21014.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>83</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/20824.html</guid>
  <pubDate>Mon, 07 Apr 2008 11:43:00 GMT</pubDate>
  <title>email</title>
  <link>http://jeffr-tech.livejournal.com/20824.html</link>
  <description>I worked at an internet service provider when I was in highschool and as a result got free email for life.  This is my @chesapeake.net address.  Unfortunately chesapeake.net outsourced email to a bulk provider and I&apos;ve had a remarkable number of emails just plain dropped since.  So what am I to do about it?&lt;br /&gt;&lt;br /&gt;I registered jroberson.net and set the mx to point at google.  Then I set my google account to let me pop and send mail via smtp.  So what I don&apos;t understand is why google does this for free?  I&apos;m not looking at any targeted advertising.  They&apos;re just acting as temporary storage until I pop the mail.  I may never use the web interface again.&lt;br /&gt;&lt;br /&gt;Also, I&apos;d like to take this opportunity to point out that google is the new microsoft which was the new ibm which was probably the new something else.  What I mean is, exciting and innovative company becomes large and imposing and then no one likes them.  I&apos;m surprised more people haven&apos;t seen the writing on the wall.  I speculate that after 10-15 years of declining popularity and utility microsoft will eventually become a respected, productive member of society again like IBM did.</description>
  <comments>http://jeffr-tech.livejournal.com/20824.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>38</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/20707.html</guid>
  <pubDate>Mon, 31 Mar 2008 23:50:10 GMT</pubDate>
  <title>IO atomicity</title>
  <link>http://jeffr-tech.livejournal.com/20707.html</link>
  <description>I have long wondered about exactly what atomicity guarantees of read() and write() are so I did some code and posix spelunking over the weekend.  The scenarios I&apos;m talking about are as such:&lt;br /&gt;&lt;br /&gt;1) Can readers read concurrently with readers?&lt;br /&gt;2) Can readers read concurrently with writers?&lt;br /&gt;3) If readers read concurrently with writers do they see old bytes, new bytes, or potentially a mix of both?&lt;br /&gt;4) Can writers write concurrently with other writers?&lt;br /&gt;5) If writers can write concurrently what constraints are there on the resulting bytes?&lt;br /&gt;&lt;br /&gt;So first, what BSD does is hold a shared lock on the inode while reading and an exclusive lock while writing.  There is an additional issue with the file descriptor offset that really should be a second post that I might do sometime.  So on BSD you have a strong guarantee that readers and writers will see a write as a single atomic transaction.  No partial writes are visible.  No interleaved writes are possible.  Readers are concurrent.&lt;br /&gt;&lt;br /&gt;On linux, excepting appends, the inode is not locked for io.  Instead page reference counts and locks are used for individual parts of the file.  You can think of this like impromptu byte-range locking.  Linux allows readers to proceed with other readers and writers for overlapping byte ranges.  This means you can call read() and see incomplete results of a file rewrite as it is happening on another cpu.  If you read and the data is not buffered an exclusive lock on the page is used until the data is valid. The same exclusive lock protects overlapping writes to the same page.  However, the results when writes span pages seem to be undefined.  This is basically as concurrent as you can reasonably get.&lt;br /&gt;&lt;br /&gt;So what does posix say?&lt;br /&gt;ISO IEC 9945-2 2002 POSIX.1 - 2 System Interfaces page 1174 (1203rd page in the pdf)&lt;br /&gt;&lt;br /&gt;Rationale section for the read syscall:&lt;br /&gt;&lt;br /&gt;I/O is intended to be atomic to ordinary files and pipes and FIFOs. Atomic&lt;br /&gt;means that all the bytes from a single operation that started out together&lt;br /&gt;end up together, without interleaving from other I/O operations.&lt;br /&gt;&lt;br /&gt;There are other statements elsewhere in posix that state a read following a write in time must see the results of the write as a whole.  The emphasis on time likely has to do with nfs.  So posix is fairly clear.  Linux is too loose but FreeBSD is too tight.  We can allow concurrent writers to the same file as long as they are non-overlapping without violating any rules.&lt;br /&gt;&lt;br /&gt;Really standards have just been derived from legacy behavior of older unix in order to define the properties that they believed were important for existing applications.  In this vein I looked at seventh edition unix, which uses an exclusive lock over the inode in all cases.  Clearly it is even more strict than FreeBSD.&lt;br /&gt;&lt;br /&gt;I believe for 8.0 I will try to make this programmable on a per-file or per-system basis.  Once the basic infrastructure is in place it would be easy to define the types of locks required for the operation to permit willing applications to see a less consistent view of the bytes.  However, I find it hard to imagine any application wants to see partial byte results.  I suspect range locking on writes will be sufficient in almost all cases.</description>
  <comments>http://jeffr-tech.livejournal.com/20707.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>44</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/20425.html</guid>
  <pubDate>Sun, 23 Mar 2008 07:42:06 GMT</pubDate>
  <link>http://jeffr-tech.livejournal.com/20425.html</link>
  <description>MyISAM performance is terrible in FreeBSD 7.0 due to the user-space pthread_rwlock implementation.  Just a word of warning if you intend to deploy a database server based on 7.0.  I am certain we will have this fixed in 7.1.  It will most likely be in CURRENT in a week or two.</description>
  <comments>http://jeffr-tech.livejournal.com/20425.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>46</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/20087.html</guid>
  <pubDate>Sun, 23 Mar 2008 04:13:47 GMT</pubDate>
  <title>FreeBSD SoC</title>
  <link>http://jeffr-tech.livejournal.com/20087.html</link>
  <description>I&apos;ve signed up to mentor a FreeBSD SoC project again this year.  I&apos;m most interested in sponsoring the following projects:&lt;br /&gt;&lt;br /&gt;1)  Improved schedgraph support.&lt;br /&gt;2)  User-space lock profiling tool&lt;br /&gt;3)  Improved VM tree structures&lt;br /&gt;4)  SMP safing Giant protected filesystems.&lt;br /&gt;&lt;br /&gt;Maybe others.  I know schedgraph and the user-space lock profiling may not sound as glamorous but they have the potential to have the biggest long-term impact on performance.  Please email jeff at freebsd.org if you&apos;d like to discuss these further.&lt;br /&gt;&lt;br /&gt;The official list of project ideas is here:&lt;br /&gt;&lt;br /&gt;&lt;a href=&quot;http://www.freebsd.org/projects/summerofcode.html&quot; rel=&quot;nofollow&quot;&gt;http://www.freebsd.org/projects/summerofcode.html&lt;/a&gt;</description>
  <comments>http://jeffr-tech.livejournal.com/20087.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>53</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/19908.html</guid>
  <pubDate>Sun, 16 Mar 2008 09:12:53 GMT</pubDate>
  <link>http://jeffr-tech.livejournal.com/19908.html</link>
  <description>People are always posting comments of &apos;what about solaris!&apos;.  I&apos;m going to install some new operating systems on an 8way xeon (2x4).  So what about solaris?  What should I install?  Can I do a network install or do I have to burn dvds?  Any tips?  Which Linux should I install for benchmarking?  I&apos;ve just been using fedora.  Maybe I should stick with that since I&apos;m familiar with it.</description>
  <comments>http://jeffr-tech.livejournal.com/19908.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>54</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/19466.html</guid>
  <pubDate>Thu, 13 Mar 2008 05:31:21 GMT</pubDate>
  <link>http://jeffr-tech.livejournal.com/19466.html</link>
  <description>I have an opteron with older slower memory that I reproduced the pipe tests on to see if it was any different on a 64bit system.  I&apos;m not going to paste the full results but here&apos;s a couple of data points:&lt;br /&gt;&lt;br /&gt;linux-2.6.24&lt;br /&gt; 64[writer]: 97.235 wall (2.031 usr, 68.674 sys), 10.531 Mb/sec&lt;br /&gt;1024[writer]: 13.300 wall (0.145 usr, 9.039 sys), 76.991 Mb/sec&lt;br /&gt;65536[writer]: 3.068 wall (0.001 usr, 1.718 sys), 333.766 Mb/sec&lt;br /&gt;&lt;br /&gt;FreeBSD 8.0-CURRENT undermydesk (no cpu switch patches though)&lt;br /&gt;64[writer]: 53.163 wall (1.057 usr, 42.083 sys), 19.261 Mb/sec&lt;br /&gt;1024[writer]: 5.325 wall (0.118 usr, 4.146 sys), 192.284 Mb/sec&lt;br /&gt;65536[writer]: 0.567 wall (0.000 usr, 0.130 sys), 1805.509 Mb/sec&lt;br /&gt;&lt;br /&gt;So on this machine we start of 2x as fast and end up 5.5x as fast.  The numbers pretty much follow a curve through those points.  This verifies the data taken from the old 32bit HTT machine they tested on.  I don&apos;t intend to post configs and so on as the original lkml thread is plenty rigorous enough.&lt;br /&gt;&lt;br /&gt;I forgot to mention earlier.  The FreeBSD Alan Cox has committed super-pages!  We&apos;re seeing some great gains from that.  This allows the kernel to automatically use large TLBs for conforming regions of memory.  It has a component that ensures that large, contiguous, chunks of physical memory will be available to support this.  There is also a defragmenting/compacting piece.  There&apos;s some great work going into FreeBSD 8.0 already!</description>
  <comments>http://jeffr-tech.livejournal.com/19466.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>49</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/19210.html</guid>
  <pubDate>Thu, 13 Mar 2008 00:17:02 GMT</pubDate>
  <link>http://jeffr-tech.livejournal.com/19210.html</link>
  <description>A couple bits of news;  We tracked down our problem with the performance drop above 30 threads on Nick Piggin&apos;s mysql benchmark to conservative settings for the pthread adaptive spinning.  We see a big gain relative to where we were before.  Frankly at this point we&apos;re splitting hairs with linux and I don&apos;t really care where we stand.  We had a tremendous problem and we resolved it.  Time to move on..&lt;br /&gt;&lt;br /&gt;I removed kernel support for our M:N threading library last night.  8.0 will only support 1:1.  This will open the way to a lot of optimizations in the signal and sleeping paths.  Hopefully reducing the total number of locks required in the sleepq path to a minimum.&lt;br /&gt;&lt;br /&gt;There are some &apos;interesting&apos; pipe benchmarks floating around.  You can read about it on the lkml and the author&apos;s website:&lt;br /&gt;&lt;br /&gt;&lt;a href=&quot;http://213.148.29.37/PipeBench/&quot; rel=&quot;nofollow&quot;&gt;http://213.148.29.37/PipeBench/&lt;/a&gt;&lt;br /&gt;&lt;a href=&quot;http://lkml.org/lkml/2008/3/5/61&quot; rel=&quot;nofollow&quot;&gt;http://lkml.org/lkml/2008/3/5/61&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I say &apos;interesting&apos; of course because FreeBSD is doing way better than linux. ;)  pipes, the next battleground?  I don&apos;t know but it&apos;s worth a read anyway.&lt;br /&gt;&lt;br /&gt;I also have a patch to implement cpu affinity for our callout mechanism.  This is for time based callbacks.  The legacy callouts may have order dependencies or may not tolerate concurrency.  So by default they are all scheduled on the first callout thread.  There is one callout thread per-cpu and they have a kind of &apos;medium&apos; affinity for that cpu, however, if they are overloaded by some interrupt work another cpu can complete the callouts.  This removes the need to do any kind of load balancing across callout handlers because the scheduler can do a better job anyway.  New callouts can specify any cpu when setting a timer and then they have an affinity for that thread until a different cpu is requested.  All migration is explicit.&lt;br /&gt;&lt;br /&gt;Hopefully having callout affinity will benefit our tcp stack where Robert Watson is experimenting with different kinds of affinity for tcp sessions.  It will also discourage migration of threads who are sleeping on time based events like select and nanosleep().</description>
  <comments>http://jeffr-tech.livejournal.com/19210.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>36</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/19139.html</guid>
  <pubDate>Tue, 11 Mar 2008 02:59:39 GMT</pubDate>
  <title>ULE happenings, context switch surprise.</title>
  <link>http://jeffr-tech.livejournal.com/19139.html</link>
  <description>Lately I&apos;ve been able to spend a bunch of time on ULE thanks to Nokia.  They use it in one of their networking products.  I&apos;ve been doing all of this work in 8.0-CURRENT and backporting it for them at the same time.  It&apos;s a great model for both parties because users on -CURRENT shake out bugs that they&apos;d have to find in testing otherwise and we get new development paid for.&lt;br /&gt;&lt;br /&gt;I finished and committed the topology aware cpu scheduling that I discussed in earlier posts.  I also implemented a mechanism for CPU provisioning that you can use to restrict groups of processes to sets of cpus which can be dynamically migrated.  This will be useful for restricting jails to certain CPUs or dedicating some CPUs to real-time special-purpose tasks for example.&lt;br /&gt;&lt;br /&gt;Over the last few days I cleaned up my cpu switch optimizations and got those in.  The results are 25% faster context switching in a yield benchmark.  Even faster than linux on the same hardware.  Some day I&apos;ll put open solaris on so I have something else to compare to.&lt;br /&gt;&lt;br /&gt;Separate from the other switch benchmarks I&apos;ve been working on reimplementing amd64&apos;s context switching routine almost entirely in C.  I just wanted to do it because we&apos;re putting more complex things in and it was getting hard to find registers but it turns out you can make it much faster too.  The yield benchmark is another 10% faster with the C switch routine.  Mostly due to enabling more complex checks, like not setting MSR_FSBASE/GSBASE if they haven&apos;t changed, and getting uncommon code out of the fast path.</description>
  <comments>http://jeffr-tech.livejournal.com/19139.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>55</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/18706.html</guid>
  <pubDate>Fri, 07 Mar 2008 10:32:14 GMT</pubDate>
  <title>More sysbench noise.</title>
  <link>http://jeffr-tech.livejournal.com/18706.html</link>
  <description>&lt;a href=&quot;http://www.kernel.org/pub/linux/kernel/people/npiggin/sysbench/&quot; rel=&quot;nofollow&quot;&gt;http://www.kernel.org/pub/linux/kernel/people/npiggin/sysbench/ &lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Nick Piggin has been doing some benchmarking of recent linux kernels and FreeBSD 7.0 on a 2xquad core barcelona opteron.  He verified that the CFS problems seem to be fixed and FreeBSD&apos;s performance on this box with mysql is really very similar up to about 20 threads.  I feel confident that the test was conducted fairly and I&apos;m happy with these results.  Our stable release is doing very well even if fresh-out-of-git linux is showing better on this platform.  We already have some good gains in this workload in 8.0-CURRENT as well.  What&apos;s most important to me is that we stay relevant on common server hardware and we&apos;re doing a good job at that.&lt;br /&gt;&lt;br /&gt;I&apos;m also happy to see some collaboration and competition between linux and bsd kernel developers.  I hope that continues.  We&apos;re really more alike than we are different.&lt;br /&gt;&lt;br /&gt;Next up, we now have a 16 way xeon and 16 way opteron system to tune and test with.  More points of contention are being removed.  The code marches on.</description>
  <comments>http://jeffr-tech.livejournal.com/18706.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>29</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/18568.html</guid>
  <pubDate>Wed, 27 Feb 2008 07:46:41 GMT</pubDate>
  <link>http://jeffr-tech.livejournal.com/18568.html</link>
  <description>&lt;a href=&quot;http://www.onlamp.com/pub/a/bsd/2008/02/26/whats-new-in-freebsd-70.html?page=1&quot; rel=&quot;nofollow&quot;&gt;http://www.onlamp.com/pub/a/bsd/2008/02/26/whats-new-in-freebsd-70.html?page=1&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Federico Biancuzzi did a collection of short interviews with many key FreeBSD contributers.  I&apos;m in there at the end even.  Anyway, it&apos;s a good summary of some of the exciting technical things that are in 7.0.  I even learned some things while reading it.</description>
  <comments>http://jeffr-tech.livejournal.com/18568.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>16</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://jeffr-tech.livejournal.com/18215.html</guid>
  <pubDate>Thu, 21 Feb 2008 10:54:33 GMT</pubDate>
  <title>lung tech.</title>
  <link>http://jeffr-tech.livejournal.com/18215.html</link>
  <description>I, like many nerds and athletes before me, have suffered from asthma and lung problems for almost the entirety of my life.  I don&apos;t have the blue-in-the-face, bronchial-spasm, send-me-to-the-hospital variety.  Rather, I have a seemingly constant irritation and periodic, primarily exercise-induced, restriction of my airways that mostly just slows me down.  This is actually caused by a poor immune system reaction to airborn allergins. Exercise triggers attacks because as much as 10x more air is moving over your lungs so they&apos;re likely to get 10x as irritated.&lt;br /&gt;&lt;br /&gt;In any event, this hadn&apos;t been much of a problem for me in seattle, except when I lived in a very moldy old house.  However, after moving to hawaii something started really bothering me.  My training started out great but after a virus I found myself unable to significantly exert myself for longer than 5 minutes or so.  I went to the Dr but wasn&apos;t satisfied with their diagnosis so I bought myself a peak flow meter, blood oximeter and a few other gadgets. and so the nerding began.&lt;br /&gt;&lt;br /&gt;The peak flow meter is really the most interesting.  This measures, in liters/minute, how rapidly you can force air through a constrained passage.  It&apos;s just a tube you blow in with a column and a gauge.  For someone of my height and age a &apos;normal&apos; peak flow rate would be around &lt;a href=&quot;http://www.healthcaresouth.com/pages/asthmaaverpeak.htm&quot; rel=&quot;nofollow&quot;&gt;600 l/m&lt;/a&gt;.  My actual measured flow rate very regular at 675 l/m, so 112% above predicted, not bad!  However, 5 minutes of vigorous exercise on a stationary bike and that had dropped 20% to 540.  A 20% reduction in the rate your lungs move air is enough to perceive as constricted and tight.  Interestingly I&apos;d still be considered in a healthy flow range, and indeed I could rest and talk and walk just fine, I just couldn&apos;t ride my damned bike.  The blood oximeter also showed a 5% drop in blood oxygen saturation during the constriction.&lt;br /&gt;&lt;br /&gt;Armed with these findings I asked for a maintenance drug, advair, which has a corticosteroid to reduce inflammation.  And indeed, 3 days after starting this treatment, my peak flow now measures around 775, or 13% better.  This would be better than the average flow rate for a 6&apos;8&quot; male.  And hopefully now after missing the first two races of the season, my training can begin again in earnest.&lt;br /&gt;&lt;br /&gt;And the moral of the story is; You can never have enough gadgets.</description>
  <comments>http://jeffr-tech.livejournal.com/18215.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>23</lj:reply-count>
</item>
</channel>
</rss>
