[Most Recent Entries]
Below are the 20 most recent journal entries recorded in
[ << Previous 20 ]
[ << Previous 20 ]
|Monday, May 9th, 2011|
|Asynchronous partial truncation
I have spent a month of my life on partial truncation. Softupdates asynchronously handles the case where you were completely truncating a file, such as is the case when you delete a file. The operation would be scheduled, the in memory inode updated, and the whole thing would proceed in the background. However, when you truncated to a non zero length it would do many blocking operations while synchronously truncating. When I wrapped this synchronous operation for SUJ, I did not do it quite correctly, and as a result SUJ could leak blocks if you crashed during a partial truncation. This could actually lead to filesystem corruption if the checker confused a regular disk block for an indirect block and started freeing random pointers.
To resolve this, I modified the truncation machinery to handle partial truncation. This is hard because you may have many indirect blocks involved with many children blocks in different states. An indirect is a filesystem block that does nothing but point to other blocks, like a page table does for memory. It also had to handle ffs's somewhat complex fragment rules as well as zeroing partially empty blocks. It handles all of this now and supports an arbitrary number of partial truncations to the same file without any blocking operations. It always keeps the on disk copy safe while the in memory copy is free to grow again and indeed be truncated again after that. New pointers are not recorded in an indirect until prior truncation completes so there is no ambiguity about what revision of the file the blocks are from. This brings more complexity to fsync() which must now flush all pending truncations to disk before it can return.
The truncation code is a kind of asynchronous state machine that operates on leaf blocks first and then walks backwards up the tree until it reaches the root. This ensures that we always have a valid path to a block in case we crash. Indirects are only freed when all of their children are freed. For partial truncate, the block is zeroed only once those child pointers that need be are freed. Finally when all blocks have been freed the journal space can be reclaimed.
This post can not convey how complex this work was. It may not sound very dramatic or impressive but it truly has been one of the most complex projects I have ever undertaken.
|Monday, April 4th, 2011|
|Performance problems in SUJ
SUJ has been around for a year now and 9.0 will release with it this summer. In preparation I am working on the few known performance problems. The problems are sufficiently general to softupdates that they may be of interest to those who study different filesystem consistency mechanisms.
The new code and dependencies add some extra CPU overhead to each filesystem operation but in practice this has been negligible. However once disks reach ops per second rates similar to that of network interface cards we will have to re-evaluate filesystems entirely. Back on topic, the two classes of problems we have encountered relate to synchronous journal writes and excessive rollbacks.
You may recall that softupdates uses rollbacks to revert metadata operations that are not yet safe when a buffer is written to disk. When the write completes the change is rolled forward in memory and the buffer is marked dirty again. This allows us to separate potentially circular dependencies, rolling back some while writing others, allowing the filesystem state to move forward. This eliminates the types of journaling problems that can occur when many operations are allowed to aggregate for efficiency reasons which may lead to waiting on unrelated IO when fsync() is called. Our notion of a transaction is less simplistic.
The journaling code adds new dependencies and new rollbacks to the filesystem. Most importantly, the allocation bitmaps are now rolled back. In some cases we may discover that one filesystem operation undoes another and softupdates handles this by canceling all of the dependencies after reverting the metadata changes. It turned out there were some cases where the time between canceling the dependencies and the actual reversion of the changes could be longer than I expected. This would leave a dependency that was unsatisfied which would hold a cylinder-group dirty for several seconds. The solution was to simply allow the journal record to proceed even when we decide to cancel the operation. If the operation is undone before the write is issued we will still eliminate it, however, there is no harm in journaling an operation that does not happen. The checker will discover the true state of all the metadata and take no action.
The second problem has to do with blocking journal writes. There are some cases where rollbacks would be impractical so instead we detect them and force a synchronous journal write. There are very few instances of this in the filesystem but one that remains is particularly egregious. The checker requires that a new block allocation is journaled before the block is actually written. The filesystem assumes that it can write to datablocks in any order and indeed it does so before the allocation bits hit the cylinder group. These are not compatible so a new block which is immediately written to disk after allocation will wait first for the journal write and second for the block write, doubling the latency. This is tricky because we only need to block in the case that the previous identity of the block was as an indirect block for a file whose truncation still exists in the journal. The new record must first be written so the checker doesn't attempt to interpret the block as a table of indirect block pointers.
I haven't yet solved this second problem. My intent is to cache the list of recently freed indirect blocks in some fashion but I need to do it with the least memory and cpu overhead I can. My hope is to solve this soon. Experimental kernels where this restriction is relaxed perform as well as softupdates without journaling in all of the tests I've tried.
|Tuesday, March 29th, 2011|
|Interactivity score in ULE
I sometimes speak with Con Kolvis who is known for several Linux schedulers. Con is an interesting fellow because his background is not CS and he is very pragmatic about desktop performance. He doesn't care for the interactivity boost that ULE and previous Linux schedulers use in various forms. He periodically challenges me to consider the interactivity algorithm and whether it is ultimately necessary and effective. Below I present some analysis done when constructing the algorithm in use in ULE and why I believe it is effective and necessary while not suffering many of the pitfalls of earlier approaches.
Firstly, let me define the properties of what I believe is a good interactivity algorithm. These were my guiding principles in creating the ULE algorithm.
1) Any interactivity boost is gained slowly and lost quickly.
2) Interactivity should be harder to achieve the greater the system load.
3) The algorithm should not be exploitable to achieve an unfair share of the CPU.
4) The algorithm should be cheap to maintain and compute.
5) There should be sufficient history to permit bursty applications like web browsers.
The ULE algorithm uses a decaying history of voluntary sleep time and run time. Similar to %cpu, however, involuntary sleep time is not considered. That is to say, threads that are waiting due to contention for CPU resources are not given an interactivity boost for their time waiting. That allows the algorithm to work properly regardless of CPU load where if you only consider %cpu eventually all threads on a busy system will look interactive.
The algorithm scales the ratio of run time to sleep time to a value between 1 and 100. This is quite awkward in the kernel where we can't use floating point math. It decides the divisor depending on which value is larger giving a sort of bimodal distribution.
Here is a graph of what we theoretically would like the score to produce before we switch the divisor around:
And here is a graph generated by running the algorithm with a matrix of inputs:
The second graph uses larger numbers as we do in the kernel to reduce rounding effects. You can see an irregularity at 45 degrees where we switch divisors when the run time exceeds the sleep time. In practice these are never computed as we define a threshold of 20 above which tasks are not considered interactive so there is no point in computing the score when run time exceeds sleep time unless this threshold is moved.
Going from left to right runtime is increasing. From background to foreground sleep time is increasing. A thread would trace a path forward and to the right depending on its behavior. When they increase equally the score quickly reaches an equilibrium well above the threshold for interactive scheduling. A thread looking to abuse the system couldn't use much more than 20% of the cpu in a steady state. This can be adjusted by reducing the interactive threshold. On a busy system this 20% dwindles depending on load, ultimately providing no advantage to a would be exploiter. A thread running right out of the gate raises its score super-linearly to 50 within milliseconds, while a recently awoken thread climbs linearly as it accumulates cpu time.
The algorithm requires a lot of sleep time to be accumulated before a thread can be considered interactive. This remembered sleep time is capped at a few seconds so it only takes a few hundred milliseconds before we discover that a thread is no-longer interactive. It does permit interactive UI applications to wake up with the lowest possible latency since they have a very high priority. If they then abuse this benefit for very long they are scheduled round-robin based on cpu utilization like other bulk tasks. In practice we have picked values that keep desktop user applications interactive as well as is possible.
|Tuesday, March 22nd, 2011|
|OFED, 10gigE, and SUJ
I have merged the OFED 1.5.3 Infiniband stack into FreeBSD CURRENT. We have achieved feature and performance parity with the Linux stack using a combination of wrappers and re-implementation of sensitive pieces. lwn wrote an article about the wrapper work here
. Some FreeBSD developers are understandably concerned about growing a Linux kernel compat layer and how that could lower the quality of FreeBSD drivers. I don't foresee this as a real complication but only time will tell.
I'm working on bringing in support for Mellanox's 10gigE adapters now. It's always interesting for me to explore different directions operating systems take to accomplish the same features. Network buffering is one of those areas that is starkly different across operating systems. Maybe that deserves its own post.
I am now looking for bug reports for SUJ for another round of bug-fixing before 9.0 ships. There are a couple of areas where performance isn't great due to latency involved in blocking journal writes. I know how to eliminate these but it will take some time to implement. We are hoping to ship SUJ as the default in 9.0 and then I may provide an official backport to 8.x.
|Thursday, January 27th, 2011|
|wherein I replicate my feet
I sometimes do things unrelated to computers that are probably still considered geeky. I have made reference to being a cyclist before. As you know cyclists are obsessed with all things carbon fiber and I am no different. With the help of a boat builder friend of mine, I finally had an opportunity to make my own composite parts in pursuit of the perfectly fitting shoe. Behind the cut are some pictures and a description of the process.( Read more...Collapse )
|Monday, January 3rd, 2011|
|Year in review
I have not posted in a very long time. I have been busy though and I'll try to summarize the last year here.
Firstly, I collaborated with my good friends at fairwheel bikes to work on a modification to Shimano's new electronic shifting group. You can read about that at cyclingnews.com
. I replaced the stock computer with my own micro-controller that enables some advanced shifting features. I'm trying to turn this into a commercial enterprise with a friend. There is a chance that a pro team will be using it next year.
The majority of my year has been occupied with a port of the OpenFabrics Enterprise Distribution infiniband stack from Linux to FreeBSD. This is dual BSD/GPL licensed which permits the port. In pursuit of this I have created a 10,000 line Linux kernel api compatibility layer which allows us to run the vast majority of the infiniband code unmodified. As I mentioned on arch@freebsd the following pieces are emulated:
> atomics, types, bitops, byte order conversion, character devices, pci
> devices, dma, non-device files, idr tables, interrupts, ioremap, hashes,
> kobjects, radix trees, lists, modules, notifier blocks, rbtrees, rwlock,
> rwsem, semaphore, schedule, spinlocks, kalloc, wait queues, workqueues,
> timers, etc.
Additionally I have worked more on SUJ, mostly bug fixing. Kirk and Kostik have been most helpful in that and really did most of the work. There were some nasty bugs but we've whittled them down and now there are only a few performance regressions (and improvements) to concern ourselves with.
I wish I hadn't let this journal go for so long. If anyone has any specific interests let me know and I will try to post more frequently.
|Saturday, January 30th, 2010|
Just a short update; I will commit SUJ to CURRENT soon and there are already backports to 8 and 7 available. Things are really very stable.
I have done some fsck benchmarking. I recovered an 80% full 250gb volume that was doing a buildworld with 8 parallel processes in .9 seconds. The traditional fsck took 24 minutes. For the vast majority of workloads an unsafe shutdown will not prolong the boot by more than a few seconds. I also shortened the time that entries stayed valid in the journal which allowed me to trim the maximum journal size to 32MB. This will limit the maximum possible recovery time while still providing for 1 million outstanding transactions.
The project to date has taken 370 hours. The patch adds 11,000 lines and removes 2,000. This is much bigger than I was anticipating, although of course lines of code can be misleading.
|Friday, December 11th, 2009|
|What's in a journal anyway?
In this post I'll detail the contents of the journal and the recovery operation. Since we know that softupdates only leaves two inconsistencies, leaked inodes and leaked blocks, we only have to journal conditions which can create these circumstances. In truth we have to track all link count changes to an inode since they can have multiple named references via hardlinks. Blocks are somewhat simpler although ffs fragments complicate them considerably. At recovery time we verify whether links or pointers to blocks exist and use this information to free them if necessary. There are only 4 journal records (add ref, rem ref, new block, free block) and each is only 32bytes. This is effectively an intent log, there is no copy of the metadata in each record. Sounds simple no?( Read more...Collapse )
|Tuesday, December 8th, 2009|
|Journaling softupdates, SU+J
I suspect that most people who read this blog are familiar with journaling as a mechanism for ensuring post-crash filesystem consistency. Some of you may also be familiar with copy-on-write and log structured filesystems as an alternative to journaling. BSD's ffs, an extension of the original unix filesystem, has used an alternate approach called soft-updates to handle filesystem consistency for around 10 years. For the past few months, I have been creating a hybrid journaled softupdate system to deal with inadequacies in the existing softdep system. This work is opensource and will be available to FreeBSD-current users sometime this month. Behind the cut I briefly describe the tradeoffs in each consistency mechanism and motivation for this work.( Read more...Collapse )
|Saturday, January 31st, 2009|
|New UMA features for more efficient memory layout
I have wanted to write for some time about UMA changes I recently made. UMA is the "universal memory allocator" which serves as FreeBSD's kernel memory allocator. I initially wrote this 7 years ago or so and many other people have since contributed. I named it 'universal' because at that time FreeBSD had 3 separate kernel memory allocators and this unified them. The two new features relate to network performance work I've been doing lately and allow the use of more efficient layout of network buffers.( Read more...Collapse )
|Thursday, January 15th, 2009|
|more dram access timings on two interesting architectures
Ever wonder what memory latency is like on a large loosely connected opteron system? I lay awake at nights wondering myself. Fortunately, I have access to a tyan 8 socket barcelona system. This is basically two 4 socket boards with two very slow HT links between them. I also have access to a nehalem based box that I have timings for. The results are behind the cut.( Read more...Collapse )
|Sunday, May 18th, 2008|
I just returned from two weeks of travel. One for my wedding anniversary and another for the FreeBSD developer summit which preceded BSDCan. The summit was productive but I'm very happy to be done with the travel.
There were many great discussions at the summit with topics ranging from release engineering to TCP scalability. I participated in one on mbufs (network buffers) and one on the buffer cache (file-system buffers). For mbufs I presented a technique that I developed for Nokia based on Kip Macy's excellent work on 10gigabit ethernet drivers. This technique should simplify referenced data while reducing the number and temporal scope of cache lines accessed to manipulate buffers in the common case. There is still much work to do to prove it out however.
The buf discussion lasted almost two hours and was much broader in scope. We will hopefully have a fully revamped IO path for 8.0 to address a wide variety of structural and performance problems. I'm very excited to see this work progress after many years of planning and discussion. My SoC student this year is implementing one essential piece by replacing a splay tree in the vm with a radix tree.
I also had a very interesting discussion with a new project member, Lawrence Stewart, about tcp congestion control which he gave a talk on later in the conference. I spent the very first part of my career working in the tcp/ip group at microsoft when tcp vegas was still relatively new. Congestion control was one part of my work responsibilities and an area I pursued as a hobby. I was surprised to hear that delay based congestion control was making a comeback in some research circles. It was nice to hear about developments in this field that I haven't followed in some time.
After all of that socializing and discussion I had a horrible flight but was pleased to find that Hawaii now feels very much like home to me and returning was quite a relief.
|Wednesday, April 16th, 2008|
One lesson learned from working on synchronization primitives is that it's often profitable to spin before sleeping. We have adaptive mutexes, rwlocks, etc. rather than simply having sleeping locks or spinlocks. This has had an unexpected influence on our idle loop.
When a thread becomes runnable it is often desirable to run it on a cpu other than the current one. If the target cpu is in the idle loop, it may actually be waiting in a low power state using the 'hlt' instruction or some acpi mechanism that I try to avoid. To wake up this remote cpu we currently issue an IPI (inter-processor interrupt). This is actually very expensive for the sender and receiver.
On some CPUs which support SSE3 there is a pair of instructions, monitor and mwait, which allow you to signal a remote cpu using only memory operations. This works by giving the programmer access to the existing hardware bus snooping interface. The sleeping cpu sees another cpu write to a memory location we're snooping and we wake up.
On barcelona mwait doesn't enter as deep of a sleep as on the xeons. So I decided to use an adaptive algorithm that would mwait when we're busy and hlt when we're not. With mwait you can actually specify the power state you'd like so I keep both the Xeon and Opterons in C0 to further reduce wakeup latency.
Then an engineer at Nokia suggested I go one step further and allow the idle thread to spin waiting for work for a short period. So this is now the first stage in the adaptive algorithm, we spin a while, then sleep at a high power state, and then sleep at a low power state depending on load.
Using a 'ping-pong' threads program that sends a single token around a ring of threads I see a 20% perf improvement vs the old non-adaptive mechanism. In most cases we're still idling in hlt as well, so there should be no negative effect on power. In fact, it wastes a lot of time and energy to enter and exit the idle states so it might improve power under load by reducing the total cpu time required.
|Monday, April 7th, 2008|
|file offset semantics
I'm further exploring the concurrency guarantees of file i/o in various operating systems. I've found more surprising race conditions and differences of implementation between operating systems.
Each file descriptor in UNIX has an associated offset with it. This is what allows you to say read() over and over again without specifying a position and getting later and later chunks of a file. Or to write and continue where you left off. There is the additional complicated of append mode writes but let's ignore that for a moment.
To keep things straight let's call the actual file representation the inode (in FreeBSD it's a "vnode") and the open descriptor is a 'file'. This is in keeping with how it's done in the kernel. So many threads or even processes may share a single file descriptor that points to one file, so they have a shared offset. Or many processes may have unique file descriptors and so they have unique offsets.
In the shared case we have to determine how updates to this offset are serialized. One important detail is that the offset is 64bit. On 32bit platforms this means it's written with two discrete writes. Without some serialization other threads can see half of the update, or in the worst case, two simultaneous updaters may set different bytes in the final offset leaving you with a corrupt or invalid offset.
Another question is, what happens with two simultaneous writes to the file? If we don't serialize the offset they will both write to the same location. If we do, they write one after the other. The same goes for the read side. If two threads in the same process read from the same file simultaneously do they get unique data or the same data? This is true of threads and processes forked with rfork().
Before about 1986 in unix there was no serialization on updates. It also was non-preemptive, uniprocessor and had 32bit offsets so you didn't have to worry about partial writes even on 16bit machines. The inode was locked after the offset was loaded and multiple readers could see the same data and multiple writers would write to the same offset. McKusick changed it in CSRG sources in 1987 so the exclusive inode lock also protected offset to handle a case where a forked process was getting output mixed up.
Solaris manipulates the offset within a shared vnode lock for reads and an exclusive lock for writes. This means writers are serialized but readers are not. It also means that offset updates in the read case on 32bit can corrupt the offset value.
Linux manipulates the offset without a lock in any case. The offset pointer is corruptible on 32bit processors. Neither readers nor writers are serialized.
FreeBSD now allows shared vnode locks on read which 4.3BSD did not, but we use a separate lock to maintain the strict f_offset protection in all cases. This actually serializes reads done to the same fd if they don't use pread().
Posix doesn't specify this carefully enough to say what is required.
I think at a minimum solaris/linux need to protect the value on 32bit architectures. It's a once in a year type event that could lead to problems but these are the kinds of races and bugs that are impossible to track down. FreeBSD, on the other hand, could relax the restriction on read updates. It doesn't make much sense to do so for writes and this fixes the original bug encountered in 1986. I'll have to think of an elegant way to handle 64bit writes on 32bit platforms however.
I worked at an internet service provider when I was in highschool and as a result got free email for life. This is my @chesapeake.net address. Unfortunately chesapeake.net outsourced email to a bulk provider and I've had a remarkable number of emails just plain dropped since. So what am I to do about it?
I registered jroberson.net and set the mx to point at google. Then I set my google account to let me pop and send mail via smtp. So what I don't understand is why google does this for free? I'm not looking at any targeted advertising. They're just acting as temporary storage until I pop the mail. I may never use the web interface again.
Also, I'd like to take this opportunity to point out that google is the new microsoft which was the new ibm which was probably the new something else. What I mean is, exciting and innovative company becomes large and imposing and then no one likes them. I'm surprised more people haven't seen the writing on the wall. I speculate that after 10-15 years of declining popularity and utility microsoft will eventually become a respected, productive member of society again like IBM did.
|Monday, March 31st, 2008|
I have long wondered about exactly what atomicity guarantees of read() and write() are so I did some code and posix spelunking over the weekend. The scenarios I'm talking about are as such:
1) Can readers read concurrently with readers?
2) Can readers read concurrently with writers?
3) If readers read concurrently with writers do they see old bytes, new bytes, or potentially a mix of both?
4) Can writers write concurrently with other writers?
5) If writers can write concurrently what constraints are there on the resulting bytes?
So first, what BSD does is hold a shared lock on the inode while reading and an exclusive lock while writing. There is an additional issue with the file descriptor offset that really should be a second post that I might do sometime. So on BSD you have a strong guarantee that readers and writers will see a write as a single atomic transaction. No partial writes are visible. No interleaved writes are possible. Readers are concurrent.
On linux, excepting appends, the inode is not locked for io. Instead page reference counts and locks are used for individual parts of the file. You can think of this like impromptu byte-range locking. Linux allows readers to proceed with other readers and writers for overlapping byte ranges. This means you can call read() and see incomplete results of a file rewrite as it is happening on another cpu. If you read and the data is not buffered an exclusive lock on the page is used until the data is valid. The same exclusive lock protects overlapping writes to the same page. However, the results when writes span pages seem to be undefined. This is basically as concurrent as you can reasonably get.
So what does posix say?
ISO IEC 9945-2 2002 POSIX.1 - 2 System Interfaces page 1174 (1203rd page in the pdf)
Rationale section for the read syscall:
I/O is intended to be atomic to ordinary files and pipes and FIFOs. Atomic
means that all the bytes from a single operation that started out together
end up together, without interleaving from other I/O operations.
There are other statements elsewhere in posix that state a read following a write in time must see the results of the write as a whole. The emphasis on time likely has to do with nfs. So posix is fairly clear. Linux is too loose but FreeBSD is too tight. We can allow concurrent writers to the same file as long as they are non-overlapping without violating any rules.
Really standards have just been derived from legacy behavior of older unix in order to define the properties that they believed were important for existing applications. In this vein I looked at seventh edition unix, which uses an exclusive lock over the inode in all cases. Clearly it is even more strict than FreeBSD.
I believe for 8.0 I will try to make this programmable on a per-file or per-system basis. Once the basic infrastructure is in place it would be easy to define the types of locks required for the operation to permit willing applications to see a less consistent view of the bytes. However, I find it hard to imagine any application wants to see partial byte results. I suspect range locking on writes will be sufficient in almost all cases.
|Saturday, March 22nd, 2008|
MyISAM performance is terrible in FreeBSD 7.0 due to the user-space pthread_rwlock implementation. Just a word of warning if you intend to deploy a database server based on 7.0. I am certain we will have this fixed in 7.1. It will most likely be in CURRENT in a week or two.
I've signed up to mentor a FreeBSD SoC project again this year. I'm most interested in sponsoring the following projects:
1) Improved schedgraph support.
2) User-space lock profiling tool
3) Improved VM tree structures
4) SMP safing Giant protected filesystems.
Maybe others. I know schedgraph and the user-space lock profiling may not sound as glamorous but they have the potential to have the biggest long-term impact on performance. Please email jeff at freebsd.org if you'd like to discuss these further.
The official list of project ideas is here:http://www.freebsd.org/projects/summerofcode.html
|Saturday, March 15th, 2008|
People are always posting comments of 'what about solaris!'. I'm going to install some new operating systems on an 8way xeon (2x4). So what about solaris? What should I install? Can I do a network install or do I have to burn dvds? Any tips? Which Linux should I install for benchmarking? I've just been using fedora. Maybe I should stick with that since I'm familiar with it.
|Wednesday, March 12th, 2008|
I have an opteron with older slower memory that I reproduced the pipe tests on to see if it was any different on a 64bit system. I'm not going to paste the full results but here's a couple of data points:
64[writer]: 97.235 wall (2.031 usr, 68.674 sys), 10.531 Mb/sec
1024[writer]: 13.300 wall (0.145 usr, 9.039 sys), 76.991 Mb/sec
65536[writer]: 3.068 wall (0.001 usr, 1.718 sys), 333.766 Mb/sec
FreeBSD 8.0-CURRENT undermydesk (no cpu switch patches though)
64[writer]: 53.163 wall (1.057 usr, 42.083 sys), 19.261 Mb/sec
1024[writer]: 5.325 wall (0.118 usr, 4.146 sys), 192.284 Mb/sec
65536[writer]: 0.567 wall (0.000 usr, 0.130 sys), 1805.509 Mb/sec
So on this machine we start of 2x as fast and end up 5.5x as fast. The numbers pretty much follow a curve through those points. This verifies the data taken from the old 32bit HTT machine they tested on. I don't intend to post configs and so on as the original lkml thread is plenty rigorous enough.
I forgot to mention earlier. The FreeBSD Alan Cox has committed super-pages! We're seeing some great gains from that. This allows the kernel to automatically use large TLBs for conforming regions of memory. It has a component that ensures that large, contiguous, chunks of physical memory will be available to support this. There is also a defragmenting/compacting piece. There's some great work going into FreeBSD 8.0 already!