Asynchronous partial truncation

I have spent a month of my life on partial truncation. Softupdates asynchronously handles the case where you were completely truncating a file, such as is the case when you delete a file. The operation would be scheduled, the in memory inode updated, and the whole thing would proceed in the background. However, when you truncated to a non zero length it would do many blocking operations while synchronously truncating. When I wrapped this synchronous operation for SUJ, I did not do it quite correctly, and as a result SUJ could leak blocks if you crashed during a partial truncation. This could actually lead to filesystem corruption if the checker confused a regular disk block for an indirect block and started freeing random pointers.

To resolve this, I modified the truncation machinery to handle partial truncation. This is hard because you may have many indirect blocks involved with many children blocks in different states. An indirect is a filesystem block that does nothing but point to other blocks, like a page table does for memory. It also had to handle ffs's somewhat complex fragment rules as well as zeroing partially empty blocks. It handles all of this now and supports an arbitrary number of partial truncations to the same file without any blocking operations. It always keeps the on disk copy safe while the in memory copy is free to grow again and indeed be truncated again after that. New pointers are not recorded in an indirect until prior truncation completes so there is no ambiguity about what revision of the file the blocks are from. This brings more complexity to fsync() which must now flush all pending truncations to disk before it can return.

The truncation code is a kind of asynchronous state machine that operates on leaf blocks first and then walks backwards up the tree until it reaches the root. This ensures that we always have a valid path to a block in case we crash. Indirects are only freed when all of their children are freed. For partial truncate, the block is zeroed only once those child pointers that need be are freed. Finally when all blocks have been freed the journal space can be reclaimed.

This post can not convey how complex this work was. It may not sound very dramatic or impressive but it truly has been one of the most complex projects I have ever undertaken.

Performance problems in SUJ

SUJ has been around for a year now and 9.0 will release with it this summer. In preparation I am working on the few known performance problems. The problems are sufficiently general to softupdates that they may be of interest to those who study different filesystem consistency mechanisms.

The new code and dependencies add some extra CPU overhead to each filesystem operation but in practice this has been negligible. However once disks reach ops per second rates similar to that of network interface cards we will have to re-evaluate filesystems entirely. Back on topic, the two classes of problems we have encountered relate to synchronous journal writes and excessive rollbacks.

You may recall that softupdates uses rollbacks to revert metadata operations that are not yet safe when a buffer is written to disk. When the write completes the change is rolled forward in memory and the buffer is marked dirty again. This allows us to separate potentially circular dependencies, rolling back some while writing others, allowing the filesystem state to move forward. This eliminates the types of journaling problems that can occur when many operations are allowed to aggregate for efficiency reasons which may lead to waiting on unrelated IO when fsync() is called. Our notion of a transaction is less simplistic.

The journaling code adds new dependencies and new rollbacks to the filesystem. Most importantly, the allocation bitmaps are now rolled back. In some cases we may discover that one filesystem operation undoes another and softupdates handles this by canceling all of the dependencies after reverting the metadata changes. It turned out there were some cases where the time between canceling the dependencies and the actual reversion of the changes could be longer than I expected. This would leave a dependency that was unsatisfied which would hold a cylinder-group dirty for several seconds. The solution was to simply allow the journal record to proceed even when we decide to cancel the operation. If the operation is undone before the write is issued we will still eliminate it, however, there is no harm in journaling an operation that does not happen. The checker will discover the true state of all the metadata and take no action.

The second problem has to do with blocking journal writes. There are some cases where rollbacks would be impractical so instead we detect them and force a synchronous journal write. There are very few instances of this in the filesystem but one that remains is particularly egregious. The checker requires that a new block allocation is journaled before the block is actually written. The filesystem assumes that it can write to datablocks in any order and indeed it does so before the allocation bits hit the cylinder group. These are not compatible so a new block which is immediately written to disk after allocation will wait first for the journal write and second for the block write, doubling the latency. This is tricky because we only need to block in the case that the previous identity of the block was as an indirect block for a file whose truncation still exists in the journal. The new record must first be written so the checker doesn't attempt to interpret the block as a table of indirect block pointers.

I haven't yet solved this second problem. My intent is to cache the list of recently freed indirect blocks in some fashion but I need to do it with the least memory and cpu overhead I can. My hope is to solve this soon. Experimental kernels where this restriction is relaxed perform as well as softupdates without journaling in all of the tests I've tried.

Interactivity score in ULE

I sometimes speak with Con Kolvis who is known for several Linux schedulers. Con is an interesting fellow because his background is not CS and he is very pragmatic about desktop performance. He doesn't care for the interactivity boost that ULE and previous Linux schedulers use in various forms. He periodically challenges me to consider the interactivity algorithm and whether it is ultimately necessary and effective. Below I present some analysis done when constructing the algorithm in use in ULE and why I believe it is effective and necessary while not suffering many of the pitfalls of earlier approaches.

Firstly, let me define the properties of what I believe is a good interactivity algorithm. These were my guiding principles in creating the ULE algorithm.

1) Any interactivity boost is gained slowly and lost quickly.
2) Interactivity should be harder to achieve the greater the system load.
3) The algorithm should not be exploitable to achieve an unfair share of the CPU.
4) The algorithm should be cheap to maintain and compute.
5) There should be sufficient history to permit bursty applications like web browsers.

The ULE algorithm uses a decaying history of voluntary sleep time and run time. Similar to %cpu, however, involuntary sleep time is not considered. That is to say, threads that are waiting due to contention for CPU resources are not given an interactivity boost for their time waiting. That allows the algorithm to work properly regardless of CPU load where if you only consider %cpu eventually all threads on a busy system will look interactive.

The algorithm scales the ratio of run time to sleep time to a value between 1 and 100. This is quite awkward in the kernel where we can't use floating point math. It decides the divisor depending on which value is larger giving a sort of bimodal distribution.

Here is a graph of what we theoretically would like the score to produce before we switch the divisor around:



And here is a graph generated by running the algorithm with a matrix of inputs:



The second graph uses larger numbers as we do in the kernel to reduce rounding effects. You can see an irregularity at 45 degrees where we switch divisors when the run time exceeds the sleep time. In practice these are never computed as we define a threshold of 20 above which tasks are not considered interactive so there is no point in computing the score when run time exceeds sleep time unless this threshold is moved.

Going from left to right runtime is increasing. From background to foreground sleep time is increasing. A thread would trace a path forward and to the right depending on its behavior. When they increase equally the score quickly reaches an equilibrium well above the threshold for interactive scheduling. A thread looking to abuse the system couldn't use much more than 20% of the cpu in a steady state. This can be adjusted by reducing the interactive threshold. On a busy system this 20% dwindles depending on load, ultimately providing no advantage to a would be exploiter. A thread running right out of the gate raises its score super-linearly to 50 within milliseconds, while a recently awoken thread climbs linearly as it accumulates cpu time.

The algorithm requires a lot of sleep time to be accumulated before a thread can be considered interactive. This remembered sleep time is capped at a few seconds so it only takes a few hundred milliseconds before we discover that a thread is no-longer interactive. It does permit interactive UI applications to wake up with the lowest possible latency since they have a very high priority. If they then abuse this benefit for very long they are scheduled round-robin based on cpu utilization like other bulk tasks. In practice we have picked values that keep desktop user applications interactive as well as is possible.

OFED, 10gigE, and SUJ

I have merged the OFED 1.5.3 Infiniband stack into FreeBSD CURRENT. We have achieved feature and performance parity with the Linux stack using a combination of wrappers and re-implementation of sensitive pieces. lwn wrote an article about the wrapper work here. Some FreeBSD developers are understandably concerned about growing a Linux kernel compat layer and how that could lower the quality of FreeBSD drivers. I don't foresee this as a real complication but only time will tell.

I'm working on bringing in support for Mellanox's 10gigE adapters now. It's always interesting for me to explore different directions operating systems take to accomplish the same features. Network buffering is one of those areas that is starkly different across operating systems. Maybe that deserves its own post.

I am now looking for bug reports for SUJ for another round of bug-fixing before 9.0 ships. There are a couple of areas where performance isn't great due to latency involved in blocking journal writes. I know how to eliminate these but it will take some time to implement. We are hoping to ship SUJ as the default in 9.0 and then I may provide an official backport to 8.x.

wherein I replicate my feet

I sometimes do things unrelated to computers that are probably still considered geeky. I have made reference to being a cyclist before. As you know cyclists are obsessed with all things carbon fiber and I am no different. With the help of a boat builder friend of mine, I finally had an opportunity to make my own composite parts in pursuit of the perfectly fitting shoe. Behind the cut are some pictures and a description of the process.

Read more...Collapse )

Year in review

I have not posted in a very long time. I have been busy though and I'll try to summarize the last year here.

Firstly, I collaborated with my good friends at fairwheel bikes to work on a modification to Shimano's new electronic shifting group. You can read about that at cyclingnews.com. I replaced the stock computer with my own micro-controller that enables some advanced shifting features. I'm trying to turn this into a commercial enterprise with a friend. There is a chance that a pro team will be using it next year.

The majority of my year has been occupied with a port of the OpenFabrics Enterprise Distribution infiniband stack from Linux to FreeBSD. This is dual BSD/GPL licensed which permits the port. In pursuit of this I have created a 10,000 line Linux kernel api compatibility layer which allows us to run the vast majority of the infiniband code unmodified. As I mentioned on arch@freebsd the following pieces are emulated:

> atomics, types, bitops, byte order conversion, character devices, pci
> devices, dma, non-device files, idr tables, interrupts, ioremap, hashes,
> kobjects, radix trees, lists, modules, notifier blocks, rbtrees, rwlock,
> rwsem, semaphore, schedule, spinlocks, kalloc, wait queues, workqueues,
> timers, etc.

Additionally I have worked more on SUJ, mostly bug fixing. Kirk and Kostik have been most helpful in that and really did most of the work. There were some nasty bugs but we've whittled them down and now there are only a few performance regressions (and improvements) to concern ourselves with.

I wish I hadn't let this journal go for so long. If anyone has any specific interests let me know and I will try to post more frequently.

(no subject)

Just a short update; I will commit SUJ to CURRENT soon and there are already backports to 8 and 7 available. Things are really very stable.

I have done some fsck benchmarking. I recovered an 80% full 250gb volume that was doing a buildworld with 8 parallel processes in .9 seconds. The traditional fsck took 24 minutes. For the vast majority of workloads an unsafe shutdown will not prolong the boot by more than a few seconds. I also shortened the time that entries stayed valid in the journal which allowed me to trim the maximum journal size to 32MB. This will limit the maximum possible recovery time while still providing for 1 million outstanding transactions.

The project to date has taken 370 hours. The patch adds 11,000 lines and removes 2,000. This is much bigger than I was anticipating, although of course lines of code can be misleading.

What's in a journal anyway?

In this post I'll detail the contents of the journal and the recovery operation. Since we know that softupdates only leaves two inconsistencies, leaked inodes and leaked blocks, we only have to journal conditions which can create these circumstances. In truth we have to track all link count changes to an inode since they can have multiple named references via hardlinks. Blocks are somewhat simpler although ffs fragments complicate them considerably. At recovery time we verify whether links or pointers to blocks exist and use this information to free them if necessary. There are only 4 journal records (add ref, rem ref, new block, free block) and each is only 32bytes. This is effectively an intent log, there is no copy of the metadata in each record. Sounds simple no?

Read more...Collapse )

Journaling softupdates, SU+J

I suspect that most people who read this blog are familiar with journaling as a mechanism for ensuring post-crash filesystem consistency. Some of you may also be familiar with copy-on-write and log structured filesystems as an alternative to journaling. BSD's ffs, an extension of the original unix filesystem, has used an alternate approach called soft-updates to handle filesystem consistency for around 10 years. For the past few months, I have been creating a hybrid journaled softupdate system to deal with inadequacies in the existing softdep system. This work is opensource and will be available to FreeBSD-current users sometime this month. Behind the cut I briefly describe the tradeoffs in each consistency mechanism and motivation for this work.

Read more...Collapse )

New UMA features for more efficient memory layout

I have wanted to write for some time about UMA changes I recently made. UMA is the "universal memory allocator" which serves as FreeBSD's kernel memory allocator. I initially wrote this 7 years ago or so and many other people have since contributed. I named it 'universal' because at that time FreeBSD had 3 separate kernel memory allocators and this unified them. The two new features relate to network performance work I've been doing lately and allow the use of more efficient layout of network buffers.

Read more...Collapse )