April 4th, 2011

Performance problems in SUJ

SUJ has been around for a year now and 9.0 will release with it this summer. In preparation I am working on the few known performance problems. The problems are sufficiently general to softupdates that they may be of interest to those who study different filesystem consistency mechanisms.

The new code and dependencies add some extra CPU overhead to each filesystem operation but in practice this has been negligible. However once disks reach ops per second rates similar to that of network interface cards we will have to re-evaluate filesystems entirely. Back on topic, the two classes of problems we have encountered relate to synchronous journal writes and excessive rollbacks.

You may recall that softupdates uses rollbacks to revert metadata operations that are not yet safe when a buffer is written to disk. When the write completes the change is rolled forward in memory and the buffer is marked dirty again. This allows us to separate potentially circular dependencies, rolling back some while writing others, allowing the filesystem state to move forward. This eliminates the types of journaling problems that can occur when many operations are allowed to aggregate for efficiency reasons which may lead to waiting on unrelated IO when fsync() is called. Our notion of a transaction is less simplistic.

The journaling code adds new dependencies and new rollbacks to the filesystem. Most importantly, the allocation bitmaps are now rolled back. In some cases we may discover that one filesystem operation undoes another and softupdates handles this by canceling all of the dependencies after reverting the metadata changes. It turned out there were some cases where the time between canceling the dependencies and the actual reversion of the changes could be longer than I expected. This would leave a dependency that was unsatisfied which would hold a cylinder-group dirty for several seconds. The solution was to simply allow the journal record to proceed even when we decide to cancel the operation. If the operation is undone before the write is issued we will still eliminate it, however, there is no harm in journaling an operation that does not happen. The checker will discover the true state of all the metadata and take no action.

The second problem has to do with blocking journal writes. There are some cases where rollbacks would be impractical so instead we detect them and force a synchronous journal write. There are very few instances of this in the filesystem but one that remains is particularly egregious. The checker requires that a new block allocation is journaled before the block is actually written. The filesystem assumes that it can write to datablocks in any order and indeed it does so before the allocation bits hit the cylinder group. These are not compatible so a new block which is immediately written to disk after allocation will wait first for the journal write and second for the block write, doubling the latency. This is tricky because we only need to block in the case that the previous identity of the block was as an indirect block for a file whose truncation still exists in the journal. The new record must first be written so the checker doesn't attempt to interpret the block as a table of indirect block pointers.

I haven't yet solved this second problem. My intent is to cache the list of recently freed indirect blocks in some fashion but I need to do it with the least memory and cpu overhead I can. My hope is to solve this soon. Experimental kernels where this restriction is relaxed perform as well as softupdates without journaling in all of the tests I've tried.