Journaling softupdates, SU+J
If filesystem technologies are the subject of fads, it seems everything in the human experience must be. Lately we've seen an resurgence of copy-on-write (COW) filesystems which have been popularized seemingly by ZFS but have existed at least as far back as 1990's Rosenblum paper on LFS. These may be particularly attractive on flash media where seek time is not a problem and where clustering writes together can improve erase block fragmentation issues. However, on spinning media COW tends to fragment the drive and reduce the quality of allocation decisions by enforcing essentially at least two versions of any modified data to be reachable at once.
Journaled filesystems were of course the earlier fad. In this mechanism a copy of any metadata and sometimes data that is to be modified is kept in a journal, a fixed area of the disk or another disk, that logs each modification. In this mechanism the journal is replayed on reboot and the filesystem is left in a consistent state. The problem with journaling is either it's very simple and uses a tremendous amount of space and I/O with a strict transaction model that prevents some concurrency, or it becomes incredibly complex, to the tune of over 20,000 lines of code in xfs. Still I have questions about concurrency when multiple transactions affect a block in xfs, but I need to dig deeper to understand this.
Soft-updates is an alternative to this scheme where the filesystem keeps a list of dependencies that must be satisfied before a change to the filesystem can be visible on disk. For example, you wouldn't want to write a directory entry pointing at an inode until the inode was initialized on disk and marked allocated. Softdep handles this by rolling back changes to metadata that don't yet have their dependencies satisfied when we try to write a block. In this way we can commit any completed 'transactions' while keeping the disk state consistent. Softdep also allows these dependencies to discover operations which cancel each other out and thus nothing makes it to disk. For example, let's say you create a temporary file and then remove it after writing some blocks, which compilers often do, if it all happens within the interval of the syncer nothing will make it to disk.
Soft-updates guarantees that the only filesystem inconsistencies on unclean shutdown are leaked blocks and inodes. To resolve this you can run a background fsck or you can ignore it until you start to run out of space. We also could've written a mark and sweep garbage collector but never did. Ultimately, the bgfsck is too expensive and people did not like the uncertainty of not having run fsck. To resolve these issues, I have added a small journal to softupdates. However, I only have to journal block allocation and free, and inode link count changes since softdep guarantees the rest. My journal records are each only 32bytes which is incredibly compact compared to any other journaling solution. We still get the great concurrency and ability to ignore writes which have been canceled by new operations. But now we have recovery time that is around 2 seconds per megabyte of journal in-use. That's 32,768 blocks allocated, files created, links added, etc. per megabyte of journal.
This work is being funded by a group of companies, iXsystems, Yahoo!, and Juniper networks. I'm interested to see if this kind of project if feasible in the future where companies can share development costs for specific opensource projects.
The work is being done in collaboration with Kirk McKusick, the original author of ffs and softupdates. We will likely present a paper at BSDCan and then at a more formal venue following that. The code will be publicly available within two weeks.