May 9th, 2011

Asynchronous partial truncation

I have spent a month of my life on partial truncation. Softupdates asynchronously handles the case where you were completely truncating a file, such as is the case when you delete a file. The operation would be scheduled, the in memory inode updated, and the whole thing would proceed in the background. However, when you truncated to a non zero length it would do many blocking operations while synchronously truncating. When I wrapped this synchronous operation for SUJ, I did not do it quite correctly, and as a result SUJ could leak blocks if you crashed during a partial truncation. This could actually lead to filesystem corruption if the checker confused a regular disk block for an indirect block and started freeing random pointers.

To resolve this, I modified the truncation machinery to handle partial truncation. This is hard because you may have many indirect blocks involved with many children blocks in different states. An indirect is a filesystem block that does nothing but point to other blocks, like a page table does for memory. It also had to handle ffs's somewhat complex fragment rules as well as zeroing partially empty blocks. It handles all of this now and supports an arbitrary number of partial truncations to the same file without any blocking operations. It always keeps the on disk copy safe while the in memory copy is free to grow again and indeed be truncated again after that. New pointers are not recorded in an indirect until prior truncation completes so there is no ambiguity about what revision of the file the blocks are from. This brings more complexity to fsync() which must now flush all pending truncations to disk before it can return.

The truncation code is a kind of asynchronous state machine that operates on leaf blocks first and then walks backwards up the tree until it reaches the root. This ensures that we always have a valid path to a block in case we crash. Indirects are only freed when all of their children are freed. For partial truncate, the block is zeroed only once those child pointers that need be are freed. Finally when all blocks have been freed the journal space can be reclaimed.

This post can not convey how complex this work was. It may not sound very dramatic or impressive but it truly has been one of the most complex projects I have ever undertaken.