jeffr_tech (jeffr_tech) wrote,

What's in a journal anyway?

In this post I'll detail the contents of the journal and the recovery operation. Since we know that softupdates only leaves two inconsistencies, leaked inodes and leaked blocks, we only have to journal conditions which can create these circumstances. In truth we have to track all link count changes to an inode since they can have multiple named references via hardlinks. Blocks are somewhat simpler although ffs fragments complicate them considerably. At recovery time we verify whether links or pointers to blocks exist and use this information to free them if necessary. There are only 4 journal records (add ref, rem ref, new block, free block) and each is only 32bytes. This is effectively an intent log, there is no copy of the metadata in each record. Sounds simple no?


In the addref/remref case the journal record contains the inode number, directory inode number, 64bit offset within the directory for the new/removed link, and the link count before the adjustment. At recovery time when we find one of these records we verify whether the path exists or not and adjust the link appropriately. The path may not exist if the parent inode doesn't point at the directory block that this filename occupied, or if the directory write didn't happen in time, or any number of other scenarios. If we adjust a link down to 0 we free the inode and any blocks it owns, and if it is a directory, we recursively decrement the link counts of any children also potentially freeing them. This only happens when you crash immediately after adding a tree of files as with tar, etc. The directory offset tells us the exact place this should exist, we don't need to know the actual name, and this is how we handle multiple links to the same inode within the same directory. The recovery operation actual finds all valid journal records for each inode and sorts them in a list to remove duplicates before operating on an inode so we know if a name was added and immediately removed or if it was added twice we should not adjust the link twice, etc.

For adding and removing blocks we record the inode, logical block number (like a file offset), and disk block address. The lbn may be negative to indicate indirect blocks, which are blocks that hold pointers to data blocks for large files. If we discover that a block does not exist at the indicated lbn we may recursively free indirect block children. This allows us to truncate huge files with a very small number of journal entries, no more than 15 which is the number of direct and indirect block pointers contained in an inode. There is an additional test to be certain that the freed block was not allocated to a new file after this record was written.

In ffs, the filesystem is partitioned into 'cylinder groups' which partition the data blocks and inodes for locality. Each of these CGs has summary information describing where there are fragments, large clusters of available blocks, how many inodes are free, etc. Some of this summary information is copied into the superblock so that we can quickly find a CG with free blocks, inodes, etc. As a final stage in the recovery operation any CG that was modified recomputes its summary information and updates the superblock copy.

So how fast is it? In my tests so far it looks like less than 2 seconds per megabyte of journal in use. A megabyte of journal space describes 32,768 filesystem operations! Even on a machine with a very large amount of memory it's unlikely that you could have more than a few hundred thousand operations outstanding. So this is really quite acceptable. Furthermore the recovery operation is currently generating a text log of every decision it makes that is several times the size of the binary log. Once disabled will probably halve the recovery time again.
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 12 comments