New UMA features for more efficient memory layout
The first feature expands on the 'keg' concept, first introduced by Bosko to support network buffer allocations, by allowing zones to back-end to multiple kegs. UMA is a zone or slab allocator wherein a zone is created to allocate a given type of memory and contains parameters specific to that type. The type may be all allocations of a specific size, as is used for the malloc() front-end, or it may be a complex object type that has initializers, finalizers, specific alignment requirements, etc. The keg is a refinement on this concept where a keg provides the backend allocation and storage description while the zone controls the contents of the individual allocated items and provides a caching layer, client api, etc. The difference is subtle but important. Restated, the keg describes the format of the page or pages that the item lives in while the zone describes the format within each allocated item. In this way multiple kegs can provide items which meet the clients requirements while varying the format and source of memory that they come from.
Now that a zone may contain multiple kegs you may have multiple back-end sources for the allocation but the consumers are not required to differentiate. For example, I have two kegs for 2 kilobyte network buffers. One which is allocated from single pages and one which allocates from hardware supported large page sizes. If large, contiguous pages are available this keg is used. This optimizes access to network buffers by allowing the use of fewer hardware TLBs to describe them. However, due to fragmentation, we don't always have large aligned chunks of memory available and in this case we can fall back to the other keg transparently. This keg concept is also being used to implement NUMA support. There exists one keg per-numa node and the search function can be coded to be topology aware. Items from multiple compatible kegs can exist in the same fast cache in the zone and will be automatically retired to the correct source.
The second feature builds on the first and allows for much more efficiently aligned datastructures. This is akin to cache-coloring but goes one step further. The start address of each allocated item is aligned such that it falls on a different cache-line than the previous allocation. Simple cache-coloring typically ignores naturally aligned allocations or only colorizes each page or slab. In this scheme a large contiguous block of memory is allocated and each item is padded until it reaches a new color. For large, uniformly sized, network buffers, this has a tremendous benefit. Consider a 2k allocation always ending up on a 2k boundary. This uses 1/(allocation size / line size) available lines for the most accessed bit at the beginning. So for a 2k buffer and a 64byte line size the start addresses fall on only 1/32 of the available lines. Essentially the buffer is padded by the line size so that the start addresses alternate lines and the number of allocations required to hit every line is computed to determine the storage requirements. Using a large contiguous block of memory ensures that this is equally effective for l1 virtually indexed caches as it is on l2-l3 physically indexed caches.
This has a secondary effect of improving utilization on striped memory controllers. The exact details of how a physical address maps to a channel/bank/rank of dram are not published. However, it is clear that predictable access patterns aligned on a large power of two size will strongly favor a smaller set of available dimms as it also favors a small set of available cachelines. By alternating the start addresses on cache-line boundaries we can be assured that we are uniformly loading the available dimms because a cacheline is the smallest unit of memory transfer for all practical purposes.
These optimizations are only useful in workloads where memory and cache pressure are the significant bottleneck. Unfortunately, I don't have any benchmarks for stock FreeBSD to share at this time. I'm not certain that I'm permitted to share details about the yields in the proprietary stack this was implemented for. I will say that it was on the order of 10% in an already heavily optimized environment where traditional profile-guided software optimizations were yielding much less.
I should also mention that this work was primarily funded by Nokia and most graciously donated to FreeBSD.