July 8th, 2005

Memory access times.

I have produce a graph of memory access times on my Pentium-M laptop that shows some very strange characteristics. The data was gathered using a program which randomly loads an address from a 8mb array and records the time taken to perform the load using the tsc. Here's the relavent assembly from objdump, just so you know nothing is being snuck into the measurement:

mov %eax,%ebx
mov %edx,%esi
mov (%edi),%ecx

The mfence ensures that the memory subsystem is stable before we start measuring, and the lfence makes sure our load from (%edi) completes before we get the second timestamp. TSC gives us a count of the number of cycles the processor has been running so far, in essence, it's a 1.8ghz timer on my laptop.

Doing a couple of million loads and stores shows us some interesting things. As you'd expect we see the time it takes to do an L2 cache hit(~7ns) as the sample size is only 4x the cache size. We also see the time it takes to do an l2 cache miss to dram (~112ns). I'd also expect to see TLB miss time. This processor has 128 4k tlb entries, which lets us keep 500k mapped at once, so we actually should have a worse tlb hit rate than our cache hit rate, as my cache is 2mb, I believe. The TLBs, however, should mostly be in cache, so we'd expect to see one or two times the cache miss rate (two level page tables) plus some processing cost, plus the cost of a main memory access. The results, however, aren't so clear, and have some real surprises. Look here:

</a> The data source is here: mem.time

There is a tail that extends up to 1us! This is for an individual memory access. The tail has a real pattern too, with peaks repeating every 10ns. This is only really visible in the actual data as the peaks are so small relative to l2 and memory access times. I'm really not sure what this is just yet. Anyone have any insight?