Ever wonder what memory latency is like on a large loosely connected opteron system? I lay awake at nights wondering myself. Fortunately, I have access to a tyan 8 socket barcelona system. This is basically two 4 socket boards with two very slow HT links between them. I also have access to a nehalem based box that I have timings for. The results are behind the cut.
The test code simply allocates a user specified size of memory, prefaults it, mlocks it, and then uses rdtsc to count the cycles of individual random memory accesses (read only). I also have a small calibration loop to try to determine the rough cost of the rdtsc instruction and remove it from the total cycles counted. With very large blocks of memory you see dram performance and with small blocks of memory you can see different stages of the cache hierarchy.
Here are the results as a histogram for a 32way 8 socket opteron sampling 1gb of memory:
Since even all processors in a 4 socket system are not directly connected to each other you usually see 2-3 peaks depending on which cpu you're scheduled on. The fastest is always local memory and then you see one and perhaps two hops to remote nodes on the same board. Here we see a smallish peak for local memory and a larger peak as more random samples hit the other three cores on our board.
Then there is the horrible clump once we go over the slow HT links to the other board where we regularly see up to .5us memory access latencies! Incredible. The other thing worth noting is that even the local access is unfortunately slow. I believe this has to do with the cache coherency protocol still requiring us to query each other socket before owning the line.
Next we have a simpler, two socket nehalem sampling 128mb of memory:
This is really very clean. In fact, we can pick out particular features of dram looking at this graph. First, we see two tall peaks, representing local and remote dram. The second is only taller because the first has more minor variance, it is wider. So it's roughly 85ns for a local access and 135ns for remote. The other peaks we see are likely due to two causes. The short peak after the dram timing is likely covering requests which occur during a dram refresh cycle. The penalty is about the right amount of time for that. The short peak before is likely occurring due to back to back requests coming in for the same row.
Information like this helps us understand the relative trade-offs for different optimizations related to memory organization and locality.