July 13th, 2005

(no subject)

No time for a full write up on my last post, but I'll post more tomorrow. For now, here are some interesting tidbits.

1) After we hit dram the results arrive on more or less 10ns boundaries because the dram is a qaud pumped 100mhz bus, which means we can only issue an instruction every 10ns but we can get bursted results every 2.5. This explains why we have peaks every 10ns.

2) Cache accesses take between 1 and ~16 cycles depending on the level and whether the access causes a migration to l1 including time for eviction. This causes minor peaks between the dram access interval where some dram access was combined with a cache access.

3) The worst case scenario for a memory access is a cache miss with a tlb miss. The tlb miss may cause two separate cache misses (l1 and l2 page table on 32bit x86) which may case 2 dirty cache evictions & associated writebacks plus another writeback for the data. This gives you 6 memory accesses. Add to this setting the 'accessed' bit in the PTE you have 7 memory access. There is also the potential for dram refresh cycles interfering, etc. This puts us at least at 7*114ns = ~800ns which is pretty close to the actual observed worst case access time.

4) Disabling acpi, etc. yielded no difference although I can't be certain that the laptop isn't still using SMM mode.

5) By inserting extra delays before the memory operation I can get rid of peaks caused by the several cycle dram access delay. Giving the row & column strobes time to recharge. Pretty exciting stuff!

Coming soon; Plots from my 8 core 4 memory controller opteron where you can clearly see the differing latencies caused by the cube architecture.