why is multithread access to data in same cacheline has low cache miss rate?
Its been noted that access to data elements that fall in same cache-line performs badly due to ping-pong effect. However, the code I wrote doesn't and tested with valgrind --tool=cachegr开发者_C百科ind doesn't show this behaviour. Would appreciate any insights regarding this?.
Attached below is function that each pthread executes:
void test_cache(void* arg)
{
long id = (long) arg;
uint32_t idx = (uint32_t) id;
uint32_t ctr = 0;
uint32_t total_sum = 0;
for(; ctr < 500000; ++ctr)
{
total_sum += shared[idx];
AO_fetch_and_add(&shared[idx], idx);
}
printf("%d %d,\n",id, total_sum);
}
Reads are ok (once the cache is filled), writes are not, as that, depending on architecture, will cause all other processors to invalidate that cache line and fetch the line from memory. (Systems that do cache line snooping could avoid that penalty).
The initial cache line load would also have a penalty as a load per cache is required (shared caches are better), with the situation being the worst in NUMA (fetch from distant processor).
If you are running on a "dual core" whatever, you are hitting shared cache. You need separate physical CPUs to see the ping-pong effect. Include your hardware spec in the question.
精彩评论