difference in CPU time for two similar lines

2023-03-18 22:23 问答作者：

There is a while loop in my program, where IterZNext, IterZ are pointers to nodes in a list. The nodes in the list are of type struct with a field called "Index".

double xx = 20.0;
double yy = 10000.0;
double zz;      
while (IterZNext!=NULL && NextIndex<=NewIndex)
{
    IterZ=IterZNext;
    IterZNext = IterZ->Next;
    if (IterZNext!=NULL)
    {
        zz = xx + yy;
                NextIndex1 = IterZNext->Index; // line (*)
        NextIndex = IterZNext->Index;  // line (**)
        IterZNext->Index;
    }
}

When I profiled my program, I found the line (*)

NextIndex1 = IterZNext->Index;

consumes most of CPU time (2.193s), while the line (**)

NextIndex = IterZNext->Index;

which is all most the same with the line (*) only uses 0.093s. I used the Intel VTune Amplifier to see the assembly of these two lines, which is as follows:

Address Line    Assembly                   CPU Time Instructions Retired
Line (*):
0x1666  561 mov eax, dword ptr [ebp-0x44]   0.015s  50,000,000
0x1669  561 mov ecx, dword ptr [eax+0x8]        
0x166c  561 mov dword ptr [ebp-0x68], ecx   2.178s  1,614,000,000

Line (**):
0x166f  562 mov byte ptr [ebp-0x155], 0x1   0.039s  80,000,000
0x1676  562 mov eax, dword ptr [ebp-0x44]   0.027s  44,000,000
0x1679  562 mov ecx, dword ptr [eax+0x8]        
0x167c  562 mov dword ptr [ebp-0x5c], ecx   0.026s  94,000,000

If I change the order of the line () and the line (*), then the program changes to

double xx = 20.0;
double yy = 10000.0;
double zz;      
while (IterZNext!=NULL && NextIndex<=NewIndex)
{
    IterZ=IterZNext;
    IterZNext = IterZ->Next;
    if (IterZNext!=NULL)
    {
        zz = xx + yy;
                NextIndex = IterZNext->Index;  // line (**)
                NextIndex1 = IterZNext->Index; // line (*)
        IterZNext->Index;
    }
}

and the result for assembly changes to

Address Line    Assembly    CPU Time    Instructions Retired
Line (**):
0x1666  560 mov byte ptr [ebp-0x155], 0x1   0.044s  84,000,000
0x166d  560 mov eax, dword ptr [ebp-0x44]   0.006s  2,000,000
0x1670  560 mov ecx, dword ptr [eax+0x8]    0.001s  4,000,000
0x1673  560 mov dword ptr [ebp-0x5c], ecx   1.193s  1,536,000,000

Line (*):
0x1676  561 mov eax, dword ptr [ebp-0x44]   0.052s  128,000,000
0x1679  561 mov ecx, dword ptr [eax+0x8]        
0x167c  561 mov dwor开发者_如何学Cd ptr [ebp-0x68], ecx   0.034s  112,000,000

In this case, line (*) uses most of CPU time (1.245s) while line () only uses 0.086s.

Could someone tell me: (1) Why does it take so long to make the first assignment? Notice that the line zz=xx+yy only uses 0.058s. Is this related to the cache misses? since all nodes in the list are dynamically genereated. (2) Why is there huge difference in CPU time between this two lines?

Thanks!

All modern CPUs are superscaler & out-of-order - which means that instructions are not actually executed in the order of the assembly, and there isn't really such thing as the current PC - there are many 10s of instructions in flight and executing at once.

Therefore any sampling information a CPU reports is just a rough area the CPU was executing - it was executing the instruction it indicated when the sampling interrupt went off; but it was also executing all the other in-flight ones!

However people have got used to (and expect) profiling tools to tell them exactly which individual instruction the CPU is currently running - so when the sampling interrupt triggers the CPU essentially picks one of the many active instructions to be the 'current' one.

CPU line caching is probably the reason. accessing [ebp-0x5c] also brings into the cache [ebp-0x68], which then will be fetched much faster (for the second case, vice versa for the first).

It is definitly due to the cache miss. Then bigger miss then more performance penalty will be introduced by processor. Actually in the modern world CPU performs much faster then memory. If nowadays processor can have clock frequency in about 4GHz, memory still run with frequency ~0.3GHz. It is great performance gap that is still continue to grow. Cache introduction was driven by desire to hide this gap. Without cache using modern processor will spend huge amount of time waiting data from the memory and doing nothing at that time. In addition to the performance gap, each memory access incurrs additional latencies related to posible concurrency on memory bus with other CPUs and DMA devices and times required for memory access request processing and routing on the side of the processor memory management logic (checking of the caches of all levels, virtual-to-physical address translation that can involve TLB miss with additional access to memory, request pushing to the memory bus etc.)and memory controller(request routing from the CPU-controller to the controller memory bus, possible waiting for memory bank refresh cycle complition etc.). So, to summarize, raw access to the memory have really big cost in comparision to the L1 cache hit or register access. The difference in cost comparable to the difference in cost to access data in memory and in secondary storage (HDD).

Furthermore, the cost of memory access will grow with moving from the processor to the memory. L2 access will provide bigger penalty then L1 or CPU registers access, L3 access will provide penalty bigger then L2 access and, finally, memory access will provide penalty bigger then memory access. For example, you can compare cost of data access on different levels of memory hierarchy in the next table (captured from http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/6)

Cache/Memory Latency Comparison

-----------------------------------------------------------
|                                |L1| L2| L3| Main Memory |
-----------------------------------------------------------
|AMD FX-8150 (3.6GHz)            | 4| 21| 65| 195         |
-----------------------------------------------------------
|AMD Phenom II X4 975 BE (3.6GHz)| 3| 15| 59| 182         |
-----------------------------------------------------------
|AMD Phenom II X6 1100T (3.3GHz) | 3| 14| 55| 157         |
-----------------------------------------------------------
|Intel Core i5 2500K (3.3GHz)    | 4| 11| 25| 148         |
-----------------------------------------------------------

In regard to your particular case:

0x1669  561 mov ecx, dword ptr [eax+0x8]        
0x166c  561 mov dword ptr [ebp-0x68], ecx   2.178s  1,614,000,000


0x1670  560 mov ecx, dword ptr [eax+0x8]    0.001s  4,000,000 /* confusing and looks like wrong report for me*/ 
0x1673  560 mov dword ptr [ebp-0x5c], ecx   1.193s  1,536,000,000

You have penalty on dereferencing of the Index value in the code line.

mov ecx, dword ptr [eax+0x8]

Note that it is first access to the data in each subsequent node of you list, up to this moment you manipulate only by node address, but the data of that address and due to this haven't memory access. You stated, that you use dynamical list, and this is bad from cache hit rate point of vies. In addition, I suppose that you have big enough list that mean that you will have cache fluded by the previously accessed data (list nodes accessed on previous iterations)and almost always will have cache miss or cache hit only on L3 cache during the accessing Index on each new iteration. But note, that during first access to the Index on each involving cache miss on each new iteration data returned from the memory will be stored in the L1 cache. And when you access Index on the second time during the same cycle iteration, you will have low cost L1 cache hit!

So I hope I provide you detailed answer on both your questions.

In regard to the correctness of the VTune report correctness. I want to advocate Intel VTune developers. Of course, modern processors are very complex devices with number of ILP improving technologies on the board including pipelining, superscalaring, Out-Of-Order execution, branch prediction etc, and of cource this make detailed instruction level performance analysis harder and more precious. But such tools like VTune is developed with that processor features in the mind, and I belive that they are not so stupied to develop and provide the tool or feature that haven't any sence. Furthermore it looks like the developers from the Intel like no another have access to the full understanding of the all processor features details and as no another can take this details into account during profiler designing and development.

继续阅读：assembly c cpu-usage profiling

difference in CPU time for two similar lines

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？