When to use Test&Set or Test&Test&Set?

2023-01-27 21:17 问答作者：

Parallel programming under x86 can be hard job especially under multi-core CPU. Let say that we have multi-core x86 CPU and more different multithread communication combinations.

Single writer and single reader
Single reader multiple writers
Multiple readers and single writer
Multiple readers and multiple writers

So which one model is better (more efficient) for locking shared memory region: Test&Set or Test&Test&Set and when to use it!

Here I have two simple (no time limited) test procedures written in under Delphi IDE in x86 assembler:

procedure TestAndSet(const oldValue, newValue: cardinal; var destination);
asm
//eax = oldValue
//edx = NewLockValue
//ecx = destination = 32 bit pointer on lock variable 4 byte aligned
@RepeatSpinLoop:
        push    eax                   //Save lock oldValue (compared)
        pause                         //CPU spin-loop hint
        lock    cmpxchg dword ptr [ecx], edx
        pop     eax                   //Restore eax as oldValue
        jnz     @RepeatSpinLoop       //Repeat if cmpxchg wasn't successful
end;

procedure TestAndTestAndSet(const oldValue, newValue: cardinal; var desti开发者_如何学运维nation);
asm
//eax = oldValue
//edx = NewLockValue
//ecx = destination = 32 bit pointer on lock variable 4 byte aligned
@RepeatSpinLoop:
        push    eax                   //Save lock oldValue (compared)
@SpinLoop:
        pause                         //CPU spin-loop hint
        cmp     dword ptr [ecx], eax  //Test betfore test&set
        jnz     @SpinLoop
        lock    cmpxchg dword ptr [ecx], edx
        pop     eax                   //Restore eax as oldValue
        jnz     @RepeatSpinLoop       //Repeat if cmpxchg wasn't successful
end;

EDIT:

Intel in documentation mention two approach Test&Set or Test&Test&Set. I' wont to establish in which case is someone approach better, so when to use it. Check: Intel

Surely the first (testAndSet) is better because the 2nd does not achieve much with repeating the test using cmp & jnz - in between. While you are doing this the destination value may change anyway as it is not locked.

TTAS (#2) is good practice. "Lurking" and waiting for the "opportunity" before doing CAS is common practice in both Java and .NET concurrent classes. With that said, cmpxchg received quite a lot of optimizations in the last few years, so it might be possible that you'd get nearly identical results on the latest crop of processors.

What you should try in both cases, however is to employ some exponential backoff when you spin.

Update

@GJ: You should find some more up-to-date documentation on Intel's site. Note the paragraph about not locking the bus since 486 and the comparison chart of xchg and cmpxchg that shows that they are practically identical.

Spinning on a read vs on a locked instruction will still be a good idea to avoid some contention on getting the cache line in exclusive mode. (So TTAS.)

However this will provide a useful gain only if you implement e.g. an exponential back-off, even yielding the CPU after a while.

The differences between TTAS and TAS, or w/o backoff would be smaller if you are using a single, modern multi-core CPU with a shared L3 cache between the cores and would become more visible if you are using a multi-socket - e.g. server - machine or a multi-core CPU that has no shared cache between the cores. They would also be different based on the amount of contention. (I.e. light load would see smaller difference between TTAS/TAS.)

I'd use the 2nd approch, a test with not lock, then a lock if the test sucessed, with some proposals:

use call SwitchToThread instead of pause
put a call SwitchToThread in the not-locked repeat cmp loop
put the call SwitchToThread only in case of the cmp/lock failure

In all cases, I think you'd better:

use Windows API for your synchronization, if you really want to handle low-level synchronization in your project, see Synchronization Functions on MSDN - Microsoft made the low-level and optimization work for you. Most of these calls are optimized asm code, running in user mode, so are very fast.
use a high-level multi-thread framework, which in practice will handle all these problems for you, and will definitively scale well - see the Delphi OmniThreadLibrary
use a dedicated memory manager, like NexusMM, TBBMM, or ScaleMM/SynScaleMM

继续阅读：delphi multithreading parallel-processing x86

When to use Test&Set or Test&Test&Set?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？