Compare and swap in machine code in C
How would you write a function in C which does an atomic compare and swap on an integer value, using embedded machine code (assuming, say, x86 architecture)? 开发者_如何学JAVACan it be any more specific if its written only for the i7 processor?
Does the translation act as a memory fence, or does it just ensure ordering relation just on that memory location included in the compare and swap? How costly is it compared to a memory fence?
Thank you.
The easiest way to do it is probably with a compiler intrinsic like _InterlockedCompareExchange(). It looks like a function but is actually a special case in the compiler that boils down to a single machine op. In the case of the MSVC x86 intrinsic, that works as a read/write fence as well, but that's not necessarily true on other platforms. (For example, on the PowerPC, you'd need to explicitly issue a lwsync to fence memory reordering.)
In general, on many common systems, a compare-and-swap operation usually only enforces an atomic transaction upon the one address it's touching. Other memory access can be reordered, and in multicore systems, memory addresses other than the one you've swapped may not be coherent between the cores.
You can use the CMPXCHG
instruction with the LOCK
prefix for atomic execution.
E.g.
lock cmpxchg DWORD PTR [ebx], edx
or
lock cmpxchgl %edx, (%ebx)
This compares the value in the EAX register with the value at the address stored in the EBX register and stores the value in the EDX register to that location if they are the same, otherwise it loads the value at the address stored in the EBX register into EAX.
You need to have a 486 or later for this instruction to be available.
If your integer value is 64 bit than use cmpxchg8b 8 byte compare and exchange under IA32 x86. Variable must be 8 byte aligned.
Example:
mov eax, OldDataA //load Old first 32 bits
mov edx, OldDataB //load Old second 32 bits
mov ebx, NewDataA //load first 32 bits
mov ecx, NewDataB //load second 32 bits
mov edi, Destination //load destination pointer
lock cmpxchg8b qword ptr [edi]
setz al //if transfer is succesful the al is 1 else 0
If the LOCK prefix is omitted in atomic processor instructions, atomic operation across multiprocessor environment will not be guaranteed.
In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted. Intel Instruction Set Reference
Without LOCK prefix the operation will guarantee not being interrupted by any event (interrupt) on current processor/core only.
It's interesting to note that some processors don't provide a compare-exchange, but instead provide some other instructions ("Load Linked" and "Conditional Store") that can be used to synthesize the unfortunately-named compare-and-swap (the name sounds like it should be similar to "compare-exchange" but should really be called "compare-and-store" since it does the comparison, stores if the value matches, and indicates whether the value matched and the store was performed). The instructions cannot synthesize compare-exchange semantics (which provides the value that was read in case the compare failed), but may in some cases avoid the ABA problem which is present with Compare-Exchange. Many algorithms are described in terms of "CAS" operations because they can be used on both styles of CPU.
A "Load Linked" instruction tells the processor to read a memory location and watch in some way to see if it might be written. A "Conditional Store" instruction instructs the processor to write a memory location only if nothing can have written it since the last "Load Linked" operation. Note that the determination may be pessimistic; processing an interrupt, for example, may invalidate a "Load-Linked"/"Conditional Store" sequence. Likewise in a multi-processor system, an LL/CS sequence may be invalidated by another CPU accessing to a location on the same cache line as the location being watched, even if the actual location being watched wasn't touched. In typical usage, LL/CS are used very close together, with a retry loop, so that erroneous invalidations may slow things down a little but won't cause much trouble.
精彩评论