What are the threading guarantees of nowadays C and C++ compilers?
I'm wondering what are the guarantees that compilers make to ensure that threaded writes to memory have visible effects in other threads.
I know countless cases in which this is problematic, and I'm sure that if you're interested in answering you know it too, but please focus on the cases I'll be presenting.
More precisely, I am concerned about the circumstances that can lead to threads missing memory updates done by other threads. I don't care (at this point) if the updates are non-atomic or badly synchronized: as long as the concerned threads notice the changes, I'll be happy.
I hope that compilers makes the distinction between two kinds of variable accesses:
- Accesses to variables that necessarily have an address;
- Accesses to variables that don't necessarily have an address.
For instance, if you take this snippet:
void sleepingbeauty()
{
int i = 1;
while (i) sleep(1);
}
Since i
is a local, I assume that my compiler can optimize it away, and just let the sleeping beauty fall to eternal slumber.
void onedaymyprincewillcome(int* i);
void sleepingbeauty()
{
int i = 1;
onedaymyprincewillcome(&i);
while (i) sleep(1);
}
Since i
is a local, but its address is taken and passed to another function, I assume that my compiler will now know that it's an "addressable" variable, and generate memory reads to it to ensure that maybe some day the prince will come.
int i = 1;
void sleepingbeauty()
{
while (i) sleep(1);
}
Since i
is a global, I assume that my compiler knows the variable has an address and will generate reads to it instead of caching the value.
void sleepingbeauty(int* ptr)
{
*ptr = 1;
while (*ptr) sleep(1);
}
I hope that the dereference operator is explicit enough to have my compiler generate a memory read on each loop it开发者_StackOverflow中文版eration.
I'm fairly sure that this is the memory access model used by every C and C++ compiler in production out there, but I don't think there are any guarantees. In fact, the C++03 is even blind to the existence of threads, so this question wouldn't even make sense with the standard in mind. I'm not sure about C, though.
Is there some documentation out there that specifies if I'm right or wrong? I know these are muddy waters since these may not be on standards grounds, it seems like an important issue to me.
Besides the compiler generating reads, I'm also worried that the CPU cache could technically retain an outdated value, and that even though my compiler did its best to bring the reads and writes about, the values never synchronise between threads. Can this happen?
Accesses to variables that don't necessarily have an address.
All variables must have addresses (from the language's prospective -- compilers are allowed to avoid giving things addresses if they can, but that's not visible from inside the language). It's a side effect that everything must be "pointerable" that everything has an address -- even the empty class typically has size of at least a char
so that a pointer can be created to it.
Since i is a local, but its address is taken and passed to another function, I assume that my compiler will now know that it's an "addressable" variables, and generate memory reads to it to ensure that maybe some day the prince will come.
That depends on the content of onedaymyprincewillcome
. The compiler may inline that function if it wishes and still make no memory reads.
Since i is a global, I assume that my compiler knows the variable has an address and will generate reads to it.
Yes, but it really doesn't matter if there are reads to it. These reads might simply be going to cache on your current local CPU core, not actually going all the way back to main memory. You would need something like a memory barrier for this, and no C++ compiler is going to do that for you.
I hope that the dereference operator is explicit enough to have my compiler generate a memory read on each loop iteration.
Nope -- not required. The function may be inlined, which would allow the compiler to completely remove these things if it so desires.
The only language feature in the standard that lets you control things like this w.r.t. threading is volatile
, which simply requires that the compiler generate reads. That does not mean the value will be consistent though because of the CPU cache issue -- you need memory barriers for that.
If you need true multithreading correctness, you're going to be using some platform specific library to generate memory barriers and things like that, or you're going to need a C++0x compiler which supports std::atomic
, which does make these kinds of requirements on variables explicit.
You assume wrong.
void onedaymyprincewillcome(int* i);
void sleepingbeauty()
{
int i = 1;
onedaymyprincewillcome(&i);
while (i) sleep(1);
}
In this code, your compiler will load i
from memory each time through the loop. Why? NOT because it thinks another thread could alter its value, but because it thinks that sleep
could modify its value. It has nothing to do with whether or not i
has an address or must have an address, and everything to do with the operations that this thread performs which could modify the code.
In particular, it is not guaranteed that assigning to an int
is even atomic, although this happens to be true on all platforms we use these days.
Too many things go wrong if you don't use the proper synchronization primitives for your threaded programs. For example,
char *str = 0;
asynch_get_string(&str);
while (!str)
sleep(1);
puts(str);
This could (and even will, on some platforms) sometimes print out utter garbage and crash the program. It looks safe, but because you are not using the proper synchronization primitives, the change to ptr
could be seen by your thread before the change to the memory location it refers to, even though the other thread initializes the string before setting the pointer.
So just don't, don't, don't do this kind of stuff. And no, volatile
is not a fix.
Summary: The basic problem is that the compiler only changes what order the instructions go in, and where the load and store operations go. This is not enough to guarantee thread safety in general, because the processor is free to change the order of loads and stores, and the order of loads and stores is not preserved between processors. In order to ensure things happen in the right order, you need memory barriers. You can either write the assembly yourself or you can use a mutex / semaphore / critical section / etc, which does the right thing for you.
While the C++98 and C++03 standards do not dictate a standard memory model that must be used by compilers, C++0x does, and you can read about it here: http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/c++mm.pdf
In the end, for C++98 and C++03, it's really up to the compiler and the hardware platform. Typically there will not be any memory barrier or fence-operation issued by the compiler for normally written code unless you use a compiler intrinsic or something from your OS's standard library for synchronization. Most mutex/semaphore implementations also include a built-in memory barrier operation to prevent speculative reads and writes across the locking and unlocking operations on the mutex by the CPU, as well as prevent any re-ordering of operations across the same read or write calls by the compiler.
Finally, as Billy points out in the comments, on Intel x86 and x86_64 platforms, any read or write operation in a single byte increment is atomic, as well as a read or write of a register value to any 4-byte aligned memory location on x86 and 4 or 8-byte aligned memory location on x86_64. On other platforms, that may not be the case and you would have to consult the platform's documentation.
The only control you have over optimisation is volatile
.
Compilers make NO gaurantee about concurrent threads accessing the same location at the same time. You will need to some type of locking mechanism.
I can only speak for C and since synchronization is a CPU-implemented functionality a C programmer would need to call a library function for the OS containg an access to the lock (CriticalSection functions in the Windows NT engine) or implement something simpler (such as a spinlock) and access the functionality himself.
volatile is a good property to use at the module level. Sometimes a non-static (public) variable will work too.
- local (stack) variables will not be accessible from other threads and should not be.
- variables at the module level are good candidates for access by multiple threads but will require synchronizetion functions to work predictably.
Locks are unavoidable but they can be used more or less wisely resulting in a negligible or considerable performance penalty.
I answered a similar question here concerning unsynchronized threads but I think you'll be better off browsing on similar topics to get high-quality answers.
I'm not sure you understand the basics of the topic you claim to be discussing. Two threads, each starting at exactly the same time and looping one million times each performing an inc on the same variable will NOT result in a final value of two million (two * one million increments). The value will end up somewhere in-between one and two million.
The first increment will cause the value to be read from RAM into the L1 (via first the L3 then the L2) cache of the accessing thread/core. The increment is performed and the new value written initially to L1 for propagation to lower caches. When it reaches L3 (the highest cache common to both cores) the memory location will be invalidated to the other core's caches. This may seem safe but in the meantime the other core has simultaneously performed an increment based on the same initial value in the variable. The invalidation from the write by the first value will be superseeded by the write from the second core invalidating the data in the caches of the first core.
Sounds like a mess? It is! The cores are so fast that what happens in the caches falls way behind: the cores are where the action is. This is why you need explicit locks: to make sure that the new value winds up low enough in the memory hierarchy such that other cores will read the new value and nothing else. Or put another way: slow things down so the caches can catch up with the cores.
A compiler does not "feel." A compiler is rule-based and, if constructed correctly, will optimize to the extent that the rules allow and the compiler writers are able to construct the optimizer. If a variable is volatile and the code is multi-threaded the rules won't allow the compiler to skip a read. Simple as that even though on the face of it it may appear devilishly tricky.
I'll have to repeat myself and say that locks cannot be implemented in a compiler because they are specific to the OS. The generated code will call all functions without knowing if they are empty, contain lock code or will trigger a nuclear explosion. In the same way the code will not be aware of a lock being in progress since the core will insert wait states until the lock request has resulted in the lock being in place. The lock is something that exists in the core and in the mind of the programmer. The code shouldn't (and doesn't!) care.
I'm writing this answer because most of the help came from comments to questions, and not always from the authors of the answers. I already upvoted the answers that helped me most, and I'm making this a community wiki to not abuse the knowledge of others. (If you want to upvote this answer, consider also upvoting Billy's and Dietrich's answers too: they were the most helpful authors to me.)
There are two problems to address when values written from a thread need to be visible from another thread:
- Caching (a value written from a CPU could never make it to another CPU);
- Optimizations (a compiler could optimize away the reads to a variable if it feels it can't be changed).
The first one is rather easy. On modern Intel processors, there is a concept of cache coherence, which means changes to a cache propagate to other CPU caches.
Turns out the optimization part isn't too hard either. As soon as the compiler cannot guarantee that a function call cannot change the content of a variable, even in a single-threaded model, it won't optimize the reads away. In my examples, the compiler doesn't know that sleep
cannot change i
, and this is why reads are issued at every operation. It doesn't need to be sleep
though, any function for which the compiler doesn't have the implementation details would do. I suppose that a particularly well-suited function to use would be one that emits a memory barrier.
In the future, it's possible that compilers will have better knowledge of currently impenetrable functions. However, when that time will come, I expect that there will be standard ways to ensure that changes are propagated correctly. (This is coming with C++11 and the std::atomic<T>
class. I don't know for C1x.)
精彩评论