Lock-free queue algorithm, repeated reads for consistency
I'm studying the lock-free (en-,de-)queue algorithms of Michael and Scott. The problem is I can't explain/understand (nor the paper does, apart from the comments in the code itself) a couple of lines.
Enqueue:
enqueue(Q: pointer to queue_t, value: data type)
E1: node = new_node() // Allocate a new node from the free list
E2: node->value = value // Copy enqueued value into node
E3: node->next.ptr = NULL // Set next pointer of node to NULL
E4: loop // Keep trying until Enqueue is done
E5: tail = Q->Tail // Read Tail.ptr and Tail.count together
E6: next = tail.ptr->next // Read next ptr and count fields together
E7: if tail == Q->Tail // Are tail and next consistent?
// Was Tail pointing to the last node?
E8: if next.ptr == NULL
// Try to link node at the end of the linked list
E9: if CAS(&tail.ptr->next, next, <node, next.count+1>)
E10: break 开发者_如何学C // Enqueue is done. Exit loop
E11: endif
E12: else // Tail was not pointing to the last node
// Try to swing Tail to the next node
E13: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>)
E14: endif
E15: endif
E16: endloop
// Enqueue is done. Try to swing Tail to the inserted node
E17: CAS(&Q->Tail, tail, <node, tail.count+1>)
Why is E7 needed? Does correctness depend on it? Or is it merely an optimization? This if
can fail if another thread successfully executed E17, or D10 below, (and changed Q->Tail) while the first thread has executed E5 but not yet E7. But what if E17 is executed right after the first thread executes E7?
edit: Does this last sentence prove that E7 cannot be more than an optimization? My intuition is that it does, since I give a scenario were "apparently" the if
statement would make the wrong decision, yet the algorithm would still be supposed to work correctly. But then we could replace the if
's condition with a random condition, without affecting correctness. Any hole in this argument?
Dequeue:
dequeue(Q: pointer to queue_t, pvalue: pointer to data type): boolean
D1: loop // Keep trying until Dequeue is done
D2: head = Q->Head // Read Head
D3: tail = Q->Tail // Read Tail
D4: next = head.ptr->next // Read Head.ptr->next
D5: if head == Q->Head // Are head, tail, and next consistent?
D6: if head.ptr == tail.ptr // Is queue empty or Tail falling behind?
D7: if next.ptr == NULL // Is queue empty?
D8: return FALSE // Queue is empty, couldn't dequeue
D9: endif
// Tail is falling behind. Try to advance it
D10: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>)
D11: else // No need to deal with Tail
// Read value before CAS
// Otherwise, another dequeue might free the next node
D12: *pvalue = next.ptr->value
// Try to swing Head to the next node
D13: if CAS(&Q->Head, head, <next.ptr, head.count+1>)
D14: break // Dequeue is done. Exit loop
D15: endif
D16: endif
D17: endif
D18: endloop
D19: free(head.ptr) // It is safe now to free the old node
D20: return TRUE // Queue was not empty, dequeue succeeded
Again, why D5 is needed? Correctness or optimization? I'm not sure what "consistency" these tests give, since it seems they can get inconsistent right after the if
succeeds.
This looks like a standard technique. Can someone explain the motivation behind it? To me, it seems like the intention is to avoid doing an (expensive) CAS in those few cases it can be noticed that it would definitely fail, but at the cost of always doing an extra read, which is not supposed to be so much cheaper itself (e.g. in Java, Q->Tail
would be required to be volatile, so we would know we are not merely reading a copy cached in a register but reading the real thing, which would be translated in prepending the read with a fence of some sort), so I'm not sure what's really going on here... thanks.
edit This has been ported to Java, more precisely in ConcurrentLinkedQueue, e.g. E7 is line 194, while D5 is line 212.
I was stuck on this same question, and sceptical that this could be an optimization, so I asked Maged Michael, one of the authors of this paper. This is his response:
E7 and D5 are needed for correctness.
The following case shows why E7 is needed:
Thread P reads the value
<A,num1>
from Q->Tail in line E5Other threads change the queue such that the node A is removed and maybe later reused in a different queue (or a different structure with similar node structure) or allocated by a thread to insert it in this same queue. In any case A is not in this queue and its next field has the value
<NULL, num2>
.In line E6, P reads the value
<NULL, num2>
from A->next into next.(Skipping line E7)
In line E8, P finds next.ptr == NULL
In line E9, P executes a successful CAS on A->next as it finds
A->next == <NULL, num2>
and sets it to<node,num2+1>
.Now, the new node is incorrectly inserted after A which doesn't belong to this queue. This might also corrupt another unrelated structure.
With line E7, P would have discovered that Q->Tail has changed and would have started over.
Similarly for D5.
Basically, if our read from tail.ptr->next
is going to make us believe that the next pointer is null (and thus that we may write to the node), we must double check that this null refers to the end of the current queue. If the node is still in the queue after we read the null, we may assume that it really was the end of queue, and the compare-and-swap will (given the counter) catch the case where anything happened to this node after the test in E7
(removing the node from the queue will necessarily involve mutating its next pointer).
Why is E7 needed?
Its more for optimization.
Consider two threads trying to enqueue at the same time. They all get to E5 but before thread 1 gets to E7 thread 2 successfully queues. When thread 1 gets to E7 it will observer t == tail to be false then retries. This will avoid a costly CAS. Of course its not full proof because E7 can succeed before thread 2 enqueues and eventually fails the CAS and has to retry anyway.
why D5 is needed
Similar to D5
Again, both functions without E7 and D5 would work. There was probably some benchmarking going on and found that under moderate contention the double check increases throughput (this is more of an observation and less of fact).
Edit:
I went and read the paper on this queue a bit more. The check is also there for correctness of a lock free algorithm and less of the data structure's state.
The lock-free algorithm is non-blocking because if there are non-delayed processes attempting to perform operations on the queue, an operation is guaranteed to complete within finite time. An enqueue operation loops only if the condition in line E7 fails, the condition in line E8 fails, or the compare and swap in line E9 fails. A dequeue operation loops only if the condition in line D5 fails, the condition in line D6 holds (and the queue is not empty), or the compare and swap in line D13 fails. We show that the algorithm is non-blocking by showing that a process loops beyond a finite number of times only if another process completes an operation on the queue.
http://www.cs.rochester.edu/u/scott/papers/1996_PODC_queues.pdf
精彩评论