The RCU quiescent states used for RCU+SMR are serialized states. A serialized state is defined as being in an interval across which no memory accesses to objects managed by RCU by a user thread may pass. This may be a simple memory barrier or it may be an interval guarded by memory barriers with no access to the managed objects within the interval. An example of serialized states would be the kernel, which is entered and exited by context switches which contain memory barriers.
These serialized states are tracked two different ways. One is a count updated within the serialized state interval guarded by memory barriers. These are used as eventcounts. The eventcount value read by the polling thread is used as the reference count or "since" value in subsequent polls of the eventcount. To establish that a serialized state has occurred, the eventcount has to be polled twice, once to get the reference count, and again to establish that the event of interest has occurred. The eventcount of the last successful poll is used as the "since" value on the next poll.
The other way to track serialized states is the thread run state updated by the kernel. A thread run state of "not running" is considered a being in a serialized state. Before the thread can leave the serialized state, a subsequent store of a "running" state has to occur. These are events reset by the observed thread, not by the polling thread. These events (condition really) only have to be observed once to establish that an RCU quiescent state has been entered, however these are handled the same as eventcounts due to the extra complexity of handling this logic differently.
SMR hazard pointer scan for a freed object is delayed until serialized states have occurred for all threads.
During a hazard pointer scan, if the hazard pointer
Once an objects address is no longer seen in any hazard pointers, an additional RCU checkpoint is taken to ensure any pending accesses to the object have completed, i.e. memory release semantics.
All global pointers to a node are deleted before the node is deferred for deletion. Therefore no live node contains any pointers to deleted or deferred deletion nodes and thus any node pointed to by a deferred node are deferred after that node has been deferred. Hazard pointer scanning of deferred nodes is done FIFO, oldest to newest, and stopped after the first in use node is detected. Since any non-cyclic traversal of nodes will visit only live nodes or deferred nodes in oldest to newest order, no subsequent node reachable from a deferred node will be deleted before it has been visited.
Note that this requires that the mutex used for modification of the data structure continue be held while the call to smr_defer() is made in order to guarantee FIFO order on node deletion.
In addition, the hazard pointer is copied to a 2nd internal hazard pointer before it is loaded with a new node value to prevent the current object from being deleted before the hazard pointer value has stabilized, i.e. no reloads are required.
Since there is no memory barrier between the copying of the hazard pointer and the subsequent reload, there is a potential for a race condition when reading the hazard pointers if they aren't stored in the order they're read. This shouldn't be a problem since the RCU serialized state will force a previous stores of the hazard pointer to complete before the hazard pointer scans take place. Since stale/dead nodes can't be modified, subsequent hazard pointer loads from stale/dead nodes will not retry and the hazard pointer copy value will not matter in those cases. That also means the hazard pointer copy can't be explicitly used. Hazard pointers will not be copyable and a second hazard pointer (if and when multiple hazard pointers are supported) will have to be used and "copied" loading from the same data pointer.
Threads are mapped onto the set of available processors. Therefore, if all processors have entered a serialized state at least once, then all threads have also.
New threads or processors are checked for once per scan of RCU nodes. Since current logic requires in effect two scans of RCU nodes and the initial event state only requires one observation, we can ensure that serialized states of occurred for all possible threads.
Hazard pointer scan is roughly of the order O(n*m) where n is the number of hazard pointers and m is the number of threads. It might be possible to optimize this to 0(n) if sequence number indexing of the FIFO defer queue was done and there was a way to map from hazard pointer values to the defer queue item associated with the hazard pointer values. In this case the hazard pointers are read and the least (oldest) sequence number from objects is gotten and than all older objects are then processed. Hashing might be used to manage the mapping although the overhead of adding and deleting hash table entries might outweigh the benefits. Some sort of intrusive mechanism with rcu_defer_t imbedded in the object might be devised but it wouldn't be generalizable.
Mutliple FIFO defer queues for hazard pointer scanning might be introduced at some point to improve granularity. Probably a tag value set with the address of the data structure being managed to effect a separate defer queue per data structure.
It's an open question of how you would manage these. Due the overhead of having extra hazard pointers, this issue is deferred for now.
This is the primary api for low level hardware and platform dependent synchronization primatives. The main issues are 32/64 bit word size and orthogonality of the primatives, e.g. double wide CAS vs. load reserved/store conditional. The current api uses stdint.h typedefs to ensure correct wordsize for integers. I may go to just using compiler assertions to enforce correct wordsizes, particularly with regard to pointer (void *) sizes.
An api that's closest to this is atomic_ops part of Qprof Project. atomic_ops doesn't have double wide CAS which is necessary to atomic_ptr.
T * p; atomic_ptr<T> ap; local_ptr<T> lp; y = *temp(ap); // instead of y = **ap; z = ap->x; // ok z = temp(ap)->x; // ok, same as above foo(temp(ap).get()); // ok p = temp(ap).get(); // address from temp is a bad idea lp = ap; // use local_ptr and dereference it instead
Add support for process shared condition variables.
This will probably be changed to use atomix.h since it needs memory barriers and Linux atomic.h doesn't provide them.
No specific design issues at this time.
No specific design issues at this time.
Discussion on this project can be done in news://comp.programming.threads or in the
open discussion forum.