From 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sat, 16 Apr 2005 15:20:36 -0700 Subject: Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! --- Documentation/atomic_ops.txt | 456 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 456 insertions(+) create mode 100644 Documentation/atomic_ops.txt (limited to 'Documentation/atomic_ops.txt') diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt new file mode 100644 index 000000000000..8eedaa24f5e2 --- /dev/null +++ b/Documentation/atomic_ops.txt @@ -0,0 +1,456 @@ + Semantics and Behavior of Atomic and + Bitmask Operations + + David S. Miller + + This document is intended to serve as a guide to Linux port +maintainers on how to implement atomic counter, bitops, and spinlock +interfaces properly. + + The atomic_t type should be defined as a signed integer. +Also, it should be made opaque such that any kind of cast to a normal +C integer type will fail. Something like the following should +suffice: + + typedef struct { volatile int counter; } atomic_t; + + The first operations to implement for atomic_t's are the +initializers and plain reads. + + #define ATOMIC_INIT(i) { (i) } + #define atomic_set(v, i) ((v)->counter = (i)) + +The first macro is used in definitions, such as: + +static atomic_t my_counter = ATOMIC_INIT(1); + +The second interface can be used at runtime, as in: + + struct foo { atomic_t counter; }; + ... + + struct foo *k; + + k = kmalloc(sizeof(*k), GFP_KERNEL); + if (!k) + return -ENOMEM; + atomic_set(&k->counter, 0); + +Next, we have: + + #define atomic_read(v) ((v)->counter) + +which simply reads the current value of the counter. + +Now, we move onto the actual atomic operation interfaces. + + void atomic_add(int i, atomic_t *v); + void atomic_sub(int i, atomic_t *v); + void atomic_inc(atomic_t *v); + void atomic_dec(atomic_t *v); + +These four routines add and subtract integral values to/from the given +atomic_t value. The first two routines pass explicit integers by +which to make the adjustment, whereas the latter two use an implicit +adjustment value of "1". + +One very important aspect of these two routines is that they DO NOT +require any explicit memory barriers. They need only perform the +atomic_t counter update in an SMP safe manner. + +Next, we have: + + int atomic_inc_return(atomic_t *v); + int atomic_dec_return(atomic_t *v); + +These routines add 1 and subtract 1, respectively, from the given +atomic_t and return the new counter value after the operation is +performed. + +Unlike the above routines, it is required that explicit memory +barriers are performed before and after the operation. It must be +done such that all memory operations before and after the atomic +operation calls are strongly ordered with respect to the atomic +operation itself. + +For example, it should behave as if a smp_mb() call existed both +before and after the atomic operation. + +If the atomic instructions used in an implementation provide explicit +memory barrier semantics which satisfy the above requirements, that is +fine as well. + +Let's move on: + + int atomic_add_return(int i, atomic_t *v); + int atomic_sub_return(int i, atomic_t *v); + +These behave just like atomic_{inc,dec}_return() except that an +explicit counter adjustment is given instead of the implicit "1". +This means that like atomic_{inc,dec}_return(), the memory barrier +semantics are required. + +Next: + + int atomic_inc_and_test(atomic_t *v); + int atomic_dec_and_test(atomic_t *v); + +These two routines increment and decrement by 1, respectively, the +given atomic counter. They return a boolean indicating whether the +resulting counter value was zero or not. + +It requires explicit memory barrier semantics around the operation as +above. + + int atomic_sub_and_test(int i, atomic_t *v); + +This is identical to atomic_dec_and_test() except that an explicit +decrement is given instead of the implicit "1". It requires explicit +memory barrier semantics around the operation. + + int atomic_add_negative(int i, atomic_t *v); + +The given increment is added to the given atomic counter value. A +boolean is return which indicates whether the resulting counter value +is negative. It requires explicit memory barrier semantics around the +operation. + +If a caller requires memory barrier semantics around an atomic_t +operation which does not return a value, a set of interfaces are +defined which accomplish this: + + void smp_mb__before_atomic_dec(void); + void smp_mb__after_atomic_dec(void); + void smp_mb__before_atomic_inc(void); + void smp_mb__after_atomic_dec(void); + +For example, smp_mb__before_atomic_dec() can be used like so: + + obj->dead = 1; + smp_mb__before_atomic_dec(); + atomic_dec(&obj->ref_count); + +It makes sure that all memory operations preceeding the atomic_dec() +call are strongly ordered with respect to the atomic counter +operation. In the above example, it guarentees that the assignment of +"1" to obj->dead will be globally visible to other cpus before the +atomic counter decrement. + +Without the explicitl smp_mb__before_atomic_dec() call, the +implementation could legally allow the atomic counter update visible +to other cpus before the "obj->dead = 1;" assignment. + +The other three interfaces listed are used to provide explicit +ordering with respect to memory operations after an atomic_dec() call +(smp_mb__after_atomic_dec()) and around atomic_inc() calls +(smp_mb__{before,after}_atomic_inc()). + +A missing memory barrier in the cases where they are required by the +atomic_t implementation above can have disasterous results. Here is +an example, which follows a pattern occuring frequently in the Linux +kernel. It is the use of atomic counters to implement reference +counting, and it works such that once the counter falls to zero it can +be guarenteed that no other entity can be accessing the object: + +static void obj_list_add(struct obj *obj) +{ + obj->active = 1; + list_add(&obj->list); +} + +static void obj_list_del(struct obj *obj) +{ + list_del(&obj->list); + obj->active = 0; +} + +static void obj_destroy(struct obj *obj) +{ + BUG_ON(obj->active); + kfree(obj); +} + +struct obj *obj_list_peek(struct list_head *head) +{ + if (!list_empty(head)) { + struct obj *obj; + + obj = list_entry(head->next, struct obj, list); + atomic_inc(&obj->refcnt); + return obj; + } + return NULL; +} + +void obj_poke(void) +{ + struct obj *obj; + + spin_lock(&global_list_lock); + obj = obj_list_peek(&global_list); + spin_unlock(&global_list_lock); + + if (obj) { + obj->ops->poke(obj); + if (atomic_dec_and_test(&obj->refcnt)) + obj_destroy(obj); + } +} + +void obj_timeout(struct obj *obj) +{ + spin_lock(&global_list_lock); + obj_list_del(obj); + spin_unlock(&global_list_lock); + + if (atomic_dec_and_test(&obj->refcnt)) + obj_destroy(obj); +} + +(This is a simplification of the ARP queue management in the + generic neighbour discover code of the networking. Olaf Kirch + found a bug wrt. memory barriers in kfree_skb() that exposed + the atomic_t memory barrier requirements quite clearly.) + +Given the above scheme, it must be the case that the obj->active +update done by the obj list deletion be visible to other processors +before the atomic counter decrement is performed. + +Otherwise, the counter could fall to zero, yet obj->active would still +be set, thus triggering the assertion in obj_destroy(). The error +sequence looks like this: + + cpu 0 cpu 1 + obj_poke() obj_timeout() + obj = obj_list_peek(); + ... gains ref to obj, refcnt=2 + obj_list_del(obj); + obj->active = 0 ... + ... visibility delayed ... + atomic_dec_and_test() + ... refcnt drops to 1 ... + atomic_dec_and_test() + ... refcount drops to 0 ... + obj_destroy() + BUG() triggers since obj->active + still seen as one + obj->active update visibility occurs + +With the memory barrier semantics required of the atomic_t operations +which return values, the above sequence of memory visibility can never +happen. Specifically, in the above case the atomic_dec_and_test() +counter decrement would not become globally visible until the +obj->active update does. + +As a historical note, 32-bit Sparc used to only allow usage of +24-bits of it's atomic_t type. This was because it used 8 bits +as a spinlock for SMP safety. Sparc32 lacked a "compare and swap" +type instruction. However, 32-bit Sparc has since been moved over +to a "hash table of spinlocks" scheme, that allows the full 32-bit +counter to be realized. Essentially, an array of spinlocks are +indexed into based upon the address of the atomic_t being operated +on, and that lock protects the atomic operation. Parisc uses the +same scheme. + +Another note is that the atomic_t operations returning values are +extremely slow on an old 386. + +We will now cover the atomic bitmask operations. You will find that +their SMP and memory barrier semantics are similar in shape and scope +to the atomic_t ops above. + +Native atomic bit operations are defined to operate on objects aligned +to the size of an "unsigned long" C data type, and are least of that +size. The endianness of the bits within each "unsigned long" are the +native endianness of the cpu. + + void set_bit(unsigned long nr, volatils unsigned long *addr); + void clear_bit(unsigned long nr, volatils unsigned long *addr); + void change_bit(unsigned long nr, volatils unsigned long *addr); + +These routines set, clear, and change, respectively, the bit number +indicated by "nr" on the bit mask pointed to by "ADDR". + +They must execute atomically, yet there are no implicit memory barrier +semantics required of these interfaces. + + int test_and_set_bit(unsigned long nr, volatils unsigned long *addr); + int test_and_clear_bit(unsigned long nr, volatils unsigned long *addr); + int test_and_change_bit(unsigned long nr, volatils unsigned long *addr); + +Like the above, except that these routines return a boolean which +indicates whether the changed bit was set _BEFORE_ the atomic bit +operation. + +WARNING! It is incredibly important that the value be a boolean, +ie. "0" or "1". Do not try to be fancy and save a few instructions by +declaring the above to return "long" and just returning something like +"old_val & mask" because that will not work. + +For one thing, this return value gets truncated to int in many code +paths using these interfaces, so on 64-bit if the bit is set in the +upper 32-bits then testers will never see that. + +One great example of where this problem crops up are the thread_info +flag operations. Routines such as test_and_set_ti_thread_flag() chop +the return value into an int. There are other places where things +like this occur as well. + +These routines, like the atomic_t counter operations returning values, +require explicit memory barrier semantics around their execution. All +memory operations before the atomic bit operation call must be made +visible globally before the atomic bit operation is made visible. +Likewise, the atomic bit operation must be visible globally before any +subsequent memory operation is made visible. For example: + + obj->dead = 1; + if (test_and_set_bit(0, &obj->flags)) + /* ... */; + obj->killed = 1; + +The implementation of test_and_set_bit() must guarentee that +"obj->dead = 1;" is visible to cpus before the atomic memory operation +done by test_and_set_bit() becomes visible. Likewise, the atomic +memory operation done by test_and_set_bit() must become visible before +"obj->killed = 1;" is visible. + +Finally there is the basic operation: + + int test_bit(unsigned long nr, __const__ volatile unsigned long *addr); + +Which returns a boolean indicating if bit "nr" is set in the bitmask +pointed to by "addr". + +If explicit memory barriers are required around clear_bit() (which +does not return a value, and thus does not need to provide memory +barrier semantics), two interfaces are provided: + + void smp_mb__before_clear_bit(void); + void smp_mb__after_clear_bit(void); + +They are used as follows, and are akin to their atomic_t operation +brothers: + + /* All memory operations before this call will + * be globally visible before the clear_bit(). + */ + smp_mb__before_clear_bit(); + clear_bit( ... ); + + /* The clear_bit() will be visible before all + * subsequent memory operations. + */ + smp_mb__after_clear_bit(); + +Finally, there are non-atomic versions of the bitmask operations +provided. They are used in contexts where some other higher-level SMP +locking scheme is being used to protect the bitmask, and thus less +expensive non-atomic operations may be used in the implementation. +They have names similar to the above bitmask operation interfaces, +except that two underscores are prefixed to the interface name. + + void __set_bit(unsigned long nr, volatile unsigned long *addr); + void __clear_bit(unsigned long nr, volatile unsigned long *addr); + void __change_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr); + +These non-atomic variants also do not require any special memory +barrier semantics. + +The routines xchg() and cmpxchg() need the same exact memory barriers +as the atomic and bit operations returning values. + +Spinlocks and rwlocks have memory barrier expectations as well. +The rule to follow is simple: + +1) When acquiring a lock, the implementation must make it globally + visible before any subsequent memory operation. + +2) When releasing a lock, the implementation must make it such that + all previous memory operations are globally visible before the + lock release. + +Which finally brings us to _atomic_dec_and_lock(). There is an +architecture-neutral version implemented in lib/dec_and_lock.c, +but most platforms will wish to optimize this in assembler. + + int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); + +Atomically decrement the given counter, and if will drop to zero +atomically acquire the given spinlock and perform the decrement +of the counter to zero. If it does not drop to zero, do nothing +with the spinlock. + +It is actually pretty simple to get the memory barrier correct. +Simply satisfy the spinlock grab requirements, which is make +sure the spinlock operation is globally visible before any +subsequent memory operation. + +We can demonstrate this operation more clearly if we define +an abstract atomic operation: + + long cas(long *mem, long old, long new); + +"cas" stands for "compare and swap". It atomically: + +1) Compares "old" with the value currently at "mem". +2) If they are equal, "new" is written to "mem". +3) Regardless, the current value at "mem" is returned. + +As an example usage, here is what an atomic counter update +might look like: + +void example_atomic_inc(long *counter) +{ + long old, new, ret; + + while (1) { + old = *counter; + new = old + 1; + + ret = cas(counter, old, new); + if (ret == old) + break; + } +} + +Let's use cas() in order to build a pseudo-C atomic_dec_and_lock(): + +int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) +{ + long old, new, ret; + int went_to_zero; + + went_to_zero = 0; + while (1) { + old = atomic_read(atomic); + new = old - 1; + if (new == 0) { + went_to_zero = 1; + spin_lock(lock); + } + ret = cas(atomic, old, new); + if (ret == old) + break; + if (went_to_zero) { + spin_unlock(lock); + went_to_zero = 0; + } + } + + return went_to_zero; +} + +Now, as far as memory barriers go, as long as spin_lock() +strictly orders all subsequent memory operations (including +the cas()) with respect to itself, things will be fine. + +Said another way, _atomic_dec_and_lock() must guarentee that +a counter dropping to zero is never made visible before the +spinlock being acquired. + +Note that this also means that for the case where the counter +is not dropping to zero, there are no memory ordering +requirements. -- cgit