summaryrefslogtreecommitdiff
path: root/Documentation/kernel-hacking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/kernel-hacking')
-rw-r--r--Documentation/kernel-hacking/false-sharing.rst206
-rw-r--r--Documentation/kernel-hacking/hacking.rst53
-rw-r--r--Documentation/kernel-hacking/index.rst1
-rw-r--r--Documentation/kernel-hacking/locking.rst228
4 files changed, 350 insertions, 138 deletions
diff --git a/Documentation/kernel-hacking/false-sharing.rst b/Documentation/kernel-hacking/false-sharing.rst
new file mode 100644
index 000000000000..122b0e124656
--- /dev/null
+++ b/Documentation/kernel-hacking/false-sharing.rst
@@ -0,0 +1,206 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+False Sharing
+=============
+
+What is False Sharing
+=====================
+False sharing is related with cache mechanism of maintaining the data
+coherence of one cache line stored in multiple CPU's caches; then
+academic definition for it is in [1]_. Consider a struct with a
+refcount and a string::
+
+ struct foo {
+ refcount_t refcount;
+ ...
+ char name[16];
+ } ____cacheline_internodealigned_in_smp;
+
+Member 'refcount'(A) and 'name'(B) _share_ one cache line like below::
+
+ +-----------+ +-----------+
+ | CPU 0 | | CPU 1 |
+ +-----------+ +-----------+
+ / |
+ / |
+ V V
+ +----------------------+ +----------------------+
+ | A B | Cache 0 | A B | Cache 1
+ +----------------------+ +----------------------+
+ | |
+ ---------------------------+------------------+-----------------------------
+ | |
+ +----------------------+
+ | |
+ +----------------------+
+ Main Memory | A B |
+ +----------------------+
+
+'refcount' is modified frequently, but 'name' is set once at object
+creation time and is never modified. When many CPUs access 'foo' at
+the same time, with 'refcount' being only bumped by one CPU frequently
+and 'name' being read by other CPUs, all those reading CPUs have to
+reload the whole cache line over and over due to the 'sharing', even
+though 'name' is never changed.
+
+There are many real-world cases of performance regressions caused by
+false sharing. One of these is a rw_semaphore 'mmap_lock' inside
+mm_struct struct, whose cache line layout change triggered a
+regression and Linus analyzed in [2]_.
+
+There are two key factors for a harmful false sharing:
+
+* A global datum accessed (shared) by many CPUs
+* In the concurrent accesses to the data, there is at least one write
+ operation: write/write or write/read cases.
+
+The sharing could be from totally unrelated kernel components, or
+different code paths of the same kernel component.
+
+
+False Sharing Pitfalls
+======================
+Back in time when one platform had only one or a few CPUs, hot data
+members could be purposely put in the same cache line to make them
+cache hot and save cacheline/TLB, like a lock and the data protected
+by it. But for recent large system with hundreds of CPUs, this may
+not work when the lock is heavily contended, as the lock owner CPU
+could write to the data, while other CPUs are busy spinning the lock.
+
+Looking at past cases, there are several frequently occurring patterns
+for false sharing:
+
+* lock (spinlock/mutex/semaphore) and data protected by it are
+ purposely put in one cache line.
+* global data being put together in one cache line. Some kernel
+ subsystems have many global parameters of small size (4 bytes),
+ which can easily be grouped together and put into one cache line.
+* data members of a big data structure randomly sitting together
+ without being noticed (cache line is usually 64 bytes or more),
+ like 'mem_cgroup' struct.
+
+Following 'mitigation' section provides real-world examples.
+
+False sharing could easily happen unless they are intentionally
+checked, and it is valuable to run specific tools for performance
+critical workloads to detect false sharing affecting performance case
+and optimize accordingly.
+
+
+How to detect and analyze False Sharing
+========================================
+perf record/report/stat are widely used for performance tuning, and
+once hotspots are detected, tools like 'perf-c2c' and 'pahole' can
+be further used to detect and pinpoint the possible false sharing
+data structures. 'addr2line' is also good at decoding instruction
+pointer when there are multiple layers of inline functions.
+
+perf-c2c can capture the cache lines with most false sharing hits,
+decoded functions (line number of file) accessing that cache line,
+and in-line offset of the data. Simple commands are::
+
+ $ perf c2c record -ag sleep 3
+ $ perf c2c report --call-graph none -k vmlinux
+
+When running above during testing will-it-scale's tlb_flush1 case,
+perf reports something like::
+
+ Total records : 1658231
+ Locked Load/Store Operations : 89439
+ Load Operations : 623219
+ Load Local HITM : 92117
+ Load Remote HITM : 139
+
+ #----------------------------------------------------------------------
+ 4 0 2374 0 0 0 0xff1100088366d880
+ #----------------------------------------------------------------------
+ 0.00% 42.29% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81373b7b 0 231 129 5312 64 [k] __mod_lruvec_page_state [kernel.vmlinux] memcontrol.h:752 1
+ 0.00% 13.10% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81374718 0 226 97 3551 64 [k] folio_lruvec_lock_irqsave [kernel.vmlinux] memcontrol.h:752 1
+ 0.00% 11.20% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c29bf 0 170 136 555 64 [k] lru_add_fn [kernel.vmlinux] mm_inline.h:41 1
+ 0.00% 7.62% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c3ec5 0 175 108 632 64 [k] release_pages [kernel.vmlinux] mm_inline.h:41 1
+ 0.00% 23.29% 0.00% 0.00% 0.00% 0x10 1 1 0xffffffff81372d0a 0 234 279 1051 64 [k] __mod_memcg_lruvec_state [kernel.vmlinux] memcontrol.c:736 1
+
+A nice introduction for perf-c2c is [3]_.
+
+'pahole' decodes data structure layouts delimited in cache line
+granularity. Users can match the offset in perf-c2c output with
+pahole's decoding to locate the exact data members. For global
+data, users can search the data address in System.map.
+
+
+Possible Mitigations
+====================
+False sharing does not always need to be mitigated. False sharing
+mitigations should balance performance gains with complexity and
+space consumption. Sometimes, lower performance is OK, and it's
+unnecessary to hyper-optimize every rarely used data structure or
+a cold data path.
+
+False sharing hurting performance cases are seen more frequently with
+core count increasing. Because of these detrimental effects, many
+patches have been proposed across variety of subsystems (like
+networking and memory management) and merged. Some common mitigations
+(with examples) are:
+
+* Separate hot global data in its own dedicated cache line, even if it
+ is just a 'short' type. The downside is more consumption of memory,
+ cache line and TLB entries.
+
+ - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
+
+* Reorganize the data structure, separate the interfering members to
+ different cache lines. One downside is it may introduce new false
+ sharing of other members.
+
+ - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
+
+* Replace 'write' with 'read' when possible, especially in loops.
+ Like for some global variable, use compare(read)-then-write instead
+ of unconditional write. For example, use::
+
+ if (!test_bit(XXX))
+ set_bit(XXX);
+
+ instead of directly "set_bit(XXX);", similarly for atomic_t data::
+
+ if (atomic_read(XXX) == AAA)
+ atomic_set(XXX, BBB);
+
+ - Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
+ - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
+
+* Turn hot global data to 'per-cpu data + global data' when possible,
+ or reasonably increase the threshold for syncing per-cpu data to
+ global data, to reduce or postpone the 'write' to that global data.
+
+ - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
+ - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
+
+Surely, all mitigations should be carefully verified to not cause side
+effects. To avoid introducing false sharing when coding, it's better
+to:
+
+* Be aware of cache line boundaries
+* Group mostly read-only fields together
+* Group things that are written at the same time together
+* Separate frequently read and frequently written fields on
+ different cache lines.
+
+and better add a comment stating the false sharing consideration.
+
+One note is, sometimes even after a severe false sharing is detected
+and solved, the performance may still have no obvious improvement as
+the hotspot switches to a new place.
+
+
+Miscellaneous
+=============
+One open issue is that kernel has an optional data structure
+randomization mechanism, which also randomizes the situation of cache
+line sharing of data members.
+
+
+.. [1] https://en.wikipedia.org/wiki/False_sharing
+.. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/
+.. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/
diff --git a/Documentation/kernel-hacking/hacking.rst b/Documentation/kernel-hacking/hacking.rst
index a3ddb213a5e1..1717348a4404 100644
--- a/Documentation/kernel-hacking/hacking.rst
+++ b/Documentation/kernel-hacking/hacking.rst
@@ -76,8 +76,8 @@ handler is never re-entered: if the same interrupt arrives, it is queued
fast: frequently it simply acknowledges the interrupt, marks a 'software
interrupt' for execution and exits.
-You can tell you are in a hardware interrupt, because
-:c:func:`in_irq()` returns true.
+You can tell you are in a hardware interrupt, because in_hardirq() returns
+true.
.. warning::
@@ -112,8 +112,7 @@ time, although different tasklets can run simultaneously.
.. warning::
The name 'tasklet' is misleading: they have nothing to do with
- 'tasks', and probably more to do with some bad vodka Alexey
- Kuznetsov had at the time.
+ 'tasks'.
You can tell you are in a softirq (or tasklet) using the
:c:func:`in_softirq()` macro (``include/linux/preempt.h``).
@@ -121,7 +120,7 @@ You can tell you are in a softirq (or tasklet) using the
.. warning::
Beware that this will return a false positive if a
- :ref:`botton half lock <local_bh_disable>` is held.
+ :ref:`bottom half lock <local_bh_disable>` is held.
Some Basic Rules
================
@@ -290,8 +289,8 @@ userspace.
Unlike :c:func:`put_user()` and :c:func:`get_user()`, they
return the amount of uncopied data (ie. 0 still means success).
-[Yes, this moronic interface makes me cringe. The flamewar comes up
-every year or so. --RR.]
+[Yes, this objectionable interface makes me cringe. The flamewar comes
+up every year or so. --RR.]
The functions may sleep implicitly. This should never be called outside
user context (it makes no sense), with interrupts disabled, or a
@@ -346,8 +345,8 @@ routine.
Before inventing your own cache of often-used objects consider using a
slab cache in ``include/linux/slab.h``
-:c:func:`current()`
--------------------
+:c:macro:`current`
+------------------
Defined in ``include/asm/current.h``
@@ -601,7 +600,7 @@ Defined in ``include/linux/export.h``
This is the variant of `EXPORT_SYMBOL()` that allows specifying a symbol
namespace. Symbol Namespaces are documented in
-``Documentation/kbuild/namespaces.rst``.
+Documentation/core-api/symbol-namespaces.rst
:c:func:`EXPORT_SYMBOL_NS_GPL()`
--------------------------------
@@ -610,7 +609,7 @@ Defined in ``include/linux/export.h``
This is the variant of `EXPORT_SYMBOL_GPL()` that allows specifying a symbol
namespace. Symbol Namespaces are documented in
-``Documentation/kbuild/namespaces.rst``.
+Documentation/core-api/symbol-namespaces.rst
Routines and Conventions
========================
@@ -645,8 +644,9 @@ names in development kernels; this is not done just to keep everyone on
their toes: it reflects a fundamental change (eg. can no longer be
called with interrupts on, or does extra checks, or doesn't do checks
which were caught before). Usually this is accompanied by a fairly
-complete note to the linux-kernel mailing list; search the archive.
-Simply doing a global replace on the file usually makes things **worse**.
+complete note to the appropriate kernel development mailing list; search
+the archives. Simply doing a global replace on the file usually makes
+things **worse**.
Initializing structure members
------------------------------
@@ -723,14 +723,14 @@ Putting Your Stuff in the Kernel
In order to get your stuff into shape for official inclusion, or even to
make a neat patch, there's administrative work to be done:
-- Figure out whose pond you've been pissing in. Look at the top of the
- source files, inside the ``MAINTAINERS`` file, and last of all in the
- ``CREDITS`` file. You should coordinate with this person to make sure
- you're not duplicating effort, or trying something that's already
- been rejected.
+- Figure out who are the owners of the code you've been modifying. Look
+ at the top of the source files, inside the ``MAINTAINERS`` file, and
+ last of all in the ``CREDITS`` file. You should coordinate with these
+ people to make sure you're not duplicating effort, or trying something
+ that's already been rejected.
- Make sure you put your name and EMail address at the top of any files
- you create or mangle significantly. This is the first place people
+ Make sure you put your name and email address at the top of any files
+ you create or modify significantly. This is the first place people
will look when they find a bug, or when **they** want to make a change.
- Usually you want a configuration option for your kernel hack. Edit
@@ -748,15 +748,14 @@ make a neat patch, there's administrative work to be done:
can usually just add a "obj-$(CONFIG_xxx) += xxx.o" line. The syntax
is documented in ``Documentation/kbuild/makefiles.rst``.
-- Put yourself in ``CREDITS`` if you've done something noteworthy,
- usually beyond a single file (your name should be at the top of the
- source files anyway). ``MAINTAINERS`` means you want to be consulted
- when changes are made to a subsystem, and hear about bugs; it implies
- a more-than-passing commitment to some part of the code.
+- Put yourself in ``CREDITS`` if you consider what you've done
+ noteworthy, usually beyond a single file (your name should be at the
+ top of the source files anyway). ``MAINTAINERS`` means you want to be
+ consulted when changes are made to a subsystem, and hear about bugs;
+ it implies a more-than-passing commitment to some part of the code.
- Finally, don't forget to read
- ``Documentation/process/submitting-patches.rst`` and possibly
- ``Documentation/process/submitting-drivers.rst``.
+ ``Documentation/process/submitting-patches.rst``
Kernel Cantrips
===============
diff --git a/Documentation/kernel-hacking/index.rst b/Documentation/kernel-hacking/index.rst
index f53027652290..79c03bac99a2 100644
--- a/Documentation/kernel-hacking/index.rst
+++ b/Documentation/kernel-hacking/index.rst
@@ -9,3 +9,4 @@ Kernel Hacking Guides
hacking
locking
+ false-sharing
diff --git a/Documentation/kernel-hacking/locking.rst b/Documentation/kernel-hacking/locking.rst
index a8518ac0d31d..dff0646a717b 100644
--- a/Documentation/kernel-hacking/locking.rst
+++ b/Documentation/kernel-hacking/locking.rst
@@ -94,16 +94,10 @@ primitives, but I'll pretend they don't exist.
Locking in the Linux Kernel
===========================
-If I could give you one piece of advice: never sleep with anyone crazier
-than yourself. But if I had to give you advice on locking: **keep it
-simple**.
+If I could give you one piece of advice on locking: **keep it simple**.
Be reluctant to introduce new locks.
-Strangely enough, this last one is the exact reverse of my advice when
-you **have** slept with someone crazier than yourself. And you should
-think about getting a big dog.
-
Two Main Types of Kernel Locks: Spinlocks and Mutexes
-----------------------------------------------------
@@ -118,11 +112,11 @@ spinlock, but you may block holding a mutex. If you can't lock a mutex,
your task will suspend itself, and be woken up when the mutex is
released. This means the CPU can do something else while you are
waiting. There are many cases when you simply can't sleep (see
-`What Functions Are Safe To Call From Interrupts? <#sleeping-things>`__),
+`What Functions Are Safe To Call From Interrupts?`_),
and so have to use a spinlock instead.
Neither type of lock is recursive: see
-`Deadlock: Simple and Advanced <#deadlock>`__.
+`Deadlock: Simple and Advanced`_.
Locks and Uniprocessor Kernels
------------------------------
@@ -150,17 +144,17 @@ Locking Only In User Context
If you have a data structure which is only ever accessed from user
context, then you can use a simple mutex (``include/linux/mutex.h``) to
protect it. This is the most trivial case: you initialize the mutex.
-Then you can call :c:func:`mutex_lock_interruptible()` to grab the
-mutex, and :c:func:`mutex_unlock()` to release it. There is also a
-:c:func:`mutex_lock()`, which should be avoided, because it will
+Then you can call mutex_lock_interruptible() to grab the
+mutex, and mutex_unlock() to release it. There is also a
+mutex_lock(), which should be avoided, because it will
not return if a signal is received.
Example: ``net/netfilter/nf_sockopt.c`` allows registration of new
-:c:func:`setsockopt()` and :c:func:`getsockopt()` calls, with
-:c:func:`nf_register_sockopt()`. Registration and de-registration
+setsockopt() and getsockopt() calls, with
+nf_register_sockopt(). Registration and de-registration
are only done on module load and unload (and boot time, where there is
no concurrency), and the list of registrations is only consulted for an
-unknown :c:func:`setsockopt()` or :c:func:`getsockopt()` system
+unknown setsockopt() or getsockopt() system
call. The ``nf_sockopt_mutex`` is perfect to protect this, especially
since the setsockopt and getsockopt calls may well sleep.
@@ -170,19 +164,19 @@ Locking Between User Context and Softirqs
If a softirq shares data with user context, you have two problems.
Firstly, the current user context can be interrupted by a softirq, and
secondly, the critical region could be entered from another CPU. This is
-where :c:func:`spin_lock_bh()` (``include/linux/spinlock.h``) is
+where spin_lock_bh() (``include/linux/spinlock.h``) is
used. It disables softirqs on that CPU, then grabs the lock.
-:c:func:`spin_unlock_bh()` does the reverse. (The '_bh' suffix is
+spin_unlock_bh() does the reverse. (The '_bh' suffix is
a historical reference to "Bottom Halves", the old name for software
interrupts. It should really be called spin_lock_softirq()' in a
perfect world).
-Note that you can also use :c:func:`spin_lock_irq()` or
-:c:func:`spin_lock_irqsave()` here, which stop hardware interrupts
-as well: see `Hard IRQ Context <#hard-irq-context>`__.
+Note that you can also use spin_lock_irq() or
+spin_lock_irqsave() here, which stop hardware interrupts
+as well: see `Hard IRQ Context`_.
This works perfectly for UP as well: the spin lock vanishes, and this
-macro simply becomes :c:func:`local_bh_disable()`
+macro simply becomes local_bh_disable()
(``include/linux/interrupt.h``), which protects you from the softirq
being run.
@@ -216,8 +210,8 @@ Different Tasklets/Timers
~~~~~~~~~~~~~~~~~~~~~~~~~
If another tasklet/timer wants to share data with your tasklet or timer
-, you will both need to use :c:func:`spin_lock()` and
-:c:func:`spin_unlock()` calls. :c:func:`spin_lock_bh()` is
+, you will both need to use spin_lock() and
+spin_unlock() calls. spin_lock_bh() is
unnecessary here, as you are already in a tasklet, and none will be run
on the same CPU.
@@ -230,18 +224,18 @@ The Same Softirq
~~~~~~~~~~~~~~~~
The same softirq can run on the other CPUs: you can use a per-CPU array
-(see `Per-CPU Data <#per-cpu-data>`__) for better performance. If you're
+(see `Per-CPU Data`_) for better performance. If you're
going so far as to use a softirq, you probably care about scalable
performance enough to justify the extra complexity.
-You'll need to use :c:func:`spin_lock()` and
-:c:func:`spin_unlock()` for shared data.
+You'll need to use spin_lock() and
+spin_unlock() for shared data.
Different Softirqs
~~~~~~~~~~~~~~~~~~
-You'll need to use :c:func:`spin_lock()` and
-:c:func:`spin_unlock()` for shared data, whether it be a timer,
+You'll need to use spin_lock() and
+spin_unlock() for shared data, whether it be a timer,
tasklet, different softirq or the same or another softirq: any of them
could be running on a different CPU.
@@ -259,38 +253,38 @@ If a hardware irq handler shares data with a softirq, you have two
concerns. Firstly, the softirq processing can be interrupted by a
hardware interrupt, and secondly, the critical region could be entered
by a hardware interrupt on another CPU. This is where
-:c:func:`spin_lock_irq()` is used. It is defined to disable
+spin_lock_irq() is used. It is defined to disable
interrupts on that cpu, then grab the lock.
-:c:func:`spin_unlock_irq()` does the reverse.
+spin_unlock_irq() does the reverse.
-The irq handler does not to use :c:func:`spin_lock_irq()`, because
+The irq handler does not need to use spin_lock_irq(), because
the softirq cannot run while the irq handler is running: it can use
-:c:func:`spin_lock()`, which is slightly faster. The only exception
+spin_lock(), which is slightly faster. The only exception
would be if a different hardware irq handler uses the same lock:
-:c:func:`spin_lock_irq()` will stop that from interrupting us.
+spin_lock_irq() will stop that from interrupting us.
This works perfectly for UP as well: the spin lock vanishes, and this
-macro simply becomes :c:func:`local_irq_disable()`
+macro simply becomes local_irq_disable()
(``include/asm/smp.h``), which protects you from the softirq/tasklet/BH
being run.
-:c:func:`spin_lock_irqsave()` (``include/linux/spinlock.h``) is a
+spin_lock_irqsave() (``include/linux/spinlock.h``) is a
variant which saves whether interrupts were on or off in a flags word,
-which is passed to :c:func:`spin_unlock_irqrestore()`. This means
+which is passed to spin_unlock_irqrestore(). This means
that the same code can be used inside an hard irq handler (where
interrupts are already off) and in softirqs (where the irq disabling is
required).
Note that softirqs (and hence tasklets and timers) are run on return
-from hardware interrupts, so :c:func:`spin_lock_irq()` also stops
-these. In that sense, :c:func:`spin_lock_irqsave()` is the most
+from hardware interrupts, so spin_lock_irq() also stops
+these. In that sense, spin_lock_irqsave() is the most
general and powerful locking function.
Locking Between Two Hard IRQ Handlers
-------------------------------------
It is rare to have to share data between two IRQ handlers, but if you
-do, :c:func:`spin_lock_irqsave()` should be used: it is
+do, spin_lock_irqsave() should be used: it is
architecture-specific whether all interrupts are disabled inside irq
handlers themselves.
@@ -301,14 +295,14 @@ Pete Zaitcev gives the following summary:
- If you are in a process context (any syscall) and want to lock other
process out, use a mutex. You can take a mutex and sleep
- (``copy_from_user*(`` or ``kmalloc(x,GFP_KERNEL)``).
+ (``copy_from_user()`` or ``kmalloc(x,GFP_KERNEL)``).
- Otherwise (== data can be touched in an interrupt), use
- :c:func:`spin_lock_irqsave()` and
- :c:func:`spin_unlock_irqrestore()`.
+ spin_lock_irqsave() and
+ spin_unlock_irqrestore().
- Avoid holding spinlock for more than 5 lines of code and across any
- function call (except accessors like :c:func:`readb()`).
+ function call (except accessors like readb()).
Table of Minimum Requirements
-----------------------------
@@ -320,7 +314,7 @@ particular thread can only run on one CPU at a time, but if it needs
shares data with another thread, locking is required).
Remember the advice above: you can always use
-:c:func:`spin_lock_irqsave()`, which is a superset of all other
+spin_lock_irqsave(), which is a superset of all other
spinlock primitives.
============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ==============
@@ -363,13 +357,13 @@ They can be used if you need no access to the data protected with the
lock when some other thread is holding the lock. You should acquire the
lock later if you then need access to the data protected with the lock.
-:c:func:`spin_trylock()` does not spin but returns non-zero if it
+spin_trylock() does not spin but returns non-zero if it
acquires the spinlock on the first try or 0 if not. This function can be
-used in all contexts like :c:func:`spin_lock()`: you must have
+used in all contexts like spin_lock(): you must have
disabled the contexts that might interrupt you and acquire the spin
lock.
-:c:func:`mutex_trylock()` does not suspend your task but returns
+mutex_trylock() does not suspend your task but returns
non-zero if it could lock the mutex on the first try or 0 if not. This
function cannot be safely used in hardware or software interrupt
contexts despite not sleeping.
@@ -490,14 +484,14 @@ easy, since we copy the data for the user, and never let them access the
objects directly.
There is a slight (and common) optimization here: in
-:c:func:`cache_add()` we set up the fields of the object before
+cache_add() we set up the fields of the object before
grabbing the lock. This is safe, as no-one else can access it until we
put it in cache.
Accessing From Interrupt Context
--------------------------------
-Now consider the case where :c:func:`cache_find()` can be called
+Now consider the case where cache_find() can be called
from interrupt context: either a hardware interrupt or a softirq. An
example would be a timer which deletes object from the cache.
@@ -566,16 +560,16 @@ which are taken away, and the ``+`` are lines which are added.
return ret;
}
-Note that the :c:func:`spin_lock_irqsave()` will turn off
+Note that the spin_lock_irqsave() will turn off
interrupts if they are on, otherwise does nothing (if we are already in
an interrupt handler), hence these functions are safe to call from any
context.
-Unfortunately, :c:func:`cache_add()` calls :c:func:`kmalloc()`
+Unfortunately, cache_add() calls kmalloc()
with the ``GFP_KERNEL`` flag, which is only legal in user context. I
-have assumed that :c:func:`cache_add()` is still only called in
+have assumed that cache_add() is still only called in
user context, otherwise this should become a parameter to
-:c:func:`cache_add()`.
+cache_add().
Exposing Objects Outside This File
----------------------------------
@@ -592,7 +586,7 @@ This makes locking trickier, as it is no longer all in one place.
The second problem is the lifetime problem: if another structure keeps a
pointer to an object, it presumably expects that pointer to remain
valid. Unfortunately, this is only guaranteed while you hold the lock,
-otherwise someone might call :c:func:`cache_delete()` and even
+otherwise someone might call cache_delete() and even
worse, add another object, re-using the same address.
As there is only one lock, you can't hold it forever: no-one else would
@@ -693,8 +687,8 @@ Here is the code::
We encapsulate the reference counting in the standard 'get' and 'put'
functions. Now we can return the object itself from
-:c:func:`cache_find()` which has the advantage that the user can
-now sleep holding the object (eg. to :c:func:`copy_to_user()` to
+cache_find() which has the advantage that the user can
+now sleep holding the object (eg. to copy_to_user() to
name to userspace).
The other point to note is that I said a reference should be held for
@@ -710,7 +704,7 @@ number of atomic operations defined in ``include/asm/atomic.h``: these
are guaranteed to be seen atomically from all CPUs in the system, so no
lock is required. In this case, it is simpler than using spinlocks,
although for anything non-trivial using spinlocks is clearer. The
-:c:func:`atomic_inc()` and :c:func:`atomic_dec_and_test()`
+atomic_inc() and atomic_dec_and_test()
are used instead of the standard increment and decrement operators, and
the lock is no longer used to protect the reference count itself.
@@ -802,7 +796,7 @@ name to change, there are three possibilities:
- You can make ``cache_lock`` non-static, and tell people to grab that
lock before changing the name in any object.
-- You can provide a :c:func:`cache_obj_rename()` which grabs this
+- You can provide a cache_obj_rename() which grabs this
lock and changes the name for the caller, and tell everyone to use
that function.
@@ -861,11 +855,11 @@ Note that I decide that the popularity count should be protected by the
``cache_lock`` rather than the per-object lock: this is because it (like
the :c:type:`struct list_head <list_head>` inside the object)
is logically part of the infrastructure. This way, I don't need to grab
-the lock of every object in :c:func:`__cache_add()` when seeking
+the lock of every object in __cache_add() when seeking
the least popular.
I also decided that the id member is unchangeable, so I don't need to
-grab each object lock in :c:func:`__cache_find()` to examine the
+grab each object lock in __cache_find() to examine the
id: the object lock is only used by a caller who wants to read or write
the name field.
@@ -887,7 +881,7 @@ trivial to diagnose: not a
stay-up-five-nights-talk-to-fluffy-code-bunnies kind of problem.
For a slightly more complex case, imagine you have a region shared by a
-softirq and user context. If you use a :c:func:`spin_lock()` call
+softirq and user context. If you use a spin_lock() call
to protect it, it is possible that the user context will be interrupted
by the softirq while it holds the lock, and the softirq will then spin
forever trying to get the same lock.
@@ -947,8 +941,7 @@ lock.
A classic problem here is when you provide callbacks or hooks: if you
call these with the lock held, you risk simple deadlock, or a deadly
-embrace (who knows what the callback will do?). Remember, the other
-programmers are out to get you, so don't do this.
+embrace (who knows what the callback will do?).
Overzealous Prevention Of Deadlocks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -958,8 +951,6 @@ grabs a read lock, searches a list, fails to find what it wants, drops
the read lock, grabs a write lock and inserts the object has a race
condition.
-If you don't see why, please stay the fuck away from my code.
-
Racing Timers: A Kernel Pastime
-------------------------------
@@ -976,7 +967,7 @@ you might do the following::
while (list) {
struct foo *next = list->next;
- del_timer(&list->timer);
+ timer_delete(&list->timer);
kfree(list);
list = next;
}
@@ -985,12 +976,12 @@ you might do the following::
Sooner or later, this will crash on SMP, because a timer can have just
-gone off before the :c:func:`spin_lock_bh()`, and it will only get
-the lock after we :c:func:`spin_unlock_bh()`, and then try to free
+gone off before the spin_lock_bh(), and it will only get
+the lock after we spin_unlock_bh(), and then try to free
the element (which has already been freed!).
This can be avoided by checking the result of
-:c:func:`del_timer()`: if it returns 1, the timer has been deleted.
+timer_delete(): if it returns 1, the timer has been deleted.
If 0, it means (in this case) that it is currently running, so we can
do::
@@ -999,7 +990,7 @@ do::
while (list) {
struct foo *next = list->next;
- if (!del_timer(&list->timer)) {
+ if (!timer_delete(&list->timer)) {
/* Give timer a chance to delete this */
spin_unlock_bh(&list_lock);
goto retry;
@@ -1012,11 +1003,14 @@ do::
Another common problem is deleting timers which restart themselves (by
-calling :c:func:`add_timer()` at the end of their timer function).
+calling add_timer() at the end of their timer function).
Because this is a fairly common case which is prone to races, you should
-use :c:func:`del_timer_sync()` (``include/linux/timer.h``) to
-handle this case. It returns the number of times the timer had to be
-deleted before we finally stopped it from adding itself back in.
+use timer_delete_sync() (``include/linux/timer.h``) to handle this case.
+
+Before freeing a timer, timer_shutdown() or timer_shutdown_sync() should be
+called which will keep it from being rearmed. Any subsequent attempt to
+rearm the timer will be silently ignored by the core code.
+
Locking Speed
=============
@@ -1086,7 +1080,7 @@ adding ``new`` to a single linked list called ``list``::
list->next = new;
-The :c:func:`wmb()` is a write memory barrier. It ensures that the
+The wmb() is a write memory barrier. It ensures that the
first operation (setting the new element's ``next`` pointer) is complete
and will be seen by all CPUs, before the second operation is (putting
the new element into the list). This is important, since modern
@@ -1097,7 +1091,7 @@ rest of the list.
Fortunately, there is a function to do this for standard
:c:type:`struct list_head <list_head>` lists:
-:c:func:`list_add_rcu()` (``include/linux/list.h``).
+list_add_rcu() (``include/linux/list.h``).
Removing an element from the list is even simpler: we replace the
pointer to the old element with a pointer to its successor, and readers
@@ -1108,7 +1102,7 @@ will either see it, or skip over it.
list->next = old->next;
-There is :c:func:`list_del_rcu()` (``include/linux/list.h``) which
+There is list_del_rcu() (``include/linux/list.h``) which
does this (the normal version poisons the old object, which we don't
want).
@@ -1116,9 +1110,9 @@ The reader must also be careful: some CPUs can look through the ``next``
pointer to start reading the contents of the next element early, but
don't realize that the pre-fetched contents is wrong when the ``next``
pointer changes underneath them. Once again, there is a
-:c:func:`list_for_each_entry_rcu()` (``include/linux/list.h``)
+list_for_each_entry_rcu() (``include/linux/list.h``)
to help you. Of course, writers can just use
-:c:func:`list_for_each_entry()`, since there cannot be two
+list_for_each_entry(), since there cannot be two
simultaneous writers.
Our final dilemma is this: when can we actually destroy the removed
@@ -1127,14 +1121,14 @@ the list right now: if we free this element and the ``next`` pointer
changes, the reader will jump off into garbage and crash. We need to
wait until we know that all the readers who were traversing the list
when we deleted the element are finished. We use
-:c:func:`call_rcu()` to register a callback which will actually
+call_rcu() to register a callback which will actually
destroy the object once all pre-existing readers are finished.
-Alternatively, :c:func:`synchronize_rcu()` may be used to block
+Alternatively, synchronize_rcu() may be used to block
until all pre-existing are finished.
But how does Read Copy Update know when the readers are finished? The
method is this: firstly, the readers always traverse the list inside
-:c:func:`rcu_read_lock()`/:c:func:`rcu_read_unlock()` pairs:
+rcu_read_lock()/rcu_read_unlock() pairs:
these simply disable preemption so the reader won't go to sleep while
reading the list.
@@ -1223,12 +1217,12 @@ this is the fundamental idea.
}
Note that the reader will alter the popularity member in
-:c:func:`__cache_find()`, and now it doesn't hold a lock. One
+__cache_find(), and now it doesn't hold a lock. One
solution would be to make it an ``atomic_t``, but for this usage, we
don't really care about races: an approximate result is good enough, so
I didn't change it.
-The result is that :c:func:`cache_find()` requires no
+The result is that cache_find() requires no
synchronization with any other functions, so is almost as fast on SMP as
it would be on UP.
@@ -1240,9 +1234,9 @@ and put the reference count.
Now, because the 'read lock' in RCU is simply disabling preemption, a
caller which always has preemption disabled between calling
-:c:func:`cache_find()` and :c:func:`object_put()` does not
+cache_find() and object_put() does not
need to actually get and put the reference count: we could expose
-:c:func:`__cache_find()` by making it non-static, and such
+__cache_find() by making it non-static, and such
callers could simply call that.
The benefit here is that the reference count is not written to: the
@@ -1260,11 +1254,11 @@ counter. Nice and simple.
If that was too slow (it's usually not, but if you've got a really big
machine to test on and can show that it is), you could instead use a
counter for each CPU, then none of them need an exclusive lock. See
-:c:func:`DEFINE_PER_CPU()`, :c:func:`get_cpu_var()` and
-:c:func:`put_cpu_var()` (``include/linux/percpu.h``).
+DEFINE_PER_CPU(), get_cpu_var() and
+put_cpu_var() (``include/linux/percpu.h``).
Of particular use for simple per-cpu counters is the ``local_t`` type,
-and the :c:func:`cpu_local_inc()` and related functions, which are
+and the cpu_local_inc() and related functions, which are
more efficient than simple code on some architectures
(``include/asm/local.h``).
@@ -1283,16 +1277,16 @@ Manfred Spraul points out that you can still do this, even if the data
is very occasionally accessed in user context or softirqs/tasklets. The
irq handler doesn't use a lock, and all other accesses are done as so::
- spin_lock(&lock);
+ mutex_lock(&lock);
disable_irq(irq);
...
enable_irq(irq);
- spin_unlock(&lock);
+ mutex_unlock(&lock);
-The :c:func:`disable_irq()` prevents the irq handler from running
+The disable_irq() prevents the irq handler from running
(and waits for it to finish if it's currently running on other CPUs).
The spinlock prevents any other accesses happening at the same time.
-Naturally, this is slower than just a :c:func:`spin_lock_irq()`
+Naturally, this is slower than just a spin_lock_irq()
call, so it only makes sense if this type of access happens extremely
rarely.
@@ -1315,22 +1309,22 @@ from user context, and can sleep.
- Accesses to userspace:
- - :c:func:`copy_from_user()`
+ - copy_from_user()
- - :c:func:`copy_to_user()`
+ - copy_to_user()
- - :c:func:`get_user()`
+ - get_user()
- - :c:func:`put_user()`
+ - put_user()
-- :c:func:`kmalloc(GFP_KERNEL) <kmalloc>`
+- kmalloc(GP_KERNEL) <kmalloc>`
-- :c:func:`mutex_lock_interruptible()` and
- :c:func:`mutex_lock()`
+- mutex_lock_interruptible() and
+ mutex_lock()
- There is a :c:func:`mutex_trylock()` which does not sleep.
+ There is a mutex_trylock() which does not sleep.
Still, it must not be used inside interrupt context since its
- implementation is not safe for that. :c:func:`mutex_unlock()`
+ implementation is not safe for that. mutex_unlock()
will also never sleep. It cannot be used in interrupt context either
since a mutex must be released by the same task that acquired it.
@@ -1340,11 +1334,11 @@ Some Functions Which Don't Sleep
Some functions are safe to call from any context, or holding almost any
lock.
-- :c:func:`printk()`
+- printk()
-- :c:func:`kfree()`
+- kfree()
-- :c:func:`add_timer()` and :c:func:`del_timer()`
+- add_timer() and timer_delete()
Mutex API reference
===================
@@ -1358,7 +1352,19 @@ Mutex API reference
Futex API reference
===================
-.. kernel-doc:: kernel/futex.c
+.. kernel-doc:: kernel/futex/core.c
+ :internal:
+
+.. kernel-doc:: kernel/futex/futex.h
+ :internal:
+
+.. kernel-doc:: kernel/futex/pi.c
+ :internal:
+
+.. kernel-doc:: kernel/futex/requeue.c
+ :internal:
+
+.. kernel-doc:: kernel/futex/waitwake.c
:internal:
Further reading
@@ -1400,26 +1406,26 @@ preemption
bh
Bottom Half: for historical reasons, functions with '_bh' in them often
- now refer to any software interrupt, e.g. :c:func:`spin_lock_bh()`
+ now refer to any software interrupt, e.g. spin_lock_bh()
blocks any software interrupt on the current CPU. Bottom halves are
deprecated, and will eventually be replaced by tasklets. Only one bottom
half will be running at any time.
Hardware Interrupt / Hardware IRQ
- Hardware interrupt request. :c:func:`in_irq()` returns true in a
+ Hardware interrupt request. in_hardirq() returns true in a
hardware interrupt handler.
Interrupt Context
Not user context: processing a hardware irq or software irq. Indicated
- by the :c:func:`in_interrupt()` macro returning true.
+ by the in_interrupt() macro returning true.
SMP
Symmetric Multi-Processor: kernels compiled for multiple-CPU machines.
(``CONFIG_SMP=y``).
Software Interrupt / softirq
- Software interrupt handler. :c:func:`in_irq()` returns false;
- :c:func:`in_softirq()` returns true. Tasklets and softirqs both
+ Software interrupt handler. in_hardirq() returns false;
+ in_softirq() returns true. Tasklets and softirqs both
fall into the category of 'software interrupts'.
Strictly speaking a softirq is one of up to 32 enumerated software