Age | Commit message (Collapse) | Author |
|
Currently, enabling a partition root is only allowed if all the
constraints of a valid partition are satisfied. Even changes to
"cpuset.cpus" may not be allowed in some cases. Moreover, there are
limits to changes made to a parent cpuset if it is a valid partition
root. This is contrary to the general cgroup v2 philosophy.
This patch relaxes the constraints of changing the state of "cpuset.cpus"
and "cpuset.cpus.partition". Now all valid changes ("member" or "root")
to "cpuset.cpus.partition" are allowed even if there are child cpusets
underneath it.
Trying to make a cpuset a partition root, however, will cause its state
to become invalid if the following constraints of a valid partition
root are not satisfied.
1) The "cpuset.cpus" is non-empty and exclusive.
2) The parent cpuset is a valid partition root.
3) The "cpuset.cpus" overlaps parent's "cpuset.cpus".
Similarly, almost all changes to "cpuset.cpus" are allowed with the
exception that if the underlying CS_CPU_EXCLUSIVE flag is set, the
exclusivity rule will still apply.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Currently, a partition root cannot have empty "cpuset.cpus.effective".
As a result, a parent partition root cannot distribute out all its
CPUs to child partitions with no CPUs left. However in most cases,
there shouldn't be any tasks associated with intermediate nodes of the
default hierarchy. So the current rule is too restrictive and can waste
valuable CPU resource.
To address this issue, we are now allowing a partition to have empty
"cpuset.cpus.effective" as long as it has no task. Since cpuset is
threaded, no-internal-process rule does not apply. So it is possible
to have tasks in a partition root with child sub-partitions even though
that should be uncommon.
A parent partition with no task can now have all its CPUs distributed out
to its child partitions. The top cpuset always have some house-keeping
tasks running and so its list of effective cpu can't be empty.
Once a partition with empty "cpuset.cpus.effective" is formed, no
new task can be moved into it until "cpuset.cpus.effective" becomes
non-empty.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The partition root state (PRS) macro names do not currently match the
external names. Change them to match the external names and add helper
functions to read or change the state.
Shorten the cpuset argument of update_parent_subparts_cpumask() to cs
to match other cpuset functions.
Remove the new_prs argument from notify_partition_change() as the
cs->partition_root_state has already been set to new_prs before it
is called.
There is no functional change.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Previously, update_tasks_cpumask() is not supposed to be called with
top cpuset. With cpuset partition that takes CPUs away from the top
cpuset, adjusting the cpus_mask of the tasks in the top cpuset is
necessary. Percpu kthreads, however, are ignored.
Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
pids.peak tracks the high watermark of usage for number of pids. This
helps give a better baseline on which to set pids.max. Polling
pids.current isn't really feasible, since it would potentially miss
short-lived spikes.
This interface is analogous to memory.peak.
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
There's a seemingly harmless data-race around cgrp_dfl_visible detected by
kernel concurrency sanitizer. Let's remove it by throwing WRITE/READ_ONCE at
it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Abhishek Shah <abhishek.shah@columbia.edu>
Cc: Gabriel Ryan <gabe@cs.columbia.edu>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/netdev/20220819072256.fn7ctciefy4fc4cu@wittgenstein/
|
|
is on
When CONFIG_SECURITY_NETWORK is disabled, there will be build warnings
from resolve_btfids:
WARN: resolve_btfids: unresolved symbol bpf_lsm_socket_socketpair
......
WARN: resolve_btfids: unresolved symbol bpf_lsm_inet_conn_established
Fixing it by wrapping these BTF ID definitions by CONFIG_SECURITY_NETWORK.
Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor")
Fixes: 9113d7e48e91 ("bpf: expose bpf_{g,s}etsockopt to lsm cgroup")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20220901065126.3856297-1-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
The assignment of the else and else if branches is the same, so the else
if here is redundant, so we remove it and add a comment to make the code
here readable.
./kernel/bpf/cgroup_iter.c:81:6-8: WARNING: possible condition with no effect (if == else).
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2016
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220831021618.86770-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
tools/testing/selftests/net/.gitignore
sort the net-next version and use it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Both __this_cpu_inc_return() and __this_cpu_dec() are not preemption
safe and now migrate_disable() doesn't disable preemption, so the update
of prog-active is not atomic and in theory under fully preemptible kernel
recurisve prevention may do not work.
Fixing by using the preemption-safe and IRQ-safe variants.
Fixes: ca06f55b9002 ("bpf: Add per-program recursion prevention mechanism")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20220901061938.3789460-3-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Now migrate_disable() does not disable preemption and under some
architectures (e.g. arm64) __this_cpu_{inc|dec|inc_return} are neither
preemption-safe nor IRQ-safe, so for fully preemptible kernel concurrent
lookups or updates on the same task local storage and on the same CPU
may make bpf_task_storage_busy be imbalanced, and
bpf_task_storage_trylock() on the specific cpu will always fail.
Fixing it by using this_cpu_{inc|dec|inc_return} when manipulating
bpf_task_storage_busy.
Fixes: bc235cdb423a ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20220901061938.3789460-2-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
'nocb.2022.09.01a', 'poll.2022.08.31b', 'poll-srcu.2022.08.31b' and 'tasks.2022.08.31b' into HEAD
doc.2022.08.31b: Documentation updates
fixes.2022.08.31b: Miscellaneous fixes
kvfree.2022.08.31b: kvfree_rcu() updates
nocb.2022.09.01a: NOCB CPU updates
poll.2022.08.31b: Full-oldstate RCU polling grace-period API
poll-srcu.2022.08.31b: Polled SRCU grace-period updates
tasks.2022.08.31b: Tasks RCU updates
|
|
The rcutorture_oom_notify() function unconditionally invokes
rcu_barrier(), which is OK when the rcutorture.torture_type value is
"rcu", but unhelpful otherwise. The purpose of these barrier calls is to
wait for all outstanding callback-flooding callbacks to be invoked before
cleaning up their data. Using the wrong barrier function therefore
risks arbitrary memory corruption. Thus, this commit changes these
rcu_barrier() calls into cur_ops->cb_barrier() to make things work when
torturing non-vanilla flavors of RCU.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from bluetooth, bpf and wireless.
Current release - regressions:
- bpf:
- fix wrong last sg check in sk_msg_recvmsg()
- fix kernel BUG in purge_effective_progs()
- mac80211:
- fix possible leak in ieee80211_tx_control_port()
- potential NULL dereference in ieee80211_tx_control_port()
Current release - new code bugs:
- nfp: fix the access to management firmware hanging
Previous releases - regressions:
- ip: fix triggering of 'icmp redirect'
- sched: tbf: don't call qdisc_put() while holding tree lock
- bpf: fix corrupted packets for XDP_SHARED_UMEM
- bluetooth: hci_sync: fix suspend performance regression
- micrel: fix probe failure
Previous releases - always broken:
- tcp: make global challenge ack rate limitation per net-ns and
default disabled
- tg3: fix potential hang-up on system reboot
- mac802154: fix reception for no-daddr packets
Misc:
- r8152: add PID for the lenovo onelink+ dock"
* tag 'net-6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (56 commits)
net/smc: Remove redundant refcount increase
Revert "sch_cake: Return __NET_XMIT_STOLEN when consuming enqueued skb"
tcp: make global challenge ack rate limitation per net-ns and default disabled
tcp: annotate data-race around challenge_timestamp
net: dsa: hellcreek: Print warning only once
ip: fix triggering of 'icmp redirect'
sch_cake: Return __NET_XMIT_STOLEN when consuming enqueued skb
selftests: net: sort .gitignore file
Documentation: networking: correct possessive "its"
kcm: fix strp_init() order and cleanup
mlxbf_gige: compute MDIO period based on i1clk
ethernet: rocker: fix sleep in atomic context bug in neigh_timer_handler
net: lan966x: improve error handle in lan966x_fdma_rx_get_frame()
nfp: fix the access to management firmware hanging
net: phy: micrel: Make the GPIO to be non-exclusive
net: virtio_net: fix notification coalescing comments
net/sched: fix netdevice reference leaks in attach_default_qdiscs()
net: sched: tbf: don't call qdisc_put() while holding tree lock
net: Use u64_stats_fetch_begin_irq() for stats fetch.
net: dsa: xrs700x: Use irqsave variant for u64 stats update
...
|
|
Add cgroup_file_show() which allows toggling visibility of a cgroup file
using the new kernfs_show(). This will be used to hide psi interface files
on cgroups where it's disabled.
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-10-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
console_unblank() does this too (called in both places right after),
and with a lot more confidence inspiring approach to locking.
Reconstructing this story is very strange:
In b61312d353da ("oops handling: ensure that any oops is flushed to
the mtdoops console") it is claimed that a printk(" "); flushed out
the console buffer, which was removed in e3e8a75d2acf ("[PATCH]
Extract and use wake_up_klogd()"). In todays kernels this is done way
earlier in console_flush_on_panic with some really nasty tricks. I
didn't bother to fully reconstruct this all, least because the call to
bust_spinlock(0); gets moved every few years, depending upon how the
wind blows (or well, who screamed loudest about the various issue each
call site caused).
Before that commit the only calls to console_unblank() where in s390
arch code.
The other side here is the console->unblank callback, which was
introduced in 2.1.31 for the vt driver. Which predates the
console_unblank() function by a lot, which was added (without users)
in 2.4.14.3. So pretty much impossible to guess at any motivation
here. Also afaict the vt driver is the only (and always was the only)
console driver implementing the unblank callback, so no idea why a
call to console_unblank() was added for the mtdooops driver - the
action actually flushing out the console buffers is done from
console_unlock() only.
Note that as prep for the s390 users the locking was adjusted in
2.5.22 (I couldn't figure out how to properly reference the BK commit
from the historical git trees) from a normal semaphore to a trylock.
Note that a copy of the direct unblank_screen() call was added to
panic() in c7c3f05e341a ("panic: avoid deadlocks in re-entrant console
drivers"), which partially inlined the bust_spinlocks(0); call.
Long story short, I have no idea why the direct call to unblank_screen
survived for so long (the infrastructure to do it properly existed for
years), nor why it wasn't removed when the console_unblank() call was
finally added. But it makes a ton more sense to finally do that than
not - it's just better encapsulation to go through the console
functions instead of doing a direct call, so let's dare. Plus it
really does not make much sense to call the only unblank
implementation there is twice, once without, and once with appropriate
locking.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Xuezhi Zhang <zhangxuezhi1@coolpad.com>
Cc: Yangxi Xiang <xyangxi5@gmail.com>
Cc: nick black <dankamongmen@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: Marco Elver <elver@google.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: David Gow <davidgow@google.com>
Cc: tangmeng <tangmeng@uniontech.com>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/20220830145004.430545-1-daniel.vetter@ffwll.ch
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The error message is not printed immediately because it does not end with
a newline character.
Before:
root@localhost:~# insmod vmlinux.ko
insmod: ERROR: could not insert module vmlinux.ko: Invalid parameters
After:
root@localhost:~# insmod vmlinux.ko
[ 43.982558] livepatch: vmlinux.ko: invalid module name
insmod: ERROR: could not insert module vmlinux.ko: Invalid parameters
Fixes: dcf550e52f56 ("livepatch: Disallow vmlinux.ko")
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220830112855.749-1-thunder.leizhen@huawei.com
|
|
The KLP transition code depends on the TIF_PATCH_PENDING and
the task->patch_state to stay in sync. On a normal (forward)
transition, TIF_PATCH_PENDING will be set on every task in
the system, while on a reverse transition (after a failed
forward one) first TIF_PATCH_PENDING will be cleared from
every task, followed by it being set on tasks that need to
be transitioned back to the original code.
However, the fork code copies over the TIF_PATCH_PENDING flag
from the parent to the child early on, in dup_task_struct and
setup_thread_stack. Much later, klp_copy_process will set
child->patch_state to match that of the parent.
However, the parent's patch_state may have been changed by KLP loading
or unloading since it was initially copied over into the child.
This results in the KLP code occasionally hitting this warning in
klp_complete_transition:
for_each_process_thread(g, task) {
WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_PATCH_PENDING));
task->patch_state = KLP_UNDEFINED;
}
Set, or clear, the TIF_PATCH_PENDING flag in the child task
depending on whether or not it is needed at the time
klp_copy_process is called, at a point in copy_process where the
tasklist_lock is held exclusively, preventing races with the KLP
code.
The KLP code does have a few places where the state is changed
without the tasklist_lock held, but those should not cause
problems because klp_update_patch_state(current) cannot be
called while the current task is in the middle of fork,
klp_check_and_switch_task() which is called under the pi_lock,
which prevents rescheduling, and manipulation of the patch
state of idle tasks, which do not fork.
This should prevent this warning from triggering again in the
future, and close the race for both normal and reverse transitions.
Signed-off-by: Rik van Riel <riel@surriel.com>
Reported-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Fixes: d83a7cb375ee ("livepatch: change to a per-task consistency model")
Cc: stable@kernel.org
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220808150019.03d6a67b@imladris.surriel.com
|
|
As members in sched_dl_entity are independent with dl_bw, move
__dl_clear_params out of dl_bw lock.
Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Link: https://lore.kernel.org/r/20220827020911.30641-1-shangxiaojing@huawei.com
|
|
Wrap repeated code in helper function replenish_dl_new_period, which set
the deadline and runtime of input dl_se based on pi_of(dl_se). Note that
setup_new_dl_entity originally set the deadline and runtime base on
dl_se, which should equals to pi_of(dl_se) for non-boosted task.
Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Link: https://lore.kernel.org/r/20220826100037.12146-1-shangxiaojing@huawei.com
|
|
Wrap repeated code in helper function dl_task_is_earliest_deadline, which
return true if there is no deadline task on the rq at all, or task's
deadline earlier than the whole rq.
Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Link: https://lore.kernel.org/r/20220826083453.698-1-shangxiaojing@huawei.com
|
|
In __htab_map_lookup_and_delete_batch() if htab_lock_bucket() returns
-EBUSY, it will go to next bucket. Going to next bucket may not only
skip the elements in current bucket silently, but also incur
out-of-bound memory access or expose kernel memory to userspace if
current bucket_cnt is greater than bucket_size or zero.
Fixing it by stopping batch operation and returning -EBUSY when
htab_lock_bucket() fails, and the application can retry or skip the busy
batch as needed.
Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20220831042629.130006-3-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Per-cpu htab->map_locked is used to prohibit the concurrent accesses
from both NMI and non-NMI contexts. But since commit 74d862b682f5
("sched: Make migrate_disable/enable() independent of RT"),
migrate_disable() is also preemptible under CONFIG_PREEMPT case, so now
map_locked also disallows concurrent updates from normal contexts
(e.g. userspace processes) unexpectedly as shown below:
process A process B
htab_map_update_elem()
htab_lock_bucket()
migrate_disable()
/* return 1 */
__this_cpu_inc_return()
/* preempted by B */
htab_map_update_elem()
/* the same bucket as A */
htab_lock_bucket()
migrate_disable()
/* return 2, so lock fails */
__this_cpu_inc_return()
return -EBUSY
A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
only checking the value of map_locked for nmi context. But it will
re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
through non-tracing program (e.g. fentry program).
One cannot use preempt_disable() to fix this issue as htab_use_raw_lock
being false causes the bucket lock to be a spin lock which can sleep and
does not work with preempt_disable().
Therefore, use migrate_disable() when using the spinlock instead of
preempt_disable() and defer fixing concurrent updates to when the kernel
has its own BPF memory allocator.
Fixes: 74d862b682f5 ("sched: Make migrate_disable/enable() independent of RT")
Reviewed-by: Hao Luo <haoluo@google.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20220831042629.130006-2-houtao@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Userspace execution is a valid quiescent state for RCU Tasks Trace,
but the scheduling-clock interrupt does not currently report such
quiescent states.
Of course, the scheduling-clock interrupt is not strictly speaking
userspace execution. However, the only way that this code is not
in a quiescent state is if something invoked rcu_read_lock_trace(),
and that would be reflected in the ->trc_reader_nesting field in
the task_struct structure. Furthermore, this field is checked by
rcu_tasks_trace_qs(), which is invoked by rcu_tasks_qs() which is in
turn invoked by rcu_note_voluntary_context_switch() in kernels building
at least one of the RCU Tasks flavors. It is therefore safe to invoke
rcu_tasks_trace_qs() from the rcu_sched_clock_irq().
But rcu_tasks_qs() also invokes rcu_tasks_classic_qs() for RCU
Tasks, which lacks the read-side markers provided by RCU Tasks Trace.
This raises the possibility that an RCU Tasks grace period could start
after the interrupt from userspace execution, but before the call to
rcu_sched_clock_irq(). However, it turns out that this is safe because
the RCU Tasks grace period waits for an RCU grace period, which will
wait for the entire scheduling-clock interrupt handler, including any
RCU Tasks read-side critical section that this handler might contain.
This commit therefore updates the rcu_sched_clock_irq() function's
check for usermode execution and its call to rcu_tasks_classic_qs()
to instead check for both usermode execution and interrupt from idle,
and to instead call rcu_note_voluntary_context_switch(). This
consolidates code and provides more faster RCU Tasks Trace
reporting of quiescent states in kernels that do scheduling-clock
interrupts for userspace execution.
[ paulmck: Consolidate checks into rcu_sched_clock_irq(). ]
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The RCU Tasks Trace grace-period kthread loops across all CPUs, and
there can be quite a few CPUs, with some commercially available systems
sporting well over a thousand of them. Some of these loops can feature
IPIs, which can take some time. This commit therefore places a call to
cond_resched_tasks_rcu_qs() in each such loop.
Link: https://docs.google.com/document/d/1V0YnG1HTWMt9WHJjroiJL9lf-hMrud4v8Fn3fhyY0cI/edit?usp=sharing
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Kernels built with CONFIG_PROVE_RCU=y and CONFIG_DEBUG_LOCK_ALLOC=y
attempt to emit a warning when the synchronize_rcu_tasks_generic()
function is called during early boot while the rcu_scheduler_active
variable is RCU_SCHEDULER_INACTIVE. However the warnings is not
actually be printed because the debug_lockdep_rcu_enabled() returns
false, exactly because the rcu_scheduler_active variable is still equal
to RCU_SCHEDULER_INACTIVE.
This commit therefore replaces RCU_LOCKDEP_WARN() with WARN_ONCE()
to force these warnings to actually be printed.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit makes Tiny SRCU use full-sized grace-period counters to
further avoid counter-wrap issues when using polled grace-period APIs.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit applies the more-precise grace-period-state check used by
rcu_seq_done_exact() to poll_state_synchronize_srcu(). This is important
because Tiny SRCU uses a 16-bit counter, which can wrap quite quickly.
If counter wrap continues to be a problem, then expanding ->srcu_idx
and ->srcu_idx_max to 32 bits might be warranted.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit brings the "srcud" (dynamically allocated) SRCU test in line
with the "srcu" (statically allocated) test, so that both test the full
SRCU polled grace-period API.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
RCU's polled grace-period API is reasonably lightweight, but still
contains heavyweight memory barriers. This commit therefore limits
testing of this API from rcutorture's readers in order to avoid the
false negatives that these heavyweight operations could provoke.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit expands the rcu_torture_write_types() function's first "if"
condition and body, placing one element per line, in order to make the
compiler's error messages more helpful.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit changes the use of gp_poll_exp to gp_poll_exp1 in the first
check in rcu_torture_write_types(). No functional effect, but consistency
is a good thing.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Large systems can have hundreds of rcu_node structures, and updating
counters in each of them might slow down booting. This commit therefore
updates only the counters in those rcu_node structures corresponding
to the boot CPU, up to and including the root rcu_node structure.
The counters for the remaining rcu_node structures are updated by the
rcu_scheduler_starting() function, which executes just before the first
non-boot kthread is spawned.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Now that rcu_gp_oldstate can accurately track both normal and
expedited grace periods regardless of system state, rcutorture's
rcu_poll_need_2gp() function need only call for a second grace period
for the old single-unsigned-long grace-period polling APIs
This commit therefore adjusts rcu_poll_need_2gp() accordingly.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Because both normal and expedited grace periods increment their respective
counters on their pre-scheduler early boot fastpaths, the rcu_gp_oldstate
structure no longer needs its ->rgos_polled field. This commit therefore
removes this field, shrinking this structure so that it is the same size
as an rcu_head structure.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit causes the early boot single-CPU synchronize_rcu_expedited()
fastpath to update the rcu_state structure's ->expedited_sequence
counter. This will allow the full-state polled grace-period APIs to
detect all expedited grace periods without the need to track the special
combined polling-only counter, which is another step towards removing
the ->rgos_polled field from the rcu_gp_oldstate, thereby reducing its
size by one third.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Now that the expedited grace-period fast path can only happen during
the pre-scheduler portion of early boot, this fast path can no longer
block run-time RCU Trace grace periods. This commit therefore removes
the conditional cond_resched() invocation.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit causes the early boot single-CPU synchronize_rcu() fastpath to
update the rcu_state and rcu_node structures' ->gp_seq and ->gp_seq_needed
counters. This will allow the full-state polled grace-period APIs to
detect all normal grace periods without the need to track the special
combined polling-only counter, which is a step towards removing the
->rgos_polled field from the rcu_gp_oldstate, thereby reducing its size
by one third.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Now that the grace-period fast path can only happen during the
pre-scheduler portion of early boot, this fast path can no longer block
run-time RCU Tasks and RCU Tasks Trace grace periods. This commit
therefore removes the conditional cond_resched_tasks_rcu_qs() invocation.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
It would be good do reduce the size of the rcu_gp_oldstate structure
from three unsigned long instances to two, but this requires that the
boot-time optimized grace periods update the various ->gp_seq fields.
Updating these fields in the rcu_state structure and in all of the
rcu_node structures is at least semi-reasonable, but updating them in
all of the rcu_data structures is a bridge too far. This means that if
there are too many early boot-time grace periods, the ->gp_seq field in
the rcu_data structure cannot be trusted. This commit therefore sets
each rcu_data structure's ->gpwrap field to provide the necessary impetus
for a suitable level of distrust.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The run-time single-CPU grace-period optimization applies only to
kernels built with CONFIG_SMP=y && CONFIG_PREEMPTION=y that are running
on a single-CPU system. But a kernel intended for a single-CPU system
should instead be built with CONFIG_SMP=n, and in any case, single-CPU
systems running Linux no longer appear to be the common case. Plus this
optimization results in the rcu_gp_oldstate structure being half again
larger than it needs to be.
This commit therefore disables the run-time single-CPU grace-period
optimization, so that this optimization applies only during the
pre-scheduler portion of the boot sequence.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The cond_synchronize_rcu_expedited() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods. Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.
This commit therefore adds yet another member of the full-state RCU
grace-period polling API, which is the cond_synchronize_rcu_exp_full()
function. This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The cond_synchronize_rcu() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods. Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.
This commit therefore adds yet another member of the full-state RCU
grace-period polling API, which is the cond_synchronize_rcu_full()
function. This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.
[ paulmck: Apply feedback from kernel test robot and Julia Lawall. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit removes the blank line preceding the oldstate parameter to
the docbook header for the poll_state_synchronize_rcu() function and
marks uses of this parameter later in that header.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The start_poll_synchronize_rcu_expedited() API compresses the combined
expedited and normal grace-period states into a single unsigned long,
which conserves storage, but can miss grace periods in certain cases
involving overlapping normal and expedited grace periods. Missing the
occasional grace period is usually not a problem, but there are use
cases that care about each and every grace period.
This commit therefore adds yet another member of the
full-state RCU grace-period polling API, which is the
start_poll_synchronize_rcu_expedited_full() function. This uses up to
three times the storage (rcu_gp_oldstate structure instead of unsigned
long), but is guaranteed not to miss grace periods.
[ paulmck: Apply feedback from kernel test robot and Julia Lawall. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The start_poll_synchronize_rcu() API compresses the combined expedited and
normal grace-period states into a single unsigned long, which conserves
storage, but can miss grace periods in certain cases involving overlapping
normal and expedited grace periods. Missing the occasional grace period
is usually not a problem, but there are use cases that care about each
and every grace period.
This commit therefore adds the next member of the full-state RCU
grace-period polling API, namely the start_poll_synchronize_rcu_full()
function. This uses up to three times the storage (rcu_gp_oldstate
structure instead of unsigned long), but is guaranteed not to miss
grace periods.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit adds full-state polling checks to accompany the old-style
polling checks in the rcu_torture_one_read() function. If a polling
cycle within an RCU reader completes, a WARN_ONCE() is triggered.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This check does nothing because the state at this point in the code
because the rcu_torture_writer_state value is guaranteed to instead
be RTWS_REPLACE. This commit therefore removes this check.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit adds a test to rcu_torture_writer() that verifies that a
->get_gp_state_full() and ->poll_gp_state_full() polled grace-period
sequence does not claim that a grace period elapsed within the confines
of the corresponding read-side critical section.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|