summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-09-27mm: Introduce mm_struct.has_pinnedPeter Xu
(Commit message majorly collected from Jason Gunthorpe) Reduce the chance of false positive from page_maybe_dma_pinned() by keeping track if the mm_struct has ever been used with pin_user_pages(). This allows cases that might drive up the page ref_count to avoid any penalty from handling dma_pinned pages. Future work is planned, to provide a more sophisticated solution, likely to turn it into a real counter. For now, make it atomic_t but use it as a boolean for simplicity. Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-25Merge tag 'pm-5.9-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix more fallout of recent RCU-lockdep changes in CPU idle code and two devfreq issues. Specifics: - Export rcu_idle_{enter,exit} to modules to fix build issues introduced by recent RCU-lockdep fixes (Borislav Petkov) - Add missing return statement to a stub function in the ACPI processor driver to fix a build issue introduced by recent RCU-lockdep fixes (Rafael Wysocki) - Fix recently introduced suspicious RCU usage warnings in the PSCI cpuidle driver and drop stale comments regarding RCU_NONIDLE() usage from enter_s2idle_proper() (Ulf Hansson) - Fix error code path in the tegra30 devfreq driver (Dan Carpenter) - Add missing information to devfreq_summary debugfs (Chanwoo Choi)" * tag 'pm-5.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: processor: Fix build for ARCH_APICTIMER_STOPS_ON_C3 unset PM / devfreq: tegra30: Disable clock on error in probe PM / devfreq: Add timer type to devfreq_summary debugfs cpuidle: Drop misleading comments about RCU usage cpuidle: psci: Fix suspicious RCU usage rcu/tree: Export rcu_idle_{enter,exit} to modules
2020-09-25block: add a bdev_is_partition helperChristoph Hellwig
Add a littler helper to make the somewhat arcane bd_contains checks a little more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQPeter Oskolkov
This patchset is based on Google-internal RSEQ work done by Paul Turner and Andrew Hunter. When working with per-CPU RSEQ-based memory allocations, it is sometimes important to make sure that a global memory location is no longer accessed from RSEQ critical sections. For example, there can be two per-CPU lists, one is "active" and accessed per-CPU, while another one is inactive and worked on asynchronously "off CPU" (e.g. garbage collection is performed). Then at some point the two lists are swapped, and a fast RCU-like mechanism is required to make sure that the previously active list is no longer accessed. This patch introduces such a mechanism: in short, membarrier() syscall issues an IPI to a CPU, restarting a potentially active RSEQ critical section on the CPU. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lkml.kernel.org/r/20200923233618.2572849-1-posk@google.com
2020-09-25sched/fair: Use dst group while checking imbalance for NUMA balancerBarry Song
Barry Song noted the following Something is wrong. In find_busiest_group(), we are checking if src has higher load, however, in task_numa_find_cpu(), we are checking if dst will have higher load after balancing. It seems it is not sensible to check src. It maybe cause wrong imbalance value, for example, if dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and src_running = env->src_stats.nr_running - 1 results in 1; The current code is thinking imbalance as 0 since src_running is smaller than 2. This is inconsistent with load balancer. Basically, in find_busiest_group(), the NUMA imbalance is ignored if moving a task "from an almost idle domain" to a "domain with spare capacity". This patch forbids movement "from a misplaced domain" to "an almost idle domain" as that is closer to what the CPU load balancer expects. This patch is not a universal win. The old behaviour was intended to allow a task from an almost idle NUMA node to migrate to its preferred node if the destination had capacity but there are corner cases. For example, a NAS compute load could be parallelised to use 1/3rd of available CPUs but not all those potential tasks are active at all times allowing this logic to trigger. An obvious example is specjbb 2005 running various numbers of warehouses on a 2 socket box with 80 cpus. specjbb 5.9.0-rc4 5.9.0-rc4 vanilla dstbalance-v1r1 Hmean tput-1 46425.00 ( 0.00%) 43394.00 * -6.53%* Hmean tput-2 98416.00 ( 0.00%) 96031.00 * -2.42%* Hmean tput-3 150184.00 ( 0.00%) 148783.00 * -0.93%* Hmean tput-4 200683.00 ( 0.00%) 197906.00 * -1.38%* Hmean tput-5 236305.00 ( 0.00%) 245549.00 * 3.91%* Hmean tput-6 281559.00 ( 0.00%) 285692.00 * 1.47%* Hmean tput-7 338558.00 ( 0.00%) 334467.00 * -1.21%* Hmean tput-8 340745.00 ( 0.00%) 372501.00 * 9.32%* Hmean tput-9 424343.00 ( 0.00%) 413006.00 * -2.67%* Hmean tput-10 421854.00 ( 0.00%) 434261.00 * 2.94%* Hmean tput-11 493256.00 ( 0.00%) 485330.00 * -1.61%* Hmean tput-12 549573.00 ( 0.00%) 529959.00 * -3.57%* Hmean tput-13 593183.00 ( 0.00%) 555010.00 * -6.44%* Hmean tput-14 588252.00 ( 0.00%) 599166.00 * 1.86%* Hmean tput-15 623065.00 ( 0.00%) 642713.00 * 3.15%* Hmean tput-16 703924.00 ( 0.00%) 660758.00 * -6.13%* Hmean tput-17 666023.00 ( 0.00%) 697675.00 * 4.75%* Hmean tput-18 761502.00 ( 0.00%) 758360.00 * -0.41%* Hmean tput-19 796088.00 ( 0.00%) 798368.00 * 0.29%* Hmean tput-20 733564.00 ( 0.00%) 823086.00 * 12.20%* Hmean tput-21 840980.00 ( 0.00%) 856711.00 * 1.87%* Hmean tput-22 804285.00 ( 0.00%) 872238.00 * 8.45%* Hmean tput-23 795208.00 ( 0.00%) 889374.00 * 11.84%* Hmean tput-24 848619.00 ( 0.00%) 966783.00 * 13.92%* Hmean tput-25 750848.00 ( 0.00%) 903790.00 * 20.37%* Hmean tput-26 780523.00 ( 0.00%) 962254.00 * 23.28%* Hmean tput-27 1042245.00 ( 0.00%) 991544.00 * -4.86%* Hmean tput-28 1090580.00 ( 0.00%) 1035926.00 * -5.01%* Hmean tput-29 999483.00 ( 0.00%) 1082948.00 * 8.35%* Hmean tput-30 1098663.00 ( 0.00%) 1113427.00 * 1.34%* Hmean tput-31 1125671.00 ( 0.00%) 1134175.00 * 0.76%* Hmean tput-32 968167.00 ( 0.00%) 1250286.00 * 29.14%* Hmean tput-33 1077676.00 ( 0.00%) 1060893.00 * -1.56%* Hmean tput-34 1090538.00 ( 0.00%) 1090933.00 * 0.04%* Hmean tput-35 967058.00 ( 0.00%) 1107421.00 * 14.51%* Hmean tput-36 1051745.00 ( 0.00%) 1210663.00 * 15.11%* Hmean tput-37 1019465.00 ( 0.00%) 1351446.00 * 32.56%* Hmean tput-38 1083102.00 ( 0.00%) 1064541.00 * -1.71%* Hmean tput-39 1232990.00 ( 0.00%) 1303623.00 * 5.73%* Hmean tput-40 1175542.00 ( 0.00%) 1340943.00 * 14.07%* Hmean tput-41 1127826.00 ( 0.00%) 1339492.00 * 18.77%* Hmean tput-42 1198313.00 ( 0.00%) 1411023.00 * 17.75%* Hmean tput-43 1163733.00 ( 0.00%) 1228253.00 * 5.54%* Hmean tput-44 1305562.00 ( 0.00%) 1357886.00 * 4.01%* Hmean tput-45 1326752.00 ( 0.00%) 1406061.00 * 5.98%* Hmean tput-46 1339424.00 ( 0.00%) 1418451.00 * 5.90%* Hmean tput-47 1415057.00 ( 0.00%) 1381570.00 * -2.37%* Hmean tput-48 1392003.00 ( 0.00%) 1421167.00 * 2.10%* Hmean tput-49 1408374.00 ( 0.00%) 1418659.00 * 0.73%* Hmean tput-50 1359822.00 ( 0.00%) 1391070.00 * 2.30%* Hmean tput-51 1414246.00 ( 0.00%) 1392679.00 * -1.52%* Hmean tput-52 1432352.00 ( 0.00%) 1354020.00 * -5.47%* Hmean tput-53 1387563.00 ( 0.00%) 1409563.00 * 1.59%* Hmean tput-54 1406420.00 ( 0.00%) 1388711.00 * -1.26%* Hmean tput-55 1438804.00 ( 0.00%) 1387472.00 * -3.57%* Hmean tput-56 1399465.00 ( 0.00%) 1400296.00 * 0.06%* Hmean tput-57 1428132.00 ( 0.00%) 1396399.00 * -2.22%* Hmean tput-58 1432385.00 ( 0.00%) 1386253.00 * -3.22%* Hmean tput-59 1421612.00 ( 0.00%) 1371416.00 * -3.53%* Hmean tput-60 1429423.00 ( 0.00%) 1389412.00 * -2.80%* Hmean tput-61 1396230.00 ( 0.00%) 1351122.00 * -3.23%* Hmean tput-62 1418396.00 ( 0.00%) 1383098.00 * -2.49%* Hmean tput-63 1409918.00 ( 0.00%) 1374662.00 * -2.50%* Hmean tput-64 1410236.00 ( 0.00%) 1376216.00 * -2.41%* Hmean tput-65 1396405.00 ( 0.00%) 1364418.00 * -2.29%* Hmean tput-66 1395975.00 ( 0.00%) 1357326.00 * -2.77%* Hmean tput-67 1392986.00 ( 0.00%) 1349642.00 * -3.11%* Hmean tput-68 1386541.00 ( 0.00%) 1343261.00 * -3.12%* Hmean tput-69 1374407.00 ( 0.00%) 1342588.00 * -2.32%* Hmean tput-70 1377513.00 ( 0.00%) 1334654.00 * -3.11%* Hmean tput-71 1369319.00 ( 0.00%) 1334952.00 * -2.51%* Hmean tput-72 1354635.00 ( 0.00%) 1329005.00 * -1.89%* Hmean tput-73 1350933.00 ( 0.00%) 1318942.00 * -2.37%* Hmean tput-74 1351714.00 ( 0.00%) 1316347.00 * -2.62%* Hmean tput-75 1352198.00 ( 0.00%) 1309974.00 * -3.12%* Hmean tput-76 1349490.00 ( 0.00%) 1286064.00 * -4.70%* Hmean tput-77 1336131.00 ( 0.00%) 1303684.00 * -2.43%* Hmean tput-78 1308896.00 ( 0.00%) 1271024.00 * -2.89%* Hmean tput-79 1326703.00 ( 0.00%) 1290862.00 * -2.70%* Hmean tput-80 1336199.00 ( 0.00%) 1291629.00 * -3.34%* The performance at the mid-point is better but not universally better. The patch is a mixed bag depending on the workload, machine and overall levels of utilisation. Sometimes it's better (sometimes much better), other times it is worse (sometimes much worse). Given that there isn't a universally good decision in this section and more people seem to prefer the patch then it may be best to keep the LB decisions consistent and revisit imbalance handling when the load balancer code changes settle down. Jirka Hladky added the following observation. Our results are mostly in line with what you see. We observe big gains (20-50%) when the system is loaded to 1/3 of the maximum capacity and mixed results at the full load - some workloads benefit from the patch at the full load, others not, but performance changes at the full load are mostly within the noise of results (+/-5%). Overall, we think this patch is helpful. [mgorman@techsingularity.net: Rewrote changelog] Fixes: fb86f5b211 ("sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity") Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200921221849.GI3179@techsingularity.net
2020-09-25sched/fair: Reduce busy load balance intervalVincent Guittot
The busy_factor, which increases load balance interval when a cpu is busy, is set to 32 by default. This value generates some huge LB interval on large system like the THX2 made of 2 node x 28 cores x 4 threads. For such system, the interval increases from 112ms to 3584ms at MC level. And from 228ms to 7168ms at NUMA level. Even on smaller system, a lower busy factor has shown improvement on the fair distribution of the running time so let reduce it for all. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200921072424.14813-5-vincent.guittot@linaro.org
2020-09-25sched/fair: Minimize concurrent LBs between domain levelVincent Guittot
sched domains tend to trigger simultaneously the load balance loop but the larger domains often need more time to collect statistics. This slowness makes the larger domain trying to detach tasks from a rq whereas tasks already migrated somewhere else at a sub-domain level. This is not a real problem for idle LB because the period of smaller domains will increase with its CPUs being busy and this will let time for higher ones to pulled tasks. But this becomes a problem when all CPUs are already busy because all domains stay synced when they trigger their LB. A simple way to minimize simultaneous LB of all domains is to decrement the the busy interval by 1 jiffies. Because of the busy_factor, the interval of larger domain will not be a multiple of smaller ones anymore. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200921072424.14813-4-vincent.guittot@linaro.org
2020-09-25sched/fair: Reduce minimal imbalance thresholdVincent Guittot
The 25% default imbalance threshold for DIE and NUMA domain is large enough to generate significant unfairness between threads. A typical example is the case of 11 threads running on 2x4 CPUs. The imbalance of 20% between the 2 groups of 4 cores is just low enough to not trigger the load balance between the 2 groups. We will have always the same 6 threads on one group of 4 CPUs and the other 5 threads on the other group of CPUS. With a fair time sharing in each group, we ends up with +20% running time for the group of 5 threads. Consider decreasing the imbalance threshold for overloaded case where we use the load to balance task and to ensure fair time sharing. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: Hillf Danton <hdanton@sina.com> Link: https://lkml.kernel.org/r/20200921072424.14813-3-vincent.guittot@linaro.org
2020-09-25sched/fair: Relax constraint on task's load during load balanceVincent Guittot
Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the load balancer currently migrates the waiting task between the CPUs in an almost random manner. The success of a rq pulling a task depends of the value of nr_balance_failed of its domains and its ability to be faster than others to detach it. This behavior results in an unfair distribution of the running time between tasks because some CPUs will run most of the time, if not always, the same task whereas others will share their time between several tasks. Instead of using nr_balance_failed as a boolean to relax the condition for detaching task, the LB will use nr_balanced_failed to relax the threshold between the tasks'load and the imbalance. This mecanism prevents the same rq or domain to always win the load balance fight. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200921072424.14813-2-vincent.guittot@linaro.org
2020-09-25sched/fair: Remove the force parameter of update_tg_load_avg()Xianting Tian
In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used, sometimes update_tg_load_avg(cfs_rq, false) is used. update_tg_load_avg() has the parameter force, but in current code, it never set 1 or true to it, so remove the force parameter. Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200924014755.36253-1-tian.xianting@h3c.com
2020-09-25sched/fair: Fix wrong cpu selecting from isolated domainXunlei Pang
We've met problems that occasionally tasks with full cpumask (e.g. by putting it into a cpuset or setting to full affinity) were migrated to our isolated cpus in production environment. After some analysis, we found that it is due to the current select_idle_smt() not considering the sched_domain mask. Steps to reproduce on my 31-CPU hyperthreads machine: 1. with boot parameter: "isolcpus=domain,2-31" (thread lists: 0,16 and 1,17) 2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads" 3. some threads will be migrated to the isolated cpu16~17. Fix it by checking the valid domain mask in select_idle_smt(). Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings()) Reported-by: Wetp Zhang <wetp.zy@linux.alibaba.com> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com
2020-09-25sched: Remove unused inline function uclamp_bucket_base_value()YueHaibing
There is no caller in tree, so can remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20200922132410.48440-1-yuehaibing@huawei.com
2020-09-25sched/rt: Disable RT_RUNTIME_SHARE by defaultDaniel Bristot de Oliveira
The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime between CPUs, allowing a CPU to run a real-time task up to 100% of the time while leaving more space for non-real-time tasks to run on the CPU that lend rt_runtime. The problem is that a CPU can easily borrow enough rt_runtime to allow a spinning rt-task to run forever, starving per-cpu tasks like kworkers, which are non-real-time by design. This patch disables RT_RUNTIME_SHARE by default, avoiding this problem. The feature will still be present for users that want to enable it, though. Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Wei Wang <wvw@google.com> Link: https://lkml.kernel.org/r/b776ab46817e3db5d8ef79175fa0d71073c051c7.1600697903.git.bristot@redhat.com
2020-09-25sched/deadline: Fix stale throttling on de-/boosted tasksLucas Stach
When a boosted task gets throttled, what normally happens is that it's immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the runtime and clears the dl_throttled flag. There is a special case however: if the throttling happened on sched-out and the task has been deboosted in the meantime, the replenish is skipped as the task will return to its normal scheduling class. This leaves the task with the dl_throttled flag set. Now if the task gets boosted up to the deadline scheduling class again while it is sleeping, it's still in the throttled state. The normal wakeup however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't actually place it on the rq. Thus we end up with a task that is runnable, but not actually on the rq and neither a immediate replenishment happens, nor is the replenishment timer set up, so the task is stuck in forever-throttled limbo. Clear the dl_throttled flag before dropping back to the normal scheduling class to fix this issue. Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20200831110719.2126930-1-l.stach@pengutronix.de
2020-09-25sched/numa: Use runnable_avg to classify nodeVincent Guittot
Use runnable_avg to classify numa node state similarly to what is done for normal load balancer. This helps to ensure that numa and normal balancers use the same view of the state of the system. Large arm64system: 2 nodes / 224 CPUs: hackbench -l (256000/#grp) -g #grp grp tip/sched/core +patchset improvement 1 14,008(+/- 4,99 %) 13,800(+/- 3.88 %) 1,48 % 4 4,340(+/- 5.35 %) 4.283(+/- 4.85 %) 1,33 % 16 3,357(+/- 0.55 %) 3.359(+/- 0.54 %) -0,06 % 32 3,050(+/- 0.94 %) 3.039(+/- 1,06 %) 0,38 % 64 2.968(+/- 1,85 %) 3.006(+/- 2.92 %) -1.27 % 128 3,290(+/-12.61 %) 3,108(+/- 5.97 %) 5.51 % 256 3.235(+/- 3.95 %) 3,188(+/- 2.83 %) 1.45 % Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20200921072959.16317-1-vincent.guittot@linaro.org
2020-09-24timers: Mask invalid flags in do_init_timer()Qianli Zhao
do_init_timer() accepts any combination of timer flags handed in by the caller without a sanity check, but only TIMER_DEFFERABLE, TIMER_PINNED and TIMER_IRQSAFE are valid. If the supplied flags have other bits set, this could result in malfunction. If bits are set in TIMER_CPUMASK the first timer usage could deference a cpu base which is outside the range of possible CPUs. If TIMER_MIGRATION is set, then the switch_timer_base() will live lock. Prevent that with a sanity check which warns when invalid flags are supplied and masks them out. [ tglx: Made it WARN_ON_ONCE() and added context to the changelog ] Signed-off-by: Qianli Zhao <zhaoqianli@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/9d79a8aa4eb56713af7379f99f062dedabcde140.1597326756.git.zhaoqianli@xiaomi.com
2020-09-24treewide: Make all debug_obj_descriptors constStephen Boyd
This should make it harder for the kernel to corrupt the debug object descriptor, used to call functions to fixup state and track debug objects, by moving the structure to read-only memory. Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20200815004027.2046113-3-swboyd@chromium.org
2020-09-24kprobes: Use module_name() macroJarkko Sakkinen
It is advised to use module_name() macro instead of dereferencing mod->name directly. This makes sense for consistencys sake and also it prevents a hard dependency to CONFIG_MODULES. Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lkml.kernel.org/r/20200818050857.117998-1-jarkko.sakkinen@linux.intel.com
2020-09-23PM: mm: cleanup swsusp_swap_checkChristoph Hellwig
Use blkdev_get_by_dev instead of bdget + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23mm: split swap_type_ofChristoph Hellwig
swap_type_of is used for two entirely different purposes: (1) check what swap type a given device/offset corresponds to (2) find the first available swap device that can be written to Mixing both in a single function creates an unreadable mess. Create two separate functions instead, and switch both to pass a dev_t instead of a struct block_device to further simplify the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23PM: rewrite is_hibernate_resume_dev to not require an inodeChristoph Hellwig
Just check the dev_t to help simplifying the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from Jakub Kicinski: - fix failure to add bond interfaces to a bridge, the offload-handling code was too defensive there and recent refactoring unearthed that. Users complained (Ido) - fix unnecessarily reflecting ECN bits within TOS values / QoS marking in TCP ACK and reset packets (Wei) - fix a deadlock with bpf iterator. Hopefully we're in the clear on this front now... (Yonghong) - BPF fix for clobbering r2 in bpf_gen_ld_abs (Daniel) - fix AQL on mt76 devices with FW rate control and add a couple of AQL issues in mac80211 code (Felix) - fix authentication issue with mwifiex (Maximilian) - WiFi connectivity fix: revert IGTK support in ti/wlcore (Mauro) - fix exception handling for multipath routes via same device (David Ahern) - revert back to a BH spin lock flavor for nsid_lock: there are paths which do require the BH context protection (Taehee) - fix interrupt / queue / NAPI handling in the lantiq driver (Hauke) - fix ife module load deadlock (Cong) - make an adjustment to netlink reply message type for code added in this release (the sole change touching uAPI here) (Michal) - a number of fixes for small NXP and Microchip switches (Vladimir) [ Pull request acked by David: "you can expect more of this in the future as I try to delegate more things to Jakub" ] * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (167 commits) net: mscc: ocelot: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries net: dsa: seville: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries net: dsa: felix: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries inet_diag: validate INET_DIAG_REQ_PROTOCOL attribute net: bridge: br_vlan_get_pvid_rcu() should dereference the VLAN group under RCU net: Update MAINTAINERS for MediaTek switch driver net/mlx5e: mlx5e_fec_in_caps() returns a boolean net/mlx5e: kTLS, Avoid kzalloc(GFP_KERNEL) under spinlock net/mlx5e: kTLS, Fix leak on resync error flow net/mlx5e: kTLS, Add missing dma_unmap in RX resync net/mlx5e: kTLS, Fix napi sync and possible use-after-free net/mlx5e: TLS, Do not expose FPGA TLS counter if not supported net/mlx5e: Fix using wrong stats_grps in mlx5e_update_ndo_stats() net/mlx5e: Fix multicast counter not up-to-date in "ip -s" net/mlx5e: Fix endianness when calculating pedit mask first bit net/mlx5e: Enable adding peer miss rules only if merged eswitch is supported net/mlx5e: CT: Fix freeing ct_label mapping net/mlx5e: Fix memory leak of tunnel info when rule under multipath not ready net/mlx5e: Use synchronize_rcu to sync with NAPI net/mlx5e: Use RCU to protect rq->xdp_prog ...
2020-09-22x86/entry: Fix typo in comments for syscall_enter_from_user_mode()Kees Cook
Just to help myself and others with finding the correct function names, fix a typo for "usermode" vs "user_mode". Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200919080936.259819-1-keescook@chromium.org
2020-09-22Merge tag 'trace-v5.9-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Check kprobe is enabled before unregistering from ftrace as it isn't registered when disabled. - Remove kprobes enabled via command-line that is on init text when freed. - Add missing RCU synchronization for ftrace trampoline symbols removed from kallsyms. - Free trampoline on error path if ftrace_startup() fails. - Give more space for the longer PID numbers in trace output. - Fix a possible double free in the histogram code. - A couple of fixes that were discovered by sparse. * tag 'trace-v5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: bootconfig: init: make xbc_namebuf static kprobes: tracing/kprobes: Fix to kill kprobes on initmem after boot tracing: fix double free ftrace: Let ftrace_enable_sysctl take a kernel pointer buffer tracing: Make the space reserved for the pid wider ftrace: Fix missing synchronize_rcu() removing trampoline from kallsyms ftrace: Free the trampoline when ftrace_startup() fails kprobes: Fix to check probe enabled before disarm_kprobe_ftrace()
2020-09-22printk: remove dict ringJohn Ogness
Since there is no code that will ever store anything into the dict ring, remove it. If any future dictionary properties are to be added, these should be added to the struct printk_info. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200918223421.21621-4-john.ogness@linutronix.de
2020-09-22printk: move dictionary keys to dev_printk_infoJohn Ogness
Dictionaries are only used for SUBSYSTEM and DEVICE properties. The current implementation stores the property names each time they are used. This requires more space than otherwise necessary. Also, because the dictionary entries are currently considered optional, it cannot be relied upon that they are always available, even if the writer wanted to store them. These issues will increase should new dictionary properties be introduced. Rather than storing the subsystem and device properties in the dict ring, introduce a struct dev_printk_info with separate fields to store only the property values. Embed this struct within the struct printk_info to provide guaranteed availability. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/87mu1jl6ne.fsf@jogness.linutronix.de
2020-09-22printk: move printk_info into separate arrayJohn Ogness
The majority of the size of a descriptor is taken up by meta data, which is often not of interest to the ringbuffer (for example, when performing state checks). Since descriptors are often temporarily stored on the stack, keeping their size minimal will help reduce stack pressure. Rather than embedding the printk_info into the descriptor, create a separate printk_info array. The index of a descriptor in the descriptor array corresponds to the printk_info with the same index in the printk_info array. The rules for validity of a printk_info match the existing rules for the data blocks: the descriptor must be in a consistent state. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200918223421.21621-2-john.ogness@linutronix.de
2020-09-21bpf: Fix sysfs export of empty BTF sectionTony Ambardar
If BTF data is missing or removed from the ELF section it is still exported via sysfs as a zero-length file: root@OpenWrt:/# ls -l /sys/kernel/btf/vmlinux -r--r--r-- 1 root root 0 Jul 18 02:59 /sys/kernel/btf/vmlinux Moreover, reads from this file succeed and leak kernel data: root@OpenWrt:/# hexdump -C /sys/kernel/btf/vmlinux|head -10 000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 000cc0 00 00 00 00 00 00 00 00 00 00 00 00 80 83 b0 80 |................| 000cd0 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 000ce0 00 00 00 00 00 00 00 00 00 00 00 00 57 ac 6e 9d |............W.n.| 000cf0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 002650 00 00 00 00 00 00 00 10 00 00 00 01 00 00 00 01 |................| 002660 80 82 9a c4 80 85 97 80 81 a9 51 68 00 00 00 02 |..........Qh....| 002670 80 25 44 dc 80 85 97 80 81 a9 50 24 81 ab c4 60 |.%D.......P$...`| This situation was first observed with kernel 5.4.x, cross-compiled for a MIPS target system. Fix by adding a sanity-check for export of zero-length data sections. Fixes: 341dfcf8d78e ("btf: expose BTF info through sysfs") Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/b38db205a66238f70823039a8c531535864eaac5.1600417359.git.Tony.Ambardar@gmail.com
2020-09-21Merge branch 'rcu/urgent' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU fix from Paul McKenney: "This contains a single commit that fixes a bug that was introduced in the last merge window. This bug causes a compiler warning complaining about show_rcu_tasks_classic_gp_kthread() being an unused static function in !SMP kernels. The fix is straightforward, just adding an 'inline' to make this a static inline function, thus avoiding the warning. This bug was reported by Laurent Pinchart, who would like it fixed sooner rather than later" * 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: rcu-tasks: Prevent complaints of unused show_rcu_tasks_classic_gp_kthread()
2020-09-21rcu/tree: Export rcu_idle_{enter,exit} to modulesBorislav Petkov
Fix this link error: ERROR: modpost: "rcu_idle_enter" [drivers/acpi/processor.ko] undefined! ERROR: modpost: "rcu_idle_exit" [drivers/acpi/processor.ko] undefined! when CONFIG_ACPI_PROCESSOR is built as module. PeterZ says that in light of ARM needing those soon too, they should simply be exported. Fixes: 1fecfdbb7acc ("ACPI: processor: Take over RCU-idle for C3-BM idle") Reported-by: Sven Joachim <svenjoac@gmx.de> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Paul E. McKenney <paulmckrcu@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-09-20Merge tag 'core_urgent_for_v5.9_rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull syscall tracing fix from Borislav Petkov: "Fix the seccomp syscall rewriting so that trace and audit see the rewritten syscall number, from Kees Cook" * tag 'core_urgent_for_v5.9_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: core/entry: Report syscall correctly for trace and audit
2020-09-20Merge tag 'locking_urgent_for_v5.9_rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Borislav Petkov: "Two fixes from the locking/urgent pile: - Fix lockdep's detection of "USED" <- "IN-NMI" inversions (Peter Zijlstra) - Make percpu-rwsem operations on the semaphore's ->read_count IRQ-safe because it can be used in an IRQ context (Hou Tao)" * tag 'locking_urgent_for_v5.9_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count locking/lockdep: Fix "USED" <- "IN-NMI" inversions
2020-09-19Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "15 patches. Subsystems affected by this patch series: mailmap, mm/hotfixes, mm/thp, mm/memory-hotplug, misc, kcsan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: kcsan: kconfig: move to menu 'Generic Kernel Debugging Instruments' fs/fs-writeback.c: adjust dirtytime_interval_handler definition to match prototype stackleak: let stack_erasing_sysctl take a kernel pointer buffer ftrace: let ftrace_enable_sysctl take a kernel pointer buffer mm/memory_hotplug: drain per-cpu pages again during memory offline selftests/vm: fix display of page size in map_hugetlb mm/thp: fix __split_huge_pmd_locked() for migration PMD kprobes: fix kill kprobe which has been marked as gone tmpfs: restore functionality of nr_inodes=0 mlock: fix unevictable_pgs event counts on THP mm: fix check_move_unevictable_pages() on THP mm: migration of hugetlbfs page skip memcg ksm: reinstate memcg charge on copied pages mailmap: add older email addresses for Kees Cook
2020-09-19stackleak: let stack_erasing_sysctl take a kernel pointer bufferTobias Klauser
Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") changed ctl_table.proc_handler to take a kernel pointer. Adjust the signature of stack_erasing_sysctl to match ctl_table.proc_handler which fixes the following sparse warning: kernel/stackleak.c:31:50: warning: incorrect type in argument 3 (different address spaces) kernel/stackleak.c:31:50: expected void * kernel/stackleak.c:31:50: got void [noderef] __user *buffer Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/20200907093253.13656-1-tklauser@distanz.ch Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-19ftrace: let ftrace_enable_sysctl take a kernel pointer bufferTobias Klauser
Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") changed ctl_table.proc_handler to take a kernel pointer. Adjust the signature of ftrace_enable_sysctl to match ctl_table.proc_handler which fixes the following sparse warning: kernel/trace/ftrace.c:7544:43: warning: incorrect type in argument 3 (different address spaces) kernel/trace/ftrace.c:7544:43: expected void * kernel/trace/ftrace.c:7544:43: got void [noderef] __user *buffer Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/20200907093207.13540-1-tklauser@distanz.ch Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-19kprobes: fix kill kprobe which has been marked as goneMuchun Song
If a kprobe is marked as gone, we should not kill it again. Otherwise, we can disarm the kprobe more than once. In that case, the statistics of kprobe_ftrace_enabled can unbalance which can lead to that kprobe do not work. Fixes: e8386a0cb22f ("kprobes: support probing module __exit function") Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Song Liu <songliubraving@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200822030055.32383-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-18Merge tag 's390-5.9-6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Vasily Gorbik: - Fix order in trace_hardirqs_off_caller() to make locking state consistent even if the IRQ tracer calls into lockdep again. Touches common code. Acked-by Peter Zijlstra. - Correctly handle secure storage violation exception to avoid kernel panic triggered by user space misbehaviour. - Switch the idle->seqcount over to using raw_write_*() to avoid "suspicious RCU usage". - Fix memory leaks on hard unplug in pci code. - Use kvmalloc instead of kmalloc for larger allocations in zcrypt. - Add few missing __init annotations to static functions to avoid section mismatch complains when functions are not inlined. * tag 's390-5.9-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390: add 3f program exception handler lockdep: fix order in trace_hardirqs_off_caller() s390/pci: fix leak of DMA tables on hard unplug s390/init: add missing __init annotations s390/zcrypt: fix kmalloc 256k failure s390/idle: fix suspicious RCU usage
2020-09-18kprobes: tracing/kprobes: Fix to kill kprobes on initmem after bootMasami Hiramatsu
Since kprobe_event= cmdline option allows user to put kprobes on the functions in initmem, kprobe has to make such probes gone after boot. Currently the probes on the init functions in modules will be handled by module callback, but the kernel init text isn't handled. Without this, kprobes may access non-exist text area to disable or remove it. Link: https://lkml.kernel.org/r/159972810544.428528.1839307531600646955.stgit@devnote2 Fixes: 970988e19eb0 ("tracing/kprobe: Add kprobe_event= boot parameter") Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-09-18tracing: fix double freeTom Rix
clang static analyzer reports this problem trace_events_hist.c:3824:3: warning: Attempt to free released memory kfree(hist_data->attrs->var_defs.name[i]); In parse_var_defs() if there is a problem allocating var_defs.expr, the earlier var_defs.name is freed. This free is duplicated by free_var_defs() which frees the rest of the list. Because free_var_defs() has to run anyway, remove the second free fom parse_var_defs(). Link: https://lkml.kernel.org/r/20200907135845.15804-1-trix@redhat.com Cc: stable@vger.kernel.org Fixes: 30350d65ac56 ("tracing: Add variable support to hist triggers") Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-09-18ftrace: Let ftrace_enable_sysctl take a kernel pointer bufferTobias Klauser
Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") changed ctl_table.proc_handler to take a kernel pointer. Adjust the signature of ftrace_enable_sysctl to match ctl_table.proc_handler which fixes the following sparse warning: kernel/trace/ftrace.c:7544:43: warning: incorrect type in argument 3 (different address spaces) kernel/trace/ftrace.c:7544:43: expected void * kernel/trace/ftrace.c:7544:43: got void [noderef] __user *buffer Link: https://lkml.kernel.org/r/20200907093207.13540-1-tklauser@distanz.ch Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-09-18tracing: Make the space reserved for the pid widerSebastian Andrzej Siewior
For 64bit CONFIG_BASE_SMALL=0 systems PID_MAX_LIMIT is set by default to 4194304. During boot the kernel sets a new value based on number of CPUs but no lower than 32768. It is 1024 per CPU so with 128 CPUs the default becomes 131072 which needs six digits. This value can be increased during run time but must not exceed the initial upper limit. Systemd sometime after v241 sets it to the upper limit during boot. The result is that when the pid exceeds five digits, the trace output is a little hard to read because it is no longer properly padded (same like on big iron with 98+ CPUs). Increase the pid padding to seven digits. Link: https://lkml.kernel.org/r/20200904082331.dcdkrr3bkn3e4qlg@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-09-18ftrace: Fix missing synchronize_rcu() removing trampoline from kallsymsAdrian Hunter
Add synchronize_rcu() after list_del_rcu() in ftrace_remove_trampoline_from_kallsyms() to protect readers of ftrace_ops_trampoline_list (in ftrace_get_trampoline_kallsym) which is used when kallsyms is read. Link: https://lkml.kernel.org/r/20200901091617.31837-1-adrian.hunter@intel.com Fixes: fc0ea795f53c8d ("ftrace: Add symbols for ftrace trampolines") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-09-18ftrace: Free the trampoline when ftrace_startup() failsMiroslav Benes
Commit fc0ea795f53c ("ftrace: Add symbols for ftrace trampolines") missed to remove ops from new ftrace_ops_trampoline_list in ftrace_startup() if ftrace_hash_ipmodify_enable() fails there. It may lead to BUG if such ops come from a module which may be removed. Moreover, the trampoline itself is not freed in this case. Fix it by calling ftrace_trampoline_free() during the rollback. Link: https://lkml.kernel.org/r/20200831122631.28057-1-mbenes@suse.cz Fixes: fc0ea795f53c ("ftrace: Add symbols for ftrace trampolines") Fixes: f8b8be8a310a ("ftrace, kprobes: Support IPMODIFY flag to find IP modify conflict") Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-09-18kprobes: Fix to check probe enabled before disarm_kprobe_ftrace()Masami Hiramatsu
Commit 0cb2f1372baa ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler") fixed one bug but not completely fixed yet. If we run a kprobe_module.tc of ftracetest, kernel showed a warning as below. # ./ftracetest test.d/kprobe/kprobe_module.tc === Ftrace unit tests === [1] Kprobe dynamic event - probing module ... [ 22.400215] ------------[ cut here ]------------ [ 22.400962] Failed to disarm kprobe-ftrace at trace_printk_irq_work+0x0/0x7e [trace_printk] (-2) [ 22.402139] WARNING: CPU: 7 PID: 200 at kernel/kprobes.c:1091 __disarm_kprobe_ftrace.isra.0+0x7e/0xa0 [ 22.403358] Modules linked in: trace_printk(-) [ 22.404028] CPU: 7 PID: 200 Comm: rmmod Not tainted 5.9.0-rc2+ #66 [ 22.404870] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014 [ 22.406139] RIP: 0010:__disarm_kprobe_ftrace.isra.0+0x7e/0xa0 [ 22.406947] Code: 30 8b 03 eb c9 80 3d e5 09 1f 01 00 75 dc 49 8b 34 24 89 c2 48 c7 c7 a0 c2 05 82 89 45 e4 c6 05 cc 09 1f 01 01 e8 a9 c7 f0 ff <0f> 0b 8b 45 e4 eb b9 89 c6 48 c7 c7 70 c2 05 82 89 45 e4 e8 91 c7 [ 22.409544] RSP: 0018:ffffc90000237df0 EFLAGS: 00010286 [ 22.410385] RAX: 0000000000000000 RBX: ffffffff83066024 RCX: 0000000000000000 [ 22.411434] RDX: 0000000000000001 RSI: ffffffff810de8d3 RDI: ffffffff810de8d3 [ 22.412687] RBP: ffffc90000237e10 R08: 0000000000000001 R09: 0000000000000001 [ 22.413762] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88807c478640 [ 22.414852] R13: ffffffff8235ebc0 R14: ffffffffa00060c0 R15: 0000000000000000 [ 22.415941] FS: 00000000019d48c0(0000) GS:ffff88807d7c0000(0000) knlGS:0000000000000000 [ 22.417264] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 22.418176] CR2: 00000000005bb7e3 CR3: 0000000078f7a000 CR4: 00000000000006a0 [ 22.419309] Call Trace: [ 22.419990] kill_kprobe+0x94/0x160 [ 22.420652] kprobes_module_callback+0x64/0x230 [ 22.421470] notifier_call_chain+0x4f/0x70 [ 22.422184] blocking_notifier_call_chain+0x49/0x70 [ 22.422979] __x64_sys_delete_module+0x1ac/0x240 [ 22.423733] do_syscall_64+0x38/0x50 [ 22.424366] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 22.425176] RIP: 0033:0x4bb81d [ 22.425741] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e0 ff ff ff f7 d8 64 89 01 48 [ 22.428726] RSP: 002b:00007ffc70fef008 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0 [ 22.430169] RAX: ffffffffffffffda RBX: 00000000019d48a0 RCX: 00000000004bb81d [ 22.431375] RDX: 0000000000000000 RSI: 0000000000000880 RDI: 00007ffc70fef028 [ 22.432543] RBP: 0000000000000880 R08: 00000000ffffffff R09: 00007ffc70fef320 [ 22.433692] R10: 0000000000656300 R11: 0000000000000246 R12: 00007ffc70fef028 [ 22.434635] R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000 [ 22.435682] irq event stamp: 1169 [ 22.436240] hardirqs last enabled at (1179): [<ffffffff810df542>] console_unlock+0x422/0x580 [ 22.437466] hardirqs last disabled at (1188): [<ffffffff810df19b>] console_unlock+0x7b/0x580 [ 22.438608] softirqs last enabled at (866): [<ffffffff81c0038e>] __do_softirq+0x38e/0x490 [ 22.439637] softirqs last disabled at (859): [<ffffffff81a00f42>] asm_call_on_stack+0x12/0x20 [ 22.440690] ---[ end trace 1e7ce7e1e4567276 ]--- [ 22.472832] trace_kprobe: This probe might be able to register after target module is loaded. Continue. This is because the kill_kprobe() calls disarm_kprobe_ftrace() even if the given probe is not enabled. In that case, ftrace_set_filter_ip() fails because the given probe point is not registered to ftrace. Fix to check the given (going) probe is enabled before invoking disarm_kprobe_ftrace(). Link: https://lkml.kernel.org/r/159888672694.1411785.5987998076694782591.stgit@devnote2 Fixes: 0cb2f1372baa ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler") Cc: Ingo Molnar <mingo@kernel.org> Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David Miller <davem@davemloft.net> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-09-18stacktrace: Remove reliable argument from arch_stack_walk() callbackMark Brown
Currently the callback passed to arch_stack_walk() has an argument called reliable passed to it to indicate if the stack entry is reliable, a comment says that this is used by some printk() consumers. However in the current kernel none of the arch_stack_walk() implementations ever set this flag to true and the only callback implementation we have is in the generic stacktrace code which ignores the flag. It therefore appears that this flag is redundant so we can simplify and clarify things by removing it. Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Link: https://lore.kernel.org/r/20200914153409.25097-2-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2020-09-17mm: allow a controlled amount of unfairness in the page lockLinus Torvalds
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made the page locking entirely fair, in that if a waiter came in while the lock was held, the lock would be transferred to the lockers strictly in order. That was intended to finally get rid of the long-reported watchdog failures that involved the page lock under extreme load, where a process could end up waiting essentially forever, as other page lockers stole the lock from under it. It also improved some benchmarks, but it ended up causing huge performance regressions on others, simply because fair lock behavior doesn't end up giving out the lock as aggressively, causing better worst-case latency, but potentially much worse average latencies and throughput. Instead of reverting that change entirely, this introduces a controlled amount of unfairness, with a sysctl knob to tune it if somebody needs to. But the default value should hopefully be good for any normal load, allowing a few rounds of lock stealing, but enforcing the strict ordering before the lock has been stolen too many times. There is also a hint from Matthieu Baerts that the fair page coloring may end up exposing an ABBA deadlock that is hidden by the usual optimistic lock stealing, and while the unfairness doesn't fix the fundamental issue (and I'm still looking at that), it avoids it in practice. The amount of unfairness can be modified by writing a new value to the 'sysctl_page_lock_unfairness' variable (default value of 5, exposed through /proc/sys/vm/page_lock_unfairness), but that is hopefully something we'd use mainly for debugging rather than being necessary for any deep system tuning. This whole issue has exposed just how critical the page lock can be, and how contended it gets under certain locks. And the main contention doesn't really seem to be anything related to IO (which was the origin of this lock), but for things like just verifying that the page file mapping is stable while faulting in the page into a page table. Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/ Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1 Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/ Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com> Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-17quota: simplify the quotactl compat handlingChristoph Hellwig
Fold the misaligned u64 workarounds into the main quotactl flow instead of implementing a separate compat syscall handler. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-17Merge remote-tracking branch 'origin/irq/gic-retrigger' into irq/irqchip-nextMarc Zyngier
Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-09-16rcu-tasks: Prevent complaints of unused show_rcu_tasks_classic_gp_kthread()Paul E. McKenney
Commit 8344496e8b49 ("rcu-tasks: Conditionally compile show_rcu_tasks_gp_kthreads()") introduced conditional compilation of several functions, but forgot one occurrence of show_rcu_tasks_classic_gp_kthread() that causes the compiler to warn of an unused static function. This commit uses "static inline" to avoid these complaints and possibly also to avoid emitting an actual definition of this function. Fixes: 8344496e8b49 ("rcu-tasks: Conditionally compile show_rcu_tasks_gp_kthreads()") Cc: <stable@vger.kernel.org> # 5.8.x Reported-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-09-16irqdomain/msi: Allow to override msi_domain_alloc/free_irqs()Thomas Gleixner
To support MSI irq domains which do not fit at all into the regular MSI irqdomain scheme, like the XEN MSI interrupt management for PV/HVM/DOM0, it's necessary to allow to override the alloc/free implementation. This is a preperatory step to switch X86 away from arch_*_msi_irqs() and store the irq domain pointer right in struct device. No functional change for existing MSI irq domain users. Aside of the evil XEN wrapper this is also useful for special MSI domains which need to do extra alloc/free work before/after calling the generic core function. Work like allocating/freeing MSI descriptors, MSI storage space etc. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200826112333.526797548@linutronix.de