summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-07-14Revert "tracing: Add "(fault)" name injection to kernel probes"Masami Hiramatsu (Google)
This reverts commit 2e9906f84fc7c99388bb7123ade167250d50f1c0. It was turned out that commit 2e9906f84fc7 ("tracing: Add "(fault)" name injection to kernel probes") did not work correctly and probe events still show just '(fault)' (instead of '"(fault)"'). Also, current '(fault)' is more explicit that it faulted. This also moves FAULT_STRING macro to trace.h so that synthetic event can keep using it, and uses it in trace_probe.c too. Link: https://lore.kernel.org/all/168908495772.123124.1250788051922100079.stgit@devnote2/ Link: https://lore.kernel.org/all/20230706230642.3793a593@rorschach.local.home/ Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14tracing/probes: Fix to update dynamic data counter if fetcharg uses itMasami Hiramatsu (Google)
Fix to update dynamic data counter ('dyndata') and max length ('maxlen') only if the fetcharg uses the dynamic data. Also get out arg->dynamic from unlikely(). This makes dynamic data address wrong if process_fetch_insn() returns error on !arg->dynamic case. Link: https://lore.kernel.org/all/168908494781.123124.8160245359962103684.stgit@devnote2/ Suggested-by: Steven Rostedt <rostedt@goodmis.org> Link: https://lore.kernel.org/all/20230710233400.5aaf024e@gandalf.local.home/ Fixes: 9178412ddf5a ("tracing: probeevent: Return consumed bytes of dynamic area") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14tracing/probes: Fix not to count error code to total lengthMasami Hiramatsu (Google)
Fix not to count the error code (which is minus value) to the total used length of array, because it can mess up the return code of process_fetch_insn_bottom(). Also clear the 'ret' value because it will be used for calculating next data_loc entry. Link: https://lore.kernel.org/all/168908493827.123124.2175257289106364229.stgit@devnote2/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/ Fixes: 9b960a38835f ("tracing: probeevent: Unify fetch_insn processing common part") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14tracing/probes: Fix to avoid double count of the string length on the arrayMasami Hiramatsu (Google)
If an array is specified with the ustring or symstr, the length of the strings are accumlated on both of 'ret' and 'total', which means the length is double counted. Just set the length to the 'ret' value for avoiding double counting. Link: https://lore.kernel.org/all/168908492917.123124.15076463491122036025.stgit@devnote2/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/ Fixes: 88903c464321 ("tracing/probe: Add ustring type for user-space string") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-14fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is runningMasami Hiramatsu (Google)
Add a comment the reason why fprobe_kprobe_handler() exits if any other kprobe is running. Link: https://lore.kernel.org/all/168874788299.159442.2485957441413653858.stgit@devnote2/ Suggested-by: Steven Rostedt <rostedt@goodmis.org> Link: https://lore.kernel.org/all/20230706120916.3c6abf15@gandalf.local.home/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-13nvme-pci: fix DMA direction of unmapping integrity dataMing Lei
DMA direction should be taken in dma_unmap_page() for unmapping integrity data. Fix this DMA direction, and reported in Guangwu's test. Reported-by: Guangwu Zhang <guazhang@redhat.com> Fixes: 4aedb705437f ("nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-13nvme: don't reject probe due to duplicate IDs for single-ported PCIe devicesChristoph Hellwig
While duplicate IDs are still very harmful, including the potential to easily see changing devices in /dev/disk/by-id, it turn out they are extremely common for cheap end user NVMe devices. Relax our check for them for so that it doesn't reject the probe on single-ported PCIe devices, but prints a big warning instead. In doubt we'd still like to see quirk entries to disable the potential for changing supposed stable device identifier links, but this will at least allow users how have two (or more) of these devices to use them without having to manually add a new PCI ID entry with the quirk through sysfs or by patching the kernel. Fixes: 2079f41ec6ff ("nvme: check that EUI/GUID/UUID are globally unique") Cc: stable@vger.kernel.org # 6.0+ Co-developed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-13tracing: Fix memory leak of iter->temp when reading trace_pipeZheng Yejian
kmemleak reports: unreferenced object 0xffff88814d14e200 (size 256): comm "cat", pid 336, jiffies 4294871818 (age 779.490s) hex dump (first 32 bytes): 04 00 01 03 00 00 00 00 08 00 00 00 00 00 00 00 ................ 0c d8 c8 9b ff ff ff ff 04 5a ca 9b ff ff ff ff .........Z...... backtrace: [<ffffffff9bdff18f>] __kmalloc+0x4f/0x140 [<ffffffff9bc9238b>] trace_find_next_entry+0xbb/0x1d0 [<ffffffff9bc9caef>] trace_print_lat_context+0xaf/0x4e0 [<ffffffff9bc94490>] print_trace_line+0x3e0/0x950 [<ffffffff9bc95499>] tracing_read_pipe+0x2d9/0x5a0 [<ffffffff9bf03a43>] vfs_read+0x143/0x520 [<ffffffff9bf04c2d>] ksys_read+0xbd/0x160 [<ffffffff9d0f0edf>] do_syscall_64+0x3f/0x90 [<ffffffff9d2000aa>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 when reading file 'trace_pipe', 'iter->temp' is allocated or relocated in trace_find_next_entry() but not freed before 'trace_pipe' is closed. To fix it, free 'iter->temp' in tracing_release_pipe(). Link: https://lore.kernel.org/linux-trace-kernel/20230713141435.1133021-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Fixes: ff895103a84ab ("tracing: Save off entry when peeking at next entry") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-13sched/fair: Stabilize asym cpu capacity system idle cpu selectionVincent Guittot
select_idle_capacity() not only looks for an idle cpu that fits for the waking task but also for cpu with highest bandwidth when no cpu fits. Start the loop with target cpu so it will be selected 1st when no cpu fits but several cpus shared the same bandwidth. Starting with target cpu prevents the task to migrate between cpus with same bandwidth at every wakeup when no cpu fits. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230711081359.868862-1-vincent.guittot@linaro.org
2023-07-13sched/debug: Dump domains' sched group flagsPeter Zijlstra
There have been a case where the SD_SHARE_CPUCAPACITY sched group flag in a parent domain were not set and propagated properly when a degenerate domain is removed. Add dump of domain sched group flags of a CPU to make debug easier in the future. Usage: cat /debug/sched/domains/cpu0/domain1/groups_flags to dump cpu0 domain1's sched group flags. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/ed1749262d94d95a8296c86a415999eda90bcfe3.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13x86/sched: Enable cluster scheduling on HybridPeter Zijlstra
With the SMT vs non-SMT balancing issues sorted, also enable the cluster domain for Hybrid machines. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-07-13sched/fair: Consider the idle state of the whole core for load balanceRicardo Neri
should_we_balance() traverses the group_balance_mask (AND'ed with lb_env:: cpus) starting from lower numbered CPUs looking for the first idle CPU. In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before non-SMT cores: [0, 1] [2, 3] [4, 5] 6 7 8 9 b i b i b i b i i i In the figure above, CPUs in brackets are siblings of an SMT core. The rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an idle CPU. We should let a CPU on a fully idle core get the first chance to idle load balance as it has more CPU capacity than a CPU on an idle SMT CPU with busy sibling. So for the figure above, if we are running should_we_balance() to CPU 1, we should return false to let CPU 7 on idle core to have a chance first to idle load balance. A partially busy (i.e., of type group_has_spare) local group with SMT  cores will often have only one SMT sibling busy. If the destination CPU is a non-SMT core, partially busy, lower-numbered, SMT cores should not be considered when finding the first idle CPU.  However, in should_we_balance(), when we encounter idle SMT first in partially busy core, we prematurely break the search for the first idle CPU. Higher-numbered, non-SMT cores is not given the chance to have idle balance done on their behalf. Those CPUs will only be considered for idle balancing by chance via CPU_NEWLY_IDLE. Instead, consider the idle state of the whole SMT core. Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/807bdd05331378ea3bf5956bda87ded1036ba769.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13sched/fair: Implement prefer sibling imbalance calculation between ↵Tim C Chen
asymmetric groups In the current prefer sibling load balancing code, there is an implicit assumption that the busiest sched group and local sched group are equivalent, hence the tasks to be moved is simply the difference in number of tasks between the two groups (i.e. imbalance) divided by two. However, we may have different number of cores between the cluster groups, say when we take CPU offline or we have hybrid groups. In that case, we should balance between the two groups such that #tasks/#cores ratio is the same between the same between both groups. Hence the imbalance computed will need to reflect this. Adjust the sibling imbalance computation to take into account of the above considerations. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/4eacbaa236e680687dae2958378a6173654113df.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13sched/topology: Record number of cores in sched groupTim C Chen
When balancing sibling domains that have different number of cores, tasks in respective sibling domain should be proportional to the number of cores in each domain. In preparation of implementing such a policy, record the number of cores in a scheduling group. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/04641eeb0e95c21224352f5743ecb93dfac44654.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13sched/fair: Determine active load balance for SMT sched groupsTim C Chen
On hybrid CPUs with scheduling cluster enabled, we will need to consider balancing between SMT CPU cluster, and Atom core cluster. Below shows such a hybrid x86 CPU with 4 big cores and 8 atom cores. Each scheduling cluster span a L2 cache. --L2-- --L2-- --L2-- --L2-- ----L2---- -----L2------ [0, 1] [2, 3] [4, 5] [5, 6] [7 8 9 10] [11 12 13 14] Big Big Big Big Atom Atom core core core core Module Module If the busiest group is a big core with both SMT CPUs busy, we should active load balance if destination group has idle CPU cores. Such condition is considered by asym_active_balance() in load balancing but not considered when looking for busiest group and computing load imbalance. Add this consideration in find_busiest_group() and calculate_imbalance(). In addition, update the logic determining the busier group when one group is SMT and the other group is non SMT but both groups are partially busy with idle CPU. The busier group should be the group with idle cores rather than the group with one busy SMT CPU. We do not want to make the SMT group the busiest one to pull the only task off SMT CPU and causing the whole core to go empty. Otherwise suppose in the search for the busiest group, we first encounter an SMT group with 1 task and set it as the busiest. The destination group is an atom cluster with 1 task and we next encounter an atom cluster group with 3 tasks, we will not pick this atom cluster over the SMT group, even though we should. As a result, we do not load balance the busier Atom cluster (with 3 tasks) towards the local atom cluster (with 1 task). And it doesn't make sense to pick the 1 task SMT group as the busier group as we also should not pull task off the SMT towards the 1 task atom cluster and make the SMT core completely empty. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/e24f35d142308790f69be65930b82794ef6658a2.1688770494.git.tim.c.chen@linux.intel.com
2023-07-13sched/psi: make psi_cgroups_enabled staticMiaohe Lin
The static key psi_cgroups_enabled is only used inside file psi.c. Make it static. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Link: https://lore.kernel.org/r/20230525103428.49712-1-linmiaohe@huawei.com
2023-07-13sched/core: introduce sched_core_idle_cpu()Cruz Zhao
As core scheduling introduced, a new state of idle is defined as force idle, running idle task but nr_running greater than zero. If a cpu is in force idle state, idle_cpu() will return zero. This result makes sense in some scenarios, e.g., load balance, showacpu when dumping, and judge the RCU boost kthread is starving. But this will cause error in other scenarios, e.g., tick_irq_exit(): When force idle, rq->curr == rq->idle but rq->nr_running > 0, results that idle_cpu() returns 0. In function tick_irq_exit(), if idle_cpu() is 0, tick_nohz_irq_exit() will not be called, and ts->idle_active will not become 1, which became 0 in tick_nohz_irq_enter(). ts->idle_sleeptime won't update in function update_ts_time_stats(), if ts->idle_active is 0, which should be 1. And this bug will result that ts->idle_sleeptime is less than the actual value, and finally will result that the idle time in /proc/stat is less than the actual value. To solve this problem, we introduce sched_core_idle_cpu(), which returns 1 when force idle. We audit all users of idle_cpu(), and change idle_cpu() into sched_core_idle_cpu() in function tick_irq_exit(). v2-->v3: Only replace idle_cpu() with sched_core_idle_cpu() in function tick_irq_exit(). And modify the corresponding commit log. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/r/1688011324-42406-1-git-send-email-CruzZhao@linux.alibaba.com
2023-07-13sched: add throttled time stat for throttled childrenJosh Don
We currently export the total throttled time for cgroups that are given a bandwidth limit. This patch extends this accounting to also account the total time that each children cgroup has been throttled. This is useful to understand the degree to which children have been affected by the throttling control. Children which are not runnable during the entire throttled period, for example, will not show any self-throttling time during this period. Expose this in a new interface, 'cpu.stat.local', which is similar to how non-hierarchical events are accounted in 'memory.events.local'. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230620183247.737942-2-joshdon@google.com
2023-07-13sched: don't account throttle time for empty groupsJosh Don
It is easy for a cfs_rq to become throttled even when it has no enqueued entities (for example, if we have just put_prev()'d the last runnable task of the cfs_rq, and the cfs_rq is out of quota). Avoid accounting this time towards total throttle time, since it otherwise falsely inflates the stats. Note that the dequeue path is special, since we normally disallow migrations when a task is in a throttled hierarchy (see throttled_lb_pair()). Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230620183247.737942-1-joshdon@google.com
2023-07-13sched: avoid false lockdep splat in put_task_struct()Wander Lairson Costa
In put_task_struct(), a spin_lock is indirectly acquired under the kernel stock. When running the kernel in real-time (RT) configuration, the operation is dispatched to a preemptible context call to ensure guaranteed preemption. However, if PROVE_RAW_LOCK_NESTING is enabled and __put_task_struct() is called while holding a raw_spinlock, lockdep incorrectly reports an "Invalid lock context" in the stock kernel. This false splat occurs because lockdep is unaware of the different route taken under RT. To address this issue, override the inner wait type to prevent the false lockdep splat. Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230614122323.37957-3-wander@redhat.com
2023-07-13kernel/fork: beware of __put_task_struct() calling contextWander Lairson Costa
Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping locks. Therefore, it can't be called from an non-preemptible context. One practical example is splat inside inactive_task_timer(), which is called in a interrupt context: CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W --------- Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012 Call Trace: dump_stack_lvl+0x57/0x7d mark_lock_irq.cold+0x33/0xba mark_lock+0x1e7/0x400 mark_usage+0x11d/0x140 __lock_acquire+0x30d/0x930 lock_acquire.part.0+0x9c/0x210 rt_spin_lock+0x27/0xe0 refill_obj_stock+0x3d/0x3a0 kmem_cache_free+0x357/0x560 inactive_task_timer+0x1ad/0x340 __run_hrtimer+0x8a/0x1a0 __hrtimer_run_queues+0x91/0x130 hrtimer_interrupt+0x10f/0x220 __sysvec_apic_timer_interrupt+0x7b/0xd0 sysvec_apic_timer_interrupt+0x4f/0xd0 asm_sysvec_apic_timer_interrupt+0x12/0x20 RIP: 0033:0x7fff196bf6f5 Instead of calling __put_task_struct() directly, we defer it using call_rcu(). A more natural approach would use a workqueue, but since in PREEMPT_RT, we can't allocate dynamic memory from atomic context, the code would become more complex because we would need to put the work_struct instance in the task_struct and initialize it when we allocate a new task_struct. The issue is reproducible with stress-ng: while true; do stress-ng --sched deadline --sched-period 1000000000 \ --sched-runtime 800000000 --sched-deadline \ 1000000000 --mmapfork 23 -t 20 done Reported-by: Hu Chunyu <chuhu@redhat.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Valentin Schneider <vschneid@redhat.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230614122323.37957-2-wander@redhat.com
2023-07-13libceph: harden msgr2.1 frame segment length checksIlya Dryomov
ceph_frame_desc::fd_lens is an int array. decode_preamble() thus effectively casts u32 -> int but the checks for segment lengths are written as if on unsigned values. While reading in HELLO or one of the AUTH frames (before authentication is completed), arithmetic in head_onwire_len() can get duped by negative ctrl_len and produce head_len which is less than CEPH_PREAMBLE_LEN but still positive. This would lead to a buffer overrun in prepare_read_control() as the preamble gets copied to the newly allocated buffer of size head_len. Cc: stable@vger.kernel.org Fixes: cd1a677cad99 ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)") Reported-by: Thelford Williams <thelford@google.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Xiubo Li <xiubli@redhat.com>
2023-07-13Merge branch 'net-sched-fixes-for-sch_qfq'Paolo Abeni
Pedro Tammela says: ==================== net/sched: fixes for sch_qfq Patch 1 fixes a regression introduced in 6.4 where the MTU size could be bigger than 'lmax'. Patch 3 fixes an issue where the code doesn't account for qdisc_pkt_len() returning a size bigger then 'lmax'. Patches 2 and 4 are selftests for the issues above. ==================== Link: https://lore.kernel.org/r/20230711210103.597831-1-pctammela@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-13selftests: tc-testing: add test for qfq with stab overheadPedro Tammela
A packet with stab overhead greater than QFQ_MAX_LMAX should be dropped by the QFQ qdisc as it can't handle such lengths. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-13net/sched: sch_qfq: account for stab overhead in qfq_enqueuePedro Tammela
Lion says: ------- In the QFQ scheduler a similar issue to CVE-2023-31436 persists. Consider the following code in net/sched/sch_qfq.c: static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) { unsigned int len = qdisc_pkt_len(skb), gso_segs; // ... if (unlikely(cl->agg->lmax < len)) { pr_debug("qfq: increasing maxpkt from %u to %u for class %u", cl->agg->lmax, len, cl->common.classid); err = qfq_change_agg(sch, cl, cl->agg->class_weight, len); if (err) { cl->qstats.drops++; return qdisc_drop(skb, sch, to_free); } // ... } Similarly to CVE-2023-31436, "lmax" is increased without any bounds checks according to the packet length "len". Usually this would not impose a problem because packet sizes are naturally limited. This is however not the actual packet length, rather the "qdisc_pkt_len(skb)" which might apply size transformations according to "struct qdisc_size_table" as created by "qdisc_get_stab()" in net/sched/sch_api.c if the TCA_STAB option was set when modifying the qdisc. A user may choose virtually any size using such a table. As a result the same issue as in CVE-2023-31436 can occur, allowing heap out-of-bounds read / writes in the kmalloc-8192 cache. ------- We can create the issue with the following commands: tc qdisc add dev $DEV root handle 1: stab mtu 2048 tsize 512 mpu 0 \ overhead 999999999 linklayer ethernet qfq tc class add dev $DEV parent 1: classid 1:1 htb rate 6mbit burst 15k tc filter add dev $DEV parent 1: matchall classid 1:1 ping -I $DEV 1.1.1.2 This is caused by incorrectly assuming that qdisc_pkt_len() returns a length within the QFQ_MIN_LMAX < len < QFQ_MAX_LMAX. Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Reported-by: Lion <nnamrec@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-13selftests: tc-testing: add tests for qfq mtu sanity checkPedro Tammela
QFQ only supports a certain bound of MTU size so make sure we check for this requirement in the tests. Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-13net/sched: sch_qfq: reintroduce lmax bound check for MTUPedro Tammela
25369891fcef deletes a check for the case where no 'lmax' is specified which 3037933448f6 previously fixed as 'lmax' could be set to the device's MTU without any bound checking for QFQ_LMAX_MIN and QFQ_LMAX_MAX. Therefore, reintroduce the check. Fixes: 25369891fcef ("net/sched: sch_qfq: refactor parsing of netlink parameters") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-13sh: hd64461: Handle virq offset for offchip IRQ base and HD64461 IRQArtur Rojek
A recent change to start counting SuperH IRQ #s from 16 breaks support for the Hitachi HD64461 companion chip. Move the offchip IRQ base and HD64461 IRQ # by 16 in order to accommodate for the new virq numbering rules. Fixes: a8ac2961148e ("sh: Avoid using IRQ0 on SH3 and SH4") Signed-off-by: Artur Rojek <contact@artur-rojek.eu> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Link: https://lore.kernel.org/r/20230710233132.69734-1-contact@artur-rojek.eu Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
2023-07-13sh: mach-dreamcast: Handle virq offset in cascaded IRQ demuxGeert Uytterhoeven
Take into account the virq offset when translating cascaded interrupts. Fixes: a8ac2961148e8c72 ("sh: Avoid using IRQ0 on SH3 and SH4") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Link: https://lore.kernel.org/r/7d0cb246c9f1cd24bb1f637ec5cb67e799a4c3b8.1688908227.git.geert+renesas@glider.be Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
2023-07-13sh: mach-highlander: Handle virq offset in cascaded IRL demuxGeert Uytterhoeven
Take into account the virq offset when translating cascaded IRL interrupts. Fixes: a8ac2961148e8c72 ("sh: Avoid using IRQ0 on SH3 and SH4") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Link: https://lore.kernel.org/r/4fcb0d08a2b372431c41e04312742dc9e41e1be4.1688908186.git.geert+renesas@glider.be Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
2023-07-13sh: mach-r2d: Handle virq offset in cascaded IRL demuxGeert Uytterhoeven
When booting rts7751r2dplus_defconfig on QEMU, the system hangs due to an interrupt storm on IRQ 20. IRQ 20 aka event 0x280 is a cascaded IRL interrupt, which maps to IRQ_VOYAGER, the interrupt used by the Silicon Motion SM501 multimedia companion chip. As rts7751r2d_irq_demux() does not take into account the new virq offset, the interrupt is no longer translated, leading to an unhandled interrupt. Fix this by taking into account the virq offset when translating cascaded IRL interrupts. Fixes: a8ac2961148e8c72 ("sh: Avoid using IRQ0 on SH3 and SH4") Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/r/fbfea3ad-d327-4ad5-ac9c-648c7ca3fe1f@roeck-us.net Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/2c99d5df41c40691f6c407b7b6a040d406bc81ac.1688901306.git.geert+renesas@glider.be Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
2023-07-12smb: client: fix missed ses refcountingPaulo Alcantara
Use new cifs_smb_ses_inc_refcount() helper to get an active reference of @ses and @ses->dfs_root_ses (if set). This will prevent @ses->dfs_root_ses of being put in the next call to cifs_put_smb_ses() and thus potentially causing an use-after-free bug. Fixes: 8e3554150d6c ("cifs: fix sharing of DFS connections") Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-07-12smb: client: Fix -Wstringop-overflow issuesGustavo A. R. Silva
pSMB->hdr.Protocol is an array of size 4 bytes, hence when the compiler analyzes this line of code parm_data = ((char *) &pSMB->hdr.Protocol) + offset; it legitimately complains about the fact that offset points outside the bounds of the array. Notice that the compiler gives priority to the object as an array, rather than merely the address of one more byte in a structure to wich offset should be added (which seems to be the actual intention of the original implementation). Fix this by explicitly instructing the compiler to treat the code as a sequence of bytes in struct smb_com_transaction2_spi_req, and not as an array accessed through pointer notation. Notice that ((char *)pSMB) + sizeof(pSMB->hdr.smb_buf_length) points to the same address as ((char *) &pSMB->hdr.Protocol), therefore this results in no differences in binary output. Fixes the following -Wstringop-overflow warnings when built s390 architecture with defconfig (GCC 13): CC [M] fs/smb/client/cifssmb.o In function 'cifs_init_ace', inlined from 'posix_acl_to_cifs' at fs/smb/client/cifssmb.c:3046:3, inlined from 'cifs_do_set_acl' at fs/smb/client/cifssmb.c:3191:15: fs/smb/client/cifssmb.c:2987:31: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=] 2987 | cifs_ace->cifs_e_perm = local_ace->e_perm; | ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~ In file included from fs/smb/client/cifssmb.c:27: fs/smb/client/cifspdu.h: In function 'cifs_do_set_acl': fs/smb/client/cifspdu.h:384:14: note: at offset [7, 11] into destination object 'Protocol' of size 4 384 | __u8 Protocol[4]; | ^~~~~~~~ In function 'cifs_init_ace', inlined from 'posix_acl_to_cifs' at fs/smb/client/cifssmb.c:3046:3, inlined from 'cifs_do_set_acl' at fs/smb/client/cifssmb.c:3191:15: fs/smb/client/cifssmb.c:2988:30: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=] 2988 | cifs_ace->cifs_e_tag = local_ace->e_tag; | ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~ fs/smb/client/cifspdu.h: In function 'cifs_do_set_acl': fs/smb/client/cifspdu.h:384:14: note: at offset [6, 10] into destination object 'Protocol' of size 4 384 | __u8 Protocol[4]; | ^~~~~~~~ This helps with the ongoing efforts to globally enable -Wstringop-overflow. Link: https://github.com/KSPP/linux/issues/310 Fixes: dc1af4c4b472 ("cifs: implement set acl method") Cc: stable@vger.kernel.org Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-07-12Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Alexei Starovoitov says: ==================== pull-request: bpf 2023-07-12 We've added 5 non-merge commits during the last 7 day(s) which contain a total of 7 files changed, 93 insertions(+), 28 deletions(-). The main changes are: 1) Fix max stack depth check for async callbacks, from Kumar. 2) Fix inconsistent JIT image generation, from Björn. 3) Use trusted arguments in XDP hints kfuncs, from Larysa. 4) Fix memory leak in cpu_map_update_elem, from Pu. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: xdp: use trusted arguments in XDP hints kfuncs bpf: cpumap: Fix memory leak in cpu_map_update_elem riscv, bpf: Fix inconsistent JIT image generation selftests/bpf: Add selftest for check_stack_max_depth bug bpf: Fix max stack depth check for async callbacks ==================== Link: https://lore.kernel.org/r/20230712223045.40182-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-12wifi: cfg80211: fix receiving mesh packets without RFC1042 headerFelix Fietkau
Fix ethernet header length field after stripping the mesh header Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/CT5GNZSK28AI.2K6M69OXM9RW5@syracuse/ Fixes: 986e43b19ae9 ("wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces") Reported-and-tested-by: Nicolas Escande <nico.escande@gmail.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20230711115052.68430-1-nbd@nbd.name Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-12wifi: rtw89: debug: fix error code in rtw89_debug_priv_send_h2c_set()Zhang Shurong
If there is a failure during rtw89_fw_h2c_raw() rtw89_debug_priv_send_h2c should return negative error code instead of a positive value count. Fix this bug by returning correct error code. Fixes: e3ec7017f6a2 ("rtw89: add Realtek 802.11ax driver") Signed-off-by: Zhang Shurong <zhang_shurong@foxmail.com> Acked-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://lore.kernel.org/r/tencent_AD09A61BC4DA92AD1EB0790F5C850E544D07@qq.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-12net: txgbe: fix eeprom calculation errorJiawen Wu
For some device types like TXGBE_ID_XAUI, *checksum computed in txgbe_calc_eeprom_checksum() is larger than TXGBE_EEPROM_SUM. Remove the limit on the size of *checksum. Fixes: 049fe5365324 ("net: txgbe: Add operations to interact with firmware") Fixes: 5e2ea7801fac ("net: txgbe: Fix unsigned comparison to zero in txgbe_calc_eeprom_checksum()") Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Link: https://lore.kernel.org/r/20230711063414.3311-1-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-12Merge tag 'for-linus' of https://github.com/openrisc/linuxLinus Torvalds
Pull OpenRISC fix from Stafford Horne: - During the 6.4 cycle my fpu support work broke ABI compatibility in the sigcontext struct. This was noticed by musl libc developers after the release. This fix restores the ABI. * tag 'for-linus' of https://github.com/openrisc/linux: openrisc: Union fpcsr and oldmask in sigcontext to unbreak userspace ABI
2023-07-12tracing/histograms: Add histograms to hist_vars if they have referenced ↵Mohamed Khalfella
variables Hist triggers can have referenced variables without having direct variables fields. This can be the case if referenced variables are added for trigger actions. In this case the newly added references will not have field variables. Not taking such referenced variables into consideration can result in a bug where it would be possible to remove hist trigger with variables being refenced. This will result in a bug that is easily reproducable like so $ cd /sys/kernel/tracing $ echo 'synthetic_sys_enter char[] comm; long id' >> synthetic_events $ echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger $ echo 'hist:keys=common_pid.execname,id.syscall:onmatch(raw_syscalls.sys_enter).synthetic_sys_enter($comm, id)' >> events/raw_syscalls/sys_enter/trigger $ echo '!hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger [ 100.263533] ================================================================== [ 100.264634] BUG: KASAN: slab-use-after-free in resolve_var_refs+0xc7/0x180 [ 100.265520] Read of size 8 at addr ffff88810375d0f0 by task bash/439 [ 100.266320] [ 100.266533] CPU: 2 PID: 439 Comm: bash Not tainted 6.5.0-rc1 #4 [ 100.267277] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014 [ 100.268561] Call Trace: [ 100.268902] <TASK> [ 100.269189] dump_stack_lvl+0x4c/0x70 [ 100.269680] print_report+0xc5/0x600 [ 100.270165] ? resolve_var_refs+0xc7/0x180 [ 100.270697] ? kasan_complete_mode_report_info+0x80/0x1f0 [ 100.271389] ? resolve_var_refs+0xc7/0x180 [ 100.271913] kasan_report+0xbd/0x100 [ 100.272380] ? resolve_var_refs+0xc7/0x180 [ 100.272920] __asan_load8+0x71/0xa0 [ 100.273377] resolve_var_refs+0xc7/0x180 [ 100.273888] event_hist_trigger+0x749/0x860 [ 100.274505] ? kasan_save_stack+0x2a/0x50 [ 100.275024] ? kasan_set_track+0x29/0x40 [ 100.275536] ? __pfx_event_hist_trigger+0x10/0x10 [ 100.276138] ? ksys_write+0xd1/0x170 [ 100.276607] ? do_syscall_64+0x3c/0x90 [ 100.277099] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 100.277771] ? destroy_hist_data+0x446/0x470 [ 100.278324] ? event_hist_trigger_parse+0xa6c/0x3860 [ 100.278962] ? __pfx_event_hist_trigger_parse+0x10/0x10 [ 100.279627] ? __kasan_check_write+0x18/0x20 [ 100.280177] ? mutex_unlock+0x85/0xd0 [ 100.280660] ? __pfx_mutex_unlock+0x10/0x10 [ 100.281200] ? kfree+0x7b/0x120 [ 100.281619] ? ____kasan_slab_free+0x15d/0x1d0 [ 100.282197] ? event_trigger_write+0xac/0x100 [ 100.282764] ? __kasan_slab_free+0x16/0x20 [ 100.283293] ? __kmem_cache_free+0x153/0x2f0 [ 100.283844] ? sched_mm_cid_remote_clear+0xb1/0x250 [ 100.284550] ? __pfx_sched_mm_cid_remote_clear+0x10/0x10 [ 100.285221] ? event_trigger_write+0xbc/0x100 [ 100.285781] ? __kasan_check_read+0x15/0x20 [ 100.286321] ? __bitmap_weight+0x66/0xa0 [ 100.286833] ? _find_next_bit+0x46/0xe0 [ 100.287334] ? task_mm_cid_work+0x37f/0x450 [ 100.287872] event_triggers_call+0x84/0x150 [ 100.288408] trace_event_buffer_commit+0x339/0x430 [ 100.289073] ? ring_buffer_event_data+0x3f/0x60 [ 100.292189] trace_event_raw_event_sys_enter+0x8b/0xe0 [ 100.295434] syscall_trace_enter.constprop.0+0x18f/0x1b0 [ 100.298653] syscall_enter_from_user_mode+0x32/0x40 [ 100.301808] do_syscall_64+0x1a/0x90 [ 100.304748] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 100.307775] RIP: 0033:0x7f686c75c1cb [ 100.310617] Code: 73 01 c3 48 8b 0d 65 3c 10 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 21 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 35 3c 10 00 f7 d8 64 89 01 48 [ 100.317847] RSP: 002b:00007ffc60137a38 EFLAGS: 00000246 ORIG_RAX: 0000000000000021 [ 100.321200] RAX: ffffffffffffffda RBX: 000055f566469ea0 RCX: 00007f686c75c1cb [ 100.324631] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 000000000000000a [ 100.328104] RBP: 00007ffc60137ac0 R08: 00007f686c818460 R09: 000000000000000a [ 100.331509] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009 [ 100.334992] R13: 0000000000000007 R14: 000000000000000a R15: 0000000000000007 [ 100.338381] </TASK> We hit the bug because when second hist trigger has was created has_hist_vars() returned false because hist trigger did not have variables. As a result of that save_hist_vars() was not called to add the trigger to trace_array->hist_vars. Later on when we attempted to remove the first histogram find_any_var_ref() failed to detect it is being used because it did not find the second trigger in hist_vars list. With this change we wait until trigger actions are created so we can take into consideration if hist trigger has variable references. Also, now we check the return value of save_hist_vars() and fail trigger creation if save_hist_vars() fails. Link: https://lore.kernel.org/linux-trace-kernel/20230712223021.636335-1-mkhalfella@purestorage.com Cc: stable@vger.kernel.org Fixes: 067fe038e70f6 ("tracing: Add variable reference handling to hist triggers") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-12net/sched: make psched_mtu() RTNL-less safePedro Tammela
Eric Dumazet says[1]: ------- Speaking of psched_mtu(), I see that net/sched/sch_pie.c is using it without holding RTNL, so dev->mtu can be changed underneath. KCSAN could issue a warning. ------- Annotate dev->mtu with READ_ONCE() so KCSAN don't issue a warning. [1] https://lore.kernel.org/all/CANn89iJoJO5VtaJ-2=_d2aOQhb0Xw8iBT_Cxqp2HyuS-zj6azw@mail.gmail.com/ v1 -> v2: Fix commit message Fixes: d4b36210c2e6 ("net: pkt_sched: PIE AQM scheme") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230711021634.561598-1-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-12net: ena: fix shift-out-of-bounds in exponential backoffKrister Johansen
The ENA adapters on our instances occasionally reset. Once recently logged a UBSAN failure to console in the process: UBSAN: shift-out-of-bounds in build/linux/drivers/net/ethernet/amazon/ena/ena_com.c:540:13 shift exponent 32 is too large for 32-bit type 'unsigned int' CPU: 28 PID: 70012 Comm: kworker/u72:2 Kdump: loaded not tainted 5.15.117 Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017 Workqueue: ena ena_fw_reset_device [ena] Call Trace: <TASK> dump_stack_lvl+0x4a/0x63 dump_stack+0x10/0x16 ubsan_epilogue+0x9/0x36 __ubsan_handle_shift_out_of_bounds.cold+0x61/0x10e ? __const_udelay+0x43/0x50 ena_delay_exponential_backoff_us.cold+0x16/0x1e [ena] wait_for_reset_state+0x54/0xa0 [ena] ena_com_dev_reset+0xc8/0x110 [ena] ena_down+0x3fe/0x480 [ena] ena_destroy_device+0xeb/0xf0 [ena] ena_fw_reset_device+0x30/0x50 [ena] process_one_work+0x22b/0x3d0 worker_thread+0x4d/0x3f0 ? process_one_work+0x3d0/0x3d0 kthread+0x12a/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x22/0x30 </TASK> Apparently, the reset delays are getting so large they can trigger a UBSAN panic. Looking at the code, the current timeout is capped at 5000us. Using a base value of 100us, the current code will overflow after (1<<29). Even at values before 32, this function wraps around, perhaps unintentionally. Cap the value of the exponent used for this backoff at (1<<16) which is larger than currently necessary, but large enough to support bigger values in the future. Cc: stable@vger.kernel.org Fixes: 4bb7f4cf60e3 ("net: ena: reduce driver load time") Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Shay Agroskin <shayagr@amazon.com> Link: https://lore.kernel.org/r/20230711013621.GE1926@templeofstupid.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-12kallsyms: strip LTO-only suffixes from promoted global functionsYonghong Song
Commit 6eb4bd92c1ce ("kallsyms: strip LTO suffixes from static functions") stripped all function/variable suffixes started with '.' regardless of whether those suffixes are generated at LTO mode or not. In fact, as far as I know, in LTO mode, when a static function/variable is promoted to the global scope, '.llvm.<...>' suffix is added. The existing mechanism breaks live patch for a LTO kernel even if no <symbol>.llvm.<...> symbols are involved. For example, for the following kernel symbols: $ grep bpf_verifier_vlog /proc/kallsyms ffffffff81549f60 t bpf_verifier_vlog ffffffff8268b430 d bpf_verifier_vlog._entry ffffffff8282a958 d bpf_verifier_vlog._entry_ptr ffffffff82e12a1f d bpf_verifier_vlog.__already_done 'bpf_verifier_vlog' is a static function. '_entry', '_entry_ptr' and '__already_done' are static variables used inside 'bpf_verifier_vlog', so llvm promotes them to file-level static with prefix 'bpf_verifier_vlog.'. Note that the func-level to file-level static function promotion also happens without LTO. Given a symbol name 'bpf_verifier_vlog', with LTO kernel, current mechanism will return 4 symbols to live patch subsystem which current live patching subsystem cannot handle it. With non-LTO kernel, only one symbol is returned. In [1], we have a lengthy discussion, the suggestion is to separate two cases: (1). new symbols with suffix which are generated regardless of whether LTO is enabled or not, and (2). new symbols with suffix generated only when LTO is enabled. The cleanup_symbol_name() should only remove suffixes for case (2). Case (1) should not be changed so it can work uniformly with or without LTO. This patch removed LTO-only suffix '.llvm.<...>' so live patching and tracing should work the same way for non-LTO kernel. The cleanup_symbol_name() in scripts/kallsyms.c is also changed to have the same filtering pattern so both kernel and kallsyms tool have the same expectation on the order of symbols. [1] https://lore.kernel.org/live-patching/20230615170048.2382735-1-song@kernel.org/T/#u Fixes: 6eb4bd92c1ce ("kallsyms: strip LTO suffixes from static functions") Reported-by: Song Liu <song@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230628181926.4102448-1-yhs@fb.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-13Merge tag 'renesas-pinctrl-fixes-for-v6.5-tag1' of ↵Linus Walleij
git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers into fixes pinctrl: renesas: Fixes for v6.5 - Fix handling of non-unique pin control configuration subnode names on the RZ/V2M and RZ/G2L SoC families. Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-07-13pinctrl: amd: Unify debounce handling into amd_pinconf_set()Mario Limonciello
Debounce handling is done in two different entry points in the driver. Unify this to make sure that it's always handled the same. Tested-by: Jan Visser <starquake@linuxeverywhere.org> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20230705133005.577-5-mario.limonciello@amd.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-07-13pinctrl: amd: Drop pull up select configurationMario Limonciello
pinctrl-amd currently tries to program bit 19 of all GPIOs to select either a 4kΩ or 8hΩ pull up, but this isn't what bit 19 does. Bit 19 is marked as reserved, even in the latest platforms documentation. Drop this programming functionality. Tested-by: Jan Visser <starquake@linuxeverywhere.org> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20230705133005.577-4-mario.limonciello@amd.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-07-13pinctrl: amd: Use amd_pinconf_set() for all config optionsMario Limonciello
On ASUS TUF A16 it is reported that the ITE5570 ACPI device connected to GPIO 7 is causing an interrupt storm. This issue doesn't happen on Windows. Comparing the GPIO register configuration between Windows and Linux bit 20 has been configured as a pull up on Windows, but not on Linux. Checking GPIO declaration from the firmware it is clear it *should* have been a pull up on Linux as well. ``` GpioInt (Level, ActiveLow, Exclusive, PullUp, 0x0000, "\\_SB.GPIO", 0x00, ResourceConsumer, ,) { // Pin list 0x0007 } ``` On Linux amd_gpio_set_config() is currently only used for programming the debounce. Actually the GPIO core calls it with all the arguments that are supported by a GPIO, pinctrl-amd just responds `-ENOTSUPP`. To solve this issue expand amd_gpio_set_config() to support the other arguments amd_pinconf_set() supports, namely `PIN_CONFIG_BIAS_PULL_DOWN`, `PIN_CONFIG_BIAS_PULL_UP`, and `PIN_CONFIG_DRIVE_STRENGTH`. Reported-by: Nik P <npliashechnikov@gmail.com> Reported-by: Nathan Schulte <nmschulte@gmail.com> Reported-by: Friedrich Vock <friedrich.vock@gmx.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217336 Reported-by: dridri85@gmail.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217493 Link: https://lore.kernel.org/linux-input/20230530154058.17594-1-friedrich.vock@gmx.de/ Tested-by: Jan Visser <starquake@linuxeverywhere.org> Fixes: 2956b5d94a76 ("pinctrl / gpio: Introduce .set_config() callback for GPIO chips") Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20230705133005.577-3-mario.limonciello@amd.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-07-13pinctrl: amd: Only use special debounce behavior for GPIO 0Mario Limonciello
It's uncommon to use debounce on any other pin, but technically we should only set debounce to 0 when working off GPIO0. Cc: stable@vger.kernel.org Tested-by: Jan Visser <starquake@linuxeverywhere.org> Fixes: 968ab9261627 ("pinctrl: amd: Detect internal GPIO0 debounce handling") Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20230705133005.577-2-mario.limonciello@amd.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-07-12tracing: Stop FORTIFY_SOURCE complaining about stack trace callerSteven Rostedt (Google)
The stack_trace event is an event created by the tracing subsystem to store stack traces. It originally just contained a hard coded array of 8 words to hold the stack, and a "size" to know how many entries are there. This is exported to user space as: name: kernel_stack ID: 4 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:int size; offset:8; size:4; signed:1; field:unsigned long caller[8]; offset:16; size:64; signed:0; print fmt: "\t=> %ps\n\t=> %ps\n\t=> %ps\n" "\t=> %ps\n\t=> %ps\n\t=> %ps\n" "\t=> %ps\n\t=> %ps\n",i (void *)REC->caller[0], (void *)REC->caller[1], (void *)REC->caller[2], (void *)REC->caller[3], (void *)REC->caller[4], (void *)REC->caller[5], (void *)REC->caller[6], (void *)REC->caller[7] Where the user space tracers could parse the stack. The library was updated for this specific event to only look at the size, and not the array. But some older users still look at the array (note, the older code still checks to make sure the array fits inside the event that it read. That is, if only 4 words were saved, the parser would not read the fifth word because it will see that it was outside of the event size). This event was changed a while ago to be more dynamic, and would save a full stack even if it was greater than 8 words. It does this by simply allocating more ring buffer to hold the extra words. Then it copies in the stack via: memcpy(&entry->caller, fstack->calls, size); As the entry is struct stack_entry, that is created by a macro to both create the structure and export this to user space, it still had the caller field of entry defined as: unsigned long caller[8]. When the stack is greater than 8, the FORTIFY_SOURCE code notices that the amount being copied is greater than the source array and complains about it. It has no idea that the source is pointing to the ring buffer with the required allocation. To hide this from the FORTIFY_SOURCE logic, pointer arithmetic is used: ptr = ring_buffer_event_data(event); entry = ptr; ptr += offsetof(typeof(*entry), caller); memcpy(ptr, fstack->calls, size); Link: https://lore.kernel.org/all/20230612160748.4082850-1-svens@linux.ibm.com/ Link: https://lore.kernel.org/linux-trace-kernel/20230712105235.5fc441aa@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Reported-by: Sven Schnelle <svens@linux.ibm.com> Tested-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-12ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()Zheng Yejian
As comments in ftrace_process_locs(), there may be NULL pointers in mcount_loc section: > Some architecture linkers will pad between > the different mcount_loc sections of different > object files to satisfy alignments. > Skip any NULL pointers. After commit 20e5227e9f55 ("ftrace: allow NULL pointers in mcount_loc"), NULL pointers will be accounted when allocating ftrace pages but skipped before adding into ftrace pages, this may result in some pages not being used. Then after commit 706c81f87f84 ("ftrace: Remove extra helper functions"), warning may occur at: WARN_ON(pg->next); To fix it, only warn for case that no pointers skipped but pages not used up, then free those unused pages after releasing ftrace_lock. Link: https://lore.kernel.org/linux-trace-kernel/20230712060452.3175675-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Fixes: 706c81f87f84 ("ftrace: Remove extra helper functions") Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-12drm/nouveau: bring back blit subchannel for pre nv50 GPUsKarol Herbst
1ba6113a90a0 removed a lot of the kernel GPU channel, but method 0x128 was important as otherwise the GPU spams us with `CACHE_ERROR` messages. We use the blit subchannel inside our vblank handling, so we should keep at least this part. v2: Only do it for NV11+ GPUs Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/201 Fixes: 4a16dd9d18a0 ("drm/nouveau/kms: switch to drm fbdev helpers") Signed-off-by: Karol Herbst <kherbst@redhat.com> Reviewed-by: Ben Skeggs <bskeggs@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230526091052.2169044-1-kherbst@redhat.com