summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-06-05iomap: Fix iomap_adjust_read_range for plen calculationRitesh Harjani (IBM)
If the extent spans the block that contains i_size, we need to handle both halves separately so that we properly zero data in the page cache for blocks that are entirely outside of i_size. But this is needed only when i_size is within the current folio under processing. "orig_pos + length > isize" can be true for all folios if the mapped extent length is greater than the folio size. That is making plen to break for every folio instead of only the last folio. So use orig_plen for checking if "orig_pos + orig_plen > isize". Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/a32e5f9a4fcfdb99077300c4020ed7ae61d6e0f9.1715067055.git.ritesh.list@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> cc: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-05iomap: keep on increasing i_size in iomap_write_end()Zhang Yi
Commit '943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")' breaks xfs with realtime device on generic/561, the problem is when unaligned truncate down a xfs realtime inode with rtextsize > 1 fs block, xfs only zero out the EOF block but doesn't zero out the tail blocks that aligned to rtextsize, so if we don't increase i_size in iomap_write_end(), it could expose stale data after we do an append write beyond the aligned EOF block. xfs should zero out the tail blocks when truncate down, but before we finish that, let's fix the issue by just revert the changes in iomap_write_end(). Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation") Reported-by: Chandan Babu R <chandanbabu@kernel.org> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240603112222.2109341-1-yi.zhang@huaweicloud.com Tested-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-05Merge branch 'pm-cpufreq'Rafael J. Wysocki
Merge cpufreq fixes for 6.10-rc3: - Fix a recently introduced unchecked HWP MSR access in the intel_pstate driver (Srinivas Pandruvada). - Add missing conversion from MHz to KHz to amd_pstate_set_boost() to address sysfs inteface inconsistency (Dhananjay Ugwekar). - Get rid of an excess global header file used by the amd-pstate cpufreq driver (Arnd Bergmann). * pm-cpufreq: cpufreq: intel_pstate: Fix unchecked HWP MSR access cpufreq: amd-pstate: Fix the inconsistency in max frequency units cpufreq: amd-pstate: remove global header file
2024-06-05KVM: s390x: selftests: Add shared zeropage testDavid Hildenbrand
Let's test that we can have shared zeropages in our process as long as storage keys are not getting used, that shared zeropages are properly unshared (replaced by anonymous pages) once storage keys are enabled, and that no new shared zeropages are populated after storage keys were enabled. We require the new pagemap interface to detect the shared zeropage. On an old kernel (zeropages always disabled): # ./s390x/shared_zeropage_test TAP version 13 1..3 not ok 1 Shared zeropages should be enabled ok 2 Shared zeropage should be gone ok 3 Shared zeropages should be disabled # Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0 On a fixed kernel: # ./s390x/shared_zeropage_test TAP version 13 1..3 ok 1 Shared zeropages should be enabled ok 2 Shared zeropage should be gone ok 3 Shared zeropages should be disabled # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0 Testing of UFFDIO_ZEROPAGE can be added later. [ agordeev: Fixed checkpatch complaint, added ucall_common.h include ] Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Acked-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/r/20240412084329.30315-1-david@redhat.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05s390/crash: Do not use VM info if os_info does not have itAlexander Gordeev
The virtual memory information stored in os_info area is required for creation of the kernel image PT_LOAD program header for kernels since commit a2ec5bec56dd ("s390/mm: uncouple physical vs virtual address spaces"). By contrast, if such information in os_info is absent the PT_LOAD program header should not be created. Currently the proper PT_LOAD program header is created for kernels that contain the virtual memory information, but for kernels without one an invalid header of zero size is created. That in turn leads to stand-alone dump failures. Use OS_INFO_KASLR_OFFSET variable to check whether os_info is present or not (same as crash and makedumpfile tools do) and based on that create or do not create the kernel image PT_LOAD program header. Fixes: f4cac27dc0d6 ("s390/crash: Use old os_info to create PT_LOAD headers") Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Acked-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05Merge branches 'acpi-ec', 'acpi-apei' and 'pnp'Rafael J. Wysocki
Merge ACPI EC driver fixes, an ACPI APEI fix and PNP fixes for 6.10-rc3: - Fix error handling during EC operation region accesses in the ACPI EC driver (Armin Wolf). - Fix a memory leak in the APEI error injection driver introduced during its converion to a platform driver (Dan Williams). - Fix build failures related to the dev_is_pnp() macro by redefining it as a proper function and exporting it to modules as appropriate and unexport pnp_bus_type which need not be exported any more (Andy Shevchenko). * acpi-ec: ACPI: EC: Avoid returning AE_OK on errors in address space handler ACPI: EC: Abort address space access upon error * acpi-apei: ACPI: APEI: EINJ: Fix einj_dev release leak * pnp: PNP: Hide pnp_bus_type from the non-PNP code PNP: Make dev_is_pnp() to be a function and export it for modules
2024-06-05libbpf: Remove callback-based type/string BTF field visitor helpersAndrii Nakryiko
Now that all libbpf/bpftool code switched to btf_field_iter, remove btf_type_visit_type_ids() and btf_type_visit_str_offs() callback-based helpers as not needed anymore. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-6-andrii@kernel.org
2024-06-05bpftool: Use BTF field iterator in btfgenAndrii Nakryiko
Switch bpftool's code which is using libbpf-internal btf_type_visit_type_ids() helper to new btf_field_iter functionality. This makes bpftool code simpler, but also unblocks removing libbpf's btf_type_visit_type_ids() helper completely. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Reviewed-by: Quentin Monnet <qmo@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-5-andrii@kernel.org
2024-06-05libbpf: Make use of BTF field iterator in BTF handling codeAndrii Nakryiko
Use new BTF field iterator logic to replace all the callback-based visitor calls. There is still a .BTF.ext callback-based visitor APIs that should be converted, which will happens as a follow up. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-4-andrii@kernel.org
2024-06-05libbpf: Make use of BTF field iterator in BPF linker codeAndrii Nakryiko
Switch all BPF linker code dealing with iterating BTF type ID and string offset fields to new btf_field_iter facilities. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-3-andrii@kernel.org
2024-06-05libbpf: Add BTF field iteratorAndrii Nakryiko
Implement iterator-based type ID and string offset BTF field iterator. This is used extensively in BTF-handling code and BPF linker code for various sanity checks, rewriting IDs/offsets, etc. Currently this is implemented as visitor pattern calling custom callbacks, which makes the logic (especially in simple cases) unnecessarily obscure and harder to follow. Having equivalent functionality using iterator pattern makes for simpler to understand and maintain code. As we add more code for BTF processing logic in libbpf, it's best to switch to iterator pattern before adding more callback-based code. The idea for iterator-based implementation is to record offsets of necessary fields within fixed btf_type parts (which should be iterated just once), and, for kinds that have multiple members (based on vlen field), record where in each member necessary fields are located. Generic iteration code then just keeps track of last offset that was returned and handles N members correctly. Return type is just u32 pointer, where NULL is returned when all relevant fields were already iterated. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-2-andrii@kernel.org
2024-06-05bcachefs: Fix trans->locked assertKent Overstreet
in bch2_move_data_btree, we might start with the trans unlocked from a previous loop iteration - we need a trans_begin() before iter_init(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-05bcachefs: Rereplicate now moves data off of durability=0 devicesKent Overstreet
This fixes an issue where setting a device to durability=0 after it's been used makes it impossible to remove. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-05bcachefs: Fix GFP_KERNEL allocation in break_cycle()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-05perf bpf: Fix handling of minimal vmlinux.h file when interrupting the buildNamhyung Kim
Ingo reported that he was seeing these when hitting Control+C during a perf tools build: Makefile.perf:1149: *** Missing bpftool input for generating vmlinux.h. Stop. The failure happens when you don't have vmlinux.h or vmlinux with BTF. ifeq ($(VMLINUX_H),) ifeq ($(VMLINUX_BTF),) $(error Missing bpftool input for generating vmlinux.h) endif endif VMLINUX_BTF can be empty if you didn't build a kernel or it doesn't have a BTF section and the current kernel also has no BTF. This is totally ok. But VMLINUX_H should be set to the minimal version in the source tree (unless you overwrite it manually) when you don't pass GEN_VMLINUX_H=1 (which requires VMLINUX_BTF should not be empty). The problem is that it's defined in Makefile.config which is not included for `make clean`. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Ingo Molnar <mingo@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Link: http://lore.kernel.org/lkml/CAM9d7ch5HTr+k+_GpbMrX0HUo5BZ11byh1xq0Two7B7RQACuNw@mail.gmail.com Link: http://lore.kernel.org/lkml/ZjssGrj+abyC6mYP@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-06-05Revert "perf record: Reduce memory for recording PERF_RECORD_LOST_SAMPLES event"Arnaldo Carvalho de Melo
This reverts commit 7d1405c71df21f6c394b8a885aa8a133f749fa22. This causes segfaults in some cases, as reported by Milian: ``` sudo /usr/bin/perf record -z --call-graph dwarf -e cycles -e raw_syscalls:sys_enter ls ... [ perf record: Woken up 3 times to write data ] malloc(): invalid next size (unsorted) Aborted ``` Backtrace with GDB + debuginfod: ``` malloc(): invalid next size (unsorted) Thread 1 "perf" received signal SIGABRT, Aborted. __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 Downloading source file /usr/src/debug/glibc/glibc/nptl/pthread_kill.c 44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0; (gdb) bt #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 #1 0x00007ffff6ea8eb3 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78 #2 0x00007ffff6e50a30 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/ raise.c:26 #3 0x00007ffff6e384c3 in __GI_abort () at abort.c:79 #4 0x00007ffff6e39354 in __libc_message_impl (fmt=fmt@entry=0x7ffff6fc22ea "%s\n") at ../sysdeps/posix/libc_fatal.c:132 #5 0x00007ffff6eb3085 in malloc_printerr (str=str@entry=0x7ffff6fc5850 "malloc(): invalid next size (unsorted)") at malloc.c:5772 #6 0x00007ffff6eb657c in _int_malloc (av=av@entry=0x7ffff6ff6ac0 <main_arena>, bytes=bytes@entry=368) at malloc.c:4081 #7 0x00007ffff6eb877e in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3754 #8 0x000055555569bdb6 in perf_session.do_write_header () #9 0x00005555555a373a in __cmd_record.constprop.0 () #10 0x00005555555a6846 in cmd_record () #11 0x000055555564db7f in run_builtin () #12 0x000055555558ed77 in main () ``` Valgrind memcheck: ``` ==45136== Invalid write of size 8 ==45136== at 0x2B38A5: perf_event__synthesize_id_sample (in /usr/bin/perf) ==45136== by 0x157069: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd ==45136== at 0x4849BF3: calloc (vg_replace_malloc.c:1675) ==45136== by 0x3574AB: zalloc (in /usr/bin/perf) ==45136== by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== ==45136== Syscall param write(buf) points to unaddressable byte(s) ==45136== at 0x575953D: __libc_write (write.c:26) ==45136== by 0x575953D: write (write.c:24) ==45136== by 0x35761F: ion (in /usr/bin/perf) ==45136== by 0x357778: writen (in /usr/bin/perf) ==45136== by 0x1548F7: record__write (in /usr/bin/perf) ==45136== by 0x15708A: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd ==45136== at 0x4849BF3: calloc (vg_replace_malloc.c:1675) ==45136== by 0x3574AB: zalloc (in /usr/bin/perf) ==45136== by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== ----- Closes: https://lore.kernel.org/linux-perf-users/23879991.0LEYPuXRzz@milian-workstation/ Reported-by: Milian Wolff <milian.wolff@kdab.com> Tested-by: Milian Wolff <milian.wolff@kdab.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: stable@kernel.org # 6.8+ Link: https://lore.kernel.org/lkml/Zl9ksOlHJHnKM70p@x1 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-06-05locking/atomic: scripts: fix ${atomic}_sub_and_test() kerneldocCarlos Llamas
For ${atomic}_sub_and_test() the @i parameter is the value to subtract, not add. Fix the typo in the kerneldoc template and generate the headers with this update. Fixes: ad8110706f38 ("locking/atomic: scripts: generate kerneldoc comments") Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20240515133844.3502360-1-cmllamas@google.com
2024-06-05perf/core: Fix missing wakeup when waiting for context referenceHaifeng Xu
In our production environment, we found many hung tasks which are blocked for more than 18 hours. Their call traces are like this: [346278.191038] __schedule+0x2d8/0x890 [346278.191046] schedule+0x4e/0xb0 [346278.191049] perf_event_free_task+0x220/0x270 [346278.191056] ? init_wait_var_entry+0x50/0x50 [346278.191060] copy_process+0x663/0x18d0 [346278.191068] kernel_clone+0x9d/0x3d0 [346278.191072] __do_sys_clone+0x5d/0x80 [346278.191076] __x64_sys_clone+0x25/0x30 [346278.191079] do_syscall_64+0x5c/0xc0 [346278.191083] ? syscall_exit_to_user_mode+0x27/0x50 [346278.191086] ? do_syscall_64+0x69/0xc0 [346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20 [346278.191092] ? irqentry_exit+0x19/0x30 [346278.191095] ? exc_page_fault+0x89/0x160 [346278.191097] ? asm_exc_page_fault+0x8/0x30 [346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae The task was waiting for the refcount become to 1, but from the vmcore, we found the refcount has already been 1. It seems that the task didn't get woken up by perf_event_release_kernel() and got stuck forever. The below scenario may cause the problem. Thread A Thread B ... ... perf_event_free_task perf_event_release_kernel ... acquire event->child_mutex ... get_ctx ... release event->child_mutex acquire ctx->mutex ... perf_free_event (acquire/release event->child_mutex) ... release ctx->mutex wait_var_event acquire ctx->mutex acquire event->child_mutex # move existing events to free_list release event->child_mutex release ctx->mutex put_ctx ... ... In this case, all events of the ctx have been freed, so we couldn't find the ctx in free_list and Thread A will miss the wakeup. It's thus necessary to add a wakeup after dropping the reference. Fixes: 1cf8dfe8a661 ("perf/core: Fix race between close() and fork()") Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20240513103948.33570-1-haifeng.xu@shopee.com
2024-06-05KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin()Breno Leitao
Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure the loads and stores are atomic. In the extremely unlikely scenario the compiler tears the stores, it's theoretically possible for KVM to attempt to get a vCPU using an out-of-bounds index, e.g. if the write is split into multiple 8-bit stores, and is paired with a 32-bit load on a VM with 257 vCPUs: CPU0 CPU1 last_boosted_vcpu = 0xff; (last_boosted_vcpu = 0x100) last_boosted_vcpu[15:8] = 0x01; i = (last_boosted_vcpu = 0x1ff) last_boosted_vcpu[7:0] = 0x00; vcpu = kvm->vcpu_array[0x1ff]; As detected by KCSAN: BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm] write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) value changed: 0x00000012 -> 0x00000000 Fixes: 217ece6129f2 ("KVM: use yield_to instead of sleep in kvm_vcpu_on_spin") Cc: stable@vger.kernel.org Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240510092353.2261824-1-leitao@debian.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: selftests: x86: Prioritize getting max_gfn from GuestPhysBitsTao Su
Use the max mappable GPA via GuestPhysBits advertised by KVM to calculate max_gfn. Currently some selftests (e.g. access_tracking_perf_test, dirty_log_test...) add RAM regions close to max_gfn, so guest may access GPA beyond its mappable range and cause infinite loop. Adjust max_gfn in vm_compute_max_gfn() since x86 selftests already overrides vm_compute_max_gfn() specifically to deal with goofy edge cases. Reported-by: Yi Lai <yi1.lai@intel.com> Signed-off-by: Tao Su <tao1.su@linux.intel.com> Tested-by: Yi Lai <yi1.lai@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20240513014003.104593-1-tao1.su@linux.intel.com [sean: tweak name, add comment and sanity check] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: selftests: Fix shift of 32 bit unsigned int more than 32 bitsColin Ian King
Currrentl a 32 bit 1u value is being shifted more than 32 bits causing overflow and incorrect checking of bits 32-63. Fix this by using the BIT_ULL macro for shifting bits. Detected by cppcheck: sev_init2_tests.c:108:34: error: Shifting 32-bit value by 63 bits is undefined behaviour [shiftTooManyBits] Fixes: dfc083a181ba ("selftests: kvm: add tests for KVM_SEV_INIT2") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20240523154102.2236133-1-colin.i.king@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05Merge branch 'mlx5-fixes'David S. Miller
Tariq Toukan says: ==================== mlx5 core fixes 20240603 This small patchset provides two bug fixes from the team to the mlx5 core driver. Series generated against: commit 33700a0c9b56 ("net/tcp: Don't consider TCP_CLOSE in TCP_AO_ESTABLISHED") ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net/mlx5: Always stop health timer during driver removalShay Drory
Currently, if teardown_hca fails to execute during driver removal, mlx5 does not stop the health timer. Afterwards, mlx5 continue with driver teardown. This may lead to a UAF bug, which results in page fault Oops[1], since the health timer invokes after resources were freed. Hence, stop the health monitor even if teardown_hca fails. [1] mlx5_core 0000:18:00.0: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) mlx5_core 0000:18:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) mlx5_core 0000:18:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) mlx5_core 0000:18:00.0: E-Switch: cleanup mlx5_core 0000:18:00.0: wait_func:1155:(pid 1967079): TEARDOWN_HCA(0x103) timeout. Will cause a leak of a command resource mlx5_core 0000:18:00.0: mlx5_function_close:1288:(pid 1967079): tear_down_hca failed, skip cleanup BUG: unable to handle page fault for address: ffffa26487064230 PGD 100c00067 P4D 100c00067 PUD 100e5a067 PMD 105ed7067 PTE 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE ------- --- 6.7.0-68.fc38.x86_64 #1 Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0013.121520200651 12/15/2020 RIP: 0010:ioread32be+0x34/0x60 RSP: 0018:ffffa26480003e58 EFLAGS: 00010292 RAX: ffffa26487064200 RBX: ffff9042d08161a0 RCX: ffff904c108222c0 RDX: 000000010bbf1b80 RSI: ffffffffc055ddb0 RDI: ffffa26487064230 RBP: ffff9042d08161a0 R08: 0000000000000022 R09: ffff904c108222e8 R10: 0000000000000004 R11: 0000000000000441 R12: ffffffffc055ddb0 R13: ffffa26487064200 R14: ffffa26480003f00 R15: ffff904c108222c0 FS: 0000000000000000(0000) GS:ffff904c10800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffa26487064230 CR3: 00000002c4420006 CR4: 00000000007706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> ? __die+0x23/0x70 ? page_fault_oops+0x171/0x4e0 ? exc_page_fault+0x175/0x180 ? asm_exc_page_fault+0x26/0x30 ? __pfx_poll_health+0x10/0x10 [mlx5_core] ? __pfx_poll_health+0x10/0x10 [mlx5_core] ? ioread32be+0x34/0x60 mlx5_health_check_fatal_sensors+0x20/0x100 [mlx5_core] ? __pfx_poll_health+0x10/0x10 [mlx5_core] poll_health+0x42/0x230 [mlx5_core] ? __next_timer_interrupt+0xbc/0x110 ? __pfx_poll_health+0x10/0x10 [mlx5_core] call_timer_fn+0x21/0x130 ? __pfx_poll_health+0x10/0x10 [mlx5_core] __run_timers+0x222/0x2c0 run_timer_softirq+0x1d/0x40 __do_softirq+0xc9/0x2c8 __irq_exit_rcu+0xa6/0xc0 sysvec_apic_timer_interrupt+0x72/0x90 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x1a/0x20 RIP: 0010:cpuidle_enter_state+0xcc/0x440 ? cpuidle_enter_state+0xbd/0x440 cpuidle_enter+0x2d/0x40 do_idle+0x20d/0x270 cpu_startup_entry+0x2a/0x30 rest_init+0xd0/0xd0 arch_call_rest_init+0xe/0x30 start_kernel+0x709/0xa90 x86_64_start_reservations+0x18/0x30 x86_64_start_kernel+0x96/0xa0 secondary_startup_64_no_verify+0x18f/0x19b ---[ end trace 0000000000000000 ]--- Fixes: 9b98d395b85d ("net/mlx5: Start health poll at earlier stage of driver load") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net/mlx5: Stop waiting for PCI if pci channel is offlineMoshe Shemesh
In case pci channel becomes offline the driver should not wait for PCI reads during health dump and recovery flow. The driver has timeout for each of these loops trying to read PCI, so it would fail anyway. However, in case of recovery waiting till timeout may cause the pci error_detected() callback fail to meet pci_dpc_recovered() wait timeout. Fixes: b3bd076f7501 ("net/mlx5: Report devlink health on FW fatal issues") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Shay Drori <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net: ethernet: mtk_eth_soc: handle dma buffer size soc specificFrank Wunderlich
The mainline MTK ethernet driver suffers long time from rarly but annoying tx queue timeouts. We think that this is caused by fixed dma sizes hardcoded for all SoCs. We suspect this problem arises from a low level of free TX DMADs, the TX Ring alomost full. The transmit timeout is caused by the Tx queue not waking up. The Tx queue stops when the free counter is less than ring->thres, and it will wake up once the free counter is greater than ring->thres. If the CPU is too late to wake up the Tx queues, it may cause a transmit timeout. Therefore, we increased the TX and RX DMADs to improve this error situation. Use the dma-size implementation from SDK in a per SoC manner. In difference to SDK we have no RSS feature yet, so all RX/TX sizes should be raised from 512 to 2048 byte except fqdma on mt7988 to avoid the tx timeout issue. Fixes: 656e705243fd ("net-next: mediatek: add support for MT7623 ethernet") Suggested-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Frank Wunderlich <frank-w@public-files.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05Merge branch 'tcp-rto-min-us'David S. Miller
Kevin Yang says: ==================== tcp: add sysctl_tcp_rto_min_us Adding a sysctl knob to allow user to specify a default rto_min at socket init time. After this patch series, the rto_min will has multiple sources: route option has the highest precedence, followed by the TCP_BPF_RTO_MIN socket option, followed by this new tcp_rto_min_us sysctl. v3: fix typo, simplify min/max_t to min/max v2: fit line width to 80 column. v2: https://lore.kernel.org/netdev/20240530153436.2202800-1-yyd@google.com/ v1: https://lore.kernel.org/netdev/20240528171320.1332292-1-yyd@google.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05tcp: add sysctl_tcp_rto_min_usKevin Yang
Adding a sysctl knob to allow user to specify a default rto_min at socket init time, other than using the hard coded 200ms default rto_min. Note that the rto_min route option has the highest precedence for configuring this setting, followed by the TCP_BPF_RTO_MIN socket option, followed by the tcp_rto_min_us sysctl. Signed-off-by: Kevin Yang <yyd@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05tcp: derive delack_max with tcp_rto_min helperKevin Yang
Rto_min now has multiple sources, ordered by preprecedence high to low: ip route option rto_min, icsk->icsk_rto_min. When derive delack_max from rto_min, we should not only use ip route option, but should use tcp_rto_min helper to get the correct rto_min. Signed-off-by: Kevin Yang <yyd@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05arm64/io: add constant-argument checkArnd Bergmann
In some configurations __const_iowrite32_copy() does not get inlined and gcc runs into the BUILD_BUG(): In file included from <command-line>: In function '__const_memcpy_toio_aligned32', inlined from '__const_iowrite32_copy' at arch/arm64/include/asm/io.h:203:3, inlined from '__const_iowrite32_copy' at arch/arm64/include/asm/io.h:199:20: include/linux/compiler_types.h:487:45: error: call to '__compiletime_assert_538' declared with attribute error: BUILD_BUG failed 487 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^ include/linux/compiler_types.h:468:25: note: in definition of macro '__compiletime_assert' 468 | prefix ## suffix(); \ | ^~~~~~ include/linux/compiler_types.h:487:9: note: in expansion of macro '_compiletime_assert' 487 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") | ^~~~~~~~~~~~~~~~ arch/arm64/include/asm/io.h:193:17: note: in expansion of macro 'BUILD_BUG' 193 | BUILD_BUG(); | ^~~~~~~~~ Move the check for constant arguments into the inline function to ensure it is still constant if the compiler decides against inlining it, and mark them as __always_inline to override the logic that sometimes leads to the compiler not producing the simplified output. Note that either the __always_inline annotation or the check for a constant value are sufficient here, but combining the two looks cleaner as it also avoids the macro. With clang-8 and older, the macro was still needed, but all versions of gcc and clang can reliably perform constant folding here. Fixes: ead79118dae6 ("arm64/io: Provide a WC friendly __iowriteXX_copy()") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20240604210006.668912-1-arnd@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2024-06-05ASoC: mediatek: mt8183-da7219-max98357: Fix kcontrol name collisionHsin-Te Yuan
Since "Headphone Switch" kcontrol name has already been used by da7219, rename the control name from "Headphone" to "Headphones" to prevent the colision. Also, this change makes kcontrol name align with the one in mt8186-mt6366-da7219-max98357.c. Fixes: 9c7388baa2053 ("ASoC: mediatek: mt8183-da7219-max98357: Map missing jack kcontrols") Change-Id: I9ae69a4673cd04786b247cc514fdd20f878ef009 Signed-off-by: Hsin-Te Yuan <yuanhsinte@chromium.org> Reviewed-by: Chen-Yu Tsai <wenst@chromium.org> Link: https://msgid.link/r/20240531-da7219-v1-1-ac3343f3ae6a@chromium.org Signed-off-by: Mark Brown <broonie@kernel.org>
2024-06-05ASoC: rockchip: i2s-tdm: Fix trcm mode by setting clock on right mclkAlibek Omarov
When TRCM mode is enabled, I2S RX and TX clocks are synchronized through selected clock source. Without this fix BCLK and LRCK might get parented to an uninitialized MCLK and the DAI will receive data at wrong pace. However, unlike in original i2s-tdm driver, there is no need to manually synchronize mclk_rx and mclk_tx, as only one gets used anyway. Tested on a board with RK3568 SoC and Silergy SY24145S codec with enabled and disabled TRCM mode. Fixes: 9e2ab4b18ebd ("ASoC: rockchip: i2s-tdm: Fix inaccurate sampling rates") Signed-off-by: Alibek Omarov <a1ba.omarov@gmail.com> Reviewed-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Link: https://msgid.link/r/20240604184752.697313-1-a1ba.omarov@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org>
2024-06-05rtnetlink: make the "split" NLM_DONE handling genericJakub Kicinski
Jaroslav reports Dell's OMSA Systems Management Data Engine expects NLM_DONE in a separate recvmsg(), both for rtnl_dump_ifinfo() and inet_dump_ifaddr(). We already added a similar fix previously in commit 460b0d33cf10 ("inet: bring NLM_DONE out to a separate recv() again") Instead of modifying all the dump handlers, and making them look different than modern for_each_netdev_dump()-based dump handlers - put the workaround in rtnetlink code. This will also help us move the custom rtnl-locking from af_netlink in the future (in net-next). Note that this change is not touching rtnl_dump_all(). rtnl_dump_all() is different kettle of fish and a potential problem. We now mix families in a single recvmsg(), but NLM_DONE is not coalesced. Tested: ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_addr.yaml \ --dump getaddr --json '{"ifa-family": 2}' ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \ --dump getroute --json '{"rtm-family": 2}' ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_link.yaml \ --dump getlink Fixes: 3e41af90767d ("rtnetlink: use xarray iterator to implement rtnl_dump_ifinfo()") Fixes: cdb2f80f1c10 ("inet: use xa_array iterator to implement inet_dump_ifaddr()") Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Link: https://lore.kernel.org/all/CAK8fFZ7MKoFSEzMBDAOjoUt+vTZRRQgLDNXEOfdCCXSoXXKE0g@mail.gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05Merge branch 'tcp-mptcp-close-wait'David S. Miller
Jason Xing says: ==================== tcp/mptcp: count CLOSE-WAIT for CurrEstab Taking CLOSE-WAIT sockets into CurrEstab counters is in accordance with RFC 1213, as suggested by Eric and Neal. v5 Link: https://lore.kernel.org/all/20240531091753.75930-1-kerneljasonxing@gmail.com/ 1. add more detailed comment (Matthieu) v4 Link: https://lore.kernel.org/all/20240530131308.59737-1-kerneljasonxing@gmail.com/ 1. correct the Fixes: tag in patch [2/2]. (Eric) Previous discussion Link: https://lore.kernel.org/all/20240529033104.33882-1-kerneljasonxing@gmail.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05mptcp: count CLOSE-WAIT sockets for MPTCP_MIB_CURRESTABJason Xing
Like previous patch does in TCP, we need to adhere to RFC 1213: "tcpCurrEstab OBJECT-TYPE ... The number of TCP connections for which the current state is either ESTABLISHED or CLOSE- WAIT." So let's consider CLOSE-WAIT sockets. The logic of counting When we increment the counter? a) Only if we change the state to ESTABLISHED. When we decrement the counter? a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT, say, on the client side, changing from ESTABLISHED to FIN-WAIT-1. b) if the socket leaves CLOSE-WAIT, say, on the server side, changing from CLOSE-WAIT to LAST-ACK. Fixes: d9cd27b8cd19 ("mptcp: add CurrEstab MIB counter support") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTABJason Xing
According to RFC 1213, we should also take CLOSE-WAIT sockets into consideration: "tcpCurrEstab OBJECT-TYPE ... The number of TCP connections for which the current state is either ESTABLISHED or CLOSE- WAIT." After this, CurrEstab counter will display the total number of ESTABLISHED and CLOSE-WAIT sockets. The logic of counting When we increment the counter? a) if we change the state to ESTABLISHED. b) if we change the state from SYN-RECEIVED to CLOSE-WAIT. When we decrement the counter? a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT, say, on the client side, changing from ESTABLISHED to FIN-WAIT-1. b) if the socket leaves CLOSE-WAIT, say, on the server side, changing from CLOSE-WAIT to LAST-ACK. Please note: there are two chances that old state of socket can be changed to CLOSE-WAIT in tcp_fin(). One is SYN-RECV, the other is ESTABLISHED. So we have to take care of the former case. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05tcp: annotate data-races around tw->tw_ts_recent and tw->tw_ts_recent_stampEric Dumazet
These fields can be read and written locklessly, add annotations around these minor races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05octeontx2-af: Add debugfs support to dump NIX TM topologyAnshumali Gaur
This patch adds support to dump NIX transmit queue topology. There are multiple levels of scheduling/shaping supported by NIX and a packet traverses through multiple levels before sending the packet out. At each level, there are set of scheduling/shaping rules applied to a packet flow. Each packet traverses through multiple levels SQ->SMQ->TL4->TL3->TL2->TL1 and these levels are mapped in a parent-child relationship. This patch dumps the debug information related to all TM Levels in the following way. Example: $ echo <nixlf> > /sys/kernel/debug/octeontx2/nix/tm_tree $ cat /sys/kernel/debug/octeontx2/nix/tm_tree A more desriptive set of registers at each level can be dumped in the following way. Example: $ echo <nixlf> > /sys/kernel/debug/octeontx2/nix/tm_topo $ cat /sys/kernel/debug/octeontx2/nix/tm_topo Signed-off-by: Anshumali Gaur <agaur@marvell.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05KVM: x86/mmu: Don't save mmu_invalidate_seq after checking private attrTao Su
Drop the second snapshot of mmu_invalidate_seq in kvm_faultin_pfn(). Before checking the mismatch of private vs. shared, mmu_invalidate_seq is saved to fault->mmu_seq, which can be used to detect an invalidation related to the gfn occurred, i.e. KVM will not install a mapping in page table if fault->mmu_seq != mmu_invalidate_seq. Currently there is a second snapshot of mmu_invalidate_seq, which may not be same as the first snapshot in kvm_faultin_pfn(), i.e. the gfn attribute may be changed between the two snapshots, but the gfn may be mapped in page table without hindrance. Therefore, drop the second snapshot as it has no obvious benefits. Fixes: f6adeae81f35 ("KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn()") Signed-off-by: Tao Su <tao1.su@linux.intel.com> Message-ID: <20240528102234.2162763-1-tao1.su@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-05Merge tag 'kvmarm-fixes-6.10-1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.10, take #1 - Large set of FP/SVE fixes for pKVM, addressing the fallout from the per-CPU data rework and making sure that the host is not involved in the FP/SVE switching any more - Allow FEAT_BTI to be enabled with NV now that FEAT_PAUTH is copletely supported - Fix for the respective priorities of Failed PAC, Illegal Execution state and Instruction Abort exceptions - Fix the handling of AArch32 instruction traps failing their condition code, which was broken by the introduction of ESR_EL2.ISS2 - Allow vpcus running in AArch32 state to be restored in System mode - Fix AArch32 GPR restore that would lose the 64 bit state under some conditions
2024-06-05arm64: armv8_deprecated: Fix warning in isndep cpuhp starting processWei Li
The function run_all_insn_set_hw_mode() is registered as startup callback of 'CPUHP_AP_ARM64_ISNDEP_STARTING', it invokes set_hw_mode() methods of all emulated instructions. As the STARTING callbacks are not expected to fail, if one of the set_hw_mode() fails, e.g. due to el0 mixed-endian is not supported for 'setend', it will report a warning: ``` CPU[2] cannot support the emulation of setend CPU 2 UP state arm64/isndep:starting (136) failed (-22) CPU2: Booted secondary processor 0x0000000002 [0x414fd0c1] ``` To fix it, add a check for INSN_UNAVAILABLE status and skip the process. Signed-off-by: Wei Li <liwei391@huawei.com> Tested-by: Huisong Li <lihuisong@huawei.com> Link: https://lore.kernel.org/r/20240423093501.3460764-1-liwei391@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2024-06-05selftests: hsr: add missing config for CONFIG_BRIDGEHangbin Liu
hsr_redbox.sh test need to create bridge for testing. Add the missing config CONFIG_BRIDGE in config file. Fixes: eafbf0574e05 ("test: hsr: Extend the hsr_redbox.sh to have more SAN devices connected") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Tested-by: Simon Horman <horms@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05vxlan: Fix regression when dropping packets due to invalid src addressesDaniel Borkmann
Commit f58f45c1e5b9 ("vxlan: drop packets from invalid src-address") has recently been added to vxlan mainly in the context of source address snooping/learning so that when it is enabled, an entry in the FDB is not being created for an invalid address for the corresponding tunnel endpoint. Before commit f58f45c1e5b9 vxlan was similarly behaving as geneve in that it passed through whichever macs were set in the L2 header. It turns out that this change in behavior breaks setups, for example, Cilium with netkit in L3 mode for Pods as well as tunnel mode has been passing before the change in f58f45c1e5b9 for both vxlan and geneve. After mentioned change it is only passing for geneve as in case of vxlan packets are dropped due to vxlan_set_mac() returning false as source and destination macs are zero which for E/W traffic via tunnel is totally fine. Fix it by only opting into the is_valid_ether_addr() check in vxlan_set_mac() when in fact source address snooping/learning is actually enabled in vxlan. This is done by moving the check into vxlan_snoop(). With this change, the Cilium connectivity test suite passes again for both tunnel flavors. Fixes: f58f45c1e5b9 ("vxlan: drop packets from invalid src-address") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: David Bauer <mail@david-bauer.net> Cc: Ido Schimmel <idosch@nvidia.com> Cc: Nikolay Aleksandrov <razor@blackwall.org> Cc: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: David Bauer <mail@david-bauer.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net: sched: sch_multiq: fix possible OOB write in multiq_tune()Hangyu Hua
q->bands will be assigned to qopt->bands to execute subsequent code logic after kmalloc. So the old q->bands should not be used in kmalloc. Otherwise, an out-of-bounds write will occur. Fixes: c2999f7fb05b ("net: sched: multiq: don't call qdisc_put() while holding tree lock") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Acked-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05ionic: fix kernel panic in XDP_TX actionTaehee Yoo
In the XDP_TX path, ionic driver sends a packet to the TX path with rx page and corresponding dma address. After tx is done, ionic_tx_clean() frees that page. But RX ring buffer isn't reset to NULL. So, it uses a freed page, which causes kernel panic. BUG: unable to handle page fault for address: ffff8881576c110c PGD 773801067 P4D 773801067 PUD 87f086067 PMD 87efca067 PTE 800ffffea893e060 Oops: Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN NOPTI CPU: 1 PID: 25 Comm: ksoftirqd/1 Not tainted 6.9.0+ #11 Hardware name: ASUS System Product Name/PRIME Z690-P D4, BIOS 0603 11/01/2021 RIP: 0010:bpf_prog_f0b8caeac1068a55_balancer_ingress+0x3b/0x44f Code: 00 53 41 55 41 56 41 57 b8 01 00 00 00 48 8b 5f 08 4c 8b 77 00 4c 89 f7 48 83 c7 0e 48 39 d8 RSP: 0018:ffff888104e6fa28 EFLAGS: 00010283 RAX: 0000000000000002 RBX: ffff8881576c1140 RCX: 0000000000000002 RDX: ffffffffc0051f64 RSI: ffffc90002d33048 RDI: ffff8881576c110e RBP: ffff888104e6fa88 R08: 0000000000000000 R09: ffffed1027a04a23 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881b03a21a8 R13: ffff8881589f800f R14: ffff8881576c1100 R15: 00000001576c1100 FS: 0000000000000000(0000) GS:ffff88881ae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8881576c110c CR3: 0000000767a90000 CR4: 00000000007506f0 PKRU: 55555554 Call Trace: <TASK> ? __die+0x20/0x70 ? page_fault_oops+0x254/0x790 ? __pfx_page_fault_oops+0x10/0x10 ? __pfx_is_prefetch.constprop.0+0x10/0x10 ? search_bpf_extables+0x165/0x260 ? fixup_exception+0x4a/0x970 ? exc_page_fault+0xcb/0xe0 ? asm_exc_page_fault+0x22/0x30 ? 0xffffffffc0051f64 ? bpf_prog_f0b8caeac1068a55_balancer_ingress+0x3b/0x44f ? do_raw_spin_unlock+0x54/0x220 ionic_rx_service+0x11ab/0x3010 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864] ? ionic_tx_clean+0x29b/0xc60 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864] ? __pfx_ionic_tx_clean+0x10/0x10 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864] ? __pfx_ionic_rx_service+0x10/0x10 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864] ? ionic_tx_cq_service+0x25d/0xa00 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864] ? __pfx_ionic_rx_service+0x10/0x10 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864] ionic_cq_service+0x69/0x150 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864] ionic_txrx_napi+0x11a/0x540 [ionic 9180c3001ab627d82bbc5f3ebe8a0decaf6bb864] __napi_poll.constprop.0+0xa0/0x440 net_rx_action+0x7e7/0xc30 ? __pfx_net_rx_action+0x10/0x10 Fixes: 8eeed8373e1c ("ionic: Add XDP_TX support") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05Merge branch 'devlink-const'David S. Miller
Christophe JAILLET says: ==================== devlink: Constify struct devlink_dpipe_table_ops Patch 1 updates devl_dpipe_table_register() and struct devlink_dpipe_table to accept "const struct devlink_dpipe_table_ops". Then patch 2 updates the only user of this function. This is compile tested only. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05mlxsw: spectrum_router: Constify struct devlink_dpipe_table_opsChristophe JAILLET
'struct devlink_dpipe_table_ops' are not modified in this driver. Constifying these structures moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 15557 712 0 16269 3f8d drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.o After: ===== text data bss dec hex filename 15789 488 0 16277 3f95 drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05devlink: Constify the 'table_ops' parameter of devl_dpipe_table_register()Christophe JAILLET
"struct devlink_dpipe_table_ops" only contains some function pointers. Update "struct devlink_dpipe_table" and the 'table_ops' parameter of devl_dpipe_table_register() so that structures in drivers can be constified. Constifying these structures will move some data to a read-only section, so increase overall security. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net: phy: aquantia: add support for PHY LEDsDaniel Golle
Aquantia Ethernet PHYs got 3 LED output pins which are typically used to indicate link status and activity. Add a minimal LED controller driver supporting the most common uses with the 'netdev' trigger as well as software-driven forced control of the LEDs. Signed-off-by: Daniel Golle <daniel@makrotopia.org> [ rework indentation, fix checkpatch error and improve some functions ] Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net: phy: aquantia: move priv and hw stat to headerChristian Marangi
In preparation for LEDs support, move priv and hw stat to header to reference priv struct also in other .c outside aquantia.main Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net: ethtool: remove unused struct 'cable_test_tdr_req_info'Dr. David Alan Gilbert
'cable_test_tdr_req_info' is unused since the original commit f2bc8ad31a7f ("net: ethtool: Allow PHY cable test TDR data to configured"). Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>