summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-11-30octeontx2-af: Slightly simplify rvu_npc_exact_init()Christophe JAILLET
Use kzalloc() instead of kmalloc()/memset(). Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/60ea220ccf3b61963f7d5a97e3df2c76a5feb837.1669378798.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-30octeontx2-af: Fix a potentially spurious error messageChristophe JAILLET
When this error message is displayed, we know that the all the bits in the bitmap are set. So, bitmap_weight() will return the number of bits of the bitmap, which is 'table->tot_ids'. It is unlikely that a bit will be cleared between mutex_unlock() and dev_err(), but, in order to simplify the code and avoid this possibility, just take 'table->tot_ids'. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/5ce01c402f86412dc57884ff0994b63f0c5b3871.1669378798.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-30net: broadcom: Add PTP_1588_CLOCK_OPTIONAL dependency for BCMGENET under ↵YueHaibing
ARCH_BCM2835 commit 8d820bc9d12b ("net: broadcom: Fix BCMGENET Kconfig") fixes the build that contain 99addbe31f55 ("net: broadcom: Select BROADCOM_PHY for BCMGENET") and enable BCMGENET=y but PTP_1588_CLOCK_OPTIONAL=m, which otherwise leads to a link failure. However this may trigger a runtime failure. Fix the original issue by propagating the PTP_1588_CLOCK_OPTIONAL dependency of BROADCOM_PHY down to BCMGENET. Fixes: 8d820bc9d12b ("net: broadcom: Fix BCMGENET Kconfig") Fixes: 99addbe31f55 ("net: broadcom: Select BROADCOM_PHY for BCMGENET") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20221125115003.30308-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-30bpf: Fix a compilation failure with clang lto buildYonghong Song
When building the kernel with clang lto (CONFIG_LTO_CLANG_FULL=y), the following compilation error will appear: $ make LLVM=1 LLVM_IAS=1 -j ... ld.lld: error: ld-temp.o <inline asm>:26889:1: symbol 'cgroup_storage_map_btf_ids' is already defined cgroup_storage_map_btf_ids:; ^ make[1]: *** [/.../bpf-next/scripts/Makefile.vmlinux_o:61: vmlinux.o] Error 1 In local_storage.c, we have BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map) Commit c4bcfb38a95e ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs") added the above identical BTF_ID_LIST_SINGLE definition in bpf_cgrp_storage.c. With duplicated definitions, llvm linker complains with lto build. Also, extracting btf_id of 'struct bpf_local_storage_map' is defined four times for sk, inode, task and cgrp local storages. Let us define a single global one with a different name than cgroup_storage_map_btf_ids, which also fixed the lto compilation error. Fixes: c4bcfb38a95e ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20221130052147.1591625-1-yhs@fb.com
2022-12-01selftests/bpf: Add ingress tests for txmsg with apply_bytesPengcheng Yang
Currently, the ingress redirect is not covered in "txmsg test apply". Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/1669718441-2654-5-git-send-email-yangpc@wangsu.com
2022-12-01bpf, sockmap: Fix data loss caused by using apply_bytes on ingress redirectPengcheng Yang
Use apply_bytes on ingress redirect, when apply_bytes is less than the length of msg data, some data may be skipped and lost in bpf_tcp_ingress(). If there is still data in the scatterlist that has not been consumed, we cannot move the msg iter. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/1669718441-2654-4-git-send-email-yangpc@wangsu.com
2022-12-01bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytesPengcheng Yang
When redirecting, we use sk_msg_to_ingress() to get the BPF_F_INGRESS flag from the msg->flags. If apply_bytes is used and it is larger than the current data being processed, sk_psock_msg_verdict() will not be called when sendmsg() is called again. At this time, the msg->flags is 0, and we lost the BPF_F_INGRESS flag. So we need to save the BPF_F_INGRESS flag in sk_psock and use it when redirection. Fixes: 8934ce2fd081 ("bpf: sockmap redirect ingress support") Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/1669718441-2654-3-git-send-email-yangpc@wangsu.com
2022-12-01bpf, sockmap: Fix repeated calls to sock_put() when msg has more_dataPengcheng Yang
In tcp_bpf_send_verdict() redirection, the eval variable is assigned to __SK_REDIRECT after the apply_bytes data is sent, if msg has more_data, sock_put() will be called multiple times. We should reset the eval variable to __SK_NONE every time more_data starts. This causes: IPv4: Attempt to release TCP socket in state 1 00000000b4c925d7 ------------[ cut here ]------------ refcount_t: addition on 0; use-after-free. WARNING: CPU: 5 PID: 4482 at lib/refcount.c:25 refcount_warn_saturate+0x7d/0x110 Modules linked in: CPU: 5 PID: 4482 Comm: sockhash_bypass Kdump: loaded Not tainted 6.0.0 #1 Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014 Call Trace: <TASK> __tcp_transmit_skb+0xa1b/0xb90 ? __alloc_skb+0x8c/0x1a0 ? __kmalloc_node_track_caller+0x184/0x320 tcp_write_xmit+0x22a/0x1110 __tcp_push_pending_frames+0x32/0xf0 do_tcp_sendpages+0x62d/0x640 tcp_bpf_push+0xae/0x2c0 tcp_bpf_sendmsg_redir+0x260/0x410 ? preempt_count_add+0x70/0xa0 tcp_bpf_send_verdict+0x386/0x4b0 tcp_bpf_sendmsg+0x21b/0x3b0 sock_sendmsg+0x58/0x70 __sys_sendto+0xfa/0x170 ? xfd_validate_state+0x1d/0x80 ? switch_fpu_return+0x59/0xe0 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x37/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: cd9733f5d75c ("tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function") Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/1669718441-2654-2-git-send-email-yangpc@wangsu.com
2022-11-30Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fixes from Stephen Boyd: "A set of clk driver fixes that resolve issues for various SoCs. Most of these are incorrect clk data, like bad parent descriptions. When the clk tree is improperly described things don't work, like USB and UFS controllers, because clk frequencies are wonky. Here are the extra details: - Fix the parent of UFS reference clks on Qualcomm SC8280XP so that UFS works properly - Fix the clk ID for USB on AT91 RM9200 so the USB driver continues to probe - Stop using of_device_get_match_data() on the wrong device for a Samsung Exynos driver so it gets the proper clk data - Fix ExynosAutov9 binding - Fix the parent of the div4 clk on Exynos7885 - Stop calling runtime PM APIs from the Qualcomm GDSC driver directly as it leads to a lockdep splat and is just plain wrong because it violates runtime PM semantics by calling runtime PM APIs when the device has been runtime PM disabled" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: qcom: gcc-sc8280xp: add cxo as parent for three ufs ref clks ARM: at91: rm9200: fix usb device clock id clk: samsung: Revert "clk: samsung: exynos-clkout: Use of_device_get_match_data()" dt-bindings: clock: exynosautov9: fix reference to CMU_FSYS1 clk: qcom: gdsc: Remove direct runtime PM calls clk: samsung: exynos7885: Correct "div4" clock parents
2022-11-30bpf: Tighten ptr_to_btf_id checks.Alexei Starovoitov
The networking programs typically don't require CAP_PERFMON, but through kfuncs like bpf_cast_to_kern_ctx() they can access memory through PTR_TO_BTF_ID. In such case enforce CAP_PERFMON. Also make sure that only GPL programs can access kernel data structures. All kfuncs require GPL already. Also remove allow_ptr_to_map_access. It's the same as allow_ptr_leaks and different name for the same check only causes confusion. Fixes: fd264ca02094 ("bpf: Add a kfunc to type cast from bpf uapi ctx to kernel ctx") Fixes: 50c6b8a9aea2 ("selftests/bpf: Add a test for btf_type_tag "percpu"") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221125220617.26846-1-alexei.starovoitov@gmail.com
2022-12-01selftests/bpf: Add bench test to arm64 and s390x denylistDaniel Borkmann
BPF CI fails for arm64 and s390x each with the following result: [...] All error logs: serial_test_kprobe_multi_bench_attach:PASS:get_syms 0 nsec serial_test_kprobe_multi_bench_attach:PASS:kprobe_multi_empty__open_and_load 0 nsec libbpf: prog 'test_kprobe_empty': failed to attach: Operation not supported serial_test_kprobe_multi_bench_attach:FAIL:bpf_program__attach_kprobe_multi_opts unexpected error: -95 #92 kprobe_multi_bench_attach:FAIL [...] Add the test to the deny list. Fixes: 5b6c7e5c4434 ("selftests/bpf: Add attach bench test") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2022-11-30revert "kbuild: fix -Wimplicit-function-declaration in ↵Andrew Morton
license_is_gpl_compatible" It causes build failures with unusual CC/HOSTCC combinations. Quoting https://lkml.kernel.org/r/A222B1E6-69B8-4085-AD1B-27BDB72CA971@goldelico.com: HOSTCC scripts/mod/modpost.o - due to target missing In file included from include/linux/string.h:5, from scripts/mod/../../include/linux/license.h:5, from scripts/mod/modpost.c:24: include/linux/compiler.h:246:10: fatal error: asm/rwonce.h: No such file or directory 246 | #include <asm/rwonce.h> | ^~~~~~~~~~~~~~ compilation terminated. ... The problem is that HOSTCC is not necessarily the same compiler or even architecture as CC and pulling in <linux/compiler.h> or <asm/rwonce.h> files indirectly isn't a good idea then. My toolchain is providing HOSTCC = gcc (MacPorts) and CC = arm-linux-gnueabihf (built from gcc source) and all running on Darwin. If I change the include to <string.h> I can then "HOSTCC scripts/mod/modpost.c" but then it fails for "CC kernel/module/main.c" not finding <string.h>: CC kernel/module/main.o - due to target missing In file included from kernel/module/main.c:43:0: ./include/linux/license.h:5:20: fatal error: string.h: No such file or directory #include <string.h> ^ compilation terminated. Reported-by: "H. Nikolaus Schaller" <hns@goldelico.com> Cc: Sam James <sam@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabledLee Jones
When enabled, KASAN enlarges function's stack-frames. Pushing quite a few over the current threshold. This can mainly be seen on 32-bit architectures where the present limit (when !GCC) is a lowly 1024-Bytes. Link: https://lkml.kernel.org/r/20221125120750.3537134-3-lee@kernel.org Signed-off-by: Lee Jones <lee@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: David Airlie <airlied@gmail.com> Cc: Harry Wentland <harry.wentland@amd.com> Cc: Leo Li <sunpeng.li@amd.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Tom Rix <trix@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frameLee Jones
Patch series "Fix a bunch of allmodconfig errors", v2. Since b339ec9c229aa ("kbuild: Only default to -Werror if COMPILE_TEST") WERROR now defaults to COMPILE_TEST meaning that it's enabled for allmodconfig builds. This leads to some interesting build failures when using Clang, each resolved in this set. With this set applied, I am able to obtain a successful allmodconfig Arm build. This patch (of 2): calculate_bandwidth() is presently broken on all !(X86_64 || SPARC64 || ARM64) architectures built with Clang (all released versions), whereby the stack frame gets blown up to well over 5k. This would cause an immediate kernel panic on most architectures. We'll revert this when the following bug report has been resolved: https://github.com/llvm/llvm-project/issues/41896. Link: https://lkml.kernel.org/r/20221125120750.3537134-1-lee@kernel.org Link: https://lkml.kernel.org/r/20221125120750.3537134-2-lee@kernel.org Signed-off-by: Lee Jones <lee@kernel.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: David Airlie <airlied@gmail.com> Cc: Harry Wentland <harry.wentland@amd.com> Cc: Lee Jones <lee@kernel.org> Cc: Leo Li <sunpeng.li@amd.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Tom Rix <trix@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/khugepaged: invoke MMU notifiers in shmem/file collapse pathsJann Horn
Any codepath that zaps page table entries must invoke MMU notifiers to ensure that secondary MMUs (like KVM) don't keep accessing pages which aren't mapped anymore. Secondary MMUs don't hold their own references to pages that are mirrored over, so failing to notify them can lead to page use-after-free. I'm marking this as addressing an issue introduced in commit f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of the security impact of this only came in commit 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP"), which actually omitted flushes for the removal of present PTEs, not just for the removal of empty page tables. Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/khugepaged: fix GUP-fast interaction by sending IPIJann Horn
Since commit 70cbc3cc78a99 ("mm: gup: fix the fast GUP race against THP collapse"), the lockless_pages_from_mm() fastpath rechecks the pmd_t to ensure that the page table was not removed by khugepaged in between. However, lockless_pages_from_mm() still requires that the page table is not concurrently freed. Fix it by sending IPIs (if the architecture uses semi-RCU-style page table freeing) before freeing/reusing page tables. Link: https://lkml.kernel.org/r/20221129154730.2274278-2-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-2-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-2-jannh@google.com Fixes: ba76149f47d8 ("thp: khugepaged") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/khugepaged: take the right locks for page table retractionJann Horn
pagetable walks on address ranges mapped by VMAs can be done under the mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the VMA's address_space. Only one of these needs to be held, and it does not need to be held in exclusive mode. Under those circumstances, the rules for concurrent access to page table entries are: - Terminal page table entries (entries that don't point to another page table) can be arbitrarily changed under the page table lock, with the exception that they always need to be consistent for hardware page table walks and lockless_pages_from_mm(). This includes that they can be changed into non-terminal entries. - Non-terminal page table entries (which point to another page table) can not be modified; readers are allowed to READ_ONCE() an entry, verify that it is non-terminal, and then assume that its value will stay as-is. Retracting a page table involves modifying a non-terminal entry, so page-table-level locks are insufficient to protect against concurrent page table traversal; it requires taking all the higher-level locks under which it is possible to start a page walk in the relevant range in exclusive mode. The collapse_huge_page() path for anonymous THP already follows this rule, but the shmem/file THP path was getting it wrong, making it possible for concurrent rmap-based operations to cause corruption. Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: migrate: fix THP's mapcount on isolationGavin Shan
The issue is reported when removing memory through virtio_mem device. The transparent huge page, experienced copy-on-write fault, is wrongly regarded as pinned. The transparent huge page is escaped from being isolated in isolate_migratepages_block(). The transparent huge page can't be migrated and the corresponding memory block can't be put into offline state. Fix it by replacing page_mapcount() with total_mapcount(). With this, the transparent huge page can be isolated and migrated, and the memory block can be put into offline state. Besides, The page's refcount is increased a bit earlier to avoid the page is released when the check is executed. Link: https://lkml.kernel.org/r/20221124095523.31061-1-gshan@redhat.com Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations") Signed-off-by: Gavin Shan <gshan@redhat.com> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> Tested-by: Zhenyu Zhang <zhenyzha@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> [5.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: introduce arch_has_hw_nonleaf_pmd_young()Juergen Gross
When running as a Xen PV guests commit eed9a328aa1a ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") can cause a protection violation in pmdp_test_and_clear_young(): BUG: unable to handle page fault for address: ffff8880083374d0 #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 3026067 P4D 3026067 PUD 3027067 PMD 7fee5067 PTE 8010000008337065 Oops: 0003 [#1] PREEMPT SMP NOPTI CPU: 7 PID: 158 Comm: kswapd0 Not tainted 6.1.0-rc5-20221118-doflr+ #1 RIP: e030:pmdp_test_and_clear_young+0x25/0x40 This happens because the Xen hypervisor can't emulate direct writes to page table entries other than PTEs. This can easily be fixed by introducing arch_has_hw_nonleaf_pmd_young() similar to arch_has_hw_pte_young() and test that instead of CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG. Link: https://lkml.kernel.org/r/20221123064510.16225-1-jgross@suse.com Fixes: eed9a328aa1a ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") Signed-off-by: Juergen Gross <jgross@suse.com> Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Acked-by: Yu Zhao <yuzhao@google.com> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Acked-by: David Hildenbrand <david@redhat.com> [core changes] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add dummy pmd_young() for architectures not having itJuergen Gross
In order to avoid #ifdeffery add a dummy pmd_young() implementation as a fallback. This is required for the later patch "mm: introduce arch_has_hw_nonleaf_pmd_young()". Link: https://lkml.kernel.org/r/fd3ac3cd-7349-6bbd-890a-71a9454ca0b3@suse.com Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Sander Eikelenboom <linux@eikelenboom.it> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in ↵SeongJae Park
damon_sysfs_set_schemes() Commit da87878010e5 ("mm/damon/sysfs: support online inputs update") made 'damon_sysfs_set_schemes()' to be called for running DAMON context, which could have schemes. In the case, DAMON sysfs interface is supposed to update, remove, or add schemes to reflect the sysfs files. However, the code is assuming the DAMON context wouldn't have schemes at all, and therefore creates and adds new schemes. As a result, the code doesn't work as intended for online schemes tuning and could have more than expected memory footprint. The schemes are all in the DAMON context, so it doesn't leak the memory, though. Remove the wrong asssumption (the DAMON context wouldn't have schemes) in 'damon_sysfs_set_schemes()' to fix the bug. Link: https://lkml.kernel.org/r/20221122194831.3472-1-sj@kernel.org Fixes: da87878010e5 ("mm/damon/sysfs: support online inputs update") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [5.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep"Tiezhu Yang
The latest version of grep claims the egrep is now obsolete so the build now contains warnings that look like: egrep: warning: egrep is obsolescent; using grep -E fix this up by moving the related file to use "grep -E" instead. sed -i "s/egrep/grep -E/g" `grep egrep -rwl tools/vm` Here are the steps to install the latest grep: wget http://ftp.gnu.org/gnu/grep/grep-3.8.tar.gz tar xf grep-3.8.tar.gz cd grep-3.8 && ./configure && make sudo make install export PATH=/usr/local/bin:$PATH Link: https://lkml.kernel.org/r/1668825419-30584-1-git-send-email-yangtiezhu@loongson.cn Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry()ZhangPeng
Syzbot reported a null-ptr-deref bug: NILFS (loop0): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] CPU: 1 PID: 3603 Comm: segctord Not tainted 6.1.0-rc2-syzkaller-00105-gb229b6ca5abb #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 RIP: 0010:nilfs_palloc_commit_free_entry+0xe5/0x6b0 fs/nilfs2/alloc.c:608 Code: 00 00 00 00 fc ff df 80 3c 02 00 0f 85 cd 05 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b 73 08 49 8d 7e 10 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 26 05 00 00 49 8b 46 10 be a6 00 00 00 48 c7 c7 RSP: 0018:ffffc90003dff830 EFLAGS: 00010212 RAX: dffffc0000000000 RBX: ffff88802594e218 RCX: 000000000000000d RDX: 0000000000000002 RSI: 0000000000002000 RDI: 0000000000000010 RBP: ffff888071880222 R08: 0000000000000005 R09: 000000000000003f R10: 000000000000000d R11: 0000000000000000 R12: ffff888071880158 R13: ffff88802594e220 R14: 0000000000000000 R15: 0000000000000004 FS: 0000000000000000(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb1c08316a8 CR3: 0000000018560000 CR4: 0000000000350ee0 Call Trace: <TASK> nilfs_dat_commit_free fs/nilfs2/dat.c:114 [inline] nilfs_dat_commit_end+0x464/0x5f0 fs/nilfs2/dat.c:193 nilfs_dat_commit_update+0x26/0x40 fs/nilfs2/dat.c:236 nilfs_btree_commit_update_v+0x87/0x4a0 fs/nilfs2/btree.c:1940 nilfs_btree_commit_propagate_v fs/nilfs2/btree.c:2016 [inline] nilfs_btree_propagate_v fs/nilfs2/btree.c:2046 [inline] nilfs_btree_propagate+0xa00/0xd60 fs/nilfs2/btree.c:2088 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_file_data+0x45/0xd0 fs/nilfs2/segment.c:568 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1018 nilfs_segctor_scan_file+0x3f4/0x6f0 fs/nilfs2/segment.c:1067 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1197 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1503 [inline] nilfs_segctor_do_construct+0x12fc/0x6af0 fs/nilfs2/segment.c:2045 nilfs_segctor_construct+0x8e3/0xb30 fs/nilfs2/segment.c:2379 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2487 [inline] nilfs_segctor_thread+0x3c3/0xf30 fs/nilfs2/segment.c:2570 kthread+0x2e4/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK> ... If DAT metadata file is corrupted on disk, there is a case where req->pr_desc_bh is NULL and blocknr is 0 at nilfs_dat_commit_end() during a b-tree operation that cascadingly updates ancestor nodes of the b-tree, because nilfs_dat_commit_alloc() for a lower level block can initialize the blocknr on the same DAT entry between nilfs_dat_prepare_end() and nilfs_dat_commit_end(). If this happens, nilfs_dat_commit_end() calls nilfs_dat_commit_free() without valid buffer heads in req->pr_desc_bh and req->pr_bitmap_bh, and causes the NULL pointer dereference above in nilfs_palloc_commit_free_entry() function, which leads to a crash. Fix this by adding a NULL check on req->pr_desc_bh and req->pr_bitmap_bh before nilfs_palloc_commit_free_entry() in nilfs_dat_commit_free(). This also calls nilfs_error() in that case to notify that there is a fatal flaw in the filesystem metadata and prevent further operations. Link: https://lkml.kernel.org/r/00000000000097c20205ebaea3d6@google.com Link: https://lkml.kernel.org/r/20221114040441.1649940-1-zhangpeng362@huawei.com Link: https://lkml.kernel.org/r/20221119120542.17204-1-konishi.ryusuke@gmail.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+ebe05ee8e98f755f61d0@syzkaller.appspotmail.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processingMike Kravetz
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page tables associated with the address range. For hugetlb vmas, zap_page_range will call __unmap_hugepage_range_final. However, __unmap_hugepage_range_final assumes the passed vma is about to be removed and deletes the vma_lock to prevent pmd sharing as the vma is on the way out. In the case of madvise(MADV_DONTNEED) the vma remains, but the missing vma_lock prevents pmd sharing and could potentially lead to issues with truncation/fault races. This issue was originally reported here [1] as a BUG triggered in page_try_dup_anon_rmap. Prior to the introduction of the hugetlb vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to prevent pmd sharing. Subsequent faults on this vma were confused as VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was not set in new pages added to the page table. This resulted in pages that appeared anonymous in a VM_SHARED vma and triggered the BUG. Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap call from unmap_vmas(). This is used to indicate the 'final' unmapping of a hugetlb vma. When called via MADV_DONTNEED, this flag is not set and the vm_lock is not deleted. [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/ Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Wei Chen <harperchen1110@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30madvise: use zap_page_range_single for madvise dontneedMike Kravetz
This series addresses the issue first reported in [1], and fully described in patch 2. Patches 1 and 2 address the user visible issue and are tagged for stable backports. While exploring solutions to this issue, related problems with mmu notification calls were discovered. This is addressed in the patch "hugetlb: remove duplicate mmu notifications:". Since there are no user visible effects, this third is not tagged for stable backports. Previous discussions suggested further cleanup by removing the routine zap_page_range. This is possible because zap_page_range_single is now exported, and all callers of zap_page_range pass ranges entirely within a single vma. This work will be done in a later patch so as not to distract from this bug fix. [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/ This patch (of 2): Expose the routine zap_page_range_single to zap a range within a single vma. The madvise routine madvise_dontneed_single_vma can use this routine as it explicitly operates on a single vma. Also, update the mmu notification range in zap_page_range_single to take hugetlb pmd sharing into account. This is required as MADV_DONTNEED supports hugetlb vmas. Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Wei Chen <harperchen1110@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODEYang Shi
Syzbot reported the below splat: WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 Modules linked in: CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156 madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611 madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066 madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240 do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419 do_madvise mm/madvise.c:1432 [inline] __do_sys_madvise mm/madvise.c:1432 [inline] __se_sys_madvise mm/madvise.c:1430 [inline] __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f6b48a4eef9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9 RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000 RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4 R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000 </TASK> It is because khugepaged allocates pages with __GFP_THISNODE, but the preferred node is bogus. The previous patch fixed the khugepaged code to avoid allocating page from non-existing node. But it is still racy against memory hotremove. There is no synchronization with the memory hotplug so it is possible that memory gets offline during a longer taking scanning. So this warning still seems not quite helpful because: * There is no guarantee the node is online for __GFP_THISNODE context for all the callsites. * Kernel just fails the allocation regardless the warning, and it looks all callsites handle the allocation failure gracefully. Although while the warning has helped to identify a buggy code, it is not safe in general and this warning could panic the system with panic-on-warn configuration which tends to be used surprisingly often. So replace VM_WARN_ON to pr_warn(). And the warning will be triggered if __GFP_NOWARN is set since the allocator would print out warning for such case if __GFP_NOWARN is not set. [shy828301@gmail.com: rename nid to this_node and gfp to warn_gfp] Link: https://lkml.kernel.org/r/20221123193014.153983-1-shy828301@gmail.com [akpm@linux-foundation.org: fix whitespace] [akpm@linux-foundation.org: print gfp_mask instead of warn_gfp, per Michel] Link: https://lkml.kernel.org/r/20221108184357.55614-3-shy828301@gmail.com Fixes: 7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse") Signed-off-by: Yang Shi <shy828301@gmail.com> Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30netfilter: conntrack: set icmpv6 redirects as RELATEDFlorian Westphal
icmp conntrack will set icmp redirects as RELATED, but icmpv6 will not do this. For icmpv6, only icmp errors (code <= 128) are examined for RELATED state. ICMPV6 Redirects are part of neighbour discovery mechanism, those are handled by marking a selected subset (e.g. neighbour solicitations) as UNTRACKED, but not REDIRECT -- they will thus be flagged as INVALID. Add minimal support for REDIRECTs. No parsing of neighbour options is added for simplicity, so this will only check that we have the embeeded original header (ND_OPT_REDIRECT_HDR), and then attempt to do a flow lookup for this tuple. Also extend the existing test case to cover redirects. Fixes: 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.") Reported-by: Eric Garver <eric@garver.life> Link: https://github.com/firewalld/firewalld/issues/1046 Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Garver <eric@garver.life> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-30selftests/bpf: Make sure enum-less bpf_enable_stats() API works in C++ modeAndrii Nakryiko
Just a simple test to make sure we don't introduce unwanted compiler warnings and API still supports passing enums as input argument. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20221130200013.2997831-2-andrii@kernel.org
2022-11-30libbpf: Avoid enum forward-declarations in public API in C++ modeAndrii Nakryiko
C++ enum forward declarations are fundamentally not compatible with pure C enum definitions, and so libbpf's use of `enum bpf_stats_type;` forward declaration in libbpf/bpf.h public API header is causing C++ compilation issues. More details can be found in [0], but it comes down to C++ supporting enum forward declaration only with explicitly specified backing type: enum bpf_stats_type: int; In C (and I believe it's a GCC extension also), such forward declaration is simply: enum bpf_stats_type; Further, in Linux UAPI this enum is defined in pure C way: enum bpf_stats_type { BPF_STATS_RUN_TIME = 0; } And even though in both cases backing type is int, which can be confirmed by looking at DWARF information, for C++ compiler actual enum definition and forward declaration are incompatible. To eliminate this problem, for C++ mode define input argument as int, which makes enum unnecessary in libbpf public header. This solves the issue and as demonstrated by next patch doesn't cause any unwanted compiler warnings, at least with default warnings setting. [0] https://stackoverflow.com/questions/42766839/c11-enum-forward-causes-underlying-type-mismatch [1] Closes: https://github.com/libbpf/libbpf/issues/249 Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20221130200013.2997831-1-andrii@kernel.org
2022-11-30selftests/bpf: Avoid pinning prog when attaching to tc ingress in ↵Martin KaFai Lau
btf_skc_cls_ingress This patch removes the need to pin prog when attaching to tc ingress in the btf_skc_cls_ingress test. Instead, directly use the bpf_tc_hook_create() and bpf_tc_attach(). The qdisc clsact will go away together with the netns, so no need to bpf_tc_hook_destroy(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20221129070900.3142427-8-martin.lau@linux.dev
2022-11-30selftests/bpf: Remove serial from tests using {open,close}_netnsMartin KaFai Lau
After removing the mount/umount dance from {open,close}_netns() in the pervious patch, "serial_" can be removed from the tests using {open,close}_netns(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20221129070900.3142427-7-martin.lau@linux.dev
2022-11-30selftests/bpf: Remove the "/sys" mount and umount dance in {open,close}_netnsMartin KaFai Lau
The previous patches have removed the need to do the mount and umount dance when switching netns. In particular: * Avoid remounting /sys/fs/bpf to have a clean start * Avoid remounting /sys to get a ifindex of a particular netns This patch can finally remove the mount and umount dance in {open,close}_netns which is unnecessarily complicated and error-prone. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20221129070900.3142427-6-martin.lau@linux.dev
2022-11-30selftests/bpf: Avoid pinning bpf prog in the netns_load_bpf() callersMartin KaFai Lau
This patch removes the need to pin prog in the remaining tests in tc_redirect.c by directly using the bpf_tc_hook_create() and bpf_tc_attach(). The clsact qdisc will go away together with the test netns, so no need to do bpf_tc_hook_destroy(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20221129070900.3142427-5-martin.lau@linux.dev
2022-11-30selftests/bpf: Avoid pinning bpf prog in the tc_redirect_peer_l3 testMartin KaFai Lau
This patch removes the need to pin prog in the tc_redirect_peer_l3 test by directly using the bpf_tc_hook_create() and bpf_tc_attach(). The clsact qdisc will go away together with the test netns, so no need to do bpf_tc_hook_destroy(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20221129070900.3142427-4-martin.lau@linux.dev
2022-11-30selftests/bpf: Avoid pinning bpf prog in the tc_redirect_dtime testMartin KaFai Lau
This patch removes the need to pin prog in the tc_redirect_dtime test by directly using the bpf_tc_hook_create() and bpf_tc_attach(). The clsact qdisc will go away together with the test netns, so no need to do bpf_tc_hook_destroy(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20221129070900.3142427-3-martin.lau@linux.dev
2022-11-30selftests/bpf: Use if_nametoindex instead of reading the ↵Martin KaFai Lau
/sys/net/class/*/ifindex When switching netns, the setns_by_fd() is doing dances in mount/umounting the /sys directories. One reason is the tc_redirect.c test is depending on the /sys/net/class/*/ifindex instead of using the if_nametoindex(). if_nametoindex() uses ioctl() to get the ifindex. This patch is to move all /sys/net/class/*/ifindex usages to if_nametoindex(). The current code checks ifindex >= 0 which is incorrect. ifindex > 0 should be checked instead. This patch also stores ifindex_veth_src and ifindex_veth_dst since the latter patch will need them. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20221129070900.3142427-2-martin.lau@linux.dev
2022-11-30igb: Allocate MSI-X vector when testingAkihiko Odaki
Without this change, the interrupt test fail with MSI-X environment: $ sudo ethtool -t enp0s2 offline [ 43.921783] igb 0000:00:02.0: offline testing starting [ 44.855824] igb 0000:00:02.0 enp0s2: igb: enp0s2 NIC Link is Down [ 44.961249] igb 0000:00:02.0 enp0s2: igb: enp0s2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX [ 51.272202] igb 0000:00:02.0: testing shared interrupt [ 56.996975] igb 0000:00:02.0 enp0s2: igb: enp0s2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX The test result is FAIL The test extra info: Register test (offline) 0 Eeprom test (offline) 0 Interrupt test (offline) 4 Loopback test (offline) 0 Link test (on/offline) 0 Here, "4" means an expected interrupt was not delivered. To fix this, route IRQs correctly to the first MSI-X vector by setting IVAR_MISC. Also, set bit 0 of EIMS so that the vector will not be masked. The interrupt test now runs properly with this change: $ sudo ethtool -t enp0s2 offline [ 42.762985] igb 0000:00:02.0: offline testing starting [ 50.141967] igb 0000:00:02.0: testing shared interrupt [ 56.163957] igb 0000:00:02.0 enp0s2: igb: enp0s2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX The test result is PASS The test extra info: Register test (offline) 0 Eeprom test (offline) 0 Interrupt test (offline) 0 Loopback test (offline) 0 Link test (on/offline) 0 Fixes: 4eefa8f01314 ("igb: add single vector msi-x testing to interrupt test") Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-11-30e1000e: Fix TX dispatch conditionAkihiko Odaki
e1000_xmit_frame is expected to stop the queue and dispatch frames to hardware if there is not sufficient space for the next frame in the buffer, but sometimes it failed to do so because the estimated maximum size of frame was wrong. As the consequence, the later invocation of e1000_xmit_frame failed with NETDEV_TX_BUSY, and the frame in the buffer remained forever, resulting in a watchdog failure. This change fixes the estimated size by making it match with the condition for NETDEV_TX_BUSY. Apparently, the old estimation failed to account for the following lines which determines the space requirement for not causing NETDEV_TX_BUSY: ``` /* reserve a descriptor for the offload context */ if ((mss) || (skb->ip_summed == CHECKSUM_PARTIAL)) count++; count++; count += DIV_ROUND_UP(len, adapter->tx_fifo_limit); ``` This issue was found when running http-stress02 test included in Linux Test Project 20220930 on QEMU with the following commandline: ``` qemu-system-x86_64 -M q35,accel=kvm -m 8G -smp 8 -drive if=virtio,format=raw,file=root.img,file.locking=on -device e1000e,netdev=netdev -netdev tap,script=ifup,downscript=no,id=netdev ``` Fixes: bc7f75fa9788 ("[E1000E]: New pci-express e1000 driver (currently for ICH9 devices only)") Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-11-30afs: Fix server->active leak in afs_put_serverMarc Dionne
The atomic_read was accidentally replaced with atomic_inc_return, which prevents the server from getting cleaned up and causes rmmod to hang with a warning: Can't purge s=00000001 Fixes: 2757a4dc1849 ("afs: Fix access after dec in put functions") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20221130174053.2665818-1-marc.dionne@auristor.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-11-30netfilter: ipset: Add support for new bitmask parameterVishwanath Pai
Add a new parameter to complement the existing 'netmask' option. The main difference between netmask and bitmask is that bitmask takes any arbitrary ip address as input, it does not have to be a valid netmask. The name of the new parameter is 'bitmask'. This lets us mask out arbitrary bits in the ip address, for example: ipset create set1 hash:ip bitmask 255.128.255.0 ipset create set2 hash:ip,port family inet6 bitmask ffff::ff80 Signed-off-by: Vishwanath Pai <vpai@akamai.com> Signed-off-by: Joshua Hunt <johunt@akamai.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-30netfilter: conntrack: merge ipv4+ipv6 confirm functionsFlorian Westphal
No need to have distinct functions. After merge, ipv6 can avoid protooff computation if the connection neither needs sequence adjustment nor helper invocation -- this is the normal case. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-30netfilter: conntrack: add sctp DATA_SENT stateSriram Yagnaraman
SCTP conntrack currently assumes that the SCTP endpoints will probe secondary paths using HEARTBEAT before sending traffic. But, according to RFC 9260, SCTP endpoints can send any traffic on any of the confirmed paths after SCTP association is up. SCTP endpoints that sends INIT will confirm all peer addresses that upper layer configures, and the SCTP endpoint that receives COOKIE_ECHO will only confirm the address it sent the INIT_ACK to. So, we can have a situation where the INIT sender can start to use secondary paths without the need to send HEARTBEAT. This patch allows DATA/SACK packets to create new connection tracking entry. A new state has been added to indicate that a DATA/SACK chunk has been seen in the original direction - SCTP_CONNTRACK_DATA_SENT. State transitions mostly follows the HEARTBEAT_SENT, except on receiving HEARTBEAT/HEARTBEAT_ACK/DATA/SACK in the reply direction. State transitions in original direction: - DATA_SENT behaves similar to HEARTBEAT_SENT for all chunks, except that it remains in DATA_SENT on receving HEARTBEAT, HEARTBEAT_ACK/DATA/SACK chunks State transitions in reply direction: - DATA_SENT behaves similar to HEARTBEAT_SENT for all chunks, except that it moves to HEARTBEAT_ACKED on receiving HEARTBEAT/HEARTBEAT_ACK/DATA/SACK chunks Note: This patch still doesn't solve the problem when the SCTP endpoint decides to use primary paths for association establishment but uses a secondary path for association shutdown. We still have to depend on timeout for connections to expire in such a case. Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-30drm/i915: fix TLB invalidation for Gen12 video and compute enginesAndrzej Hajda
In case of Gen12 video and compute engines, TLB_INV registers are masked - to modify one bit, corresponding bit in upper half of the register must be enabled, otherwise nothing happens. CVE: CVE-2022-4139 Suggested-by: Chris Wilson <chris.p.wilson@intel.com> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store") Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-11-30Merge tag 'asoc-fix-v6.1-rc7' of ↵Takashi Iwai
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Fixes for v6.1 Some more fixes for v6.1, some of these are very old and were originally intended to get sent for v5.18 but got lost in the shuffle when there was an issue with Linus not liking my branching strategy and I rebuilt bits of my workflow. The ops changes have been validated by people looking at real hardware and are how things getting dropped got noticed.
2022-11-30gpio: amd8111: Fix PCI device reference count leakXiongfeng Wang
for_each_pci_dev() is implemented by pci_get_device(). The comment of pci_get_device() says that it will increase the reference count for the returned pci_dev and also decrease the reference count for the input pci_dev @from if it is not NULL. If we break for_each_pci_dev() loop with pdev not NULL, we need to call pci_dev_put() to decrease the reference count. Add the missing pci_dev_put() after the 'out' label. Since pci_dev_put() can handle NULL input parameter, there is no problem for the 'Device not found' branch. For the normal path, add pci_dev_put() in amd_gpio_exit(). Fixes: f942a7de047d ("gpio: add a driver for GPIO pins found on AMD-8111 south bridge chips") Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2022-11-30KVM: x86: fix uninitialized variable use on KVM_REQ_TRIPLE_FAULTPaolo Bonzini
If a triple fault was fixed by kvm_x86_ops.nested_ops->triple_fault (by turning it into a vmexit), there is no need to leave vcpu_enter_guest(). Any vcpu->requests will be caught later before the actual vmentry, and in fact vcpu_enter_guest() was not initializing the "r" variable. Depending on the compiler's whims, this could cause the x86_64/triple_fault_event_test test to fail. Cc: Maxim Levitsky <mlevitsk@redhat.com> Fixes: 92e7d5c83aff ("KVM: x86: allow L1 to not intercept triple fault") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30ALSA: dice: fix regression for Lexicon I-ONIX FW810STakashi Sakamoto
For Lexicon I-ONIX FW810S, the call of ioctl(2) with SNDRV_PCM_IOCTL_HW_PARAMS can returns -ETIMEDOUT. This is a regression due to the commit 41319eb56e19 ("ALSA: dice: wait just for NOTIFY_CLOCK_ACCEPTED after GLOBAL_CLOCK_SELECT operation"). The device does not emit NOTIFY_CLOCK_ACCEPTED notification when accepting GLOBAL_CLOCK_SELECT operation with the same parameters as current ones. This commit fixes the regression. When receiving no notification, return -ETIMEDOUT as long as operating for any change. Fixes: 41319eb56e19 ("ALSA: dice: wait just for NOTIFY_CLOCK_ACCEPTED after GLOBAL_CLOCK_SELECT operation") Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp> Link: https://lore.kernel.org/r/20221130130604.29774-1-o-takashi@sakamocchi.jp Signed-off-by: Takashi Iwai <tiwai@suse.de>
2022-11-30nvme: fix SRCU protection of nvme_ns_head listCaleb Sander
Walking the nvme_ns_head siblings list is protected by the head's srcu in nvme_ns_head_submit_bio() but not nvme_mpath_revalidate_paths(). Removing namespaces from the list also fails to synchronize the srcu. Concurrent scan work can therefore cause use-after-frees. Hold the head's srcu lock in nvme_mpath_revalidate_paths() and synchronize with the srcu, not the global RCU, in nvme_ns_remove(). Observed the following panic when making NVMe/RDMA connections with native multipath on the Rocky Linux 8.6 kernel (it seems the upstream kernel has the same race condition). Disassembly shows the faulting instruction is cmp 0x50(%rdx),%rcx; computing capacity != get_capacity(ns->disk). Address 0x50 is dereferenced because ns->disk is NULL. The NULL disk appears to be the result of concurrent scan work freeing the namespace (note the log line in the middle of the panic). [37314.206036] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 [37314.206036] nvme0n3: detected capacity change from 0 to 11811160064 [37314.299753] PGD 0 P4D 0 [37314.299756] Oops: 0000 [#1] SMP PTI [37314.299759] CPU: 29 PID: 322046 Comm: kworker/u98:3 Kdump: loaded Tainted: G W X --------- - - 4.18.0-372.32.1.el8test86.x86_64 #1 [37314.299762] Hardware name: Dell Inc. PowerEdge R720/0JP31P, BIOS 2.7.0 05/23/2018 [37314.299763] Workqueue: nvme-wq nvme_scan_work [nvme_core] [37314.299783] RIP: 0010:nvme_mpath_revalidate_paths+0x26/0xb0 [nvme_core] [37314.299790] Code: 1f 44 00 00 66 66 66 66 90 55 53 48 8b 5f 50 48 8b 83 c8 c9 00 00 48 8b 13 48 8b 48 50 48 39 d3 74 20 48 8d 42 d0 48 8b 50 20 <48> 3b 4a 50 74 05 f0 80 60 70 ef 48 8b 50 30 48 8d 42 d0 48 39 d3 [37315.058803] RSP: 0018:ffffabe28f913d10 EFLAGS: 00010202 [37315.121316] RAX: ffff927a077da800 RBX: ffff92991dd70000 RCX: 0000000001600000 [37315.206704] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff92991b719800 [37315.292106] RBP: ffff929a6b70c000 R08: 000000010234cd4a R09: c0000000ffff7fff [37315.377501] R10: 0000000000000001 R11: ffffabe28f913a30 R12: 0000000000000000 [37315.462889] R13: ffff92992716600c R14: ffff929964e6e030 R15: ffff92991dd70000 [37315.548286] FS: 0000000000000000(0000) GS:ffff92b87fb80000(0000) knlGS:0000000000000000 [37315.645111] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [37315.713871] CR2: 0000000000000050 CR3: 0000002208810006 CR4: 00000000000606e0 [37315.799267] Call Trace: [37315.828515] nvme_update_ns_info+0x1ac/0x250 [nvme_core] [37315.892075] nvme_validate_or_alloc_ns+0x2ff/0xa00 [nvme_core] [37315.961871] ? __blk_mq_free_request+0x6b/0x90 [37316.015021] nvme_scan_work+0x151/0x240 [nvme_core] [37316.073371] process_one_work+0x1a7/0x360 [37316.121318] ? create_worker+0x1a0/0x1a0 [37316.168227] worker_thread+0x30/0x390 [37316.212024] ? create_worker+0x1a0/0x1a0 [37316.258939] kthread+0x10a/0x120 [37316.297557] ? set_kthread_struct+0x50/0x50 [37316.347590] ret_from_fork+0x35/0x40 [37316.390360] Modules linked in: nvme_rdma nvme_tcp(X) nvme_fabrics nvme_core netconsole iscsi_tcp libiscsi_tcp dm_queue_length dm_service_time nf_conntrack_netlink br_netfilter bridge stp llc overlay nft_chain_nat ipt_MASQUERADE nf_nat xt_addrtype xt_CT nft_counter xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment xt_multiport nft_compat nf_tables libcrc32c nfnetlink dm_multipath tg3 rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr iTCO_wdt iTCO_vendor_support dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul crc32_pclmul mlx5_ib ghash_clmulni_intel ib_uverbs rapl intel_cstate intel_uncore ib_core ipmi_si joydev mei_me pcspkr ipmi_devintf mei lpc_ich wmi ipmi_msghandler acpi_power_meter ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 mlx5_core drm_kms_helper syscopyarea [37316.390419] sysfillrect ahci sysimgblt fb_sys_fops libahci drm crc32c_intel libata mlxfw pci_hyperv_intf tls i2c_algo_bit psample dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: nvme_core] [37317.645908] CR2: 0000000000000050 Fixes: e7d65803e2bb ("nvme-multipath: revalidate paths during rescan") Signed-off-by: Caleb Sander <csander@purestorage.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-30nvme-pci: clear the prp2 field when not usedLei Rao
If the prp2 field is not filled in nvme_setup_prp_simple(), the prp2 field is garbage data. According to nvme spec, the prp2 is reserved if the data transfer does not cross a memory page boundary, so clear it to zero if it is not used. Signed-off-by: Lei Rao <lei.rao@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-30netfilter: ctnetlink: fix compilation warning after data race fixes in ct markPablo Neira Ayuso
All warnings (new ones prefixed by >>): net/netfilter/nf_conntrack_netlink.c: In function '__ctnetlink_glue_build': >> net/netfilter/nf_conntrack_netlink.c:2674:13: warning: unused variable 'mark' [-Wunused-variable] 2674 | u32 mark; | ^~~~ Fixes: 52d1aa8b8249 ("netfilter: conntrack: Fix data-races around ct mark") Reported-by: kernel test robot <lkp@intel.com> Tested-by: Ivan Babrou <ivan@ivan.computer> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>