summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-09-08Merge tag 'pm-6.6-rc1-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Fix an Intel RAPL power capping driver regression introduced during the 6.5 development cycle (Srinivas Pandruvada)" * tag 'pm-6.6-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: powercap: intel_rapl: Fix invalid setting of Power Limit 4
2023-09-08Merge tag 'gpio-fixes-for-v6.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fix from Bartosz Golaszewski: - fix a regression in irqchip setup in gpio-zynq * tag 'gpio-fixes-for-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpio: zynq: restore zynq_gpio_irq_reqres/zynq_gpio_irq_relres callbacks
2023-09-08Revert "PCI: Mark NVIDIA T4 GPUs to avoid bus reset"Bjorn Helgaas
This reverts commit d5af729dc2071273f14cbb94abbc60608142fd83. d5af729dc207 ("PCI: Mark NVIDIA T4 GPUs to avoid bus reset") avoided Secondary Bus Reset on the T4 because the reset seemed to not work when the T4 was directly attached to a Root Port. But NVIDIA thinks the issue is probably related to some issue with the Root Port, not with the T4. The T4 provides neither PM nor FLR reset, so masking bus reset compromises this device for assignment scenarios. Revert d5af729dc207 as requested by Wu Zongyong. This will leave SBR broken in the specific configuration Wu tested, as it was in v6.5, so Wu will debug that further. Link: https://lore.kernel.org/r/ZPqMCDWvITlOLHgJ@wuzongyong-alibaba Link: https://lore.kernel.org/r/20230908201104.GA305023@bhelgaas Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-09-08Merge tag 'sound-fix-6.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "A collection of fixes for 6.6-rc1. All small and easy ones. - The corrections of the previous PCM iov_iter transitions - Regression fixes in MIDI 2.0 / USB changes - Various ASoC codec fixes for Cirrus, Realtek, WCD - ASoC AMD quirks and ASoC Intel AVS driver workaround" * tag 'sound-fix-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (21 commits) ALSA: hda/realtek - ALC287 I2S speaker platform support ASoC: amd: yc: Fix a non-functional mic on Lenovo 82TL ASoC: Intel: avs: Provide support for fallback topology ALSA: seq: Fix snd_seq_expand_var_event() call to user-space ALSA: usb-audio: Fix potential memory leaks at error path for UMP open ALSA: hda/cirrus: Fix broken audio on hardware with two CS42L42 codecs. ASoC: rt5645: NULL pointer access when removing jack ASoC: amd: yc: Add DMI entries to support Victus by HP Gaming Laptop 15-fb0xxx (8A3E) MAINTAINERS: Update the MAINTAINERS enties for TEXAS INSTRUMENTS ASoC DRIVERS ALSA: sb: Fix wrong argument in commented code ALSA: pcm: Fix error checks of default read/write copy ops ASoC: Name iov_iter argument as iterator instead of buffer ASoC: dmaengine: Drop unused iov_iter for process callback ALSA: hda/tas2781: Use standard clamp() macro ASoC: cs35l56: Waiting for firmware to boot must be tolerant of I/O errors ASoC: dt-bindings: fsl_easrc: Add support for imx8mp-easrc ASoC: cs42l43: Fix missing error code in cs42l43_codec_probe() ASoC: cs35l45: Rename DACPCM1 Source control ASoC: cs35l45: Fix "Dead assigment" warning ASoC: cs35l45: Add support for Chip ID 0x35A460 ...
2023-09-08Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "The main one is a fix for a broken strscpy() conversion that landed in the merge window and broke early parsing of the kernel command line. - Fix an incorrect mask in the CXL PMU driver - Fix a regression in early parsing of the kernel command line - Fix an IP checksum OoB access reported by syzbot" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: csum: Fix OoB access in IP checksum code for negative lengths arm64/sysreg: Fix broken strncpy() -> strscpy() conversion perf: CXL: fix mismatched number of counters mask
2023-09-08Merge tag 'loongarch-6.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch updates from Huacai Chen: - Allow usage of LSX/LASX in the kernel, and use them for SIMD-optimized RAID5/RAID6 routines - Add Loongson Binary Translation (LBT) extension support - Add basic KGDB & KDB support - Add building with kcov coverage - Add KFENCE (Kernel Electric-Fence) support - Add KASAN (Kernel Address Sanitizer) support - Some bug fixes and other small changes - Update the default config file * tag 'loongarch-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: (25 commits) LoongArch: Update Loongson-3 default config file LoongArch: Add KASAN (Kernel Address Sanitizer) support LoongArch: Simplify the processing of jumping new kernel for KASLR kasan: Add (pmd|pud)_init for LoongArch zero_(pud|p4d)_populate process kasan: Add __HAVE_ARCH_SHADOW_MAP to support arch specific mapping LoongArch: Add KFENCE (Kernel Electric-Fence) support LoongArch: Get partial stack information when providing regs parameter LoongArch: mm: Add page table mapped mode support for virt_to_page() kfence: Defer the assignment of the local variable addr LoongArch: Allow building with kcov coverage LoongArch: Provide kaslr_offset() to get kernel offset LoongArch: Add basic KGDB & KDB support LoongArch: Add Loongson Binary Translation (LBT) extension support raid6: Add LoongArch SIMD recovery implementation raid6: Add LoongArch SIMD syndrome calculation LoongArch: Add SIMD-optimized XOR routines LoongArch: Allow usage of LSX/LASX in the kernel LoongArch: Define symbol 'fault' as a local label in fpu.S LoongArch: Adjust {copy, clear}_user exception handler behavior LoongArch: Use static defined zero page rather than allocated ...
2023-09-08Merge tag 'printk-for-6.6-fixup' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Revert exporting symbols needed for dumping the raw printk buffer in panic(). I pushed the export prematurely before the user was ready for merging into the mainline. * tag 'printk-for-6.6-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: Revert "printk: export symbols for debug modules"
2023-09-08Merge tag 'landlock-6.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux Pull landlock updates from Mickaël Salaün: "One test fix and a __counted_by annotation" * tag 'landlock-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux: selftests/landlock: Fix a resource leak landlock: Annotate struct landlock_rule with __counted_by
2023-09-08soc: renesas: Kconfig: For ARCH_R9A07G043 select the required configs if ↵Lad Prabhakar
dependencies are met To prevent randconfig build issues when enabling the RZ/Five SoC, consider selecting specific configurations only when their dependencies are satisfied. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202308311610.ec6bm2G8-lkp@intel.com/ Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Fixes: 484861e09f3e ("soc: renesas: Kconfig: Select the required configs for RZ/Five SoC") Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20230901110936.313171-1-prabhakar.mahadev-lad.rj@bp.renesas.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-08riscv: Kconfig.errata: Add dependency for RISCV_SBI in ERRATA_ANDES configLad Prabhakar
Andes errata uses sbi_ecalll() which is only available if RISCV_SBI is enabled. So add an dependency for RISCV_SBI in ERRATA_ANDES config to avoid any build failures. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202308311610.ec6bm2G8-lkp@intel.com/ Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20230901110320.312674-1-prabhakar.mahadev-lad.rj@bp.renesas.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-08riscv: Kconfig.errata: Drop dependency for MMU in ERRATA_ANDES_CMO configLad Prabhakar
Now that RISCV_DMA_NONCOHERENT conditionally selects DMA_DIRECT_REMAP ie only if MMU is enabled, we no longer need the MMU dependency in ERRATA_ANDES_CMO config. Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20230901105858.311745-1-prabhakar.mahadev-lad.rj@bp.renesas.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-08riscv: Kconfig: Select DMA_DIRECT_REMAP only if MMU is enabledLad Prabhakar
kernel/dma/mapping.c has its use of pgprot_dmacoherent() inside an #ifdef CONFIG_MMU block. kernel/dma/pool.c has its use of pgprot_dmacoherent() inside an #ifdef CONFIG_DMA_DIRECT_REMAP block. So select DMA_DIRECT_REMAP only if MMU is enabled for RISCV_DMA_NONCOHERENT config. This avoids users to explicitly select MMU. Suggested-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Link: https://lore.kernel.org/r/20230901105111.311200-1-prabhakar.mahadev-lad.rj@bp.renesas.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-08Merge patch series "bpf, riscv: use BPF prog pack allocator in BPF JIT"Palmer Dabbelt
Puranjay Mohan <puranjay12@gmail.com> says: Here is some data to prove the V2 fixes the problem: Without this series: root@rv-selftester:~/src/kselftest/bpf# time ./test_tag test_tag: OK (40945 tests) real 7m47.562s user 0m24.145s sys 6m37.064s With this series applied: root@rv-selftester:~/src/selftest/bpf# time ./test_tag test_tag: OK (40945 tests) real 7m29.472s user 0m25.865s sys 6m18.401s BPF programs currently consume a page each on RISCV. For systems with many BPF programs, this adds significant pressure to instruction TLB. High iTLB pressure usually causes slow down for the whole system. Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue. It packs multiple BPF programs into a single huge page. It is currently only enabled for the x86_64 BPF JIT. I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now. This patch series enables the BPF prog pack allocator for the RISCV BPF JIT. ====================================================== Performance Analysis of prog pack allocator on RISCV64 ====================================================== Test setup: =========== Host machine: Debian GNU/Linux 11 (bullseye) Qemu Version: QEMU emulator version 8.0.3 (Debian 1:8.0.3+dfsg-1) u-boot-qemu Version: 2023.07+dfsg-1 opensbi Version: 1.3-1 To test the performance of the BPF prog pack allocator on RV, a stresser tool[4] linked below was built. This tool loads 8 BPF programs on the system and triggers 5 of them in an infinite loop by doing system calls. The runner script starts 20 instances of the above which loads 8*20=160 BPF programs on the system, 5*20=100 of which are being constantly triggered. The script is passed a command which would be run in the above environment. The script was run with following perf command: ./run.sh "perf stat -a \ -e iTLB-load-misses \ -e dTLB-load-misses \ -e dTLB-store-misses \ -e instructions \ --timeout 60000" The output of the above command is discussed below before and after enabling the BPF prog pack allocator. The tests were run on qemu-system-riscv64 with 8 cpus, 16G memory. The rootfs was created using Bjorn's riscv-cross-builder[5] docker container linked below. Results ======= Before enabling prog pack allocator: ------------------------------------ Performance counter stats for 'system wide': 4939048 iTLB-load-misses 5468689 dTLB-load-misses 465234 dTLB-store-misses 1441082097998 instructions 60.045791200 seconds time elapsed After enabling prog pack allocator: ----------------------------------- Performance counter stats for 'system wide': 3430035 iTLB-load-misses 5008745 dTLB-load-misses 409944 dTLB-store-misses 1441535637988 instructions 60.046296600 seconds time elapsed Improvements in metrics ======================= It was expected that the iTLB-load-misses would decrease as now a single huge page is used to keep all the BPF programs compared to a single page for each program earlier. -------------------------------------------- The improvement in iTLB-load-misses: -30.5 % -------------------------------------------- I repeated this expriment more than 100 times in different setups and the improvement was always greater than 30%. This patch series is boot tested on the Starfive VisionFive 2 board[6]. The performance analysis was not done on the board because it doesn't expose iTLB-load-misses, etc. The stresser program was run on the board to test the loading and unloading of BPF programs [1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/ [2] https://lore.kernel.org/all/20230626085811.3192402-1-puranjay12@gmail.com/ [3] https://lore.kernel.org/all/20230626085811.3192402-2-puranjay12@gmail.com/ [4] https://github.com/puranjaymohan/BPF-Allocator-Bench [5] https://github.com/bjoto/riscv-cross-builder [6] https://www.starfivetech.com/en/site/boards * b4-shazam-merge: bpf, riscv: use prog pack allocator in the BPF JIT riscv: implement a memset like function for text riscv: extend patch_text_nosync() for multiple pages bpf: make bpf_prog_pack allocator portable Link: https://lore.kernel.org/r/20230831131229.497941-1-puranjay12@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-08Merge patch series "riscv: Introduce KASLR"Palmer Dabbelt
Alexandre Ghiti <alexghiti@rivosinc.com> says: The following KASLR implementation allows to randomize the kernel mapping: - virtually: we expect the bootloader to provide a seed in the device-tree - physically: only implemented in the EFI stub, it relies on the firmware to provide a seed using EFI_RNG_PROTOCOL. arm64 has a similar implementation hence the patch 3 factorizes KASLR related functions for riscv to take advantage. The new virtual kernel location is limited by the early page table that only has one PUD and with the PMD alignment constraint, the kernel can only take < 512 positions. * b4-shazam-merge: riscv: libstub: Implement KASLR by using generic functions libstub: Fix compilation warning for rv32 arm64: libstub: Move KASLR handling functions to kaslr.c riscv: Dump out kernel offset information on panic riscv: Introduce virtual kernel mapping KASLR Link: https://lore.kernel.org/r/20230722123850.634544-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-08Merge patch "RISC-V: Add ptrace support for vectors"Palmer Dabbelt
This resurrects the vector ptrace() support that was removed for 6.5 due to some bugs cropping up as part of the GDB review process. * b4-shazam-merge: RISC-V: Add ptrace support for vectors Link: https://lore.kernel.org/r/20230825050248.32681-1-andy.chiu@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-08Merge patch series "Add non-coherent DMA support for AX45MP"Palmer Dabbelt
Prabhakar <prabhakar.csengg@gmail.com> says: From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> non-coherent DMA support for AX45MP ==================================== On the Andes AX45MP core, cache coherency is a specification option so it may not be supported. In this case DMA will fail. To get around with this issue this patch series does the below: 1] Andes alternative ports is implemented as errata which checks if the IOCP is missing and only then applies to CMO errata. One vendor specific SBI EXT (ANDES_SBI_EXT_IOCP_SW_WORKAROUND) is implemented as part of errata. Below are the configs which Andes port provides (and are selected by RZ/Five): - ERRATA_ANDES - ERRATA_ANDES_CMO OpenSBI patch supporting ANDES_SBI_EXT_IOCP_SW_WORKAROUND SBI is now part v1.3 release. 2] Andes AX45MP core has a Programmable Physical Memory Attributes (PMA) block that allows dynamic adjustment of memory attributes in the runtime. It contains a configurable amount of PMA entries implemented as CSR registers to control the attributes of memory locations in interest. OpenSBI configures the PMA regions as required and creates a reserve memory node and propagates it to the higher boot stack. Currently OpenSBI (upstream) configures the required PMA region and passes this a shared DMA pool to Linux. reserved-memory { #address-cells = <2>; #size-cells = <2>; ranges; pma_resv0@58000000 { compatible = "shared-dma-pool"; reg = <0x0 0x58000000 0x0 0x08000000>; no-map; linux,dma-default; }; }; The above shared DMA pool gets appended to Linux DTB so the DMA memory requests go through this region. 3] We provide callbacks to synchronize specific content between memory and cache. 4] RZ/Five SoC selects the below configs - AX45MP_L2_CACHE - DMA_GLOBAL_POOL - ERRATA_ANDES - ERRATA_ANDES_CMO ----------x---------------------x--------------------x---------------x---- * b4-shazam-merge: soc: renesas: Kconfig: Select the required configs for RZ/Five SoC cache: Add L2 cache management for Andes AX45MP RISC-V core dt-bindings: cache: andestech,ax45mp-cache: Add DT binding documentation for L2 cache controller riscv: mm: dma-noncoherent: nonstandard cache operations support riscv: errata: Add Andes alternative ports riscv: asm: vendorid_list: Add Andes Technology to the vendors list Link: https://lore.kernel.org/r/20230818135723.80612-1-prabhakar.mahadev-lad.rj@bp.renesas.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-08Merge patch series "riscv: dma-mapping: unify support for cache flushes"Palmer Dabbelt
Prabhakar <prabhakar.csengg@gmail.com> says: From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> This patch series is a subset from Arnd's original series [0]. Ive just picked up the bits required for RISC-V unification of cache flushing. Remaining patches from the series [0] will be taken care by Arnd soon. * b4-shazam-merge: riscv: dma-mapping: switch over to generic implementation riscv: dma-mapping: skip invalidation before bidirectional DMA riscv: dma-mapping: only invalidate after DMA, not flush Link: https://lore.kernel.org/r/20230816232336.164413-1-prabhakar.mahadev-lad.rj@bp.renesas.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-08Merge patch series "RISC-V: Probe for misaligned access speed"Palmer Dabbelt
Evan Green <evan@rivosinc.com> says: The current setting for the hwprobe bit indicating misaligned access speed is controlled by a vendor-specific feature probe function. This is essentially a per-SoC table we have to maintain on behalf of each vendor going forward. Let's convert that instead to something we detect at runtime. We have two assembly routines at the heart of our probe: one that does a bunch of word-sized accesses (without aligning its input buffer), and the other that does byte accesses. If we can move a larger number of bytes using misaligned word accesses than we can with the same amount of time doing byte accesses, then we can declare misaligned accesses as "fast". The tradeoff of reducing this maintenance burden is boot time. We spend 4-6 jiffies per core doing this measurement (0-2 on jiffie edge alignment, and 4 on measurement). The timing loop was based on raid6_choose_gen(), which uses (16+1)*N jiffies (where N is the number of algorithms). By taking only the fastest iteration out of all attempts for use in the comparison, variance between runs is very low. On my THead C906, it looks like this: [ 0.047563] cpu0: Ratio of byte access time to unaligned word access is 4.34, unaligned accesses are fast Several others have chimed in with results on slow machines with the older algorithm, which took all runs into account, including noise like interrupts. Even with this variation, results indicate that in all cases (fast, slow, and emulated) the measured numbers are nowhere near each other (always multiple factors away). * b4-shazam-merge: RISC-V: alternative: Remove feature_probe_func RISC-V: Probe for unaligned access speed Link: https://lore.kernel.org/r/20230818194136.4084400-1-evan@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-08selftests: Keep symlinks, when possibleBjörn Töpel
When kselftest is built/installed with the 'gen_tar' target, rsync is used for the installation step to copy files. Extra care is needed for tests that have symlinks. Commit ae108c48b5d2 ("selftests: net: Fix cross-tree inclusion of scripts") added '-L' (transform symlink into referent file/dir) to rsync, to fix dangling links. However, that broke some tests where the symlink (being a symlink) is part of the test (e.g. exec:execveat). Use rsync's '--copy-unsafe-links' that does right thing. Fixes: ae108c48b5d2 ("selftests: net: Fix cross-tree inclusion of scripts") Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2023-09-08selftests: fix dependency checker scriptRicardo B. Marliere
This patch fixes inconsistencies in the parsing rules of the levels 1 and 2 of the kselftest_deps.sh. It was added the levels 4 and 5 to account for a few edge cases that are present in some tests, also some minor identation styling have been fixed (s/ /\t/g). Signed-off-by: Ricardo B. Marliere <rbmarliere@gmail.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2023-09-08kselftest/runner.sh: Propagate SIGTERM to runner childBjörn Töpel
Timeouts in kselftest are done using the "timeout" command with the "--foreground" option. Without the "foreground" option, it is not possible for a user to cancel the runner using SIGINT, because the signal is not propagated to timeout which is running in a different process group. The "forground" options places the timeout in the same process group as its parent, but only sends the SIGTERM (on timeout) signal to the forked process. Unfortunately, this does not play nice with all kselftests, e.g. "net:fcnal-test.sh", where the child processes will linger because timeout does not send SIGTERM to the group. Some users have noted these hangs [1]. Fix this by nesting the timeout with an additional timeout without the foreground option. Link: https://lore.kernel.org/all/7650b2eb-0aee-a2b0-2e64-c9bc63210f67@alu.unizg.hr/ # [1] Fixes: 651e0d881461 ("kselftest/runner: allow to properly deliver signals to tests") Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2023-09-08Merge branch 'bpf-task_group_seq_get_next-misc-cleanups'Alexei Starovoitov
Oleg Nesterov says: ==================== bpf: task_group_seq_get_next: misc cleanups Yonghong, I am resending 1-5 of 6 as you suggested with your acks included. The next (final) patch will change this code to use __next_thread when https://lore.kernel.org/all/20230824143142.GA31222@redhat.com/ is merged. Oleg. ==================== Link: https://lore.kernel.org/r/20230905154612.GA24872@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08Merge branch 'bpf-enable-irq-after-irq_work_raise-completes'Alexei Starovoitov
Hou Tao says: ==================== bpf: Enable IRQ after irq_work_raise() completes From: Hou Tao <houtao1@huawei.com> Hi, The patchset aims to fix the problem that bpf_mem_alloc() may return NULL unexpectedly when multiple bpf_mem_alloc() are invoked concurrently under process context and there is still free memory available. The problem was found when doing stress test for qp-trie but the same problem also exists for bpf_obj_new() as demonstrated in patch #3. As pointed out by Alexei, the patchset can only fix ENOMEM problem for normal process context and can not fix the problem for irq-disabled context or RT-enabled kernel. Patch #1 fixes the race between unit_alloc() and unit_alloc(). Patch #2 fixes the race between unit_alloc() and unit_free(). And patch #3 adds a selftest for the problem. The major change compared with v1 is using local_irq_{save,restore)() pair to disable and enable preemption instead of preempt_{disable,enable}_notrace pair. The main reason is to prevent potential overhead from __preempt_schedule_notrace(). I also run htab_mem benchmark and hash_map_perf on a 8-CPUs KVM VM to compare the performance between local_irq_{save,restore} and preempt_{disable,enable}_notrace(), but the results are similar as shown below: (1) use preempt_{disable,enable}_notrace() [root@hello bpf]# ./map_perf_test 4 8 0:hash_map_perf kmalloc 652179 events per sec 1:hash_map_perf kmalloc 651880 events per sec 2:hash_map_perf kmalloc 651382 events per sec 3:hash_map_perf kmalloc 650791 events per sec 5:hash_map_perf kmalloc 650140 events per sec 6:hash_map_perf kmalloc 652773 events per sec 7:hash_map_perf kmalloc 652751 events per sec 4:hash_map_perf kmalloc 648199 events per sec [root@hello bpf]# ./benchs/run_bench_htab_mem.sh normal bpf ma ============= overwrite per-prod-op: 110.82 ± 0.02k/s, avg mem: 2.00 ± 0.00MiB, peak mem: 2.73MiB batch_add_batch_del per-prod-op: 89.79 ± 0.75k/s, avg mem: 1.68 ± 0.38MiB, peak mem: 2.73MiB add_del_on_diff_cpu per-prod-op: 17.83 ± 0.07k/s, avg mem: 25.68 ± 2.92MiB, peak mem: 35.10MiB (2) use local_irq_{save,restore} [root@hello bpf]# ./map_perf_test 4 8 0:hash_map_perf kmalloc 656299 events per sec 1:hash_map_perf kmalloc 656397 events per sec 2:hash_map_perf kmalloc 656046 events per sec 3:hash_map_perf kmalloc 655723 events per sec 5:hash_map_perf kmalloc 655221 events per sec 4:hash_map_perf kmalloc 654617 events per sec 6:hash_map_perf kmalloc 650269 events per sec 7:hash_map_perf kmalloc 653665 events per sec [root@hello bpf]# ./benchs/run_bench_htab_mem.sh normal bpf ma ============= overwrite per-prod-op: 116.10 ± 0.02k/s, avg mem: 2.00 ± 0.00MiB, peak mem: 2.74MiB batch_add_batch_del per-prod-op: 88.76 ± 0.61k/s, avg mem: 1.94 ± 0.33MiB, peak mem: 2.74MiB add_del_on_diff_cpu per-prod-op: 18.12 ± 0.08k/s, avg mem: 25.10 ± 2.70MiB, peak mem: 34.78MiB As ususal comments are always welcome. Change Log: v2: * Use local_irq_save to disable preemption instead of using preempt_{disable,enable}_notrace pair to prevent potential overhead v1: https://lore.kernel.org/bpf/20230822133807.3198625-1-houtao@huaweicloud.com/ ==================== Link: https://lore.kernel.org/r/20230901111954.1804721-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: simplify the "next tid" logicOleg Nesterov
Kill saved_tid. It looks ugly to update *tid and then restore the previous value if __task_pid_nr_ns() returns 0. Change this code to update *tid and common->pid_visiting once before return. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154656.GA24950@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08selftests/bpf: Test preemption between bpf_obj_new() and bpf_obj_drop()Hou Tao
The test case creates 4 threads and then pins these 4 threads in CPU 0. These 4 threads will run different bpf program through bpf_prog_test_run_opts() and these bpf program will use bpf_obj_new() and bpf_obj_drop() to allocate and free local kptrs concurrently. Under preemptible kernel, bpf_obj_new() and bpf_obj_drop() may preempt each other, bpf_obj_new() may return NULL and the test will fail before applying these fixes as shown below: test_preempted_bpf_ma_op:PASS:open_and_load 0 nsec test_preempted_bpf_ma_op:PASS:attach 0 nsec test_preempted_bpf_ma_op:PASS:no test prog 0 nsec test_preempted_bpf_ma_op:PASS:no test prog 0 nsec test_preempted_bpf_ma_op:PASS:no test prog 0 nsec test_preempted_bpf_ma_op:PASS:no test prog 0 nsec test_preempted_bpf_ma_op:PASS:pthread_create 0 nsec test_preempted_bpf_ma_op:PASS:pthread_create 0 nsec test_preempted_bpf_ma_op:PASS:pthread_create 0 nsec test_preempted_bpf_ma_op:PASS:pthread_create 0 nsec test_preempted_bpf_ma_op:PASS:run prog err 0 nsec test_preempted_bpf_ma_op:PASS:run prog err 0 nsec test_preempted_bpf_ma_op:PASS:run prog err 0 nsec test_preempted_bpf_ma_op:PASS:run prog err 0 nsec test_preempted_bpf_ma_op:FAIL:ENOMEM unexpected ENOMEM: got TRUE #168 preempted_bpf_ma_op:FAIL Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230901111954.1804721-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: kill next_taskOleg Nesterov
It only adds the unnecessary confusion and compicates the "retry" code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154654.GA24945@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Enable IRQ after irq_work_raise() completes in unit_free{_rcu}()Hou Tao
Both unit_free() and unit_free_rcu() invoke irq_work_raise() to free freed objects back to slab and the invocation may also be preempted by unit_alloc() and unit_alloc() may return NULL unexpectedly as shown in the following case: task A task B unit_free() // high_watermark = 48 // free_cnt = 49 after free irq_work_raise() // mark irq work as IRQ_WORK_PENDING irq_work_claim() // task B preempts task A unit_alloc() // free_cnt = 48 after alloc // does unit_alloc() 32-times ...... // free_cnt = 16 unit_alloc() // free_cnt = 15 after alloc // irq work is already PENDING, // so just return irq_work_raise() // does unit_alloc() 15-times ...... // free_cnt = 0 unit_alloc() // free_cnt = 0 before alloc return NULL Fix it by enabling IRQ after irq_work_raise() completes. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230901111954.1804721-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: fix the skip_if_dup_files checkOleg Nesterov
Unless I am notally confused it is wrong. We are going to return or skip next_task so we need to check next_task-files, not task->files. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154651.GA24940@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: cleanup the usage of get/put_task_structOleg Nesterov
get_pid_task() makes no sense, the code does put_task_struct() soon after. Use find_task_by_pid_ns() instead of find_pid_ns + get_pid_task and kill put_task_struct(), this allows to do get_task_struct() only once before return. While at it, kill the unnecessary "if (!pid)" check in the "if (!*tid)" block, this matches the next usage of find_pid_ns() + get_pid_task() in this function. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154649.GA24935@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: cleanup the usage of next_thread()Oleg Nesterov
1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we can safely iterate the task->thread_group list. Even if this task exits right after get_pid_task() (or goto retry) and pid_alive() returns 0. Kill the unnecessary pid_alive() check. 2. next_thread() simply can't return NULL, kill the bogus "if (!next_task)" check. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154646.GA24928@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08Merge branch 'bpf-add-support-for-local-percpu-kptr'Alexei Starovoitov
Yonghong Song says: ==================== bpf: Add support for local percpu kptr Patch set [1] implemented cgroup local storage BPF_MAP_TYPE_CGRP_STORAGE similar to sk/task/inode local storage and old BPF_MAP_TYPE_CGROUP_STORAGE map is marked as deprecated since old BPF_MAP_TYPE_CGROUP_STORAGE map can only work with current cgroup. Similarly, the existing BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE map is a percpu version of BPF_MAP_TYPE_CGROUP_STORAGE and only works with current cgroup. But there is no replacement which can work with arbitrary cgroup. This patch set solved this problem but adding support for local percpu kptr. The map value can have a percpu kptr field which holds a bpf prog allocated percpu data. The below is an example, struct percpu_val_t { ... fields ... } struct map_value_t { struct percpu_val_t __percpu_kptr *percpu_data_ptr; } In the above, 'map_value_t' is the map value type for a BPF_MAP_TYPE_CGRP_STORAGE map. User can access 'percpu_data_ptr' and then read/write percpu data. This covers BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE and more. So BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE map type is marked as deprecated. In additional, local percpu kptr supports the same map type as other kptrs including hash, lru_hash, array, sk/inode/task/cgrp local storage. Currently, percpu data structure does not support non-scalars or special fields (e.g., bpf_spin_lock, bpf_rb_root, etc.). They can be supported in the future if there exist use cases. Please for individual patches for details. [1] https://lore.kernel.org/all/20221026042835.672317-1-yhs@fb.com/ Changelog: v2 -> v3: - fix libbpf_str test failure. v1 -> v2: - does not support special fields in percpu data structure. - rename __percpu attr to __percpu_kptr attr. - rename BPF_KPTR_PERCPU_REF to BPF_KPTR_PERCPU. - better code to handle bpf_{this,per}_cpu_ptr() helpers. - add more negative tests. - fix a bpftool related test failure. ==================== Link: https://lore.kernel.org/r/20230827152729.1995219-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Enable IRQ after irq_work_raise() completes in unit_alloc()Hou Tao
When doing stress test for qp-trie, bpf_mem_alloc() returned NULL unexpectedly because all qp-trie operations were initiated from bpf syscalls and there was still available free memory. bpf_obj_new() has the same problem as shown by the following selftest. The failure is due to the preemption. irq_work_raise() will invoke irq_work_claim() first to mark the irq work as pending and then inovke __irq_work_queue_local() to raise an IPI. So when the current task which is invoking irq_work_raise() is preempted by other task, unit_alloc() may return NULL for preemption task as shown below: task A task B unit_alloc() // low_watermark = 32 // free_cnt = 31 after alloc irq_work_raise() // mark irq work as IRQ_WORK_PENDING irq_work_claim() // task B preempts task A unit_alloc() // free_cnt = 30 after alloc // irq work is already PENDING, // so just return irq_work_raise() // does unit_alloc() 30-times ...... unit_alloc() // free_cnt = 0 before alloc return NULL Fix it by enabling IRQ after irq_work_raise() completes. An alternative fix is using preempt_{disable|enable}_notrace() pair, but it may have extra overhead. Another feasible fix is to only disable preemption or IRQ before invoking irq_work_queue() and enable preemption or IRQ after the invocation completes, but it can't handle the case when c->low_watermark is 1. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230901111954.1804721-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Mark BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE deprecatedYonghong Song
Now 'BPF_MAP_TYPE_CGRP_STORAGE + local percpu ptr' can cover all BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE functionality and more. So mark BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE deprecated. Also make changes in selftests/bpf/test_bpftool_synctypes.py and selftest libbpf_str to fix otherwise test errors. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152837.2003563-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08selftests/bpf: Add some negative testsYonghong Song
Add a few negative tests for common mistakes with using percpu kptr including: - store to percpu kptr. - type mistach in bpf_kptr_xchg arguments. - sleepable prog with untrusted arg for bpf_this_cpu_ptr(). - bpf_percpu_obj_new && bpf_obj_drop, and bpf_obj_new && bpf_percpu_obj_drop - struct with ptr for bpf_percpu_obj_new - struct with special field (e.g., bpf_spin_lock) for bpf_percpu_obj_new Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152832.2002421-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08selftests/bpf: Add tests for cgrp_local_storage with local percpu kptrYonghong Song
Add a non-sleepable cgrp_local_storage test with percpu kptr. The test does allocation of percpu data, assigning values to percpu data and retrieval of percpu data. The de-allocation of percpu data is done when the map is freed. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152827.2001784-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08selftests/bpf: Remove unnecessary direct read of local percpu kptrYonghong Song
For the second argument of bpf_kptr_xchg(), if the reg type contains MEM_ALLOC and MEM_PERCPU, which means a percpu allocation, after bpf_kptr_xchg(), the argument is marked as MEM_RCU and MEM_PERCPU if in rcu critical section. This way, re-reading from the map value is not needed. Remove it from the percpu_alloc_array.c selftest. Without previous kernel change, the test will fail like below: 0: R1=ctx(off=0,imm=0) R10=fp0 ; int BPF_PROG(test_array_map_10, int a) 0: (b4) w1 = 0 ; R1_w=0 ; int i, index = 0; 1: (63) *(u32 *)(r10 -4) = r1 ; R1_w=0 R10=fp0 fp-8=0000???? 2: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 ; 3: (07) r2 += -4 ; R2_w=fp-4 ; e = bpf_map_lookup_elem(&array, &index); 4: (18) r1 = 0xffff88810e771800 ; R1_w=map_ptr(off=0,ks=4,vs=16,imm=0) 6: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null(id=1,off=0,ks=4,vs=16,imm=0) 7: (bf) r6 = r0 ; R0_w=map_value_or_null(id=1,off=0,ks=4,vs=16,imm=0) R6_w=map_value_or_null(id=1,off=0,ks=4,vs=16,imm=0) ; if (!e) 8: (15) if r6 == 0x0 goto pc+81 ; R6_w=map_value(off=0,ks=4,vs=16,imm=0) ; bpf_rcu_read_lock(); 9: (85) call bpf_rcu_read_lock#87892 ; ; p = e->pc; 10: (bf) r7 = r6 ; R6=map_value(off=0,ks=4,vs=16,imm=0) R7_w=map_value(off=0,ks=4,vs=16,imm=0) 11: (07) r7 += 8 ; R7_w=map_value(off=8,ks=4,vs=16,imm=0) 12: (79) r6 = *(u64 *)(r6 +8) ; R6_w=percpu_rcu_ptr_or_null_val_t(id=2,off=0,imm=0) ; if (!p) { 13: (55) if r6 != 0x0 goto pc+13 ; R6_w=0 ; p = bpf_percpu_obj_new(struct val_t); 14: (18) r1 = 0x12 ; R1_w=18 16: (b7) r2 = 0 ; R2_w=0 17: (85) call bpf_percpu_obj_new_impl#87883 ; R0_w=percpu_ptr_or_null_val_t(id=4,ref_obj_id=4,off=0,imm=0) refs=4 18: (bf) r6 = r0 ; R0=percpu_ptr_or_null_val_t(id=4,ref_obj_id=4,off=0,imm=0) R6=percpu_ptr_or_null_val_t(id=4,ref_obj_id=4,off=0,imm=0) refs=4 ; if (!p) 19: (15) if r6 == 0x0 goto pc+69 ; R6=percpu_ptr_val_t(ref_obj_id=4,off=0,imm=0) refs=4 ; p1 = bpf_kptr_xchg(&e->pc, p); 20: (bf) r1 = r7 ; R1_w=map_value(off=8,ks=4,vs=16,imm=0) R7=map_value(off=8,ks=4,vs=16,imm=0) refs=4 21: (bf) r2 = r6 ; R2_w=percpu_ptr_val_t(ref_obj_id=4,off=0,imm=0) R6=percpu_ptr_val_t(ref_obj_id=4,off=0,imm=0) refs=4 22: (85) call bpf_kptr_xchg#194 ; R0_w=percpu_ptr_or_null_val_t(id=6,ref_obj_id=6,off=0,imm=0) refs=6 ; if (p1) { 23: (15) if r0 == 0x0 goto pc+3 ; R0_w=percpu_ptr_val_t(ref_obj_id=6,off=0,imm=0) refs=6 ; bpf_percpu_obj_drop(p1); 24: (bf) r1 = r0 ; R0_w=percpu_ptr_val_t(ref_obj_id=6,off=0,imm=0) R1_w=percpu_ptr_val_t(ref_obj_id=6,off=0,imm=0) refs=6 25: (b7) r2 = 0 ; R2_w=0 refs=6 26: (85) call bpf_percpu_obj_drop_impl#87882 ; ; v = bpf_this_cpu_ptr(p); 27: (bf) r1 = r6 ; R1_w=scalar(id=7) R6=scalar(id=7) 28: (85) call bpf_this_cpu_ptr#154 R1 type=scalar expected=percpu_ptr_, percpu_rcu_ptr_, percpu_trusted_ptr_ The R1 which gets its value from R6 is a scalar. But before insn 22, R6 is R6=percpu_ptr_val_t(ref_obj_id=4,off=0,imm=0) Its type is changed to a scalar at insn 22 without previous patch. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152821.2001129-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Mark OBJ_RELEASE argument as MEM_RCU when possibleYonghong Song
In previous selftests/bpf patch, we have p = bpf_percpu_obj_new(struct val_t); if (!p) goto out; p1 = bpf_kptr_xchg(&e->pc, p); if (p1) { /* race condition */ bpf_percpu_obj_drop(p1); } p = e->pc; if (!p) goto out; After bpf_kptr_xchg(), we need to re-read e->pc into 'p'. This is due to that the second argument of bpf_kptr_xchg() is marked OBJ_RELEASE and it will be marked as invalid after the call. So after bpf_kptr_xchg(), 'p' is an unknown scalar, and the bpf program needs to reread from the map value. This patch checks if the 'p' has type MEM_ALLOC and MEM_PERCPU, and if 'p' is RCU protected. If this is the case, 'p' can be marked as MEM_RCU. MEM_ALLOC needs to be removed since 'p' is not an owning reference any more. Such a change makes re-read from the map value unnecessary. Note that re-reading 'e->pc' after bpf_kptr_xchg() might get a different value from 'p' if immediately before 'p = e->pc', another cpu may do another bpf_kptr_xchg() and swap in another value into 'e->pc'. If this is the case, then 'p = e->pc' may get either 'p' or another value, and race condition already exists. So removing direct re-reading seems fine too. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152816.2000760-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08selftests/bpf: Add tests for array map with local percpu kptrYonghong Song
Add non-sleepable and sleepable tests with percpu kptr. For non-sleepable test, four programs are executed in the order of: 1. allocate percpu data. 2. assign values to percpu data. 3. retrieve percpu data. 4. de-allocate percpu data. The sleepable prog tried to exercise all above 4 steps in a single prog. Also for sleepable prog, rcu_read_lock is needed to protect direct percpu ptr access (from map value) and following bpf_this_cpu_ptr() and bpf_per_cpu_ptr() helpers. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152811.2000125-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08selftests/bpf: Add bpf_percpu_obj_{new,drop}() macro in bpf_experimental.hYonghong Song
The new macro bpf_percpu_obj_{new/drop}() is very similar to bpf_obj_{new,drop}() as they both take a type as the argument. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152805.1999417-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08libbpf: Add __percpu_kptr macro definitionYonghong Song
Add __percpu_kptr macro definition in bpf_helpers.h. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152800.1998492-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08libbpf: Add basic BTF sanity validationAndrii Nakryiko
Implement a simple and straightforward BTF sanity check when parsing BTF data. Right now it's very basic and just validates that all the string offsets and type IDs are within valid range. For FUNC we also check that it points to FUNC_PROTO kinds. Even with such simple checks it fixes a bunch of crashes found by OSS fuzzer ([0]-[5]) and will allow fuzzer to make further progress. Some other invariants will be checked in follow up patches (like ensuring there is no infinite type loops), but this seems like a good start already. Adding FUNC -> FUNC_PROTO check revealed that one of selftests has a problem with FUNC pointing to VAR instead, so fix it up in the same commit. [0] https://github.com/libbpf/libbpf/issues/482 [1] https://github.com/libbpf/libbpf/issues/483 [2] https://github.com/libbpf/libbpf/issues/485 [3] https://github.com/libbpf/libbpf/issues/613 [4] https://github.com/libbpf/libbpf/issues/618 [5] https://github.com/libbpf/libbpf/issues/619 Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Reviewed-by: Song Liu <song@kernel.org> Closes: https://github.com/libbpf/libbpf/issues/617 Link: https://lore.kernel.org/bpf/20230825202152.1813394-1-andrii@kernel.org
2023-09-08selftests/bpf: Update error message in negative linked_list testYonghong Song
Some error messages are changed due to the addition of percpu kptr support. Fix linked_list test with changed error messages. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152754.1997769-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Add bpf_this_cpu_ptr/bpf_per_cpu_ptr support for allocated percpu objYonghong Song
The bpf helpers bpf_this_cpu_ptr() and bpf_per_cpu_ptr() are re-purposed for allocated percpu objects. For an allocated percpu obj, the reg type is 'PTR_TO_BTF_ID | MEM_PERCPU | MEM_RCU'. The return type for these two re-purposed helpera is 'PTR_TO_MEM | MEM_RCU | MEM_ALLOC'. The MEM_ALLOC allows that the per-cpu data can be read and written. Since the memory allocator bpf_mem_alloc() returns a ptr to a percpu ptr for percpu data, the first argument of bpf_this_cpu_ptr() and bpf_per_cpu_ptr() is patched with a dereference before passing to the helper func. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152749.1997202-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Add alloc/xchg/direct_access support for local percpu kptrYonghong Song
Add two new kfunc's, bpf_percpu_obj_new_impl() and bpf_percpu_obj_drop_impl(), to allocate a percpu obj. Two functions are very similar to bpf_obj_new_impl() and bpf_obj_drop_impl(). The major difference is related to percpu handling. bpf_rcu_read_lock() struct val_t __percpu_kptr *v = map_val->percpu_data; ... bpf_rcu_read_unlock() For a percpu data map_val like above 'v', the reg->type is set as PTR_TO_BTF_ID | MEM_PERCPU | MEM_RCU if inside rcu critical section. MEM_RCU marking here is similar to NON_OWN_REF as 'v' is not a owning reference. But NON_OWN_REF is trusted and typically inside the spinlock while MEM_RCU is under rcu read lock. RCU is preferred here since percpu data structures mean potential concurrent access into its contents. Also, bpf_percpu_obj_new_impl() is restricted such that no pointers or special fields are allowed. Therefore, the bpf_list_head and bpf_rb_root will not be supported in this patch set to avoid potential memory leak issue due to racing between bpf_obj_free_fields() and another bpf_kptr_xchg() moving an allocated object to bpf_list_head and bpf_rb_root. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152744.1996739-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Add BPF_KPTR_PERCPU as a field typeYonghong Song
BPF_KPTR_PERCPU represents a percpu field type like below struct val_t { ... fields ... }; struct t { ... struct val_t __percpu_kptr *percpu_data_ptr; ... }; where #define __percpu_kptr __attribute__((btf_type_tag("percpu_kptr"))) While BPF_KPTR_REF points to a trusted kernel object or a trusted local object, BPF_KPTR_PERCPU points to a trusted local percpu object. This patch added basic support for BPF_KPTR_PERCPU related to percpu_kptr field parsing, recording and free operations. BPF_KPTR_PERCPU also supports the same map types as BPF_KPTR_REF does. Note that unlike a local kptr, it is possible that a BPF_KTPR_PERCPU struct may not contain any special fields like other kptr, bpf_spin_lock, bpf_list_head, etc. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152739.1996391-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Add support for non-fix-size percpu mem allocationYonghong Song
This is needed for later percpu mem allocation when the allocation is done by bpf program. For such cases, a global bpf_global_percpu_ma is added where a flexible allocation size is needed. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152734.1995725-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08MAINTAINERS: remove links to obsolete btrfs.wiki.kernel.orgBhaskar Chowdhury
The wiki has been archived and is not updated anymore. Remove or replace the links in files that contain it (MAINTAINERS, Kconfig, docs). Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08btrfs: assert delayed node locked when removing delayed itemFilipe Manana
When removing a delayed item, or releasing which will remove it as well, we will modify one of the delayed node's rbtrees and item counter if the delayed item is in one of the rbtrees. This require having the delayed node's mutex locked, otherwise we will race with other tasks modifying the rbtrees and the counter. This is motivated by a previous version of another patch actually calling btrfs_release_delayed_item() after unlocking the delayed node's mutex and against a delayed item that is in a rbtree. So assert at __btrfs_remove_delayed_item() that the delayed node's mutex is locked. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08btrfs: remove BUG() after failure to insert delayed dir index itemFilipe Manana
Instead of calling BUG() when we fail to insert a delayed dir index item into the delayed node's tree, we can just release all the resources we have allocated/acquired before and return the error to the caller. This is fine because all existing call chains undo anything they have done before calling btrfs_insert_delayed_dir_index() or BUG_ON (when creating pending snapshots in the transaction commit path). So remove the BUG() call and do proper error handling. This relates to a syzbot report linked below, but does not fix it because it only prevents hitting a BUG(), it does not fix the issue where somehow we attempt to use twice the same index number for different index items. Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/ CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08btrfs: improve error message after failure to add delayed dir index itemFilipe Manana
If we fail to add a delayed dir index item because there's already another item with the same index number, we print an error message (and then BUG). However that message isn't very helpful to debug anything because we don't know what's the index number and what are the values of index counters in the inode and its delayed inode (index_cnt fields of struct btrfs_inode and struct btrfs_delayed_node). So update the error message to include the index number and counters. We actually had a recent case where this issue was hit by a syzbot report (see the link below). Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>