summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-08-16riscv: stack: Fixup independent irq stack for CONFIG_FRAME_POINTER=nGuo Ren
The independent irq stack uses s0 to save & restore sp, but s0 would be corrupted when CONFIG_FRAME_POINTER=n. So add s0 in the clobber list to fix the problem. Fixes: 163e76cc6ef4 ("riscv: stack: Support HAVE_IRQ_EXIT_ON_IRQ_STACK") Cc: stable@vger.kernel.org Reported-by: Zhangjin Wu <falcon@tinylab.org> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Tested-by: Drew Fustini <dfustini@baylibre.com> Link: https://lore.kernel.org/r/20230716001506.3506041-2-guoren@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-08-16kselftest/arm64: add jscvt feature to hwcap testZeng Heng
Add the jscvt feature check in the set of hwcap tests. Due to the requirement of jscvt feature, a compiler configuration of v8.3 or above is needed to support assembly. Therefore, hand encode is used here instead. Signed-off-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230815040915.3966955-5-zengheng4@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2023-08-16kselftest/arm64: add pmull feature to hwcap testZeng Heng
Add the pmull feature check in the set of hwcap tests. Signed-off-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230815040915.3966955-4-zengheng4@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2023-08-16kselftest/arm64: add AES feature check to hwcap testZeng Heng
Add the AES feature check in the set of hwcap tests. Signed-off-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230815040915.3966955-3-zengheng4@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2023-08-16kselftest/arm64: add SHA1 and related features to hwcap testZeng Heng
Add the SHA1 and related features check in the set of hwcap tests. Signed-off-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230815040915.3966955-2-zengheng4@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2023-08-16riscv: correct riscv_insn_is_c_jr() and riscv_insn_is_c_jalr()Nam Cao
The instructions c.jr and c.jalr must have rs1 != 0, but riscv_insn_is_c_jr() and riscv_insn_is_c_jalr() do not check for this. So, riscv_insn_is_c_jr() can match a reserved encoding, while riscv_insn_is_c_jalr() can match the c.ebreak instruction. Rewrite them with check for rs1 != 0. Signed-off-by: Nam Cao <namcaov@gmail.com> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Fixes: ec5f90877516 ("RISC-V: Move riscv_insn_is_* macros into a common header") Link: https://lore.kernel.org/r/20230731183925.152145-1-namcaov@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-08-16riscv: entry: set a0 = -ENOSYS only when syscall != -1Celeste Liu
When we test seccomp with 6.4 kernel, we found errno has wrong value. If we deny NETLINK_AUDIT with EAFNOSUPPORT, after f0bddf50586d, we will get ENOSYS instead. We got same result with commit 9c2598d43510 ("riscv: entry: Save a0 prior syscall_enter_from_user_mode()"). After analysing code, we think that regs->a0 = -ENOSYS should only be executed when syscall != -1. In __seccomp_filter, when seccomp rejected this syscall with specified errno, they will set a0 to return number as syscall ABI, and then return -1. This return number is finally pass as return number of syscall_enter_from_user_mode, and then is compared with NR_syscalls after converted to ulong (so it will be ULONG_MAX). The condition syscall < NR_syscalls will always be false, so regs->a0 = -ENOSYS is always executed. It covered a0 set by seccomp, so we always get ENOSYS when match seccomp RET_ERRNO rule. Fixes: f0bddf50586d ("riscv: entry: Convert to generic entry") Reported-by: Felix Yan <felixonmars@archlinux.org> Co-developed-by: Ruizhe Pan <c141028@gmail.com> Signed-off-by: Ruizhe Pan <c141028@gmail.com> Co-developed-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn> Signed-off-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn> Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com> Tested-by: Felix Yan <felixonmars@archlinux.org> Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Guo Ren <guoren@kernel.org> Link: https://lore.kernel.org/r/20230801141607.435192-1-CoelacanthusHex@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-08-16regulator: raa215300: Add const definitionBiju Das
Add const definition to the initialized local variable name to avoid overriding. Also the second parameter in strscpy is const char * instead of char *. Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Link: https://lore.kernel.org/r/20230816135550.146657-3-biju.das.jz@bp.renesas.com Signed-off-by: Mark Brown <broonie@kernel.org>
2023-08-16regulator: raa215300: Fix resource leak in case of errorBiju Das
The clk_register_clkdev() allocates memory by calling vclkdev_alloc() and this memory is not freed in the error path. Similarly, resources allocated by clk_register_fixed_rate() are not freed in the error path. Fix these issues by using devm_clk_hw_register_fixed_rate() and devm_clk_hw_register_clkdev(). After this, the static variable clk is not needed. Replace it with  local variable hw in probe() and drop calling clk_unregister_fixed_rate() from raa215300_rtc_unregister_device(). Fixes: 7bce16630837 ("regulator: Add Renesas PMIC RAA215300 driver") Cc: stable@kernel.org Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Link: https://lore.kernel.org/r/20230816135550.146657-2-biju.das.jz@bp.renesas.com Signed-off-by: Mark Brown <broonie@kernel.org>
2023-08-16regulator: Get Synquacer testing workingMark Brown
Merge up v6.5-rc6 which has a fix that gets Synquacer netbooting so we can include it in our testing.
2023-08-16arm64: sysreg: Generate C compiler warnings on {read,write}_sysreg_s argumentsJames Clark
Evaluate the register before the asm section so that the C compiler generates warnings when there is an issue with the register argument. This will prevent possible future issues such as the one seen here [1] where a missing bracket caused the shift and addition operators to be evaluated in the wrong order, but no warning was emitted. The GNU assembler has no warning for when expressions evaluate differently to C due to different operator precedence, but the C compiler has some warnings that may suggest something is wrong. For example in this case the following warning would have been emitted: error: operator '>>' has lower precedence than '+'; '+' will be evaluated first [-Werror,-Wshift-op-parentheses] There are currently no existing warnings that need to be fixed. [1]: https://lore.kernel.org/linux-perf-users/20230728162011.GA22050@willie-the-truck/ Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20230815140639.614769-1-james.clark@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2023-08-16ASoC: SOF: ipc4-pcm: fix possible null pointer deferenceChao Song
The call to snd_sof_find_spcm_dai() could return NULL, add nullable check for the return value to avoid null pointer defenrece. Fixes: 7cb19007baba ("ASoC: SOF: ipc4-pcm: add hw_params") Signed-off-by: Chao Song <chao.song@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Link: https://lore.kernel.org/r/20230816133311.7523-1-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2023-08-16kselftest/arm64: build BTI tests in output directoryAndre Przywara
The arm64 BTI selftests are currently built in the source directory, then the generated binaries are copied to the output directory. This leaves the object files around in a potentially otherwise pristine source tree, tainting it for out-of-tree kernel builds. Prepend $(OUTPUT) to every reference to an object file in the Makefile, and remove the extra handling and copying. This puts all generated files under the output directory. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230815145931.2522557-1-andre.przywara@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2023-08-16perf/imx_ddr: don't enable counter0 if none of 4 counters are usedXu Yang
In current driver, counter0 will be enabled after ddr_perf_pmu_enable() is called even though none of the 4 counters are used. This will cause counter0 continue to count until ddr_perf_pmu_disabled() is called. If pmu is not disabled all the time, the pmu interrupt will be asserted from time to time due to counter0 will overflow and irq handler will clear it. It's not an expected behavior. This patch will not enable counter0 if none of 4 counters are used. Fixes: 9a66d36cc7ac ("drivers/perf: imx_ddr: Add DDR performance counter support to perf") Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://lore.kernel.org/r/20230811015438.1999307-2-xu.yang_2@nxp.com Signed-off-by: Will Deacon <will@kernel.org>
2023-08-16perf/imx_ddr: speed up overflow frequency of cycleXu Yang
For i.MX8MP, we cannot ensure that cycle counter overflow occurs at least 4 times as often as other events. Due to byte counters will count for any event configured, it will overflow more often. And if byte counters overflow that related counters would stop since they share the COUNTER_CNTL. We can speed up cycle counter overflow frequency by setting counter parameter (CP) field of cycle counter. In this way, we can avoid stop counting byte counters when interrupt didn't come and the byte counters can be fetched or updated from each cycle counter overflow interrupt. Because we initialize CP filed to shorten counter0 overflow time, the cycle counter will start couting from a fixed/base value each time. We need to remove the base from the result too. Therefore, we could get precise result from cycle counter. Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://lore.kernel.org/r/20230811015438.1999307-1-xu.yang_2@nxp.com Signed-off-by: Will Deacon <will@kernel.org>
2023-08-16drivers/perf: hisi: Schedule perf session according to localityYicong Yang
The PCIe PMUs locate on different NUMA node but currently we don't consider it and likely stack all the sessions on the same CPU: [root@localhost tmp]# cat /sys/devices/hisi_pcie*/cpumask 0 0 0 0 0 0 This can be optimize a bit to use a local CPU for the PMU. Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20230815131010.2147-1-yangyicong@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2023-08-16kselftest/arm64: fix a memleak in zt_regs_run()Ding Xiang
If memcmp() does not return 0, "zeros" need to be freed to prevent memleak Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com> Link: https://lore.kernel.org/r/20230815074915.245528-1-dingxiang@cmss.chinamobile.com Signed-off-by: Will Deacon <will@kernel.org>
2023-08-16perf/arm-dmc620: Fix dmc620_pmu_irqs_lock/cpu_hotplug_lock circular lock ↵Waiman Long
dependency The following circular locking dependency was reported when running cpus online/offline test on an arm64 system. [ 84.195923] Chain exists of: dmc620_pmu_irqs_lock --> cpu_hotplug_lock --> cpuhp_state-down [ 84.207305] Possible unsafe locking scenario: [ 84.213212] CPU0 CPU1 [ 84.217729] ---- ---- [ 84.222247] lock(cpuhp_state-down); [ 84.225899] lock(cpu_hotplug_lock); [ 84.232068] lock(cpuhp_state-down); [ 84.238237] lock(dmc620_pmu_irqs_lock); [ 84.242236] *** DEADLOCK *** The following locking order happens when dmc620_pmu_get_irq() calls cpuhp_state_add_instance_nocalls(). lock(dmc620_pmu_irqs_lock) --> lock(cpu_hotplug_lock) On the other hand, the calling sequence cpuhp_thread_fun() => cpuhp_invoke_callback() => dmc620_pmu_cpu_teardown() leads to the locking sequence lock(cpuhp_state-down) => lock(dmc620_pmu_irqs_lock) Here dmc620_pmu_irqs_lock protects both the dmc620_pmu_irqs and the pmus_node lists in various dmc620_pmu instances. dmc620_pmu_get_irq() requires protected access to dmc620_pmu_irqs whereas dmc620_pmu_cpu_teardown() needs protection to the pmus_node lists. Break this circular locking dependency by using two separate locks to protect dmc620_pmu_irqs list and the pmus_node lists respectively. Suggested-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20230812235549.494174-1-longman@redhat.com Signed-off-by: Will Deacon <will@kernel.org>
2023-08-16s390/ipl: add common ipl parameter attribute groupSven Schnelle
All ipl types have 'secure','has_secure' and type parameters. Move these to a common ipl parameter group so that they don't need to be present in each ipl parameter group. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-16s390/ipl: add missing secure/has_secure file to ipl type 'unknown'Sven Schnelle
OS installers are relying on /sys/firmware/ipl/has_secure to be present on machines supporting secure boot. This file is present for all IPL types, but not the unknown type, which prevents a secure installation when an LPAR is booted in HMC via FTP(s), because this is an unknown IPL type in linux. While at it, also add the secure file. Fixes: c9896acc7851 ("s390/ipl: Provide has_secure sysfs attribute") Cc: stable@vger.kernel.org Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-16s390/dcssblk: fix kernel crash with list_add corruptionGerald Schaefer
Commit fb08a1908cb1 ("dax: simplify the dax_device <-> gendisk association") introduced new logic for gendisk association, requiring drivers to explicitly call dax_add_host() and dax_remove_host(). For dcssblk driver, some dax_remove_host() calls were missing, e.g. in device remove path. The commit also broke error handling for out_dax case in device add path, resulting in an extra put_device() w/o the previous get_device() in that case. This lead to stale xarray entries after device add / remove cycles. In the case when a previously used struct gendisk pointer (xarray index) would be used again, because blk_alloc_disk() happened to return such a pointer, the xa_insert() in dax_add_host() would fail and go to out_dax, doing the extra put_device() in the error path. In combination with an already flawed error handling in dcssblk (device_register() cleanup), which needs to be addressed in a separate patch, this resulted in a missing device_del() / klist_del(), and eventually in the kernel crash with list_add corruption on a subsequent device_add() / klist_add(). Fix this by adding the missing dax_remove_host() calls, and also move the put_device() in the error path to restore the previous logic. Fixes: fb08a1908cb1 ("dax: simplify the dax_device <-> gendisk association") Cc: <stable@vger.kernel.org> # 5.17+ Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-16s390/mm: make virt_to_pfn() a static inlineLinus Walleij
Making virt_to_pfn() a static inline taking a strongly typed (const void *) makes the contract of a passing a pointer of that type to the function explicit and exposes any misuse of the macro virt_to_pfn() acting polymorphic and accepting many types such as (void *), (unitptr_t) or (unsigned long) as arguments without warnings. For symmetry do the same with pfn_to_virt() reflecting the current layout in asm-generic/page.h. Doing this reveals a number of offenders in the arch code and the S390-specific drivers, so just bite the bullet and fix up all of those as well. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/r/20230812-virt-to-phys-s390-v2-1-6c40f31fe36f@linaro.org Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-16s390/boot: fix multi-line comments styleAlexander Gordeev
Make multi-line comment style consistent across the source. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-16s390/boot: account Real Memory Copy and Lowcore areasAlexander Gordeev
Real Memory Copy and (absolute) Lowcore areas are not accounted when virtual memory layout is set up. Fixes: 4df29d2b9024 ("s390/smp: rework absolute lowcore access") Fixes: 2f0e8aae26a2 ("s390/mm: rework memcpy_real() to avoid DAT-off mode") Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-16s390/mm: define Real Memory Copy size and mask macrosAlexander Gordeev
Make Real Memory Copy area size and mask explicit. This does not bring any functional change and only needed for clarity. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-16s390/boot: cleanup number of page table levels setupAlexander Gordeev
The separate vmalloc area size check against _REGION2_SIZE is needed in case user provided insanely large value using vmalloc= kernel command line parameter. That could lead to overflow and selecting 3 page table levels instead of 4. Use size_add() for the overflow check and get rid of the extra vmalloc area check. With the current values of CONFIG_MAX_PHYSMEM_BITS and PAGES_PER_SECTION the sum of maximal possible size of identity mapping and vmemmap area (derived from these macros) plus modules area size MODULES_LEN can not overflow. Thus, that sum is used as first addend while vmalloc area size is second addend for size_add(). Suggested-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2023-08-16ALSA: hda/realtek - Remodified 3k pull low procedureKailang Yang
Set spec->en_3kpull_low default to true. Then fillback ALC236 and ALC257 to false. Additional note: this addresses a regression caused by the previous fix 69ea4c9d02b7 ("ALSA: hda/realtek - remove 3k pull low procedure"). The previous workaround was applied too widely without necessity, which resulted in the pop noise at PM again. This patch corrects the condition and restores the old behavior for the devices that don't suffer from the original problem. Fixes: 69ea4c9d02b7 ("ALSA: hda/realtek - remove 3k pull low procedure") Link: https://bugzilla.kernel.org/show_bug.cgi?id=217732 Link: https://lore.kernel.org/r/01e212a538fc407ca6edd10b81ff7b05@realtek.com Signed-off-by: Kailang Yang <kailang@realtek.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2023-08-16spi: rpc-if: switch to use devm_spi_alloc_host()Yang Yingliang
Switch to use modern name function devm_spi_alloc_host(). No functional changed. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20230816094013.1275068-15-yangyingliang@huawei.com Signed-off-by: Mark Brown <broonie@kernel.org>
2023-08-16spi: dw-mmio: keep old name same as documentationYang Yingliang
The documentation has not use the new name(host/target), so keep the comment words same as documentation used. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Link: https://lore.kernel.org/r/20230816093938.1274806-1-yangyingliang@huawei.com Signed-off-by: Mark Brown <broonie@kernel.org>
2023-08-16gpiolib: fix reference leaks when removing GPIO chips still in useBartosz Golaszewski
After we remove a GPIO chip that still has some requested descriptors, gpiod_free_commit() will fail and we will never put the references to the GPIO device and the owning module in gpiod_free(). Rework this function to: - not warn on desc == NULL as this is a use-case on which most free functions silently return - put the references to desc->gdev and desc->gdev->owner unconditionally so that the release callback actually gets called when the remaining references are dropped by external GPIO users Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
2023-08-16gpiolib: sysfs: Do unexport GPIO when user asks for itAndy Shevchenko
It seems that sysfs interface implicitly relied on the gpiod_free() to unexport the line. This is logically incorrect as core gpiolib should not deal with sysfs so instead of restoring it, let's call gpiod_unexport() from sysfs code. Fixes: b0ce9ce408b6 ("gpiolib: Do not unexport GPIO on freeing") Reported-by: Marek Behún <kabel@kernel.org> Closes: https://lore.kernel.org/r/20230808102828.4a9eac09@dellmb Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Marek Behún <kabel@kernel.org> [Bartosz: tweaked the commit message] Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2023-08-16Merge branch 'ipv6-expired-routes'David S. Miller
Kui-Feng Lee says: ==================== Remove expired routes with a separated list of routes. FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree can be expensive if the number of routes in a table is big, even if most of them are permanent. Checking routes in a separated list of routes having expiration will avoid this potential issue. Background ========== The size of a Linux IPv6 routing table can become a big problem if not managed appropriately. Now, Linux has a garbage collector to remove expired routes periodically. However, this may lead to a situation in which the routing path is blocked for a long period due to an excessive number of routes. For example, years ago, there is a commit c7bb4b89033b ("ipv6: tcp: drop silly ICMPv6 packet too big messages"). The root cause is that malicious ICMPv6 packets were sent back for every small packet sent to them. These packets add routes with an expiration time that prompts the GC to periodically check all routes in the tables, including permanent ones. Why Route Expires ================= Users can add IPv6 routes with an expiration time manually. However, the Neighbor Discovery protocol may also generate routes that can expire. For example, Router Advertisement (RA) messages may create a default route with an expiration time. [RFC 4861] For IPv4, it is not possible to set an expiration time for a route, and there is no RA, so there is no need to worry about such issues. Create Routes with Expires ========================== You can create routes with expires with the command. For example, ip -6 route add 2001:b000:591::3 via fe80::5054:ff:fe12:3457 \ dev enp0s3 expires 30 The route that has been generated will be deleted automatically in 30 seconds. GC of FIB6 ========== The function called fib6_run_gc() is responsible for performing garbage collection (GC) for the Linux IPv6 stack. It checks for the expiration of every route by traversing the trees of routing tables. The time taken to traverse a routing table increases with its size. Holding the routing table lock during traversal is particularly undesirable. Therefore, it is preferable to keep the lock for the shortest possible duration. Solution ======== The cause of the issue is keeping the routing table locked during the traversal of large trees. To solve this problem, we can create a separate list of routes that have expiration. This will prevent GC from checking permanent routes. Result ====== We conducted a test to measure the execution times of fib6_gc_timer_cb() and observed that it enhances the GC of FIB6. During the test, we added permanent routes with the following numbers: 1000, 3000, 6000, and 9000. Additionally, we added a route with an expiration time. Here are the average execution times for the kernel without the patch. - 120020 ns with 1000 permanent routes - 308920 ns with 3000 ... - 581470 ns with 6000 ... - 855310 ns with 9000 ... The kernel with the patch consistently takes around 14000 ns to execute, regardless of the number of permanent routes that are installed. Major changes from v7: - Fix warings raised by the patchwork. Major changes from v6: - Remove unnecessary check of tb6 in fib6_clean_expires_locked(). - Use ib6_clean_expires_locked() instead in fib6_purge_rt(). Major changes from v5: - Change the order of adding new routes to the GC list and starting GC timer. - Remove time measurements from the test case. - Stop forcing GC flush. Major changes from v4: - Detect existence of 'strace' in the test case. Major changes from v3: - Fix the type of arg according to feedback. - Add 1k temporary routes and 5K permanent routes in the test case. Measure time spending on GC with strace. Major changes from v2: - Remove unnecessary and incorrect sysctl restoring in the test case. Major changes from v1: - Moved gc_link to avoid creating a hole in fib6_info. - Moved fib6_set_expires*() and fib6_clean_expires*() to the header file and inlined. And removed duplicated lines. - Added a test case. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16selftests: fib_tests: Add a test case for IPv6 garbage collectionKui-Feng Lee
Add 1000 IPv6 routes with expiration time (w/ and w/o additional 5000 permanet routes in the background.) Wait for a few seconds to make sure they are removed correctly. The expected output of the test looks like the following example. > Fib6 garbage collection test > TEST: ipv6 route garbage collection [ OK ] Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net/ipv6: Remove expired routes with a separated list of routes.Kui-Feng Lee
FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree can be expensive if the number of routes in a table is big, even if most of them are permanent. Checking routes in a separated list of routes having expiration will avoid this potential issue. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16e1000e: Use PME poll to circumvent unreliable ACPI wakeKai-Heng Feng
On some I219 devices, ethernet cable plugging detection only works once from PCI D3 state. Subsequent cable plugging does set PME bit correctly, but device still doesn't get woken up. Since I219 connects to the root complex directly, it relies on platform firmware (ACPI) to wake it up. In this case, the GPE from _PRW only works for first cable plugging but fails to notify the driver for subsequent plugging events. The issue was originally found on CNP, but the same issue can be found on ADL too. So workaround the issue by continuing use PME poll after first ACPI wake. As PME poll is always used, the runtime suspend restriction for CNP can also be removed. Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net-memcg: Fix scope of sockmem pressure indicatorsAbel Wu
Now there are two indicators of socket memory pressure sit inside struct mem_cgroup, socket_pressure and tcpmem_pressure, indicating memory reclaim pressure in memcg->memory and ->tcpmem respectively. When in legacy mode (cgroupv1), the socket memory is charged into ->tcpmem which is independent of ->memory, so socket_pressure has nothing to do with socket's pressure at all. Things could be worse by taking socket_pressure into consideration in legacy mode, as a pressure in ->memory can lead to premature reclamation/throttling in socket. While for the default mode (cgroupv2), the socket memory is charged into ->memory, and ->tcpmem/->tcpmem_pressure are simply not used. So {socket,tcpmem}_pressure are only used in default/legacy mode respectively for indicating socket memory pressure. This patch fixes the pieces of code that make mixed use of both. Fixes: 8e8ae645249b ("mm: memcontrol: hook up vmpressure to socket pressure") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16nfp: update maintainerLouis Peens
Take over maintainership of the nfp driver from Simon as he is moving away from Corigine. Signed-off-by: Louis Peens <louis.peens@corigine.com> Acked-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net: ethernet: ti: am65-cpsw: add mqprio qdisc offload in channel modeGrygorii Strashko
This patch adds MQPRIO Qdisc offload in full 'channel' mode which allows not only setting up pri:tc mapping, but also configuring TX shapers on external port FIFOs. The K3 CPSW MQPRIO Qdisc offload is expected to work with VLAN/priority tagged packets. Non-tagged packets have to be mapped only to TC0. - TX traffic classes must be rated starting from TC that has highest priority and with no gaps - Traffic classes are used starting from 0, that has highest priority - min_rate defines Committed Information Rate (guaranteed) - max_rate defines Excess Information Rate (non guaranteed) and offloaded as (max_rate[i] - tcX_min_rate[i]) - VLAN/priority tagged packets mapped to TC0 will exit switch with VLAN tag priority 0 The configuration example: ethtool -L eth1 tx 5 ethtool --set-priv-flags eth1 p0-rx-ptype-rrobin off tc qdisc add dev eth1 parent root handle 100: mqprio num_tc 3 \ map 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 \ queues 1@0 1@1 1@2 hw 1 mode channel \ shaper bw_rlimit min_rate 0 100mbit 200mbit max_rate 0 101mbit 202mbit tc qdisc replace dev eth2 handle 100: parent root mqprio num_tc 1 \ map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 queues 1@0 hw 1 ip link add link eth1 name eth1.100 type vlan id 100 ip link set eth1.100 type vlan egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 In the above example two ports share the same TX CPPI queue 0 for low priority traffic. 3 traffic classes are defined for eth1 and mapped to: TC0 - low priority, TX CPPI queue 0 -> ext Port 1 fifo0, no rate limit TC1 - prio 2, TX CPPI queue 1 -> ext Port 1 fifo1, CIR=100Mbit/s, EIR=1Mbit/s TC2 - prio 3, TX CPPI queue 2 -> ext Port 1 fifo2, CIR=200Mbit/s, EIR=2Mbit/s Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Roger Quadros <rogerq@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16Merge tag 'nf-23-08-16' of ↵David S. Miller
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florisn Westphal says: ==================== These are netfilter fixes for the *net* tree. First patch resolves a false-positive lockdep splat: rcu_dereference is used outside of rcu read lock. Let lockdep validate that the transaction mutex is locked. Second patch fixes a kdoc warning added in previous PR. Third patch fixes a memory leak: The catchall element isn't disabled correctly, this allows userspace to deactivate the element again. This results in refcount underflow which in turn prevents memory release. This was always broken since the feature was added in 5.13. Patch 4 fixes an incorrect change in the previous pull request: Adding a duplicate key to a set should work if the duplicate key has expired, restore this behaviour. All from myself. Patch #5 resolves an old historic artifact in sctp conntrack: a 300ms timeout for shutdown_ack. Increase this to 3s. From Xin Long. Patch #6 fixes a sysctl data race in ipvs, two threads can clobber the sysctl value, from Sishuai Gong. This is a day-0 bug that predates git history. Patches 7, 8 and 9, from Pablo Neira Ayuso, are also followups for the previous GC rework in nf_tables: The netlink notifier and the netns exit path must both increment the gc worker seqcount, else worker may encounter stale (free'd) pointers. ================ Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16Merge branch 'inet-data-races'David S. Miller
Eric Dumazet says: ==================== inet: socket lock and data-races avoidance In this series, I converted 20 bits in "struct inet_sock" and made them truly atomic. This allows to implement many IP_ socket options in a lockless fashion (no need to acquire socket lock), and fixes data-races that were showing up in various KCSAN reports. I also took care of IP_TTL/IP_MINTTL, but left few other options for another series. v4: Rebased after recent mptcp changes. Added Reviewed-by: tags from Simon (thanks !) v3: fixed patch 7, feedback from build bot about ipvs set_mcast_loop() v2: addressed a feedback from a build bot in patch 9 by removing unused issk variable in mptcp_setsockopt_sol_ip_set_transparent() Added Acked-by: tags from Soheil (thanks !) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: implement lockless IP_MINTTLEric Dumazet
inet->min_ttl is already read with READ_ONCE(). Implementing IP_MINTTL socket option set/read without holding the socket lock is easy. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: implement lockless IP_TTLEric Dumazet
ip_select_ttl() is racy, because it reads inet->uc_ttl without proper locking. Add READ_ONCE()/WRITE_ONCE() annotations while allowing IP_TTL socket option to be set/read without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->defer_connect to inet->inet_flagsEric Dumazet
Make room in struct inet_sock by removing this bit field, using one available bit in inet_flags instead. Also move local_port_range to fill the resulting hole, saving 8 bytes on 64bit arches. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->bind_address_no_port to inet->inet_flagsEric Dumazet
IP_BIND_ADDRESS_NO_PORT socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->nodefrag to inet->inet_flagsEric Dumazet
IP_NODEFRAG socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->is_icsk to inet->inet_flagsEric Dumazet
We move single bit fields to inet->inet_flags to avoid races. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->transparent to inet->inet_flagsEric Dumazet
IP_TRANSPARENT socket option can now be set/read without locking the socket. v2: removed unused issk variable in mptcp_setsockopt_sol_ip_set_transparent() v4: rebased after commit 3f326a821b99 ("mptcp: change the mpc check helper to return a sk") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->mc_all to inet->inet_fragsEric Dumazet
IP_MULTICAST_ALL socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->mc_loop to inet->inet_fragsEric Dumazet
IP_MULTICAST_LOOP socket option can now be set/read without locking the socket. v3: fix build bot error reported in ipvs set_mcast_loop() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->hdrincl to inet->inet_flagsEric Dumazet
IP_HDRINCL socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>