summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-06-29LoongArch: Add support to clone a time namespaceTiezhu Yang
We can see that "Time namespaces are not supported" on LoongArch: (1) clone3 test # cd tools/testing/selftests/clone3 && make && ./clone3 ... # Time namespaces are not supported ok 18 # SKIP Skipping clone3() with CLONE_NEWTIME # Totals: pass:17 fail:0 xfail:0 xpass:0 skip:1 error:0 (2) timens test # cd tools/testing/selftests/timens && make && ./timens ... 1..0 # SKIP Time namespaces are not supported On LoongArch the current kernel does not support CONFIG_TIME_NS which depends on GENERIC_VDSO_TIME_NS, select GENERIC_VDSO_TIME_NS to enable CONFIG_TIME_NS to build kernel/time/namespace.c. Additionally, it needs to define some arch-dependent functions for the timens, such as __arch_get_timens_vdso_data(), arch_get_vdso_data() and vdso_join_timens(). At the same time, modify the layout of vvar to use one page size for generic vdso data, expand another page size for timens vdso data and assign LOONGARCH_VDSO_DATA_SIZE (maybe exceeds a page size if expand in the future) for loongarch vdso data, at last add the callback function vvar_fault() and modify stack_top(). With this patch under CONFIG_TIME_NS: (1) clone3 test # cd tools/testing/selftests/clone3 && make && ./clone3 ... ok 18 [739] Result (0) matches expectation (0) # Totals: pass:18 fail:0 xfail:0 xpass:0 skip:0 error:0 (2) timens test # cd tools/testing/selftests/timens && make && ./timens ... # Totals: pass:10 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29Makefile: Add loongarch target flag for Clang compilationWANG Xuerui
The LoongArch kernel is 64-bit and built with the soft-float ABI, hence the loongarch64-linux-gnusf target. (The "libc" part can affect the codegen of libcalls: other arches do not use a bare-metal target, and currently the only fully supported libc on LoongArch is glibc anyway.) See: https://lore.kernel.org/loongarch/CAKwvOdnimxv8oJ4mVY74zqtt1x7KTMrWvn2_T9x22SFDbU6rHQ@mail.gmail.com/ Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29LoongArch: Mark Clang LTO as workingWANG Xuerui
Confirmed working with QEMU system emulation. Acked-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29LoongArch: Include KBUILD_CPPFLAGS in CHECKFLAGS invocationWANG Xuerui
This is a port of commit 08f6554ff90e ("mips: Include KBUILD_CPPFLAGS in CHECKFLAGS invocation") to arch/loongarch, for fixing cross-compilation of Linux/LoongArch with Clang, where previously the `--target` flag would no longer be present for the CHECKFLAGS cc invocation leading to build failure. Reported-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1787#issuecomment-1608306002 Signed-off-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29LoongArch: vDSO: Use CLANG_FLAGS instead of filtering out '--target='WANG Xuerui
This is a port of commit 76d7fff22be3e ("MIPS: VDSO: Use CLANG_FLAGS instead of filtering out '--target='") to arch/loongarch, for fixing cross-compilation with Clang. Reported-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1787#issuecomment-1608306002 Signed-off-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29LoongArch: Tweak CFLAGS for Clang compatibilityWANG Xuerui
Now the arch code is mostly ready for LLVM/Clang consumption, it is time to re-organize the CFLAGS a little to actually enable the LLVM build. Namely, all -G0 switches from CFLAGS are removed, and -mexplicit-relocs and -mdirect-extern-access are now wrapped with cc-option (with the related asm/percpu.h definition guarded against toolchain combos that are known to not work). A build with !RELOCATABLE && !MODULE is confirmed working within a QEMU environment; support for the two features are currently blocked on LLVM/Clang, and will come later. Why -G0 can be removed: In GCC, -G stands for "small data threshold", that instructs the compiler to put data smaller than the specified threshold in a dedicated "small data" section (called .sdata on LoongArch and several other arches). However, benefiting from this would require ABI cooperation, which is not the case for LoongArch; and current GCC behave the same whether -G0 (equal to disabling this optimization) is given or not. So, remove -G0 from CFLAGS altogether for one less thing to care about. This also benefits LLVM/Clang compatibility where the -G switch is not supported. Why -mexplicit-relocs can now be conditionally applied without regressions: Originally -mexplicit-relocs is unconditionally added to CFLAGS in case of CONFIG_AS_HAS_EXPLICIT_RELOCS, because not having it (i.e. old GCC + new binutils) would not work: modules will have R_LARCH_ABS_* relocs inside, but given the rarity of such toolchain combo in the wild, it may not be worthwhile to support it, so support for such relocs in modules were not added back when explicit relocs support was upstreamed, and -mexplicit-relocs is unconditionally added to fail the build early. Now that Clang compatibility is desired, given Clang is behaving like -mexplicit-relocs from day one but without support for the CLI flag, we must ensure the flag is not passed in case of Clang. However, explicit compiler flavor checks can be more brittle than feature detection: in this case what actually matters is support for __attribute__((model)) when building modules. Given neither older GCC nor current Clang support this attribute, probing for the attribute support and #error'ing out would allow proper UX without checking for Clang, and also automatically work when Clang support for the attribute is to be added in the future. Why -mdirect-extern-access is now conditionally applied: This is actually a nice-to-have optimization that can reduce GOT accesses, but not having it is harmless either. Because Clang does not support the option currently, but might do so in the future, conditional application via cc-option ensures compatibility with both current and future Clang versions. Suggested-by: Xi Ruoyao <xry111@xry111.site> # cc-option changes Signed-off-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29LoongArch: Simplify the invtlb wrappersWANG Xuerui
The invtlb instruction has been supported by upstream LoongArch toolchains from day one, so ditch the raw opcode trickery and just use plain inline asm for it. While at it, also make the invtlb asm statements barriers, for proper modeling of the side effects. The functions are also marked as __always_inline instead of just "inline", because they cannot work at all if not inlined: the op argument will not be compile-time const in that case, thus failing to satisfy the "i" constraint. The signature of the other more specific invtlb wrappers contain unused arguments right now, but these are not removed right away in order for the patch to be focused. In the meantime, assertions are added to ensure no accidental misuse happens before the refactor. (The more specific wrappers cannot re-use the generic invtlb wrapper, because the ISA manual says $zero shall be used in case a particular op does not take the respective argument: re-using the generic wrapper would mean losing control over the register usage.) Signed-off-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29LoongArch: Make the CPUCFG&CSR ops simple aliases of compiler built-insWANG Xuerui
In addition to less visual clutter, this also makes Clang happy regarding the const-ness of arguments. In the original approach, all Clang gets to see is the incoming arguments whose const-ness cannot be proven without first being inlined; so Clang errors out here while GCC is fine. While at it, tweak several printk format strings because the return type of csr_read64 becomes effectively unsigned long, instead of unsigned long long. Signed-off-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29LoongArch: Prepare for assemblers with proper FCSR class supportWANG Xuerui
The GNU assembler (as of 2.40) mis-treats FCSR operands as GPRs, but the LLVM IAS does not. Probe for this and refer to FCSRs as "$fcsrNN" if support is present. Signed-off-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29LoongArch: extable: Also recognize ABI names of registersWANG Rui
When the kernel is compiled with LLVM, the register names being handled during exception fixup building are ABI names instead of bare $rNN style. Add mapping for the ABI names for LLVM compatibility. Signed-off-by: WANG Rui <wangrui@loongson.cn> Signed-off-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29LoongArch: Calculate various sizes in the linker scriptWANG Rui
Taking the address delta between symbols in different sections is not supported by the LLVM IAS. Instead, do this in the linker script, so the same data can be properly referenced in assembly. Signed-off-by: WANG Rui <wangrui@loongson.cn> Signed-off-by: WANG Xuerui <git@xen0n.name> [chenhuacai: Fix build with !CONFIG_EFI_STUB] Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29LoongArch: Add guard for the larch_insn_gen_xxx functionsWANG Rui
Add guard for the larch_insn_gen_xxx functions to verify whether the immediate operand is within the acceptable range. Signed-off-by: WANG Rui <wangrui@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29LoongArch: Delete unnecessary debugfs checkingDan Carpenter
Debugfs functions are not supposed to be checked for errors. This is sort of unusual but it is described in the comments for the debugfs_create_dir() function. Also debugfs_create_dir() can never return NULL. Reviewed-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29LoongArch: Set CPU#0 as the io master for FDTHuacai Chen
ACPI systems set io masters by parsing ACPI MADT, FDT systems have no MADT so we explicitly set CPU#0 as the io master. Otherwise CPU#0 will be considered as hotpluggable. Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-06-29Merge branch 'fix-ptp-received-on-wrong-port-with-bridged-sja1105-dsa'Paolo Abeni
Vladimir Oltean says: ==================== Fix PTP received on wrong port with bridged SJA1105 DSA Since the changes were made to tag_8021q to support imprecise RX for bridged ports, the tag_sja1105 driver still prefers the source port information deduced from the VLAN headers for link-local traffic, even though the switch can theoretically do better and report the precise source port. The problem is that the tagger doesn't know when to trust one source of information over another, because the INCL_SRCPT option (to "tag" link local frames) is sometimes enabled and sometimes it isn't. The first patch makes the switch provide the hardware tag for link local traffic under all circumstances, and the second patch makes the tagger always use that hardware tag as primary source of information for link local packets. ==================== Link: https://lore.kernel.org/r/20230627094207.3385231-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-29net: dsa: tag_sja1105: always prefer source port information from INCL_SRCPTVladimir Oltean
Currently the sja1105 tagging protocol prefers using the source port information from the VLAN header if that is available, falling back to the INCL_SRCPT option if it isn't. The VLAN header is available for all frames except for META frames initiated by the switch (containing RX timestamps), and thus, the "if (is_link_local)" branch is practically dead. The tag_8021q source port identification has become more loose ("imprecise") and will report a plausible rather than exact bridge port, when under a bridge (be it VLAN-aware or VLAN-unaware). But link-local traffic always needs to know the precise source port. With incorrect source port reporting, for example PTP traffic over 2 bridged ports will all be seen on sockets opened on the first such port, which is incorrect. Now that the tagging protocol has been changed to make link-local frames always contain source port information, we can reverse the order of the checks so that we always give precedence to that information (which is always precise) in lieu of the tag_8021q VID which is only precise for a standalone port. Fixes: d7f9787a763f ("net: dsa: tag_8021q: add support for imprecise RX based on the VBID") Fixes: 91495f21fcec ("net: dsa: tag_8021q: replace the SVL bridging with VLAN-unaware IVL bridging") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-29net: dsa: sja1105: always enable the INCL_SRCPT optionVladimir Oltean
Link-local traffic on bridged SJA1105 ports is sometimes tagged by the hardware with source port information (when the port is under a VLAN aware bridge). The tag_8021q source port identification has become more loose ("imprecise") and will report a plausible rather than exact bridge port, when under a bridge (be it VLAN-aware or VLAN-unaware). But link-local traffic always needs to know the precise source port. Modify the driver logic (and therefore: the tagging protocol itself) to always include the source port information with link-local packets, regardless of whether the port is standalone, under a VLAN-aware or VLAN-unaware bridge. This makes it possible for the tagging driver to give priority to that information over the tag_8021q VLAN header. The big drawback with INCL_SRCPT is that it makes it impossible to distinguish between an original MAC DA of 01:80:C2:XX:YY:ZZ and 01:80:C2:AA:BB:ZZ, because the tagger just patches MAC DA bytes 3 and 4 with zeroes. Only if PTP RX timestamping is enabled, the switch will generate a META follow-up frame containing the RX timestamp and the original bytes 3 and 4 of the MAC DA. Those will be used to patch up the original packet. Nonetheless, in the absence of PTP RX timestamping, we have to live with this limitation, since it is more important to have the more precise source port information for link-local traffic. Fixes: d7f9787a763f ("net: dsa: tag_8021q: add support for imprecise RX based on the VBID") Fixes: 91495f21fcec ("net: dsa: tag_8021q: replace the SVL bridging with VLAN-unaware IVL bridging") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-29Merge branch 'fix-ptp-packet-drops-with-ocelot-8021q-dsa-tag-protocol'Paolo Abeni
Vladimir Oltean says: ==================== Fix PTP packet drops with ocelot-8021q DSA tag protocol Changes in v2: - Distinguish between L2 and L4 PTP packets v1 at: https://lore.kernel.org/netdev/20230626154003.3153076-1-vladimir.oltean@nxp.com/ Patch 3/3 fixes an issue with the ocelot/felix driver, where it would drop PTP traffic on RX unless hardware timestamping for that packet type was enabled. Fixing that requires the driver to know whether it had previously configured the hardware to timestamp PTP packets on that port. But it cannot correctly determine that today using the existing code structure, so patches 1/3 and 2/3 fix the control path of the code such that ocelot->ports[port]->trap_proto faithfully reflects whether that configuration took place. ==================== Link: https://lore.kernel.org/r/20230627163114.3561597-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-29net: dsa: felix: don't drop PTP frames with tag_8021q when RX timestamping ↵Vladimir Oltean
is disabled The driver implements a workaround for the fact that it doesn't have an IRQ source to tell it whether PTP frames are available through the extraction registers, for those frames to be processed and passed towards the network stack. That workaround is to configure the switch, through felix_hwtstamp_set() -> felix_update_trapping_destinations(), to create two copies of PTP packets: one sent over Ethernet to the DSA master, and one to be consumed through the aforementioned CPU extraction queue registers. The reason why we want PTP packets to be consumed through the CPU extraction registers in the first place is because we want to see their hardware RX timestamp. With tag_8021q, that is only visible that way, and it isn't visible with the copy of the packet that's transmitted over Ethernet. The problem with the workaround implementation is that it drops the packet received over Ethernet, in expectation of its copy being present in the CPU extraction registers. However, if felix_hwtstamp_set() hasn't run (aka PTP RX timestamping is disabled), the driver will drop the original PTP frame and there will be no copy of it in the CPU extraction registers. So, the network stack will simply not see any PTP frame. Look at the port's trapping configuration to see whether the driver has previously enabled the CPU extraction registers. If it hasn't, just don't RX timestamp the frame and let it be passed up the stack by DSA, which is perfectly fine. Fixes: 0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP timestamping") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-29net: mscc: ocelot: don't keep PTP configuration of all ports in single structureVladimir Oltean
In a future change, the driver will need to determine whether PTP RX timestamping is enabled on a port (including whether traps were set up on that port in particular) and that is currently not possible. The driver supports different RX filters (L2, L4) and kinds of TX timestamping (one-step, two-step) on its ports, but it saves all configuration in a single struct hwtstamp_config that is global to the switch. So, the latest timestamping configuration on one port (including a request to disable timestamping) affects what gets reported for all ports, even though the configuration itself is still individual to each port. The port timestamping configurations are only coupled because of the common structure, so replace the hwtstamp_config with a mask of trapped protocols saved per port. We also have the ptp_cmd to distinguish between one-step and two-step PTP timestamping, so with those 2 bits of information we can fully reconstruct a descriptive struct hwtstamp_config for each port, during the SIOCGHWTSTAMP ioctl. Fixes: 4e3b0468e6d7 ("net: mscc: PTP Hardware Clock (PHC) support") Fixes: 96ca08c05838 ("net: mscc: ocelot: set up traps for PTP packets") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-29net: mscc: ocelot: don't report that RX timestamping is enabled by defaultVladimir Oltean
PTP RX timestamping should be enabled when the user requests it, not by default. If it is enabled by default, it can be problematic when the ocelot driver is a DSA master, and it sidesteps what DSA tries to avoid through __dsa_master_hwtstamp_validate(). Additionally, after the change which made ocelot trap PTP packets only to the CPU at ocelot_hwtstamp_set() time, it is no longer even true that RX timestamping is enabled by default, because until ocelot_hwtstamp_set() is called, the PTP traps are actually not set up. So the rx_filter field of ocelot->hwtstamp_config reflects an incorrect reality. Fixes: 96ca08c05838 ("net: mscc: ocelot: set up traps for PTP packets") Fixes: 4e3b0468e6d7 ("net: mscc: PTP Hardware Clock (PHC) support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-29arm64: sme: Use STR P to clear FFR context field in streaming SVE modeWill Deacon
The FFR is a predicate register which can vary between 16 and 256 bits in size depending upon the configured vector length. When saving the SVE state in streaming SVE mode, the FFR register is inaccessible and so commit 9f5848665788 ("arm64/sve: Make access to FFR optional") simply clears the FFR field of the in-memory context structure. Unfortunately, it achieves this using an unconditional 8-byte store and so if the SME vector length is anything other than 64 bytes in size we will either fail to clear the entire field or, worse, we will corrupt memory immediately following the structure. This has led to intermittent kfence splats in CI [1] and can trigger kmalloc Redzone corruption messages when running the 'fp-stress' kselftest: | ============================================================================= | BUG kmalloc-1k (Not tainted): kmalloc Redzone overwritten | ----------------------------------------------------------------------------- | | 0xffff000809bf1e22-0xffff000809bf1e27 @offset=7714. First byte 0x0 instead of 0xcc | Allocated in do_sme_acc+0x9c/0x220 age=2613 cpu=1 pid=531 | __kmalloc+0x8c/0xcc | do_sme_acc+0x9c/0x220 | ... Replace the 8-byte store with a store of a predicate register which has been zero-initialised with PFALSE, ensuring that the entire field is cleared in memory. [1] https://lore.kernel.org/r/CA+G9fYtU7HsV0R0dp4XEH5xXHSJFw8KyDf5VQrLLfMxWfxQkag@mail.gmail.com Cc: Mark Brown <broonie@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Fixes: 9f5848665788 ("arm64/sve: Make access to FFR optional") Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Tested-by: Anders Roxell <anders.roxell@linaro.org> Link: https://lore.kernel.org/r/20230628155605.22296-1-will@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-06-29Merge branch 'net-sched-act_ipt-bug-fixes'Paolo Abeni
Florian Westphal says: ==================== net/sched: act_ipt bug fixes v3: prefer skb_header() helper in patch 2. No other changes. I've retained Acks and RvB-Tags of v2. While checking if netfilter could be updated to replace selected instances of NF_DROP with kfree_skb_reason+NF_STOLEN to improve debugging info via drop monitor I found that act_ipt is incompatible with such an approach. Moreover, it lacks multiple sanity checks to avoid certain code paths that make assumptions that the tc layer doesn't meet, such as header sanity checks, availability of skb_dst, skb_nfct() and so on. act_ipt test in the tc selftest still pass with this applied. I think that we should consider removal of this module, while this should take care of all problems, its ipv4 only and I don't think there are any netfilter targets that lack a native tc equivalent, even when ignoring bpf. ==================== Link: https://lore.kernel.org/r/20230627123813.3036-1-fw@strlen.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-29net/sched: act_ipt: zero skb->cb before calling targetFlorian Westphal
xtables relies on skb being owned by ip stack, i.e. with ipv4 check in place skb->cb is supposed to be IPCB. I don't see an immediate problem (REJECT target cannot be used anymore now that PRE/POSTROUTING hook validation has been fixed), but better be safe than sorry. A much better patch would be to either mark act_ipt as "depends on BROKEN" or remove it altogether. I plan to do this for -next in the near future. This tc extension is broken in the sense that tc lacks an equivalent of NF_STOLEN verdict. With NF_STOLEN, target function takes complete ownership of skb, caller cannot dereference it anymore. ACT_STOLEN cannot be used for this: it has a different meaning, caller is allowed to dereference the skb. At this time NF_STOLEN won't be returned by any targets as far as I can see, but this may change in the future. It might be possible to work around this via list of allowed target extensions known to only return DROP or ACCEPT verdicts, but this is error prone/fragile. Existing selftest only validates xt_LOG and act_ipt is restricted to ipv4 so I don't think this action is used widely. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-29net/sched: act_ipt: add sanity checks on skb before calling targetFlorian Westphal
Netfilter targets make assumptions on the skb state, for example iphdr is supposed to be in the linear area. This is normally done by IP stack, but in act_ipt case no such checks are made. Some targets can even assume that skb_dst will be valid. Make a minimum effort to check for this: - Don't call the targets eval function for non-ipv4 skbs. - Don't call the targets eval function for POSTROUTING emulation when the skb has no dst set. v3: use skb_protocol helper (Davide Caratti) Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-29net/sched: act_ipt: add sanity checks on table name and hook locationsFlorian Westphal
Looks like "tc" hard-codes "mangle" as the only supported table name, but on kernel side there are no checks. This is wrong. Not all xtables targets are safe to call from tc. E.g. "nat" targets assume skb has a conntrack object assigned to it. Normally those get called from netfilter nat core which consults the nat table to obtain the address mapping. "tc" userspace either sets PRE or POSTROUTING as hook number, but there is no validation of this on kernel side, so update netlink policy to reject bogus numbers. Some targets may assume skb_dst is set for input/forward hooks, so prevent those from being used. act_ipt uses the hook number in two places: 1. the state hook number, this is fine as-is 2. to set par.hook_mask The latter is a bit mask, so update the assignment to make xt_check_target() to the right thing. Followup patch adds required checks for the skb/packet headers before calling the targets evaluation function. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-29sctp: fix potential deadlock on &net->sctp.addr_wq_lockChengfeng Ye
As &net->sctp.addr_wq_lock is also acquired by the timer sctp_addr_wq_timeout_handler() in protocal.c, the same lock acquisition at sctp_auto_asconf_init() seems should disable irq since it is called from sctp_accept() under process context. Possible deadlock scenario: sctp_accept() -> sctp_sock_migrate() -> sctp_auto_asconf_init() -> spin_lock(&net->sctp.addr_wq_lock) <timer interrupt> -> sctp_addr_wq_timeout_handler() -> spin_lock_bh(&net->sctp.addr_wq_lock); (deadlock here) This flaw was found using an experimental static analysis tool we are developing for irq-related deadlock. The tentative patch fix the potential deadlock by spin_lock_bh(). Signed-off-by: Chengfeng Ye <dg573847474@gmail.com> Fixes: 34e5b0118685 ("sctp: delay auto_asconf init until binding the first addr") Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/20230627120340.19432-1-dg573847474@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-29net: lan743x: Don't sleep in atomic contextMoritz Fischer
dev_set_rx_mode() grabs a spin_lock, and the lan743x implementation proceeds subsequently to go to sleep using readx_poll_timeout(). Introduce a helper wrapping the readx_poll_timeout_atomic() function and use it to replace the calls to readx_polL_timeout(). Fixes: 23f0703c125b ("lan743x: Add main source files for new lan743x driver") Cc: stable@vger.kernel.org Cc: Bryan Whitehead <bryan.whitehead@microchip.com> Cc: UNGLinuxDriver@microchip.com Signed-off-by: Moritz Fischer <moritzf@google.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20230627035000.1295254-1-moritzf@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-29media: wl128x: fix a clang warningMauro Carvalho Chehab
Clang-16 produces this warning, which is fatal with CONFIG_WERROR: ../drivers/media/radio/wl128x/fmdrv_common.c:1237:19: error: variable 'cmd_cnt' set but not used [-Werror,-Wunused-but-set-variable] int ret, fw_len, cmd_cnt; ^ 1 error generated. What happens is that cmd_cnt tracks the amount of firmware data packets were transfered, which is printed only when debug is used. Switch to use the firmware count, as the message is all about reporting a partial firmware transfer. Link: https://lore.kernel.org/linux-media/6badd27ebfa718d5737f517f18b29a3e0f6e43f8.1687981726.git.mchehab@kernel.org Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2023-06-28ksmbd: avoid field overflow warningArnd Bergmann
clang warns about a possible field overflow in a memcpy: In file included from fs/smb/server/smb_common.c:7: include/linux/fortify-string.h:583:4: error: call to '__write_overflow_field' declared with 'warning' attribute: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror,-Wattribute-warning] __write_overflow_field(p_size_field, size); It appears to interpret the "&out[baselen + 4]" as referring to a single byte of the character array, while the equivalen "out + baselen + 4" is seen as an offset into the array. I don't see that kind of warning elsewhere, so just go with the simple rework. Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-28Merge branch 'expand-stack'Linus Torvalds
This modifies our user mode stack expansion code to always take the mmap_lock for writing before modifying the VM layout. It's actually something we always technically should have done, but because we didn't strictly need it, we were being lazy ("opportunistic" sounds so much better, doesn't it?) about things, and had this hack in place where we would extend the stack vma in-place without doing the proper locking. And it worked fine. We just needed to change vm_start (or, in the case of grow-up stacks, vm_end) and together with some special ad-hoc locking using the anon_vma lock and the mm->page_table_lock, it all was fairly straightforward. That is, it was all fine until Ruihan Li pointed out that now that the vma layout uses the maple tree code, we *really* don't just change vm_start and vm_end any more, and the locking really is broken. Oops. It's not actually all _that_ horrible to fix this once and for all, and do proper locking, but it's a bit painful. We have basically three different cases of stack expansion, and they all work just a bit differently: - the common and obvious case is the page fault handling. It's actually fairly simple and straightforward, except for the fact that we have something like 24 different versions of it, and you end up in a maze of twisty little passages, all alike. - the simplest case is the execve() code that creates a new stack. There are no real locking concerns because it's all in a private new VM that hasn't been exposed to anybody, but lockdep still can end up unhappy if you get it wrong. - and finally, we have GUP and page pinning, which shouldn't really be expanding the stack in the first place, but in addition to execve() we also use it for ptrace(). And debuggers do want to possibly access memory under the stack pointer and thus need to be able to expand the stack as a special case. None of these cases are exactly complicated, but the page fault case in particular is just repeated slightly differently many many times. And ia64 in particular has a fairly complicated situation where you can have both a regular grow-down stack _and_ a special grow-up stack for the register backing store. So to make this slightly more manageable, the bulk of this series is to first create a helper function for the most common page fault case, and convert all the straightforward architectures to it. Thus the new 'lock_mm_and_find_vma()' helper function, which ends up being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon, loongarch, nios2, sh, sparc32, and xtensa. So we not only convert more than half the architectures, we now have more shared code and avoid some of those twisty little passages. And largely due to this common helper function, the full diffstat of this series ends up deleting more lines than it adds. That still leaves eight architectures (ia64, m68k, microblaze, openrisc, parisc, s390, sparc64 and um) that end up doing 'expand_stack()' manually because they are doing something slightly different from the normal pattern. Along with the couple of special cases in execve() and GUP. So there's a couple of patches that first create 'locked' helper versions of the stack expansion functions, so that there's a obvious path forward in the conversion. The execve() case is then actually pretty simple, and is a nice cleanup from our old "grow-up stackls are special, because at execve time even they grow down". The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because it's just more straightforward to write out the stack expansion there manually, instead od having get_user_pages_remote() do it for us in some situations but not others and have to worry about locking rules for GUP. And the final step is then to just convert the remaining odd cases to a new world order where 'expand_stack()' is called with the mmap_lock held for reading, but where it might drop it and upgrade it to a write, only to return with it held for reading (in the success case) or with it completely dropped (in the failure case). In the process, we remove all the stack expansion from GUP (where dropping the lock wouldn't be ok without special rules anyway), and add it in manually to __access_remote_vm() for ptrace(). Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases. Everything else here felt pretty straightforward, but the ia64 rules for stack expansion are really quite odd and very different from everything else. Also thanks to Vegard Nossum who caught me getting one of those odd conditions entirely the wrong way around. Anyway, I think I want to actually move all the stack expansion code to a whole new file of its own, rather than have it split up between mm/mmap.c and mm/memory.c, but since this will have to be backported to the initial maple tree vma introduction anyway, I tried to keep the patches _fairly_ minimal. Also, while I don't think it's valid to expand the stack from GUP, the final patch in here is a "warn if some crazy GUP user wants to try to expand the stack" patch. That one will be reverted before the final release, but it's left to catch any odd cases during the merge window and release candidates. Reported-by: Ruihan Li <lrh2000@pku.edu.cn> * branch 'expand-stack': gup: add warning if some caller would seem to want stack expansion mm: always expand the stack with the mmap write lock held execve: expand new process stack manually ahead of time mm: make find_extend_vma() fail if write lock not held powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma() mm/fault: convert remaining simple cases to lock_mm_and_find_vma() arm/mm: Convert to using lock_mm_and_find_vma() riscv/mm: Convert to using lock_mm_and_find_vma() mips/mm: Convert to using lock_mm_and_find_vma() powerpc/mm: Convert to using lock_mm_and_find_vma() arm64/mm: Convert to using lock_mm_and_find_vma() mm: make the page fault mmap locking killable mm: introduce new 'lock_mm_and_find_vma()' page fault helper
2023-06-28Merge tag 'net-next-6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking changes from Jakub Kicinski: "WiFi 7 and sendpage changes are the biggest pieces of work for this release. The latter will definitely require fixes but I think that we got it to a reasonable point. Core: - Rework the sendpage & splice implementations Instead of feeding data into sockets page by page extend sendmsg handlers to support taking a reference on the data, controlled by a new flag called MSG_SPLICE_PAGES Rework the handling of unexpected-end-of-file to invoke an additional callback instead of trying to predict what the right combination of MORE/NOTLAST flags is Remove the MSG_SENDPAGE_NOTLAST flag completely - Implement SCM_PIDFD, a new type of CMSG type analogous to SCM_CREDENTIALS, but it contains pidfd instead of plain pid - Enable socket busy polling with CONFIG_RT - Improve reliability and efficiency of reporting for ref_tracker - Auto-generate a user space C library for various Netlink families Protocols: - Allow TCP to shrink the advertised window when necessary, prevent sk_rcvbuf auto-tuning from growing the window all the way up to tcp_rmem[2] - Use per-VMA locking for "page-flipping" TCP receive zerocopy - Prepare TCP for device-to-device data transfers, by making sure that payloads are always attached to skbs as page frags - Make the backoff time for the first N TCP SYN retransmissions linear. Exponential backoff is unnecessarily conservative - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO) - Avoid waking up applications using TLS sockets until we have a full record - Allow using kernel memory for protocol ioctl callbacks, paving the way to issuing ioctls over io_uring - Add nolocalbypass option to VxLAN, forcing packets to be fully encapsulated even if they are destined for a local IP address - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure in-kernel ECMP implementation (e.g. Open vSwitch) select the same link for all packets. Support L4 symmetric hashing in Open vSwitch - PPPoE: make number of hash bits configurable - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client (ipconfig) - Add layer 2 miss indication and filtering, allowing higher layers (e.g. ACL filters) to make forwarding decisions based on whether packet matched forwarding state in lower devices (bridge) - Support matching on Connectivity Fault Management (CFM) packets - Hide the "link becomes ready" IPv6 messages by demoting their printk level to debug - HSR: don't enable promiscuous mode if device offloads the proto - Support active scanning in IEEE 802.15.4 - Continue work on Multi-Link Operation for WiFi 7 BPF: - Add precision propagation for subprogs and callbacks. This allows maintaining verification efficiency when subprograms are used, or in fact passing the verifier at all for complex programs, especially those using open-coded iterators - Improve BPF's {g,s}setsockopt() length handling. Previously BPF assumed the length is always equal to the amount of written data. But some protos allow passing a NULL buffer to discover what the output buffer *should* be, without writing anything - Accept dynptr memory as memory arguments passed to helpers - Add routing table ID to bpf_fib_lookup BPF helper - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark maps as read-only) - Show target_{obj,btf}_id in tracing link fdinfo - Addition of several new kfuncs (most of the names are self-explanatory): - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(), bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size() and bpf_dynptr_clone(). - bpf_task_under_cgroup() - bpf_sock_destroy() - force closing sockets - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs Netfilter: - Relax set/map validation checks in nf_tables. Allow checking presence of an entry in a map without using the value - Increase ip_vs_conn_tab_bits range for 64BIT builds - Allow updating size of a set - Improve NAT tuple selection when connection is closing Driver API: - Integrate netdev with LED subsystem, to allow configuring HW "offloaded" blinking of LEDs based on link state and activity (i.e. packets coming in and out) - Support configuring rate selection pins of SFP modules - Factor Clause 73 auto-negotiation code out of the drivers, provide common helper routines - Add more fool-proof helpers for managing lifetime of MDIO devices associated with the PCS layer - Allow drivers to report advanced statistics related to Time Aware scheduler offload (taprio) - Allow opting out of VF statistics in link dump, to allow more VFs to fit into the message - Split devlink instance and devlink port operations New hardware / drivers: - Ethernet: - Synopsys EMAC4 IP support (stmmac) - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches - Marvell 88E6250 7 port switches - Microchip LAN8650/1 Rev.B0 PHYs - MediaTek MT7981/MT7988 built-in 1GE PHY driver - WiFi: - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps - Realtek RTL8723DS (SDIO variant) - Realtek RTL8851BE - CAN: - Fintek F81604 Drivers: - Ethernet NICs: - Intel (100G, ice): - support dynamic interrupt allocation - use meta data match instead of VF MAC addr on slow-path - nVidia/Mellanox: - extend link aggregation to handle 4, rather than just 2 ports - spawn sub-functions without any features by default - OcteonTX2: - support HTB (Tx scheduling/QoS) offload - make RSS hash generation configurable - support selecting Rx queue using TC filters - Wangxun (ngbe/txgbe): - add basic Tx/Rx packet offloads - add phylink support (SFP/PCS control) - Freescale/NXP (enetc): - report TAPRIO packet statistics - Solarflare/AMD: - support matching on IP ToS and UDP source port of outer header - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6 - add devlink dev info support for EF10 - Virtual NICs: - Microsoft vNIC: - size the Rx indirection table based on requested configuration - support VLAN tagging - Amazon vNIC: - try to reuse Rx buffers if not fully consumed, useful for ARM servers running with 16kB pages - Google vNIC: - support TCP segmentation of >64kB frames - Ethernet embedded switches: - Marvell (mv88e6xxx): - enable USXGMII (88E6191X) - Microchip: - lan966x: add support for Egress Stage 0 ACL engine - lan966x: support mapping packet priority to internal switch priority (based on PCP or DSCP) - Ethernet PHYs: - Broadcom PHYs: - support for Wake-on-LAN for BCM54210E/B50212E - report LPI counter - Microsemi PHYs: support RGMII delay configuration (VSC85xx) - Micrel PHYs: receive timestamp in the frame (LAN8841) - Realtek PHYs: support optional external PHY clock - Altera TSE PCS: merge the driver into Lynx PCS which it is a variant of - CAN: Kvaser PCIEcan: - support packet timestamping - WiFi: - Intel (iwlwifi): - major update for new firmware and Multi-Link Operation (MLO) - configuration rework to drop test devices and split the different families - support for segmented PNVM images and power tables - new vendor entries for PPAG (platform antenna gain) feature - Qualcomm 802.11ax (ath11k): - Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID Advertisement (EMA) support in AP mode - support factory test mode - RealTek (rtw89): - add RSSI based antenna diversity - support U-NII-4 channels on 5 GHz band - RealTek (rtl8xxxu): - AP mode support for 8188f - support USB RX aggregation for the newer chips" * tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits) net: scm: introduce and use scm_recv_unix helper af_unix: Skip SCM_PIDFD if scm->pid is NULL. net: lan743x: Simplify comparison netlink: Add __sock_i_ino() for __netlink_diag_dump(). net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses Revert "af_unix: Call scm_recv() only after scm_set_cred()." phylink: ReST-ify the phylink_pcs_neg_mode() kdoc libceph: Partially revert changes to support MSG_SPLICE_PAGES net: phy: mscc: fix packet loss due to RGMII delays net: mana: use vmalloc_array and vcalloc net: enetc: use vmalloc_array and vcalloc ionic: use vmalloc_array and vcalloc pds_core: use vmalloc_array and vcalloc gve: use vmalloc_array and vcalloc octeon_ep: use vmalloc_array and vcalloc net: usb: qmi_wwan: add u-blox 0x1312 composition perf trace: fix MSG_SPLICE_PAGES build error ipvlan: Fix return value of ipvlan_queue_xmit() netfilter: nf_tables: fix underflow in chain reference counter netfilter: nf_tables: unbind non-anonymous set if rule construction fails ...
2023-06-28Merge tag 'v6.5-rc1-sysctl-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "The changes for sysctl are in line with prior efforts to stop usage of deprecated routines which incur recursion and also make it hard to remove the empty array element in each sysctl array declaration. The most difficult user to modify was parport which required a bit of re-thinking of how to declare shared sysctls there, Joel Granados has stepped up to the plate to do most of this work and eventual removal of register_sysctl_table(). That work ended up saving us about 1465 bytes according to bloat-o-meter. Since we gained a few bloat-o-meter karma points I moved two rather small sysctl arrays from kernel/sysctl.c leaving us only two more sysctl arrays to move left. Most changes have been tested on linux-next for about a month. The last straggler patches are a minor parport fix, changes to the sysctl kernel selftest so to verify correctness and prevent regressions for the future change he made to provide an alternative solution for the special sysctl mount point target which was using the now deprecated sysctl child element. This is all prep work to now finally be able to remove the empty array element in all sysctl declarations / registrations which is expected to save us a bit of bytes all over the kernel. That work will be tested early after v6.5-rc1 is out" * tag 'v6.5-rc1-sysctl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: sysctl: replace child with an enumeration sysctl: Remove debugging dump_stack test_sysclt: Test for registering a mount point test_sysctl: Add an option to prevent test skip test_sysctl: Add an unregister sysctl test test_sysctl: Group node sysctl test under one func test_sysctl: Fix test metadata getters parport: plug a sysctl register leak sysctl: move security keys sysctl registration to its own file sysctl: move umh sysctl registration to its own file signal: move show_unhandled_signals sysctl to its own file sysctl: remove empty dev table sysctl: Remove register_sysctl_table sysctl: Refactor base paths registrations sysctl: stop exporting register_sysctl_table parport: Removed sysctl related defines parport: Remove register_sysctl_table from parport_default_proc_register parport: Remove register_sysctl_table from parport_device_proc_register parport: Remove register_sysctl_table from parport_proc_register parport: Move magic number "15" to a define
2023-06-28Merge tag 'v6.5-rc1-modules-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull module updates from Luis Chamberlain: "The changes queued up for modules are pretty tame, mostly code removal of moving of code. Only two minor functional changes are made, the only one which stands out is Sebastian Andrzej Siewior's simplification of module reference counting by removing preempt_disable() and that has been tested on linux-next for well over a month without no regressions. I'm now, I guess, also a kitchen sink for some kallsyms changes" [ There was a mis-communication about the concurrent module load changes that I had expected to come through Luis despite me authoring the patch. So some of the module updates were left hanging in the email ether, and I just committed them separately. It's my bad - I should have made it more clear that I expected my own patches to come through the module tree too. Now they missed linux-next, but hopefully that won't cause any issues - Linus ] * tag 'v6.5-rc1-modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: kallsyms: make kallsyms_show_value() as generic function kallsyms: move kallsyms_show_value() out of kallsyms.c kallsyms: remove unsed API lookup_symbol_attrs kallsyms: remove unused arch_get_kallsym() helper module: Remove preempt_disable() from module reference counting.
2023-06-28modules: catch concurrent module loads, treat them as idempotentLinus Torvalds
This is the new-and-improved attempt at avoiding huge memory load spikes when the user space boot sequence tries to load hundreds (or even thousands) of redundant duplicate modules in parallel. See commit 9828ed3f695a ("module: error out early on concurrent load of the same module file") for background and an earlier failed attempt that was reverted. That earlier attempt just said "concurrently loading the same module is silly, just open the module file exclusively and return -ETXTBSY if somebody else is already loading it". While it is true that concurrent module loads of the same module is silly, the reason that earlier attempt then failed was that the concurrently loaded module would often be a prerequisite for another module. Thus failing to load the prerequisite would then cause cascading failures of the other modules, rather than just short-circuiting that one unnecessary module load. At the same time, we still really don't want to load the contents of the same module file hundreds of times, only to then wait for an eventually successful load, and have everybody else return -EEXIST. As a result, this takes another approach, and treats concurrent module loads from the same file as "idempotent" in the inode. So if one module load is ongoing, we don't start a new one, but instead just wait for the first one to complete and return the same return value as it did. So unlike the first attempt, this does not return early: the intent is not to speed up the boot, but to avoid a thundering herd problem in allocating memory (both physical and virtual) for a module more than once. Also note that this does change behavior: it used to be that when you had concurrent loads, you'd have one "winner" that would return success, and everybody else would return -EEXIST. In contrast, this idempotent logic goes all Oprah on the problem, and says "You are a winner! And you are a winner! We are ALL winners". But since there's no possible actual real semantic difference between "you loaded the module" and "somebody else already loaded the module", this is more of a feel-good change than an actual honest-to-goodness semantic change. Of course, any true Johnny-come-latelies that don't get caught in the concurrency filter will still return -EEXIST. It's no different from not even getting a seat at an Oprah taping. That's life. See the long thread on the kernel mailing list about this all, which includes some numbers for memory use before and after the patch. Link: https://lore.kernel.org/lkml/20230524213620.3509138-1-mcgrof@kernel.org/ Reviewed-by: Johan Hovold <johan@kernel.org> Tested-by: Johan Hovold <johan@kernel.org> Tested-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Rudi Heitbaum <rudi@heitbaum..com> Tested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-28module: split up 'finit_module()' into init_module_from_file() helperLinus Torvalds
This will simplify the next step, where we can then key off the inode to do one idempotent module load. Let's do the obvious re-organization in one step, and then the new code in another. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-28nvme: improved uring pollingKeith Busch
Drivers can poll requests directly, so use that. We just need to ensure the driver's request was allocated from a polled hctx, so a special driver flag is added to struct io_uring_cmd. The allows unshared and multipath namespaces to use the same polling callback, and multipath is guaranteed to get the same queue as the command was submitted on. Previously multipath polling might check a different path and poll the wrong info. The other bonus is we don't need a bio payload in order to poll, allowing commands like 'flush' and 'write zeroes' to be submitted on the same high priority queue as read and write commands. Finally, using the request based polling skips the unnecessary bio overhead. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230612190343.2087040-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-28block: add request polling helperKeith Busch
Provide a direct request polling will for drivers. The interface does not require a bio, and can skip the overhead associated with polling those. The biggest gain from skipping the relatively expensive xarray lookup unnecessary when you already have the request. With this, the simple rq/qc conversion functions have only one caller each, so open code this and remove the helpers. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230612190343.2087040-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-28Merge branch 'for-6.5/block-late' into block-6.5Jens Axboe
* for-6.5/block-late: blk-sysfs: add a new attr_group for blk_mq blk-iocost: move wbt_enable/disable_default() out of spinlock blk-wbt: cleanup rwb_enabled() and wbt_disabled() blk-wbt: remove dead code to handle wbt enable/disable with io inflight blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled blk-mq: fix two misuses on RQF_USE_SCHED blk-throttle: Fix io statistics for cgroup v1 bcache: Fix bcache device claiming bcache: Alloc holder object before async registration raid10: avoid spin_lock from fastpath from raid10_unplug() md: fix 'delete_mutex' deadlock md: use mddev->external to select holder in export_rdev() md/raid1-10: fix casting from randomized structure in raid1_submit_write() md/raid10: fix the condition to call bio_end_io_acct()
2023-06-28Merge tag 'mmc-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmcLinus Torvalds
Pull MMC updates from Ulf Hansson: "MMC core: - Allow synchronous detection of (e)MMC/SD/SDIO cards - Fixup error check for ioctls for SPI hosts - Disable broken SD-Cache support for Kingston Canvas Go Plus from 2019 - Disable broken eMMC-Trim support for Kingston EMMC04G-M627 - Disable broken eMMC-Trim support for Micron MTFC4GACAJCN-1M MMC host: - bcm2835: Convert DT bindings to YAML - mmci: - Enable asynchronous probe - Transform the ux500 HW-busy detection into a proper state machine - Add support for SW busy-end timeouts for the ux500 variants - mmci_stm32: - Add support for sdm32 variant revision v3.0 used on STM32MP25 - Improve the tuning sequence - mtk-sd: Tune polling-period to improve performance - sdhci: Fixup DMA configuration for 64-bit DMA mode - sdhci-bcm-kona: Convert DT bindings to YAML - sdhci-msm: - Switch to use the new ICE API - Add support for the SC8280XP/IPQ6018/QDU1000/QRU1000 variants - sdhci-pci-gli: - Add support SD Express cards for GL9767 - Add support for the Genesys Logic GL9767 variant" * tag 'mmc-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (42 commits) dt-bindings: mmc: fsl-imx-esdhc: Add imx6ul support mmc: mmci: Add support for SW busy-end timeouts mmc: Add MMC_QUIRK_BROKEN_SD_CACHE for Kingston Canvas Go Plus from 11/2019 mmc: core: disable TRIM on Kingston EMMC04G-M627 mmc: mmci: stm32: add delay block support for STM32MP25 mmc: mmci: stm32: prepare other delay block support mmc: mmci: stm32: manage block gap hardware flow control mmc: mmci: Add support for sdmmc variant revision v3.0 mmc: mmci: add stm32_idmabsize_align parameter dt-bindings: mmc: mmci: Add st,stm32mp25-sdmmc2 compatible mmc: core: disable TRIM on Micron MTFC4GACAJCN-1M mmc: mmci: Break out a helper function mmc: mmci: Use a switch statement machine mmc: mmci: Use state machine state as exit condition mmc: mmci: Retry the busy start condition mmc: mmci: Make busy complete state machine explicit mmc: mmci: Break out error check in busy detect mmc: mmci: Stash status while waiting for busy mmc: mmci: Unwind big if() clause mmc: mmci: Clear busy_status when starting command ...
2023-06-28Merge tag 'mtd/for-6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux Pull mtd updates from "Core MTD changes: - otp: - Put factory OTP/NVRAM into the entropy pool - Clean up on error in mtd_otp_nvmem_add() MTD devices changes: - sm_ftl: Fix typos in comments - Use SPDX license headers - pismo: Switch back to use i2c_driver's .probe() - mtdpart: Drop useless LIST_HEAD - st_spi_fsm: Use the devm_clk_get_enabled() helper function DT binding changes: - partitions: - Include TP-Link SafeLoader in allowed list - Add missing type for "linux,rootfs" - Extend the nand node names filter - Create a file for raw NAND chip properties - Mark nand-ecc-placement deprecated - Describe nand-ecc-mode - Prevent NAND chip unevaluated properties in all NAND bindings with a NAND chip reference. - Qcom: Fix a property position - Marvell: Convert to YAML DT schema Raw NAND chip drivers changes: - Macronix: OTP access for MX30LFxG18AC - Add basic Sandisk manufacturer ops - Add support for Sandisk SDTNQGAMA Raw NAND controller driver changes: - Meson: - Replace integer consts with proper defines - Allow waiting w/o wired ready/busy pin - Check buffer length validity - Fix unaligned DMA buffers handling - dt-bindings: Fix 'nand-rb' property - Arasan: Revert "mtd: rawnand: arasan: Prevent an unsupported configuration" as this limitation is no longer true thanks to the recent efforts in improving the clocks support in this driver SPI-NAND changes: - Gigadevice: add support for GD5F2GQ5xExxH - Macronix: Add support for serial NAND flashes" * tag 'mtd/for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux: (38 commits) dt-bindings: mtd: marvell-nand: Convert to YAML DT scheme dt-bindings: mtd: ti,am654: Prevent unevaluated properties dt-bindings: mtd: mediatek: Prevent NAND chip unevaluated properties dt-bindings: mtd: mediatek: Reference raw-nand-chip.yaml dt-bindings: mtd: stm32: Prevent NAND chip unevaluated properties dt-bindings: mtd: rockchip: Prevent NAND chip unevaluated properties dt-bindings: mtd: intel: Prevent NAND chip unevaluated properties dt-bindings: mtd: denali: Prevent NAND chip unevaluated properties dt-bindings: mtd: brcmnand: Prevent NAND chip unevaluated properties dt-bindings: mtd: meson: Prevent NAND chip unevaluated properties dt-bindings: mtd: sunxi: Prevent NAND chip unevaluated properties dt-bindings: mtd: ingenic: Prevent NAND chip unevaluated properties dt-bindings: mtd: qcom: Prevent NAND chip unevaluated properties dt-bindings: mtd: qcom: Fix a property position dt-bindings: mtd: Describe nand-ecc-mode dt-bindings: mtd: Mark nand-ecc-placement deprecated dt-bindings: mtd: Create a file for raw NAND chip properties dt-bindings: mtd: Accept nand related node names mtd: sm_ftl: Fix typos in comments mtd: otp: clean up on error in mtd_otp_nvmem_add() ...
2023-06-28smb: client: improve DFS mount checkPaulo Alcantara
Some servers may return error codes from REQ_GET_DFS_REFERRAL requests that are unexpected by the client, so to make it easier, assume non-DFS mounts when the client can't get the initial DFS referral of @ctx->UNC in dfs_mount_share(). Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-28smb: client: fix shared DFS root mounts with different prefixesPaulo Alcantara
When having two DFS root mounts that are connected to same namespace, same mount options but different prefix paths, we can't really use the shared @server->origin_fullpath when chasing DFS links in them. Move the origin_fullpath field to cifs_tcon structure so when having shared DFS root mounts with different prefix paths, and we need to chase any DFS links, dfs_get_automount_devname() will pick up the correct full path out of the @tcon that will be used for the new mount. Before patch mount.cifs //dom/dfs/dir /mnt/1 -o ... mount.cifs //dom/dfs /mnt/2 -o ... # shared server, ses, tcon # server: origin_fullpath=//dom/dfs/dir # @server->origin_fullpath + '/dir/link1' $ ls /mnt/2/dir/link1 ls: cannot open directory '/mnt/2/dir/link1': No such file or directory After patch mount.cifs //dom/dfs/dir /mnt/1 -o ... mount.cifs //dom/dfs /mnt/2 -o ... # shared server & ses # tcon_1: origin_fullpath=//dom/dfs/dir # tcon_2: origin_fullpath=//dom/dfs # @tcon_2->origin_fullpath + '/dir/link1' $ ls /mnt/2/dir/link1 dir0 dir1 dir10 dir3 dir5 dir6 dir7 dir9 target2_file.txt tsub Fixes: 8e3554150d6c ("cifs: fix sharing of DFS connections") Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-28Merge tag 'spi-v6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi updates from Mark Brown: "One small core feature this time around but mostly driver improvements and additions for SPI: - Add support for controlling the idle state of MOSI, some systems can support this and depending on the system integration may need it to avoid glitching in some situations - Support for polling mode in the S3C64xx driver and DMA on the Qualcomm QSPI driver - Support for several Allwinner SoCs, AMD Pensando Elba, Intel Mount Evans, Renesas RZ/V2M, and ST STM32H7" * tag 'spi-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (66 commits) spi: dt-bindings: atmel,at91rm9200-spi: fix broken sam9x7 compatible spi: dt-bindings: atmel,at91rm9200-spi: add sam9x7 compatible spi: Add support for Renesas CSI spi: dt-bindings: Add bindings for RZ/V2M CSI spi: sun6i: Use the new helper to derive the xfer timeout value spi: atmel: Prevent false timeouts on long transfers spi: dt-bindings: stm32: do not disable spi-slave property for stm32f4-f7 spi: Create a helper to derive adaptive timeouts spi: spi-geni-qcom: correctly handle -EPROBE_DEFER from dma_request_chan() spi: stm32: disable spi-slave property for stm32f4-f7 spi: stm32: introduction of stm32h7 SPI device mode support spi: stm32: use dmaengine_terminate_{a}sync instead of _all spi: stm32: renaming of spi_master into spi_controller spi: dw: Remove misleading comment for Mount Evans SoC spi: dt-bindings: snps,dw-apb-ssi: Add compatible for Intel Mount Evans SoC spi: dw: Add compatible for Intel Mount Evans SoC spi: s3c64xx: Use dev_err_probe() spi: s3c64xx: Use the managed spi master allocation function spi: spl022: Probe defer is no error spi: spi-imx: fix mixing of native and gpio chipselects for imx51/imx53/imx6 variants ...
2023-06-28Merge tag 'regulator-v6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator updates from Mark Brown: "This release is almost all drivers, there's some small improvements in the core but otherwise everything is updates to drivers, mostly the addition of new ones. There's also a bunch of changes pulled in from the MFD subsystem as dependencies, Rockchip and TI core MFD code that the regulator drivers depend on. I've also yet again managed to put a SPI commit in the regulator tree, I don't know what it is about those two trees (this for spi-geni-qcom). Summary: - Support for Renesas RAA215300, Rockchip RK808, Texas Instruments TPS6594 and TPS6287x, and X-Powers AXP15060 and AXP313a" * tag 'regulator-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (43 commits) regulator: Add Renesas PMIC RAA215300 driver regulator: dt-bindings: Add Renesas RAA215300 PMIC bindings regulator: ltc3676: Use maple tree register cache regulator: ltc3589: Use maple tree register cache regulator: helper: Document ramp_delay parameter of regulator_set_ramp_delay_regmap() regulator: mt6358: Use linear voltage helpers for single range regulators regulator: mt6358: Const-ify mt6358_regulator_info data structures regulator: mt6358: Drop *_SSHUB regulators regulator: mt6358: Merge VCN33_* regulators regulator: dt-bindings: mt6358: Drop *_sshub regulators regulator: dt-bindings: mt6358: Merge ldo_vcn33_* regulators regulator: dt-bindings: pwm-regulator: Add missing type for "pwm-dutycycle-unit" regulator: Switch two more i2c drivers back to use .probe() spi: spi-geni-qcom: Do not do DMA map/unmap inside driver, use framework instead soc: qcom: geni-se: Add interfaces geni_se_tx_init_dma() and geni_se_rx_init_dma() regulator: tps6594-regulator: Add driver for TI TPS6594 regulators regulator: axp20x: Add AXP15060 support regulator: axp20x: Add support for AXP313a variant dt-bindings: pfuze100.yaml: Add an entry for interrupts regulator: stm32-pwr: Fix regulator disabling ...
2023-06-28Merge tag 'regmap-v6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap Pull regmap updates from Mark Brown: "Another busy release for regmap with the second half of the maple tree register cache implementation, there's some smaller optimisations that could be done but this should now be able to replace the rbtree cache for most devices. We also had a followup from Aidan MacDonald's refactoring of some of the regmap-irq interfaces, the conversion is complete so the old interfaces are removed. This means that even with the new features for the maple tree cache we'd have a nice negative diffstat were it not for the addition of a bunch more KUnit coverage. There's one GPIO patch in here, it was a dependency for a cleanup of an API in the regmap-irq code for which the gpio-104-dio-48e driver was the only user. Highlights: - The maple tree cache can now load in default values more efficiently, and is capabale of syncing multiple registers in a single write during cache sync - More KUnit coverage, including some coverage for raw I/O and a dummy RAM backed cache to support it - Removal of several old interfaces in regmap-irq now all users have been modernised" * tag 'regmap-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: (23 commits) regmap: Allow reads from write only registers with the flat cache regmap: Drop early readability check regmap: Check for register readability before checking cache during read regmap: Add test to make sure we don't sync to read only registers regmap: Add a test case for write only registers regmap: Add test that writes to write only registers are prevented regmap: Add debugfs file for forcing field writes regmap: Don't check for changes in regcache_set_val() regmap: maple: Implement block sync for the maple tree cache regmap: Provide basic KUnit coverage for the raw register I/O regmap: Provide a ram backed regmap with raw support regmap: Add missing cache_only checks regmap: regmap-irq: Move handle_post_irq to before pm_runtime_put regmap: Load register defaults in blocks rather than register by register regmap: mmio: Allow passing an empty config->reg_stride regmap-irq: Drop backward compatibility for inverted mask/unmask regmap-irq: Minor adjustments to .handle_mask_sync() regmap-irq: Remove support for not_fixed_stride regmap-irq: Remove type registers regmap-irq: Remove virtual registers ...
2023-06-29Merge tag 'drm-intel-next-fixes-2023-06-21' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-intel into drm-next One fix for incorrect error handling in the frame buffer mmap callback, HuC init error handling fix, missing wakeref during GSC init and a build fix when !CONFIG_PROC_FS. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/ZJLI8ON96ApPTl8H@tursulin-desk
2023-06-28x86/mem_encrypt: Remove stale mem_encrypt_init() declarationLinus Torvalds
The memory encryption initialization logic was moved from init/main.c into arch_cpu_finalize_init() in commit 439e17576eb4 ("init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()"), but a stale declaration for the init function was left in <linux/init.h>. And didn't cause any problems if you had X86_MEM_ENCRYPT enabled, which apparently everybody involved did have. See also commit 0a9567ac5e6a ("x86/mem_encrypt: Unbreak the AMD_MEM_ENCRYPT=n build") in this whole sad saga of conflicting declarations for different situations. Reported-by: Matthew Wilcox <willy@infradead.org> Fixes: 439e17576eb4 init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init() Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-28mm: fix __access_remote_vm() GUP failure caseLinus Torvalds
Commit ca5e863233e8 ("mm/gup: remove vmas parameter from get_user_pages_remote()") removed the vma argument from GUP handling, and instead added a helper function (get_user_page_vma_remote()) that looks it up separately using 'vma_lookup()'. And then converted existing users that needed a vma to use the helper instead. However, the helper function intentionally acts exactly like the old get_user_pages_remote() did, and only fills in 'vma' on successful page lookup. Fine so far. However, __access_remote_vm() wants the vma even for the unsuccessful case, and used to do a vma = vma_lookup(mm, addr); explicitly to look it up when the get_user_page() failed. However, that conversion commit incorrectly removed that vma lookup, thinking that get_user_page_vma_remote() would have done it. Not so. So add the vma_lookup() back in. Fixes: ca5e863233e8 ("mm/gup: remove vmas parameter from get_user_pages_remote()") Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-28Merge tag 'mm-nonmm-stable-2023-06-24-19-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-mm updates from Andrew Morton: - Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in top-level directories - Douglas Anderson has added a new "buddy" mode to the hardlockup detector. It permits the detector to work on architectures which cannot provide the required interrupts, by having CPUs periodically perform checks on other CPUs - Zhen Lei has enhanced kexec's ability to support two crash regions - Petr Mladek has done a lot of cleanup on the hard lockup detector's Kconfig entries - And the usual bunch of singleton patches in various places * tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits) kernel/time/posix-stubs.c: remove duplicated include ocfs2: remove redundant assignment to variable bit_off watchdog/hardlockup: fix typo in config HARDLOCKUP_DETECTOR_PREFER_BUDDY powerpc: move arch_trigger_cpumask_backtrace from nmi.h to irq.h devres: show which resource was invalid in __devm_ioremap_resource() watchdog/hardlockup: define HARDLOCKUP_DETECTOR_ARCH watchdog/sparc64: define HARDLOCKUP_DETECTOR_SPARC64 watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific watchdog/hardlockup: declare arch_touch_nmi_watchdog() only in linux/nmi.h watchdog/hardlockup: make the config checks more straightforward watchdog/hardlockup: sort hardlockup detector related config values a logical way watchdog/hardlockup: move SMP barriers from common code to buddy code watchdog/buddy: simplify the dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY watchdog/buddy: don't copy the cpumask in watchdog_next_cpu() watchdog/buddy: cleanup how watchdog_buddy_check_hardlockup() is called watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog() watchdog/hardlockup: in watchdog_hardlockup_check() use cpumask_copy() watchdog/hardlockup: don't use raw_cpu_ptr() in watchdog_hardlockup_kick() watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe() watchdog/hardlockup: keep kernel.nmi_watchdog sysctl as 0444 if probe fails ...