summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-04-19net/mlx5e: RX, Prepare non-linear striding RQ for XDP multi-buffer supportTariq Toukan
In preparation for supporting XDP multi-buffer in striding RQ, use xdp_buff struct to describe the packet. Make its skb_shared_info collide the one of the allocated SKB, then add the fragments using the xdp_buff API. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: RX, Generalize mlx5e_fill_mxbuf()Tariq Toukan
Make the function more generic. Let it get an additional frame_sz parameter instead of deriving it from the RQ struct. No functional change here, just a preparation for a downstream patch. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: RX, Take shared info fragment addition into a functionTariq Toukan
Introduce mlx5e_add_skb_shared_info_frag(), a function dedicated for adding a fragment into a struct skb_shared_info object. Use it in the Legacy RQ flow. Similar usage will be added in a downstream patch by the corresponding Striding RQ flow. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: XDP, Allow non-linear single-segment frames in XDP TX MPWQETariq Toukan
Under a few restrictions, TX MPWQE feature can serve multiple TX packets in a single TX descriptor. It requires each of the packets to have a single scatter entry / segment. Today we allow only linear frames to use this feature, although there's no real problem with non-linear ones where the whole packet reside in the first fragment. Expand the XDP TX MPWQE feature support to include such frames. This is in preparation for the downstream patch, in which we will generate such non-linear frames. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: XDP, Remove un-established assumptions on XDP bufferTariq Toukan
Remove the assumption of non-zero linear length in the XDP xmit function, used to serve both internal XDP_TX operations as well as redirected-in requests. Do not apply the MLX5E_XDP_MIN_INLINE check unless necessary. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: XDP, Consider large muti-buffer packets in Striding RQ params ↵Tariq Toukan
calculations Function mlx5e_rx_get_linear_stride_sz() returns PAGE_SIZE immediately in case an XDP program is attached. The more accurate formula is ALIGN(sz, PAGE_SIZE), to prevent two packets from residing on the same page. The assumption behind the current code is that sz <= PAGE_SIZE holds for all cases with XDP program set. This is true because it is being called from: - 3 times from Striding RQ flows, in which XDP is not supported for such large packets. - 1 time from Legacy RQ flow, under the condition mlx5e_rx_is_linear_skb(). No functional change here, just removing the implied assumption in preparation for supporting XDP multi-buffer in Striding RQ. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: XDP, Let XDP checker function get the params as inputTariq Toukan
Change mlx5e_xdp_allowed() so it gets the params structure with the xdp_prog applied, rather than creating a local copy based on the current params in priv. This reduces the amount of memory on the stack, and acts on the exact params instance that's about to be applied. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: XDP, Improve Striding RQ check with XDPTariq Toukan
Non-linear mem scheme of Striding RQ does not yet support XDP at this point. Take the check where it belongs, inside the params validation function mlx5e_params_validate_xdp(). Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: XDP, Add support for multi-buffer XDP redirect-inTariq Toukan
Handle multi-buffer XDP redirect-in requests coming through mlx5e_xdp_xmit. Extend struct mlx5e_xmit_data_frags with an additional dma_arr field, to point to the fragments dma mapping, as they cannot be retrieved via the page_pool_get_dma_addr() function. Push a dma_addr xdpi instance per each fragment, and use them in the completion flow to dma_unmap the frags. Finally, remove the restriction in mlx5e_open_xdpsq, and set the flag in xdp_features. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: XDP, Use multiple single-entry objects in xdpi_fifoTariq Toukan
Here we fix the current wi->num_pkts abuse, as it was used to indicate multiple xdpi entries in the xdpi_fifo. Instead, reduce mlx5e_xdp_info to the size of a single field, making it a union of unions. Per packet, use as many instances as needed to provide the information needed at the time of completion. The sequence of xdpi instances pushed is well defined, derived by the xmit_mode. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: XDP, Remove doubtful unlikely callsTariq Toukan
It is not likely nor unlikely that the xdp buff has fragments, it depends on the program loaded and size of the packet received. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: Introduce extended version for mlx5e_xmit_dataTariq Toukan
Introduce struct mlx5e_xmit_data_frags to be used for non-linear xmit buffers. Let it include sinfo pointer. Take one bit from the len field to indicate if the descriptor has fragments and can be casted-up into the extended version. Zero-init to make sure has_frags, and potentially future fields, are zero when not explicitly assigned. Another field will be added in a downstream patch to indicate and point to dma addresses of the different frags, for redirect-in requests. This simplifies the mlx5e_xmit_xdp_frame/mlx5e_xmit_xdp_frame_mpwqe functions params. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: Move struct mlx5e_xmit_data to datapath headerTariq Toukan
Move TX datapath struct from the generic en.h to the datapath txrx.h header, where it belongs. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: Move XDP struct and enum to XDP headerTariq Toukan
Move struct mlx5e_xdp_info and enum mlx5e_xdp_xmit_mode from the generic en.h to the XDP header, where they belong. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19bonding: Fix memory leak when changing bond type to EthernetIdo Schimmel
When a net device is put administratively up, its 'IFF_UP' flag is set (if not set already) and a 'NETDEV_UP' notification is emitted, which causes the 8021q driver to add VLAN ID 0 on the device. The reverse happens when a net device is put administratively down. When changing the type of a bond to Ethernet, its 'IFF_UP' flag is incorrectly cleared, resulting in the kernel skipping the above process and VLAN ID 0 being leaked [1]. Fix by restoring the flag when changing the type to Ethernet, in a similar fashion to the restoration of the 'IFF_SLAVE' flag. The issue can be reproduced using the script in [2], with example out before and after the fix in [3]. [1] unreferenced object 0xffff888103479900 (size 256): comm "ip", pid 329, jiffies 4294775225 (age 28.561s) hex dump (first 32 bytes): 00 a0 0c 15 81 88 ff ff 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff81a6051a>] kmalloc_trace+0x2a/0xe0 [<ffffffff8406426c>] vlan_vid_add+0x30c/0x790 [<ffffffff84068e21>] vlan_device_event+0x1491/0x21a0 [<ffffffff81440c8e>] notifier_call_chain+0xbe/0x1f0 [<ffffffff8372383a>] call_netdevice_notifiers_info+0xba/0x150 [<ffffffff837590f2>] __dev_notify_flags+0x132/0x2e0 [<ffffffff8375ad9f>] dev_change_flags+0x11f/0x180 [<ffffffff8379af36>] do_setlink+0xb96/0x4060 [<ffffffff837adf6a>] __rtnl_newlink+0xc0a/0x18a0 [<ffffffff837aec6c>] rtnl_newlink+0x6c/0xa0 [<ffffffff837ac64e>] rtnetlink_rcv_msg+0x43e/0xe00 [<ffffffff839a99e0>] netlink_rcv_skb+0x170/0x440 [<ffffffff839a738f>] netlink_unicast+0x53f/0x810 [<ffffffff839a7fcb>] netlink_sendmsg+0x96b/0xe90 [<ffffffff8369d12f>] ____sys_sendmsg+0x30f/0xa70 [<ffffffff836a6d7a>] ___sys_sendmsg+0x13a/0x1e0 unreferenced object 0xffff88810f6a83e0 (size 32): comm "ip", pid 329, jiffies 4294775225 (age 28.561s) hex dump (first 32 bytes): a0 99 47 03 81 88 ff ff a0 99 47 03 81 88 ff ff ..G.......G..... 81 00 00 00 01 00 00 00 cc cc cc cc cc cc cc cc ................ backtrace: [<ffffffff81a6051a>] kmalloc_trace+0x2a/0xe0 [<ffffffff84064369>] vlan_vid_add+0x409/0x790 [<ffffffff84068e21>] vlan_device_event+0x1491/0x21a0 [<ffffffff81440c8e>] notifier_call_chain+0xbe/0x1f0 [<ffffffff8372383a>] call_netdevice_notifiers_info+0xba/0x150 [<ffffffff837590f2>] __dev_notify_flags+0x132/0x2e0 [<ffffffff8375ad9f>] dev_change_flags+0x11f/0x180 [<ffffffff8379af36>] do_setlink+0xb96/0x4060 [<ffffffff837adf6a>] __rtnl_newlink+0xc0a/0x18a0 [<ffffffff837aec6c>] rtnl_newlink+0x6c/0xa0 [<ffffffff837ac64e>] rtnetlink_rcv_msg+0x43e/0xe00 [<ffffffff839a99e0>] netlink_rcv_skb+0x170/0x440 [<ffffffff839a738f>] netlink_unicast+0x53f/0x810 [<ffffffff839a7fcb>] netlink_sendmsg+0x96b/0xe90 [<ffffffff8369d12f>] ____sys_sendmsg+0x30f/0xa70 [<ffffffff836a6d7a>] ___sys_sendmsg+0x13a/0x1e0 [2] ip link add name t-nlmon type nlmon ip link add name t-dummy type dummy ip link add name t-bond type bond mode active-backup ip link set dev t-bond up ip link set dev t-nlmon master t-bond ip link set dev t-nlmon nomaster ip link show dev t-bond ip link set dev t-dummy master t-bond ip link show dev t-bond ip link del dev t-bond ip link del dev t-dummy ip link del dev t-nlmon [3] Before: 12: t-bond: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/netlink 12: t-bond: <BROADCAST,MULTICAST,MASTER,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 46:57:39:a4:46:a2 brd ff:ff:ff:ff:ff:ff After: 12: t-bond: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/netlink 12: t-bond: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 66:48:7b:74:b6:8a brd ff:ff:ff:ff:ff:ff Fixes: e36b9d16c6a6 ("bonding: clean muticast addresses when device changes type") Fixes: 75c78500ddad ("bonding: remap muticast addresses without using dev_close() and dev_open()") Fixes: 9ec7eb60dcbc ("bonding: restore IFF_MASTER/SLAVE flags on bond enslave ether type change") Reported-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr> Link: https://lore.kernel.org/netdev/78a8a03b-6070-3e6b-5042-f848dab16fb8@alu.unizg.hr/ Tested-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19wifi: iwlwifi: dvm: Fix memcpy: detected field-spanning write backtraceHans de Goede
A received TKIP key may be up to 32 bytes because it may contain MIC rx/tx keys too. These are not used by iwl and copying these over overflows the iwl_keyinfo.key field. Add a check to not copy more data to iwl_keyinfo.key then will fit. This fixes backtraces like this one: memcpy: detected field-spanning write (size 32) of single field "sta_cmd.key.key" at drivers/net/wireless/intel/iwlwifi/dvm/sta.c:1103 (size 16) WARNING: CPU: 1 PID: 946 at drivers/net/wireless/intel/iwlwifi/dvm/sta.c:1103 iwlagn_send_sta_key+0x375/0x390 [iwldvm] <snip> Hardware name: Dell Inc. Latitude E6430/0H3MT5, BIOS A21 05/08/2017 RIP: 0010:iwlagn_send_sta_key+0x375/0x390 [iwldvm] <snip> Call Trace: <TASK> iwl_set_dynamic_key+0x1f0/0x220 [iwldvm] iwlagn_mac_set_key+0x1e4/0x280 [iwldvm] drv_set_key+0xa4/0x1b0 [mac80211] ieee80211_key_enable_hw_accel+0xa8/0x2d0 [mac80211] ieee80211_key_replace+0x22d/0x8e0 [mac80211] <snip> Link: https://www.alionet.org/index.php?topic=1469.0 Link: https://lore.kernel.org/linux-wireless/20230218191056.never.374-kees@kernel.org/ Link: https://lore.kernel.org/linux-wireless/68760035-7f75-1b23-e355-bfb758a87d83@redhat.com/ Cc: Kees Cook <keescook@chromium.org> Suggested-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-04-19ALSA: Use of_property_read_bool() for boolean propertiesRob Herring
It is preferred to use typed property access functions (i.e. of_property_read_<type> functions) rather than low-level of_get_property/of_find_property functions for reading properties. Convert reading boolean properties to to of_property_read_bool(). Signed-off-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20230310144734.1546587-1-robh@kernel.org Signed-off-by: Takashi Iwai <tiwai@suse.de>
2023-04-19ALSA: ppc/tumbler: Use of_property_present() for testing DT property presenceRob Herring
It is preferred to use typed property access functions (i.e. of_property_read_<type> functions) rather than low-level of_get_property/of_find_property functions for reading properties. As part of this, convert of_get_property/of_find_property calls to the recently added of_property_present() helper when we just want to test for presence of a property and nothing more. Signed-off-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20230310144733.1546500-1-robh@kernel.org Signed-off-by: Takashi Iwai <tiwai@suse.de>
2023-04-18net: ethernet: stmmac: dwmac-sti: remove stih415/stih416/stid127Alain Volmat
Remove no more supported platforms (stih415/stih416 and stid127) Signed-off-by: Alain Volmat <avolmat@me.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Link: https://lore.kernel.org/r/20230416195523.61075-1-avolmat@me.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-18net: mscc: ocelot: remove incompatible prototypesArnd Bergmann
The types for the register argument changed recently, but there are still incompatible prototypes that got left behind, and gcc-13 warns about these: In file included from drivers/net/ethernet/mscc/ocelot.c:13: drivers/net/ethernet/mscc/ocelot.h:97:5: error: conflicting types for 'ocelot_port_readl' due to enum/integer mismatch; have 'u32(struct ocelot_port *, u32)' {aka 'unsigned int(struct ocelot_port *, unsigned int)'} [-Werror=enum-int-mismatch] 97 | u32 ocelot_port_readl(struct ocelot_port *port, u32 reg); | ^~~~~~~~~~~~~~~~~ Just remove the two prototypes, and rely on the copy in the global header. Fixes: 9ecd05794b8d ("net: mscc: ocelot: strengthen type of "u32 reg" in I/O accessors") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230417205531.1880657-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-18net: stmmac: propagate feature flags to vlanCorinna Vinschen
stmmac_dev_probe doesn't propagate feature flags to VLANs. So features like offloading don't correspond with the general features and it's not possible to manipulate features via ethtool -K to affect VLANs. Propagate feature flags to vlan features. Drop TSO feature because it does not work on VLANs yet. Signed-off-by: Corinna Vinschen <vinschen@redhat.com> Link: https://lore.kernel.org/r/20230417192845.590034-1-vinschen@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-19tools/loongarch: Use __SIZEOF_LONG__ to define __BITS_PER_LONGTiezhu Yang
Although __SIZEOF_POINTER__ is equal to _SIZEOF_LONG__ on LoongArch, it is better to use __SIZEOF_LONG__ to define __BITS_PER_LONG to keep consistent between arch/loongarch/include/uapi/asm/bitsperlong.h and tools/arch/loongarch/include/uapi/asm/bitsperlong.h. Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-04-19LoongArch: Replace hard-coded values in comments with VALENEnze Li
According to LoongArch documentation [1], CSR.PGDL and CSR.PGDH are concerned with the VA's MSB which is VALEN-1 instead of always being 47. Fix comments to avoid misleading others. [1] https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#page-global-directory-base-address-for-lower-half-address-space Reviewed-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Enze Li <lienze@kylinos.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-04-19LoongArch: Clean up plat_swiotlb_setup() related codeTiezhu Yang
After commit c78c43fe7d42 ("LoongArch: Use acpi_arch_dma_setup() and remove ARCH_HAS_PHYS_TO_DMA"), plat_swiotlb_setup() has been deleted, so clean up the related code. Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-04-19LoongArch: Check unwind_error() in arch_stack_walk()Tiezhu Yang
We can see the following messages with CONFIG_PROVE_LOCKING=y on LoongArch: BUG: MAX_STACK_TRACE_ENTRIES too low! turning off the locking correctness validator. This is because stack_trace_save() returns a big value after call arch_stack_walk(), here is the call trace: save_trace() stack_trace_save() arch_stack_walk() stack_trace_consume_entry() arch_stack_walk() should return immediately if unwind_next_frame() failed, no need to do the useless loops to increase the value of c->len in stack_trace_consume_entry(), then we can fix the above problem. Cc: stable@vger.kernel.org Reported-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/all/8a44ad71-68d2-4926-892f-72bfc7a67e2a@roeck-us.net/ Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-04-19LoongArch: Adjust user_regset_copyin parameter to the correct offsetQing Zhang
Ensure that user_watch_state can be set correctly by the user. Signed-off-by: Qing Zhang <zhangqing@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-04-19LoongArch: Adjust user_watch_state for explicit alignmentQing Zhang
This is done in order to easily calculate the number of breakpoints in hw_break_get()/hw_break_set(). Signed-off-by: Qing Zhang <zhangqing@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-04-18bonding: add software tx timestamping supportHangbin Liu
Currently, bonding only obtain the timestamp (ts) information of the active slave, which is available only for modes 1, 5, and 6. For other modes, bonding only has software rx timestamping support. However, some users who use modes such as LACP also want tx timestamp support. To address this issue, let's check the ts information of each slave. If all slaves support tx timestamping, we can enable tx timestamping support for the bond. Add a note that the get_ts_info may be called with RCU, or rtnl or reference on the device in ethtool.h> Suggested-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Link: https://lore.kernel.org/r/20230418034841.2566262-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Unbreak br_netfilter physdev match support, from Florian Westphal. 2) Use GFP_KERNEL_ACCOUNT for stateful/policy objects, from Chen Aotian. 3) Use IS_ENABLED() in nf_reset_trace(), from Florian Westphal. 4) Fix validation of catch-all set element. 5) Tighten requirements for catch-all set elements. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements netfilter: nf_tables: validate catch-all set elements netfilter: nf_tables: fix ifdef to also consider nf_tables=m netfilter: nf_tables: Modify nla_memdup's flag to GFP_KERNEL_ACCOUNT netfilter: br_netfilter: fix recent physdev match breakage ==================== Link: https://lore.kernel.org/r/20230418145048.67270-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-18Merge patch series "riscv: Use PUD/P4D/PGD pages for the linear mapping"Palmer Dabbelt
Alexandre Ghiti <alexghiti@rivosinc.com> says: This patchset intends to improve tlb utilization by using hugepages for the linear mapping. As reported by Anup in v6, when STRICT_KERNEL_RWX is enabled, we must take care of isolating the kernel text and rodata so that they are not mapped with a PUD mapping which would then assign wrong permissions to the whole region: it is achieved the same way as arm64 by using the memblock nomap API which isolates those regions and re-merge them afterwards thus avoiding any issue with the system resources tree creation. arch/riscv/include/asm/page.h | 19 ++++++- arch/riscv/mm/init.c | 102 ++++++++++++++++++++++++++-------- arch/riscv/mm/physaddr.c | 16 ++++++ drivers/of/fdt.c | 11 ++-- 4 files changed, 118 insertions(+), 30 deletions(-) * b4-shazam-merge: riscv: Use PUD/P4D/PGD pages for the linear mapping riscv: Move the linear mapping creation in its own function riscv: Get rid of riscv_pfn_base variable Link: https://lore.kernel.org/r/20230324155421.271544-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-18riscv: Use PUD/P4D/PGD pages for the linear mappingAlexandre Ghiti
During the early page table creation, we used to set the mapping for PAGE_OFFSET to the kernel load address: but the kernel load address is always offseted by PMD_SIZE which makes it impossible to use PUD/P4D/PGD pages as this physical address is not aligned on PUD/P4D/PGD size (whereas PAGE_OFFSET is). But actually we don't have to establish this mapping (ie set va_pa_offset) that early in the boot process because: - first, setup_vm installs a temporary kernel mapping and among other things, discovers the system memory, - then, setup_vm_final creates the final kernel mapping and takes advantage of the discovered system memory to create the linear mapping. During the first phase, we don't know the start of the system memory and then until the second phase is finished, we can't use the linear mapping at all and phys_to_virt/virt_to_phys translations must not be used because it would result in a different translation from the 'real' one once the final mapping is installed. So here we simply delay the initialization of va_pa_offset to after the system memory discovery. But to make sure noone uses the linear mapping before, we add some guard in the DEBUG_VIRTUAL config. Finally we can use PUD/P4D/PGD hugepages when possible, which will result in a better TLB utilization. Note that: - this does not apply to rv32 as the kernel mapping lies in the linear mapping. - we rely on the firmware to protect itself using PMP. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Acked-by: Rob Herring <robh@kernel.org> # DT bits Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Tested-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20230324155421.271544-4-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-18riscv: Move the linear mapping creation in its own functionAlexandre Ghiti
No change intended, it just splits the linear mapping creation from setup_vm_final: this prepares for upcoming additions to the linear mapping creation. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Tested-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20230324155421.271544-3-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-18riscv: Get rid of riscv_pfn_base variableAlexandre Ghiti
Use directly phys_ram_base instead, riscv_pfn_base is just the pfn of the address contained in phys_ram_base. Even if there is no functional change intended in this patch, actually setting phys_ram_base that early changes the behaviour of kernel_mapping_pa_to_va during the early boot: phys_ram_base used to be zero before this patch and now it is set to the physical start address of the kernel. But it does not break the conversion of a kernel physical address into a virtual address since kernel_mapping_pa_to_va should only be used on kernel physical addresses, i.e. addresses greater than the physical start address of the kernel. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Tested-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20230324155421.271544-2-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-18RISC-V: align ISA extension Kconfig help text with each otherConor Dooley
Other extensions only capitalise the first letter in the text visible in Kconfig menus, and provide a short comment about the extension's meaning. Do the same for Svnapot & Svpbmt. The precedent for capitalisation in the Kconfig text was set by Zicbom & sorta followed for Zicboz. The RVI styling used for multi-letter extensions only capitalises the first letter, so do the same here. If nothing else, my OCD likes it when the extensions follow a consistent pattern. While editing one of the lines, reformat the "spelling" of 64-bit. Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20230405-pucker-cogwheel-3a999a94a2f2@wendy Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-18riscv: Kconfig: enable SCHED_MC kconfigSong Shuai
RISC-V now builds the sched domain based on the simple possible map. Enable SCHED_MC to make the building based on cpu_coregroup_mask() which also takes care of the NUMA and cores with LLC. Signed-off-by: Song Shuai <suagrfillet@gmail.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20230310110336.970985-1-suagrfillet@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-18riscv: export cpu/freq invariant to schedulerSong Shuai
RISC-V now manages CPU topology using arch_topology which provides CPU capacity and frequency related interfaces to access the cpu/freq invariant in possible heterogeneous or DVFS-enabled platforms. Here adds topology.h file to export the arch_topology interfaces for replacing the scheduler's constant-based cpu/freq invariant accounting. Signed-off-by: Song Shuai <suagrfillet@gmail.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Ley Foon Tan <lftan@kernel.org> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20230323123924.3032174-1-suagrfillet@gmail.com [Palmer: Fix the whitespace issues.] Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-18scsi: ipr: Remove SATA supportBrian King
Linux SATA support in ipr has always been limited to SATA DVDs. The last systems that had the option of including a SATA DVD was Power 8, which have been withdrawn for some time now, so this support can be removed. Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Link: https://lore.kernel.org/r/20230412174015.114764-1-brking@linux.vnet.ibm.com Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-04-18scsi: scsi_debug: Abort commands from scsi_debug_device_reset()John Garry
Currently scsi_debug_device_reset() does not do much apart from setting the SDEBUG_UA_POR ("Power on, reset, or bus device reset") flag, which is eventually passed back to the SCSI midlayer later for a "unit attention" command. There is a report that blktest scsi/007 test fails due to commit 1107c7b24ee3 ("scsi: scsi_debug: Dynamically allocate sdebug_queued_cmd"). The problem there is that there are dangling scsi_debug queued commands when we attempt to remove the driver. scsi/007 test triggers SCSI EH and attempts to abort a timed-out command. Function scsi_debug_device_reset() is called as part of the EH, but does not deal with outstanding erroneous command. Prior to the named commit, removing the driver caused all dangling queued commands to be stopped - this should have not been necessary. Fix by aborting outstanding commands on a scsi_device basis from scsi_debug_device_reset(). Fixes: 1107c7b24ee3 ("scsi: scsi_debug: Dynamically allocate sdebug_queued_cmd") Reported-by: kernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/oe-lkp/202304071111.e762fcbd-yujie.liu@intel.com Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20230416175654.159163-1-john.g.garry@oracle.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-04-18Merge patch series "RISC-V Hardware Probing User Interface"Palmer Dabbelt
Evan Green <evan@rivosinc.com> says: There's been a bunch of off-list discussions about this, including at Plumbers. The original plan was to do something involving providing an ISA string to userspace, but ISA strings just aren't sufficient for a stable ABI any more: in order to parse an ISA string users need the version of the specifications that the string is written to, the version of each extension (sometimes at a finer granularity than the RISC-V releases/versions encode), and the expected use case for the ISA string (ie, is it a U-mode or M-mode string). That's a lot of complexity to try and keep ABI compatible and it's probably going to continue to grow, as even if there's no more complexity in the specifications we'll have to deal with the various ISA string parsing oddities that end up all over userspace. Instead this patch set takes a very different approach and provides a set of key/value pairs that encode various bits about the system. The big advantage here is that we can clearly define what these mean so we can ensure ABI stability, but it also allows us to encode information that's unlikely to ever appear in an ISA string (see the misaligned access performance, for example). The resulting interface looks a lot like what arm64 and x86 do, and will hopefully fit well into something like ACPI in the future. The actual user interface is a syscall, with a vDSO function in front of it. The vDSO function can answer some queries without a syscall at all, and falls back to the syscall for cases it doesn't have answers to. Currently we prepopulate it with an array of answers for all keys and a CPU set of "all CPUs". This can be adjusted as necessary to provide fast answers to the most common queries. An example series in glibc exposing this syscall and using it in an ifunc selector for memcpy can be found at [1]. I was asked about the performance delta between this and something like sysfs. I created a small test program and ran it on a Nezha D1 Allwinner board. Doing each operation 100000 times and dividing, these operations take the following amount of time: - open()+read()+close() of /sys/kernel/cpu_byteorder: 3.8us - access("/sys/kernel/cpu_byteorder", R_OK): 1.3us - riscv_hwprobe() vDSO and syscall: .0094us - riscv_hwprobe() vDSO with no syscall: 0.0091us These numbers get farther apart if we query multiple keys, as sysfs will scale linearly with the number of keys, where the dedicated syscall stays the same. To frame these numbers, I also did a tight fork/exec/wait loop, which I measured as 4.8ms. So doing 4 open/read/close operations is a delta of about 0.3%, versus a single vDSO call is a delta of essentially zero. [1] https://patchwork.ozlabs.org/project/glibc/list/?series=343050 * b4-shazam-merge: RISC-V: Add hwprobe vDSO function and data selftests: Test the new RISC-V hwprobe interface RISC-V: hwprobe: Support probing of misaligned access performance RISC-V: hwprobe: Add support for RISCV_HWPROBE_BASE_BEHAVIOR_IMA RISC-V: Add a syscall for HW probing RISC-V: Move struct riscv_cpuinfo to new header Link: https://lore.kernel.org/r/20230407231103.2622178-1-evan@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-18cifs: Reapply lost fix from commit 30b2b2196d6eDavid Howells
Reapply the fix from: 30b2b2196d6e ("cifs: do not include page data when checking signature") that got lost in the iteratorisation of the cifs driver. Fixes: d08089f649a0 ("cifs: Change the I/O paths to use an iterator rather than a page list") Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Reported-by: Paulo Alcantara <pc@manguebit.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@cjr.nz> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Bharath S M <bharathsm@microsoft.com> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-18cifs: Fix unbuffered readDavid Howells
If read() is done in an unbuffered manner, such that, say, cifs_strict_readv() goes through cifs_user_readv() and thence __cifs_readv(), it doesn't recognise the EOF and keeps indicating to userspace that it returning full buffers of data. This is due to ctx->iter being advanced in cifs_send_async_read() as the buffer is split up amongst a number of rdata objects. The iterator count is then used in collect_uncached_read_data() in the non-DIO case to set the total length read - and thus the return value of sys_read(). But since the iterator normally gets used up completely during splitting, ctx->total_len gets overridden to the full amount. However, prior to that in collect_uncached_read_data(), we've gone through the list of rdatas and added up the amount of data we actually received (which we then throw away). Fix this by removing the bit that overrides the amount read in the non-DIO case and just going with the total added up in the aforementioned loop. This was observed by mounting a cifs share with multiple channels, e.g.: mount //192.168.6.1/test /test/ -o user=shares,pass=...,max_channels=6 and then reading a 1MiB file on the share: strace cat /xfstest.test/1M >/dev/null Through strace, the same data can be seen being read again and again. Fixes: d08089f649a0 ("cifs: Change the I/O paths to use an iterator rather than a page list") Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com> cc: Jérôme Glisse <jglisse@redhat.com> cc: Long Li <longli@microsoft.com> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-18null_blk: Always check queue mode setting from configfsChaitanya Kulkarni
Make sure to check device queue mode in the null_validate_conf() and return error for NULL_Q_RQ as we don't allow legacy I/O path, without this patch we get OOPs when queue mode is set to 1 from configfs, following are repro steps :- modprobe null_blk nr_devices=0 mkdir config/nullb/nullb0 echo 1 > config/nullb/nullb0/memory_backed echo 4096 > config/nullb/nullb0/blocksize echo 20480 > config/nullb/nullb0/size echo 1 > config/nullb/nullb0/queue_mode echo 1 > config/nullb/nullb0/power Entering kdb (current=0xffff88810acdd080, pid 2372) on processor 42 Oops: (null) due to oops @ 0xffffffffc041c329 CPU: 42 PID: 2372 Comm: sh Tainted: G O N 6.3.0-rc5lblk+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 RIP: 0010:null_add_dev.part.0+0xd9/0x720 [null_blk] Code: 01 00 00 85 d2 0f 85 a1 03 00 00 48 83 bb 08 01 00 00 00 0f 85 f7 03 00 00 80 bb 62 01 00 00 00 48 8b 75 20 0f 85 6d 02 00 00 <48> 89 6e 60 48 8b 75 20 bf 06 00 00 00 e8 f5 37 2c c1 48 8b 75 20 RSP: 0018:ffffc900052cbde0 EFLAGS: 00010246 RAX: 0000000000000001 RBX: ffff88811084d800 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888100042e00 RBP: ffff8881053d8200 R08: ffffc900052cbd68 R09: ffff888105db2000 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000002 R13: ffff888104765200 R14: ffff88810eec1748 R15: ffff88810eec1740 FS: 00007fd445fd1740(0000) GS:ffff8897dfc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000060 CR3: 0000000166a00000 CR4: 0000000000350ee0 DR0: ffffffff8437a488 DR1: ffffffff8437a489 DR2: ffffffff8437a48a DR3: ffffffff8437a48b DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: <TASK> nullb_device_power_store+0xd1/0x120 [null_blk] configfs_write_iter+0xb4/0x120 vfs_write+0x2ba/0x3c0 ksys_write+0x5f/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fd4460c57a7 Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007ffd3792a4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd4460c57a7 RDX: 0000000000000002 RSI: 000055b43c02e4c0 RDI: 0000000000000001 RBP: 000055b43c02e4c0 R08: 000000000000000a R09: 00007fd44615b4e0 R10: 00007fd44615b3e0 R11: 0000000000000246 R12: 0000000000000002 R13: 00007fd446198520 R14: 0000000000000002 R15: 00007fd446198700 </TASK> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Link: https://lore.kernel.org/r/20230416220339.43845-1-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18block: ublk: switch to ioctl command encodingMing Lei
All ublk commands(control, IO) should have taken ioctl command encoding from the beginning, because ioctl command encoding defines each code uniquely, so driver can figure out wrong command sent from userspace easily; 2) it might help security subsystem for audit uring cmd[1]. Unfortunately we didn't do that way, and it could be one lesson for ublk driver. So switch to ioctl command encoding now, we still support commands encoded in old way, but they become legacy definition. Any new command should take ioctl encoding. See ublksrv code for switching to ioctl command encoding in [2]. [1] https://lore.kernel.org/io-uring/CAHC9VhSVzujW9LOj5Km80AjU0EfAuukoLrxO6BEfnXeK_s6bAg@mail.gmail.com/ [2] https://github.com/ming1/ubdsrv/commits/ioctl_cmd_encoding Cc: Christoph Hellwig <hch@infradead.org> Cc: Ken Kurematsu <k.kurematsu@nskint.co.jp> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230418131810.855959-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18remoteproc: st: Use of_property_present() for testing DT property presenceRob Herring
It is preferred to use typed property access functions (i.e. of_property_read_<type> functions) rather than low-level of_get_property/of_find_property functions for reading properties. As part of this, convert of_get_property/of_find_property calls to the recently added of_property_present() helper when we just want to test for presence of a property and nothing more. Signed-off-by: Rob Herring <robh@kernel.org> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230310144736.1546972-1-robh@kernel.org
2023-04-18io_uring: add support for multishot timeoutsDavid Wei
A multishot timeout submission will repeatedly generate completions with the IORING_CQE_F_MORE cflag set. Depending on the value of the `off' field in the submission, these timeouts can either repeat indefinitely until cancelled (`off' = 0) or for a fixed number of times (`off' > 0). Only noseq timeouts (i.e. not dependent on the number of I/O completions) are supported. An indefinite timer will be cancelled if the CQ ever overflows. Signed-off-by: David Wei <davidhwei@meta.com> Link: https://lore.kernel.org/r/20230418225817.1905027-1-davidhwei@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18io_uring/rsrc: disassociate nodes and rsrc_dataPavel Begunkov
Make rsrc nodes independent from rsrd_data, for that we keep ctx and rsrc type in nodes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4f259abe9cd4eea6a3b4ed83508635218acd3c3f.1681822823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18io_uring/rsrc: devirtualise rsrc put callbacksPavel Begunkov
We only have two rsrc types, buffers and files, replace virtual callbacks for putting resources down with a switch..case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/02ca727bf8e5f7f820c2f404e95ae88c8f472930.1681822823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18io_uring/rsrc: pass node to io_rsrc_put_work()Pavel Begunkov
Instead of passing rsrc_data and a resource to io_rsrc_put_work() just forward node, that's all the function needs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/791e8edd28d78797240b74d34e99facbaad62f3b.1681822823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18io_uring/rsrc: inline io_rsrc_put_work()Pavel Begunkov
io_rsrc_put_work() is simple enough to be open coded into its only caller. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1b36dd46766ced39a9b160767babfa2fce07b8f8.1681822823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-18io_uring/rsrc: add empty flag in rsrc_nodePavel Begunkov
Unless a node was flushed by io_rsrc_ref_quiesce(), it'll carry a resource. Replace ->inline_items with an empty flag, which is initialised to false and only raised in io_rsrc_ref_quiesce(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/75d384c9d2252e12af73b9cf8a44e1699106aeb1.1681822823.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>