summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-10-29bnxt_en: extract coredump command line from current taskEdwin Peer
Tools other than 'ethtool -w' may be used to produce a coredump. For devlink health, such dumps could even be driver initiated in response to a health event. In these cases, the kernel thread information will be placed in the coredump record instead. v2: use min_t() instead of min() to fix the mismatched type warning Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29bnxt_en: Retrieve coredump and crashdump size via FW commandVasundhara Volam
Recent firmware provides coredump and crashdump size info via DBG_QCFG command. Read the dump sizes from firmware, instead of computing in the driver. This patch reduces the time taken to collect the dump via ethtool. Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29bnxt_en: Add compression flags information in coredump segment headerVasundhara Volam
Firmware sets compression flags for each segment, add this information while filling segment header. Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29bnxt_en: move coredump functions into dedicated fileEdwin Peer
Change bnxt_get_coredump() and bnxt_get_coredump_length() to non-static functions. Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29bnxt_en: Refactor coredump functionsEdwin Peer
The coredump functionality will be used by devlink health. Refactor these functions that get coredump and coredump length. There is no functional change, but the following checkpatch warnings were addressed: - strscpy is preferred over strlcpy. - sscanf results should be checked, with an additional warning. Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29bnxt_en: improve fw diagnose devlink health messagesEdwin Peer
Add firmware event counters as well as health state severity. In the unhealthy state, recommend a remedy and inform the user as to its impact. Readability of the devlink tool's output is negatively impacted by adding these fields to the diagnosis. The single line of text, as rendered by devlink health diagnose, benefits from more terse descriptions, which can be substituted without loss of clarity, even in pretty printed JSON mode. Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29bnxt_en: consolidate fw devlink health reportersEdwin Peer
Merge 'fw' and 'fw_fatal' health reporters. There is no longer a need to distinguish between firmware reporters. Only bonafide errors are reported now and no reports were being generated for the 'fw' reporter. Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29bnxt_en: remove fw_reset devlink health reporterEdwin Peer
Firmware resets initiated by the user are not errors and should not be reported via devlink. Once only unsolicited resets remain, it is no longer sensible to maintain a separate fw_reset reporter. Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29bnxt_en: improve error recovery information messagesEdwin Peer
The recovery election messages are often mistaken for errors. Improve the wording to clarify the meaning of these frequent and expected events. Also, take the first step towards more inclusive language. Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29bnxt_en: add enable_remote_dev_reset devlink parameterEdwin Peer
The reported parameter value should not take into account the state of remote drivers. Firmware will reject remote resets as appropriate, thus it is not strictly necessary to check HOT_RESET_ALLOWED before attempting to initiate a reset. But we add the check so that we can provide more intuitive messages when reset is not permitted. This firmware setting needs to be restored from all functions after a firmware reset. Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29bnxt_en: implement devlink dev reload fw_activateEdwin Peer
Similar to reload driver_reinit, the RTNL lock is held across reload down and up to prevent interleaving state changes. But we need to subsequently release the RTNL lock while waiting for firmware reset to complete. Also keep a statistic on fw_activate resets initiated remotely from other functions. Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29bnxt_en: implement devlink dev reload driver_reinitEdwin Peer
The RTNL lock must be held between down and up to prevent interleaving state changes, especially since external state changes might release and allocate different driver resource subsets that would otherwise need to be tracked and carefully handled. If the down function fails, then devlink will not call the corresponding up function, thus the lock is released in the down error paths. v2: Don't use devlink_reload_disable() and devlink_reload_enable(). Instead, check that the netdev is not in unregistered state before proceeding with reload. Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-Off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29bnxt_en: refactor cancellation of resource reservationsEdwin Peer
Resource reservations will also need to be reset after FUNC_DRV_UNRGTR in the following devlink driver_reinit patch. Extract this logic into a reusable function. Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29bnxt_en: refactor printing of device infoEdwin Peer
The device info logged during probe will be reused by the devlink driver_reinit code in a following patch. Extract this logic into the new bnxt_print_device_info() function. The board index needs to be saved in the driver context so that the board information can be retrieved at a later time, outside of the probe function. Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-29Revert "btrfs: compression: drop kmap/kunmap from zlib"David Sterba
This reverts commit 696ab562e6df9fbafd6052d8ce4aafcb2ed16069. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29Revert "btrfs: compression: drop kmap/kunmap from zstd"David Sterba
This reverts commit bbaf9715f3f5b5ff0de71da91fcc34ee9c198ed8. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. Example stacktrace with ZSTD on a 32bit ARM machine: Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = c4159ed3 [00000000] *pgd=00000000 Internal error: Oops: 5 [#1] PREEMPT SMP ARM Modules linked in: CPU: 0 PID: 210 Comm: kworker/u2:3 Not tainted 5.14.0-rc79+ #12 Hardware name: Allwinner sun4i/sun5i Families Workqueue: btrfs-delalloc btrfs_work_helper PC is at mmiocpy+0x48/0x330 LR is at ZSTD_compressStream_generic+0x15c/0x28c (mmiocpy) from [<c0629648>] (ZSTD_compressStream_generic+0x15c/0x28c) (ZSTD_compressStream_generic) from [<c06297dc>] (ZSTD_compressStream+0x64/0xa0) (ZSTD_compressStream) from [<c049444c>] (zstd_compress_pages+0x170/0x488) (zstd_compress_pages) from [<c0496798>] (btrfs_compress_pages+0x124/0x12c) (btrfs_compress_pages) from [<c043c068>] (compress_file_range+0x3c0/0x834) (compress_file_range) from [<c043c4ec>] (async_cow_start+0x10/0x28) (async_cow_start) from [<c0475c3c>] (btrfs_work_helper+0x100/0x230) (btrfs_work_helper) from [<c014ef68>] (process_one_work+0x1b4/0x418) (process_one_work) from [<c014f210>] (worker_thread+0x44/0x524) (worker_thread) from [<c0156aa4>] (kthread+0x180/0x1b0) (kthread) from [<c0100150>] Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: remove root argument from check_item_in_log()Filipe Manana
The root argument passed to check_item_in_log() always matches the root of the given directory, so it can be eliminated. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: remove root argument from add_link()Filipe Manana
The root argument for tree-log.c:add_link() always matches the root of the given directory and the given inode, so it can eliminated. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: remove root argument from btrfs_unlink_inode()Filipe Manana
The root argument passed to btrfs_unlink_inode() and its callee, __btrfs_unlink_inode(), always matches the root of the given directory and the given inode. So remove the argument and make __btrfs_unlink_inode() use the root of the directory. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: remove root argument from drop_one_dir_item()Filipe Manana
The root argument for drop_one_dir_item() always matches the root of the given directory inode, since each log tree is associated to one and only one subvolume/root, so remove the argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: clear MISSING device status bit in btrfs_close_one_deviceLi Zhang
Reported bug: https://github.com/kdave/btrfs-progs/issues/389 There's a problem with scrub reporting aborted status but returning error code 0, on a filesystem with missing and readded device. Roughly these steps: - mkfs -d raid1 dev1 dev2 - fill with data - unmount - make dev1 disappear - mount -o degraded - copy more data - make dev1 appear again Running scrub afterwards reports that the command was aborted, but the system log message says the exit code was 0. It seems that the cause of the error is decrementing fs_devices->missing_devices but not clearing device->dev_state. Every time we umount filesystem, it would call close_ctree, And it would eventually involve btrfs_close_one_device to close the device, but it only decrements fs_devices->missing_devices but does not clear the device BTRFS_DEV_STATE_MISSING bit. Worse, this bug will cause Integer Overflow, because every time umount, fs_devices->missing_devices will decrease. If fs_devices->missing_devices value hit 0, it would overflow. With added debugging: loop1: detected capacity change from 0 to 20971520 BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 1 transid 21 /dev/loop1 scanned by systemd-udevd (2311) loop2: detected capacity change from 0 to 20971520 BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 2 transid 17 /dev/loop2 scanned by systemd-udevd (2313) BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 0 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 18446744073709551615 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 18446744073709551615 If fs_devices->missing_devices is 0, next time it would be 18446744073709551615 After apply this patch, the fs_devices->missing_devices seems to be right: $ truncate -s 10g test1 $ truncate -s 10g test2 $ losetup /dev/loop1 test1 $ losetup /dev/loop2 test2 $ mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f $ losetup -d /dev/loop2 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ dmesg loop1: detected capacity change from 0 to 20971520 loop2: detected capacity change from 0 to 20971520 BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 1 transid 5 /dev/loop1 scanned by mkfs.btrfs (1863) BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 2 transid 5 /dev/loop2 scanned by mkfs.btrfs (1863) BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): checking UUID tree BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Li Zhang <zhanglikernel@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: call btrfs_check_rw_degradable only if there is a missing deviceAnand Jain
In open_ctree() in btrfs_check_rw_degradable() [1], we check each block group individually if at least the minimum number of devices is available for that profile. If all the devices are available, then we don't have to check degradable. [1] open_ctree() :: 3559 if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL)) { Also before calling btrfs_check_rw_degradable() in open_ctee() at the line number shown below [2] we call btrfs_read_chunk_tree() and down to add_missing_dev() to record number of missing devices. [2] open_ctree() :: 3454 ret = btrfs_read_chunk_tree(fs_info); btrfs_read_chunk_tree() read_one_chunk() / read_one_dev() add_missing_dev() So, check if there is any missing device before btrfs_check_rw_degradable() in open_ctree(). Also, with this the mount command could save ~16ms.[3] in the most common case, that is no device is missing. [3] 1) * 16934.96 us | btrfs_check_rw_degradable [btrfs](); CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: send: prepare for v2 protocolDavid Sterba
This is preparatory work for send protocol update to version 2 and higher. We have many pending protocol update requests but still don't have the basic protocol rev in place, the first thing that must happen is to do the actual versioning support. The protocol version is u32 and is a new member in the send ioctl struct. Validity of the version field is backed by a new flag bit. Old kernels would fail when a higher version is requested. Version protocol 0 will pick the highest supported version, BTRFS_SEND_STREAM_VERSION, that's also exported in sysfs. The version is still unchanged and will be increased once we have new incompatible commands or stream updates. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29Merge tag 'irqchip-5.16' into irq/coreBorislav Petkov
Merge irqchip updates for Linux 5.16 from Marc Zyngier: - A large cross-arch rework to move irq_enter()/irq_exit() into the arch code, and removing it from the generic irq code. Thanks to Mark Rutland for the huge effort! - A few irqchip drivers are made modular (broadcom, meson), because that's apparently a thing... - A new driver for the Microchip External Interrupt Controller - The irq_cpu_offline()/irq_cpu_online() API is now deprecated and can only be selected on the Cavium Octeon platform. Once this platform is removed, the API will be removed at the same time. - A sprinkle of devm_* helper, as people seem to love that. - The usual spattering of small fixes and minor improvements. * tag 'irqchip-5.16': (912 commits) h8300: Fix linux/irqchip.h include mess dt-bindings: irqchip: renesas-irqc: Document r8a774e1 bindings MIPS: irq: Avoid an unused-variable error genirq: Hide irq_cpu_{on,off}line() behind a deprecated option irqchip/mips-gic: Get rid of the reliance on irq_cpu_online() MIPS: loongson64: Drop call to irq_cpu_offline() irq: remove handle_domain_{irq,nmi}() irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY irq: riscv: perform irqentry in entry code irq: openrisc: perform irqentry in entry code irq: csky: perform irqentry in entry code irq: arm64: perform irqentry in entry code irq: arm: perform irqentry in entry code irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ irq: add generic_handle_arch_irq() irq: unexport handle_irq_desc() irq: simplify handle_domain_{irq,nmi}() irq: mips: simplify do_domain_IRQ() ... Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211029083332.3680101-1-maz@kernel.org
2021-10-29x86/apic: Reduce cache line misses in __x2apic_send_IPI_mask()Eric Dumazet
Using per-cpu storage for @x86_cpu_to_logical_apicid is not optimal. Broadcast IPI will need at least one cache line per cpu to access this field. __x2apic_send_IPI_mask() is using standard bitmask operators. By converting x86_cpu_to_logical_apicid to an array, we divide by 16x number of needed cache lines, because we find 16 values per cache line. CPU prefetcher can kick nicely. Also move @cluster_masks to READ_MOSTLY section to avoid false sharing. Tested on a dual socket host with 256 cpus, cost for a full broadcast is now 11 usec instead of 33 usec. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211007143556.574911-1-eric.dumazet@gmail.com
2021-10-28hwmon: (nct7802) Add of_node_put() before returnWan Jiabing
Fix following coccicheck warning: ./drivers/hwmon/nct7802.c:1152:2-24: WARNING: Function for_each_child_of_node should have of_node_put() before return. Early exits from for_each_child_of_node should decrement the node reference counter. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Link: https://lore.kernel.org/r/20211029024918.5161-1-wanjiabing@vivo.com Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2021-10-28Merge branch 'code-movement-to-br_switchdev-c'Jakub Kicinski
Vladimir Oltean says: ==================== Code movement to br_switchdev.c This is one more refactoring patch set for the Linux bridge, where more logic that is specific to switchdev is moved into br_switchdev.c, which is compiled out when CONFIG_NET_SWITCHDEV is disabled. ==================== Link: https://lore.kernel.org/r/20211027162119.2496321-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28net: bridge: switchdev: consistent function namingVladimir Oltean
Rename all recently imported functions in br_switchdev.c to start with a br_switchdev_* prefix. br_fdb_replay_one() -> br_switchdev_fdb_replay_one() br_fdb_replay() -> br_switchdev_fdb_replay() br_vlan_replay_one() -> br_switchdev_vlan_replay_one() br_vlan_replay() -> br_switchdev_vlan_replay() struct br_mdb_complete_info -> struct br_switchdev_mdb_complete_info br_mdb_complete() -> br_switchdev_mdb_complete() br_mdb_switchdev_host_port() -> br_switchdev_host_mdb_one() br_mdb_switchdev_host() -> br_switchdev_host_mdb() br_mdb_replay_one() -> br_switchdev_mdb_replay_one() br_mdb_replay() -> br_switchdev_mdb_replay() br_mdb_queue_one() -> br_switchdev_mdb_queue_one() Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28net: bridge: mdb: move all switchdev logic to br_switchdev.cVladimir Oltean
The following functions: br_mdb_complete br_switchdev_mdb_populate br_mdb_replay_one br_mdb_queue_one br_mdb_replay br_mdb_switchdev_host_port br_mdb_switchdev_host br_switchdev_mdb_notify are only accessible from code paths where CONFIG_NET_SWITCHDEV is enabled. So move them to br_switchdev.c, in order for that code to be compiled out if that config option is disabled. Note that br_switchdev.c gets build regardless of whether CONFIG_BRIDGE_IGMP_SNOOPING is enabled or not, whereas br_mdb.c only got built when CONFIG_BRIDGE_IGMP_SNOOPING was enabled. So to preserve correct compilation with CONFIG_BRIDGE_IGMP_SNOOPING being disabled, we must now place an #ifdef around these functions in br_switchdev.c. The offending bridge data structures that need this are br->multicast_lock and br->mdb_list, these are also compiled out of struct net_bridge when CONFIG_BRIDGE_IGMP_SNOOPING is turned off. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28net: bridge: split out the switchdev portion of br_mdb_notifyVladimir Oltean
Similar to fdb_notify() and br_switchdev_fdb_notify(), split the switchdev specific logic from br_mdb_notify() into a different function. This will be moved later in br_switchdev.c. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28net: bridge: move br_vlan_replay to br_switchdev.cVladimir Oltean
br_vlan_replay() is relevant only if CONFIG_NET_SWITCHDEV is enabled, so move it to br_switchdev.c. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28net: bridge: provide shim definition for br_vlan_flagsVladimir Oltean
br_vlan_replay() needs this, and we're preparing to move it to br_switchdev.c, which will be compiled regardless of whether or not CONFIG_BRIDGE_VLAN_FILTERING is enabled. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28Merge branch 'mlxsw-offload-root-tbf-as-port-shaper'Jakub Kicinski
Ido Schimmel says: ==================== mlxsw: Offload root TBF as port shaper Petr says: Egress configuration in an mlxsw deployment would generally have an ETS qdisc at root, with a number of bands and a priority dispatch between them. Some of those bands could then have a RED and/or TBF qdiscs attached. When TBF is used like this, mlxsw configures shaper on a subgroup, which is the pair of traffic classes (UC + BUM) corresponding to the band where TBF is installed. This way it is possible to limit traffic on several bands (subgroups) independently by configuring several TBF qdiscs, each on a different band. It is however not possible to limit traffic flowing through the port as such. The ASIC supports this through port shapers (as opposed to the abovementioned subgroup shapers). An obvious way to express this as a user would be to configure a root TBF qdisc, and then add the whole ETS hierarchy as its child. TBF (and RED) can currently be used as a root qdisc. This usage has always been accepted as a special case, when only one subgroup is configured, and that is the subgroup that root TBF and RED configure. However it was never possible to install ETS under that TBF. In this patchset, this limitation is relaxed. TBF qdisc in root position is now always offloaded as a port shaper. Such TBF qdisc does not limit offload of further children. It is thus possible to configure the usual priority classification through ETS, with RED and/or TBF on individual bands, all that below a port-level TBF. For example: (1) # tc qdisc replace dev swp1 root handle 1: tbf rate 800mbit burst 16kb limit 1M (2) # tc qdisc replace dev swp1 parent 1:1 handle 11: ets strict 8 priomap 7 6 5 4 3 2 1 0 (3) # tc qdisc replace dev swp1 parent 11:1 handle 111: tbf rate 600mbit burst 16kb limit 1M (4) # tc qdisc replace dev swp1 parent 11:2 handle 112: tbf rate 600mbit burst 16kb limit 1M Here, (1) configures a 800-Mbps port shaper, (2) adds an ETS element with 8 strictly-prioritized bands, and (3) and (4) configure two more shapers, each 600 Mbps, one under 11:1 (band 0, TCs 7 and 15), one under 11:2 (band 1, TCs 6 and 14). This way, traffic on bands 0 and 1 are each independently capped at 600 Mbps, and at the same time, traffic through the port as a whole is capped at 800 Mbps. In patch #1, TBF is permitted as root qdisc, under which the usual qdisc tree can be installed. In patch #2, the qdisc offloadability selftest is extended to cover the root TBF as well. Patch #3 then tests that the offloaded TBF shapes as expected. ==================== Link: https://lore.kernel.org/r/20211027152001.1320496-1-idosch@idosch.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28selftests: mlxsw: Test port shaperPetr Machata
TBF can be used as a root qdisc, in which case it is supposed to configure port shaper. Add a test that verifies that this is so by installing a root TBF with a ETS or PRIO below it, and then expecting individual bands to all be shaped according to the root TBF configuration. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28selftests: mlxsw: Test offloadability of root TBFPetr Machata
TBF can be used as a root qdisc, with the usual ETS/RED/TBF hierarchy below it. This use should now be offloaded. Add a test that verifies that it is. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28mlxsw: spectrum_qdisc: Offload root TBF as port shaperPetr Machata
The Spectrum ASIC allows configuration of maximum shaper on all levels of the scheduling hierarchy: TCs, subgroups, groups and also ports. Currently, TBF always configures a subgroup. But a user could reasonably express the intent to configure port shaper by putting TBF to a root position, around ETS / PRIO. Accept this usage and offload appropriately. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28tracing/histogram: Fix documentation inline emphasis warningKalesh Singh
This fixes the warning: Documentation/trace/histogram.rst:1766: WARNING: Inline emphasis start-string without end-string The issue was caused by an unescaped '*' character. Link: https://lore.kernel.org/all/20211028170548.2597449-1-kaleshsingh@google.com/T/#m77da47432f5cc6521d4294ffdb9621949cc35d04 Link: https://lkml.kernel.org/r/20211028170548.2597449-1-kaleshsingh@google.com Fixes: 2d2f6d4b8ce7 ("tracing/histogram: Document expression arithmetic and constants") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-28tools/testing/selftests/vm/split_huge_page_test.c: fix application of sizeof ↵David Yang
to pointer The coccinelle check report: ./tools/testing/selftests/vm/split_huge_page_test.c:344:36-42: ERROR: application of sizeof to pointer Use "strlen" to fix it. Link: https://lkml.kernel.org/r/20211012030116.184027-1-davidcomponentone@gmail.com Signed-off-by: David Yang <davidcomponentone@gmail.com> Reported-by: Zeal Robot <zealci@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-28mm/damon/core-test: fix wrong expectations for 'damon_split_regions_of()'SeongJae Park
Kunit test cases for 'damon_split_regions_of()' expects the number of regions after calling the function will be same to their request ('nr_sub'). However, the requested number is just an upper-limit, because the function randomly decides the size of each sub-region. This fixes the wrong expectation. Link: https://lkml.kernel.org/r/20211028090628.14948-1-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-28mm: khugepaged: skip huge page collapse for special filesYang Shi
The read-only THP for filesystems will collapse THP for files opened readonly and mapped with VM_EXEC. The intended usecase is to avoid TLB misses for large text segments. But it doesn't restrict the file types so a THP could be collapsed for a non-regular file, for example, block device, if it is opened readonly and mapped with EXEC permission. This may cause bugs, like [1] and [2]. This is definitely not the intended usecase, so just collapse THP for regular files in order to close the attack surface. [shy828301@gmail.com: fix vm_file check [3]] Link: https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/ [1] Link: https://lore.kernel.org/linux-mm/000000000000c6a82505ce284e4c@google.com/ [2] Link: https://lkml.kernel.org/r/CAHbLzkqTW9U3VvTu1Ki5v_cLRC9gHW+znBukg_ycergE0JWj-A@mail.gmail.com [3] Link: https://lkml.kernel.org/r/20211027195221.3825-1-shy828301@gmail.com Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Yang Shi <shy828301@gmail.com> Reported-by: Hao Sun <sunhao.th@gmail.com> Reported-by: syzbot+aae069be1de40fb11825@syzkaller.appspotmail.com Cc: Matthew Wilcox <willy@infradead.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Song Liu <songliubraving@fb.com> Cc: Andrea Righi <andrea.righi@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-28mm, thp: bail out early in collapse_file for writeback pageRongwei Wang
Currently collapse_file does not explicitly check PG_writeback, instead, page_has_private and try_to_release_page are used to filter writeback pages. This does not work for xfs with blocksize equal to or larger than pagesize, because in such case xfs has no page->private. This makes collapse_file bail out early for writeback page. Otherwise, xfs end_page_writeback will panic as follows. page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:ffff0003f88c86a8 index:0x0 pfn:0x84ef32 aops:xfs_address_space_operations [xfs] ino:30000b7 dentry name:"libtest.so" flags: 0x57fffe0000008027(locked|referenced|uptodate|active|writeback) raw: 57fffe0000008027 ffff80001b48bc28 ffff80001b48bc28 ffff0003f88c86a8 raw: 0000000000000000 0000000000000000 00000000ffffffff ffff0000c3e9a000 page dumped because: VM_BUG_ON_PAGE(((unsigned int) page_ref_count(page) + 127u <= 127u)) page->mem_cgroup:ffff0000c3e9a000 ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:1212! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: BUG: Bad page state in process khugepaged pfn:84ef32 xfs(E) page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:0 index:0x0 pfn:0x84ef32 libcrc32c(E) rfkill(E) aes_ce_blk(E) crypto_simd(E) ... CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Tainted: ... pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) Call trace: end_page_writeback+0x1c0/0x214 iomap_finish_page_writeback+0x13c/0x204 iomap_finish_ioend+0xe8/0x19c iomap_writepage_end_bio+0x38/0x50 bio_endio+0x168/0x1ec blk_update_request+0x278/0x3f0 blk_mq_end_request+0x34/0x15c virtblk_request_done+0x38/0x74 [virtio_blk] blk_done_softirq+0xc4/0x110 __do_softirq+0x128/0x38c __irq_exit_rcu+0x118/0x150 irq_exit+0x1c/0x30 __handle_domain_irq+0x8c/0xf0 gic_handle_irq+0x84/0x108 el1_irq+0xcc/0x180 arch_cpu_idle+0x18/0x40 default_idle_call+0x4c/0x1a0 cpuidle_idle_call+0x168/0x1e0 do_idle+0xb4/0x104 cpu_startup_entry+0x30/0x9c secondary_start_kernel+0x104/0x180 Code: d4210000 b0006161 910c8021 94013f4d (d4210000) ---[ end trace 4a88c6a074082f8c ]--- Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt Link: https://lkml.kernel.org/r/20211022023052.33114-1-rongwei.wang@linux.alibaba.com Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Suggested-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Song Liu <song@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-28mm/vmalloc: fix numa spreading for large hash tablesChen Wandun
Eric Dumazet reported a strange numa spreading info in [1], and found commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced this issue [2]. Dig into the difference before and after this patch, page allocation has some difference: before: alloc_large_system_hash __vmalloc __vmalloc_node(..., NUMA_NO_NODE, ...) __vmalloc_node_range __vmalloc_area_node alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ alloc_pages_current alloc_page_interleave /* can be proved by print policy mode */ after: alloc_large_system_hash __vmalloc __vmalloc_node(..., NUMA_NO_NODE, ...) __vmalloc_node_range __vmalloc_area_node alloc_pages_node /* choose nid by nuam_mem_id() */ __alloc_pages_node(nid, ....) So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), it will allocate memory in current node instead of interleaving allocate memory. Link: https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ [1] Link: https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ [2] Link: https://lkml.kernel.org/r/20211021080744.874701-2-chenwandun@huawei.com Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Reported-by: Eric Dumazet <edumazet@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-28mm/secretmem: avoid letting secretmem_users drop to zeroKees Cook
Quoting Dmitry: "refcount_inc() needs to be done before fd_install(). After fd_install() finishes, the fd can be used by userspace and we can have secret data in memory before the refcount_inc(). A straightforward misuse where a user will predict the returned fd in another thread before the syscall returns and will use it to store secret data is somewhat dubious because such a user just shoots themself in the foot. But a more interesting misuse would be to close the predicted fd and decrement the refcount before the corresponding refcount_inc, this way one can briefly drop the refcount to zero while there are other users of secretmem." Move fd_install() after refcount_inc(). Link: https://lkml.kernel.org/r/20211021154046.880251-1-keescook@chromium.org Link: https://lore.kernel.org/lkml/CACT4Y+b1sW6-Hkn8HQYw_SsT7X3tp-CJNh2ci0wG3ZnQz9jjig@mail.gmail.com Fixes: 9a436f8ff631 ("PM: hibernate: disable when there are active secretmem users") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jordy Zomer <jordy@pwning.systems> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-28ocfs2: fix race between searching chunks and release journal_head from ↵Gautham Ananthakrishna
buffer_head Encountered a race between ocfs2_test_bg_bit_allocatable() and jbd2_journal_put_journal_head() resulting in the below vmcore. PID: 106879 TASK: ffff880244ba9c00 CPU: 2 COMMAND: "loop3" Call trace: panic oops_end no_context __bad_area_nosemaphore bad_area_nosemaphore __do_page_fault do_page_fault page_fault [exception RIP: ocfs2_block_group_find_clear_bits+316] ocfs2_block_group_find_clear_bits [ocfs2] ocfs2_cluster_group_search [ocfs2] ocfs2_search_chain [ocfs2] ocfs2_claim_suballoc_bits [ocfs2] __ocfs2_claim_clusters [ocfs2] ocfs2_claim_clusters [ocfs2] ocfs2_local_alloc_slide_window [ocfs2] ocfs2_reserve_local_alloc_bits [ocfs2] ocfs2_reserve_clusters_with_limit [ocfs2] ocfs2_reserve_clusters [ocfs2] ocfs2_lock_refcount_allocators [ocfs2] ocfs2_make_clusters_writable [ocfs2] ocfs2_replace_cow [ocfs2] ocfs2_refcount_cow [ocfs2] ocfs2_file_write_iter [ocfs2] lo_rw_aio loop_queue_work kthread_worker_fn kthread ret_from_fork When ocfs2_test_bg_bit_allocatable() called bh2jh(bg_bh), the bg_bh->b_private NULL as jbd2_journal_put_journal_head() raced and released the jounal head from the buffer head. Needed to take bit lock for the bit 'BH_JournalHead' to fix this race. Link: https://lkml.kernel.org/r/1634820718-6043-1-git-send-email-gautham.ananthakrishna@oracle.com Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: <rajesh.sivaramasubramaniom@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-28mm/oom_kill.c: prevent a race between process_mrelease and exit_mmapSuren Baghdasaryan
Race between process_mrelease and exit_mmap, where free_pgtables is called while __oom_reap_task_mm is in progress, leads to kernel crash during pte_offset_map_lock call. oom-reaper avoids this race by setting MMF_OOM_VICTIM flag and causing exit_mmap to take and release mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock. Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to fix this race, however that would be considered a hack. Fix this race by elevating mm->mm_users and preventing exit_mmap from executing until process_mrelease is finished. Patch slightly refactors the code to adapt for a possible mmget_not_zero failure. This fix has considerable negative impact on process_mrelease performance and will likely need later optimization. Link: https://lkml.kernel.org/r/20211022014658.263508-1-surenb@google.com Fixes: 884a7e5964e0 ("mm: introduce process_mrelease system call") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christian Brauner <christian@brauner.io> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-28mm: filemap: check if THP has hwpoisoned subpage for PMD page faultYang Shi
When handling shmem page fault the THP with corrupted subpage could be PMD mapped if certain conditions are satisfied. But kernel is supposed to send SIGBUS when trying to map hwpoisoned page. There are two paths which may do PMD map: fault around and regular fault. Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") the thing was even worse in fault around path. The THP could be PMD mapped as long as the VMA fits regardless what subpage is accessed and corrupted. After this commit as long as head page is not corrupted the THP could be PMD mapped. In the regular fault path the THP could be PMD mapped as long as the corrupted page is not accessed and the VMA fits. This loophole could be fixed by iterating every subpage to check if any of them is hwpoisoned or not, but it is somewhat costly in page fault path. So introduce a new page flag called HasHWPoisoned on the first tail page. It indicates the THP has hwpoisoned subpage(s). It is set if any subpage of THP is found hwpoisoned by memory failure and after the refcount is bumped successfully, then cleared when the THP is freed or split. The soft offline path doesn't need this since soft offline handler just marks a subpage hwpoisoned when the subpage is migrated successfully. But shmem THP didn't get split then migrated at all. Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-28mm: hwpoison: remove the unnecessary THP checkYang Shi
When handling THP hwpoison checked if the THP is in allocation or free stage since hwpoison may mistreat it as hugetlb page. After commit 415c64c1453a ("mm/memory-failure: split thp earlier in memory error handling") the problem has been fixed, so this check is no longer needed. Remove it. The side effect of the removal is hwpoison may report unsplit THP instead of unknown error for shmem THP. It seems not like a big deal. The following patch "mm: filemap: check if THP has hwpoisoned subpage for PMD page fault" depends on this, which fixes shmem THP with hwpoisoned subpage(s) are mapped PMD wrongly. So this patch needs to be backported to -stable as well. Link: https://lkml.kernel.org/r/20211020210755.23964-2-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-28memcg: page_alloc: skip bulk allocator for __GFP_ACCOUNTShakeel Butt
Commit 5c1f4e690eec ("mm/vmalloc: switch to bulk allocator in __vmalloc_area_node()") switched to bulk page allocator for order 0 allocation backing vmalloc. However bulk page allocator does not support __GFP_ACCOUNT allocations and there are several users of kvmalloc(__GFP_ACCOUNT). For now make __GFP_ACCOUNT allocations bypass bulk page allocator. In future if there is workload that can be significantly improved with the bulk page allocator with __GFP_ACCCOUNT support, we can revisit the decision. Link: https://lkml.kernel.org/r/20211014151607.2171970-1-shakeelb@google.com Fixes: 5c1f4e690eec ("mm/vmalloc: switch to bulk allocator in __vmalloc_area_node()") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: Vasily Averin <vvs@virtuozzo.com> Tested-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-28Merge tag 'libnvdimm-fixes-5.15-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fix from Dan Williams: - Fix a regression introduced in v5.15-rc6 that caused nvdimm namespace shutdown to hang due to reworks in the block layer q_usage_count. * tag 'libnvdimm-fixes-5.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: nvdimm/pmem: stop using q_usage_count as external pgmap refcount
2021-10-28Merge branch 'Typeless/weak ksym for gen_loader + misc fixups'Alexei Starovoitov
Kumar Kartikeya says: ==================== Patches (1,2,3,6) add typeless and weak ksym support to gen_loader. It is follow up for the recent kfunc from modules series. The later patches (7,8) are misc fixes for selftests, and patch 4 for libbpf where we try to be careful to not end up with fds == 0, as libbpf assumes in various places that they are greater than 0. Patch 5 fixes up missing O_CLOEXEC in libbpf. Changelog: ---------- v4 -> v5 v4: https://lore.kernel.org/bpf/20211020191526.2306852-1-memxor@gmail.com * Address feedback from Andrii * Drop use of ensure_good_fd in unneeded call sites * Add sys_bpf_fd * Add _lskel suffix to all light skeletons and change all current selftests * Drop early break in close loop for sk_lookup * Fix other nits v3 -> v4 v3: https://lore.kernel.org/bpf/20211014205644.1837280-1-memxor@gmail.com * Remove gpl_only = true from bpf_kallsyms_lookup_name (Alexei) * Add bpf_dump_raw_ok check to ensure kptr_restrict isn't bypassed (Alexei) v2 -> v3 v2: https://lore.kernel.org/bpf/20211013073348.1611155-1-memxor@gmail.com * Address feedback from Song * Move ksym logging to separate helper to avoid code duplication * Move src_reg mask stuff to separate helper * Fix various other nits, add acks * __builtin_expect is used instead of likely to as skel_internal.h is included in isolation. v1 -> v2 v1: https://lore.kernel.org/bpf/20211006002853.308945-1-memxor@gmail.com * Remove redundant OOM checks in emit_bpf_kallsyms_lookup_name * Use designated initializer for sk_lookup fd array (Jakub) * Do fd check for all fd returning low level APIs (Andrii, Alexei) * Make Fixes: tag quote commit message, use selftests/bpf prefix (Song, Andrii) * Split typeless and weak ksym support into separate patches, expand commit message (Song) * Fix duplication in selftests stemming from use of LSKELS_EXTRA (Song) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>