linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2022-05-16	btrfs: scrub: rename members related to scrub_block::pagev	Qu Wenruo
	The following will be renamed in this patch: - scrub_block::pagev -> sectors - scrub_block::page_count -> sector_count - SCRUB_MAX_PAGES_PER_BLOCK -> SCRUB_MAX_SECTORS_PER_BLOCK - page_num -> sector_num to iterate scrub_block::sectors For now scrub_page is not yet renamed to keep the patch reasonable and it will be updated in a followup. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: remove trivial wrapper btrfs_read_buffer()	Filipe Manana
	The function btrfs_read_buffer() is useless, it just calls btree_read_extent_buffer_pages() with exactly the same arguments. So remove it and rename btree_read_extent_buffer_pages() to btrfs_read_extent_buffer(), which is a shorter name, has the "btrfs_" prefix (since it's used outside disk-io.c) and the name is clear enough about what it does. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: update outdated comment for read_block_for_search()	Filipe Manana
	The comment at the top of read_block_for_search() is very outdated, as it refers to the blocking versus spinning path locking modes. We no longer have these two locking modes after we switched the btree locks from custom code to rw semaphores. So update the comment to stop referring to the blocking mode and put it more up to date. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: release upper nodes when reading stale btree node from disk	Filipe Manana
	When reading a btree node (or leaf), at read_block_for_search(), if we can't find its extent buffer in the cache (the fs_info->buffer_radix radix tree), then we unlock all upper level nodes before reading the btree node/leaf from disk, to prevent blocking other tasks for too long. However if we find that the extent buffer is in the cache but it is not up to date, we don't unlock upper level nodes before reading it from disk, potentially blocking other tasks on upper level nodes for too long. Fix this inconsistent behaviour by unlocking upper level nodes if we need to read a node/leaf from disk because its in-memory extent buffer is not up to date. If we unlocked upper level nodes then we must return -EAGAIN to the caller, just like the case where the extent buffer is not cached in memory. And like that case, we determine if upper level nodes are locked by checking only if the parent node is locked - if it isn't, then no other upper level nodes are locked. This is actually a rare case, as if we have an extent buffer in memory, it typically has the uptodate flag set and passes all the checks done by btrfs_buffer_uptodate(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: avoid unnecessary btree search restarts when reading node	Filipe Manana
	When reading a btree node, at read_block_for_search(), if we don't find the node's (or leaf) extent buffer in the cache, we will read it from disk. Since that requires waiting on IO, we release all upper level nodes from our path before reading the target node/leaf, and then return -EAGAIN to the caller, which will make the caller restart the while btree search. However we are causing the restart of btree search even for cases where it is not necessary: 1) We have a path with ->skip_locking set to true, typically when doing a search on a commit root, so we are never holding locks on any node; 2) We are doing a read search (the "ins_len" argument passed to btrfs_search_slot() is 0), or we are doing a search to modify an existing key (the "cow" argument passed to btrfs_search_slot() has a value of 1 and "ins_len" is 0), in which case we never hold locks for upper level nodes; 3) We are doing a search to insert or delete a key, in which case we may or may not have upper level nodes locked. That depends on the current minimum write lock levels at btrfs_search_slot(), if we had to split or merge parent nodes, if we had to COW upper level nodes and if we ever visited slot 0 of an upper level node. It's still common to not have upper level nodes locked, but our current node must be at least at level 1, for insertions, or at least at level 2 for deletions. In these cases when we have locks on upper level nodes, they are always write locks. These cases where we are not holding locks on upper level nodes far outweigh the cases where we are holding locks, so it's completely wasteful to retry the whole search when we have no upper nodes locked. So change the logic to not return -EAGAIN, and make the caller retry the search, when we don't have the parent node locked - when it's not locked it means no other upper level nodes are locked as well. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: set inode flags earlier in btrfs_new_inode()	Omar Sandoval
	btrfs_new_inode() inherits the inode flags from the parent directory and the mount options _after_ we fill the inode item. This works because all of the callers of btrfs_new_inode() make further changes to the inode and then call btrfs_update_inode(). It'd be better to fully initialize the inode once to avoid the extra update, so as a first step, set the inode flags _before_ filling the inode item. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: move btrfs_get_free_objectid() call into btrfs_new_inode()	Omar Sandoval
	Every call of btrfs_new_inode() is immediately preceded by a call to btrfs_get_free_objectid(). Since getting an inode number is part of creating a new inode, this is better off being moved into btrfs_new_inode(). While we're here, get rid of the comment about reclaiming inode numbers, since we only did that when using the ino cache, which was removed by commit 5297199a8bca ("btrfs: remove inode number cache feature"). Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: don't pass parent objectid to btrfs_new_inode() explicitly	Omar Sandoval
	For everything other than a subvolume root inode, we get the parent objectid from the parent directory. For the subvolume root inode, the parent objectid is the same as the inode's objectid. We can find this within btrfs_new_inode() instead of passing it. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: remove redundant name and name_len parameters to create_subvol	Omar Sandoval
	The passed dentry already contains the name. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: remove unused mnt_userns parameter from __btrfs_set_acl	Omar Sandoval
	Commit 4a8b34afa9c9 ("btrfs: handle ACLs on idmapped mounts") added this parameter but didn't use it. __btrfs_set_acl() is the low-level helper that writes an ACL to disk. The higher-level btrfs_set_acl() is the one that translates the ACL based on the user namespace. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: remove unnecessary set_nlink() in btrfs_create_subvol_root()	Omar Sandoval
	btrfs_new_inode() already returns an inode with nlink set to 1 (via inode_init_always()). Get rid of the unnecessary set. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: remove unnecessary inode_set_bytes(0) call	Omar Sandoval
	new_inode() always returns an inode with i_blocks and i_bytes set to 0 (via inode_init_always()). Remove the unnecessary call to inode_set_bytes() in btrfs_new_inode(). Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: remove unnecessary btrfs_i_size_write(0) calls	Omar Sandoval
	btrfs_new_inode() always returns an inode with i_size and disk_i_size set to 0 (via inode_init_always() and btrfs_alloc_inode(), respectively). Remove the unnecessary calls to btrfs_i_size_write() in btrfs_mkdir() and btrfs_create_subvol_root(). Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: get rid of btrfs_add_nondir()	Omar Sandoval
	This is a trivial wrapper around btrfs_add_link(). The only thing it does other than moving arguments around is translating a > 0 return value to -EEXIST. As far as I can tell, btrfs_add_link() won't return > 0 (and if it did, the existing callsites in, e.g., btrfs_mkdir() would be broken). The check itself dates back to commit 2c90e5d65842 ("Btrfs: still corruption hunting"), so it's probably left over from debugging. Let's just get rid of btrfs_add_nondir(). Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: fix anon_dev leak in create_subvol()	Omar Sandoval
	When btrfs_qgroup_inherit(), btrfs_alloc_tree_block, or btrfs_insert_root() fail in create_subvol(), we return without freeing anon_dev. Reorganize the error handling in create_subvol() to fix this. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: reserve correct number of items for rename	Omar Sandoval
	btrfs_rename() and btrfs_rename_exchange() don't account for enough items. Replace the incorrect explanations with a specific breakdown of the number of items and account them accurately. Note that this glosses over RENAME_WHITEOUT because the next commit is going to rework that, too. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	btrfs: reserve correct number of items for unlink and rmdir	Omar Sandoval
	__btrfs_unlink_inode() calls btrfs_update_inode() on the parent directory in order to update its size and sequence number. Make sure we account for it. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16	mmc: sdhci-of-arasan: Add NULL check for data field	Sai Krishna Potthuri
	Add NULL check for data field retrieved from of_device_get_match_data() before dereferencing the data. Addresses-coverity: CID 305057:Dereference null return value (NULL_RETURNS) Signed-off-by: Sai Krishna Potthuri <lakshmi.sai.krishna.potthuri@xilinx.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/1652339993-27280-1-git-send-email-lakshmi.sai.krishna.potthuri@xilinx.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2022-05-16	nbd: Fix hung on disconnect request if socket is closed before	Xie Yongji
	When userspace closes the socket before sending a disconnect request, the following I/O requests will be blocked in wait_for_reconnect() until dead timeout. This will cause the following disconnect request also hung on blk_mq_quiesce_queue(). That means we have no way to disconnect a nbd device if there are some I/O requests waiting for reconnecting until dead timeout. It's not expected. So let's wake up the thread waiting for reconnecting directly when a disconnect request is sent. Reported-by: Xu Jianhai <zero.xu@bytedance.com> Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20220322080639.142-1-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-16	evm: Clean up some variables	Stefan Berger
	Make hmac_tfm static since it's not used anywhere else besides the file it is in. Remove declaration of hash_tfm since it doesn't exist. Signed-off-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2022-05-16	evm: Return INTEGRITY_PASS for enum integrity_status value '0'	Stefan Berger
	Return INTEGRITY_PASS for the enum integrity_status rather than 0. Signed-off-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2022-05-16	mtd: spi-nor: aspeed: set the decoding size to at least 2MB for AST2600	Potin Lai
	In AST2600, the unit of SPI CEx decoding range register is 1MB, and end address offset is set to the acctual offset - 1MB. If the flash only has 1MB, the end address will has same value as start address, which will causing unexpected errors. This patch set the decoding size to at least 2MB to avoid decoding errors. Tested: root@bletchley:~# dmesg \| grep "aspeed-smc 1e631000.spi: CE0 window" [ 59.328134] aspeed-smc 1e631000.spi: CE0 window resized to 2MB (AST2600 Decoding) [ 59.343001] aspeed-smc 1e631000.spi: CE0 window [ 0x50000000 - 0x50200000 ] 2MB root@bletchley:~# devmem 0x1e631030 0x00100000 Tested-by: Jae Hyun Yoo <quic_jaehyoo@quicinc.com> Signed-off-by: Potin Lai <potin.lai@quantatw.com> [ clg : Ported on new spi-mem driver ] Signed-off-by: Cédric Le Goater <clg@kaod.org> Link: https://lore.kernel.org/r/20220509175616.1089346-12-clg@kaod.org Signed-off-by: Mark Brown <broonie@kernel.org>
2022-05-16	spi: aspeed: Calibrate read timings	Cédric Le Goater
	To accommodate the different response time of SPI transfers on different boards and different SPI NOR devices, the Aspeed controllers provide a set of Read Timing Compensation registers to tune the timing delays depending on the frequency being used. The AST2600 SoC has one of these registers per device. On the AST2500 and AST2400 SoCs, the timing register is shared by all devices which is problematic to get good results other than for one device. The algorithm first reads a golden buffer at low speed and then performs reads with different clocks and delay cycle settings to find a breaking point. This selects a default good frequency for the CEx control register. The current settings are a bit optimistic as we pick the first delay giving good results. A safer approach would be to determine an interval and choose the middle value. Calibration is performed when the direct mapping for reads is created. Since the underlying spi-nor object needs to be initialized to create the spi_mem operation for direct mapping, we should be fine. Having a specific API would clarify the requirements though. Cc: Pratyush Yadav <p.yadav@ti.com> Reviewed-by: Joel Stanley <joel@jms.id.au> Tested-by: Joel Stanley <joel@jms.id.au> Tested-by: Tao Ren <rentao.bupt@gmail.com> Tested-by: Jae Hyun Yoo <quic_jaehyoo@quicinc.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Link: https://lore.kernel.org/r/20220509175616.1089346-9-clg@kaod.org Signed-off-by: Mark Brown <broonie@kernel.org>
2022-05-16	spi: aspeed: Add support for the AST2400 SPI controller	Cédric Le Goater
	Extend the driver for the AST2400 SPI Flash Controller (SPI). This controller has a slightly different interface which requires adaptation of the 4B handling. Summary of features : . host Firmware . 1 chip select pin (CE0) . slightly different register set, between AST2500 and the legacy controller . no segment registers . single, dual mode. Reviewed-by: Joel Stanley <joel@jms.id.au> Tested-by: Joel Stanley <joel@jms.id.au> Tested-by: Tao Ren <rentao.bupt@gmail.com> Tested-by: Jae Hyun Yoo <quic_jaehyoo@quicinc.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Link: https://lore.kernel.org/r/20220509175616.1089346-8-clg@kaod.org Signed-off-by: Mark Brown <broonie@kernel.org>
2022-05-16	spi: aspeed: Workaround AST2500 limitations	Cédric Le Goater
	It is not possible to configure a full 128MB window for a chip of the same size on the AST2500 SPI controller. For this case, the maximum window size is restricted to 120MB for CE0. Reviewed-by: Joel Stanley <joel@jms.id.au> Tested-by: Joel Stanley <joel@jms.id.au> Tested-by: Tao Ren <rentao.bupt@gmail.com> Tested-by: Jae Hyun Yoo <quic_jaehyoo@quicinc.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Link: https://lore.kernel.org/r/20220509175616.1089346-7-clg@kaod.org Signed-off-by: Mark Brown <broonie@kernel.org>
2022-05-16	spi: aspeed: Adjust direct mapping to device size	Cédric Le Goater
	The segment registers of the FMC/SPI controllers provide a way to configure the mapping window of the flash device contents on the AHB bus. Adjust this window to the size of the spi-mem mapping. Things get more complex with multiple devices. The driver needs to also adjust the window of the next device to make sure that there is no overlap, even if there is no available device. The proposal below is not perfect but it is covering all the cases we have seen on different boards with one and two devices on the same bus. Reviewed-by: Joel Stanley <joel@jms.id.au> Tested-by: Joel Stanley <joel@jms.id.au> Tested-by: Tao Ren <rentao.bupt@gmail.com> Tested-by: Jae Hyun Yoo <quic_jaehyoo@quicinc.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Link: https://lore.kernel.org/r/20220509175616.1089346-6-clg@kaod.org Signed-off-by: Mark Brown <broonie@kernel.org>
2022-05-16	spi: aspeed: Add support for direct mapping	Cédric Le Goater
	Use direct mapping to read the flash device contents. This operation mode is called "Command mode" on Aspeed SoC SMC controllers. It uses a Control Register for the settings to apply when a memory operation is performed on the flash device mapping window. If the window is not big enough, fall back to the "User mode" to perform the read. Direct mapping for writes will come later when validated. Reviewed-by: Joel Stanley <joel@jms.id.au> Tested-by: Joel Stanley <joel@jms.id.au> Tested-by: Tao Ren <rentao.bupt@gmail.com> Tested-by: Jae Hyun Yoo <quic_jaehyoo@quicinc.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Link: https://lore.kernel.org/r/20220509175616.1089346-5-clg@kaod.org Signed-off-by: Mark Brown <broonie@kernel.org>
2022-05-16	spi: spi-mem: Convert Aspeed SMC driver to spi-mem	Cédric Le Goater
	This SPI driver adds support for the Aspeed static memory controllers of the AST2600, AST2500 and AST2400 SoCs using the spi-mem interface. * AST2600 Firmware SPI Memory Controller (FMC) . BMC firmware . 3 chip select pins (CE0 ~ CE2) . Only supports SPI type flash memory . different segment register interface . single, dual and quad mode. * AST2600 SPI Flash Controller (SPI1 and SPI2) . host firmware . 2 chip select pins (CE0 ~ CE1) . different segment register interface . single, dual and quad mode. * AST2500 Firmware SPI Memory Controller (FMC) . BMC firmware . 3 chip select pins (CE0 ~ CE2) . supports SPI type flash memory (CE0-CE1) . CE2 can be of NOR type flash but this is not supported by the driver . single, dual mode. * AST2500 SPI Flash Controller (SPI1 and SPI2) . host firmware . 2 chip select pins (CE0 ~ CE1) . single, dual mode. * AST2400 New Static Memory Controller (also referred as FMC) . BMC firmware . New register set . 5 chip select pins (CE0 ∼ CE4) . supports NOR flash, NAND flash and SPI flash memory. . single, dual and quad mode. Each controller has a memory range on which flash devices contents are mapped. Each device is assigned a window that can be changed at bootime with the Segment Address Registers. Each SPI flash device can then be accessed in two modes: Command and User. When in User mode, SPI transfers are initiated with accesses to the memory segment of a device. When in Command mode, memory operations on the memory segment of a device generate SPI commands automatically using a Control Register for the settings. This initial patch adds support for User mode. Command mode needs a little more work to check that the memory window on the AHB bus fits the device size. It will come later when support for direct mapping is added. Single and dual mode RX transfers are supported. Other types than SPI are not supported. Reviewed-by: Joel Stanley <joel@jms.id.au> Tested-by: Joel Stanley <joel@jms.id.au> Tested-by: Tao Ren <rentao.bupt@gmail.com> Tested-by: Jae Hyun Yoo <quic_jaehyoo@quicinc.com> Signed-off-by: Chin-Ting Kuo <chin-ting_kuo@aspeedtech.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Link: https://lore.kernel.org/r/20220509175616.1089346-4-clg@kaod.org Signed-off-by: Mark Brown <broonie@kernel.org>
2022-05-16	spi: Convert the Aspeed SMC controllers device tree binding	Cédric Le Goater
	The "interrupt" property is optional because it is only necessary for controllers supporting DMAs (Not implemented yet in the new driver). Cc: Chin-Ting Kuo <chin-ting_kuo@aspeedtech.com> Tested-by: Joel Stanley <joel@jms.id.au> Tested-by: Tao Ren <rentao.bupt@gmail.com> Tested-by: Jae Hyun Yoo <quic_jaehyoo@quicinc.com> Reviewed-by: Joel Stanley <joel@jms.id.au> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Link: https://lore.kernel.org/r/20220509175616.1089346-3-clg@kaod.org Signed-off-by: Mark Brown <broonie@kernel.org>
2022-05-16	ata: pata_ftide010: Remove unneeded ERROR check before clk_disable_unprepare	Wan Jiabing
	ERROR check is already in clk_disable() and clk_unprepare() by using IS_ERR_OR_NULL. Remove unneeded ERROR check for ftide->pclk here. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2022-05-16	netfilter: conntrack: remove pr_debug callsites from tcp tracker	Florian Westphal
	They are either obsolete or useless. Those in the normal processing path cannot be enabled on a production system; they generate too much noise. One pr_debug call resides in an error path and does provide useful info, merge it with the existing nf_log_invalid(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16	netfilter: nf_conncount: reduce unnecessary GC	William Tu
	Currently nf_conncount can trigger garbage collection (GC) at multiple places. Each GC process takes a spin_lock_bh to traverse the nf_conncount_list. We found that when testing port scanning use two parallel nmap, because the number of connection increase fast, the nf_conncount_count and its subsequent call to __nf_conncount_add take too much time, causing several CPU lockup. This happens when user set the conntrack limit to +20,000, because the larger the limit, the longer the list that GC has to traverse. The patch mitigate the performance issue by avoiding unnecessary GC with a timestamp. Whenever nf_conncount has done a GC, a timestamp is updated, and beforce the next time GC is triggered, we make sure it's more than a jiffies. By doin this we can greatly reduce the CPU cycles and avoid the softirq lockup. To reproduce it in OVS, $ ovs-appctl dpctl/ct-set-limits zone=1,limit=20000 $ ovs-appctl dpctl/ct-get-limits At another machine, runs two nmap $ nmap -p1- <IP> $ nmap -p1- <IP> Signed-off-by: William Tu <u9012063@gmail.com> Co-authored-by: Yifeng Sun <pkusunyifeng@gmail.com> Reported-by: Greg Rose <gvrose8192@gmail.com> Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16	netfilter: Use l3mdev flow key when re-routing mangled packets	Martin Willi
	Commit 40867d74c374 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices") introduces a flow key specific for layer 3 domains, such as a VRF master device. This allows for explicit VRF domain selection instead of abusing the oif flow key. Update ip[6]_route_me_harder() to make use of that new key when re-routing mangled packets within VRFs instead of setting the flow oif, making it consistent with other users. Signed-off-by: Martin Willi <martin@strongswan.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16	netfilter: nft_flow_offload: fix offload with pppoe + vlan	Felix Fietkau
	When running a combination of PPPoE on top of a VLAN, we need to set info->outdev to the PPPoE device, otherwise PPPoE encap is skipped during software offload. Fixes: 72efd585f714 ("netfilter: flowtable: add pppoe support") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16	net: fix dev_fill_forward_path with pppoe + bridge	Felix Fietkau
	When calling dev_fill_forward_path on a pppoe device, the provided destination address is invalid. In order for the bridge fdb lookup to succeed, the pppoe code needs to update ctx->daddr to the correct value. Fix this by storing the address inside struct net_device_path_ctx Fixes: f6efc675c9dd ("net: ppp: resolve forwarding path for bridge pppoe devices") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16	netfilter: nft_flow_offload: skip dst neigh lookup for ppp devices	Felix Fietkau
	The dst entry does not contain a valid hardware address, so skip the lookup in order to avoid running into errors here. The proper hardware address is filled in from nft_dev_path_info Fixes: 72efd585f714 ("netfilter: flowtable: add pppoe support") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16	netfilter: flowtable: fix excessive hw offload attempts after failure	Felix Fietkau
	If a flow cannot be offloaded, the code currently repeatedly tries again as quickly as possible, which can significantly increase system load. Fix this by limiting flow timeout update and hardware offload retry to once per second. Fixes: c07531c01d82 ("netfilter: flowtable: Remove redundant hw refresh bit") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16	net/sched: act_pedit: sanitize shift argument before usage	Paolo Abeni
	syzbot was able to trigger an Out-of-Bound on the pedit action: UBSAN: shift-out-of-bounds in net/sched/act_pedit.c:238:43 shift exponent 1400735974 is too large for 32-bit type 'unsigned int' CPU: 0 PID: 3606 Comm: syz-executor151 Not tainted 5.18.0-rc5-syzkaller-00165-g810c2f0a3f86 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 ubsan_epilogue+0xb/0x50 lib/ubsan.c:151 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x187 lib/ubsan.c:322 tcf_pedit_init.cold+0x1a/0x1f net/sched/act_pedit.c:238 tcf_action_init_1+0x414/0x690 net/sched/act_api.c:1367 tcf_action_init+0x530/0x8d0 net/sched/act_api.c:1432 tcf_action_add+0xf9/0x480 net/sched/act_api.c:1956 tc_ctl_action+0x346/0x470 net/sched/act_api.c:2015 rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5993 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:725 ____sys_sendmsg+0x6e2/0x800 net/socket.c:2413 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fe36e9e1b59 Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffef796fe88 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe36e9e1b59 RDX: 0000000000000000 RSI: 0000000020000300 RDI: 0000000000000003 RBP: 00007fe36e9a5d00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe36e9a5d90 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> The 'shift' field is not validated, and any value above 31 will trigger out-of-bounds. The issue predates the git history, but syzbot was able to trigger it only after the commit mentioned in the fixes tag, and this change only applies on top of such commit. Address the issue bounding the 'shift' value to the maximum allowed by the relevant operator. Reported-and-tested-by: syzbot+8ed8fc4c57e9dcf23ca6@syzkaller.appspotmail.com Fixes: 8b796475fd78 ("net/sched: act_pedit: really ensure the skb is writable") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16	octeontx2-pf: Remove unnecessary synchronize_irq() before free_irq()	Minghao Chi
	Calling synchronize_irq() right before free_irq() is quite useless. On one hand the IRQ can easily fire again before free_irq() is entered, on the other hand free_irq() itself calls synchronize_irq() internally (in a race condition free way), before any state associated with the IRQ is freed. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16	net: wwan: t7xx: Fix return type of t7xx_dl_add_timedout()	YueHaibing
	t7xx_dl_add_timedout() now return int 'ret', but the return type is bool. Change the return type to int for furthor errcode upstream. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16	ALSA: usb-audio: Restore Rane SL-1 quirk	Takashi Iwai
	At cleaning up and moving the device rename from the quirk table to its own table, we removed the entry for Rane SL-1 as we thought it's only for renaming. It turned out, however, that the quirk is required for matching with the device that declares itself as no standard audio but only as vendor-specific. Restore the quirk entry for Rane SL-1 to fix the regression. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215887 Fixes: 5436f59bc5bc ("ALSA: usb-audio: Move device rename and profile quirks to an internal table") Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220516103112.12950-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2022-05-16	octeon_ep: delete unnecessary NULL check	Ziyang Xuan
	vfree(NULL) is safe. NULL check before vfree() is not needed. Delete them to simplify the code. Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16	octeon_ep: add missing destroy_workqueue in octep_init_module	Zheng Bin
	octep_init_module misses destroy_workqueue in error path, this patch fixes that. Fixes: 862cd659a6fb ("octeon_ep: Add driver framework and device initialization") Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16	Merge branch 'net-skb-defer-freeing-polish'	David S. Miller
	Eric Dumazet says: ==================== net: polish skb defer freeing While testing this recently added feature on a variety of platforms/configurations, I found the following issues: 1) A race leading to concurrent calls to smp_call_function_single_async() 2) Missed opportunity to use napi_consume_skb() 3) Need to limit the max length of the per-cpu lists. 4) Process the per-cpu list more frequently, for the (unusual) case where net_rx_action() has mutiple napi_poll() to process per round. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16	net: call skb_defer_free_flush() before each napi_poll()	Eric Dumazet
	skb_defer_free_flush() can consume cpu cycles, it seems better to call it in the inner loop: - Potentially frees page/skb that will be reallocated while hot. - Account for the cpu cycles in the @time_limit determination. - Keep softnet_data.defer_count small to reduce chances for skb_attempt_defer_free() to send an IPI. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16	net: add skb_defer_max sysctl	Eric Dumazet
	commit 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") added another per-cpu cache of skbs. It was expected to be small, and an IPI was forced whenever the list reached 128 skbs. We might need to be able to control more precisely queue capacity and added latency. An IPI is generated whenever queue reaches half capacity. Default value of the new limit is 64. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16	net: use napi_consume_skb() in skb_defer_free_flush()	Eric Dumazet
	skb_defer_free_flush() runs from softirq context, we have the opportunity to refill the napi_alloc_cache, and/or use kmem_cache_free_bulk() when this cache is full. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16	net: fix possible race in skb_attempt_defer_free()	Eric Dumazet
	A cpu can observe sd->defer_count reaching 128, and call smp_call_function_single_async() Problem is that the remote CPU can clear sd->defer_count before the IPI is run/acknowledged. Other cpus can queue more packets and also decide to call smp_call_function_single_async() while the pending IPI was not yet delivered. This is a common issue with smp_call_function_single_async(). Callers must ensure correct synchronization and serialization. I triggered this issue while experimenting smaller threshold. Performing the call to smp_call_function_single_async() under sd->defer_lock protection did not solve the problem. Commit 5a18ceca6350 ("smp: Allow smp_call_function_single_async() to insert locked csd") replaced an informative WARN_ON_ONCE() with a return of -EBUSY, which is often ignored. Test of CSD_FLAG_LOCK presence is racy anyway. Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16	net: tulip: convert to devres	Rolf Eike Beer
	Works fine on my HP C3600: [ 274.452394] tulip0: no phy info, aborting mtable build [ 274.499041] tulip0: MII transceiver #1 config 1000 status 782d advertising 01e1 [ 274.750691] net eth0: Digital DS21142/43 Tulip rev 65 at MMIO 0xf4008000, 00:30:6e:08:7d:21, IRQ 17 [ 283.104520] net eth0: Setting full-duplex based on MII#1 link partner capability of c1e1 Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16	Merge ath-next from git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git	Kalle Valo
	ath.git patches for v5.19. Major changes: ath11k * enable keepalive during WoWLAN suspend * implement remain-on-channel support