summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-06-21vsock: notify server to shutdown when client has pending signalLongpeng(Mike)
The client's sk_state will be set to TCP_ESTABLISHED if the server replay the client's connect request. However, if the client has pending signal, its sk_state will be set to TCP_CLOSE without notify the server, so the server will hold the corrupt connection. client server 1. sk_state=TCP_SYN_SENT | 2. call ->connect() | 3. wait reply | | 4. sk_state=TCP_ESTABLISHED | 5. insert to connected list | 6. reply to the client 7. sk_state=TCP_ESTABLISHED | 8. insert to connected list | 9. *signal pending* <--------------------- the user kill client 10. sk_state=TCP_CLOSE | client is exiting... | 11. call ->release() | virtio_transport_close if (!(sk->sk_state == TCP_ESTABLISHED || sk->sk_state == TCP_CLOSING)) return true; *return at here, the server cannot notice the connection is corrupt* So the client should notify the peer in this case. Cc: David S. Miller <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jorgen Hansen <jhansen@vmware.com> Cc: Norbert Slusarek <nslusarek@gmx.net> Cc: Andra Paraschiv <andraprs@amazon.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: David Brazdil <dbrazdil@google.com> Cc: Alexander Popov <alex.popov@linux.com> Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lkml.org/lkml/2021/5/17/418 Signed-off-by: lixianming <lixianming5@huawei.com> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21net: add pf_family_names[] for protocol familyYejune Deng
Modify the pr_info content from int to char * in sock_register() and sock_unregister(), this looks more readable. Fixed build error in ARCH=sparc64. Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21net: mana: Fix a memory leak in an error handling path in 'mana_create_txq()'Christophe JAILLET
If this test fails we must free some resources as in all the other error handling paths of this function. Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21Merge branch 'ingenic-fixes'David S. Miller
Zhou Yanjie says: ==================== Fix for Ingenic MAC support. 1.Remove the unexpected "snps,dwmac" item in the example. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21dt-bindings: dwmac: Remove unexpected item.周琰杰 (Zhou Yanjie)
Remove the unexpected "snps,dwmac" item in the example. Fixes: 3b8401066e5a ("dt-bindings: dwmac: Add bindings for new Ingenic SoCs.") Signed-off-by: 周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21net: hns3: Fix a memory leak in an error handling path in ↵Christophe JAILLET
'hclge_handle_error_info_log()' If this 'kzalloc()' fails we must free some resources as in all the other error handling paths of this function. Fixes: 2e2deee7618b ("net: hns3: add the RAS compatibility adaptation solution") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Jiaran Zhang <zhangjiaran@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21Merge branch 'nnicstar-fixes'David S. Miller
Zheyu Ma says: ==================== atm: nicstar: fix two bugs about error handling Zheyu Ma (2): atm: nicstar: use 'dma_free_coherent' instead of 'kfree' atm: nicstar: register the interrupt handler in the right place ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21atm: nicstar: register the interrupt handler in the right placeZheyu Ma
Because the error handling is sequential, the application of resources should be carried out in the order of error handling, so the operation of registering the interrupt handler should be put in front, so as not to free the unregistered interrupt handler during error handling. This log reveals it: [ 3.438724] Trying to free already-free IRQ 23 [ 3.439060] WARNING: CPU: 5 PID: 1 at kernel/irq/manage.c:1825 free_irq+0xfb/0x480 [ 3.440039] Modules linked in: [ 3.440257] CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.12.4-g70e7f0549188-dirty #142 [ 3.440793] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 3.441561] RIP: 0010:free_irq+0xfb/0x480 [ 3.441845] Code: 6e 08 74 6f 4d 89 f4 e8 c3 78 09 00 4d 8b 74 24 18 4d 85 f6 75 e3 e8 b4 78 09 00 8b 75 c8 48 c7 c7 a0 ac d5 85 e8 95 d7 f5 ff <0f> 0b 48 8b 75 c0 4c 89 ff e8 87 c5 90 03 48 8b 43 40 4c 8b a0 80 [ 3.443121] RSP: 0000:ffffc90000017b50 EFLAGS: 00010086 [ 3.443483] RAX: 0000000000000000 RBX: ffff888107c6f000 RCX: 0000000000000000 [ 3.443972] RDX: 0000000000000000 RSI: ffffffff8123f301 RDI: 00000000ffffffff [ 3.444462] RBP: ffffc90000017b90 R08: 0000000000000001 R09: 0000000000000003 [ 3.444950] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [ 3.444994] R13: ffff888107dc0000 R14: ffff888104f6bf00 R15: ffff888107c6f0a8 [ 3.444994] FS: 0000000000000000(0000) GS:ffff88817bd40000(0000) knlGS:0000000000000000 [ 3.444994] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3.444994] CR2: 0000000000000000 CR3: 000000000642e000 CR4: 00000000000006e0 [ 3.444994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3.444994] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3.444994] Call Trace: [ 3.444994] ns_init_card_error+0x18e/0x250 [ 3.444994] nicstar_init_one+0x10d2/0x1130 [ 3.444994] local_pci_probe+0x4a/0xb0 [ 3.444994] pci_device_probe+0x126/0x1d0 [ 3.444994] ? pci_device_remove+0x100/0x100 [ 3.444994] really_probe+0x27e/0x650 [ 3.444994] driver_probe_device+0x84/0x1d0 [ 3.444994] ? mutex_lock_nested+0x16/0x20 [ 3.444994] device_driver_attach+0x63/0x70 [ 3.444994] __driver_attach+0x117/0x1a0 [ 3.444994] ? device_driver_attach+0x70/0x70 [ 3.444994] bus_for_each_dev+0xb6/0x110 [ 3.444994] ? rdinit_setup+0x40/0x40 [ 3.444994] driver_attach+0x22/0x30 [ 3.444994] bus_add_driver+0x1e6/0x2a0 [ 3.444994] driver_register+0xa4/0x180 [ 3.444994] __pci_register_driver+0x77/0x80 [ 3.444994] ? uPD98402_module_init+0xd/0xd [ 3.444994] nicstar_init+0x1f/0x75 [ 3.444994] do_one_initcall+0x7a/0x3d0 [ 3.444994] ? rdinit_setup+0x40/0x40 [ 3.444994] ? rcu_read_lock_sched_held+0x4a/0x70 [ 3.444994] kernel_init_freeable+0x2a7/0x2f9 [ 3.444994] ? rest_init+0x2c0/0x2c0 [ 3.444994] kernel_init+0x13/0x180 [ 3.444994] ? rest_init+0x2c0/0x2c0 [ 3.444994] ? rest_init+0x2c0/0x2c0 [ 3.444994] ret_from_fork+0x1f/0x30 [ 3.444994] Kernel panic - not syncing: panic_on_warn set ... [ 3.444994] CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.12.4-g70e7f0549188-dirty #142 [ 3.444994] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 3.444994] Call Trace: [ 3.444994] dump_stack+0xba/0xf5 [ 3.444994] ? free_irq+0xfb/0x480 [ 3.444994] panic+0x155/0x3ed [ 3.444994] ? __warn+0xed/0x150 [ 3.444994] ? free_irq+0xfb/0x480 [ 3.444994] __warn+0x103/0x150 [ 3.444994] ? free_irq+0xfb/0x480 [ 3.444994] report_bug+0x119/0x1c0 [ 3.444994] handle_bug+0x3b/0x80 [ 3.444994] exc_invalid_op+0x18/0x70 [ 3.444994] asm_exc_invalid_op+0x12/0x20 [ 3.444994] RIP: 0010:free_irq+0xfb/0x480 [ 3.444994] Code: 6e 08 74 6f 4d 89 f4 e8 c3 78 09 00 4d 8b 74 24 18 4d 85 f6 75 e3 e8 b4 78 09 00 8b 75 c8 48 c7 c7 a0 ac d5 85 e8 95 d7 f5 ff <0f> 0b 48 8b 75 c0 4c 89 ff e8 87 c5 90 03 48 8b 43 40 4c 8b a0 80 [ 3.444994] RSP: 0000:ffffc90000017b50 EFLAGS: 00010086 [ 3.444994] RAX: 0000000000000000 RBX: ffff888107c6f000 RCX: 0000000000000000 [ 3.444994] RDX: 0000000000000000 RSI: ffffffff8123f301 RDI: 00000000ffffffff [ 3.444994] RBP: ffffc90000017b90 R08: 0000000000000001 R09: 0000000000000003 [ 3.444994] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [ 3.444994] R13: ffff888107dc0000 R14: ffff888104f6bf00 R15: ffff888107c6f0a8 [ 3.444994] ? vprintk_func+0x71/0x110 [ 3.444994] ns_init_card_error+0x18e/0x250 [ 3.444994] nicstar_init_one+0x10d2/0x1130 [ 3.444994] local_pci_probe+0x4a/0xb0 [ 3.444994] pci_device_probe+0x126/0x1d0 [ 3.444994] ? pci_device_remove+0x100/0x100 [ 3.444994] really_probe+0x27e/0x650 [ 3.444994] driver_probe_device+0x84/0x1d0 [ 3.444994] ? mutex_lock_nested+0x16/0x20 [ 3.444994] device_driver_attach+0x63/0x70 [ 3.444994] __driver_attach+0x117/0x1a0 [ 3.444994] ? device_driver_attach+0x70/0x70 [ 3.444994] bus_for_each_dev+0xb6/0x110 [ 3.444994] ? rdinit_setup+0x40/0x40 [ 3.444994] driver_attach+0x22/0x30 [ 3.444994] bus_add_driver+0x1e6/0x2a0 [ 3.444994] driver_register+0xa4/0x180 [ 3.444994] __pci_register_driver+0x77/0x80 [ 3.444994] ? uPD98402_module_init+0xd/0xd [ 3.444994] nicstar_init+0x1f/0x75 [ 3.444994] do_one_initcall+0x7a/0x3d0 [ 3.444994] ? rdinit_setup+0x40/0x40 [ 3.444994] ? rcu_read_lock_sched_held+0x4a/0x70 [ 3.444994] kernel_init_freeable+0x2a7/0x2f9 [ 3.444994] ? rest_init+0x2c0/0x2c0 [ 3.444994] kernel_init+0x13/0x180 [ 3.444994] ? rest_init+0x2c0/0x2c0 [ 3.444994] ? rest_init+0x2c0/0x2c0 [ 3.444994] ret_from_fork+0x1f/0x30 [ 3.444994] Dumping ftrace buffer: [ 3.444994] (ftrace buffer empty) [ 3.444994] Kernel Offset: disabled [ 3.444994] Rebooting in 1 seconds.. Signed-off-by: Zheyu Ma <zheyuma97@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21atm: nicstar: use 'dma_free_coherent' instead of 'kfree'Zheyu Ma
When 'nicstar_init_one' fails, 'ns_init_card_error' will be executed for error handling, but the correct memory free function should be used, otherwise it will cause an error. Since 'card->rsq.org' and 'card->tsq.org' are allocated using 'dma_alloc_coherent' function, they should be freed using 'dma_free_coherent'. Fix this by using 'dma_free_coherent' instead of 'kfree' This log reveals it: [ 3.440294] kernel BUG at mm/slub.c:4206! [ 3.441059] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 3.441430] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.12.4-g70e7f0549188-dirty #141 [ 3.441986] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 3.442780] RIP: 0010:kfree+0x26a/0x300 [ 3.443065] Code: e8 3a c3 b9 ff e9 d6 fd ff ff 49 8b 45 00 31 db a9 00 00 01 00 75 4d 49 8b 45 00 a9 00 00 01 00 75 0a 49 8b 45 08 a8 01 75 02 <0f> 0b 89 d9 b8 00 10 00 00 be 06 00 00 00 48 d3 e0 f7 d8 48 63 d0 [ 3.443396] RSP: 0000:ffffc90000017b70 EFLAGS: 00010246 [ 3.443396] RAX: dead000000000100 RBX: 0000000000000000 RCX: 0000000000000000 [ 3.443396] RDX: 0000000000000000 RSI: ffffffff85d3df94 RDI: ffffffff85df38e6 [ 3.443396] RBP: ffffc90000017b90 R08: 0000000000000001 R09: 0000000000000001 [ 3.443396] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888107dc0000 [ 3.443396] R13: ffffea00001f0100 R14: ffff888101a8bf00 R15: ffff888107dc0160 [ 3.443396] FS: 0000000000000000(0000) GS:ffff88817bc80000(0000) knlGS:0000000000000000 [ 3.443396] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3.443396] CR2: 0000000000000000 CR3: 000000000642e000 CR4: 00000000000006e0 [ 3.443396] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3.443396] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3.443396] Call Trace: [ 3.443396] ns_init_card_error+0x12c/0x220 [ 3.443396] nicstar_init_one+0x10d2/0x1130 [ 3.443396] local_pci_probe+0x4a/0xb0 [ 3.443396] pci_device_probe+0x126/0x1d0 [ 3.443396] ? pci_device_remove+0x100/0x100 [ 3.443396] really_probe+0x27e/0x650 [ 3.443396] driver_probe_device+0x84/0x1d0 [ 3.443396] ? mutex_lock_nested+0x16/0x20 [ 3.443396] device_driver_attach+0x63/0x70 [ 3.443396] __driver_attach+0x117/0x1a0 [ 3.443396] ? device_driver_attach+0x70/0x70 [ 3.443396] bus_for_each_dev+0xb6/0x110 [ 3.443396] ? rdinit_setup+0x40/0x40 [ 3.443396] driver_attach+0x22/0x30 [ 3.443396] bus_add_driver+0x1e6/0x2a0 [ 3.443396] driver_register+0xa4/0x180 [ 3.443396] __pci_register_driver+0x77/0x80 [ 3.443396] ? uPD98402_module_init+0xd/0xd [ 3.443396] nicstar_init+0x1f/0x75 [ 3.443396] do_one_initcall+0x7a/0x3d0 [ 3.443396] ? rdinit_setup+0x40/0x40 [ 3.443396] ? rcu_read_lock_sched_held+0x4a/0x70 [ 3.443396] kernel_init_freeable+0x2a7/0x2f9 [ 3.443396] ? rest_init+0x2c0/0x2c0 [ 3.443396] kernel_init+0x13/0x180 [ 3.443396] ? rest_init+0x2c0/0x2c0 [ 3.443396] ? rest_init+0x2c0/0x2c0 [ 3.443396] ret_from_fork+0x1f/0x30 [ 3.443396] Modules linked in: [ 3.443396] Dumping ftrace buffer: [ 3.443396] (ftrace buffer empty) [ 3.458593] ---[ end trace 3c6f8f0d8ef59bcd ]--- [ 3.458922] RIP: 0010:kfree+0x26a/0x300 [ 3.459198] Code: e8 3a c3 b9 ff e9 d6 fd ff ff 49 8b 45 00 31 db a9 00 00 01 00 75 4d 49 8b 45 00 a9 00 00 01 00 75 0a 49 8b 45 08 a8 01 75 02 <0f> 0b 89 d9 b8 00 10 00 00 be 06 00 00 00 48 d3 e0 f7 d8 48 63 d0 [ 3.460499] RSP: 0000:ffffc90000017b70 EFLAGS: 00010246 [ 3.460870] RAX: dead000000000100 RBX: 0000000000000000 RCX: 0000000000000000 [ 3.461371] RDX: 0000000000000000 RSI: ffffffff85d3df94 RDI: ffffffff85df38e6 [ 3.461873] RBP: ffffc90000017b90 R08: 0000000000000001 R09: 0000000000000001 [ 3.462372] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888107dc0000 [ 3.462871] R13: ffffea00001f0100 R14: ffff888101a8bf00 R15: ffff888107dc0160 [ 3.463368] FS: 0000000000000000(0000) GS:ffff88817bc80000(0000) knlGS:0000000000000000 [ 3.463949] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3.464356] CR2: 0000000000000000 CR3: 000000000642e000 CR4: 00000000000006e0 [ 3.464856] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3.465356] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3.465860] Kernel panic - not syncing: Fatal exception [ 3.466370] Dumping ftrace buffer: [ 3.466616] (ftrace buffer empty) [ 3.466871] Kernel Offset: disabled [ 3.467122] Rebooting in 1 seconds.. Signed-off-by: Zheyu Ma <zheyuma97@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21Merge branch 'fec-tx'David S. Miller
Joakim Zhang says: ==================== net: fec: fix TX bandwidth fluctuations This patch set intends to fix TX bandwidth fluctuations, any feedback would be appreciated. --- ChangeLogs: V1: remove RFC tag, RFC discussions please turn to below: https://lore.kernel.org/lkml/YK0Ce5YxR2WYbrAo@lunn.ch/T/ V2: change functions to be static in this patch set. And add the t-b tag. V3: fix sparse warining: ntohs()->htons() ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21net: fec: add ndo_select_queue to fix TX bandwidth fluctuationsFugang Duan
As we know that AVB is enabled by default, and the ENET IP design is queue 0 for best effort, queue 1&2 for AVB Class A&B. Bandwidth of each queue 1&2 set in driver is 50%, TX bandwidth fluctuated when selecting tx queues randomly with FEC_QUIRK_HAS_AVB quirk available. This patch adds ndo_select_queue callback to select queues for transmitting to fix this issue. It will always return queue 0 if this is not a vlan packet, and return queue 1 or 2 based on priority of vlan packet. You may complain that in fact we only use single queue for trasmitting if we are not targeted to VLAN. Yes, but seems we have no choice, since AVB is enabled when the driver probed, we can't switch this feature dynamicly. After compare multiple queues to single queue, TX throughput almost no improvement. One way we can implemet is to configure the driver to multiple queues with Round-robin scheme by default. Then add ndo_setup_tc callback to enable/disable AVB feature for users. Unfortunately, ENET AVB IP seems not follow the standard 802.1Qav spec. We only can program DMAnCFG[IDLE_SLOPE] field to calculate bandwidth fraction. And idle slope is restricted to certain valus (a total of 19). It's far away from CBS QDisc implemented in Linux TC framework. If you strongly suggest to do this, I think we only can support limited numbers of bandwidth and reject others, but it's really urgly and wried. With this patch, VLAN tagged packets route to queue 0/1/2 based on vlan priority; VLAN untagged packets route to queue 0. Tested-by: Frieder Schrempf <frieder.schrempf@kontron.de> Reported-by: Frieder Schrempf <frieder.schrempf@kontron.de> Signed-off-by: Fugang Duan <fugang.duan@nxp.com> Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21net: fec: add FEC_QUIRK_HAS_MULTI_QUEUES represents i.MX6SX ENET IPJoakim Zhang
Frieder Schrempf reported a TX throuthput issue [1], it happens quite often that the measured bandwidth in TX direction drops from its expected/nominal value to something like ~50% (for 100M) or ~67% (for 1G) connections. [1] https://lore.kernel.org/linux-arm-kernel/421cc86c-b66f-b372-32f7-21e59f9a98bc@kontron.de/ The issue becomes clear after digging into it, Net core would select queues when transmitting packets. Since FEC have not impletemented ndo_select_queue callback yet, so it will call netdev_pick_tx to select queues randomly. For i.MX6SX ENET IP with AVB support, driver default enables this feature. According to the setting of QOS/RCMRn/DMAnCFG registers, AVB configured to Credit-based scheme, 50% bandwidth of each queue 1&2. With below tests let me think more: 1) With FEC_QUIRK_HAS_AVB quirk, can reproduce TX bandwidth fluctuations issue. 2) Without FEC_QUIRK_HAS_AVB quirk, can't reproduce TX bandwidth fluctuations issue. The related difference with or w/o FEC_QUIRK_HAS_AVB quirk is that, whether we program FTYPE field of TxBD or not. As I describe above, AVB feature is enabled by default. With FEC_QUIRK_HAS_AVB quirk, frames in queue 0 marked as non-AVB, and frames in queue 1&2 marked as AVB Class A&B. It's unreasonable if frames in queue 1&2 are not required to be time-sensitive. So when Net core select tx queues ramdomly, Credit-based scheme would work and lead to TX bandwidth fluctuated. On the other hand, w/o FEC_QUIRK_HAS_AVB quirk, frames in queue 1&2 are all marked as non-AVB, so Credit-based scheme would not work. Till now, how can we fix this TX throughput issue? Yes, please remove FEC_QUIRK_HAS_AVB quirk if you suffer it from time-nonsensitive networking. However, this quirk is used to indicate i.MX6SX, other setting depends on it. So this patch adds a new quirk FEC_QUIRK_HAS_MULTI_QUEUES to represent i.MX6SX, it is safe for us remove FEC_QUIRK_HAS_AVB quirk now. FEC_QUIRK_HAS_AVB quirk is set by default in the driver, and users may not know much about driver details, they would waste effort to find the root cause, that is not we want. The following patch is a implementation to fix it and users don't need to modify the driver. Tested-by: Frieder Schrempf <frieder.schrempf@kontron.de> Reported-by: Frieder Schrempf <frieder.schrempf@kontron.de> Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21Revert "drm/amdgpu/gfx9: fix the doorbell missing when in CGPG issue."Yifan Zhang
This reverts commit 4cbbe34807938e6e494e535a68d5ff64edac3f20. Reason for revert: side effect of enlarging CP_MEC_DOORBELL_RANGE may cause some APUs fail to enter gfxoff in certain user cases. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-06-21Revert "drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full ↵Yifan Zhang
doorbell." This reverts commit 1c0b0efd148d5b24c4932ddb3fa03c8edd6097b3. Reason for revert: Side effect of enlarging CP_MEC_DOORBELL_RANGE may cause some APUs fail to enter gfxoff in certain user cases. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2021-06-21drm/amdgpu: Call drm_framebuffer_init last for framebuffer initMichel Dänzer
Once drm_framebuffer_init has returned 0, the framebuffer is hooked up to the reference counting machinery and can no longer be destroyed with a simple kfree. Therefore, it must be called last. If drm_framebuffer_init returns 0 but its caller then returns non-0, there will likely be memory corruption fireworks down the road. The following lead me to this fix: [ 12.891228] kernel BUG at lib/list_debug.c:25! [...] [ 12.891263] RIP: 0010:__list_add_valid+0x4b/0x70 [...] [ 12.891324] Call Trace: [ 12.891330] drm_framebuffer_init+0xb5/0x100 [drm] [ 12.891378] amdgpu_display_gem_fb_verify_and_init+0x47/0x120 [amdgpu] [ 12.891592] ? amdgpu_display_user_framebuffer_create+0x10d/0x1f0 [amdgpu] [ 12.891794] amdgpu_display_user_framebuffer_create+0x126/0x1f0 [amdgpu] [ 12.891995] drm_internal_framebuffer_create+0x378/0x3f0 [drm] [ 12.892036] ? drm_internal_framebuffer_create+0x3f0/0x3f0 [drm] [ 12.892075] drm_mode_addfb2+0x34/0xd0 [drm] [ 12.892115] ? drm_internal_framebuffer_create+0x3f0/0x3f0 [drm] [ 12.892153] drm_ioctl_kernel+0xe2/0x150 [drm] [ 12.892193] drm_ioctl+0x3da/0x460 [drm] [ 12.892232] ? drm_internal_framebuffer_create+0x3f0/0x3f0 [drm] [ 12.892274] amdgpu_drm_ioctl+0x43/0x80 [amdgpu] [ 12.892475] __se_sys_ioctl+0x72/0xc0 [ 12.892483] do_syscall_64+0x33/0x40 [ 12.892491] entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: f258907fdd835e "drm/amdgpu: Verify bo size can fit framebuffer size on init." Signed-off-by: Michel Dänzer <mdaenzer@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-21Merge branch 'mptcp-sdeq-fixes'David S. Miller
Mat Martineau says: ==================== mptcp: 32-bit sequence number improvements MPTCP-level sequence numbers are 64 bits, but RFC 8684 allows use of 32-bit sequence numbers in the DSS option to save header space. Those 32-bit numbers are the least significant bits of the full 64-bit sequence number, so the receiver must infer the correct upper 32 bits. These two patches improve the logic for determining the full 64-bit sequence numbers when the 32-bit truncated version has wrapped around. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21mptcp: fix 32 bit DSN expansionPaolo Abeni
The current implementation of 32 bit DSN expansion is buggy. After the previous patch, we can simply reuse the newly introduced helper to do the expansion safely. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/120 Fixes: 648ef4b88673 ("mptcp: Implement MPTCP receive path") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21mptcp: fix bad handling of 32 bit ack wrap-aroundPaolo Abeni
When receiving 32 bits DSS ack from the peer, the MPTCP need to expand them to 64 bits value. The current code is buggy WRT detecting 32 bits ack wrap-around: when the wrap-around happens the current unsigned 32 bit ack value is lower than the previous one. Additionally check for possible reverse wrap and make the helper visible, so that we could re-use it for the next patch. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/204 Fixes: cc9d25669866 ("mptcp: update per unacked sequence on pkt reception") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21block/partitions/msdos: Fix typo inidicator -> indicatorThomas Bracht Laumann Jespersen
Just a fix for a small typo in msdos_partition(). Signed-off-by: Thomas Bracht Laumann Jespersen <t@laumann.xyz> Link: https://lore.kernel.org/r/20210619195130.19348-1-t@laumann.xyz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: reset waker pointer with shared queuesPaolo Valente
Commit 85686d0dc194 ("block, bfq: keep shared queues out of the waker mechanism") leaves shared bfq_queues out of the waker-detection mechanism. It attains this goal by not updating the pointer last_completed_rq_bfqq, if the last request completed belongs to a shared bfq_queue (so that the pointer will not point to the shared bfq_queue). Yet this has a side effect: the pointer last_completed_rq_bfqq keeps pointing, deceptively, to a bfq_queue that actually is not the last one to have had a request completed. As a consequence, such a bfq_queue may deceptively be considered as a waker of some bfq_queue, even of some shared bfq_queue. To address this issue, reset last_completed_rq_bfqq if the last request completed belongs to a shared queue. Fixes: 85686d0dc194 ("block, bfq: keep shared queues out of the waker mechanism") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-8-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: check waker only for queues with no in-flight I/OPaolo Valente
Consider two bfq_queues, say Q1 and Q2, with Q2 empty. If a request of Q1 gets completed shortly before a new request arrives for Q2, then BFQ flags Q1 as a candidate waker for Q2. Yet, the arrival of this new request may have a different cause, in the following case. If also Q2 has requests in flight while waiting for the arrival of a new request, then the completion of its own requests may be the actual cause of the awakening of the process that sends I/O to Q2. So Q1 may be flagged wrongly as a candidate waker. This commit avoids this deceptive flagging, by disabling candidate-waker flagging for Q2, if Q2 has in-flight I/O. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-7-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: avoid delayed merge of async queuesPaolo Valente
Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created queues"), BFQ may schedule a merge between a newly created sync bfq_queue, say Q2, and the last sync bfq_queue created, say Q1. To this goal, BFQ stores the address of Q1 in the field bic->stable_merge_bfqq of the bic associated with Q2. So, when the time for the possible merge arrives, BFQ knows which bfq_queue to merge Q2 with. In particular, BFQ checks for possible merges on request arrivals. Yet the same bic may also be associated with an async bfq_queue, say Q3. So, if a request for Q3 arrives, then the above check may happen to be executed while the bfq_queue at hand is Q3, instead of Q2. In this case, Q1 happens to be merged with an async bfq_queue. This is not only a conceptual mistake, because async queues are to be kept out of queue merging, but also a bug that leads to inconsistent states. This commits simply filters async queues out of delayed merges. Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues") Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-6-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: boost throughput by extending queue-merging timesPietro Pedroni
One of the methods with which bfq boosts throughput is by merging queues. One of the merging variants in bfq is the stable merge. This mechanism is activated between two queues only if they are created within a certain maximum time T1 from each other. Merging can happen soon or be delayed. In the second case, before merging, bfq needs to evaluate a throughput-boost parameter that indicates whether the queue generates a high throughput is served alone. Merging occurs when this throughput-boost is not high enough. In particular, this parameter is evaluated and late merging may occur only after at least a time T2 from the creation of the queue. Currently T1 and T2 are set to 180ms and 200ms, respectively. In this way the merging mechanism rarely occurs because time is not enough. This results in a noticeable lowering of the overall throughput with some workloads (see the example below). This commit introduces two constants bfq_activation_stable_merging and bfq_late_stable_merging in order to increase the duration of T1 and T2. Both the stable merging activation time and the late merging time are set to 600ms. This value has been experimentally evaluated using sqlite benchmark in the Phoronix Test Suite on a HDD. The duration of the benchmark before this fix was 111.02s, while now it has reached 97.02s, a better result than that of all the other schedulers. Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-5-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: consider also creation time in delayed stable mergePaolo Valente
Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created queues"), BFQ may schedule a merge between a newly created sync bfq_queue and the last sync bfq_queue created. Such a merging is not performed immediately, because BFQ needs first to find out whether the newly created queue actually reaches a higher throughput if not merged at all (and in that case BFQ will not perform any stable merging). To check that, a little time must be waited after the creation of the new queue, so that some I/O can flow in the queue, and statistics on such I/O can be computed. Yet, to evaluate the above waiting time, the last split time is considered as start time, instead of the creation time of the queue. This is a mistake, because considering the split time is correct only in the following scenario. The queue undergoes a non-stable merges on the arrival of its very first I/O request, due to close I/O with some other queue. While the queue is merged for close I/O, stable merging is not considered. Yet the queue may then happen to be split, if the close I/O finishes (or happens to be a false positive). From this time on, the queue can again be considered for stable merging. But, again, a little time must elapse, to let some new I/O flow in the queue and to get updated statistics. To wait for this time, the split time is to be taken into account. Yet, if the queue does not undergo a non-stable merge on the arrival of its very first request, then BFQ immediately checks whether the stable merge is to be performed. It happens because the split time for a queue is initialized to minus infinity when the queue is created. This commit fixes this mistake by adding the missing condition. Now the check for delayed stable-merge is performed after a little time is elapsed not only from the last queue split time, but also from the creation time of the queue. Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-4-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: fix delayed stable merge checkLuca Mariotti
When attempting to schedule a merge of a given bfq_queue with the currently in-service bfq_queue or with a cooperating bfq_queue among the scheduled bfq_queues, delayed stable merge is checked for rotational or non-queueing devs. For this stable merge to be performed, some conditions must be met. If the current bfq_queue underwent some split from some merged bfq_queue, one of these conditions is that two hundred milliseconds must elapse from split, otherwise this condition is always met. Unfortunately, by mistake, time_is_after_jiffies() was written instead of time_is_before_jiffies() for this check, verifying that less than two hundred milliseconds have elapsed instead of verifying that at least two hundred milliseconds have elapsed. Fix this issue by replacing time_is_after_jiffies() with time_is_before_jiffies(). Signed-off-by: Luca Mariotti <mariottiluca1@hotmail.it> Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com> Link: https://lore.kernel.org/r/20210619140948.98712-3-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: let also stably merged queues enjoy weight raisingPaolo Valente
Merged bfq_queues are kept out of weight-raising (low-latency) mechanisms. The reason is that these queues are usually created for non-interactive and non-soft-real-time tasks. Yet this is not the case for stably-merged queues. These queues are merged just because they are created shortly after each other. So they may easily serve the I/O of an interactive or soft-real time application, if the application happens to spawn multiple processes. To address this issue, this commits lets also stably-merged queued enjoy weight raising. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21blk-wbt: make sure throttle is enabled properlyZhang Yi
After commit a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt"), if throttle was disabled by wbt_disable_default(), we could not enable again, fix this by set enable_state back to WBT_STATE_ON_DEFAULT. Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20210619093700.920393-3-yi.zhang@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21blk-wbt: introduce a new disable state to prevent false positive by ↵Zhang Yi
rwb_enabled() Now that we disable wbt by simply zero out rwb->wb_normal in wbt_disable_default() when switch elevator to bfq, but it's not safe because it will become false positive if we change queue depth. If it become false positive between wbt_wait() and wbt_track() when submit write request, it will lead to drop rqw->inflight to -1 in wbt_done(), which will end up trigger IO hung. Fix this issue by introduce a new state which mean the wbt was disabled. Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20210619093700.920393-2-yi.zhang@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Prioritize high-priority requestsBart Van Assche
While one or more requests with a certain I/O priority are pending, do not dispatch lower priority requests. Dispatch lower priority requests anyway after the "aging" time has expired. This patch has been tested as follows: modprobe scsi_debug ndelay=1000000 max_queue=16 && sd='' && while [ -z "$sd" ]; do sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*) done && echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire && cd /sys/fs/cgroup/blkio/ && echo $$ >cgroup.procs && echo restrict-to-be >blkio.prio.class && mkdir -p hipri && cd hipri && echo none-to-rt >blkio.prio.class && { max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } && echo $$ >cgroup.procs && max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt Result: * 11000 IOPS for the high-priority job * 40 IOPS for the low-priority job If the aging expiry time is changed from 100s into 0, the IOPS results change into 6712 and 6796 IOPS. The max-iops script is a script that runs fio with the following arguments: --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60 --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j} --iodepth=${arg_d} --iodepth_batch_submit=${arg_a} --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1} --filename=${positional_argument_1} Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-17-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Add cgroup supportBart Van Assche
Maintain statistics per cgroup and export these to user space. These statistics are essential for verifying whether the proper I/O priorities have been assigned to requests. An example of the statistics data with this patch applied: $ cat /sys/fs/cgroup/io.stat 11:2 rbytes=0 wbytes=0 rios=3 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0 8:32 rbytes=2142720 wbytes=0 rios=105 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0 Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-16-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Track I/O statisticsBart Van Assche
Track I/O statistics per I/O priority and export these statistics to debugfs. These statistics help developers of the deadline scheduler. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-15-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Add I/O priority supportBart Van Assche
Maintain one dispatch list and one FIFO list per I/O priority class: RT, BE and IDLE. Maintain statistics for each priority level. Split the debugfs attributes per priority level as follows: $ ls /sys/kernel/debug/block/.../sched/ async_depth dispatch2 read_next_rq write2_fifo_list batching read0_fifo_list starved write_next_rq dispatch0 read1_fifo_list write0_fifo_list dispatch1 read2_fifo_list write1_fifo_list Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-14-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Micro-optimize the batching algorithmBart Van Assche
When dispatching the first request of a batch, the deadline_move_request() call clears .next_rq[] for the opposite data direction. .next_rq[] is not restored when changing data direction. Fix this by not clearing .next_rq[] and by keeping track of the data direction of a batch in a variable instead. This patch is a micro-optimization because: - The number of deadline_next_request() calls for the read direction is halved. - The number of times that deadline_next_request() returns NULL is reduced. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-13-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Reserve 25% of scheduler tags for synchronous requestsBart Van Assche
For interactive workloads it is important that synchronous requests are not delayed. Hence reserve 25% of scheduler tags for synchronous requests. This patch still allows asynchronous requests to fill the hardware queues since blk_mq_init_sched() makes sure that the number of scheduler requests is the double of the hardware queue depth. From blk_mq_init_sched(): q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth, BLKDEV_MAX_RQ); Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-12-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Improve the sysfs show and store macrosBart Van Assche
Define separate macros for integers and jiffies to improve readability. Use sysfs_emit() and kstrtoint() instead of sprintf() and simple_strtol(). The former functions are the recommended functions. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-11-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Improve compile-time argument checkingBart Van Assche
Modern compilers complain if an out-of-range value is passed to a function argument that has an enumeration type. Let the compiler detect out-of-range data direction arguments instead of verifying the data_dir argument at runtime. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-10-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()Bart Van Assche
Change "queue" into "sched" to make the function names reflect better the purpose of these functions. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-9-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Remove two local variablesBart Van Assche
Make __dd_dispatch_request() easier to read by removing two local variables. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-8-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Add two lockdep_assert_held() statementsBart Van Assche
Document the locking strategy by adding two lockdep_assert_held() statements. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-7-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Add several commentsBart Van Assche
Make the code easier to read by adding more comments. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-6-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block: Introduce the ioprio rq-qos policyBart Van Assche
Introduce an rq-qos policy that assigns an I/O priority to requests based on blk-cgroup configuration settings. This policy has the following advantages over the ioprio_set() system call: - This policy is cgroup based so it has all the advantages of cgroups. - While ioprio_set() does not affect page cache writeback I/O, this rq-qos controller affects page cache writeback I/O for filesystems that support assiociating a cgroup with writeback I/O. See also Documentation/admin-guide/cgroup-v2.rst. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-5-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/blk-rq-qos: Move a function from a header file into a C fileBart Van Assche
rq_qos_id_to_name() is only used in blk-mq-debugfs.c so move that function into in blk-mq-debugfs.c. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20210618004456.7280-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() callsBart Van Assche
Before adding more calls in this function, simplify the error path. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210618004456.7280-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutiveBart Van Assche
These entries were consecutive at the time of their introduction but are no longer consecutive. Make these again consecutive. Additionally, modify the help text since it refers to blk-mq and since the legacy block layer has been removed. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20210618004456.7280-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21netfilter: nf_tables_offload: check FLOW_DISSECTOR_KEY_BASIC in VLAN ↵Pablo Neira Ayuso
transfer logic The VLAN transfer logic should actually check for FLOW_DISSECTOR_KEY_BASIC, not FLOW_DISSECTOR_KEY_CONTROL. Moreover, do not fallback to case 2) .n_proto is set to 802.1q or 802.1ad, if FLOW_DISSECTOR_KEY_BASIC is unset. Fixes: 783003f3bb8a ("netfilter: nftables_offload: special ethertype handling for VLAN") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-06-21netfs: fix test for whether we can skip read when writing beyond EOFJeff Layton
It's not sufficient to skip reading when the pos is beyond the EOF. There may be data at the head of the page that we need to fill in before the write. Add a new helper function that corrects and clarifies the logic of when we can skip reads, and have it only zero out the part of the page that won't have data copied in for the write. Finally, don't set the page Uptodate after zeroing. It's not up to date since the write data won't have been copied in yet. [DH made the following changes: - Prefixed the new function with "netfs_". - Don't call zero_user_segments() for a full-page write. - Altered the beyond-last-page check to avoid a DIV instruction and got rid of then-redundant zero-length file check. ] Fixes: e1b1240c1ff5f ("netfs: Add write_begin helper") Reported-by: Andrew W Elble <aweits@rit.edu> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> cc: ceph-devel@vger.kernel.org Link: https://lore.kernel.org/r/20210613233345.113565-1-jlayton@kernel.org/ Link: https://lore.kernel.org/r/162367683365.460125.4467036947364047314.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/162391826758.1173366.11794946719301590013.stgit@warthog.procyon.org.uk/ # v2
2021-06-21netfilter: nf_tables: memleak in hw offload abort pathPablo Neira Ayuso
Release flow from the abort path, this is easy to reproduce since b72920f6e4a9 ("netfilter: nftables: counter hardware offload support"). If the preparation phase fails, then the abort path is exercised without releasing the flow rule object. unreferenced object 0xffff8881f0fa7700 (size 128): comm "nft", pid 1335, jiffies 4294931120 (age 4163.740s) hex dump (first 32 bytes): 08 e4 de 13 82 88 ff ff 98 e4 de 13 82 88 ff ff ................ 48 e4 de 13 82 88 ff ff 01 00 00 00 00 00 00 00 H............... backtrace: [<00000000634547e7>] flow_rule_alloc+0x26/0x80 [<00000000c8426156>] nft_flow_rule_create+0xc9/0x3f0 [nf_tables] [<0000000075ff8e46>] nf_tables_newrule+0xc79/0x10a0 [nf_tables] [<00000000ba65e40e>] nfnetlink_rcv_batch+0xaac/0xf90 [nfnetlink] [<00000000505c614a>] nfnetlink_rcv+0x1bb/0x1f0 [nfnetlink] [<00000000eb78e1fe>] netlink_unicast+0x34b/0x480 [<00000000a8f72c94>] netlink_sendmsg+0x3af/0x690 [<000000009cb1ddf4>] sock_sendmsg+0x96/0xa0 [<0000000039d06e44>] ____sys_sendmsg+0x3fe/0x440 [<00000000137e82ca>] ___sys_sendmsg+0xd8/0x140 [<000000000c6bf6a6>] __sys_sendmsg+0xb3/0x130 [<0000000043bd6268>] do_syscall_64+0x40/0xb0 [<00000000afdebc2d>] entry_SYSCALL_64_after_hwframe+0x44/0xae Remove flow rule release from the offload commit path, otherwise error from the offload commit phase might trigger a double-free due to the execution of the abort_offload -> abort. After this patch, the abort path takes care of releasing the flow rule. This fix also needs to move the nft_flow_rule_create() call before the transaction object is added otherwise the abort path might find a NULL pointer to the flow rule object for the NFT_CHAIN_HW_OFFLOAD case. While at it, rename BASIC-like goto tags to slightly more meaningful names rather than adding a new "err3" tag. Fixes: 63b48c73ff56 ("netfilter: nf_tables_offload: undo updates if transaction fails") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-06-21afs: Fix afs_write_end() to handle short writesDavid Howells
Fix afs_write_end() to correctly handle a short copy into the intended write region of the page. Two things are necessary: (1) If the page is not up to date, then we should just return 0 (ie. indicating a zero-length copy). The loop in generic_perform_write() will go around again, possibly breaking up the iterator into discrete chunks[1]. This is analogous to commit b9de313cf05fe08fa59efaf19756ec5283af672a for ceph. (2) The page should not have been set uptodate if it wasn't completely set up by netfs_write_begin() (this will be fixed in the next patch), so we need to set uptodate here in such a case. Also remove the assertion that was checking that the page was set uptodate since it's now set uptodate if it wasn't already a few lines above. The assertion was from when uptodate was set elsewhere. Changes: v3: Remove the handling of len exceeding the end of the page. Fixes: 3003bbd0697b ("afs: Use the netfs_write_begin() helper") Reported-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Al Viro <viro@zeniv.linux.org.uk> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/YMwVp268KTzTf8cN@zeniv-ca.linux.org.uk/ [1] Link: https://lore.kernel.org/r/162367682522.460125.5652091227576721609.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/162391825688.1173366.3437507255136307904.stgit@warthog.procyon.org.uk/ # v2
2021-06-21netfilter: nfnetlink_hook: fix check for snprintf() overflowDan Carpenter
The kernel version of snprintf() can't return negatives. The "ret > (int)sizeof(sym)" check is off by one because and it should be >=. Finally, we need to set a negative error code. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-06-21Merge branch 'dsa-cross-chip'David S. Miller
Vladimir Oltean says: ==================== Improvement for DSA cross-chip setups This series improves some aspects in multi-switch DSA tree topologies: - better device tree validation - better handling of MTU changes - better handling of multicast addresses - removal of some unused code ==================== Signed-off-by: David S. Miller <davem@davemloft.net>