summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-10-29net: fec_ptp: Use platform_get_irq_xxx_optional() to avoid error messageAnson Huang
Use platform_get_irq_byname_optional() and platform_get_irq_optional() instead of platform_get_irq_byname() and platform_get_irq() for optional IRQs to avoid below error message during probe: [ 0.795803] fec 30be0000.ethernet: IRQ pps not found [ 0.800787] fec 30be0000.ethernet: IRQ index 3 not found Signed-off-by: Anson Huang <Anson.Huang@nxp.com> Acked-by: Fugang Duan <fugang.duan@nxp.com> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29net: fec_main: Use platform_get_irq_byname_optional() to avoid error messageAnson Huang
Failed to get irq using name is NOT fatal as driver will use index to get irq instead, use platform_get_irq_byname_optional() instead of platform_get_irq_byname() to avoid below error message during probe: [ 0.819312] fec 30be0000.ethernet: IRQ int0 not found [ 0.824433] fec 30be0000.ethernet: IRQ int1 not found [ 0.829539] fec 30be0000.ethernet: IRQ int2 not found Signed-off-by: Anson Huang <Anson.Huang@nxp.com> Acked-by: Fugang Duan <fugang.duan@nxp.com> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29MAINTAINERS: remove Dave Watson as TLS maintainerJakub Kicinski
Dave's Facebook email address is not working, and my attempts to contact him are failing. Let's remove it to trim down the list of TLS maintainers. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29vxlan: check tun_info options_len properlyXin Long
This patch is to improve the tun_info options_len by dropping the skb when TUNNEL_VXLAN_OPT is set but options_len is less than vxlan_metadata. This can void a potential out-of-bounds access on ip_tun_info. Fixes: ee122c79d422 ("vxlan: Flow based tunneling") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29erspan: fix the tun_info options_len check for erspanXin Long
The check for !md doens't really work for ip_tunnel_info_opts(info) which only does info + 1. Also to avoid out-of-bounds access on info, it should ensure options_len is not less than erspan_metadata in both erspan_xmit() and ip6erspan_tunnel_xmit(). Fixes: 1a66a836da ("gre: add collect_md mode to ERSPAN tunnel") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29net: hisilicon: Fix ping latency when deal with high throughputJiangfeng Xiao
This is due to error in over budget processing. When dealing with high throughput, the used buffers that exceeds the budget is not cleaned up. In addition, it takes a lot of cycles to clean up the used buffer, and then the buffer where the valid data is located can take effect. Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29net/mlx4_core: Dynamically set guaranteed amount of counters per VFEran Ben Elisha
Prior to this patch, the amount of counters guaranteed per VF in the resource tracker was MLX4_VF_COUNTERS_PER_PORT * MLX4_MAX_PORTS. It was set regardless if the VF was single or dual port. This caused several VFs to have no guaranteed counters although the system could satisfy their request. The fix is to dynamically guarantee counters, based on each VF specification. Fixes: 9de92c60beaa ("net/mlx4_core: Adjust counter grant policy in the resource tracker") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29net/mlx5e: Initialize on stack link modes bitmapAya Levin
Initialize link modes bitmap on stack before using it, otherwise the outcome of ethtool set link ksettings might have unexpected values. Fixes: 4b95840a6ced ("net/mlx5e: Fix matching of speed to PRM link modes") Signed-off-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-29net/mlx5e: Fix ethtool self test: link speedAya Levin
Ethtool self test contains a test for link speed. This test reads the PTYS register and determines whether the current speed is valid or not. Change current implementation to use the function mlx5e_port_linkspeed() that does the same check and fails when speed is invalid. This code redundancy lead to a bug when mlx5e_port_linkspeed() was updated with expended speeds and the self test was not. Fixes: 2c81bfd5ae56 ("net/mlx5e: Move port speed code from en_ethtool.c to en/port.c") Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-29net/mlx5e: Fix handling of compressed CQEs in case of low NAPI budgetMaxim Mikityanskiy
When CQE compression is enabled, compressed CQEs use the following structure: a title is followed by one or many blocks, each containing 8 mini CQEs (except the last, which may contain fewer mini CQEs). Due to NAPI budget restriction, a complete structure is not always parsed in one NAPI run, and some blocks with mini CQEs may be deferred to the next NAPI poll call - we have the mlx5e_decompress_cqes_cont call in the beginning of mlx5e_poll_rx_cq. However, if the budget is extremely low, some blocks may be left even after that, but the code that follows the mlx5e_decompress_cqes_cont call doesn't check it and assumes that a new CQE begins, which may not be the case. In such cases, random memory corruptions occur. An extremely low NAPI budget of 8 is used when busy_poll or busy_read is active. This commit adds a check to make sure that the previous compressed CQE has been completely parsed after mlx5e_decompress_cqes_cont, otherwise it prevents a new CQE from being fetched in the middle of a compressed CQE. This commit fixes random crashes in __build_skb, __page_pool_put_page and other not-related-directly places, that used to happen when both CQE compression and busy_poll/busy_read were enabled. Fixes: 7219ab34f184 ("net/mlx5e: CQE compression") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-29net/mlx5e: Don't store direct pointer to action's tunnel infoVlad Buslov
Geneve implementation changed mlx5 tc to user direct pointer to tunnel_key action's internal struct ip_tunnel_info instance. However, this leads to use-after-free error when initial filter that caused creation of new encap entry is deleted or when tunnel_key action is manually overwritten through action API. Moreover, with recent TC offloads API unlocking change struct flow_action_entry->tunnel point to temporal copy of tunnel info that is deallocated after filter is offloaded to hardware which causes bug to reproduce every time new filter is attached to existing encap entry with following KASAN bug: [ 314.885555] ================================================================== [ 314.886641] BUG: KASAN: use-after-free in memcmp+0x2c/0x60 [ 314.886864] Read of size 1 at addr ffff88886c746280 by task tc/2682 [ 314.887179] CPU: 22 PID: 2682 Comm: tc Not tainted 5.3.0-rc7+ #703 [ 314.887188] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 314.887195] Call Trace: [ 314.887215] dump_stack+0x9a/0xf0 [ 314.887236] print_address_description+0x67/0x323 [ 314.887248] ? memcmp+0x2c/0x60 [ 314.887257] ? memcmp+0x2c/0x60 [ 314.887272] __kasan_report.cold+0x1a/0x3d [ 314.887474] ? __mlx5e_tc_del_fdb_peer_flow+0x100/0x1b0 [mlx5_core] [ 314.887484] ? memcmp+0x2c/0x60 [ 314.887509] kasan_report+0xe/0x12 [ 314.887521] memcmp+0x2c/0x60 [ 314.887662] mlx5e_tc_add_fdb_flow+0x51b/0xbe0 [mlx5_core] [ 314.887838] ? mlx5e_encap_take+0x110/0x110 [mlx5_core] [ 314.887902] ? lockdep_init_map+0x87/0x2c0 [ 314.887924] ? __init_waitqueue_head+0x4f/0x60 [ 314.888062] ? mlx5e_alloc_flow.isra.0+0x18c/0x1c0 [mlx5_core] [ 314.888207] __mlx5e_add_fdb_flow+0x2d7/0x440 [mlx5_core] [ 314.888359] ? mlx5e_tc_update_neigh_used_value+0x6f0/0x6f0 [mlx5_core] [ 314.888374] ? match_held_lock+0x2e/0x240 [ 314.888537] mlx5e_configure_flower+0x830/0x16a0 [mlx5_core] [ 314.888702] ? __mlx5e_add_fdb_flow+0x440/0x440 [mlx5_core] [ 314.888713] ? down_read+0x118/0x2c0 [ 314.888728] ? down_read_killable+0x300/0x300 [ 314.888882] ? mlx5e_rep_get_ethtool_stats+0x180/0x180 [mlx5_core] [ 314.888899] tc_setup_cb_add+0x127/0x270 [ 314.888937] fl_hw_replace_filter+0x2ac/0x380 [cls_flower] [ 314.888976] ? fl_hw_destroy_filter+0x1b0/0x1b0 [cls_flower] [ 314.888990] ? fl_change+0xbcf/0x27ef [cls_flower] [ 314.889030] ? fl_change+0xa57/0x27ef [cls_flower] [ 314.889069] fl_change+0x16bd/0x27ef [cls_flower] [ 314.889135] ? __rhashtable_insert_fast.constprop.0+0xa00/0xa00 [cls_flower] [ 314.889167] ? __radix_tree_lookup+0xa4/0x130 [ 314.889200] ? fl_get+0x169/0x240 [cls_flower] [ 314.889218] ? fl_walk+0x230/0x230 [cls_flower] [ 314.889249] tc_new_tfilter+0x5e1/0xd40 [ 314.889281] ? __rhashtable_insert_fast.constprop.0+0xa00/0xa00 [cls_flower] [ 314.889309] ? tc_del_tfilter+0xa30/0xa30 [ 314.889335] ? __lock_acquire+0x5b5/0x2460 [ 314.889378] ? find_held_lock+0x85/0xa0 [ 314.889442] ? tc_del_tfilter+0xa30/0xa30 [ 314.889465] rtnetlink_rcv_msg+0x4ab/0x5f0 [ 314.889488] ? rtnl_dellink+0x490/0x490 [ 314.889518] ? lockdep_hardirqs_on+0x260/0x260 [ 314.889538] ? netlink_deliver_tap+0xab/0x5a0 [ 314.889550] ? match_held_lock+0x1b/0x240 [ 314.889575] netlink_rcv_skb+0xd0/0x200 [ 314.889588] ? rtnl_dellink+0x490/0x490 [ 314.889605] ? netlink_ack+0x440/0x440 [ 314.889635] ? netlink_deliver_tap+0x161/0x5a0 [ 314.889648] ? lock_downgrade+0x360/0x360 [ 314.889657] ? lock_acquire+0xe5/0x210 [ 314.889686] netlink_unicast+0x296/0x350 [ 314.889707] ? netlink_attachskb+0x390/0x390 [ 314.889726] ? _copy_from_iter_full+0xe0/0x3a0 [ 314.889738] ? __virt_addr_valid+0xbb/0x130 [ 314.889771] netlink_sendmsg+0x394/0x600 [ 314.889800] ? netlink_unicast+0x350/0x350 [ 314.889817] ? move_addr_to_kernel.part.0+0x90/0x90 [ 314.889852] ? netlink_unicast+0x350/0x350 [ 314.889872] sock_sendmsg+0x96/0xa0 [ 314.889891] ___sys_sendmsg+0x482/0x520 [ 314.889919] ? copy_msghdr_from_user+0x250/0x250 [ 314.889930] ? __fput+0x1fa/0x390 [ 314.889941] ? task_work_run+0xb7/0xf0 [ 314.889957] ? exit_to_usermode_loop+0x117/0x120 [ 314.889972] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 314.889982] ? do_syscall_64+0x74/0xe0 [ 314.889992] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 314.890012] ? mark_lock+0xac/0x9a0 [ 314.890028] ? __lock_acquire+0x5b5/0x2460 [ 314.890053] ? mark_lock+0xac/0x9a0 [ 314.890083] ? __lock_acquire+0x5b5/0x2460 [ 314.890112] ? match_held_lock+0x1b/0x240 [ 314.890144] ? __fget_light+0xa1/0xf0 [ 314.890166] ? sockfd_lookup_light+0x91/0xb0 [ 314.890187] __sys_sendmsg+0xba/0x130 [ 314.890201] ? __sys_sendmsg_sock+0xb0/0xb0 [ 314.890225] ? __blkcg_punt_bio_submit+0xd0/0xd0 [ 314.890264] ? lockdep_hardirqs_off+0xbe/0x100 [ 314.890274] ? mark_held_locks+0x24/0x90 [ 314.890286] ? do_syscall_64+0x1e/0xe0 [ 314.890308] do_syscall_64+0x74/0xe0 [ 314.890325] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 314.890336] RIP: 0033:0x7f00ca33d7b8 [ 314.890348] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5 4 [ 314.890356] RSP: 002b:00007ffea2983928 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 314.890369] RAX: ffffffffffffffda RBX: 000000005d777d5b RCX: 00007f00ca33d7b8 [ 314.890377] RDX: 0000000000000000 RSI: 00007ffea2983990 RDI: 0000000000000003 [ 314.890384] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006 [ 314.890392] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001 [ 314.890400] R13: 000000000047f640 R14: 00007ffea2987b58 R15: 0000000000000021 [ 314.890529] Allocated by task 2687: [ 314.890684] save_stack+0x1b/0x80 [ 314.890694] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 314.890705] __kmalloc_track_caller+0x102/0x340 [ 314.890721] kmemdup+0x1d/0x40 [ 314.890730] tc_setup_flow_action+0x731/0x2c27 [ 314.890743] fl_hw_replace_filter+0x23b/0x380 [cls_flower] [ 314.890756] fl_change+0x16bd/0x27ef [cls_flower] [ 314.890765] tc_new_tfilter+0x5e1/0xd40 [ 314.890776] rtnetlink_rcv_msg+0x4ab/0x5f0 [ 314.890786] netlink_rcv_skb+0xd0/0x200 [ 314.890796] netlink_unicast+0x296/0x350 [ 314.890805] netlink_sendmsg+0x394/0x600 [ 314.890815] sock_sendmsg+0x96/0xa0 [ 314.890825] ___sys_sendmsg+0x482/0x520 [ 314.890834] __sys_sendmsg+0xba/0x130 [ 314.890844] do_syscall_64+0x74/0xe0 [ 314.890854] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 314.890937] Freed by task 2687: [ 314.891076] save_stack+0x1b/0x80 [ 314.891086] __kasan_slab_free+0x12c/0x170 [ 314.891095] kfree+0xeb/0x2f0 [ 314.891106] tc_cleanup_flow_action+0x69/0xa0 [ 314.891119] fl_hw_replace_filter+0x2c5/0x380 [cls_flower] [ 314.891132] fl_change+0x16bd/0x27ef [cls_flower] [ 314.891140] tc_new_tfilter+0x5e1/0xd40 [ 314.891151] rtnetlink_rcv_msg+0x4ab/0x5f0 [ 314.891161] netlink_rcv_skb+0xd0/0x200 [ 314.891170] netlink_unicast+0x296/0x350 [ 314.891180] netlink_sendmsg+0x394/0x600 [ 314.891190] sock_sendmsg+0x96/0xa0 [ 314.891200] ___sys_sendmsg+0x482/0x520 [ 314.891208] __sys_sendmsg+0xba/0x130 [ 314.891218] do_syscall_64+0x74/0xe0 [ 314.891228] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 314.891315] The buggy address belongs to the object at ffff88886c746280 which belongs to the cache kmalloc-96 of size 96 [ 314.891762] The buggy address is located 0 bytes inside of 96-byte region [ffff88886c746280, ffff88886c7462e0) [ 314.892196] The buggy address belongs to the page: [ 314.892387] page:ffffea0021b1d180 refcount:1 mapcount:0 mapping:ffff88835d00ef80 index:0x0 [ 314.892398] flags: 0x57ffffc0000200(slab) [ 314.892413] raw: 0057ffffc0000200 ffffea00219e0340 0000000800000008 ffff88835d00ef80 [ 314.892423] raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000 [ 314.892430] page dumped because: kasan: bad access detected [ 314.892515] Memory state around the buggy address: [ 314.892707] ffff88886c746180: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 314.892976] ffff88886c746200: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 314.893251] >ffff88886c746280: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 314.893522] ^ [ 314.893657] ffff88886c746300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 314.893924] ffff88886c746380: 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc [ 314.894189] ================================================================== Fix the issue by duplicating tunnel info into per-encap copy that is deallocated with encap structure. Also, duplicate tunnel info in flow parse attribute to support cases when flow might be attached asynchronously. Fixes: 1f6da30697d0 ("net/mlx5e: Geneve, Keep tunnel info as pointer to the original struct") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-29net/mlx5: Fix NULL pointer dereference in extended destinationEli Britstein
The cited commit refactored the encap id into a struct pointed from the destination. Bug fix for the case there is no encap for one of the destinations. Fixes: 2b688ea5efde ("net/mlx5: Add flow steering actions to fs_cmd shim layer") Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-29net/mlx5: Fix rtable reference leakParav Pandit
If the rt entry gateway family is not AF_INET for multipath device, rtable reference is leaked. Hence, fix it by releasing the reference. Fixes: 5fb091e8130b ("net/mlx5e: Use hint to resolve route when in HW multipath mode") Fixes: e32ee6c78efa ("net/mlx5e: Support tunnel encap over tagged Ethernet") Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-29net/mlx5e: Only skip encap flows update when encap init failedVlad Buslov
When encap entry initialization completes successfully e->compl_result is set to positive value and not zero, like mlx5e_rep_update_flows() assumes at the moment. Fix the conditional to only skip encap flows update when e->compl_result < 0. Fixes: 2a1f1768fa17 ("net/mlx5e: Refactor neigh update for concurrent execution") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-29net/mlx5e: Replace kfree with kvfree when free vhca statsMaor Gottlieb
Memory allocated by kvzalloc should be freed by kvfree. Fixes: cef35af34d6d ("net/mlx5e: Add mlx5e HV VHCA stats agent") Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-29net/mlx5e: Remove incorrect match criteria assignment lineDmytro Linkin
Driver have function, which enable match criteria for misc parameters in dependence of eswitch capabilities. Fixes: 4f5d1beadc10 ("Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux") Signed-off-by: Dmytro Linkin <dmitrolin@mellanox.com> Reviewed-by: Jianbo Liu <jianbol@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-29net/mlx5e: Determine source port properly for vlan push actionDmytro Linkin
Termination tables are used for vlan push actions on uplink ports. To support RoCE dual port the source port value was placed in a register. Fix the code to use an API method returning the source port according to the FW capabilities. Fixes: 10caabdaad5a ("net/mlx5e: Use termination table for VLAN push actions") Signed-off-by: Dmytro Linkin <dmitrolin@mellanox.com> Reviewed-by: Jianbo Liu <jianbol@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-29net/mlx5: Fix flow counter list auto bits structRoi Dayan
The union should contain the extended dest and counter list. Remove the resevered 0x40 bits which is redundant. This change doesn't break any functionally. Everything works today because the code in fs_cmd.c is using the correct structs if extended dest or the basic dest. Fixes: 1b115498598f ("net/mlx5: Introduce extended destination fields") Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-29Merge branch 'VLAN-fixes-for-Ocelot-switch'David S. Miller
Vladimir Oltean says: ==================== VLAN fixes for Ocelot switch This series addresses 2 issues with vlan_filtering=1: - Untagged traffic gets dropped unless commands are run in a very specific order. - Untagged traffic starts being transmitted as tagged after adding another untagged VID on the port. Tested on NXP LS1028A-RDB board. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29net: mscc: ocelot: refuse to overwrite the port's native vlanVladimir Oltean
The switch driver keeps a "vid" variable per port, which signifies _the_ VLAN ID that is stripped on that port's egress (aka the native VLAN on a trunk port). That is the way the hardware is designed (mostly). The port->vid is programmed into REW:PORT:PORT_VLAN_CFG:PORT_VID and the rewriter is told to send all traffic as tagged except the one having port->vid. There exists a possibility of finer-grained egress untagging decisions: using the VCAP IS1 engine, one rule can be added to match every VLAN-tagged frame whose VLAN should be untagged, and set POP_CNT=1 as action. However, the IS1 can hold at most 512 entries, and the VLANs are in the order of 6 * 4096. So the code is fine for now. But this sequence of commands: $ bridge vlan add dev swp0 vid 1 pvid untagged $ bridge vlan add dev swp0 vid 2 untagged makes untagged and pvid-tagged traffic be sent out of swp0 as tagged with VID 1, despite user's request. Prevent that from happening. The user should temporarily remove the existing untagged VLAN (1 in this case), add it back as tagged, and then add the new untagged VLAN (2 in this case). Cc: Antoine Tenart <antoine.tenart@bootlin.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Fixes: 7142529f1688 ("net: mscc: ocelot: add VLAN filtering") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29net: mscc: ocelot: fix vlan_filtering when enslaving to bridge before link is upVladimir Oltean
Background information: the driver operates the hardware in a mode where a single VLAN can be transmitted as untagged on a particular egress port. That is the "native VLAN on trunk port" use case. Its value is held in port->vid. Consider the following command sequence (no network manager, all interfaces are down, debugging prints added by me): $ ip link add dev br0 type bridge vlan_filtering 1 $ ip link set dev swp0 master br0 Kernel code path during last command: br_add_slave -> ocelot_netdevice_port_event (NETDEV_CHANGEUPPER): [ 21.401901] ocelot_vlan_port_apply: port 0 vlan aware 0 pvid 0 vid 0 br_add_slave -> nbp_vlan_init -> switchdev_port_attr_set -> ocelot_port_attr_set (SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING): [ 21.413335] ocelot_vlan_port_apply: port 0 vlan aware 1 pvid 0 vid 0 br_add_slave -> nbp_vlan_init -> nbp_vlan_add -> br_switchdev_port_vlan_add -> switchdev_port_obj_add -> ocelot_port_obj_add -> ocelot_vlan_vid_add [ 21.667421] ocelot_vlan_port_apply: port 0 vlan aware 1 pvid 1 vid 1 So far so good. The bridge has replaced the driver's default pvid used in standalone mode (0) with its own default_pvid (1). The port's vid (native VLAN) has also changed from 0 to 1. $ ip link set dev swp0 up [ 31.722956] 8021q: adding VLAN 0 to HW filter on device swp0 do_setlink -> dev_change_flags -> vlan_vid_add -> ocelot_vlan_rx_add_vid -> ocelot_vlan_vid_add: [ 31.728700] ocelot_vlan_port_apply: port 0 vlan aware 1 pvid 1 vid 0 The 8021q module uses the .ndo_vlan_rx_add_vid API on .ndo_open to make ports be able to transmit and receive 802.1p-tagged traffic by default. This API is supposed to offload a VLAN sub-interface, which for a switch port means to add a VLAN that is not a pvid, and tagged on egress. But the driver implementation of .ndo_vlan_rx_add_vid is wrong: it adds back vid 0 as "egress untagged". Now back to the initial paragraph: there is a single untagged VID that the driver keeps track of, and that has just changed from 1 (the pvid) to 0. So this breaks the bridge core's expectation, because it has changed vid 1 from untagged to tagged, when what the user sees is. $ bridge vlan port vlan ids swp0 1 PVID Egress Untagged br0 1 PVID Egress Untagged But curiously, instead of manifesting itself as "untagged and pvid-tagged traffic gets sent as tagged on egress", the bug: - is hidden when vlan_filtering=0 - manifests as dropped traffic when vlan_filtering=1, due to this setting: if (port->vlan_aware && !port->vid) /* If port is vlan-aware and tagged, drop untagged and priority * tagged frames. */ val |= ANA_PORT_DROP_CFG_DROP_UNTAGGED_ENA | ANA_PORT_DROP_CFG_DROP_PRIO_S_TAGGED_ENA | ANA_PORT_DROP_CFG_DROP_PRIO_C_TAGGED_ENA; which would have made sense if it weren't for this bug. The setting's intention was "this is a trunk port with no native VLAN, so don't accept untagged traffic". So the driver was never expecting to set VLAN 0 as the value of the native VLAN, 0 was just encoding for "invalid". So the fix is to not send 802.1p traffic as untagged, because that would change the port's native vlan to 0, unbeknownst to the bridge, and trigger unexpected code paths in the driver. Cc: Antoine Tenart <antoine.tenart@bootlin.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Fixes: 7142529f1688 ("net: mscc: ocelot: add VLAN filtering") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29wimax: i2400: Fix memory leak in i2400m_op_rfkill_sw_toggleNavid Emamdoost
In the implementation of i2400m_op_rfkill_sw_toggle() the allocated buffer for cmd should be released before returning. The documentation for i2400m_msg_to_dev() says when it returns the buffer can be reused. Meaning cmd should be released in either case. Move kfree(cmd) before return to be reached by all execution paths. Fixes: 2507e6ab7a9a ("wimax: i2400: fix memory leak") Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29drm/i915/tgl: Fix doc not corresponding to codeAnna Karas
Replace PLLs names used in documentation to that used in the code. Cc: Vandita Kulkarni <vandita.kulkarni@intel.com> Fixes: 68ff39c3f8c0 ("drm/i915/tgl: Add new pll ids") Signed-off-by: Anna Karas <anna.karas@intel.com> Reviewed-by: Vandita Kulkarni <vandita.kulkarni@intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20190926123559.15717-1-anna.karas@intel.com (cherry picked from commit d328bd4f905834c7d87a49962ebc96e397aab7b9) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2019-10-29io_uring: fix race with canceling timeoutsJens Axboe
If we get -1 from hrtimer_try_to_cancel(), we know that the timer is running. Hence leave all completion to the timeout handler. If we don't, we can corrupt the list and miss a completion. Fixes: 11365043e527 ("io_uring: add support for canceling timeout requests") Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29ceph: add missing check in d_revalidate snapdir handlingAl Viro
We should not play with dcache without parent locked... Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-29ceph: fix RCU case handling in ceph_d_revalidate()Al Viro
For RCU case ->d_revalidate() is called with rcu_read_lock() and without pinning the dentry passed to it. Which means that it can't rely upon ->d_inode remaining stable; that's the reason for d_inode_rcu(), actually. Make sure we don't reload ->d_inode there. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-29ceph: fix use-after-free in __ceph_remove_cap()Luis Henriques
KASAN reports a use-after-free when running xfstest generic/531, with the following trace: [ 293.903362] kasan_report+0xe/0x20 [ 293.903365] rb_erase+0x1f/0x790 [ 293.903370] __ceph_remove_cap+0x201/0x370 [ 293.903375] __ceph_remove_caps+0x4b/0x70 [ 293.903380] ceph_evict_inode+0x4e/0x360 [ 293.903386] evict+0x169/0x290 [ 293.903390] __dentry_kill+0x16f/0x250 [ 293.903394] dput+0x1c6/0x440 [ 293.903398] __fput+0x184/0x330 [ 293.903404] task_work_run+0xb9/0xe0 [ 293.903410] exit_to_usermode_loop+0xd3/0xe0 [ 293.903413] do_syscall_64+0x1a0/0x1c0 [ 293.903417] entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happens because __ceph_remove_cap() may queue a cap release (__ceph_queue_cap_release) which can be scheduled before that cap is removed from the inode list with rb_erase(&cap->ci_node, &ci->i_caps); And, when this finally happens, the use-after-free will occur. This can be fixed by removing the cap from the inode list before being removed from the session list, and thus eliminating the risk of an UAF. Cc: stable@vger.kernel.org Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-29io_uring: support for larger fixed file setsJens Axboe
There's been a few requests for supporting more fixed files than 1024. This isn't really tricky to do, we just need to split up the file table into multiple tables and index appropriately. As we do so, reduce the max single file table to 512. This enables us to do single page allocs always for the tables, which is an improvement over the situation prior. This patch adds support for up to 64K files, which should be enough for everyone. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: protect fixed file indexing with array_index_nospec()Jens Axboe
We index the file tables with a user given value. After we check it's within our limits, use array_index_nospec() to prevent any spectre attacks here. Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: add support for IORING_OP_ACCEPTJens Axboe
This allows an application to call accept4() in an async fashion. Like other opcodes, we first try a non-blocking accept, then punt to async context if we have to. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29net: add __sys_accept4_file() helperJens Axboe
This is identical to __sys_accept4(), except it takes a struct file instead of an fd, and it also allows passing in extra file->f_flags flags. The latter is done to support masking in O_NONBLOCK without manipulating the original file flags. No functional changes in this patch. Cc: netdev@vger.kernel.org Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: io_uring: add support for async work inheriting filesJens Axboe
This is in preparation for adding opcodes that need to add new files in a process file table, system calls like open(2) or accept4(2). If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work item. If work that needs to get punted to async context have this set, the async worker will assume the original task file table before executing the work. Note that opcodes that need access to the current files of an application cannot be done through IORING_SETUP_SQPOLL. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: replace workqueue usage with io-wqJens Axboe
Drop various work-arounds we have for workqueues: - We no longer need the async_list for tracking sequential IO. - We don't have to maintain our own mm tracking/setting. - We don't need a separate workqueue for buffered writes. This didn't even work that well to begin with, as it was suboptimal for multiple buffered writers on multiple files. - We can properly cancel pending interruptible work. This fixes deadlocks with particularly socket IO, where we cannot cancel them when the io_uring is closed. Hence the ring will wait forever for these requests to complete, which may never happen. This is different from disk IO where we know requests will complete in a finite amount of time. - Due to being able to cancel work interruptible work that is already running, we can implement file table support for work. We need that for supporting system calls that add to a process file table. - It gets us one step closer to adding async support for any system call. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io-wq: small threadpool implementation for io_uringJens Axboe
This adds support for io-wq, a smaller and specialized thread pool implementation. This is meant to replace workqueues for io_uring. Among the reasons for this addition are: - We can assign memory context smarter and more persistently if we manage the life time of threads. - We can drop various work-arounds we have in io_uring, like the async_list. - We can implement hashed work insertion, to manage concurrency of buffered writes without needing a) an extra workqueue, or b) needlessly making the concurrency of said workqueue very low which hurts performance of multiple buffered file writers. - We can implement cancel through signals, for cancelling interruptible work like read/write (or send/recv) to/from sockets. - We need the above cancel for being able to assign and use file tables from a process. - We can implement a more thorough cancel operation in general. - We need it to move towards a syslet/threadlet model for even faster async execution. For that we need to take ownership of the used threads. This list is just off the top of my head. Performance should be the same, or better, at least that's what I've seen in my testing. io-wq supports basic NUMA functionality, setting up a pool per node. io-wq hooks up to the scheduler schedule in/out just like workqueue and uses that to drive the need for more/less workers. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29drm/panfrost: Don't dereference bogus MMU pointersRobin Murphy
It seems that killing an application while faults are occurring (particularly with a GPU in FPGA at a whopping 40MHz) can lead to handling a lingering page fault after all the address space contexts have already been freed. In this situation, the LRU list is empty so addr_to_drm_mm_node() ends up dereferencing the list head as if it were a struct panfrost_mmu entry; this leaves "mmu->as" actually pointing at the pfdev->alloc_mask bitmap, which is also empty, and given that the fault has a high likelihood of being in AS0, hilarity ensues. Sadly, the cleanest solution seems to involve another goto. Oh well, at least it's robust... Fixes: 65e51e30d862 ("drm/panfrost: Prevent race when handling page fault") Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Rob Herring <robh@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/9a0b09e6b5851f0d4428b72dd6b8b4c0d0ef4206.1572293305.git.robin.murphy@arm.com
2019-10-29drm/panfrost: fix -Wmissing-prototypes warningsYi Wang
We get these warnings when build kernel W=1: drivers/gpu/drm/panfrost/panfrost_perfcnt.c:35:6: warning: no previous prototype for ‘panfrost_perfcnt_clean_cache_done’ [-Wmissing-prototypes] drivers/gpu/drm/panfrost/panfrost_perfcnt.c:40:6: warning: no previous prototype for ‘panfrost_perfcnt_sample_done’ [-Wmissing-prototypes] drivers/gpu/drm/panfrost/panfrost_perfcnt.c:190:5: warning: no previous prototype for ‘panfrost_ioctl_perfcnt_enable’ [-Wmissing-prototypes] drivers/gpu/drm/panfrost/panfrost_perfcnt.c:218:5: warning: no previous prototype for ‘panfrost_ioctl_perfcnt_dump’ [-Wmissing-prototypes] drivers/gpu/drm/panfrost/panfrost_perfcnt.c:250:6: warning: no previous prototype for ‘panfrost_perfcnt_close’ [-Wmissing-prototypes] drivers/gpu/drm/panfrost/panfrost_perfcnt.c:264:5: warning: no previous prototype for ‘panfrost_perfcnt_init’ [-Wmissing-prototypes] drivers/gpu/drm/panfrost/panfrost_perfcnt.c:320:6: warning: no previous prototype for ‘panfrost_perfcnt_fini’ [-Wmissing-prototypes] drivers/gpu/drm/panfrost/panfrost_mmu.c:227:6: warning: no previous prototype for ‘panfrost_mmu_flush_range’ [-Wmissing-prototypes] drivers/gpu/drm/panfrost/panfrost_mmu.c:435:5: warning: no previous prototype for ‘panfrost_mmu_map_fault_addr’ [-Wmissing-prototypes] For file panfrost_mmu.c, make functions static to fix this. For file panfrost_perfcnt.c, include header file can fix this. Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Reviewed-by: Steven Price <steven.price@arm.com> Cc: stable@vger.kernel.org [robh: fixup function parameter alignment] Signed-off-by: Rob Herring <robh@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/1571967015-42854-1-git-send-email-wang.yi59@zte.com.cn
2019-10-29net: hisilicon: Fix "Trying to free already-free IRQ"Jiangfeng Xiao
When rmmod hip04_eth.ko, we can get the following warning: Task track: rmmod(1623)>bash(1591)>login(1581)>init(1) ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1623 at kernel/irq/manage.c:1557 __free_irq+0xa4/0x2ac() Trying to free already-free IRQ 200 Modules linked in: ping(O) pramdisk(O) cpuinfo(O) rtos_snapshot(O) interrupt_ctrl(O) mtdblock mtd_blkdevrtfs nfs_acl nfs lockd grace sunrpc xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables nf_reject_ipv CPU: 0 PID: 1623 Comm: rmmod Tainted: G O 4.4.193 #1 Hardware name: Hisilicon A15 [<c020b408>] (rtos_unwind_backtrace) from [<c0206624>] (show_stack+0x10/0x14) [<c0206624>] (show_stack) from [<c03f2be4>] (dump_stack+0xa0/0xd8) [<c03f2be4>] (dump_stack) from [<c021a780>] (warn_slowpath_common+0x84/0xb0) [<c021a780>] (warn_slowpath_common) from [<c021a7e8>] (warn_slowpath_fmt+0x3c/0x68) [<c021a7e8>] (warn_slowpath_fmt) from [<c026876c>] (__free_irq+0xa4/0x2ac) [<c026876c>] (__free_irq) from [<c0268a14>] (free_irq+0x60/0x7c) [<c0268a14>] (free_irq) from [<c0469e80>] (release_nodes+0x1c4/0x1ec) [<c0469e80>] (release_nodes) from [<c0466924>] (__device_release_driver+0xa8/0x104) [<c0466924>] (__device_release_driver) from [<c0466a80>] (driver_detach+0xd0/0xf8) [<c0466a80>] (driver_detach) from [<c0465e18>] (bus_remove_driver+0x64/0x8c) [<c0465e18>] (bus_remove_driver) from [<c02935b0>] (SyS_delete_module+0x198/0x1e0) [<c02935b0>] (SyS_delete_module) from [<c0202ed0>] (__sys_trace_return+0x0/0x10) ---[ end trace bb25d6123d849b44 ]--- Currently "rmmod hip04_eth.ko" call free_irq more than once as devres_release_all and hip04_remove both call free_irq. This results in a 'Trying to free already-free IRQ' warning. To solve the problem free_irq has been moved out of hip04_remove. Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29fjes: Handle workqueue allocation failureWill Deacon
In the highly unlikely event that we fail to allocate either of the "/txrx" or "/control" workqueues, we should bail cleanly rather than blindly march on with NULL queue pointer(s) installed in the 'fjes_adapter' instance. Cc: "David S. Miller" <davem@davemloft.net> Reported-by: Nicolas Waisman <nico@semmle.com> Link: https://lore.kernel.org/lkml/CADJ_3a8WFrs5NouXNqS5WYe7rebFP+_A5CheeqAyD_p7DFJJcg@mail.gmail.com/ Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29arm64: cpufeature: Enable Qualcomm Falkor/Kryo errata 1003Bjorn Andersson
With the introduction of 'cce360b54ce6 ("arm64: capabilities: Filter the entries based on a given mask")' the Qualcomm Falkor/Kryo errata 1003 is no long applied. The result of not applying errata 1003 is that MSM8996 runs into various RCU stalls and fails to boot most of the times. Give 1003 a "type" to ensure they are not filtered out in update_cpu_capabilities(). Fixes: cce360b54ce6 ("arm64: capabilities: Filter the entries based on a given mask") Cc: stable@vger.kernel.org Reported-by: Mark Brown <broonie@kernel.org> Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Will Deacon <will@kernel.org>
2019-10-29drm/etnaviv: fix dumping of iommuv2Christian Gmeiner
etnaviv_iommuv2_dump_size(..) returns the number of PTE * SZ_4K but etnaviv_iommuv2_dump(..) increments buf pointer even if there is no PTE. This results in a bad buf pointer which gets used for memcpy(..), when copying the MMU state in the coredump buffer. Fixes: afb7b3b1deb4 ("drm/etnaviv: implement IOMMUv2 translation") Cc: stable@vger.kernel.org Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com> Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
2019-10-29drm/etnaviv: reinstate MMUv1 command buffer window checkLucas Stach
The switch to per-process address spaces erroneously dropped the check which validated that the command buffer is mapped through the linear apperture as required by the hardware. This turned a system misconfiguration with a helpful error message into a very hard to debug issue. Reinstate the check at the appropriate location. Fixes: 17e4660ae3d7 (drm/etnaviv: implement per-process address spaces on MMUv2) Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Reviewed-by: Guido Günther <agx@sigxcpu.org>
2019-10-29drm/etnaviv: fix deadlock in GPU coredumpLucas Stach
The GPU coredump function violates the locking order by holding the MMU context lock while trying to acquire the etnaviv_gem_object lock. This results in a possible ABBA deadlock with other codepaths which follow the established locking order. Fortunately this is easy to fix by dropping the MMU context lock earlier, as the BO dumping doesn't need the MMU context to be stable. The only thing the BO dumping cares about are the BO mappings, which are stable across the lifetime of the job. Fixes: 27b67278e007 (drm/etnaviv: rework MMU handling) [ Not really the first bad commit, but the one where this fix applies cleanly. Stable kernels need a manual backport. ] Reported-by: Christian Gmeiner <christian.gmeiner@gmail.com> Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Tested-by: Christian Gmeiner <christian.gmeiner@gmail.com>
2019-10-29Merge tag 'fuse-fixes-5.4-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Mostly virtiofs fixes, but also fixes a regression and couple of longstanding data/metadata writeback ordering issues" * tag 'fuse-fixes-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: redundant get_fuse_inode() calls in fuse_writepages_fill() fuse: Add changelog entries for protocols 7.1 - 7.8 fuse: truncate pending writes on O_TRUNC fuse: flush dirty data/metadata before non-truncate setattr virtiofs: Remove set but not used variable 'fc' virtiofs: Retry request submission from worker context virtiofs: Count pending forgets as in_flight forgets virtiofs: Set FR_SENT flag only after request has been sent virtiofs: No need to check fpq->connected state virtiofs: Do not end request in submission context fuse: don't advise readdirplus for negative lookup fuse: don't dereference req->args on finished request virtio-fs: don't show mount options virtio-fs: Change module name to virtiofs.ko
2019-10-29io_uring: Fix mm_fault with READ/WRITE_FIXEDPavel Begunkov
Commit fb5ccc98782f ("io_uring: Fix broken links with offloading") introduced a potential performance regression with unconditionally taking mm even for READ/WRITE_FIXED operations. Return the logic handling it back. mm-faulted requests will go through the generic submission path, so honoring links and drains, but will fail further on req->has_user check. Fixes: fb5ccc98782f ("io_uring: Fix broken links with offloading") Cc: stable@vger.kernel.org # v5.4 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: remove index from sqe_submitPavel Begunkov
submit->index is used only for inbound check in submission path (i.e. head < ctx->sq_entries). However, it always will be true, as 1. it's already validated by io_get_sqring() 2. ctx->sq_entries can't be changedd in between, because of held ctx->uring_lock and ctx->refs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: add set of tracing eventsDmitrii Dolgov
To trace io_uring activity one can get an information from workqueue and io trace events, but looks like some parts could be hard to identify via this approach. Making what happens inside io_uring more transparent is important to be able to reason about many aspects of it, hence introduce the set of tracing events. All such events could be roughly divided into two categories: * those, that are helping to understand correctness (from both kernel and an application point of view). E.g. a ring creation, file registration, or waiting for available CQE. Proposed approach is to get a pointer to an original structure of interest (ring context, or request), and then find relevant events. io_uring_queue_async_work also exposes a pointer to work_struct, to be able to track down corresponding workqueue events. * those, that provide performance related information. Mostly it's about events that change the flow of requests, e.g. whether an async work was queued, or delayed due to some dependencies. Another important case is how io_uring optimizations (e.g. registered files) are utilized. Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: add support for canceling timeout requestsJens Axboe
We might have cases where the need for a specific timeout is gone, add support for canceling an existing timeout operation. This works like the POLL_REMOVE command, where the application passes in the user_data of the timeout it wishes to cancel in the sqe->addr field. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: add support for absolute timeoutsJens Axboe
This is a pretty trivial addition on top of the relative timeouts we have now, but it's handy for ensuring tighter timing for those that are building scheduling primitives on top of io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: replace s->needs_lock with s->in_asyncJackie Liu
There is no function change, just to clean up the code, use s->in_async to make the code know where it is. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29io_uring: allow application controlled CQ ring sizeJens Axboe
We currently size the CQ ring as twice the SQ ring, to allow some flexibility in not overflowing the CQ ring. This is done because the SQE life time is different than that of the IO request itself, the SQE is consumed as soon as the kernel has seen the entry. Certain application don't need a huge SQ ring size, since they just submit IO in batches. But they may have a lot of requests pending, and hence need a big CQ ring to hold them all. By allowing the application to control the CQ ring size multiplier, we can cater to those applications more efficiently. If an application wants to define its own CQ ring size, it must set IORING_SETUP_CQSIZE in the setup flags, and fill out io_uring_params->cq_entries. The value must be a power of two. Signed-off-by: Jens Axboe <axboe@kernel.dk>