summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-10-02net: add WARN_ONCE in kernel_sendpage() for improper zero-copy sendColy Li
If a page sent into kernel_sendpage() is a slab page or it doesn't have ref_count, this page is improper to send by the zero copy sendpage() method. Otherwise such page might be unexpected released in network code path and causes impredictable panic due to kernel memory management data structure corruption. This path adds a WARN_ON() on the sending page before sends it into the concrete zero-copy sendpage() method, if the page is improper for the zero-copy sendpage() method, a warning message can be observed before the consequential unpredictable kernel panic. This patch does not change existing kernel_sendpage() behavior for the improper page zero-copy send, it just provides hint warning message for following potential panic due the kernel memory heap corruption. Signed-off-by: Coly Li <colyli@suse.de> Cc: Cong Wang <amwang@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David S. Miller <davem@davemloft.net> Cc: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02net: introduce helper sendpage_ok() in include/linux/net.hColy Li
The original problem was from nvme-over-tcp code, who mistakenly uses kernel_sendpage() to send pages allocated by __get_free_pages() without __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on tail pages, sending them by kernel_sendpage() may trigger a kernel panic from a corrupted kernel heap, because these pages are incorrectly freed in network stack as page_count 0 pages. This patch introduces a helper sendpage_ok(), it returns true if the checking page, - is not slab page: PageSlab(page) is false. - has page refcount: page_count(page) is not zero All drivers who want to send page to remote end by kernel_sendpage() may use this helper to check whether the page is OK. If the helper does not return true, the driver should try other non sendpage method (e.g. sock_no_sendpage()) to handle the page. Signed-off-by: Coly Li <colyli@suse.de> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Vlastimil Babka <vbabka@suse.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02net/smscx5xx: change to of_get_mac_address() eth_platform_get_mac_address()Łukasz Stelmach
Use more generic eth_platform_get_mac_address() which can get a MAC address from other than DT platform specific sources too. Check if the obtained address is valid. Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02net: usb: pegasus: Proper error handing when setting pegasus' MAC addressPetko Manolov
v2: If reading the MAC address from eeprom fail don't throw an error, use randomly generated MAC instead. Either way the adapter will soldier on and the return type of set_ethernet_addr() can be reverted to void. v1: Fix a bug in set_ethernet_addr() which does not take into account possible errors (or partial reads) returned by its helpers. This can potentially lead to writing random data into device's MAC address registers. Signed-off-by: Petko Manolov <petko.manolov@konsulko.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02Merge branch 'Add skb_adjust_room() for SK_SKB'Alexei Starovoitov
John Fastabend says: ==================== This implements the helper skb_adjust_room() for BPF_SKS_SK_STREAM_VERDICT programs so we can push/pop headers from the data on recieve. One use case is to pop TLS headers off kTLS packets. The first patch implements the helper and the second updates test_sockmap to use it removing some case handling we had to do earlier to account for the TLS headers in the kTLS tests. v1->v2: Fix error path for TLS case (Daniel) check mode input is 0 because we don't use it now (Daniel) Remove incorrect/misleading comment (Lorenz) Thanks, John Acked-by: Martin KaFai Lau <kafai@fb.com> --- ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-10-02bpf, sockmap: Update selftests to use skb_adjust_roomJohn Fastabend
Instead of working around TLS headers in sockmap selftests use the new skb_adjust_room helper. This allows us to avoid special casing the receive side to skip headers. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160160100932.7052.3646935243867660528.stgit@john-Precision-5820-Tower
2020-10-02bpf, sockmap: Add skb_adjust_room to pop bytes off ingress payloadJohn Fastabend
This implements a new helper skb_adjust_room() so users can push/pop extra bytes from a BPF_SK_SKB_STREAM_VERDICT program. Some protocols may include headers and other information that we may not want to include when doing a redirect from a BPF_SK_SKB_STREAM_VERDICT program. One use case is to redirect TLS packets into a receive socket that doesn't expect TLS data. In TLS case the first 13B or so contain the protocol header. With KTLS the payload is decrypted so we should be able to redirect this to a receiving socket, but the receiving socket may not be expecting to receive a TLS header and discard the data. Using the above helper we can pop the header off and put an appropriate header on the payload. This allows for creating a proxy between protocols without extra hops through the stack or userspace. So in order to fix this case add skb_adjust_room() so users can strip the header. After this the user can strip the header and an unmodified receiver thread will work correctly when data is redirected into the ingress path of a sock. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160160099197.7052.8443193973242831692.stgit@john-Precision-5820-Tower
2020-10-02dt-bindings: net: dsa: b53: Add missing reg property to exampleKurt Kanzenbach
The switch has a certain MDIO address and this needs to be specified using the reg property. Add it to the example. Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02net: core: document two new elements of struct net_deviceMauro Carvalho Chehab
As warned by "make htmldocs", there are two new struct elements that aren't documented: ../include/linux/netdevice.h:2159: warning: Function parameter or member 'unlink_list' not described in 'net_device' ../include/linux/netdevice.h:2159: warning: Function parameter or member 'nested_level' not described in 'net_device' Fixes: 1fc70edb7d7b ("net: core: add nested_level variable in net_device") Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02Merge branch 'bpf: BTF support for ksyms'Alexei Starovoitov
Hao Luo says: ==================== v3 -> v4: - Rebasing - Cast bpf_[per|this]_cpu_ptr's parameter to void __percpu * before passing into per_cpu_ptr. v2 -> v3: - Rename functions and variables in verifier for better readability. - Stick to logging message convention in libbpf. - Move bpf_per_cpu_ptr and bpf_this_cpu_ptr from trace-specific helper set to base helper set. - More specific test in ksyms_btf. - Fix return type cast in bpf_*_cpu_ptr. - Fix btf leak in ksyms_btf selftest. - Fix return error code for kallsyms_find(). v1 -> v2: - Move check_pseudo_btf_id from check_ld_imm() to replace_map_fd_with_map_ptr() and rename the latter. - Add bpf_this_cpu_ptr(). - Use bpf_core_types_are_compat() in libbpf.c for checking type compatibility. - Rewrite typed ksym extern type in BTF with int to save space. - Minor revision of bpf_per_cpu_ptr()'s comments. - Avoid using long in tests that use skeleton. - Refactored test_ksyms.c by moving kallsyms_find() to trace_helpers.c - Fold the patches that sync include/linux/uapi and tools/include/linux/uapi. rfc -> v1: - Encode VAR's btf_id for PSEUDO_BTF_ID. - More checks in verifier. Checking the btf_id passed as PSEUDO_BTF_ID is valid VAR, its name and type. - Checks in libbpf on type compatibility of ksyms. - Add bpf_per_cpu_ptr() to access kernel percpu vars. Introduced new ARG and RET types for this helper. This patch series extends the previously added __ksym externs with btf support. Right now the __ksym externs are treated as pure 64-bit scalar value. Libbpf replaces ld_imm64 insn of __ksym by its kernel address at load time. This patch series extend those externs with their btf info. Note that btf support for __ksym must come with the kernel btf that has VARs encoded to work properly. The corresponding chagnes in pahole is available at [1] (with a fix at [2] for gcc 4.9+). The first 3 patches in this series add support for general kernel global variables, which include verifier checking (01/06), libpf support (02/06) and selftests for getting typed ksym extern's kernel address (03/06). The next 3 patches extends that capability further by introducing helpers bpf_per_cpu_ptr() and bpf_this_cpu_ptr(), which allows accessing kernel percpu variables correctly (04/06 and 05/06). The tests of this feature were performed against pahole that is extended with [1] and [2]. For kernel BTF that does not have VARs encoded, the selftests will be skipped. [1] https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?id=f3d9054ba8ff1df0fc44e507e3a01c0964cabd42 [2] https://www.spinics.net/lists/dwarves/msg00451.html ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-10-02bpf/selftests: Test for bpf_per_cpu_ptr() and bpf_this_cpu_ptr()Hao Luo
Test bpf_per_cpu_ptr() and bpf_this_cpu_ptr(). Test two paths in the kernel. If the base pointer points to a struct, the returned reg is of type PTR_TO_BTF_ID. Direct pointer dereference can be applied on the returned variable. If the base pointer isn't a struct, the returned reg is of type PTR_TO_MEM, which also supports direct pointer dereference. Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200929235049.2533242-7-haoluo@google.com
2020-10-02bpf: Introducte bpf_this_cpu_ptr()Hao Luo
Add bpf_this_cpu_ptr() to help access percpu var on this cpu. This helper always returns a valid pointer, therefore no need to check returned value for NULL. Also note that all programs run with preemption disabled, which means that the returned pointer is stable during all the execution of the program. Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200929235049.2533242-6-haoluo@google.com
2020-10-02bpf: Introduce bpf_per_cpu_ptr()Hao Luo
Add bpf_per_cpu_ptr() to help bpf programs access percpu vars. bpf_per_cpu_ptr() has the same semantic as per_cpu_ptr() in the kernel except that it may return NULL. This happens when the cpu parameter is out of range. So the caller must check the returned value. Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200929235049.2533242-5-haoluo@google.com
2020-10-02selftests/bpf: Ksyms_btf to test typed ksymsHao Luo
Selftests for typed ksyms. Tests two types of ksyms: one is a struct, the other is a plain int. This tests two paths in the kernel. Struct ksyms will be converted into PTR_TO_BTF_ID by the verifier while int typed ksyms will be converted into PTR_TO_MEM. Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200929235049.2533242-4-haoluo@google.com
2020-10-02bpf/libbpf: BTF support for typed ksymsHao Luo
If a ksym is defined with a type, libbpf will try to find the ksym's btf information from kernel btf. If a valid btf entry for the ksym is found, libbpf can pass in the found btf id to the verifier, which validates the ksym's type and value. Typeless ksyms (i.e. those defined as 'void') will not have such btf_id, but it has the symbol's address (read from kallsyms) and its value is treated as a raw pointer. Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200929235049.2533242-3-haoluo@google.com
2020-10-02bpf: Introduce pseudo_btf_idHao Luo
Pseudo_btf_id is a type of ld_imm insn that associates a btf_id to a ksym so that further dereferences on the ksym can use the BTF info to validate accesses. Internally, when seeing a pseudo_btf_id ld insn, the verifier reads the btf_id stored in the insn[0]'s imm field and marks the dst_reg as PTR_TO_BTF_ID. The btf_id points to a VAR_KIND, which is encoded in btf_vminux by pahole. If the VAR is not of a struct type, the dst reg will be marked as PTR_TO_MEM instead of PTR_TO_BTF_ID and the mem_size is resolved to the size of the VAR's type. >From the VAR btf_id, the verifier can also read the address of the ksym's corresponding kernel var from kallsyms and use that to fill dst_reg. Therefore, the proper functionality of pseudo_btf_id depends on (1) kallsyms and (2) the encoding of kernel global VARs in pahole, which should be available since pahole v1.18. Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200929235049.2533242-2-haoluo@google.com
2020-10-02Merge tag 'pinctrl-v5.9-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control fixes from Linus Walleij: "Some pin control fixes here. All of them are driver fixes, the Intel Cherryview being the most interesting one. - Fix a mux problem for I2C in the MVEBU driver. - Fix a really hairy inversion problem in the Intel Cherryview driver. - Fix the register for the sdc2_clk in the Qualcomm SM8250 driver. - Check the virtual GPIO boot failur in the Mediatek driver" * tag 'pinctrl-v5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: pinctrl: mediatek: check mtk_is_virt_gpio input parameter pinctrl: qcom: sm8250: correct sdc2_clk pinctrl: cherryview: Preserve CHV_PADCTRL1_INVRXTX_TXDATA flag on GPIOs pinctrl: mvebu: Fix i2c sda definition for 98DX3236
2020-10-02Merge tag 'pci-v5.9-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fixes from Bjorn Helgaas: - Fix rockchip regression in rockchip_pcie_valid_device() (Lorenzo Pieralisi) - Add Pali Rohár as aardvark PCI maintainer (Pali Rohár) * tag 'pci-v5.9-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: MAINTAINERS: Add Pali Rohár as aardvark PCI maintainer PCI: rockchip: Fix bus checks in rockchip_pcie_valid_device()
2020-10-02Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Two patches in driver frameworks. The iscsi one corrects a bug induced by a BPF change to network locking and the other is a regression we introduced" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: iscsi: iscsi_tcp: Avoid holding spinlock while calling getpeername() scsi: target: Fix lun lookup for TARGET_SCF_LOOKUP_LUN_FROM_TAG case
2020-10-02Merge tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - fix for async buffered reads if read-ahead is fully disabled (Hao) - double poll match fix - ->show_fdinfo() potential ABBA deadlock complaint fix * tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-block: io_uring: fix async buffered reads when readahead is disabled io_uring: fix potential ABBA deadlock in ->show_fdinfo() io_uring: always delete double poll wait entry on match
2020-10-02Merge tag 'block-5.9-2020-10-02' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fix from Jens Axboe: "Single fix for a ->commit_rqs failure case" * tag 'block-5.9-2020-10-02' of git://git.kernel.dk/linux-block: blk-mq: call commit_rqs while list empty but error happen
2020-10-02Merge branch 'net-dsa-Improve-dsa_untag_bridge_pvid'David S. Miller
Florian Fainelli says: ==================== net: dsa: Improve dsa_untag_bridge_pvid() This patch series is based on the recent discussions with Vladimir: https://lore.kernel.org/netdev/20201001030623.343535-1-f.fainelli@gmail.com/ the simplest way forward was to call dsa_untag_bridge_pvid() after eth_type_trans() has been set which guarantees that skb->protocol is set to a correct value and this allows us to utilize __vlan_find_dev_deep_rcu() properly without playing or using the bridge master as a net_device reference. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02net: dsa: Utilize __vlan_find_dev_deep_rcu()Florian Fainelli
Now that we are guaranteed that dsa_untag_bridge_pvid() is called after eth_type_trans() we can utilize __vlan_find_dev_deep_rcu() which will take care of finding an 802.1Q upper on top of a bridge master. A common use case, prior to 12a1526d067 ("net: dsa: untag the bridge pvid from rx skbs") was to configure a bridge 802.1Q upper like this: ip link add name br0 type bridge vlan_filtering 0 ip link add link br0 name br0.1 type vlan id 1 in order to pop the default_pvid VLAN tag. With this change we restore that behavior while still allowing the DSA receive path to automatically pop the VLAN tag. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02net: dsa: Obtain VLAN protocol from skb->protocolFlorian Fainelli
Now that dsa_untag_bridge_pvid() is called after eth_type_trans() we are guaranteed that skb->protocol will be set to a correct value, thus allowing us to avoid calling vlan_eth_hdr(). Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02net: dsa: b53: Set untag_bridge_pvidFlorian Fainelli
Indicate to the DSA receive path that we need to untage the bridge PVID, this allows us to remove the dsa_untag_bridge_pvid() calls from net/dsa/tag_brcm.c. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02net: dsa: Call dsa_untag_bridge_pvid() from dsa_switch_rcv()Florian Fainelli
When a DSA switch driver needs to call dsa_untag_bridge_pvid(), it can set dsa_switch::untag_brige_pvid to indicate this is necessary. This is a pre-requisite to making sure that we are always calling dsa_untag_bridge_pvid() after eth_type_trans() has been called. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2020-10-02 1) Add a full xfrm compatible layer for 32-bit applications on 64-bit kernels. From Dmitry Safonov. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02netlink: fix policy dump leakJohannes Berg
[ Upstream commit a95bc734e60449e7b073ff7ff70c35083b290ae9 ] If userspace doesn't complete the policy dump, we leak the allocated state. Fix this. Fixes: d07dcf9aadd6 ("netlink: add infrastructure to expose policies to userspace") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02netlink: fix policy dump leakJohannes Berg
If userspace doesn't complete the policy dump, we leak the allocated state. Fix this. Fixes: d07dcf9aadd6 ("netlink: add infrastructure to expose policies to userspace") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02Merge branch 'Do not limit cb_flags when creating child sk'Alexei Starovoitov
Martin KaFai says: ==================== This set fixes an issue that the bpf_skops_init_child() unnecessarily limited the child sk from inheriting all bpf_sock_ops_cb_flags of the listen sk. It also adds a test to check that. ==================== Tested-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-10-02bpf: selftest: Ensure the child sk inherited all bpf_sock_ops_cb_flagsMartin KaFai Lau
This patch adds a test to ensure the child sk inherited everything from the bpf_sock_ops_cb_flags of the listen sk: 1. Sets one more cb_flags (BPF_SOCK_OPS_STATE_CB_FLAG) to the listen sk in test_tcp_hdr_options.c 2. Saves the skops->bpf_sock_ops_cb_flags when handling the newly established passive connection 3. CHECK() it is the same as the listen sk This also covers the fastopen case as the existing test_tcp_hdr_options.c does. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201002013454.2542367-1-kafai@fb.com
2020-10-02bpf: tcp: Do not limit cb_flags when creating child sk from listen skMartin KaFai Lau
The commit 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option") unnecessarily introduced bpf_skops_init_child() which limited the child sk from inheriting all bpf_sock_ops_cb_flags of the listen sk. That breaks existing user expectation. This patch removes the bpf_skops_init_child() and just allows sock_copy() to do its job to copy everything from listen sk to the child sk. Fixes: 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option") Reported-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201002013448.2542025-1-kafai@fb.com
2020-10-02net/mlx5e: Fix race condition on nhe->n pointer in neigh updateVlad Buslov
Current neigh update event handler implementation takes reference to neighbour structure, assigns it to nhe->n, tries to schedule workqueue task and releases the reference if task was already enqueued. This results potentially overwriting existing nhe->n pointer with another neighbour instance, which causes double release of the instance (once in neigh update handler that failed to enqueue to workqueue and another one in neigh update workqueue task that processes updated nhe->n pointer instead of original one): [ 3376.512806] ------------[ cut here ]------------ [ 3376.513534] refcount_t: underflow; use-after-free. [ 3376.521213] Modules linked in: act_skbedit act_mirred act_tunnel_key vxlan ip6_udp_tunnel udp_tunnel nfnetlink act_gact cls_flower sch_ingress openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 mlx5_ib mlx5_core mlxfw pci_hyperv_intf ptp pps_core nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp rpcrdma rdma_ucm ib_umad ib_ipoib ib_iser rdma_cm ib_cm iw_cm rfkill ib_uverbs ib_core sunrpc kvm_intel kvm iTCO_wdt iTCO_vendor_support virtio_net irqbypass net_failover crc32_pclmul lpc_ich i2c_i801 failover pcspkr i2c_smbus mfd_core ghash_clmulni_intel sch_fq_codel drm i2c _core ip_tables crc32c_intel serio_raw [last unloaded: mlxfw] [ 3376.529468] CPU: 8 PID: 22756 Comm: kworker/u20:5 Not tainted 5.9.0-rc5+ #6 [ 3376.530399] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 3376.531975] Workqueue: mlx5e mlx5e_rep_neigh_update [mlx5_core] [ 3376.532820] RIP: 0010:refcount_warn_saturate+0xd8/0xe0 [ 3376.533589] Code: ff 48 c7 c7 e0 b8 27 82 c6 05 0b b6 09 01 01 e8 94 93 c1 ff 0f 0b c3 48 c7 c7 88 b8 27 82 c6 05 f7 b5 09 01 01 e8 7e 93 c1 ff <0f> 0b c3 0f 1f 44 00 00 8b 07 3d 00 00 00 c0 74 12 83 f8 01 74 13 [ 3376.536017] RSP: 0018:ffffc90002a97e30 EFLAGS: 00010286 [ 3376.536793] RAX: 0000000000000000 RBX: ffff8882de30d648 RCX: 0000000000000000 [ 3376.537718] RDX: ffff8882f5c28f20 RSI: ffff8882f5c18e40 RDI: ffff8882f5c18e40 [ 3376.538654] RBP: ffff8882cdf56c00 R08: 000000000000c580 R09: 0000000000001a4d [ 3376.539582] R10: 0000000000000731 R11: ffffc90002a97ccd R12: 0000000000000000 [ 3376.540519] R13: ffff8882de30d600 R14: ffff8882de30d640 R15: ffff88821e000900 [ 3376.541444] FS: 0000000000000000(0000) GS:ffff8882f5c00000(0000) knlGS:0000000000000000 [ 3376.542732] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3376.543545] CR2: 0000556e5504b248 CR3: 00000002c6f10005 CR4: 0000000000770ee0 [ 3376.544483] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3376.545419] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3376.546344] PKRU: 55555554 [ 3376.546911] Call Trace: [ 3376.547479] mlx5e_rep_neigh_update.cold+0x33/0xe2 [mlx5_core] [ 3376.548299] process_one_work+0x1d8/0x390 [ 3376.548977] worker_thread+0x4d/0x3e0 [ 3376.549631] ? rescuer_thread+0x3e0/0x3e0 [ 3376.550295] kthread+0x118/0x130 [ 3376.550914] ? kthread_create_worker_on_cpu+0x70/0x70 [ 3376.551675] ret_from_fork+0x1f/0x30 [ 3376.552312] ---[ end trace d84e8f46d2a77eec ]--- Fix the bug by moving work_struct to dedicated dynamically-allocated structure. This enabled every event handler to work on its own private neighbour pointer and removes the need for handling the case when task is already enqueued. Fixes: 232c001398ae ("net/mlx5e: Add support to neighbour update flow") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02net/mlx5e: Fix VLAN create flowAya Levin
When interface is attached while in promiscuous mode and with VLAN filtering turned off, both configurations are not respected and VLAN filtering is performed. There are 2 flows which add the any-vid rules during interface attach: VLAN creation table and set rx mode. Each is relaying on the other to add any-vid rules, eventually non of them does. Fix this by adding any-vid rules on VLAN creation regardless of promiscuous mode. Fixes: 9df30601c843 ("net/mlx5e: Restore vlan filter after seamless reset") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02net/mlx5e: Fix VLAN cleanup flowAya Levin
Prior to this patch unloading an interface in promiscuous mode with RX VLAN filtering feature turned off - resulted in a warning. This is due to a wrong condition in the VLAN rules cleanup flow, which left the any-vid rules in the VLAN steering table. These rules prevented destroying the flow group and the flow table. The any-vid rules are removed in 2 flows, but none of them remove it in case both promiscuous is set and VLAN filtering is off. Fix the issue by changing the condition of the VLAN table cleanup flow to clean also in case of promiscuous mode. mlx5_core 0000:00:08.0: mlx5_destroy_flow_group:2123:(pid 28729): Flow group 20 wasn't destroyed, refcount > 1 mlx5_core 0000:00:08.0: mlx5_destroy_flow_group:2123:(pid 28729): Flow group 19 wasn't destroyed, refcount > 1 mlx5_core 0000:00:08.0: mlx5_destroy_flow_table:2112:(pid 28729): Flow table 262149 wasn't destroyed, refcount > 1 ... ... ------------[ cut here ]------------ FW pages counter is 11560 after reclaiming all pages WARNING: CPU: 1 PID: 28729 at drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:660 mlx5_reclaim_startup_pages+0x178/0x230 [mlx5_core] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Call Trace: mlx5_function_teardown+0x2f/0x90 [mlx5_core] mlx5_unload_one+0x71/0x110 [mlx5_core] remove_one+0x44/0x80 [mlx5_core] pci_device_remove+0x3e/0xc0 device_release_driver_internal+0xfb/0x1c0 device_release_driver+0x12/0x20 pci_stop_bus_device+0x68/0x90 pci_stop_and_remove_bus_device+0x12/0x20 hv_eject_device_work+0x6f/0x170 [pci_hyperv] ? __schedule+0x349/0x790 process_one_work+0x206/0x400 worker_thread+0x34/0x3f0 ? process_one_work+0x400/0x400 kthread+0x126/0x140 ? kthread_park+0x90/0x90 ret_from_fork+0x22/0x30 ---[ end trace 6283bde8d26170dc ]--- Fixes: 9df30601c843 ("net/mlx5e: Restore vlan filter after seamless reset") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02net/mlx5e: Fix return status when setting unsupported FEC modeAya Levin
Verify the configured FEC mode is supported by at least a single link mode before applying the command. Otherwise fail the command and return "Operation not supported". Prior to this patch, the command was successful, yet it falsely set all link modes to FEC auto mode - like configuring FEC mode to auto. Auto mode is the default configuration if a link mode doesn't support the configured FEC mode. Fixes: b5ede32d3329 ("net/mlx5e: Add support for FEC modes based on 50G per lane links") Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02net/mlx5e: Fix driver's declaration to support GRE offloadAya Levin
Declare GRE offload support with respect to the inner protocol. Add a list of supported inner protocols on which the driver can offload checksum and GSO. For other protocols, inform the stack to do the needed operations. There is no noticeable impact on GRE performance. Fixes: 2729984149e6 ("net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels") Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02net/mlx5e: CT, Fix coverity issueMaor Dickman
The cited commit introduced the following coverity issue at function mlx5_tc_ct_rule_to_tuple_nat: - Memory - corruptions (OVERRUN) Overrunning array "tuple->ip.src_v6.in6_u.u6_addr32" of 4 4-byte elements at element index 7 (byte offset 31) using index "ip6_offset" (which evaluates to 7). In case of IPv6 destination address rewrite, ip6_offset values are between 4 to 7, which will cause memory overrun of array "tuple->ip.src_v6.in6_u.u6_addr32" to array "tuple->ip.dst_v6.in6_u.u6_addr32". Fixed by writing the value directly to array "tuple->ip.dst_v6.in6_u.u6_addr32" in case ip6_offset values are between 4 to 7. Fixes: bc562be9674b ("net/mlx5e: CT: Save ct entries tuples in hashtables") Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02net/mlx5e: Add resiliency in Striding RQ mode for packets larger than MTUAya Levin
Prior to this fix, in Striding RQ mode the driver was vulnerable when receiving packets in the range (stride size - headroom, stride size]. Where stride size is calculated by mtu+headroom+tailroom aligned to the closest power of 2. Usually, this filtering is performed by the HW, except for a few cases: - Between 2 VFs over the same PF with different MTUs - On bluefield, when the host physical function sets a larger MTU than the ARM has configured on its representor and uplink representor. When the HW filtering is not present, packets that are larger than MTU might be harmful for the RQ's integrity, in the following impacts: 1) Overflow from one WQE to the next, causing a memory corruption that in most cases is unharmful: as the write happens to the headroom of next packet, which will be overwritten by build_skb(). In very rare cases, high stress/load, this is harmful. When the next WQE is not yet reposted and points to existing SKB head. 2) Each oversize packet overflows to the headroom of the next WQE. On the last WQE of the WQ, where addresses wrap-around, the address of the remainder headroom does not belong to the next WQE, but it is out of the memory region range. This results in a HW CQE error that moves the RQ into an error state. Solution: Add a page buffer at the end of each WQE to absorb the leak. Actually the maximal overflow size is headroom but since all memory units must be of the same size, we use page size to comply with UMR WQEs. The increase in memory consumption is of a single page per RQ. Initialize the mkey with all MTTs pointing to a default page. When the channels are activated, UMR WQEs will redirect the RX WQEs to the actual memory from the RQ's pool, while the overflow MTTs remain mapped to the default page. Fixes: 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU") Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02net/mlx5e: Fix error path for RQ allocAya Levin
Increase granularity of the error path to avoid unneeded free/release. Fix the cleanup to be symmetric to the order of creation. Fixes: 0ddf543226ac ("xdp/mlx5: setup xdp_rxq_info") Fixes: 422d4c401edd ("net/mlx5e: RX, Split WQ objects for different RQ types") Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02net/mlx5: Fix request_irqs error flowMaor Gottlieb
Fix error flow handling in request_irqs which try to free irq that we failed to request. It fixes the below trace. WARNING: CPU: 1 PID: 7587 at kernel/irq/manage.c:1684 free_irq+0x4d/0x60 CPU: 1 PID: 7587 Comm: bash Tainted: G W OE 4.15.15-1.el7MELLANOXsmp-x86_64 #1 Hardware name: Advantech SKY-6200/SKY-6200, BIOS F2.00 08/06/2020 RIP: 0010:free_irq+0x4d/0x60 RSP: 0018:ffffc9000ef47af0 EFLAGS: 00010282 RAX: ffff88001476ae00 RBX: 0000000000000655 RCX: 0000000000000000 RDX: ffff88001476ae00 RSI: ffffc9000ef47ab8 RDI: ffff8800398bb478 RBP: ffff88001476a838 R08: ffff88001476ae00 R09: 000000000000156d R10: 0000000000000000 R11: 0000000000000004 R12: ffff88001476a838 R13: 0000000000000006 R14: ffff88001476a888 R15: 00000000ffffffe4 FS: 00007efeadd32740(0000) GS:ffff88047fc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc9cc010008 CR3: 00000001a2380004 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: mlx5_irq_table_create+0x38d/0x400 [mlx5_core] ? atomic_notifier_chain_register+0x50/0x60 mlx5_load_one+0x7ee/0x1130 [mlx5_core] init_one+0x4c9/0x650 [mlx5_core] pci_device_probe+0xb8/0x120 driver_probe_device+0x2a1/0x470 ? driver_allows_async_probing+0x30/0x30 bus_for_each_drv+0x54/0x80 __device_attach+0xa3/0x100 pci_bus_add_device+0x4a/0x90 pci_iov_add_virtfn+0x2dc/0x2f0 pci_enable_sriov+0x32e/0x420 mlx5_core_sriov_configure+0x61/0x1b0 [mlx5_core] ? kstrtoll+0x22/0x70 num_vf_store+0x4b/0x70 [mlx5_core] kernfs_fop_write+0x102/0x180 __vfs_write+0x26/0x140 ? rcu_all_qs+0x5/0x80 ? _cond_resched+0x15/0x30 ? __sb_start_write+0x41/0x80 vfs_write+0xad/0x1a0 SyS_write+0x42/0x90 do_syscall_64+0x60/0x110 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: 24163189da48 ("net/mlx5: Separate IRQ request/free from EQ life cycle") Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessibleSaeed Mahameed
In case of pci is offline reclaim_pages_cmd() will still try to call the FW to release FW pages, cmd_exec() in this case will return a silent success without actually calling the FW. This is wrong and will cause page leaks, what we should do is to detect pci offline or command interface un-available before tying to access the FW and manually release the FW pages in the driver. In this patch we share the code to check for FW command interface availability and we call it in sensitive places e.g. reclaim_pages_cmd(). Alternative fix: 1. Remove MLX5_CMD_OP_MANAGE_PAGES form mlx5_internal_err_ret_value, command success simulation list. 2. Always Release FW pages even if cmd_exec fails in reclaim_pages_cmd(). Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02net/mlx5: Add retry mechanism to the command entry index allocationEran Ben Elisha
It is possible that new command entry index allocation will temporarily fail. The new command holds the semaphore, so it means that a free entry should be ready soon. Add one second retry mechanism before returning an error. Patch "net/mlx5: Avoid possible free of command entry while timeout comp handler" increase the possibility to bump into this temporarily failure as it delays the entry index release for non-callback commands. Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02net/mlx5: poll cmd EQ in case of command timeoutEran Ben Elisha
Once driver detects a command interface command timeout, it warns the user and returns timeout error to the caller. In such case, the entry of the command is not evacuated (because only real event interrupt is allowed to clear command interface entry). If the HW event interrupt of this entry will never arrive, this entry will be left unused forever. Command interface entries are limited and eventually we can end up without the ability to post a new command. In addition, if driver will not consume the EQE of the lost interrupt and rearm the EQ, no new interrupts will arrive for other commands. Add a resiliency mechanism for manually polling the command EQ in case of a command timeout. In case resiliency mechanism will find non-handled EQE, it will consume it, and the command interface will be fully functional again. Once the resiliency flow finished, wait another 5 seconds for the command interface to complete for this command entry. Define mlx5_cmd_eq_recover() to manage the cmd EQ polling resiliency flow. Add an async EQ spinlock to avoid races between resiliency flows and real interrupts that might run simultaneously. Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02net/mlx5: Avoid possible free of command entry while timeout comp handlerEran Ben Elisha
Upon command completion timeout, driver simulates a forced command completion. In a rare case where real interrupt for that command arrives simultaneously, it might release the command entry while the forced handler might still access it. Fix that by adding an entry refcount, to track current amount of allowed handlers. Command entry to be released only when this refcount is decremented to zero. Command refcount is always initialized to one. For callback commands, command completion handler is the symmetric flow to decrement it. For non-callback commands, it is wait_func(). Before ringing the doorbell, increment the refcount for the real completion handler. Once the real completion handler is called, it will decrement it. For callback commands, once the delayed work is scheduled, increment the refcount. Upon callback command completion handler, we will try to cancel the timeout callback. In case of success, we need to decrement the callback refcount as it will never run. In addition, gather the entry index free and the entry free into a one flow for all command types release. Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02net/mlx5: Fix a race when moving command interface to polling modeEran Ben Elisha
As part of driver unload, it destroys the commands EQ (via FW command). As the commands EQ is destroyed, FW will not generate EQEs for any command that driver sends afterwards. Driver should poll for later commands status. Driver commands mode metadata is updated before the commands EQ is actually destroyed. This can lead for double completion handle by the driver (polling and interrupt), if a command is executed and completed by FW after the mode was changed, but before the EQ was destroyed. Fix that by using the mlx5_cmd_allowed_opcode mechanism to guarantee that only DESTROY_EQ command can be executed during this time period. Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-02Merge branch 'work.epoll' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull epoll fixes from Al Viro: "Several race fixes in epoll" * 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ep_create_wakeup_source(): dentry name can change under you... epoll: EPOLL_CTL_ADD: close the race in decision to take fast path epoll: replace ->visited/visited_list with generation count epoll: do not insert into poll queues until all sanity checks are done
2020-10-02Merge tag 'riscv-for-linus-5.9-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: "Two fixes for this week: - The addition of a symbol export for clint_time_val, which has been inlined into some timex functions and can be used by drivers. - A fix to avoid calling get_cycles() before the timers have been probed. These both only effect !MMU systems" * tag 'riscv-for-linus-5.9-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: RISC-V: Check clint_time_val before use clocksource: clint: Export clint_time_val for modules
2020-10-02Merge tag 'for-5.9-rc7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two more fixes. One is for a lockdep warning/lockup (also caught by syzbot), that one has been seen in practice. Regarding the other syzbot reports mentioned last time, they don't seem to be urgent and reliably reproducible so they'll be fixed later. The second fix is for a potential corruption when device replace finishes and the in-memory state of trim is not copied to the new device" * tag 'for-5.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix filesystem corruption after a device replace btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks btrfs: move btrfs_scratch_superblocks into btrfs_dev_replace_finishing
2020-10-02Merge tag 'pm-5.9-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix one more issue related to the recent RCU-lockdep changes, a typo in documentation and add a missing return statement to intel_pstate. Specifics: - Fix up RCU usage for cpuidle on the ARM imx6q platform (Ulf Hansson) - Fix typo in the PM documentation (Yoann Congal) - Add return statement that is missing after recent changes in the intel_pstate driver (Zhang Rui)" * tag 'pm-5.9-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ARM: imx6q: Fixup RCU usage for cpuidle Documentation: PM: Fix a reStructuredText syntax error cpufreq: intel_pstate: Fix missing return statement