summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2018-09-29tls: Remove redundant vars from tls record structureVakul Garg
Structure 'tls_rec' contains sg_aead_in and sg_aead_out which point to a aad_space and then chain scatterlists sg_plaintext_data, sg_encrypted_data respectively. Rather than using chained scatterlists for plaintext and encrypted data in aead_req, it is efficient to store aad_space in sg_encrypted_data and sg_plaintext_data itself in the first index and get rid of sg_aead_in, sg_aead_in and further chaining. This requires increasing size of sg_encrypted_data & sg_plaintext_data arrarys by 1 to accommodate entry for aad_space. The code which uses sg_encrypted_data and sg_plaintext_data has been modified to skip first index as it points to aad_space. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-28Bluetooth: Fix debugfs NULL pointer dereferenceMatias Karhumaa
Fix crash caused by NULL pointer dereference when debugfs functions le_max_key_read, le_max_key_size_write, le_min_key_size_read or le_min_key_size_write and Bluetooth adapter was powered off. Fix is to move max_key_size and min_key_size from smp_dev to hci_dev. At the same time they were renamed to le_max_key_size and le_min_key_size. BUG: unable to handle kernel NULL pointer dereference at 00000000000002e8 PGD 0 P4D 0 Oops: 0000 [#24] SMP PTI CPU: 2 PID: 6255 Comm: cat Tainted: G D OE 4.18.9-200.fc28.x86_64 #1 Hardware name: LENOVO 4286CTO/4286CTO, BIOS 8DET76WW (1.46 ) 06/21/2018 RIP: 0010:le_max_key_size_read+0x45/0xb0 [bluetooth] Code: 00 00 00 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 48 8b 87 c8 00 00 00 48 8d 7c 24 04 48 8b 80 48 0a 00 00 <48> 8b 80 e8 02 00 00 0f b6 48 52 e8 fb b6 b3 ed be 04 00 00 00 48 RSP: 0018:ffffab23c3ff3df0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00007f0b4ca2e000 RCX: ffffab23c3ff3f08 RDX: ffffffffc0ddb033 RSI: 0000000000000004 RDI: ffffab23c3ff3df4 RBP: 0000000000020000 R08: 0000000000000000 R09: 0000000000000000 R10: ffffab23c3ff3ed8 R11: 0000000000000000 R12: ffffab23c3ff3f08 R13: 00007f0b4ca2e000 R14: 0000000000020000 R15: ffffab23c3ff3f08 FS: 00007f0b4ca0f540(0000) GS:ffff91bd5e280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000002e8 CR3: 00000000629fa006 CR4: 00000000000606e0 Call Trace: full_proxy_read+0x53/0x80 __vfs_read+0x36/0x180 vfs_read+0x8a/0x140 ksys_read+0x4f/0xb0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Matias Karhumaa <matias.karhumaa@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-09-28Merge tag 'mac80211-for-davem-2018-09-27' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== More patches than I'd like perhaps, but each seems reasonable: * two new spectre-v1 mitigations in nl80211 * TX status fix in general, and mesh in particular * powersave vs. offchannel fix * regulatory initialization fix * fix for a queue hang due to a bad return value * allocate TXQs for active monitor interfaces, fixing my earlier patch to avoid unnecessary allocations where I missed this case needed them * fix TDLS data frames priority assignment * fix scan results processing to take into account duplicate channel numbers (over different operating classes, but we don't necessarily know the operating class) * various hwsim fixes for radio destruction and new radio announcement messages * remove an extraneous kernel-doc line ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-28netlink: add nested array policy validationJohannes Berg
Sometimes nested netlink attributes are just used as arrays, with the nla_type() of each not being used; we have this in nl80211 and e.g. NFTA_SET_ELEM_LIST_ELEMENTS. Add the ability to validate this type of message directly in the policy, by adding the type NLA_NESTED_ARRAY which does exactly this: require a first level of nesting but ignore the attribute type, and then inside each require a second level of nested and validate those attributes against a given policy (if present). Note that some nested array types actually require that all of the entries have the same index, this is possible to express in a nested policy already, apart from the validation that only the one allowed type is used. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-28netlink: allow NLA_NESTED to specify nested policy to validateJohannes Berg
Now that we have a validation_data pointer, and the len field in the policy is unused for NLA_NESTED, we can allow using them both to have nested validation. This can be nice in code, although we still have to use nla_parse_nested() or similar which would also take a policy; however, it also serves as documentation in the policy without requiring a look at the code. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-28netlink: make validation_data constJohannes Berg
The validation data is only used within the policy that should usually already be const, and isn't changed in any code that uses it. Therefore, make the validation_data pointer const. While at it, remove the duplicate variable in the bitfield validation that I'd otherwise have to change to const. Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-28netlink: remove NLA_NESTED_COMPATJohannes Berg
This isn't used anywhere, so we might as well get rid of it. Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-28netfilter: nf_tables: add SECMARK supportChristian Göttsche
Add the ability to set the security context of packets within the nf_tables framework. Add a nft_object for holding security contexts in the kernel and manipulating packets on the wire. Convert the security context strings at rule addition time to security identifiers. This is the same behavior like in xt_SECMARK and offers better performance than computing it per packet. Set the maximum security context length to 256. Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-27Bluetooth: L2CAP: Derive rx credits from MTU and MPSLuiz Augusto von Dentz
Give enough rx credits for a full packet instead of using an arbitrary number which may not be enough depending on the MTU and MPS which can cause interruptions while waiting for more credits, also remove debugfs entry for l2cap_le_max_credits. With these changes the credits are restored after each SDU is received instead of using fixed threshold, this way it is garanteed that there will always be enough credits to send a packet without waiting more credits to arrive. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-09-27Bluetooth: L2CAP: Derive MPS from connection MTULuiz Augusto von Dentz
This ensures the MPS can fit in a single HCI fragment so each segment don't have to be reassembled at HCI level, in addition to that also remove the debugfs entry to configure the MPS. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-09-27Bluetooth: Add definitions and track LE resolve list modificationAnkit Navik
Add the definitions for adding entries to the LE resolve list and removing entries from the LE resolve list. When the LE resolve list gets changed via HCI commands make sure that the internal storage of the resolve list entries gets updated. Signed-off-by: Ankit Navik <ankit.p.navik@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-09-26net-ipv4: remove 2 always zero parameters from ipv4_redirect()Maciej Żenczykowski
(the parameters in question are mark and flow_flags) Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-26net-ipv4: remove 2 always zero parameters from ipv4_update_pmtu()Maciej Żenczykowski
(the parameters in question are mark and flow_flags) Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-26bonding: avoid possible dead-lockMahesh Bandewar
Syzkaller reported this on a slightly older kernel but it's still applicable to the current kernel - ====================================================== WARNING: possible circular locking dependency detected 4.18.0-next-20180823+ #46 Not tainted ------------------------------------------------------ syz-executor4/26841 is trying to acquire lock: 00000000dd41ef48 ((wq_completion)bond_dev->name){+.+.}, at: flush_workqueue+0x2db/0x1e10 kernel/workqueue.c:2652 but task is already holding lock: 00000000768ab431 (rtnl_mutex){+.+.}, at: rtnl_lock net/core/rtnetlink.c:77 [inline] 00000000768ab431 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x412/0xc30 net/core/rtnetlink.c:4708 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (rtnl_mutex){+.+.}: __mutex_lock_common kernel/locking/mutex.c:925 [inline] __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088 rtnl_lock+0x17/0x20 net/core/rtnetlink.c:77 bond_netdev_notify drivers/net/bonding/bond_main.c:1310 [inline] bond_netdev_notify_work+0x44/0xd0 drivers/net/bonding/bond_main.c:1320 process_one_work+0xc73/0x1aa0 kernel/workqueue.c:2153 worker_thread+0x189/0x13c0 kernel/workqueue.c:2296 kthread+0x35a/0x420 kernel/kthread.c:246 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415 -> #1 ((work_completion)(&(&nnw->work)->work)){+.+.}: process_one_work+0xc0b/0x1aa0 kernel/workqueue.c:2129 worker_thread+0x189/0x13c0 kernel/workqueue.c:2296 kthread+0x35a/0x420 kernel/kthread.c:246 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415 -> #0 ((wq_completion)bond_dev->name){+.+.}: lock_acquire+0x1e4/0x4f0 kernel/locking/lockdep.c:3901 flush_workqueue+0x30a/0x1e10 kernel/workqueue.c:2655 drain_workqueue+0x2a9/0x640 kernel/workqueue.c:2820 destroy_workqueue+0xc6/0x9d0 kernel/workqueue.c:4155 __alloc_workqueue_key+0xef9/0x1190 kernel/workqueue.c:4138 bond_init+0x269/0x940 drivers/net/bonding/bond_main.c:4734 register_netdevice+0x337/0x1100 net/core/dev.c:8410 bond_newlink+0x49/0xa0 drivers/net/bonding/bond_netlink.c:453 rtnl_newlink+0xef4/0x1d50 net/core/rtnetlink.c:3099 rtnetlink_rcv_msg+0x46e/0xc30 net/core/rtnetlink.c:4711 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2454 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4729 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343 netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1908 sock_sendmsg_nosec net/socket.c:622 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:632 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2115 __sys_sendmsg+0x11d/0x290 net/socket.c:2153 __do_sys_sendmsg net/socket.c:2162 [inline] __se_sys_sendmsg net/socket.c:2160 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2160 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Chain exists of: (wq_completion)bond_dev->name --> (work_completion)(&(&nnw->work)->work) --> rtnl_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(rtnl_mutex); lock((work_completion)(&(&nnw->work)->work)); lock(rtnl_mutex); lock((wq_completion)bond_dev->name); *** DEADLOCK *** 1 lock held by syz-executor4/26841: stack backtrace: CPU: 1 PID: 26841 Comm: syz-executor4 Not tainted 4.18.0-next-20180823+ #46 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 print_circular_bug.isra.34.cold.55+0x1bd/0x27d kernel/locking/lockdep.c:1222 check_prev_add kernel/locking/lockdep.c:1862 [inline] check_prevs_add kernel/locking/lockdep.c:1975 [inline] validate_chain kernel/locking/lockdep.c:2416 [inline] __lock_acquire+0x3449/0x5020 kernel/locking/lockdep.c:3412 lock_acquire+0x1e4/0x4f0 kernel/locking/lockdep.c:3901 flush_workqueue+0x30a/0x1e10 kernel/workqueue.c:2655 drain_workqueue+0x2a9/0x640 kernel/workqueue.c:2820 destroy_workqueue+0xc6/0x9d0 kernel/workqueue.c:4155 __alloc_workqueue_key+0xef9/0x1190 kernel/workqueue.c:4138 bond_init+0x269/0x940 drivers/net/bonding/bond_main.c:4734 register_netdevice+0x337/0x1100 net/core/dev.c:8410 bond_newlink+0x49/0xa0 drivers/net/bonding/bond_netlink.c:453 rtnl_newlink+0xef4/0x1d50 net/core/rtnetlink.c:3099 rtnetlink_rcv_msg+0x46e/0xc30 net/core/rtnetlink.c:4711 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2454 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4729 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343 netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1908 sock_sendmsg_nosec net/socket.c:622 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:632 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2115 __sys_sendmsg+0x11d/0x290 net/socket.c:2153 __do_sys_sendmsg net/socket.c:2162 [inline] __se_sys_sendmsg net/socket.c:2160 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2160 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457089 Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f2df20a5c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f2df20a66d4 RCX: 0000000000457089 RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003 RBP: 0000000000930140 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 00000000004d40b8 R14: 00000000004c8ad8 R15: 0000000000000001 Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-26net/af_iucv: locate IUCV header via skb_network_header()Julian Wiedmann
This patch attempts to untangle the TX and RX code in qeth from af_iucv's respective HiperTransport path: On the TX side, pointing skb_network_header() at the IUCV header means that qeth_l3_fill_af_iucv_hdr() no longer needs a magical offset to access the header. On the RX side, qeth pulls the (fake) L2 header off the skb like any normal ethernet driver would. This makes working with the IUCV header in af_iucv easier, since we no longer have to assume a fixed skb layout. While at it, replace the open-coded length checks in af_iucv's RX path with pskb_may_pull(). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-26cfg80211: fix reg_query_regdb_wmm kernel-docRandy Dunlap
Drop @ptr from kernel-doc for function reg_query_regdb_wmm(). This function parameter was recently removed so update the kernel-doc to match that and remove the kernel-doc warnings. Removes 109 occurrences of this warning message: ../include/net/cfg80211.h:4869: warning: Excess function parameter 'ptr' description in 'reg_query_regdb_wmm' Fixes: 38cb87ee47fb ("cfg80211: make wmm_rule part of the reg_rule structure") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Johannes Berg <johannes.berg@intel.com> Cc: Kalle Valo <kvalo@codeaurora.org> Cc: linux-wireless@vger.kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-09-25 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Allow for RX stack hardening by implementing the kernel's flow dissector in BPF. Idea was originally presented at netconf 2017 [0]. Quote from merge commit: [...] Because of the rigorous checks of the BPF verifier, this provides significant security guarantees. In particular, the BPF flow dissector cannot get inside of an infinite loop, as with CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot read outside of packet bounds, because all memory accesses are checked. Also, with BPF the administrator can decide which protocols to support, reducing potential attack surface. Rarely encountered protocols can be excluded from dissection and the program can be updated without kernel recompile or reboot if a bug is discovered. [...] Also, a sample flow dissector has been implemented in BPF as part of this work, from Petar and Willem. [0] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf 2) Add support for bpftool to list currently active attachment points of BPF networking programs providing a quick overview similar to bpftool's perf subcommand, from Yonghong. 3) Fix a verifier pruning instability bug where a union member from the register state was not cleared properly leading to branches not being pruned despite them being valid candidates, from Alexei. 4) Various smaller fast-path optimizations in XDP's map redirect code, from Jesper. 5) Enable to recognize BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps in bpftool, from Roman. 6) Remove a duplicate check in libbpf that probes for function storage, from Taeung. 7) Fix an issue in test_progs by avoid checking for errno since on success its value should not be checked, from Mauricio. 8) Fix unused variable warning in bpf_getsockopt() helper when CONFIG_INET is not configured, from Anders. 9) Fix a compilation failure in the BPF sample code's use of bpf_flow_keys, from Prashant. 10) Minor cleanups in BPF code, from Yue and Zhong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-25Merge branch '40GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 40GbE Intel Wired LAN Driver Updates 2018-09-25 This series contains updates to i40e and xsk. Mariusz fixes an issue where the VF link state was not being updated properly when the PF is down or up. Also cleaned up the promiscuous configuration during a VF reset. Patryk simplifies the code a bit to use the variables for PF and HW that are declared, rather than using the VSI pointers. Cleaned up the message length parameter to several virtchnl functions, since it was not being used (or needed). Harshitha fixes two potential race conditions when trying to change VF settings by creating a helper function to validate that the VF is enabled and that the VSI is set up. Sergey corrects a double "link down" message by putting in a check for whether or not the link is up or going down. Björn addresses an AF_XDP zero-copy issue that buffers passed from userspace to the kernel was leaked when the hardware descriptor ring was torn down. A zero-copy capable driver picks buffers off the fill ring and places them on the hardware receive ring to be completed at a later point when DMA is complete. Similar on the transmit side; The driver picks buffers off the transmit ring and places them on the transmit hardware ring. In the typical flow, the receive buffer will be placed onto an receive ring (completed to the user), and the transmit buffer will be placed on the completion ring to notify the user that the transfer is done. However, if the driver needs to tear down the hardware rings for some reason (interface goes down, reconfiguration and such), the userspace buffers cannot be leaked. They have to be reused or completed back to userspace. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-25net: sched: implement tcf_block_refcnt_{get|put}()Vlad Buslov
Implement get/put function for blocks that only take/release the reference and perform deallocation. These functions are intended to be used by unlocked rules update path to always hold reference to block while working with it. They use on new fine-grained locking mechanisms introduced in previous patches in this set, instead of relying on global protection provided by rtnl lock. Extract code that is common with tcf_block_detach_ext() into common function __tcf_block_put(). Extend tcf_block with rcu to allow safe deallocation when it is accessed concurrently. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-25net: sched: change tcf block reference counter type to refcount_tVlad Buslov
As a preparation for removing rtnl lock dependency from rules update path, change tcf block reference counter type to refcount_t to allow modification by concurrent users. In block put function perform decrement and check reference counter once to accommodate concurrent modification by unlocked users. After this change tcf_chain_put at the end of block put function is called with block->refcnt==0 and will deallocate block after the last chain is released, so there is no need to manually deallocate block in this case. However, if block reference counter reached 0 and there are no chains to release, block must still be deallocated manually. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-25net: sched: add helper function to take reference to QdiscVlad Buslov
Implement function to take reference to Qdisc that relies on rcu read lock instead of rtnl mutex. Function only takes reference to Qdisc if reference counter isn't zero. Intended to be used by unlocked cls API. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-25net: sched: extend Qdisc with rcuVlad Buslov
Currently, Qdisc API functions assume that users have rtnl lock taken. To implement rtnl unlocked classifiers update interface, Qdisc API must be extended with functions that do not require rtnl lock. Extend Qdisc structure with rcu. Implement special version of put function qdisc_put_unlocked() that is called without rtnl lock taken. This function only takes rtnl lock if Qdisc reference counter reached zero and is intended to be used as optimization. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-25net: sched: rename qdisc_destroy() to qdisc_put()Vlad Buslov
Current implementation of qdisc_destroy() decrements Qdisc reference counter and only actually destroy Qdisc if reference counter value reached zero. Rename qdisc_destroy() to qdisc_put() in order for it to better describe the way in which this function currently implemented and used. Extract code that deallocates Qdisc into new private qdisc_destroy() function. It is intended to be shared between regular qdisc_put() and its unlocked version that is introduced in next patch in this series. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-25net: xsk: add a simple buffer reuse queueJakub Kicinski
XSK UMEM is strongly single producer single consumer so reuse of frames is challenging. Add a simple "stash" of FILL packets to reuse for drivers to optionally make use of. This is useful when driver has to free (ndo_stop) or resize a ring with an active AF_XDP ZC socket. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-09-25Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Version bump conflict in batman-adv, take what's in net-next. iavf conflict, adjustment of netdev_ops in net-next conflicting with poll controller method removal in net. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24net/tls: Fixed race condition in async encryptionVakul Garg
On processors with multi-engine crypto accelerators, it is possible that multiple records get encrypted in parallel and their encryption completion is notified to different cpus in multicore processor. This leads to the situation where tls_encrypt_done() starts executing in parallel on different cores. In current implementation, encrypted records are queued to tx_ready_list in tls_encrypt_done(). This requires addition to linked list 'tx_ready_list' to be protected. As tls_decrypt_done() could be executing in irq content, it is not possible to protect linked list addition operation using a lock. To fix the problem, we remove linked list addition operation from the irq context. We do tx_ready_list addition/removal operation from application context only and get rid of possible multiple access to the linked list. Before starting encryption on the record, we add it to the tail of tx_ready_list. To prevent tls_tx_records() from transmitting it, we mark the record with a new flag 'tx_ready' in 'struct tls_rec'. When record encryption gets completed, tls_encrypt_done() has to only update the 'tx_ready' flag to true & linked list add operation is not required. The changed logic brings some other side benefits. Since the records are always submitted in tls sequence number order for encryption, the tx_ready_list always remains sorted and addition of new records to it does not have to traverse the linked list. Lastly, we renamed tx_ready_list in 'struct tls_sw_context_tx' to 'tx_list'. This is because now, the some of the records at the tail are not ready to transmit. Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption") Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24neighbour: send netlink notification if NTF_ROUTER changesRoopa Prabhu
send netlink notification if neigh_update results in NTF_ROUTER change and if NEIGH_UPDATE_F_ISROUTER is on. Also move the NTF_ROUTER change function into a helper. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24net/sched: Add hardware specific counters to TC actionsEelco Chaudron
Add additional counters that will store the bytes/packets processed by hardware. These will be exported through the netlink interface for displaying by the iproute2 tc tool Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24net/core: Add new basic hardware counterEelco Chaudron
Add a new hardware specific basic counter, TCA_STATS_BASIC_HW. This can be used to count packets/bytes processed by hardware offload. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21tcp: provide earliest departure time in skb->tstampEric Dumazet
Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns, from usec units to nsec units. Do not clear skb->tstamp before entering IP stacks in TX, so that qdisc or devices can implement pacing based on the earliest departure time instead of socket sk->sk_pacing_rate Packets are fed with tcp_wstamp_ns, and following patch will update tcp_wstamp_ns when both TCP and sch_fq switch to the earliest departure time mechanism. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21tcp: add tcp_wstamp_ns socket fieldEric Dumazet
TCP will soon provide earliest departure time on TX skbs. It needs to track this in a new variable. tcp_mstamp_refresh() needs to update this variable, and became too big to stay an inline. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21tcp: introduce tcp_skb_timestamp_us() helperEric Dumazet
There are few places where TCP reads skb->skb_mstamp expecting a value in usec unit. skb->tstamp (aka skb->skb_mstamp) will soon store CLOCK_TAI nsec value. Add tcp_skb_timestamp_us() to provide proper conversion when needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21tcp: switch tcp_clock_ns() to CLOCK_TAI baseEric Dumazet
TCP pacing is either implemented in sch_fq or internally. We have the goal of being able to offload pacing on the NICS. TCP will soon provide per skb skb->tstamp as early departure time. Like ETF in commit 25db26a91364 ("net/sched: Introduce the ETF Qdisc") we chose CLOCK_T as the clock base, so that TCP and pacers can share a common clock, to get better RTT samples (without pacing artificially inflating these samples). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21net/tls: Add support for async encryption of records for performanceVakul Garg
In current implementation, tls records are encrypted & transmitted serially. Till the time the previously submitted user data is encrypted, the implementation waits and on finish starts transmitting the record. This approach of encrypt-one record at a time is inefficient when asynchronous crypto accelerators are used. For each record, there are overheads of interrupts, driver softIRQ scheduling etc. Also the crypto accelerator sits idle most of time while an encrypted record's pages are handed over to tcp stack for transmission. This patch enables encryption of multiple records in parallel when an async capable crypto accelerator is present in system. This is achieved by allowing the user space application to send more data using sendmsg() even while previously issued data is being processed by crypto accelerator. This requires returning the control back to user space application after submitting encryption request to accelerator. This also means that zero-copy mode of encryption cannot be used with async accelerator as we must be done with user space application buffer before returning from sendmsg(). There can be multiple records in flight to/from the accelerator. Each of the record is represented by 'struct tls_rec'. This is used to store the memory pages for the record. After the records are encrypted, they are added in a linked list called tx_ready_list which contains encrypted tls records sorted as per tls sequence number. The records from tx_ready_list are transmitted using a newly introduced function called tls_tx_records(). The tx_ready_list is polled for any record ready to be transmitted in sendmsg(), sendpage() after initiating encryption of new tls records. This achieves parallel encryption and transmission of records when async accelerator is present. There could be situation when crypto accelerator completes encryption later than polling of tx_ready_list by sendmsg()/sendpage(). Therefore we need a deferred work context to be able to transmit records from tx_ready_list. The deferred work context gets scheduled if applications are not sending much data through the socket. If the applications issue sendmsg()/sendpage() in quick succession, then the scheduling of tx_work_handler gets cancelled as the tx_ready_list would be polled from application's context itself. This saves scheduling overhead of deferred work. The patch also brings some side benefit. We are able to get rid of the concept of CLOSED record. This is because the records once closed are either encrypted and then placed into tx_ready_list or if encryption fails, the socket error is set. This simplifies the kernel tls sendpath. However since tls_device.c is still using macros, accessory functions for CLOSED records have been retained. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-20net/ipv4: Move device validation to helperDavid Ahern
Move the device matching check in __fib_validate_source to a helper and export it for use by netfilter modules. Code move only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-20netfilter: conntrack: clamp l4proto array size at largers supported protocolFlorian Westphal
All higher l4proto numbers are handled by the generic tracker; the l4proto lookup function already returns generic one in case the l4proto number exceeds max size. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20netfilter: conntrack: remove l3->l4 mapping informationFlorian Westphal
l4 protocols are demuxed by l3num, l4num pair. However, almost all l4 trackers are l3 agnostic. Only exceptions are: - gre, icmp (ipv4 only) - icmpv6 (ipv6 only) This commit gets rid of the l3 mapping, l4 trackers can now be looked up by their IPPROTO_XXX value alone, which gets rid of the additional l3 indirection. For icmp, ipcmp6 and gre, add a check on state->pf and return -NF_ACCEPT in case we're asked to track e.g. icmpv6-in-ipv4, this seems more fitting than using the generic tracker. Additionally we can kill the 2nd l4proto definitions that were needed for v4/v6 split -- they are now the same so we can use single l4proto struct for each protocol, rather than two. The EXPORT_SYMBOLs can be removed as all these object files are part of nf_conntrack with no external references. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20netfilter: conntrack: remove unused proto arg from netns init functionsFlorian Westphal
Its unused, next patch will remove l4proto->l3proto number to simplify l4 protocol demuxer lookup. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20netfilter: conntrack: remove error callback and handle icmp from coreFlorian Westphal
icmp(v6) are the only two layer four protocols that need the error() callback (to handle icmp errors that are related to an established connections, e.g. packet too big, port unreachable and the like). Remove the error callback and handle these two special cases from the core. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20netfilter: conntrack: deconstify packet callback skb pointerFlorian Westphal
Only two protocols need the ->error() function: icmp and icmpv6. This is because icmp error mssages might be RELATED to an existing connection (e.g. PMTUD, port unreachable and the like), and their ->error() handlers do this. The error callback is already optional, so remove it for udp and call them from ->packet() instead. As the error() callback can call checksum functions that write to skb->csum*, the const qualifier has to be removed as well. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20netfilter: conntrack: remove the l4proto->new() functionFlorian Westphal
->new() gets invoked after ->error() and before ->packet() if a conntrack lookup has found no result for the tuple. We can fold it into ->packet() -- the packet() implementations can check if the conntrack is confirmed (new) or not (already in hash). If its unconfirmed, the conntrack isn't in the hash yet so current skb created a new conntrack entry. Only relevant side effect -- if packet() doesn't return NF_ACCEPT but -NF_ACCEPT (or drop), while the conntrack was just created, then the newly allocated conntrack is freed right away, rather than not created in the first place. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20netfilter: conntrack: pass nf_hook_state to packet and error handlersFlorian Westphal
nf_hook_state contains all the hook meta-information: netns, protocol family, hook location, and so on. Instead of only passing selected information, pass a pointer to entire structure. This will allow to merge the error and the packet handlers and remove the ->new() function in followup patches. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-18NFC: Fix the number of pipesSuren Baghdasaryan
According to ETSI TS 102 622 specification chapter 4.4 pipe identifier is 7 bits long which allows for 128 unique pipe IDs. Because NFC_HCI_MAX_PIPES is used as the number of pipes supported and not as the max pipe ID, its value should be 128 instead of 127. nfc_hci_recv_from_llc extracts pipe ID from packet header using NFC_HCI_FRAGMENT(0x7F) mask which allows for pipe ID value of 127. Same happens when NCI_HCP_MSG_GET_PIPE() is being used. With pipes array having only 127 elements and pipe ID of 127 the OOB memory access will result. Cc: Samuel Ortiz <sameo@linux.intel.com> Cc: Allen Pais <allen.pais@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Suggested-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18netlink: add ethernet address policy typesJohannes Berg
Commonly, ethernet addresses are just using a policy of { .len = ETH_ALEN } which leaves userspace free to send more data than it should, which may hide bugs. Introduce NLA_EXACT_LEN which checks for exact size, rejecting the attribute if it's not exactly that length. Also add NLA_EXACT_LEN_WARN which requires the minimum length and will warn on longer attributes, for backward compatibility. Use these to define NLA_POLICY_ETH_ADDR (new strict policy) and NLA_POLICY_ETH_ADDR_COMPAT (compatible policy with warning); these are used like this: static const struct nla_policy <name>[...] = { [NL_ATTR_NAME] = NLA_POLICY_ETH_ADDR, ... }; Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18netlink: add NLA_REJECT policy typeJohannes Berg
In some situations some netlink attributes may be used for output only (kernel->userspace) or may be reserved for future use. It's then helpful to be able to prevent userspace from using them in messages sent to the kernel, since they'd otherwise be ignored and any future will become impossible if this happens. Add NLA_REJECT to the policy which does nothing but reject (with EINVAL) validation of any messages containing this attribute. Allow for returning a specific extended ACK error message in the validation_data pointer. While at it clear up the documentation a bit - the NLA_BITFIELD32 documentation was added to the list of len field descriptions. Also, use NL_SET_BAD_ATTR() in one place where it's open-coded. The specific case I have in mind now is a shared nested attribute containing request/response data, and it would be pointless and potentially confusing to have userspace include response data in the messages that actually contain a request. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Two new tls tests added in parallel in both net and net-next. Used Stephen Rothwell's linux-next resolution. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17tls: async support causes out-of-bounds access in crypto APIsJohn Fastabend
When async support was added it needed to access the sk from the async callback to report errors up the stack. The patch tried to use space after the aead request struct by directly setting the reqsize field in aead_request. This is an internal field that should not be used outside the crypto APIs. It is used by the crypto code to define extra space for private structures used in the crypto context. Users of the API then use crypto_aead_reqsize() and add the returned amount of bytes to the end of the request memory allocation before posting the request to encrypt/decrypt APIs. So this breaks (with general protection fault and KASAN error, if enabled) because the request sent to decrypt is shorter than required causing the crypto API out-of-bounds errors. Also it seems unlikely the sk is even valid by the time it gets to the callback because of memset in crypto layer. Anyways, fix this by holding the sk in the skb->sk field when the callback is set up and because the skb is already passed through to the callback handler via void* we can access it in the handler. Then in the handler we need to be careful to NULL the pointer again before kfree_skb. I added comments on both the setup (in tls_do_decryption) and when we clear it from the crypto callback handler tls_decrypt_done(). After this selftests pass again and fixes KASAN errors/warnings. Fixes: 94524d8fc965 ("net/tls: Add support for async decryption of tls records") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Vakul Garg <Vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17netfilter: nf_tables: asynchronous releaseFlorian Westphal
Release the committed transaction log from a work queue, moving expensive synchronize_rcu out of the locked section and providing opportunity to batch this. On my test machine this cuts runtime of nft-test.py in half. Based on earlier patch from Pablo Neira Ayuso. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17netfilter: nf_tables: split set destruction in deactivate and destroy phaseFlorian Westphal
Splits unbind_set into destroy_set and unbinding operation. Unbinding removes set from lists (so new transaction would not find it anymore) but keeps memory allocated (so packet path continues to work). Rebind function is added to allow unrolling in case transaction that wants to remove set is aborted. Destroy function is added to free the memory, but this could occur outside of transaction in the future. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-14flow_dissector: implements flow dissector BPF hookPetar Penkov
Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector path. The BPF program is per-network namespace. Signed-off-by: Petar Penkov <ppenkov@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>