summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2019-02-06cfg80211: pmsr: fix abort lockingJohannes Berg
When we destroy the interface we already hold the wdev->mtx while calling cfg80211_pmsr_wdev_down(), which assumes this isn't true and flushes the worker that takes the lock, thus leading to a deadlock. Fix this by refactoring the worker and calling its code in cfg80211_pmsr_wdev_down() directly. We still need to flush the work later to make sure it's not still running and will crash, but it will not do anything. Fixes: 9bb7e0f24e7e ("cfg80211: add peer measurement with FTM initiator API") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-06cfg80211: pmsr: fix MAC address settingJohannes Berg
When we *don't* have a MAC address attribute, we shouldn't try to use this - this was intended to copy the local MAC address instead, so fix it. Fixes: 9bb7e0f24e7e ("cfg80211: add peer measurement with FTM initiator API") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Use CONFIG_NF_TABLES_INET from seltests, not NF_TABLES_INET. From Naresh Kamboju. 2) Add a test to cover masquerading and redirect case, from Florian Westphal. 3) Two packets coming from the same socket may race to set up NAT, ending up with different tuples and the packet losing race being dropped. Update nf_conntrack_tuple_taken() to exercise clash resolution for this case. From Martynas Pumputis and Florian Westphal. 4) Unbind anonymous sets from the commit and abort path, this fixes a splat due to double set list removal/release in case that the transaction needs to be aborted. 5) Do not preserve original output interface for packets that are redirected in the output chain when ip6_route_me_harder() is called. Otherwise packets end up going not going to the loopback device. From Eli Cooper. 6) Fix bogus splat in nft_compat with CONFIG_REFCOUNT_FULL=y, this also simplifies the existing logic to deal with the list insertions of the xtables extensions. From Florian Westphal. Diffstat look rather larger than usual because of the new selftest, but Florian and I consider that having tests soon into the tree is good to improve coverage. If there's a different policy in this regard, please, let me know. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-05netfilter: nft_compat: don't use refcount_inc on newly allocated entryFlorian Westphal
When I moved the refcount to refcount_t type I missed the fact that refcount_inc() will result in use-after-free warning with CONFIG_REFCOUNT_FULL=y builds. The correct fix would be to init the reference count to 1 at allocation time, but, unfortunately we cannot do this, as we can't undo that in case something else fails later in the batch. So only solution I see is to special-case the 'new entry' condition and replace refcount_inc() with a "delayed" refcount_set(1) in this case, as done here. The .activate callback can be removed to simplify things, we only need to make sure that deactivate() decrements/unlinks the entry from the list at end of transaction phase (commit or abort). Fixes: 12c44aba6618 ("netfilter: nft_compat: use refcnt_t type for nft_xt reference count") Reported-by: Jordan Glover <Golden_Miller83@protonmail.ch> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-05netfilter: ipv6: Don't preserve original oif for loopback addressEli Cooper
Commit 508b09046c0f ("netfilter: ipv6: Preserve link scope traffic original oif") made ip6_route_me_harder() keep the original oif for link-local and multicast packets. However, it also affected packets for the loopback address because it used rt6_need_strict(). REDIRECT rules in the OUTPUT chain rewrite the destination to loopback address; thus its oif should not be preserved. This commit fixes the bug that redirected local packets are being dropped. Actually the packet was not exactly dropped; Instead it was sent out to the original oif rather than lo. When a packet with daddr ::1 is sent to the router, it is effectively dropped. Fixes: 508b09046c0f ("netfilter: ipv6: Preserve link scope traffic original oif") Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-05xfrm: destroy xfrm_state synchronously on net exit pathCong Wang
xfrm_state_put() moves struct xfrm_state to the GC list and schedules the GC work to clean it up. On net exit call path, xfrm_state_flush() is called to clean up and xfrm_flush_gc() is called to wait for the GC work to complete before exit. However, this doesn't work because one of the ->destructor(), ipcomp_destroy(), schedules the same GC work again inside the GC work. It is hard to wait for such a nested async callback. This is also why syzbot still reports the following warning: WARNING: CPU: 1 PID: 33 at net/ipv6/xfrm6_tunnel.c:351 xfrm6_tunnel_net_exit+0x2cb/0x500 net/ipv6/xfrm6_tunnel.c:351 ... ops_exit_list.isra.0+0xb0/0x160 net/core/net_namespace.c:153 cleanup_net+0x51d/0xb10 net/core/net_namespace.c:551 process_one_work+0xd0c/0x1ce0 kernel/workqueue.c:2153 worker_thread+0x143/0x14a0 kernel/workqueue.c:2296 kthread+0x357/0x430 kernel/kthread.c:246 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352 In fact, it is perfectly fine to bypass GC and destroy xfrm_state synchronously on net exit call path, because it is in process context and doesn't need a work struct to do any blocking work. This patch introduces xfrm_state_put_sync() which simply bypasses GC, and lets its callers to decide whether to use this synchronous version. On net exit path, xfrm_state_fini() and xfrm6_tunnel_net_exit() use it. And, as ipcomp_destroy() itself is blocking, it can use xfrm_state_put_sync() directly too. Also rename xfrm_state_gc_destroy() to ___xfrm_state_destroy() to reflect this change. Fixes: b48c05ab5d32 ("xfrm: Fix warning in xfrm6_tunnel_net_exit.") Reported-and-tested-by: syzbot+e9aebef558e3ed673934@syzkaller.appspotmail.com Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-02-04net: dsa: Fix lockdep false positive splatMarc Zyngier
Creating a macvtap on a DSA-backed interface results in the following splat when lockdep is enabled: [ 19.638080] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready [ 23.041198] device lan0 entered promiscuous mode [ 23.043445] device eth0 entered promiscuous mode [ 23.049255] [ 23.049557] ============================================ [ 23.055021] WARNING: possible recursive locking detected [ 23.060490] 5.0.0-rc3-00013-g56c857a1b8d3 #118 Not tainted [ 23.066132] -------------------------------------------- [ 23.071598] ip/2861 is trying to acquire lock: [ 23.076171] 00000000f61990cb (_xmit_ETHER){+...}, at: dev_set_rx_mode+0x1c/0x38 [ 23.083693] [ 23.083693] but task is already holding lock: [ 23.089696] 00000000ecf0c3b4 (_xmit_ETHER){+...}, at: dev_uc_add+0x24/0x70 [ 23.096774] [ 23.096774] other info that might help us debug this: [ 23.103494] Possible unsafe locking scenario: [ 23.103494] [ 23.109584] CPU0 [ 23.112093] ---- [ 23.114601] lock(_xmit_ETHER); [ 23.117917] lock(_xmit_ETHER); [ 23.121233] [ 23.121233] *** DEADLOCK *** [ 23.121233] [ 23.127325] May be due to missing lock nesting notation [ 23.127325] [ 23.134315] 2 locks held by ip/2861: [ 23.137987] #0: 000000003b766c72 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x338/0x4e0 [ 23.146231] #1: 00000000ecf0c3b4 (_xmit_ETHER){+...}, at: dev_uc_add+0x24/0x70 [ 23.153757] [ 23.153757] stack backtrace: [ 23.158243] CPU: 0 PID: 2861 Comm: ip Not tainted 5.0.0-rc3-00013-g56c857a1b8d3 #118 [ 23.166212] Hardware name: Globalscale Marvell ESPRESSOBin Board (DT) [ 23.172843] Call trace: [ 23.175358] dump_backtrace+0x0/0x188 [ 23.179116] show_stack+0x14/0x20 [ 23.182524] dump_stack+0xb4/0xec [ 23.185928] __lock_acquire+0x123c/0x1860 [ 23.190048] lock_acquire+0xc8/0x248 [ 23.193724] _raw_spin_lock_bh+0x40/0x58 [ 23.197755] dev_set_rx_mode+0x1c/0x38 [ 23.201607] dev_set_promiscuity+0x3c/0x50 [ 23.205820] dsa_slave_change_rx_flags+0x5c/0x70 [ 23.210567] __dev_set_promiscuity+0x148/0x1e0 [ 23.215136] __dev_set_rx_mode+0x74/0x98 [ 23.219167] dev_uc_add+0x54/0x70 [ 23.222575] macvlan_open+0x170/0x1d0 [ 23.226336] __dev_open+0xe0/0x160 [ 23.229830] __dev_change_flags+0x16c/0x1b8 [ 23.234132] dev_change_flags+0x20/0x60 [ 23.238074] do_setlink+0x2d0/0xc50 [ 23.241658] __rtnl_newlink+0x5f8/0x6e8 [ 23.245601] rtnl_newlink+0x50/0x78 [ 23.249184] rtnetlink_rcv_msg+0x360/0x4e0 [ 23.253397] netlink_rcv_skb+0xe8/0x130 [ 23.257338] rtnetlink_rcv+0x14/0x20 [ 23.261012] netlink_unicast+0x190/0x210 [ 23.265043] netlink_sendmsg+0x288/0x350 [ 23.269075] sock_sendmsg+0x18/0x30 [ 23.272659] ___sys_sendmsg+0x29c/0x2c8 [ 23.276602] __sys_sendmsg+0x60/0xb8 [ 23.280276] __arm64_sys_sendmsg+0x1c/0x28 [ 23.284488] el0_svc_common+0xd8/0x138 [ 23.288340] el0_svc_handler+0x24/0x80 [ 23.292192] el0_svc+0x8/0xc This looks fairly harmless (no actual deadlock occurs), and is fixed in a similar way to c6894dec8ea9 ("bridge: fix lockdep addr_list_lock false positive splat") by putting the addr_list_lock in its own lockdep class. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04net: dsa: slave: Don't propagate flag changes on down slave interfacesRundong Ge
The unbalance of master's promiscuity or allmulti will happen after ifdown and ifup a slave interface which is in a bridge. When we ifdown a slave interface , both the 'dsa_slave_close' and 'dsa_slave_change_rx_flags' will clear the master's flags. The flags of master will be decrease twice. In the other hand, if we ifup the slave interface again, since the slave's flags were cleared the 'dsa_slave_open' won't set the master's flag, only 'dsa_slave_change_rx_flags' that triggered by 'br_add_if' will set the master's flags. The flags of master is increase once. Only propagating flag changes when a slave interface is up makes sure this does not happen. The 'vlan_dev_change_rx_flags' had the same problem and was fixed, and changes here follows that fix. Fixes: 91da11f870f0 ("net: Distributed Switch Architecture protocol support") Signed-off-by: Rundong Ge <rdong.ge@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04rds: rdma: update rdma transport for tosSantosh Shilimkar
For RDMA transports, RDS TOS is an extension of IB QoS(Annex A13) to provide clients the ability to segregate traffic flows for different type of data. RDMA CM abstract it for ULPs using rdma_set_service_type(). Internally, each traffic flow is represented by a connection with all of its independent resources like that of a normal connection, and is differentiated by service type. In other words, there can be multiple qp connections between an IP pair and each supports a unique service type. The feature has been added from RDSv4.1 onwards and supports rolling upgrades. RDMA connection metadata also carries the tos information to set up SL on end to end context. The original code was developed by Bang Nguyen in downstream kernel back in 2.6.32 kernel days and it has evolved over period of time. Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> [yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes] Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04rds: add transport specific tos_map hookSantosh Shilimkar
RDMA transport maps user tos to underline virtual lanes(VL) for IB or DSCP values. RDMA CM transport abstract thats for RDS. TCP transport makes use of default priority 0 and maps all user tos values to it. Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> [yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes] Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04rds: add type of service(tos) infrastructureSantosh Shilimkar
RDS Service type (TOS) is user-defined and needs to be configured via RDS IOCTL interface. It must be set before initiating any traffic and once set the TOS can not be changed. All out-going traffic from the socket will be associated with its TOS. Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> [yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes] Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04rds: rdma: add consumer rejectSantosh Shilimkar
For legacy protocol version incompatibility with non linux RDS, consumer reject reason being used to convey it to peer. But the choice of reject reason value as '1' was really poor. Anyway for interoperability reasons with shipping products, it needs to be supported. For any future versions, properly encoded reject reason should to be used. Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> [yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes] Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04rds: make v3.1 as compat versionSantosh Shilimkar
Mark RDSv3.1 as compat version and add v4.1 version macro's. Subsequent patches enable TOS(Type of Service) feature which is tied with v4.1 for RDMA transport. Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> [yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes] Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04Merge tag 'v5.0-rc5' into rdma.git for-nextJason Gunthorpe
Linux 5.0-rc5 Needed to merge the include/uapi changes so we have an up to date single-tree for these files. Patches already posted are also expected to need this for dependencies.
2019-02-04IB/core: Remove ib_sg_dma_address() and ib_sg_dma_len()Bart Van Assche
Keeping single line wrapper functions is not useful. Hence remove the ib_sg_dma_address() and ib_sg_dma_len() functions. This patch does not change any functionality. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04netfilter: ipv6: avoid indirect calls for IPV6=y caseFlorian Westphal
indirect calls are only needed if ipv6 is a module. Add helpers to abstract the v6ops indirections and use them instead. fragment, reroute and route_input are kept as indirect calls. The first two are not not used in hot path and route_input is only used by bridge netfilter. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04netfilter: nat: remove module dependency on ipv6 coreFlorian Westphal
nf_nat_ipv6 calls two ipv6 core functions, so add those to v6ops to avoid the module dependency. This is a prerequisite for merging ipv4 and ipv6 nat implementations. Add wrappers to avoid the indirection if ipv6 is builtin. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04net: cls_flower: Remove filter from mask before freeing itPetr Machata
In fl_change(), when adding a new rule (i.e. fold == NULL), a driver may reject the new rule, for example due to resource exhaustion. By that point, the new rule was already assigned a mask, and it was added to that mask's hash table. The clean-up path that's invoked as a result of the rejection however neglects to undo the hash table addition, and proceeds to free the new rule, thus leaving a dangling pointer in the hash table. Fix by removing fnew from the mask's hash table before it is freed. Fixes: 35cc3cefc4de ("net/sched: cls_flower: Reject duplicated rules also under skip_sw") Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04net/smc: correct state change for peer closingUrsula Braun
If some kind of closing is received from the peer while still in state SMC_INIT, it means the peer has had an active connection and closed the socket quickly before listen_work finished. This should not result in a shortcut from state SMC_INIT to state SMC_CLOSED. This patch adds the socket to the accept queue in state SMC_APPCLOSEWAIT1. The socket reaches state SMC_CLOSED once being accepted and closed with smc_release(). Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04net/smc: delete rkey first before switching to unusedUrsula Braun
Once RMBs are flagged as unused they are candidates for reuse. Thus the LLC DELETE RKEY operaton should be made before flagging the RMB as unused. Fixes: c7674c001b11 ("net/smc: unregister rkeys of unused buffer") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04net/smc: fix sender_free computationUrsula Braun
In some scenarios a separate consumer cursor update is necessary. The decision is made in smc_tx_consumer_cursor_update(). The sender_free computation could be wrong: The rx confirmed cursor is always smaller than or equal to the rx producer cursor. The parameters in the smc_curs_diff() call have to be exchanged, otherwise sender_free might even be negative. And if more data arrives local_rx_ctrl.prod might be updated, enabling a cursor difference between local_rx_ctrl.prod and rx confirmed cursor larger than the RMB size. This case is not covered by smc_curs_diff(). Thus function smc_curs_diff_large() is introduced here. If a recvmsg() is processed in parallel, local_tx_ctrl.cons might change during smc_cdc_msg_send. Make sure rx_curs_confirmed is updated with the actually sent local_tx_ctrl.cons value. Fixes: e82f2e31f559 ("net/smc: optimize consumer cursor updates") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04net/smc: preallocated memory for rdma work requestsUrsula Braun
The work requests for rdma writes are built in local variables within function smc_tx_rdma_write(). This violates the rule that the work request storage has to stay till the work request is confirmed by a completion queue response. This patch introduces preallocated memory for these work requests. The storage is allocated, once a link (and thus a queue pair) is established. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04netfilter: nf_tables: unbind set in rule from commit pathPablo Neira Ayuso
Anonymous sets that are bound to rules from the same transaction trigger a kernel splat from the abort path due to double set list removal and double free. This patch updates the logic to search for the transaction that is responsible for creating the set and disable the set list removal and release, given the rule is now responsible for this. Lookup is reverse since the transaction that adds the set is likely to be at the tail of the list. Moreover, this patch adds the unbind step to deliver the event from the commit path. This should not be done from the worker thread, since we have no guarantees of in-order delivery to the listener. This patch removes the assumption that both activate and deactivate callbacks need to be provided. Fixes: cd5125d8f518 ("netfilter: nf_tables: split set destruction in deactivate and destroy phase") Reported-by: Mikhail Morfikov <mmorfikov@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04Bluetooth: Fix decrementing reference count twice in releasing socketMyungho Jung
When releasing socket, it is possible to enter hci_sock_release() and hci_sock_dev_event(HCI_DEV_UNREG) at the same time in different thread. The reference count of hdev should be decremented only once from one of them but if storing hdev to local variable in hci_sock_release() before detached from socket and setting to NULL in hci_sock_dev_event(), hci_dev_put(hdev) is unexpectedly called twice. This is resolved by referencing hdev from socket after bt_sock_unlink() in hci_sock_release(). Reported-by: syzbot+fdc00003f4efff43bc5b@syzkaller.appspotmail.com Signed-off-by: Myungho Jung <mhjungk@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2019-02-04netfilter: nft_tunnel: Add NFTA_TUNNEL_MODE optionswenxu
nft "tunnel" expr match both the tun_info of RX and TX. This patch provide the NFTA_TUNNEL_MODE to individually match the tun_info of RX or TX. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04netfilter: nf_nat: skip nat clash resolution for same-origin entriesMartynas Pumputis
It is possible that two concurrent packets originating from the same socket of a connection-less protocol (e.g. UDP) can end up having different IP_CT_DIR_REPLY tuples which results in one of the packets being dropped. To illustrate this, consider the following simplified scenario: 1. Packet A and B are sent at the same time from two different threads by same UDP socket. No matching conntrack entry exists yet. Both packets cause allocation of a new conntrack entry. 2. get_unique_tuple gets called for A. No clashing entry found. conntrack entry for A is added to main conntrack table. 3. get_unique_tuple is called for B and will find that the reply tuple of B is already taken by A. It will allocate a new UDP source port for B to resolve the clash. 4. conntrack entry for B cannot be added to main conntrack table because its ORIGINAL direction is clashing with A and the REPLY directions of A and B are not the same anymore due to UDP source port reallocation done in step 3. This patch modifies nf_conntrack_tuple_taken so it doesn't consider colliding reply tuples if the IP_CT_DIR_ORIGINAL tuples are equal. [ Florian: simplify patch to not use .allow_clash setting and always ignore identical flows ] Signed-off-by: Martynas Pumputis <martynas@weave.works> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-03net: Fix fall through warning in y2038 tstamp changes.David S. Miller
net/core/sock.c: In function 'sock_setsockopt': net/core/sock.c:914:3: warning: this statement may fall through [-Wimplicit-fallthrough=] sock_set_flag(sk, SOCK_TSTAMP_NEW); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ net/core/sock.c:915:2: note: here case SO_TIMESTAMPING_OLD: ^~~~ Fixes: 9718475e6908 ("socket: Add SO_TIMESTAMPING_NEW") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03bpfilter: remove extra header search paths for bpfilter_umhMasahiro Yamada
Currently, the header search paths -Itools/include and -Itools/include/uapi are not used. Let's drop the unused code. We can remove -I. too by fixing up one C file. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03sctp: check and update stream->out_curr when allocating stream_outXin Long
Now when using stream reconfig to add out streams, stream->out will get re-allocated, and all old streams' information will be copied to the new ones and the old ones will be freed. So without stream->out_curr updated, next time when trying to send from stream->out_curr stream, a panic would be caused. This patch is to check and update stream->out_curr when allocating stream_out. v1->v2: - define fa_index() to get elem index from stream->out_curr. v2->v3: - repost with no change. Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations") Reported-by: Ying Xu <yinxu@redhat.com> Reported-by: syzbot+e33a3a138267ca119c7d@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03net: Fix ip_mc_{dec,inc}_group allocation contextFlorian Fainelli
After 4effd28c1245 ("bridge: join all-snoopers multicast address"), I started seeing the following sleep in atomic warnings: [ 26.763893] BUG: sleeping function called from invalid context at mm/slab.h:421 [ 26.771425] in_atomic(): 1, irqs_disabled(): 0, pid: 1658, name: sh [ 26.777855] INFO: lockdep is turned off. [ 26.781916] CPU: 0 PID: 1658 Comm: sh Not tainted 5.0.0-rc4 #20 [ 26.787943] Hardware name: BCM97278SV (DT) [ 26.792118] Call trace: [ 26.794645] dump_backtrace+0x0/0x170 [ 26.798391] show_stack+0x24/0x30 [ 26.801787] dump_stack+0xa4/0xe4 [ 26.805182] ___might_sleep+0x208/0x218 [ 26.809102] __might_sleep+0x78/0x88 [ 26.812762] kmem_cache_alloc_trace+0x64/0x28c [ 26.817301] igmp_group_dropped+0x150/0x230 [ 26.821573] ip_mc_dec_group+0x1b0/0x1f8 [ 26.825585] br_ip4_multicast_leave_snoopers.isra.11+0x174/0x190 [ 26.831704] br_multicast_toggle+0x78/0xcc [ 26.835887] store_bridge_parm+0xc4/0xfc [ 26.839894] multicast_snooping_store+0x3c/0x4c [ 26.844517] dev_attr_store+0x44/0x5c [ 26.848262] sysfs_kf_write+0x50/0x68 [ 26.852006] kernfs_fop_write+0x14c/0x1b4 [ 26.856102] __vfs_write+0x60/0x190 [ 26.859668] vfs_write+0xc8/0x168 [ 26.863059] ksys_write+0x70/0xc8 [ 26.866449] __arm64_sys_write+0x24/0x30 [ 26.870458] el0_svc_common+0xa0/0x11c [ 26.874291] el0_svc_handler+0x38/0x70 [ 26.878120] el0_svc+0x8/0xc while toggling the bridge's multicast_snooping attribute dynamically. Pass a gfp_t down to igmpv3_add_delrec(), introduce __igmp_group_dropped() and introduce __ip_mc_dec_group() to take a gfp_t argument. Similarly introduce ____ip_mc_inc_group() and __ip_mc_inc_group() to allow caller to specify gfp_t. IPv6 part of the patch appears fine. Fixes: 4effd28c1245 ("bridge: join all-snoopers multicast address") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03net: devlink: report cell size of shared buffersJakub Kicinski
Shared buffer allocation is usually done in cell increments. Drivers will either round up the allocation or refuse the configuration if it's not an exact multiple of cell size. Drivers know exactly the cell size of shared buffer, so help out users by providing this information in dumps. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEWDeepa Dinamani
Add new socket timeout options that are y2038 safe. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Cc: ccaulfie@redhat.com Cc: davem@davemloft.net Cc: deller@gmx.de Cc: paulus@samba.org Cc: ralf@linux-mips.org Cc: rth@twiddle.net Cc: cluster-devel@redhat.com Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: sparclinux@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03socket: Rename SO_RCVTIMEO/ SO_SNDTIMEO with _OLD suffixesDeepa Dinamani
SO_RCVTIMEO and SO_SNDTIMEO socket options use struct timeval as the time format. struct timeval is not y2038 safe. The subsequent patches in the series add support for new socket timeout options with _NEW suffix that will use y2038 safe data structures. Although the existing struct timeval layout is sufficiently wide to represent timeouts, because of the way libc will interpret time_t based on user defined flag, these new flags provide a way of having a structure that is the same for all architectures consistently. Rename the existing options with _OLD suffix forms so that the right option is enabled for userspace applications according to the architecture and time_t definition of libc. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Cc: ccaulfie@redhat.com Cc: deller@gmx.de Cc: paulus@samba.org Cc: ralf@linux-mips.org Cc: rth@twiddle.net Cc: cluster-devel@redhat.com Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: sparclinux@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03socket: Add SO_TIMESTAMPING_NEWDeepa Dinamani
Add SO_TIMESTAMPING_NEW variant of socket timestamp options. This is the y2038 safe versions of the SO_TIMESTAMPING_OLD for all architectures. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Cc: chris@zankel.net Cc: fenghua.yu@intel.com Cc: rth@twiddle.net Cc: tglx@linutronix.de Cc: ubraun@linux.ibm.com Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-s390@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: sparclinux@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03socket: Add SO_TIMESTAMP[NS]_NEWDeepa Dinamani
Add SO_TIMESTAMP_NEW and SO_TIMESTAMPNS_NEW variants of socket timestamp options. These are the y2038 safe versions of the SO_TIMESTAMP_OLD and SO_TIMESTAMPNS_OLD for all architectures. Note that the format of scm_timestamping.ts[0] is not changed in this patch. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Cc: jejb@parisc-linux.org Cc: ralf@linux-mips.org Cc: rth@twiddle.net Cc: linux-alpha@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-parisc@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: netdev@vger.kernel.org Cc: sparclinux@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03socket: Use old_timeval types for socket timestampsDeepa Dinamani
As part of y2038 solution, all internal uses of struct timeval are replaced by struct __kernel_old_timeval and struct compat_timeval by struct old_timeval32. Make socket timestamps use these new types. This is mainly to be able to verify that the kernel build is y2038 safe when such non y2038 safe types are not supported anymore. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Cc: isdn@linux-pingi.de Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03sockopt: Rename SO_TIMESTAMP* to SO_TIMESTAMP*_OLDDeepa Dinamani
SO_TIMESTAMP, SO_TIMESTAMPNS and SO_TIMESTAMPING options, the way they are currently defined, are not y2038 safe. Subsequent patches in the series add new y2038 safe versions of these options which provide 64 bit timestamps on all architectures uniformly. Hence, rename existing options with OLD tag suffixes. Also note that kernel will not use the untagged SO_TIMESTAMP* and SCM_TIMESTAMP* options internally anymore. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Cc: deller@gmx.de Cc: dhowells@redhat.com Cc: jejb@parisc-linux.org Cc: ralf@linux-mips.org Cc: rth@twiddle.net Cc: linux-afs@lists.infradead.org Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-parisc@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: sparclinux@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03socket: move compat timeout handling into sock.cArnd Bergmann
This is a cleanup to prepare for the addition of 64-bit time_t in O_SNDTIMEO/O_RCVTIMEO. The existing compat handler seems unnecessarily complex and error-prone, moving it all into the main setsockopt()/getsockopt() implementation requires half as much code and is easier to extend. 32-bit user space can now use old_timeval32 on both 32-bit and 64-bit machines, while 64-bit code can use __old_kernel_timeval. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03vsock/virtio: reset connected sockets on device removalStefano Garzarella
When the virtio transport device disappear, we should reset all connected sockets in order to inform the users. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03vsock/virtio: fix kernel panic after device hot-unplugStefano Garzarella
virtio_vsock_remove() invokes the vsock_core_exit() also if there are opened sockets for the AF_VSOCK protocol family. In this way the vsock "transport" pointer is set to NULL, triggering the kernel panic at the first socket activity. This patch move the vsock_core_init()/vsock_core_exit() in the virtio_vsock respectively in module_init and module_exit functions, that cannot be invoked until there are open sockets. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1609699 Reported-by: Yan Fu <yafu@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03ipv4/igmp: Don't drop IGMP pkt with zeros src addrEdward Chron
Don't drop IGMP packets with a source address of all zeros which are IGMP proxy reports. This is documented in Section 2.1.1 IGMP Forwarding Rules of RFC 4541 IGMP and MLD Snooping Switches Considerations. Signed-off-by: Edward Chron <echron@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-02-01 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) introduce bpf_spin_lock, from Alexei. 2) convert xdp samples to libbpf, from Maciej. 3) skip verifier tests for unsupported program/map types, from Stanislav. 4) powerpc64 JIT support for BTF line info, from Sandipan. 5) assorted fixed, from Valdis, Jesper, Jiong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01ethtool: add compat for devlink infoJakub Kicinski
If driver did not fill the fw_version field, try to call into the new devlink get_info op and collect the versions that way. We assume ethtool was always reporting running versions. v4: - use IS_REACHABLE() to avoid problems with DEVLINK=m (kbuildbot). v3 (Jiri): - do a dump and then parse it instead of special handling; - concatenate all versions (well, all that fit :)). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01devlink: add version reporting to devlink info APIJakub Kicinski
ethtool -i has a few fixed-size fields which can be used to report firmware version and expansion ROM version. Unfortunately, modern hardware has more firmware components. There is usually some datapath microcode, management controller, PXE drivers, and a CPLD load. Running ethtool -i on modern controllers reveals the fact that vendors cram multiple values into firmware version field. Here are some examples from systems I could lay my hands on quickly: tg3: "FFV20.2.17 bc 5720-v1.39" i40e: "6.01 0x800034a4 1.1747.0" nfp: "0.0.3.5 0.25 sriov-2.1.16 nic" Add a new devlink API to allow retrieving multiple versions, and provide user-readable name for those versions. While at it break down the versions into three categories: - fixed - this is the board/fixed component version, usually vendors report information like the board version in the PCI VPD, but it will benefit from naming and common API as well; - running - this is the running firmware version; - stored - this is firmware in the flash, after firmware update this value will reflect the flashed version, while the running version may only be updated after reboot. v3: - add per-type helpers instead of using the special argument (Jiri). RFCv2: - remove the nesting in attr DEVLINK_ATTR_INFO_VERSIONS (now versions are mixed with other info attrs)l - have the driver report versions from the same callback as other info. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01devlink: add device information APIJakub Kicinski
ethtool -i has served us well for a long time, but its showing its limitations more and more. The device information should also be reported per device not per-netdev. Lay foundation for a simple devlink-based way of reading device info. Add driver name and device serial number as initial pieces of information exposed via this new API. v3: - rename helpers (Jiri); - rename driver name attr (Jiri); - remove double spacing in commit message (Jiri). RFC v2: - wrap the skb into an opaque structure (Jiri); - allow the serial number of be any length (Jiri & Andrew); - add driver name (Jonathan). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2019-01-31 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) disable preemption in sender side of socket filters, from Alexei. 2) fix two potential deadlocks in syscall bpf lookup and prog_register, from Martin and Alexei. 3) fix BTF to allow typedef on func_proto, from Yonghong. 4) two bpftool fixes, from Jiri and Paolo. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01ipconfig: add carrier_timeout kernel parameterMartin Kepplinger
commit 3fb72f1e6e61 ("ipconfig wait for carrier") added a "wait for carrier" policy, with a fixed worst case maximum wait of two minutes. Now make the wait for carrier timeout configurable on the kernel commandline and use the 120s as the default. The timeout messages introduced with commit 5e404cd65860 ("ipconfig: add informative timeout messages while waiting for carrier") are done in a fixed interval of 20 seconds, just like they were before (240/12). Signed-off-by: Martin Kepplinger <martin.kepplinger@ginzinger.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01ipv4: fib: use struct_size() in kzalloc()Gustavo A. R. Silva
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01net: tls: Set async_capable for tls zerocopy only if we see EINPROGRESSDave Watson
Currently we don't zerocopy if the crypto framework async bit is set. However some crypto algorithms (such as x86 AESNI) support async, but in the context of sendmsg, will never run asynchronously. Instead, check for actual EINPROGRESS return code before assuming algorithm is async. Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01net: tls: Add tls 1.3 supportDave Watson
TLS 1.3 has minor changes from TLS 1.2 at the record layer. * Header now hardcodes the same version and application content type in the header. * The real content type is appended after the data, before encryption (or after decryption). * The IV is xored with the sequence number, instead of concatinating four bytes of IV with the explicit IV. * Zero-padding: No exlicit length is given, we search backwards from the end of the decrypted data for the first non-zero byte, which is the content type. Currently recv supports reading zero-padding, but there is no way for send to add zero padding. Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>