summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2019-08-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Use 32-bit index for tails calls in s390 bpf JIT, from Ilya Leoshkevich. 2) Fix missed EPOLLOUT events in TCP, from Eric Dumazet. Same fix for SMC from Jason Baron. 3) ipv6_mc_may_pull() should return 0 for malformed packets, not -EINVAL. From Stefano Brivio. 4) Don't forget to unpin umem xdp pages in error path of xdp_umem_reg(). From Ivan Khoronzhuk. 5) Fix sta object leak in mac80211, from Johannes Berg. 6) Fix regression by not configuring PHYLINK on CPU port of bcm_sf2 switches. From Florian Fainelli. 7) Revert DMA sync removal from r8169 which was causing regressions on some MIPS Loongson platforms. From Heiner Kallweit. 8) Use after free in flow dissector, from Jakub Sitnicki. 9) Fix NULL derefs of net devices during ICMP processing across collect_md tunnels, from Hangbin Liu. 10) proto_register() memory leaks, from Zhang Lin. 11) Set NLM_F_MULTI flag in multipart netlink messages consistently, from John Fastabend. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (66 commits) r8152: Set memory to all 0xFFs on failed reg reads openvswitch: Fix conntrack cache with timeout ipv4: mpls: fix mpls_xmit for iptunnel nexthop: Fix nexthop_num_path for blackhole nexthops net: rds: add service level support in rds-info net: route dump netlink NLM_F_MULTI flag missing s390/qeth: reject oversized SNMP requests sock: fix potential memory leak in proto_register() MAINTAINERS: Add phylink keyword to SFF/SFP/SFP+ MODULE SUPPORT xfrm/xfrm_policy: fix dst dev null pointer dereference in collect_md mode ipv4/icmp: fix rt dst dev null pointer dereference openvswitch: Fix log message in ovs conntrack bpf: allow narrow loads of some sk_reuseport_md fields with offset > 0 bpf: fix use after free in prog symbol exposure bpf: fix precision tracking in presence of bpf2bpf calls flow_dissector: Fix potential use-after-free on BPF_PROG_DETACH Revert "r8169: remove not needed call to dma_sync_single_for_device" ipv6: propagate ipv6_add_dev's error returns out of ipv6_find_idev net/ncsi: Fix the payload copying for the request coming from Netlink qed: Add cleanup in qed_slowpath_start() ...
2019-08-27netfilter: not mark a spinlock as __read_mostlyLi RongQing
when spinlock is locked/unlocked, its elements will be changed, so marking it as __read_mostly is not suitable. and remove a duplicate definition of nf_conntrack_locks_all_lock strange that compiler does not complain. Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-27netfilter: conntrack: make sysctls per-namespace againFlorian Westphal
When I merged the extension sysctl tables with the main one I forgot to reset them on netns creation. They currently read/write init_net settings. Fixes: d912dec12428 ("netfilter: conntrack: merge acct and helper sysctl table with main one") Fixes: cb2833ed0044 ("netfilter: conntrack: merge ecache and timestamp sysctl tables with main one") Reported-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-27netfilter: nft_dynset: support for element deletionAnder Juaristi
This patch implements the delete operation from the ruleset. It implements a new delete() function in nft_set_rhash. It is simpler to use than the already existing remove(), because it only takes the set and the key as arguments, whereas remove() expects a full nft_set_elem structure. Signed-off-by: Ander Juaristi <a@juaristi.eus> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-27netfilter: nf_conntrack_ftp: Fix debug outputThomas Jarosch
The find_pattern() debug output was printing the 'skip' character. This can be a NULL-byte and messes up further pr_debug() output. Output without the fix: kernel: nf_conntrack_ftp: Pattern matches! kernel: nf_conntrack_ftp: Skipped up to `<7>nf_conntrack_ftp: find_pattern `PORT': dlen = 8 kernel: nf_conntrack_ftp: find_pattern `EPRT': dlen = 8 Output with the fix: kernel: nf_conntrack_ftp: Pattern matches! kernel: nf_conntrack_ftp: Skipped up to 0x0 delimiter! kernel: nf_conntrack_ftp: Match succeeded! kernel: nf_conntrack_ftp: conntrack_ftp: match `172,17,0,100,200,207' (20 bytes at 4150681645) kernel: nf_conntrack_ftp: find_pattern `PORT': dlen = 8 Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-27netfilter: xt_physdev: Fix spurious error message in physdev_mt_checkTodd Seidelmann
Simplify the check in physdev_mt_check() to emit an error message only when passed an invalid chain (ie, NF_INET_LOCAL_OUT). This avoids cluttering up the log with errors against valid rules. For large/heavily modified rulesets, current behavior can quickly overwhelm the ring buffer, because this function gets called on every change, regardless of the rule that was changed. Signed-off-by: Todd Seidelmann <tseidelmann@linode.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-27rxrpc: Use skb_unshare() rather than skb_cow_data()David Howells
The in-place decryption routines in AF_RXRPC's rxkad security module currently call skb_cow_data() to make sure the data isn't shared and that the skb can be written over. This has a problem, however, as the softirq handler may be still holding a ref or the Rx ring may be holding multiple refs when skb_cow_data() is called in rxkad_verify_packet() - and so skb_shared() returns true and __pskb_pull_tail() dislikes that. If this occurs, something like the following report will be generated. kernel BUG at net/core/skbuff.c:1463! ... RIP: 0010:pskb_expand_head+0x253/0x2b0 ... Call Trace: __pskb_pull_tail+0x49/0x460 skb_cow_data+0x6f/0x300 rxkad_verify_packet+0x18b/0xb10 [rxrpc] rxrpc_recvmsg_data.isra.11+0x4a8/0xa10 [rxrpc] rxrpc_kernel_recv_data+0x126/0x240 [rxrpc] afs_extract_data+0x51/0x2d0 [kafs] afs_deliver_fs_fetch_data+0x188/0x400 [kafs] afs_deliver_to_call+0xac/0x430 [kafs] afs_wait_for_call_to_complete+0x22f/0x3d0 [kafs] afs_make_call+0x282/0x3f0 [kafs] afs_fs_fetch_data+0x164/0x300 [kafs] afs_fetch_data+0x54/0x130 [kafs] afs_readpages+0x20d/0x340 [kafs] read_pages+0x66/0x180 __do_page_cache_readahead+0x188/0x1a0 ondemand_readahead+0x17d/0x2e0 generic_file_read_iter+0x740/0xc10 __vfs_read+0x145/0x1a0 vfs_read+0x8c/0x140 ksys_read+0x4a/0xb0 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by using skb_unshare() instead in the input path for DATA packets that have a security index != 0. Non-DATA packets don't need in-place encryption and neither do unencrypted DATA packets. Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Reported-by: Julian Wollrath <jwollrath@web.de> Signed-off-by: David Howells <dhowells@redhat.com>
2019-08-27rxrpc: Use the tx-phase skb flag to simplify tracingDavid Howells
Use the previously-added transmit-phase skbuff private flag to simplify the socket buffer tracing a bit. Which phase the skbuff comes from can now be divined from the skb rather than having to be guessed from the call state. We can also reduce the number of rxrpc_skb_trace values by eliminating the difference between Tx and Rx in the symbols. Signed-off-by: David Howells <dhowells@redhat.com>
2019-08-27rxrpc: Add a private skb flag to indicate transmission-phase skbsDavid Howells
Add a flag in the private data on an skbuff to indicate that this is a transmission-phase buffer rather than a receive-phase buffer. Signed-off-by: David Howells <dhowells@redhat.com>
2019-08-27rxrpc: Abstract out rxtx ring cleanupDavid Howells
Abstract out rxtx ring cleanup into its own function from its two callers. This makes it easier to apply the same changes to both. Signed-off-by: David Howells <dhowells@redhat.com>
2019-08-27rxrpc: Pass the input handler's data skb reference to the Rx ringDavid Howells
Pass the reference held on a DATA skb in the rxrpc input handler into the Rx ring rather than getting an additional ref for this and then dropping the original ref at the end. Signed-off-by: David Howells <dhowells@redhat.com>
2019-08-27rxrpc: Use info in skbuff instead of reparsing a jumbo packetDavid Howells
Use the information now cached in the skbuff private data to avoid the need to reparse a jumbo packet. We can find all the subpackets by dead reckoning, so it's only necessary to note how many there are, whether the last one is flagged as LAST_PACKET and whether any have the REQUEST_ACK flag set. This is necessary as once recvmsg() can see the packet, it can start modifying it, such as doing in-place decryption. Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com>
2019-08-27rxrpc: Improve jumbo packet countingDavid Howells
Improve the information stored about jumbo packets so that we don't need to reparse them so much later. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
2019-08-26net: sched: flower: don't take rtnl lock for cls hw offloads APIVlad Buslov
Don't manually take rtnl lock in flower classifier before calling cls hardware offloads API. Instead, pass rtnl lock status via 'rtnl_held' parameter. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: copy tunnel info when setting flow_action entry->tunnelVlad Buslov
In order to remove dependency on rtnl lock, modify tc_setup_flow_action() to copy tunnel info, instead of just saving pointer to tunnel_key action tunnel info. This is necessary to prevent concurrent action overwrite from releasing tunnel info while it is being used by rtnl-unlocked driver. Implement helper tcf_tunnel_info_copy() that is used to copy tunnel info with all its options to dynamically allocated memory block. Modify tc_cleanup_flow_action() to free dynamically allocated tunnel info. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: take reference to action dev before calling offloadsVlad Buslov
In order to remove dependency on rtnl lock when calling hardware offload API, take reference to action mirred dev when initializing flow_action structure in tc_setup_flow_action(). Implement function tc_cleanup_flow_action(), use it to release the device after hardware offload API is done using it. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: take rtnl lock in tc_setup_flow_action()Vlad Buslov
In order to allow using new flow_action infrastructure from unlocked classifiers, modify tc_setup_flow_action() to accept new 'rtnl_held' argument. Take rtnl lock before accessing tc_action data. This is necessary to protect from concurrent action replace. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: conditionally obtain rtnl lock in cls hw offloads APIVlad Buslov
In order to remove dependency on rtnl lock from offloads code of classifiers, take rtnl lock conditionally before executing driver callbacks. Only obtain rtnl lock if block is bound to devices that require it. Block bind/unbind code is rtnl-locked and obtains block->cb_lock while holding rtnl lock. Obtain locks in same order in tc_setup_cb_*() functions to prevent deadlock. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: add API for registering unlocked offload block callbacksVlad Buslov
Extend struct flow_block_offload with "unlocked_driver_cb" flag to allow registering and unregistering block hardware offload callbacks that do not require caller to hold rtnl lock. Extend tcf_block with additional lockeddevcnt counter that is incremented for each non-unlocked driver callback attached to device. This counter is necessary to conditionally obtain rtnl lock before calling hardware callbacks in following patches. Register mlx5 tc block offload callbacks as "unlocked". Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: notify classifier on successful offload add/deleteVlad Buslov
To remove dependency on rtnl lock, extend classifier ops with new ops->hw_add() and ops->hw_del() callbacks. Call them from cls API while holding cb_lock every time filter if successfully added to or deleted from hardware. Implement the new API in flower classifier. Use it to manage hw_filters list under cb_lock protection, instead of relying on rtnl lock to synchronize with concurrent fl_reoffload() call. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: refactor block offloads counter usageVlad Buslov
Without rtnl lock protection filters can no longer safely manage block offloads counter themselves. Refactor cls API to protect block offloadcnt with tcf_block->cb_lock that is already used to protect driver callback list and nooffloaddevcnt counter. The counter can be modified by concurrent tasks by new functions that execute block callbacks (which is safe with previous patch that changed its type to atomic_t), however, block bind/unbind code that checks the counter value takes cb_lock in write mode to exclude any concurrent modifications. This approach prevents race conditions between bind/unbind and callback execution code but allows for concurrency for tc rule update path. Move block offload counter, filter in hardware counter and filter flags management from classifiers into cls hardware offloads API. Make functions tcf_block_offload_{inc|dec}() and tc_cls_offload_cnt_update() to be cls API private. Implement following new cls API to be used instead: tc_setup_cb_add() - non-destructive filter add. If filter that wasn't already in hardware is successfully offloaded, increment block offloads counter, set filter in hardware counter and flag. On failure, previously offloaded filter is considered to be intact and offloads counter is not decremented. tc_setup_cb_replace() - destructive filter replace. Release existing filter block offload counter and reset its in hardware counter and flag. Set new filter in hardware counter and flag. On failure, previously offloaded filter is considered to be destroyed and offload counter is decremented. tc_setup_cb_destroy() - filter destroy. Unconditionally decrement block offloads counter. tc_setup_cb_reoffload() - reoffload filter to single cb. Execute cb() and call tc_cls_offload_cnt_update() if cb() didn't return an error. Refactor all offload-capable classifiers to atomically offload filters to hardware, change block offload counter, and set filter in hardware counter and flag by means of the new cls API functions. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: change tcf block offload counter type to atomic_tVlad Buslov
As a preparation for running proto ops functions without rtnl lock, change offload counter type to atomic. This is necessary to allow updating the counter by multiple concurrent users when offloading filters to hardware from unlocked classifiers. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: protect block offload-related fields with rw_semaphoreVlad Buslov
In order to remove dependency on rtnl lock, extend tcf_block with 'cb_lock' rwsem and use it to protect flow_block->cb_list and related counters from concurrent modification. The lock is taken in read mode for read-only traversal of cb_list in tc_setup_cb_call() and write mode in all other cases. This approach ensures that: - cb_list is not changed concurrently while filters is being offloaded on block. - block->nooffloaddevcnt is checked while holding the lock in read mode, but is only changed by bind/unbind code when holding the cb_lock in write mode to prevent concurrent modification. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26xprtrdma: Send Queue size grows after a reconnectChuck Lever
Eli Dorfman reports that after a series of idle disconnects, an RPC/RDMA transport becomes unusable (rdma_create_qp returns -ENOMEM). Problem was tracked down to increasing Send Queue size after each reconnect. The rdma_create_qp() API does not promise to leave its @qp_init_attr parameter unaltered. In fact, some drivers do modify one or more of its fields. Thus our calls to rdma_create_qp must use a fresh copy of ib_qp_init_attr each time. This fix is appropriate for kernels dating back to late 2007, though it will have to be adapted, as the connect code has changed over the years. Reported-by: Eli Dorfman <eli@vastdata.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-26xprtrdma: Clear xprt->reestablish_timeout on closeChuck Lever
Ensure that the re-establishment delay does not grow exponentially on each good reconnect. This probably should have been part of commit 675dd90ad093 ("xprtrdma: Modernize ops->connect"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-26SUNRPC: Handle connection breakages correctly in call_status()Trond Myklebust
If the connection breaks while we're waiting for a reply from the server, then we want to immediately try to reconnect. Fixes: ec6017d90359 ("SUNRPC fix regression in umount of a secure mount") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-26Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was invalidated"Trond Myklebust
This reverts commit a79f194aa4879e9baad118c3f8bb2ca24dbef765. The mechanism for aborting I/O is racy, since we are not guaranteed that the request is asleep while we're changing both task->tk_status and task->tk_action. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v5.1
2019-08-26SUNRPC: Handle EADDRINUSE and ENOBUFS correctlyTrond Myklebust
If a connect or bind attempt returns EADDRINUSE, that means we want to retry with a different port. It is not a fatal connection error. Similarly, ENOBUFS is not fatal, but just indicates a memory allocation issue. Retry after a short delay. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-26SUNRPC: Don't handle errors if the bind/connect succeededTrond Myklebust
Don't handle errors in call_bind_status()/call_connect_status() if it turns out that a previous call caused it to succeed. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v5.1+
2019-08-26xprtrdma: Recycle MRs after disconnectChuck Lever
The optimization done in "xprtrdma: Simplify rpcrdma_mr_pop" was a bit too optimistic. MRs left over after a reconnect still need to be recycled, not added back to the free list, since they could be in flight or actually fully registered. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-26netfilter: nfnetlink_log: add support for VLAN informationMichael Braun
Currently, there is no vlan information (e.g. when used with a vlan aware bridge) passed to userspache, HWHEADER will contain an 08 00 (ip) suffix even for tagged ip packets. Therefore, add an extra netlink attribute that passes the vlan information to userspace similarly to 15824ab29f for nfqueue. Signed-off-by: Michael Braun <michael-dev@fami-braun.de> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-26netfilter: nft_meta: support for time matchingAnder Juaristi
This patch introduces meta matches in the kernel for time (a UNIX timestamp), day (a day of week, represented as an integer between 0-6), and hour (an hour in the current day, or: number of seconds since midnight). All values are taken as unsigned 64-bit integers. The 'time' keyword is internally converted to nanoseconds by nft in userspace, and hence the timestamp is taken in nanoseconds as well. Signed-off-by: Ander Juaristi <a@juaristi.eus> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-26netfilter: nf_tables: Introduce new 64-bit helper register functionsAnder Juaristi
Introduce new helper functions to load/store 64-bit values onto/from registers: - nft_reg_store64 - nft_reg_load64 This commit also re-orders all these helpers from smallest to largest target bit size. Signed-off-by: Ander Juaristi <a@juaristi.eus> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-25openvswitch: Fix conntrack cache with timeoutYi-Hung Wei
This patch addresses a conntrack cache issue with timeout policy. Currently, we do not check if the timeout extension is set properly in the cached conntrack entry. Thus, after packet recirculate from conntrack action, the timeout policy is not applied properly. This patch fixes the aforementioned issue. Fixes: 06bd2bdf19d2 ("openvswitch: Add timeout support to ct action") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-25ipv4: mpls: fix mpls_xmit for iptunnelAlexey Kodanev
When using mpls over gre/gre6 setup, rt->rt_gw4 address is not set, the same for rt->rt_gw_family. Therefore, when rt->rt_gw_family is checked in mpls_xmit(), neigh_xmit() call is skipped. As a result, such setup doesn't work anymore. This issue was found with LTP mpls03 tests. Fixes: 1550c171935d ("ipv4: Prepare rtable for IPv6 gateway") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24net: rds: add service level support in rds-infoZhu Yanjun
>From IB specific 7.6.5 SERVICE LEVEL, Service Level (SL) is used to identify different flows within an IBA subnet. It is carried in the local route header of the packet. Before this commit, run "rds-info -I". The outputs are as below: " RDS IB Connections: LocalAddr RemoteAddr Tos SL LocalDev RemoteDev 192.2.95.3 192.2.95.1 2 0 fe80::21:28:1a:39 fe80::21:28:10:b9 192.2.95.3 192.2.95.1 1 0 fe80::21:28:1a:39 fe80::21:28:10:b9 192.2.95.3 192.2.95.1 0 0 fe80::21:28:1a:39 fe80::21:28:10:b9 " After this commit, the output is as below: " RDS IB Connections: LocalAddr RemoteAddr Tos SL LocalDev RemoteDev 192.2.95.3 192.2.95.1 2 2 fe80::21:28:1a:39 fe80::21:28:10:b9 192.2.95.3 192.2.95.1 1 1 fe80::21:28:1a:39 fe80::21:28:10:b9 192.2.95.3 192.2.95.1 0 0 fe80::21:28:1a:39 fe80::21:28:10:b9 " The commit fe3475af3bdf ("net: rds: add per rds connection cache statistics") adds cache_allocs in struct rds_info_rdma_connection as below: struct rds_info_rdma_connection { ... __u32 rdma_mr_max; __u32 rdma_mr_size; __u8 tos; __u32 cache_allocs; }; The peer struct in rds-tools of struct rds_info_rdma_connection is as below: struct rds_info_rdma_connection { ... uint32_t rdma_mr_max; uint32_t rdma_mr_size; uint8_t tos; uint8_t sl; uint32_t cache_allocs; }; The difference between userspace and kernel is the member variable sl. In the kernel struct, the member variable sl is missing. This will introduce risks. So it is necessary to use this commit to avoid this risk. Fixes: fe3475af3bdf ("net: rds: add per rds connection cache statistics") CC: Joe Jin <joe.jin@oracle.com> CC: JUNXIAO_BI <junxiao.bi@oracle.com> Suggested-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24net: route dump netlink NLM_F_MULTI flag missingJohn Fastabend
An excerpt from netlink(7) man page, In multipart messages (multiple nlmsghdr headers with associated payload in one byte stream) the first and all following headers have the NLM_F_MULTI flag set, except for the last header which has the type NLMSG_DONE. but, after (ee28906) there is a missing NLM_F_MULTI flag in the middle of a FIB dump. The result is user space applications following above man page excerpt may get confused and may stop parsing msg believing something went wrong. In the golang netlink lib [0] the library logic stops parsing believing the message is not a multipart message. Found this running Cilium[1] against net-next while adding a feature to auto-detect routes. I noticed with multiple route tables we no longer could detect the default routes on net tree kernels because the library logic was not returning them. Fix this by handling the fib_dump_info_fnhe() case the same way the fib_dump_info() handles it by passing the flags argument through the call chain and adding a flags argument to rt_fill_info(). Tested with Cilium stack and auto-detection of routes works again. Also annotated libs to dump netlink msgs and inspected NLM_F_MULTI and NLMSG_DONE flags look correct after this. Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same as is done for fib_dump_info() so this looks correct to me. [0] https://github.com/vishvananda/netlink/ [1] https://github.com/cilium/ Fixes: ee28906fd7a14 ("ipv4: Dump route exceptions if requested") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24sock: fix potential memory leak in proto_register()zhanglin
If protocols registered exceeded PROTO_INUSE_NR, prot will be added to proto_list, but no available bit left for prot in proto_inuse_idx. Changes since v2: * Propagate the error code properly Signed-off-by: zhanglin <zhang.lin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24net/core/skmsg: Delete an unnecessary check before the function call ↵Markus Elfring
“consume_skb” The consume_skb() function performs also input parameter validation. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24xfrm/xfrm_policy: fix dst dev null pointer dereference in collect_md modeHangbin Liu
In decode_session{4,6} there is a possibility that the skb dst dev is NULL, e,g, with tunnel collect_md mode, which will cause kernel crash. Here is what the code path looks like, for GRE: - ip6gre_tunnel_xmit - ip6gre_xmit_ipv6 - __gre6_xmit - ip6_tnl_xmit - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE - icmpv6_send - icmpv6_route_lookup - xfrm_decode_session_reverse - decode_session4 - oif = skb_dst(skb)->dev->ifindex; <-- here - decode_session6 - oif = skb_dst(skb)->dev->ifindex; <-- here The reason is __metadata_dst_init() init dst->dev to NULL by default. We could not fix it in __metadata_dst_init() as there is no dev supplied. On the other hand, the skb_dst(skb)->dev is actually not needed as we called decode_session{4,6} via xfrm_decode_session_reverse(), so oif is not used by: fl4->flowi4_oif = reverse ? skb->skb_iif : oif; So make a dst dev check here should be clean and safe. v4: No changes. v3: No changes. v2: fix the issue in decode_session{4,6} instead of updating shared dst dev in {ip_md, ip6}_tunnel_xmit. Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24ipv4/icmp: fix rt dst dev null pointer dereferenceHangbin Liu
In __icmp_send() there is a possibility that the rt->dst.dev is NULL, e,g, with tunnel collect_md mode, which will cause kernel crash. Here is what the code path looks like, for GRE: - ip6gre_tunnel_xmit - ip6gre_xmit_ipv4 - __gre6_xmit - ip6_tnl_xmit - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE - icmp_send - net = dev_net(rt->dst.dev); <-- here The reason is __metadata_dst_init() init dst->dev to NULL by default. We could not fix it in __metadata_dst_init() as there is no dev supplied. On the other hand, the reason we need rt->dst.dev is to get the net. So we can just try get it from skb->dev when rt->dst.dev is NULL. v4: Julian Anastasov remind skb->dev also could be NULL. We'd better still use dst.dev and do a check to avoid crash. v3: No changes. v2: fix the issue in __icmp_send() instead of updating shared dst dev in {ip_md, ip6}_tunnel_xmit. Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24openvswitch: Fix log message in ovs conntrackYi-Hung Wei
Fixes: 06bd2bdf19d2 ("openvswitch: Add timeout support to ct action") Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24Merge branch 'ieee802154-for-davem-2019-08-24' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan Stefan Schmidt says: ==================== pull-request: ieee802154 for net 2019-08-24 An update from ieee802154 for your *net* tree. Yue Haibing fixed two bugs discovered by KASAN in the hwsim driver for ieee802154 and Colin Ian King cleaned up a redundant variable assignment. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-08-24 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix verifier precision tracking with BPF-to-BPF calls, from Alexei. 2) Fix a use-after-free in prog symbol exposure, from Daniel. 3) Several s390x JIT fixes plus BE related fixes in BPF kselftests, from Ilya. 4) Fix memory leak by unpinning XDP umem pages in error path, from Ivan. 5) Fix a potential use-after-free on flow dissector detach, from Jakub. 6) Fix bpftool to close prog fd after showing metadata, from Quentin. 7) BPF kselftest config and TEST_PROGS_EXTENDED fixes, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24bpf: allow narrow loads of some sk_reuseport_md fields with offset > 0Ilya Leoshkevich
test_select_reuseport fails on s390 due to verifier rejecting test_select_reuseport_kern.o with the following message: ; data_check.eth_protocol = reuse_md->eth_protocol; 18: (69) r1 = *(u16 *)(r6 +22) invalid bpf_context access off=22 size=2 This is because on big-endian machines casts from __u32 to __u16 are generated by referencing the respective variable as __u16 with an offset of 2 (as opposed to 0 on little-endian machines). The verifier already has all the infrastructure in place to allow such accesses, it's just that they are not explicitly enabled for eth_protocol field. Enable them for eth_protocol field by using bpf_ctx_range instead of offsetof. Ditto for ip_protocol, bind_inany and len, since they already allow narrowing, and the same problem can arise when working with them. Fixes: 2dbb9b9e6df6 ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT") Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-24flow_dissector: Fix potential use-after-free on BPF_PROG_DETACHJakub Sitnicki
Call to bpf_prog_put(), with help of call_rcu(), queues an RCU-callback to free the program once a grace period has elapsed. The callback can run together with new RCU readers that started after the last grace period. New RCU readers can potentially see the "old" to-be-freed or already-freed pointer to the program object before the RCU update-side NULLs it. Reorder the operations so that the RCU update-side resets the protected pointer before the end of the grace period after which the program will be freed. Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook") Reported-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Petar Penkov <ppenkov@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-23drop_monitor: Make timestamps y2038 safeIdo Schimmel
Timestamps are currently communicated to user space as 'struct timespec', which is not considered y2038 safe since it uses a 32-bit signed value for seconds. Fix this while the API is still not part of any official kernel release by using 64-bit nanoseconds timestamps instead. Fixes: ca30707dee2b ("drop_monitor: Add packet alert mode") Fixes: 5e58109b1ea4 ("drop_monitor: Add support for packet alert mode for hardware drops") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-23net/rds: Whitelist rdma_cookie and rx_tstamp for usercopyDag Moxnes
Add the RDMA cookie and RX timestamp to the usercopy whitelist. After the introduction of hardened usercopy whitelisting (https://lwn.net/Articles/727322/), a warning is displayed when the RDMA cookie or RX timestamp is copied to userspace: kernel: WARNING: CPU: 3 PID: 5750 at mm/usercopy.c:81 usercopy_warn+0x8e/0xa6 [...] kernel: Call Trace: kernel: __check_heap_object+0xb8/0x11b kernel: __check_object_size+0xe3/0x1bc kernel: put_cmsg+0x95/0x115 kernel: rds_recvmsg+0x43d/0x620 [rds] kernel: sock_recvmsg+0x43/0x4a kernel: ___sys_recvmsg+0xda/0x1e6 kernel: ? __handle_mm_fault+0xcae/0xf79 kernel: __sys_recvmsg+0x51/0x8a kernel: SyS_recvmsg+0x12/0x1c kernel: do_syscall_64+0x79/0x1ae When the whitelisting feature was introduced, the memory for the RDMA cookie and RX timestamp in RDS was not added to the whitelist, causing the warning above. Signed-off-by: Dag Moxnes <dag.moxnes@oracle.com> Tested-by: Jenny <jenny.x.xu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-23ipv6: propagate ipv6_add_dev's error returns out of ipv6_find_idevSabrina Dubroca
Currently, ipv6_find_idev returns NULL when ipv6_add_dev fails, ignoring the specific error value. This results in addrconf_add_dev returning ENOBUFS in all cases, which is unfortunate in cases such as: # ip link add dummyX type dummy # ip link set dummyX mtu 1200 up # ip addr add 2000::/64 dev dummyX RTNETLINK answers: No buffer space available Commit a317a2f19da7 ("ipv6: fail early when creating netdev named all or default") introduced error returns in ipv6_add_dev. Before that, that function would simply return NULL for all failures. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-23net: ipv6: fix listify ip6_rcv_finish in case of forwardingXin Long
We need a similar fix for ipv6 as Commit 0761680d5215 ("net: ipv4: fix listify ip_rcv_finish in case of forwarding") does for ipv4. This issue can be reprocuded by syzbot since Commit 323ebb61e32b ("net: use listified RX for handling GRO_NORMAL skbs") on net-next. The call trace was: kernel BUG at include/linux/skbuff.h:2225! RIP: 0010:__skb_pull include/linux/skbuff.h:2225 [inline] RIP: 0010:skb_pull+0xea/0x110 net/core/skbuff.c:1902 Call Trace: sctp_inq_pop+0x2f1/0xd80 net/sctp/inqueue.c:202 sctp_endpoint_bh_rcv+0x184/0x8d0 net/sctp/endpointola.c:385 sctp_inq_push+0x1e4/0x280 net/sctp/inqueue.c:80 sctp_rcv+0x2807/0x3590 net/sctp/input.c:256 sctp6_rcv+0x17/0x30 net/sctp/ipv6.c:1049 ip6_protocol_deliver_rcu+0x2fe/0x1660 net/ipv6/ip6_input.c:397 ip6_input_finish+0x84/0x170 net/ipv6/ip6_input.c:438 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip6_input+0xe4/0x3f0 net/ipv6/ip6_input.c:447 dst_input include/net/dst.h:442 [inline] ip6_sublist_rcv_finish+0x98/0x1e0 net/ipv6/ip6_input.c:84 ip6_list_rcv_finish net/ipv6/ip6_input.c:118 [inline] ip6_sublist_rcv+0x80c/0xcf0 net/ipv6/ip6_input.c:282 ipv6_list_rcv+0x373/0x4b0 net/ipv6/ip6_input.c:316 __netif_receive_skb_list_ptype net/core/dev.c:5049 [inline] __netif_receive_skb_list_core+0x5fc/0x9d0 net/core/dev.c:5097 __netif_receive_skb_list net/core/dev.c:5149 [inline] netif_receive_skb_list_internal+0x7eb/0xe60 net/core/dev.c:5244 gro_normal_list.part.0+0x1e/0xb0 net/core/dev.c:5757 gro_normal_list net/core/dev.c:5755 [inline] gro_normal_one net/core/dev.c:5769 [inline] napi_frags_finish net/core/dev.c:5782 [inline] napi_gro_frags+0xa6a/0xea0 net/core/dev.c:5855 tun_get_user+0x2e98/0x3fa0 drivers/net/tun.c:1974 tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2020 Fixes: d8269e2cbf90 ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()") Fixes: 323ebb61e32b ("net: use listified RX for handling GRO_NORMAL skbs") Reported-by: syzbot+eb349eeee854e389c36d@syzkaller.appspotmail.com Reported-by: syzbot+4a0643a653ac375612d1@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>