summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2019-12-20llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c)Chan Shu Tak, Alex
When a frame with NULL DSAP is received, llc_station_rcv is called. In turn, llc_stat_ev_rx_null_dsap_xid_c is called to check if it is a NULL XID frame. The return statement of llc_stat_ev_rx_null_dsap_xid_c returns 1 when the incoming frame is not a NULL XID frame and 0 otherwise. Hence, a NULL XID response is returned unexpectedly, e.g. when the incoming frame is a NULL TEST command. To fix the error, simply remove the conditional operator. A similar error in llc_stat_ev_rx_null_dsap_test_c is also fixed. Signed-off-by: Chan Shu Tak, Alex <alexchan@task.com.hk> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20tipc: make legacy address flag readable over netlinkJohn Rutherford
To enable iproute2/tipc to generate backwards compatible printouts and validate command parameters for nodes using a <z.c.n> node address, it needs to be able to read the legacy address flag from the kernel. The legacy address flag records the way in which the node identity was originally specified. The legacy address flag is requested by the netlink message TIPC_NL_ADDR_LEGACY_GET. If the flag is set the attribute TIPC_NLA_NET_ADDR_LEGACY is set in the return message. Signed-off-by: John Rutherford <john.rutherford@dektech.com.au> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: dsa: ksz: use common define for tag lenMichael Grzeschik
Remove special taglen define KSZ8795_INGRESS_TAG_LEN and use generic KSZ_INGRESS_TAG_LEN instead. Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: dsa: add support for Atheros AR9331 TAG formatOleksij Rempel
Add support for tag format used in Atheros AR9331 built-in switch. Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20xsk: Use struct_size() helperMagnus Karlsson
Improve readability and maintainability by using the struct_size() helper when allocating the AF_XDP rings. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-13-git-send-email-magnus.karlsson@intel.com
2019-12-20xsk: Add function naming comments and reorder functionsMagnus Karlsson
Add comments on how the ring access functions are named and how they are supposed to be used for producers and consumers. The functions are also reordered so that the consumer functions are in the beginning and the producer functions in the end, for easier reference. Put this in a separate patch as the diff might look a little odd, but no functionality has changed in this patch. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-12-git-send-email-magnus.karlsson@intel.com
2019-12-20xsk: Remove unnecessary READ_ONCE of dataMagnus Karlsson
There are two unnecessary READ_ONCE of descriptor data. These are not needed since the data is written by the producer before it signals that the data is available by incrementing the producer pointer. As the access to this producer pointer is serialized and the consumer always reads the descriptor after it has read and synchronized with the producer counter, the write of the descriptor will have fully completed and it does not matter if the consumer has any read tearing. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-11-git-send-email-magnus.karlsson@intel.com
2019-12-20xsk: ixgbe: i40e: ice: mlx5: Xsk_umem_discard_addr to xsk_umem_release_addrMagnus Karlsson
Change the name of xsk_umem_discard_addr to xsk_umem_release_addr to better reflect the new naming of the AF_XDP queue manipulation functions. As this functions is used by drivers implementing support for AF_XDP zero-copy, it requires a name change to these drivers. The function xsk_umem_release_addr_rq has also changed name in the same fashion. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-10-git-send-email-magnus.karlsson@intel.com
2019-12-20xsk: Change names of validation functionsMagnus Karlsson
Change the names of the validation functions to better reflect what they are doing. The uppermost ones are reading entries from the rings and only the bottom ones validate entries. So xskq_cons_read_ is a better prefix name. Also change the xskq_cons_read_ functions to return a bool as the the descriptor or address is already returned by reference in the parameters. Everyone is using the return value as a bool anyway. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-9-git-send-email-magnus.karlsson@intel.com
2019-12-20xsk: Simplify the consumer ring access functionsMagnus Karlsson
Simplify and refactor consumer ring functions. The consumer first "peeks" to find descriptors or addresses that are available to read from the ring, then reads them and finally "releases" these descriptors once it is done. The two local variables cons_tail and cons_head are turned into one single variable called cached_cons. cached_tail referred to the cached value of the global consumer pointer and will be stored in cached_cons. For cached_head, we just use cached_prod instead as it was not used for a consumer queue before. It also better reflects what it really is now: a cached copy of the producer pointer. The names of the functions are also renamed in the same manner as the producer functions. The new functions are called xskq_cons_ followed by what it does. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-8-git-send-email-magnus.karlsson@intel.com
2019-12-20xsk: Simplify xskq_nb_avail and xskq_nb_freeMagnus Karlsson
At this point, there are no users of the functions xskq_nb_avail and xskq_nb_free that take any other number of entries argument than 1, so let us get rid of the second argument that takes the number of entries. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-7-git-send-email-magnus.karlsson@intel.com
2019-12-20xsk: Eliminate the RX batch sizeMagnus Karlsson
In the xsk consumer ring code there is a variable called RX_BATCH_SIZE that dictates the minimum number of entries that we try to grab from the fill and Tx rings. In fact, the code always try to grab the maximum amount of entries from these rings. The only thing this variable does is to throw an error if there is less than 16 (as it is defined) entries on the ring. There is no reason to do this and it will just lead to weird behavior from user space's point of view. So eliminate this variable. With this change, we will be able to simplify the xskq_nb_free and xskq_nb_avail code in the next commit. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-6-git-send-email-magnus.karlsson@intel.com
2019-12-20xsk: Standardize naming of producer ring access functionsMagnus Karlsson
Adopt the naming of the producer ring access functions to have a similar naming convention as the functions in libbpf, but adapted to the kernel. You first reserve a number of entries that you later submit to the global state of the ring. This is much clearer, IMO, than the one that was in the kernel part. Once renamed, we also discover that two functions are actually the same, so remove one of them. Some of the primitive ring submission operations are also the same so break these out into __xskq_prod_submit that the upper level ring access functions can use. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-5-git-send-email-magnus.karlsson@intel.com
2019-12-20xsk: Consolidate to one single cached producer pointerMagnus Karlsson
Currently, the xsk ring code has two cached producer pointers: prod_head and prod_tail. This patch consolidates these two into a single one called cached_prod to make the code simpler and easier to maintain. This will be in line with the user space part of the the code found in libbpf, that only uses a single cached pointer. The Rx path only uses the two top level functions xskq_produce_batch_desc and xskq_produce_flush_desc and they both use prod_head and never prod_tail. So just move them over to cached_prod. The Tx XDP_DRV path uses xskq_produce_addr_lazy and xskq_produce_flush_addr_n and unnecessarily operates on both prod_tail and prod_head, so move them over to just use cached_prod by skipping the intermediate step of updating prod_tail. The Tx path in XDP_SKB mode uses xskq_reserve_addr and xskq_produce_addr. They currently use both cached pointers, but we can operate on the global producer pointer in xskq_produce_addr since it has to be updated anyway, thus eliminating the use of both cached pointers. We can also remove the xskq_nb_free in xskq_produce_addr since it is already called in xskq_reserve_addr. No need to do it twice. When there is only one cached producer pointer, we can also simplify xskq_nb_free by removing one argument. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-4-git-send-email-magnus.karlsson@intel.com
2019-12-20xsk: Simplify detection of empty and full ringsMagnus Karlsson
In order to set the correct return flags for poll, the xsk code has to check if the Rx queue is empty and if the Tx queue is full. This code was unnecessarily large and complex as it used the functions that are used to update the local state from the global state (xskq_nb_free and xskq_nb_avail). Since we are not doing this nor updating any data dependent on this state, we can simplify the functions. Another benefit from this is that we can also simplify the xskq_nb_free and xskq_nb_avail functions in a later commit. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-3-git-send-email-magnus.karlsson@intel.com
2019-12-20xsk: Eliminate the lazy update thresholdMagnus Karlsson
The lazy update threshold was introduced to keep the producer and consumer some distance apart in the completion ring. This was important in the beginning of the development of AF_XDP as the ring format as that point in time was very sensitive to the producer and consumer being on the same cache line. This is not the case anymore as the current ring format does not degrade in any noticeable way when this happens. Moreover, this threshold makes it impossible to run the system with rings that have less than 128 entries. So let us remove this threshold and just get one entry from the ring as in all other functions. This will enable us to remove this function in a later commit. Note that xskq_produce_addr_lazy followed by xskq_produce_flush_addr_n are still not the same function as xskq_produce_addr() as it operates on another cached pointer. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1576759171-28550-2-git-send-email-magnus.karlsson@intel.com
2019-12-20rxrpc: Fix missing security check on incoming callsDavid Howells
Fix rxrpc_new_incoming_call() to check that we have a suitable service key available for the combination of service ID and security class of a new incoming call - and to reject calls for which we don't. This causes an assertion like the following to appear: rxrpc: Assertion failed - 6(0x6) == 12(0xc) is false kernel BUG at net/rxrpc/call_object.c:456! Where call->state is RXRPC_CALL_SERVER_SECURING (6) rather than RXRPC_CALL_COMPLETE (12). Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-12-20rxrpc: Don't take call->user_mutex in rxrpc_new_incoming_call()David Howells
Standard kernel mutexes cannot be used in any way from interrupt or softirq context, so the user_mutex which manages access to a call cannot be a mutex since on a new call the mutex must start off locked and be unlocked within the softirq handler to prevent userspace interfering with a call we're setting up. Commit a0855d24fc22d49cdc25664fb224caee16998683 ("locking/mutex: Complain upon mutex API misuse in IRQ contexts") causes big warnings to be splashed in dmesg for each a new call that comes in from the server. Whilst it *seems* like it should be okay, since the accept path uses trylock, there are issues with PI boosting and marking the wrong task as the owner. Fix this by not taking the mutex in the softirq path at all. It's not obvious that there should be any need for it as the state is set before the first notification is generated for the new call. There's also no particular reason why the link-assessing ping should be triggered inside the mutex. It's not actually transmitted there anyway, but rather it has to be deferred to a workqueue. Further, I don't think that there's any particular reason that the socket notification needs to be done from within rx->incoming_lock, so the amount of time that lock is held can be shortened too and the ping prepared before the new call notification is sent. Fixes: 540b1c48c37a ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg") Signed-off-by: David Howells <dhowells@redhat.com> cc: Peter Zijlstra (Intel) <peterz@infradead.org> cc: Ingo Molnar <mingo@redhat.com> cc: Will Deacon <will@kernel.org> cc: Davidlohr Bueso <dave@stgolabs.net>
2019-12-20rxrpc: Unlock new call in rxrpc_new_incoming_call() rather than the callerDavid Howells
Move the unlock and the ping transmission for a new incoming call into rxrpc_new_incoming_call() rather than doing it in the caller. This makes it clearer to see what's going on. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> cc: Ingo Molnar <mingo@redhat.com> cc: Will Deacon <will@kernel.org> cc: Davidlohr Bueso <dave@stgolabs.net>
2019-12-19xdp: Simplify __bpf_tx_xdp_map()Björn Töpel
The explicit error checking is not needed. Simply return the error instead. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-9-bjorn.topel@gmail.com
2019-12-19xdp: Remove map_to_flush and map swap detectionBjörn Töpel
Now that all XDP maps that can be used with bpf_redirect_map() tracks entries to be flushed in a global fashion, there is not need to track that the map has changed and flush from xdp_do_generic_map() anymore. All entries will be flushed in xdp_do_flush_map(). This means that the map_to_flush can be removed, and the corresponding checks. Moving the flush logic to one place, xdp_do_flush_map(), give a bulking behavior and performance boost. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-8-bjorn.topel@gmail.com
2019-12-19xdp: Make cpumap flush_list common for all map instancesBjörn Töpel
The cpumap flush list is used to track entries that need to flushed from via the xdp_do_flush_map() function. This list used to be per-map, but there is really no reason for that. Instead make the flush list global for all devmaps, which simplifies __cpu_map_flush() and cpu_map_alloc(). Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-7-bjorn.topel@gmail.com
2019-12-19xdp: Make devmap flush_list common for all map instancesBjörn Töpel
The devmap flush list is used to track entries that need to flushed from via the xdp_do_flush_map() function. This list used to be per-map, but there is really no reason for that. Instead make the flush list global for all devmaps, which simplifies __dev_map_flush() and dev_map_init_map(). Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-6-bjorn.topel@gmail.com
2019-12-19xsk: Make xskmap flush_list common for all map instancesBjörn Töpel
The xskmap flush list is used to track entries that need to flushed from via the xdp_do_flush_map() function. This list used to be per-map, but there is really no reason for that. Instead make the flush list global for all xskmaps, which simplifies __xsk_map_flush() and xsk_map_alloc(). Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-5-bjorn.topel@gmail.com
2019-12-19net/sched: cls_u32: fix refcount leak in the error path of u32_change()Davide Caratti
when users replace cls_u32 filters with new ones having wrong parameters, so that u32_change() fails to validate them, the kernel doesn't roll-back correctly, and leaves semi-configured rules. Fix this in u32_walk(), avoiding a call to the walker function on filters that don't have a match rule connected. The side effect is, these "empty" filters are not even dumped when present; but that shouldn't be a problem as long as we are restoring the original behaviour, where semi-configured filters were not even added in the error path of u32_change(). Fixes: 6676d5e416ee ("net: sched: set dedicated tcf_walker flag when tp is empty") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-19net/tls: add helper for testing if socket is RX offloadedJakub Kicinski
There is currently no way for driver to reliably check that the socket it has looked up is in fact RX offloaded. Add a helper. This allows drivers to catch misbehaving firmware. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20netfilter: nft_tproxy: Fix port selector on Big EndianPhil Sutter
On Big Endian architectures, u16 port value was extracted from the wrong parts of u32 sreg_port, just like commit 10596608c4d62 ("netfilter: nf_tables: fix mismatch in big-endian system") describes. Fixes: 4ed8eb6570a49 ("netfilter: nf_tables: Add native tproxy support") Signed-off-by: Phil Sutter <phil@nwl.cc> Acked-by: Florian Westphal <fw@strlen.de> Acked-by: Máté Eckl <ecklm94@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-20netfilter: ebtables: compat: reject all padding in matches/watchersFlorian Westphal
syzbot reported following splat: BUG: KASAN: vmalloc-out-of-bounds in size_entry_mwt net/bridge/netfilter/ebtables.c:2063 [inline] BUG: KASAN: vmalloc-out-of-bounds in compat_copy_entries+0x128b/0x1380 net/bridge/netfilter/ebtables.c:2155 Read of size 4 at addr ffffc900004461f4 by task syz-executor267/7937 CPU: 1 PID: 7937 Comm: syz-executor267 Not tainted 5.5.0-rc1-syzkaller #0 size_entry_mwt net/bridge/netfilter/ebtables.c:2063 [inline] compat_copy_entries+0x128b/0x1380 net/bridge/netfilter/ebtables.c:2155 compat_do_replace+0x344/0x720 net/bridge/netfilter/ebtables.c:2249 compat_do_ebt_set_ctl+0x22f/0x27e net/bridge/netfilter/ebtables.c:2333 [..] Because padding isn't considered during computation of ->buf_user_offset, "total" is decremented by fewer bytes than it should. Therefore, the first part of if (*total < sizeof(*entry) || entry->next_offset < sizeof(*entry)) will pass, -- it should not have. This causes oob access: entry->next_offset is past the vmalloced size. Reject padding and check that computed user offset (sum of ebt_entry structure plus all individual matches/watchers/targets) is same value that userspace gave us as the offset of the next entry. Reported-by: syzbot+f68108fed972453a0ad4@syzkaller.appspotmail.com Fixes: 81e675c227ec ("netfilter: ebtables: add CONFIG_COMPAT support") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-20netfilter: nf_flow_table: fix big-endian integer overflowArnd Bergmann
In some configurations, gcc reports an integer overflow: net/netfilter/nf_flow_table_offload.c: In function 'nf_flow_rule_match': net/netfilter/nf_flow_table_offload.c:80:21: error: unsigned conversion from 'int' to '__be16' {aka 'short unsigned int'} changes value from '327680' to '0' [-Werror=overflow] mask->tcp.flags = TCP_FLAG_RST | TCP_FLAG_FIN; ^~~~~~~~~~~~ From what I can tell, we want the upper 16 bits of these constants, so they need to be shifted in cpu-endian mode. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-12-19 The following pull-request contains BPF updates for your *net* tree. We've added 10 non-merge commits during the last 8 day(s) which contain a total of 21 files changed, 269 insertions(+), 108 deletions(-). The main changes are: 1) Fix lack of synchronization between xsk wakeup and destroying resources used by xsk wakeup, from Maxim Mikityanskiy. 2) Fix pruning with tail call patching, untrack programs in case of verifier error and fix a cgroup local storage tracking bug, from Daniel Borkmann. 3) Fix clearing skb->tstamp in bpf_redirect() when going from ingress to egress which otherwise cause issues e.g. on fq qdisc, from Lorenz Bauer. 4) Fix compile warning of unused proc_dointvec_minmax_bpf_restricted() when only cBPF is present, from Alexander Lobakin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-19net, sysctl: Fix compiler warning when only cBPF is presentAlexander Lobakin
proc_dointvec_minmax_bpf_restricted() has been firstly introduced in commit 2e4a30983b0f ("bpf: restrict access to core bpf sysctls") under CONFIG_HAVE_EBPF_JIT. Then, this ifdef has been removed in ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations"), because a new sysctl, bpf_jit_limit, made use of it. Finally, this parameter has become long instead of integer with fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") and thus, a new proc_dolongvec_minmax_bpf_restricted() has been added. With this last change, we got back to that proc_dointvec_minmax_bpf_restricted() is used only under CONFIG_HAVE_EBPF_JIT, but the corresponding ifdef has not been brought back. So, in configurations like CONFIG_BPF_JIT=y && CONFIG_HAVE_EBPF_JIT=n since v4.20 we have: CC net/core/sysctl_net_core.o net/core/sysctl_net_core.c:292:1: warning: ‘proc_dointvec_minmax_bpf_restricted’ defined but not used [-Wunused-function] 292 | proc_dointvec_minmax_bpf_restricted(struct ctl_table *table, int write, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Suppress this by guarding it with CONFIG_HAVE_EBPF_JIT again. Fixes: fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Signed-off-by: Alexander Lobakin <alobakin@dlink.ru> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191218091821.7080-1-alobakin@dlink.ru
2019-12-19xsk: Add rcu_read_lock around the XSK wakeupMaxim Mikityanskiy
The XSK wakeup callback in drivers makes some sanity checks before triggering NAPI. However, some configuration changes may occur during this function that affect the result of those checks. For example, the interface can go down, and all the resources will be destroyed after the checks in the wakeup function, but before it attempts to use these resources. Wrap this callback in rcu_read_lock to allow driver to synchronize_rcu before actually destroying the resources. xsk_wakeup is a new function that encapsulates calling ndo_xsk_wakeup wrapped into the RCU lock. After this commit, xsk_poll starts using xsk_wakeup and checks xs->zc instead of ndo_xsk_wakeup != NULL to decide ndo_xsk_wakeup should be called. It also fixes a bug introduced with the need_wakeup feature: a non-zero-copy socket may be used with a driver supporting zero-copy, and in this case ndo_xsk_wakeup should not be called, so the xs->zc check is the correct one. Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191217162023.16011-2-maximmi@mellanox.com
2019-12-18bpf: Allow to change skb mark in test_runNikita V. Shirokov
allow to pass skb's mark field into bpf_prog_test_run ctx for BPF_PROG_TYPE_SCHED_CLS prog type. that would allow to test bpf programs which are doing decision based on this field Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-12-18net: sch_ets: Make the ETS qdisc offloadablePetr Machata
Add hooks at appropriate points to make it possible to offload the ETS Qdisc. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-18net: sch_ets: Add a new QdiscPetr Machata
Introduces a new Qdisc, which is based on 802.1Q-2014 wording. It is PRIO-like in how it is configured, meaning one needs to specify how many bands there are, how many are strict and how many are dwrr, quanta for the latter, and priomap. The new Qdisc operates like the PRIO / DRR combo would when configured as per the standard. The strict classes, if any, are tried for traffic first. When there's no traffic in any of the strict queues, the ETS ones (if any) are treated in the same way as in DRR. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-18sch_cake: drop unused variable tin_quantum_prioKevin 'ldir' Darbyshire-Bryant
Turns out tin_quantum_prio isn't used anymore and is a leftover from a previous implementation of diffserv tins. Since the variable isn't used in any calculations it can be eliminated. Drop variable and places where it was set. Rename remaining variable and consolidate naming of intermediate variables that set it. Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-18net: nfc: nci: fix a possible sleep-in-atomic-context bug in ↵Jia-Ju Bai
nci_uart_tty_receive() The kernel may sleep while holding a spinlock. The function call path (from bottom to top) in Linux 4.19 is: net/nfc/nci/uart.c, 349: nci_skb_alloc in nci_uart_default_recv_buf net/nfc/nci/uart.c, 255: (FUNC_PTR)nci_uart_default_recv_buf in nci_uart_tty_receive net/nfc/nci/uart.c, 254: spin_lock in nci_uart_tty_receive nci_skb_alloc(GFP_KERNEL) can sleep at runtime. (FUNC_PTR) means a function pointer is called. To fix this bug, GFP_KERNEL is replaced with GFP_ATOMIC for nci_skb_alloc(). This bug is found by a static analysis tool STCheck written by myself. Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-18nfs: use time64_t internallyArnd Bergmann
The timestamps for the cache are all in boottime seconds, so they don't overflow 32-bit values, but the use of time_t is deprecated because it generally does overflow when used with wall-clock time. There are multiple possible ways of avoiding it: - leave time_t, which is safe here, but forces others to look into this code to determine that it is over and over. - use a more generic type, like 'int' or 'long', which is known to be sufficient here but loses the documentation of referring to timestamps - use ktime_t everywhere, and convert into seconds in the few places where we want realtime-seconds. The conversion is sometimes expensive, but not more so than the conversion we do today. - use time64_t to clarify that this code is safe. Nothing would change for 64-bit architectures, but it is slightly less efficient on 32-bit architectures. Without a clear winner of the three approaches above, this picks the last one, favouring readability over a small performance loss on 32-bit architectures. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-12-18sunrpc: convert to time64_t for expiryArnd Bergmann
Using signed 32-bit types for UTC time leads to the y2038 overflow, which is what happens in the sunrpc code at the moment. This changes the sunrpc code over to use time64_t where possible. The one exception is the gss_import_v{1,2}_context() function for kerberos5, which uses 32-bit timestamps in the protocol. Here, we can at least treat the numbers as 'unsigned', which extends the range from 2038 to 2106. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-12-18packet: clarify timestamp overflowArnd Bergmann
The memory mapped packet socket data structure in version 1 through 3 all contain 32-bit second values for the packet time stamps, which makes them suffer from the overflow of time_t in y2038 or y2106 (depending on whether user space interprets the value as signed or unsigned). The implementation uses the deprecated getnstimeofday() function. In order to get rid of that, this changes the code to use ktime_get_real_ts64() as a replacement, documenting the nature of the overflow. As long as the user applications treat the timestamps as unsigned, or only use the difference between timestamps, they are fine, and changing the timestamps to 64-bit wouldn't require a more invasive user space API change. Note: a lot of other APIs suffer from incompatible structures when time_t gets redefined to 64-bit in 32-bit user space, but this one does not. Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/lkml/CAF=yD-Jomr-gWSR-EBNKnSpFL46UeG564FLfqTCMNEm-prEaXA@mail.gmail.com/T/#u Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-12-17net-sysfs: Call dev_hold always in rx_queue_add_kobjectJouni Hogander
Dev_hold has to be called always in rx_queue_add_kobject. Otherwise usage count drops below 0 in case of failure in kobject_init_and_add. Fixes: b8eb718348b8 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject") Reported-by: syzbot <syzbot+30209ea299c09d8785c9@syzkaller.appspotmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Miller <davem@davemloft.net> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Jouni Hogander <jouni.hogander@unikie.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-17net: dsa: make unexported dsa_link_touch() staticBen Dooks (Codethink)
dsa_link_touch() is not exported, or defined outside of the file it is in so make it static to avoid the following warning: net/dsa/dsa2.c:127:17: warning: symbol 'dsa_link_touch' was not declared. Should it be static? Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-17net: annotate lockless accesses to sk->sk_pacing_shiftEric Dumazet
sk->sk_pacing_shift can be read and written without lock synchronization. This patch adds annotations to document this fact and avoid future syzbot complains. This might also avoid unexpected false sharing in sk_pacing_shift_update(), as the compiler could remove the conditional check and always write over sk->sk_pacing_shift : if (sk->sk_pacing_shift != val) sk->sk_pacing_shift = val; Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-17sctp: fix memleak on err handling of stream initializationMarcelo Ricardo Leitner
syzbot reported a memory leak when an allocation fails within genradix_prealloc() for output streams. That's because genradix_prealloc() leaves initialized members initialized when the issue happens and SCTP stack will abort the current initialization but without cleaning up such members. The fix here is to always call genradix_free() when genradix_prealloc() fails, for output and also input streams, as it suffers from the same issue. Reported-by: syzbot+772d9e36c490b18d51d1@syzkaller.appspotmail.com Fixes: 2075e50caf5e ("sctp: convert to genradix") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-17net: dsa: Make PHYLINK related function static againFlorian Fainelli
Commit 77373d49de22 ("net: dsa: Move the phylink driver calls into port.c") moved and exported a bunch of symbols, but they are not used outside of net/dsa/port.c at the moment, so no reason to export them. Reported-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Acked-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-17tipc: don't send gap blocks in ACK messagesJon Maloy
In the commit referred to below we eliminated sending of the 'gap' indicator in regular ACK messages, reserving this to explicit NACK ditto. Unfortunately we missed to also eliminate building of the 'gap block' area in ACK messages. This area is meant to report gaps in the received packet sequence following the initial gap, so that lost packets can be retransmitted earlier and received out-of-sequence packets can be released earlier. However, the interpretation of those blocks is dependent on a complete and correct sequence of gaps and acks. Hence, when the initial gap indicator is missing a single gap block will be interpreted as an acknowledgment of all preceding packets. This may lead to packets being released prematurely from the sender's transmit queue, with easily predicatble consequences. We now fix this by not building any gap block area if there is no initial gap to report. Fixes: commit 02288248b051 ("tipc: eliminate gap indicator from ACK messages") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-17netfilter: conntrack: remove two export symbolsFlorian Westphal
Not used anywhere, remove them. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-17netfilter: nft_tunnel: add the missing nla_nest_cancel()Xin Long
When nla_put_xxx() fails under nla_nest_start_noflag(), nla_nest_cancel() should be called, so that the skb can be trimmed properly. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-17netfilter: nft_tunnel: also dump OPTS_ERSPAN/VXLANXin Long
This patch is to add the nest attr OPTS_ERSPAN/VXLAN when dumping KEY_OPTS, and it would be helpful when parsing in userpace. Also, this is needed for supporting multiple geneve opts in the future patches. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-17netfilter: nft_tunnel: also dump ERSPAN_VERSIONXin Long
This is not necessary, but it'll be easier to parse in userspace, also given that other places like act_tunnel_key, cls_flower and ip_tunnel_core are also doing so. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>