summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2019-11-06net/tls: don't pay attention to sk_write_pending when pushing partial recordsJakub Kicinski
sk_write_pending being not zero does not guarantee that partial record will be pushed. If the thread waiting for memory times out the pending record may get stuck. In case of tls_device there is no path where parial record is set and writer present in the first place. Partial record is set only in tls_push_sg() and tls_push_sg() will return an error immediately. All tls_device callers of tls_push_sg() will return (and not wait for memory) if it failed. Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06net: annotate lockless accesses to sk->sk_max_ack_backlogEric Dumazet
sk->sk_max_ack_backlog can be read without any lock being held at least in TCP/DCCP cases. We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing and/or potential KCSAN warnings. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06net: annotate lockless accesses to sk->sk_ack_backlogEric Dumazet
sk->sk_ack_backlog can be read without any lock being held. We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing and/or potential KCSAN warnings. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06net: use helpers to change sk_ack_backlogEric Dumazet
Writers are holding a lock, but many readers do not. Following patch will add appropriate barriers in sk_acceptq_removed() and sk_acceptq_added(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06inet_diag: use jiffies_delta_to_msecs()Eric Dumazet
Use jiffies_delta_to_msecs() to avoid reporting 'infinite' timeouts and to cleanup code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06net: neigh: use long type to store jiffies deltaEric Dumazet
A difference of two unsigned long needs long storage. Fixes: c7fb64db001f ("[NETLINK]: Neighbour table configuration and statistics via rtnetlink") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06SUNRPC: Avoid RPC delays when exiting suspendTrond Myklebust
Jon Hunter: "I have been tracking down another suspend/NFS related issue where again I am seeing random delays exiting suspend. The delays can be up to a couple minutes in the worst case and this is causing a suspend test we have to fail." Change the use of a deferrable work to a standard delayed one. Reported-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Fixes: 7e0a0e38fcfea ("SUNRPC: Replace the queue timer with a delayed work function") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-05Merge tag 'batadv-next-for-davem-20191105' of ↵David S. Miller
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This feature/cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - Simplify batadv_v_ogm_aggr_list_free using skb_queue_purge, by Christophe Jaillet - Replace aggr_list_lock with lock free skb handlers, by Christophe Jaillet - explicitly mark fallthrough cases, by Sven Eckelmann - Drop lockdep.h include from soft-interface.c, by Sven Eckelmann ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05net_sched: add TCA_STATS_PKT64 attributeEric Dumazet
Now the kernel uses 64bit packet counters in scheduler layer, we want to export these counters to user space. Instead risking breaking user space by adding fields to struct gnet_stats_basic, add a new TCA_STATS_PKT64. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05net_sched: extend packet counter to 64bitEric Dumazet
After this change, qdisc packet counter is no longer a 32bit quantity. We still export 32bit values to user. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05net: dsa: Add support for devlink resourcesAndrew Lunn
Add wrappers around the devlink resource API, so that DSA drivers can register and unregister devlink resources. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05net/tls: fix sk_msg trim on fallback to copy modeJakub Kicinski
sk_msg_trim() tries to only update curr pointer if it falls into the trimmed region. The logic, however, does not take into the account pointer wrapping that sk_msg_iter_var_prev() does nor (as John points out) the fact that msg->sg is a ring buffer. This means that when the message was trimmed completely, the new curr pointer would have the value of MAX_MSG_FRAGS - 1, which is neither smaller than any other value, nor would it actually be correct. Special case the trimming to 0 length a little bit and rework the comparison between curr and end to take into account wrapping. This bug caused the TLS code to not copy all of the message, if zero copy filled in fewer sg entries than memcopy would need. Big thanks to Alexander Potapenko for the non-KMSAN reproducer. v2: - take into account that msg->sg is a ring buffer (John). Link: https://lore.kernel.org/netdev/20191030160542.30295-1-jakub.kicinski@netronome.com/ (v1) Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface") Reported-by: syzbot+f8495bff23a879a6d0bd@syzkaller.appspotmail.com Reported-by: syzbot+6f50c99e8f6194bf363f@syzkaller.appspotmail.com Co-developed-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05Change in Openvswitch to support MPLS label depth of 3 in ingress directionMartin Varghese
The openvswitch was supporting a MPLS label depth of 1 in the ingress direction though the userspace OVS supports a max depth of 3 labels. This change enables openvswitch module to support a max depth of 3 labels in the ingress. Signed-off-by: Martin Varghese <martin.varghese@nokia.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05vsock: Simplify '__vsock_release()'Christophe JAILLET
Use 'skb_queue_purge()' instead of re-implementing it. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05net: dsa: Fix use after free in dsa_switch_remove()Florian Fainelli
The order in which the ports are deleted from the list and freed and the call to dsa_switch_remove() is done is reversed, which leads to an use after free condition. Reverse the two: first tear down the ports and switch from the fabric, then free the ports associated with that switch fabric. Fixes: 05f294a85235 ("net: dsa: allocate ports on touch") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05net: sched: prevent duplicate flower rules from tcf_proto destroy raceJohn Hurley
When a new filter is added to cls_api, the function tcf_chain_tp_insert_unique() looks up the protocol/priority/chain to determine if the tcf_proto is duplicated in the chain's hashtable. It then creates a new entry or continues with an existing one. In cls_flower, this allows the function fl_ht_insert_unque to determine if a filter is a duplicate and reject appropriately, meaning that the duplicate will not be passed to drivers via the offload hooks. However, when a tcf_proto is destroyed it is removed from its chain before a hardware remove hook is hit. This can lead to a race whereby the driver has not received the remove message but duplicate flows can be accepted. This, in turn, can lead to the offload driver receiving incorrect duplicate flows and out of order add/delete messages. Prevent duplicates by utilising an approach suggested by Vlad Buslov. A hash table per block stores each unique chain/protocol/prio being destroyed. This entry is only removed when the full destroy (and hardware offload) has completed. If a new flow is being added with the same identiers as a tc_proto being detroyed, then the add request is replayed until the destroy is complete. Fixes: 8b64678e0af8 ("net: sched: refactor tp insert/delete for concurrent execution") Signed-off-by: John Hurley <john.hurley@netronome.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reported-by: Louis Peens <louis.peens@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05icmp: remove duplicate codeMatteo Croce
The same code which recognizes ICMP error packets is duplicated several times. Use the icmp_is_err() and icmpv6_is_err() helpers instead, which do the same thing. ip_multipath_l3_keys() and tcf_nat_act() didn't check for all the error types, assume that they should instead. Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05taprio: fix panic while hw offload sched list swapIvan Khoronzhuk
Don't swap oper and admin schedules too early, it's not correct and causes crash. Steps to reproduce: 1) tc qdisc replace dev eth0 parent root handle 100 taprio \ num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \ queues 1@0 1@1 1@2 \ base-time $SOME_BASE_TIME \ sched-entry S 01 80000 \ sched-entry S 02 15000 \ sched-entry S 04 40000 \ flags 2 2) tc qdisc replace dev eth0 parent root handle 100 taprio \ base-time $SOME_BASE_TIME \ sched-entry S 01 90000 \ sched-entry S 02 20000 \ sched-entry S 04 40000 \ flags 2 3) tc qdisc replace dev eth0 parent root handle 100 taprio \ base-time $SOME_BASE_TIME \ sched-entry S 01 150000 \ sched-entry S 02 200000 \ sched-entry S 04 40000 \ flags 2 Do 2 3 2 .. steps more times if not happens and observe: [ 305.832319] Unable to handle kernel write to read-only memory at virtual address ffff0000087ce7f0 [ 305.910887] CPU: 0 PID: 0 Comm: swapper/0 Not tainted [ 305.919306] Hardware name: Texas Instruments AM654 Base Board (DT) [...] [ 306.017119] x1 : ffff800848031d88 x0 : ffff800848031d80 [ 306.022422] Call trace: [ 306.024866] taprio_free_sched_cb+0x4c/0x98 [ 306.029040] rcu_process_callbacks+0x25c/0x410 [ 306.033476] __do_softirq+0x10c/0x208 [ 306.037132] irq_exit+0xb8/0xc8 [ 306.040267] __handle_domain_irq+0x64/0xb8 [ 306.044352] gic_handle_irq+0x7c/0x178 [ 306.048092] el1_irq+0xb0/0x128 [ 306.051227] arch_cpu_idle+0x10/0x18 [ 306.054795] do_idle+0x120/0x138 [ 306.058015] cpu_startup_entry+0x20/0x28 [ 306.061931] rest_init+0xcc/0xd8 [ 306.065154] start_kernel+0x3bc/0x3e4 [ 306.068810] Code: f2fbd5b7 f2fbd5b6 d503201f f9400422 (f9000662) [ 306.074900] ---[ end trace 96c8e2284a9d9d6e ]--- [ 306.079507] Kernel panic - not syncing: Fatal exception in interrupt [ 306.085847] SMP: stopping secondary CPUs [ 306.089765] Kernel Offset: disabled Try to explain one of the possible crash cases: The "real" admin list is assigned when admin_sched is set to new_admin, it happens after "swap", that assigns to oper_sched NULL. Thus if call qdisc show it can crash. Farther, next second time, when sched list is updated, the admin_sched is not NULL and becomes the oper_sched, previous oper_sched was NULL so just skipped. But then admin_sched is assigned new_admin, but schedules to free previous assigned admin_sched (that already became oper_sched). Farther, next third time, when sched list is updated, while one more swap, oper_sched is not null, but it was happy to be freed already (while prev. admin update), so while try to free oper_sched the kernel panic happens at taprio_free_sched_cb(). So, move the "swap emulation" where it should be according to function comment from code. Fixes: 9c66d15646760e ("taprio: Add support for hardware offloading") Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Tested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05rfkill: allocate static minorMarcel Holtmann
udev has a feature of creating /dev/<node> device-nodes if it finds a devnode:<node> modalias. This allows for auto-loading of modules that provide the node. This requires to use a statically allocated minor number for misc character devices. However, rfkill uses dynamic minor numbers and prevents auto-loading of the module. So allocate the next static misc minor number and use it for rfkill. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Link: https://lore.kernel.org/r/20191024174042.19851-1-marcel@holtmann.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-04can: j1939: transport: j1939_xtp_rx_eoma_one(): Add sanity check for correct ↵Oleksij Rempel
total message size We were sending malformed EOMA with total message size set to 0. This issue has been fixed in the previous patch. In this patch a sanity check is added to the RX path and a error message is displayed. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-11-04can: j1939: transport: j1939_session_fresh_new(): make sure EOMA is send ↵Oleksij Rempel
with the total message size set We were sending malformed EOMA messageswith total message size set to 0. This patch fixes the bug. Reported-by: https://github.com/linux-can/can-utils/issues/159 Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Acked-by: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-11-04can: j1939: fix memory leak if filters was setOleksij Rempel
Filters array is coped from user space and linked to the j1939 socket. On socket release this memory was not freed. Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-11-04can: j1939: fix resource leak of skb on error return pathsColin Ian King
Currently the error return paths do not free skb and this results in a memory leak. Fix this by freeing them before the return. Addresses-Coverity: ("Resource leak") Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-11-04Merge branch 'master' of git://blackhole.kfki.hu/nfPablo Neira Ayuso
Jozsef Kadlecsik says: ==================== ipset patches for nf - Fix the error code in ip_set_sockfn_get() when copy_to_user() is used, from Dan Carpenter. - The IPv6 part was missed when fixing copying the right MAC address in the patch "netfilter: ipset: Copy the right MAC address in bitmap:ip,mac and hash:ip,mac sets", it is completed now by Stefano Brivio. - ipset nla_policies are fixed to fully support NL_VALIDATE_STRICT and the code is converted from deprecated parsings to verified ones. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04netfilter: nf_tables_offload: skip EBUSY on chain updatePablo Neira Ayuso
Do not try to bind a chain again if it exists, otherwise the driver returns EBUSY. Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04netfilter: nf_tables: bogus EOPNOTSUPP on basechain updatePablo Neira Ayuso
Userspace never includes the NFT_BASE_CHAIN flag, this flag is inferred from the NFTA_CHAIN_HOOK atribute. The chain update path does not allow to update flags at this stage, the existing sanity check bogusly hits EOPNOTSUPP in the basechain case if the offload flag is set on. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04bridge: ebtables: don't crash when using dnat target in output chainsFlorian Westphal
xt_in() returns NULL in the output hook, skip the pkt_type change for that case, redirection only makes sense in broute/prerouting hooks. Reported-by: Tom Yan <tom.ty89@gmail.com> Cc: Linus Lüssing <linus.luessing@c0d3.blue> Fixes: cf3cb246e277d ("bridge: ebtables: fix reception of frames DNAT-ed to bridge device/port") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04netfilter: nf_tables: fix unexpected EOPNOTSUPP errorFernando Fernandez Mancera
If the object type doesn't implement an update operation and the user tries to update it will silently ignore the update operation. Fixes: aa4095a156b5 ("netfilter: nf_tables: fix possible null-pointer dereference in object update") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04netfilter: ipset: Fix nla_policies to fully support NL_VALIDATE_STRICTJozsef Kadlecsik
Since v5.2 (commit "netlink: re-add parse/validate functions in strict mode") NL_VALIDATE_STRICT is enabled. Fix the ipset nla_policies which did not support strict mode and convert from deprecated parsings to verified ones. Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2019-11-04netfilter: ipset: Copy the right MAC address in hash:ip,mac IPv6 setsStefano Brivio
Same as commit 1b4a75108d5b ("netfilter: ipset: Copy the right MAC address in bitmap:ip,mac and hash:ip,mac sets"), another copy and paste went wrong in commit 8cc4ccf58379 ("netfilter: ipset: Allow matching on destination MAC address for mac and ipmac sets"). When I fixed this for IPv4 in 1b4a75108d5b, I didn't realise that hash:ip,mac sets also support IPv6 as family, and this is covered by a separate function, hash_ipmac6_kadt(). In hash:ip,mac sets, the first dimension is the IP address, and the second dimension is the MAC address: check the IPSET_DIM_TWO_SRC flag in flags while deciding which MAC address to copy, destination or source. This way, mixing source and destination matches for the two dimensions of ip,mac hash type works as expected, also for IPv6. With this setup: ip netns add A ip link add veth1 type veth peer name veth2 netns A ip addr add 2001:db8::1/64 dev veth1 ip -net A addr add 2001:db8::2/64 dev veth2 ip link set veth1 up ip -net A link set veth2 up dst=$(ip netns exec A cat /sys/class/net/veth2/address) ip netns exec A ipset create test_hash hash:ip,mac family inet6 ip netns exec A ipset add test_hash 2001:db8::1,${dst} ip netns exec A ip6tables -A INPUT -p icmpv6 --icmpv6-type 135 -j ACCEPT ip netns exec A ip6tables -A INPUT -m set ! --match-set test_hash src,dst -j DROP ipset now correctly matches a test packet: # ping -c1 2001:db8::2 >/dev/null # echo $? 0 Reported-by: Chen, Yi <yiche@redhat.com> Fixes: 8cc4ccf58379 ("netfilter: ipset: Allow matching on destination MAC address for mac and ipmac sets") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2019-11-04netfilter: ipset: Fix an error code in ip_set_sockfn_get()Dan Carpenter
The copy_to_user() function returns the number of bytes remaining to be copied. In this code, that positive return is checked at the end of the function and we return zero/success. What we should do instead is return -EFAULT. Fixes: a7b4f989a629 ("netfilter: ipset: IP set core support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2019-11-04netfilter: ipset: Add wildcard support to net,ifaceKristian Evensen
The net,iface equal functions currently compares the full interface names. In several cases, wildcard (or prefix) matching is useful. For example, when converting a large iptables rule-set to make use of ipset, I was able to significantly reduce the number of set elements by making use of wildcard matching. Wildcard matching is enabled by adding "wildcard" when adding an element to a set. Internally, this causes the IPSET_FLAG_IFACE_WILDCARD-flag to be set. When this flag is set, only the initial part of the interface name is used for comparison. Wildcard matching is done per element and not per set, as there are many cases where mixing wildcard and non-wildcard elements are useful. This means that is up to the user to handle (avoid) overlapping interface names. Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2019-11-04dccp: do not leak jiffies on the wireEric Dumazet
For some reason I missed the case of DCCP passive flows in my previous patch. Fixes: a904a0693c18 ("inet: stop leaking jiffies on the wire") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Thiemo Nagel <tnagel@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-04ipv6: use jhash2() in rt6_exception_hash()Eric Dumazet
Faster jhash2() can be used instead of jhash(), since IPv6 addresses have the needed alignment requirement. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-04net: of_get_phy_mode: Change API to solve int/unit warningsAndrew Lunn
Before this change of_get_phy_mode() returned an enum, phy_interface_t. On error, -ENODEV etc, is returned. If the result of the function is stored in a variable of type phy_interface_t, and the compiler has decided to represent this as an unsigned int, comparision with -ENODEV etc, is a signed vs unsigned comparision. Fix this problem by changing the API. Make the function return an error, or 0 on success, and pass a pointer, of type phy_interface_t, where the phy mode should be stored. v2: Return with *interface set to PHY_INTERFACE_MODE_NA on error. Add error checks to all users of of_get_phy_mode() Fixup a few reverse christmas tree errors Fixup a few slightly malformed reverse christmas trees v3: Fix 0-day reported errors. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-04net: bridge: fdb: eliminate extra port state tests from fast-pathNikolay Aleksandrov
When commit df1c0b8468b3 ("[BRIDGE]: Packets leaking out of disabled/blocked ports.") introduced the port state tests in br_fdb_update() it was to avoid learning/refreshing from STP BPDUs, it was also used to avoid learning/refreshing from user-space with NTF_USE. Those two tests are done for every packet entering the bridge if it's learning, but for the fast-path we already have them checked in br_handle_frame() and is unnecessary to do it again. Thus push the checks to the unlikely cases and drop them from br_fdb_update(), the new nbp_state_should_learn() helper is used to determine if the port state allows br_fdb_update() to be called. The two places which need to do it manually are: - user-space add call with NTF_USE set - link-local packet learning done in __br_handle_local_finish() Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-04netfilter: nf_tables_offload: check for register data length mismatchesPablo Neira Ayuso
Make sure register data length does not mismatch immediate data length, otherwise hit EOPNOTSUPP. Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04Bluetooth: Fix advertising duplicated flagsLuiz Augusto von Dentz
Instances may have flags set as part of its data in which case the code should not attempt to add it again otherwise it can cause duplication: < HCI Command: LE Set Extended Advertising Data (0x08|0x0037) plen 35 Handle: 0x00 Operation: Complete extended advertising data (0x03) Fragment preference: Minimize fragmentation (0x01) Data length: 0x06 Flags: 0x04 BR/EDR Not Supported Flags: 0x06 LE General Discoverable Mode BR/EDR Not Supported Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2019-11-03NFSv4.1: Don't rebind to the same source port when reconnecting to the serverTrond Myklebust
NFSv2, v3 and NFSv4 servers often have duplicate replay caches that look at the source port when deciding whether or not an RPC call is a replay of a previous call. This requires clients to perform strange TCP gymnastics in order to ensure that when they reconnect to the server, they bind to the same source port. NFSv4.1 and NFSv4.2 have sessions that provide proper replay semantics, that do not look at the source port of the connection. This patch therefore ensures they can ignore the rebind requirement. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-03net: icmp6: provide input address for traceroute6Francesco Ruggeri
traceroute6 output can be confusing, in that it shows the address that a router would use to reach the sender, rather than the address the packet used to reach the router. Consider this case: ------------------------ N2 | | ------ ------ N3 ---- | R1 | | R2 |------|H2| ------ ------ ---- | | ------------------------ N1 | ---- |H1| ---- where H1's default route is through R1, and R1's default route is through R2 over N2. traceroute6 from H1 to H2 shows R2's address on N1 rather than on N2. The script below can be used to reproduce this scenario. traceroute6 output without this patch: traceroute to 2000:103::4 (2000:103::4), 30 hops max, 80 byte packets 1 2000:101::1 (2000:101::1) 0.036 ms 0.008 ms 0.006 ms 2 2000:101::2 (2000:101::2) 0.011 ms 0.008 ms 0.007 ms 3 2000:103::4 (2000:103::4) 0.013 ms 0.010 ms 0.009 ms traceroute6 output with this patch: traceroute to 2000:103::4 (2000:103::4), 30 hops max, 80 byte packets 1 2000:101::1 (2000:101::1) 0.056 ms 0.019 ms 0.006 ms 2 2000:102::2 (2000:102::2) 0.013 ms 0.008 ms 0.008 ms 3 2000:103::4 (2000:103::4) 0.013 ms 0.009 ms 0.009 ms #!/bin/bash # # ------------------------ N2 # | | # ------ ------ N3 ---- # | R1 | | R2 |------|H2| # ------ ------ ---- # | | # ------------------------ N1 # | # ---- # |H1| # ---- # # N1: 2000:101::/64 # N2: 2000:102::/64 # N3: 2000:103::/64 # # R1's host part of address: 1 # R2's host part of address: 2 # H1's host part of address: 3 # H2's host part of address: 4 # # For example: # the IPv6 address of R1's interface on N2 is 2000:102::1/64 # # Nets are implemented by macvlan interfaces (bridge mode) over # dummy interfaces. # # Create net namespaces ip netns add host1 ip netns add host2 ip netns add rtr1 ip netns add rtr2 # Create nets ip link add net1 type dummy; ip link set net1 up ip link add net2 type dummy; ip link set net2 up ip link add net3 type dummy; ip link set net3 up # Add interfaces to net1, move them to their nemaspaces ip link add link net1 dev host1net1 type macvlan mode bridge ip link set host1net1 netns host1 ip link add link net1 dev rtr1net1 type macvlan mode bridge ip link set rtr1net1 netns rtr1 ip link add link net1 dev rtr2net1 type macvlan mode bridge ip link set rtr2net1 netns rtr2 # Add interfaces to net2, move them to their nemaspaces ip link add link net2 dev rtr1net2 type macvlan mode bridge ip link set rtr1net2 netns rtr1 ip link add link net2 dev rtr2net2 type macvlan mode bridge ip link set rtr2net2 netns rtr2 # Add interfaces to net3, move them to their nemaspaces ip link add link net3 dev rtr2net3 type macvlan mode bridge ip link set rtr2net3 netns rtr2 ip link add link net3 dev host2net3 type macvlan mode bridge ip link set host2net3 netns host2 # Configure interfaces and routes in host1 ip netns exec host1 ip link set lo up ip netns exec host1 ip link set host1net1 up ip netns exec host1 ip -6 addr add 2000:101::3/64 dev host1net1 ip netns exec host1 ip -6 route add default via 2000:101::1 # Configure interfaces and routes in rtr1 ip netns exec rtr1 ip link set lo up ip netns exec rtr1 ip link set rtr1net1 up ip netns exec rtr1 ip -6 addr add 2000:101::1/64 dev rtr1net1 ip netns exec rtr1 ip link set rtr1net2 up ip netns exec rtr1 ip -6 addr add 2000:102::1/64 dev rtr1net2 ip netns exec rtr1 ip -6 route add default via 2000:102::2 ip netns exec rtr1 sysctl net.ipv6.conf.all.forwarding=1 # Configure interfaces and routes in rtr2 ip netns exec rtr2 ip link set lo up ip netns exec rtr2 ip link set rtr2net1 up ip netns exec rtr2 ip -6 addr add 2000:101::2/64 dev rtr2net1 ip netns exec rtr2 ip link set rtr2net2 up ip netns exec rtr2 ip -6 addr add 2000:102::2/64 dev rtr2net2 ip netns exec rtr2 ip link set rtr2net3 up ip netns exec rtr2 ip -6 addr add 2000:103::2/64 dev rtr2net3 ip netns exec rtr2 sysctl net.ipv6.conf.all.forwarding=1 # Configure interfaces and routes in host2 ip netns exec host2 ip link set lo up ip netns exec host2 ip link set host2net3 up ip netns exec host2 ip -6 addr add 2000:103::4/64 dev host2net3 ip netns exec host2 ip -6 route add default via 2000:103::2 # Ping host2 from host1 ip netns exec host1 ping6 -c5 2000:103::4 # Traceroute host2 from host1 ip netns exec host1 traceroute6 2000:103::4 # Delete nets ip link del net3 ip link del net2 ip link del net1 # Delete namespaces ip netns del rtr2 ip netns del rtr1 ip netns del host2 ip netns del host1 Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Original-patch-by: Honggang Xu <hxu@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03tipc: improve message bundling algorithmTuong Lien
As mentioned in commit e95584a889e1 ("tipc: fix unlimited bundling of small messages"), the current message bundling algorithm is inefficient that can generate bundles of only one payload message, that causes unnecessary overheads for both the sender and receiver. This commit re-designs the 'tipc_msg_make_bundle()' function (now named as 'tipc_msg_try_bundle()'), so that when a message comes at the first place, we will just check & keep a reference to it if the message is suitable for bundling. The message buffer will be put into the link backlog queue and processed as normal. Later on, when another one comes we will make a bundle with the first message if possible and so on... This way, a bundle if really needed will always consist of at least two payload messages. Otherwise, we let the first buffer go its way without any need of bundling, so reduce the overheads to zero. Moreover, since now we have both the messages in hand, we can even optimize the 'tipc_msg_bundle()' function, make bundle of a very large (size ~ MSS) and small messages which is not with the current algorithm e.g. [1400-byte message] + [10-byte message] (MTU = 1500). Acked-by: Ying Xue <ying.xue@windreiver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03net: icmp: use input address in tracerouteFrancesco Ruggeri
Even with icmp_errors_use_inbound_ifaddr set, traceroute returns the primary address of the interface the packet was received on, even if the path goes through a secondary address. In the example: 1.0.3.1/24 ---- 1.0.1.3/24 1.0.1.1/24 ---- 1.0.2.1/24 1.0.2.4/24 ---- |H1|--------------------------|R1|--------------------------|H2| ---- N1 ---- N2 ---- where 1.0.3.1/24 is R1's primary address on N1, traceroute from H1 to H2 returns: traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets 1 1.0.3.1 (1.0.3.1) 0.018 ms 0.006 ms 0.006 ms 2 1.0.2.4 (1.0.2.4) 0.021 ms 0.007 ms 0.007 ms After applying this patch, it returns: traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets 1 1.0.1.1 (1.0.1.1) 0.033 ms 0.007 ms 0.006 ms 2 1.0.2.4 (1.0.2.4) 0.011 ms 0.007 ms 0.007 ms Original-patch-by: Bill Fenner <fenner@arista.com> Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03net: openvswitch: simplify the ovs_dp_cmd_newTonghao Zhang
use the specified functions to init resource. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03net: openvswitch: don't unlock mutex when changing the user_features failsTonghao Zhang
Unlocking of a not locked mutex is not allowed. Other kernel thread may be in critical section while we unlock it because of setting user_feature fail. Fixes: 95a7233c4 ("net: openvswitch: Set OvS recirc_id from tc chain index") Cc: Paul Blakey <paulb@mellanox.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03net: openvswitch: fix possible memleak on destroy flow-tableTonghao Zhang
When we destroy the flow tables which may contain the flow_mask, so release the flow mask struct. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03net: openvswitch: add likely in flow_lookupTonghao Zhang
The most case *index < ma->max, and flow-mask is not NULL. We add un/likely for performance. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03net: openvswitch: simplify the flow_hashTonghao Zhang
Simplify the code and remove the unnecessary BUILD_BUG_ON. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03net: openvswitch: optimize flow-mask looking upTonghao Zhang
The full looking up on flow table traverses all mask array. If mask-array is too large, the number of invalid flow-mask increase, performance will be drop. One bad case, for example: M means flow-mask is valid and NULL of flow-mask means deleted. +-------------------------------------------+ | M | NULL | ... | NULL | M| +-------------------------------------------+ In that case, without this patch, openvswitch will traverses all mask array, because there will be one flow-mask in the tail. This patch changes the way of flow-mask inserting and deleting, and the mask array will be keep as below: there is not a NULL hole. In the fast path, we can "break" "for" (not "continue") in flow_lookup when we get a NULL flow-mask. "break" v +-------------------------------------------+ | M | M | NULL |... | NULL | NULL| +-------------------------------------------+ This patch don't optimize slow or control path, still using ma->max to traverse. Slow path: * tbl_mask_array_realloc * ovs_flow_tbl_lookup_exact * flow_mask_find Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03net: openvswitch: optimize flow mask cache hash collisionTonghao Zhang
Port the codes to linux upstream and with little changes. Pravin B Shelar, says: | In case hash collision on mask cache, OVS does extra flow | lookup. Following patch avoid it. Link: https://github.com/openvswitch/ovs/commit/0e6efbe2712da03522532dc5e84806a96f6a0dd1 Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03net: openvswitch: shrink the mask array if necessaryTonghao Zhang
When creating and inserting flow-mask, if there is no available flow-mask, we realloc the mask array. When removing flow-mask, if necessary, we shrink mask array. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>