Age | Commit message (Collapse) | Author |
|
smc_netinfo_by_tcpsk() looks up the routing cache. Such a lookup requires
protection by an RCU read lock.
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The SMC receive function currently lacks a timeout check under the
condition that no data were received and no data are available. This
patch adds such a check.
Signed-off-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the infiniband part, SMC currently uses get_netdev which calls
dev_hold on the returned net device. However, the SMC code never calls
dev_put on that net device resulting in a wrong reference count.
This patch adds a dev_put after the usage of the net device to fix the
issue.
Signed-off-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Dumping a DSA port's FDB entries is not specific to a DSA slave, so add
a dsa_port_fdb_dump function, similarly to dsa_port_fdb_add and
dsa_port_fdb_del.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A few DSA slave functions take a dsa_slave_priv pointer as first
argument, whereas the scope of the slave.c functions is the slave
net_device structure. Fix this and rename dsa_netpoll_send_skb to
dsa_slave_netpoll_send_skb.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 3f1ac7a700d0 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
deprecated the ethtool_cmd::transceiver field, which was fine in
premise, except that the PHY library was actually using it to report the
type of transceiver: internal or external.
Use the first word of the reserved field to put this __u8 transceiver
field back in. It is made read-only, and we don't expect the
ETHTOOL_xLINKSETTINGS API to be doing anything with this anyway, so this
is mostly for the legacy path where we do:
ethtool_get_settings()
-> dev->ethtool_ops->get_link_ksettings()
-> convert_link_ksettings_to_legacy_settings()
to have no information loss compared to the legacy get_settings API.
Fixes: 3f1ac7a700d0 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since commit 1dced6a85482 ("ipv4: Restore accept_local behaviour
in fib_validate_source()") a full fib lookup is needed even if
the rp_filter is disabled, if accept_local is false - which is
the default.
What we really need in the above scenario is just checking
that the source IP address is not local, and in most case we
can do that is a cheaper way looking up the ifaddr hash table.
This commit adds a helper for such lookup, and uses it to
validate the src address when rp_filter is disabled and no
'local' routes are created by the user space in the relevant
namespace.
A new ipv4 netns flag is added to account for such routes.
We need that to preserve the same behavior we had before this
patch.
It also drops the checks to bail early from __fib_validate_source,
added by the commit 1dced6a85482 ("ipv4: Restore accept_local
behaviour in fib_validate_source()") they do not give any
measurable performance improvement: if we do the lookup with are
on a slower path.
This improves UDP performances for unconnected sockets
when rp_filter is disabled by 5% and also gives small but
measurable performance improvement for TCP flood scenarios.
v1 -> v2:
- use the ifaddr lookup helper in __ip_dev_find(), as suggested
by Eric
- fall-back to full lookup if custom local routes are present
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fixes: c15ab236d69d ("net/sched: Change cls_flower to use IDR")
Cc: Chris Mi <chrism@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If real-time or fair-share curves are enabled in hfsc_change_class()
class isn't inserted into rb-trees yet. Thus init_ed() and init_vf()
must be called in place of update_ed() and update_vf().
Remove isn't required because for now curves cannot be disabled.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
SKB stored in qdisc->gso_skb also counted into backlog.
Some qdiscs don't reset backlog to zero in ->reset(),
for example sfq just dequeue and free all queued skb.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too")
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
According to IEEE Std 802.11-2016 (16.2.3.4 Long PHY SIGNAL field) all of
the following rates are mandatory for a HR/DSSS PHY: 1 Mb/s, 2 Mb/s,
5.5 Mb/s and 11 Mb/s. Set IEEE80211_RATE_MANDATORY_B flag for all of these
instead of just 1 Mb/s to correctly reflect this.
Signed-off-by: Richard Schütz <rschuetz@uni-koblenz.de>
[johannes: use switch statement]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Sometimes a station is added already in ASSOC state. For example,
in AP mode, when a client station didn't get assoc resp and sends
an assoc req again. If a station is inserted when its state is ASSOC
or higher, the min chandef and allow_p2p_go_ps should be recalculated
again after the insertion.
Before this patch the recalculation happened only in sta_info_move_state
which occurs before the insertion of the sta and thus even though
it calls ieee80211_recalc_min_chandef/_p2p_go_ps_allowed functions,
since sdata->local->sta_list is still empty at this point, it doesn't do
anything.
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The user space can now allow the kernel to associate to an AP that
requires MFP or that doesn't have MFP enabled in the same
NL80211_CMD_CONNECT command, by using a new NL80211_MFP_OPTIONAL flag.
The driver / firmware will decide whether to use it or not.
Include a feature bit to advertise support for NL80211_MFP_OPTIONAL.
This allows new user space to run on old kernels and know that it
cannot use the new attribute if it isn't supported.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This function hasn't been used since the removal of iwmc3200wifi
in 2012. It also appears to have a bug when qos=True, since then
it'll copy uninitialized stack memory to the SKB.
Just remove the function entirely.
Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This was created using the following spatch:
@find@
type S;
expression M, M2;
position p;
@@
offsetof(S, M) + sizeof(M2)@p
@script:python@
m << find.M;
m2 << find.M2;
@@
if not m2.endswith('-> ' + m):
cocci.include_match(False)
@change@
type find.S;
expression find.M, find.M2;
position find.p;
@@
-offsetof(S, M) + sizeof(M2)@p
+offsetofend(S, M)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Simplify the locking in ieee80211_sta_tear_down_BA_sessions() and
lock sta->ampdu_mlme.mtx over the entire function instead of
locking/unlocking it for each TID etc.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add documentation to ieee80211_rx_ba_offl() function and, while at it,
rename the bit argument to tid, for consistency.
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
One of OCE's optimizations is acception of broadcast probe responses.
Accept broadcast probe responses but don't set
NL80211_EXT_FEATURE_ACCEPT_BCAST_PROBE_RESP. Because a device's firmware
may filter out the broadcast probe resp - drivers should set this flag.
Signed-off-by: Roee Zamir <roee.zamir@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
[johannes: make accepting broadcast conditional on the nl80211 scan
flag that was added for that specific purpose]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add Optimized Connectivity Experience (OCE) scan and capability flags.
Some of them unique to OCE and some are stand alone.
And add scan flags to enable/disable them.
Signed-off-by: Roee Zamir <roee.zamir@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
When NL80211_ATTR_WIPHY_CHANNEL_TYPE is given, nl80211 would parse the
channel definition the old way, discarding NL80211_ATTR_CENTER_FREQ1,
NL80211_ATTR_CENTER_FREQ2 etc. However, it is possible that user space
added both NL80211_ATTR_WIPHY_CHANNEL_TYPE and NL80211_ATTR_CENTER_FREQ1
or NL80211_ATTR_CENTER_FREQ2 assuming that all settings would be honored.
In such a case, validate that NL80211_ATTR_CENTER_FREQ1 and
NL80211_ATTR_CENTER_FREQ2 values match the channel configuration,
as otherwise user space would assume that the desired configuration was
applied.
For example, when trying to start ap with
NL80211_ATTR_WIPHY_CHANNEL_TYPE = NL80211_CHAN_HT40MINUS,
NL80211_ATTR_WIPHY_FREQ = 5180 and NL80211_ATTR_CENTER_FREQ1 = 5250
without this fix, the ap will start on channel 36 (center_freq1 will be
corrected to 5180). With this fix, we will throw an error instead.
Signed-off-by: Tova Mussai <tova.mussai@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
There's no need to split off IEs from the ones obtained
from userspace, if they were already split off, so for
example IEs that went before HT don't have to be listed
again to go before VHT. Simplify the code here so it's
clearer.
While at it, also clarify the comments regarding the DMG
(60 GHz) elements.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Current ieee80211_ie_split() implementation doesn't
account for elements that are sub-elements of the
EXTENSION IE. To extend support to these IEs as well,
treat the WLAN_EID_EXTENSION ids in the %ids array
as indicating that the next id in the array is a
sub-element of the EXTENSION IE.
Signed-off-by: Liad Kaufman <liad.kaufman@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
For AP_VLAN and monitor interfaces we'll never use the TXQs
we allocated, so avoid doing so.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains two Netfilter fixes for your net tree,
they are:
1) Fix NAt compilation with UP, from Geert Uytterhoeven.
2) Fix incorrect number of entries when dumping a set, from
Vishwanath Pai.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Instead of open coding the check.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since XDP's view of the packet includes the MAC header, moving the start-
of-packet with bpf_xdp_adjust_head needs to also update the offset of the
MAC header (which is relative to skb->head, not to the skb->data that was
changed).
Without this, tcpdump sees packets starting from the old MAC header rather
than the new one, at least in my tests on the loopback device.
Fixes: b5cdae3291f7 ("net: Generic XDP")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit 00ba4cb36da682c68dc87d1703a8aaffe2b4e9c5.
Discussion with David Ahern determined that this change is
actually not needed.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The actual length of cmsg fetched in during the second loop
(i.e., kcmsg - kcmsg_base) could be different from what we
get from the first loop (i.e., kcmlen).
The main reason is that the two get_user() calls in the two
loops (i.e., get_user(ucmlen, &ucmsg->cmsg_len) and
__get_user(ucmlen, &ucmsg->cmsg_len)) could cause ucmlen
to have different values even they fetch from the same userspace
address, as user can race to change the memory content in
&ucmsg->cmsg_len across fetches.
Although in the second loop, the sanity check
if ((char *)kcmsg_base + kcmlen - (char *)kcmsg < CMSG_ALIGN(tmp))
is inplace, it only ensures that the cmsg fetched in during the
second loop does not exceed the length of kcmlen, but not
necessarily equal to kcmlen. But indicated by the assignment
kmsg->msg_controllen = kcmlen, we should enforce that.
This patch adds this additional sanity check and ensures that
what is recorded in kmsg->msg_controllen is the actual cmsg length.
Signed-off-by: Meng Xu <mengxu.gatech@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The commit 6b229cf77d68 ("udp: add batching to udp_rmem_release()")
reduced greatly the cacheline contention between the BH and the US
reader batching the rmem updates in most scenarios.
Such optimization is explicitly avoided if the US reader is faster
then BH processing.
My fault, I initially suggested this kind of behavior due to concerns
of possible regressions with small sk_rcvbuf values. Tests showed
such concerns are misplaced, so this commit relaxes the condition
for rmem bulk updates, obtaining small but measurable performance
gain in the scenario described above.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, when an interface is released from a bridge via
ioctl(), we get a RTM_DELLINK event through netlink:
Deleted 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 6e:23:c2:54:3a:b3
Userspace has to interpret that as a removal from the bridge, not as a
complete removal of the interface. When an bridged interface is
completely removed, we get two events:
Deleted 2: dummy0: <BROADCAST,NOARP> mtu 1500 master bridge0 state DOWN
link/ether 6e:23:c2:54:3a:b3
Deleted 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether 6e:23:c2:54:3a:b3 brd ff:ff:ff:ff:ff:ff
In constrast, when an interface is released from a bond, we get a
RTM_NEWLINK with only the new characteristics (no master):
3: dummy1: <BROADCAST,NOARP,SLAVE,UP,LOWER_UP> mtu 1500 qdisc noqueue master bond0 state UNKNOWN group default
link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
3: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
3: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noqueue state DOWN group default
link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
3: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noqueue state DOWN group default
link/ether ca:c8:7b:66:f8:25 brd ff:ff:ff:ff:ff:ff
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
Userland may be confused by the fact we say a link is deleted while
its characteristics are only modified. A first solution would have
been to turn the RTM_DELLINK event in del_nbp() into a RTM_NEWLINK
event. However, maybe some piece of userland is relying on this
RTM_DELLINK to detect when a bridged interface is released. Instead,
we also emit a RTM_NEWLINK event once the interface is
released (without master info).
Deleted 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 8a:bb:e7:94:b1:f8
2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
link/ether 8a:bb:e7:94:b1:f8 brd ff:ff:ff:ff:ff:ff
This is done only when using ioctl(). When using Netlink, such an
event is already automatically emitted in do_setlink().
Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Packet socket bind operations must hold the po->bind_lock. This keeps
po->running consistent with whether the socket is actually on a ptype
list to receive packets.
fanout_add unbinds a socket and its packet_rcv/tpacket_rcv call, then
binds the fanout object to receive through packet_rcv_fanout.
Make it hold the po->bind_lock when testing po->running and rebinding.
Else, it can race with other rebind operations, such as that in
packet_set_ring from packet_rcv to tpacket_rcv. Concurrent updates
can result in a socket being added to a fanout group twice, causing
use-after-free KASAN bug reports, among others.
Reported independently by both trinity and syzkaller.
Verified that the syzkaller reproducer passes after this patch.
Fixes: dc99f600698d ("packet: Add fanout support.")
Reported-by: nixioaming <nixiaoming@huawei.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In ipv6_skip_exthdr, the lengh of AH header is computed manually
as (hp->hdrlen+2)<<2. However, in include/linux/ipv6.h, a macro
named ipv6_authlen is already defined for exactly the same job. This
commit replaces the manual computation code with the macro.
Signed-off-by: Xiang Gao <qasdfgtyuiop@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
According to 802.15.4-2003/2006/2015 specifications the MAC frame is
composed of MHR, MAC payload and MFR and just the outgoing MAC payload
must be encrypted.
If communication is secure,sender build Auxiliary Security Header(ASH),
insert it next to the standard MHR header with security enabled bit ON,
and secure frames before transmitting them. According to the information
carried within the ASH, recipient retrieves the right cryptographic key
and correctly un-secure MAC frames.
The error scenario occurs on Linux using IEEE802154_SCF_SECLEVEL_ENC(4)
security level when llsec_do_encrypt_unauth() function builds theses MAC
frames incorrectly. On recipients these MAC frames are discarded,logging
"got invalid frame" messages.
Signed-off-by: Diogenes Pereira <dvnp@cesar.org.br>
Signed-off-by: Stefan Schmidt <stefan@osg.samsung.com>
|
|
Use IEEE802154_SCF_SECLEVEL_NONE macro defined at ieee802154.h file.
Signed-off-by: Diogenes Pereira <dvnp@cesar.org.br>
Signed-off-by: Stefan Schmidt <stefan@osg.samsung.com>
|
|
Currently, writing into
net.ipv6.conf.all.{accept_dad,use_optimistic,optimistic_dad} has no effect.
Fix handling of these flags by:
- using the maximum of global and per-interface values for the
accept_dad flag. That is, if at least one of the two values is
non-zero, enable DAD on the interface. If at least one value is
set to 2, enable DAD and disable IPv6 operation on the interface if
MAC-based link-local address was found
- using the logical OR of global and per-interface values for the
optimistic_dad flag. If at least one of them is set to one, optimistic
duplicate address detection (RFC 4429) is enabled on the interface
- using the logical OR of global and per-interface values for the
use_optimistic flag. If at least one of them is set to one,
optimistic addresses won't be marked as deprecated during source address
selection on the interface.
While at it, as we're modifying the prototype for ipv6_use_optimistic_addr(),
drop inline, and let the compiler decide.
Fixes: 7fd2561e4ebd ("net: ipv6: Add a sysctl to make optimistic addresses useful candidates")
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit f784ad3d79e5 ("ipv6: do not send RTM_DELADDR for tentative
addresses") incorrectly assumes that no RTM_NEWADDR are sent for
addresses in tentative state, as this does happen for the standard
IPv6 use-case of DAD failure, see the call to ipv6_ifa_notify() in
addconf_dad_stop(). So as a result of this change, no RTM_DELADDR is
sent after DAD failure for a link-local when strict DAD (accept_dad=2)
is configured, or on the next admin down in other cases. The absence
of this notification breaks backwards compatibility and causes problems
after DAD failure if this notification was being relied on. The
solution is to allow RTM_DELADDR to still be sent after DAD failure.
Fixes: f784ad3d79e5 ("ipv6: do not send RTM_DELADDR for tentative addresses")
Signed-off-by: Mike Manning <mmanning@brocade.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 109980b894e9 ("bpf: don't select potentially stale
ri->map from buggy xdp progs") passed the pointer to the prog
itself to be loaded into r4 prior on bpf_redirect_map() helper
call, so that we can store the owner into ri->map_owner out of
the helper.
Issue with that is that the actual address of the prog is still
subject to change when subsequent rewrites occur that require
slow path in bpf_prog_realloc() to alloc more memory, e.g. from
patching inlining helper functions or constant blinding. Thus,
we really need to take prog->aux as the address we're holding,
which also works with prog clones as they share the same aux
object.
Instead of then fetching aux->prog during runtime, which could
potentially incur cache misses due to false sharing, we are
going to just use aux for comparison on the map owner. This
will also keep the patchlet of the same size, and later check
in xdp_map_invalid() only accesses read-only aux pointer from
the prog, it's also in the same cacheline already from prior
access when calling bpf_func.
Fixes: 109980b894e9 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Implement exit_batch() method to dismantle more devices
per round.
(rtnl_lock() ...
unregister_netdevice_many() ...
rtnl_unlock())
Tested:
$ cat add_del_unshare.sh
for i in `seq 1 40`
do
(for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
done
wait ; grep net_namespace /proc/slabinfo
Before patch :
$ time ./add_del_unshare.sh
net_namespace 126 282 5504 1 2 : tunables 8 4 0 : slabdata 126 282 0
real 1m38.965s
user 0m0.688s
sys 0m37.017s
After patch:
$ time ./add_del_unshare.sh
net_namespace 135 291 5504 1 2 : tunables 8 4 0 : slabdata 135 291 0
real 0m22.117s
user 0m0.728s
sys 0m35.328s
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Implement exit_batch() method to dismantle more devices
per round.
(rtnl_lock() ...
unregister_netdevice_many() ...
rtnl_unlock())
Tested:
$ cat add_del_unshare.sh
for i in `seq 1 40`
do
(for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
done
wait ; grep net_namespace /proc/slabinfo
Before patch :
$ time ./add_del_unshare.sh
net_namespace 110 267 5504 1 2 : tunables 8 4 0 : slabdata 110 267 0
real 3m25.292s
user 0m0.644s
sys 0m40.153s
After patch:
$ time ./add_del_unshare.sh
net_namespace 126 282 5504 1 2 : tunables 8 4 0 : slabdata 126 282 0
real 1m38.965s
user 0m0.688s
sys 0m37.017s
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When dealing with a list of dismantling netns, we can scan
tcp_metrics once, saving cpu cycles.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Having a global list of labels do not scale to thousands of
netns in the cloud era. This causes quadratic behavior on
netns creation and deletion.
This is time having a per netns list of ~10 labels.
Tested:
$ time perf record (for f in `seq 1 3000` ; do ip netns add tast$f; done)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 3.637 MB perf.data (~158898 samples) ]
real 0m20.837s # instead of 0m24.227s
user 0m0.328s
sys 0m20.338s # instead of 0m23.753s
16.17% ip [kernel.kallsyms] [k] netlink_broadcast_filtered
12.30% ip [kernel.kallsyms] [k] netlink_has_listeners
6.76% ip [kernel.kallsyms] [k] _raw_spin_lock_irqsave
5.78% ip [kernel.kallsyms] [k] memset_erms
5.77% ip [kernel.kallsyms] [k] kobject_uevent_env
5.18% ip [kernel.kallsyms] [k] refcount_sub_and_test
4.96% ip [kernel.kallsyms] [k] _raw_read_lock
3.82% ip [kernel.kallsyms] [k] refcount_inc_not_zero
3.33% ip [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
2.11% ip [kernel.kallsyms] [k] unmap_page_range
1.77% ip [kernel.kallsyms] [k] __wake_up
1.69% ip [kernel.kallsyms] [k] strlen
1.17% ip [kernel.kallsyms] [k] __wake_up_common
1.09% ip [kernel.kallsyms] [k] insert_header
1.04% ip [kernel.kallsyms] [k] page_remove_rmap
1.01% ip [kernel.kallsyms] [k] consume_skb
0.98% ip [kernel.kallsyms] [k] netlink_trim
0.51% ip [kernel.kallsyms] [k] kernfs_link_sibling
0.51% ip [kernel.kallsyms] [k] filemap_map_pages
0.46% ip [kernel.kallsyms] [k] memcpy_erms
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
gen estimator has been rewritten in commit 1c0d32fde5bd
("net_sched: gen_estimator: complete rewrite of rate estimators"),
the caller no longer needs to wait for a grace period. So this
patch gets rid of it.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Our recent change exposed a bug in TCP Fastopen Client that syzkaller
found right away [1]
When we prepare skb with SYN+DATA, we attempt to transmit it,
and we update socket state as if the transmit was a success.
In socket RTX queue we have two skbs, one with the SYN alone,
and a second one containing the DATA.
When (malicious) ACK comes in, we now complain that second one had no
skb_mstamp.
The proper fix is to make sure that if the transmit failed, we do not
pretend we sent the DATA skb, and make it our send_head.
When 3WHS completes, we can now send the DATA right away, without having
to wait for a timeout.
[1]
WARNING: CPU: 0 PID: 100189 at net/ipv4/tcp_input.c:3117 tcp_clean_rtx_queue+0x2057/0x2ab0 net/ipv4/tcp_input.c:3117()
WARN_ON_ONCE(last_ackt == 0);
Modules linked in:
CPU: 0 PID: 100189 Comm: syz-executor1 Not tainted
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
0000000000000000 ffff8800b35cb1d8 ffffffff81cad00d 0000000000000000
ffffffff828a4347 ffff88009f86c080 ffffffff8316eb20 0000000000000d7f
ffff8800b35cb220 ffffffff812c33c2 ffff8800baad2440 00000009d46575c0
Call Trace:
[<ffffffff81cad00d>] __dump_stack
[<ffffffff81cad00d>] dump_stack+0xc1/0x124
[<ffffffff812c33c2>] warn_slowpath_common+0xe2/0x150
[<ffffffff812c361e>] warn_slowpath_null+0x2e/0x40
[<ffffffff828a4347>] tcp_clean_rtx_queue+0x2057/0x2ab0 n
[<ffffffff828ae6fd>] tcp_ack+0x151d/0x3930
[<ffffffff828baa09>] tcp_rcv_state_process+0x1c69/0x4fd0
[<ffffffff828efb7f>] tcp_v4_do_rcv+0x54f/0x7c0
[<ffffffff8258aacb>] sk_backlog_rcv
[<ffffffff8258aacb>] __release_sock+0x12b/0x3a0
[<ffffffff8258ad9e>] release_sock+0x5e/0x1c0
[<ffffffff8294a785>] inet_wait_for_connect
[<ffffffff8294a785>] __inet_stream_connect+0x545/0xc50
[<ffffffff82886f08>] tcp_sendmsg_fastopen
[<ffffffff82886f08>] tcp_sendmsg+0x2298/0x35a0
[<ffffffff82952515>] inet_sendmsg+0xe5/0x520
[<ffffffff8257152f>] sock_sendmsg_nosec
[<ffffffff8257152f>] sock_sendmsg+0xcf/0x110
Fixes: 8c72c65b426b ("tcp: update skb->skb_mstamp more carefully")
Fixes: 783237e8daf1 ("net-tcp: Fast Open client - sending SYN-data")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
DSA overrides the master device ethtool ops, so that it can inject stats
from its dedicated switch CPU port as well.
The related code is currently split in dsa.c and slave.c, but it only
scopes the master net device. Move it to a new master.c DSA core file.
This file will be later extented with master net device specific code.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
DSA overrides the master's ethtool ops so that we can inject its CPU
port's statistics. Because of that, we need to setup the ethtool ops
after the master's dsa_ptr pointer has been assigned, not before.
This patch setups the ethtool ops after dsa_ptr is assigned, and
restores them before it gets cleared.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a DSA switch tree is meant to be applied, it already has a CPU
port. Thus remove the condition of dst->cpu_dp.
Moreover, the next lines access dst->cpu_dp unconditionally.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There is no need to store a copy of the master ethtool ops, storing the
original pointer in DSA and the new one in the master netdev itself is
enough.
In the meantime, set orig_ethtool_ops to NULL when restoring the master
ethtool ops and check the presence of the master original ethtool ops as
well as its needed functions before calling them.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
Current uses (TCP receive ofo queue and netem) need to save/restore
tstamp, while skb->dev is either NULL (TCP) or a constant for a given
queue (netem).
Since we plan using an RB tree for TCP retransmit queue to speedup SACK
processing with large BDP, this patch exchanges skb->dev and
skb->tstamp.
This saves some overhead in both TCP and netem.
v2: removes the swtstamp field from struct tcp_skb_cb
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
net/vmw_vsock/vmci_transport.c does not use any miscdevice so this patch
remove this unnecessary inclusion.
Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts most of commit f53b7665c8ce ("libceph: upmap semantic
changes").
We need to prevent duplicates in the final result. For example, we
can currently take
[1,2,3] and apply [(1,2)] and get [2,2,3]
or
[1,2,3] and apply [(3,2)] and get [1,2,2]
The rest of the system is not prepared to handle duplicates in the
result set like this.
The reverted piece was intended to allow
[1,2,3] and [(1,2),(2,1)] to get [2,1,3]
to reorder primaries. First, this bidirectional swap is hard to
implement in a way that also prevents dups. For example, [1,2,3] and
[(1,4),(2,3),(3,4)] would give [4,3,4] but would we just drop the last
step we'd have [4,3,3] which is also invalid, etc. Simpler to just not
handle bidirectional swaps. In practice, they are not needed: if you
just want to choose a different primary then use primary_affinity, or
pg_upmap (not pg_upmap_items).
Cc: stable@vger.kernel.org # 4.13
Link: http://tracker.ceph.com/issues/21410
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
|