summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2018-02-14tipc: Refactor __tipc_nl_compat_doitYing Xue
As preparation for adding RTNL to make (*cmd->transcode)() and (*cmd->transcode)() constantly protected by RTNL lock, we move out of memory allocations existing between them as many as possible so that the time of holding RTNL can be minimized in __tipc_nl_compat_doit(). Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14netfilter: drop outermost socket lock in getsockopt()Paolo Abeni
The Syzbot reported a possible deadlock in the netfilter area caused by rtnl lock, xt lock and socket lock being acquired with a different order on different code paths, leading to the following backtrace: Reviewed-by: Xin Long <lucien.xin@gmail.com> ====================================================== WARNING: possible circular locking dependency detected 4.15.0+ #301 Not tainted ------------------------------------------------------ syzkaller233489/4179 is trying to acquire lock: (rtnl_mutex){+.+.}, at: [<0000000048e996fd>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 but task is already holding lock: (&xt[i].mutex){+.+.}, at: [<00000000328553a2>] xt_find_table_lock+0x3e/0x3e0 net/netfilter/x_tables.c:1041 which lock already depends on the new lock. === Since commit 3f34cfae1230 ("netfilter: on sockopt() acquire sock lock only in the required scope"), we already acquire the socket lock in the innermost scope, where needed. In such commit I forgot to remove the outer-most socket lock from the getsockopt() path, this commit addresses the issues dropping it now. v1 -> v2: fix bad subj, added relavant 'fixes' tag Fixes: 22265a5c3c10 ("netfilter: xt_TEE: resolve oif using netdevice notifiers") Fixes: 202f59afd441 ("netfilter: ipt_CLUSTERIP: do not hold dev") Fixes: 3f34cfae1230 ("netfilter: on sockopt() acquire sock lock only in the required scope") Reported-by: syzbot+ddde1c7b7ff7442d7f2d@syzkaller.appspotmail.com Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-02-14net: Move ipv4 set_lwt_redirect helper to lwtunnelDavid Ahern
IPv4 uses set_lwt_redirect to set the lwtunnel redirect functions as needed. Move it to lwtunnel.h as lwtunnel_set_redirect and change IPv6 to also use it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14net: dsa: forward timestamping callbacks to switch driversBrandon Streiff
Forward the rx/tx timestamp machinery from the dsa infrastructure to the switch driver. On the rx side, defer delivery of skbs until we have an rx timestamp. This mimicks the behavior of skb_defer_rx_timestamp. On the tx side, identify PTP packets, clone them, and pass them to the underlying switch driver before we transmit. This mimicks the behavior of skb_tx_timestamp. Adjusted txstamp API to keep the allocation and freeing of the clone in the same central function by Richard Cochran Signed-off-by: Brandon Streiff <brandon.streiff@ni.com> Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14net: dsa: forward hardware timestamping ioctls to switch driverBrandon Streiff
This patch adds support to the dsa slave network device so that switch drivers can implement the SIOC[GS]HWTSTAMP ioctls and the ethtool timestamp-info interface. Signed-off-by: Brandon Streiff <brandon.streiff@ni.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14tcp: try to keep packet if SYN_RCV race is lostEric Dumazet
배석진 reported that in some situations, packets for a given 5-tuple end up being processed by different CPUS. This involves RPS, and fragmentation. 배석진 is seeing packet drops when a SYN_RECV request socket is moved into ESTABLISH state. Other states are protected by socket lock. This is caused by a CPU losing the race, and simply not caring enough. Since this seems to occur frequently, we can do better and perform a second lookup. Note that all needed memory barriers are already in the existing code, thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert() and reqsk_put(). The second lookup must find the new socket, unless it has already been accepted and closed by another cpu. Note that the fragmentation could be avoided in the first place by use of a correct TCP MSS option in the SYN{ACK} packet, but this does not mean we can not be more robust. Many thanks to 배석진 for a very detailed analysis. Reported-by: 배석진 <soukjin.bae@samsung.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14sctp: fix some copy-paste errors for file commentsXin Long
This patch is to fix the file comments in stream.c and stream_interleave.c v1->v2: rephrase the comment for stream.c according to Neil's suggestion. Fixes: a83863174a61 ("sctp: prepare asoc stream for stream reconf") Fixes: 0c3f6f655487 ("sctp: implement make_datafrag for sctp_stream_interleave") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14net: fix race on decreasing number of TX queuesJakub Kicinski
netif_set_real_num_tx_queues() can be called when netdev is up. That usually happens when user requests change of number of channels/rings with ethtool -L. The procedure for changing the number of queues involves resetting the qdiscs and setting dev->num_tx_queues to the new value. When the new value is lower than the old one, extra care has to be taken to ensure ordering of accesses to the number of queues vs qdisc reset. Currently the queues are reset before new dev->num_tx_queues is assigned, leaving a window of time where packets can be enqueued onto the queues going down, leading to a likely crash in the drivers, since most drivers don't check if TX skbs are assigned to an active queue. Fixes: e6484930d7c7 ("net: allocate tx queues in register_netdevice") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14netlabel: If PF_INET6, check sk_buff ip header versionRichard Haines
When resolving a fallback label, check the sk_buff version as it is possible (e.g. SCTP) to have family = PF_INET6 while receiving ip_hdr(skb)->version = 4. Signed-off-by: Richard Haines <richard_c_haines@btinternet.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-02-14net: Remove atalk header from socket.cDavid Ahern
Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14net: Make dn_ptr depend on CONFIG_DECNETDavid Ahern
Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14ieee802154: 6lowpan: set IFF_NO_QUEUEAlexander Aring
This patch sets the IFF_NO_QUEUE for IEEE 802.15.4 6lowpan interfaces. As commit 24dcbf662205 ("6lowpan: Don't set IFF_NO_QUEUE") removes it for "reasons" from the bluetooth 6lowpan subsystem. In IEEE 802.15.4 the lower interface deals with one qdisc for the real hardware, 6LoWPAN does the protocol adaption only and no second queuing on top. Signed-off-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: Stefan Schmidt <stefan@osg.samsung.com>
2018-02-13net/ipv4: Unexport fib_multipath_hash and fib_select_pathDavid Ahern
Do not export fib_multipath_hash or fib_select_path; both are only used by core ipv4 code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net/ipv4: Simplify fib_select_pathDavid Ahern
If flow oif is set and it is not an l3mdev, then fib_select_path can jump to the source address check. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13sctp: add file comments in diag.cXin Long
This patch is to add the missing file comments for sctp diag file. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13sctp: rename sctp_diag.c as diag.cXin Long
Remove 'sctp_' prefix for diag file, to keep consistent with other files' names. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13rds: do not call ->conn_alloc with GFP_KERNELSowmini Varadhan
Commit ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management") adds an rcu read critical section to __rd_conn_create. The memory allocations in that critcal section need to use GFP_ATOMIC to avoid sleeping. This patch was verified with syzkaller reproducer. Reported-by: syzbot+a0564419941aaae3fe3c@syzkaller.appspotmail.com Fixes: ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management") Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: sched: fix tc_u_common lookupJiri Pirko
The offending commit wrongly assumes 1:1 mapping between block and q. However, there are multiple blocks for a single q for classful qdiscs. Since the obscure tc_u_common sharing mechanism expects it to be shared among a qdisc, fix it by storing q pointer in case the block is not shared. Reported-by: Paweł Staszewski <pstaszewski@itcare.pl> Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Fixes: 7fa9d974f3c2 ("net: sched: cls_u32: use block instead of q in tc_u_common") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: sched: don't set q pointer for shared blocksJiri Pirko
It is pointless to set block->q for block which are shared among multiple qdiscs. So remove the assignment in that case. Do a bit of code reshuffle to make block->index initialized at that point so we can use tcf_block_shared() helper. Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Fixes: 4861738775d7 ("net: sched: introduce shared filter blocks infrastructure") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: af_unix: fix typo in UNIX_SKB_FRAGS_SZ commentTobias Klauser
Change "minimun" to "minimum". Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert netlink_tap_net_opsKirill Tkhai
These pernet_operations init just allocated net memory, and they obviously can be executed in parallel in any others. v3: New Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert diag_net_opsKirill Tkhai
These pernet operations just create and destroy netlink socket. The socket is pernet and else operations don't touch it. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert default_device_opsKirill Tkhai
These pernet operations consist of exit() and exit_batch() methods. default_device_exit() moves not-local and virtual devices to init_net. There is nothing exciting, because this may happen in any time on a working system, and rtnl_lock() and synchronize_net() protect us from all cases of external dereference. The same for default_device_exit_batch(). Similar unregisteration may happen in any time on a system. Here several lists (like todo_list), which are accessed under rtnl_lock(). After rtnl_unlock() and netdev_run_todo() all the devices are flushed. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert addrconf_opsKirill Tkhai
These pernet_operations (un)register sysctl, which are not touched by anybody else. So, it's safe to make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert ipv4_sysctl_opsKirill Tkhai
These pernet_operations create and destroy sysctl, which are not touched by anybody else. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert packet_net_opsKirill Tkhai
These pernet_operations just create and destroy /proc entry, and another operations do not touch it. Also, nobody else are interested in foreign net::packet::sklist. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert unix_net_opsKirill Tkhai
These pernet_operations are just create and destroy /proc and sysctl entries, and are not touched by foreign pernet_operations. So, we are able to make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert pernet_subsys, registered from inet_init()Kirill Tkhai
arp_net_ops just addr/removes /proc entry. devinet_ops allocates and frees duplicate of init_net tables and (un)registers sysctl entries. fib_net_ops allocates and frees pernet tables, creates/destroys netlink socket and (un)initializes /proc entries. Foreign pernet_operations do not touch them. ip_rt_proc_ops only modifies pernet /proc entries. xfrm_net_ops creates/destroys /proc entries, allocates/frees pernet statistics, hashes and tables, and (un)initializes sysctl files. These are not touched by foreigh pernet_operations xfrm4_net_ops allocates/frees private pernet memory, and configures sysctls. sysctl_route_ops creates/destroys sysctls. rt_genid_ops only initializes fields of just allocated net. ipv4_inetpeer_ops allocated/frees net private memory. igmp_net_ops just creates/destroys /proc files and socket, noone else interested in. tcp_sk_ops seems to be safe, because tcp_sk_init() does not depend on any other pernet_operations modifications. Iteration over hash table in inet_twsk_purge() is made under RCU lock, and it's safe to iterate the table this way. Removing from the table happen from inet_twsk_deschedule_put(), but this function is safe without any extern locks, as it's synchronized inside itself. There are many examples, it's used in different context. So, it's safe to leave tcp_sk_exit_batch() unlocked. tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe. udplite4_net_ops only creates/destroys pernet /proc file. icmp_sk_ops creates percpu sockets, not touched by foreign pernet_operations. ipmr_net_ops creates/destroys pernet fib tables, (un)registers fib rules and /proc files. This seem to be safe to execute in parallel with foreign pernet_operations. af_inet_ops just sets up default parameters of newly created net. ipv4_mib_ops creates and destroys pernet percpu statistics. raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops and ip_proc_ops only create/destroy pernet /proc files. ip4_frags_ops creates and destroys sysctl file. So, it's safe to make the pernet_operations async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert sysctl_core_opsKirill Tkhai
These pernet_operations register and destroy sysctl directory, and it's not interesting for foreign pernet_operations. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert wext_pernet_opsKirill Tkhai
These pernet_operations initialize and purge net::wext_nlevents queue, and are not touched by foreign pernet_operations. Mark them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert genl_pernet_opsKirill Tkhai
This pernet_operations create and destroy net::genl_sock. Foreign pernet_operations don't touch it. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert subsys_initcall() registered pernet_operations from net/schedKirill Tkhai
psched_net_ops only creates and destroyes /proc entry, and safe to be executed in parallel with any foreigh pernet_operations. tcf_action_net_ops initializes and destructs tcf_action_net::egdev_ht, which is not touched by foreign pernet_operations. So, make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert fib_* pernet_operations, registered via subsys_initcallKirill Tkhai
Both of them create and initialize lists, which are not touched by another foreing pernet_operations. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert pernet_subsys ops, registered via net_dev_init()Kirill Tkhai
There are: 1)dev_proc_ops and dev_mc_net_ops, which create and destroy pernet proc file and not interesting for another net namespaces; 2)netdev_net_ops, which creates pernet hashes, which are not touched by another pernet_operations. So, make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert proto_net_opsKirill Tkhai
This patch starts to convert pernet_subsys, registered from subsys initcalls. It seems safe to be executed in parallel with others, as it's only creates/destoyes proc entry, which nobody else is not interested in. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert rtnetlink_net_opsKirill Tkhai
rtnetlink_net_init() and rtnetlink_net_exit() create and destroy netlink socket net::rtnl. The socket is used to send rtnl notification via rtnl_net_notifyid(). There is no a problem to create and destroy it in parallel with other pernet operations, as we link net in setup_net() after the socket is created, and destroy in cleanup_net() after net is unhashed from all the lists and there is no RCU references on it. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert netlink_net_opsKirill Tkhai
The methods of netlink_net_ops create and destroy "netlink" file, which are not interesting for foreigh pernet_operations. So, netlink_net_ops may safely be made async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert net_defaults_opsKirill Tkhai
net_defaults_ops introduce only net_defaults_init_net method, and it acts on net::core::sysctl_somaxconn, which is not interesting for the rest of pernet_subsys and pernet_device lists. Then, make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert net_inuse_opsKirill Tkhai
net_inuse_ops methods expose statistics in /proc. No one from the rest of pernet_subsys or pernet_device lists touch net::core::inuse. So, it's safe to make net_inuse_ops async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert nf_log_net_opsKirill Tkhai
The pernet_operations would have had a problem in parallel execution with others, if init_net had been able to released. But it's not, and the rest is safe for that. There is memory allocation, which nobody else interested in, and sysctl registration. So, we make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert netfilter_net_opsKirill Tkhai
Methods netfilter_net_init() and netfilter_net_exit() initialize net::nf::hooks and change net-related proc directory of net. Another pernet_operations are not interested in forein net::nf::hooks or proc entries, so it's safe to make them executed in parallel with methods of other pernet operations. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert sysctl_pernet_opsKirill Tkhai
This patch starts to convert pernet_subsys, registered from core initcalls. Methods sysctl_net_init() and sysctl_net_exit() initialize net::sysctls table of a namespace. pernet_operations::init()/exit() methods from the rest of the list do not touch net::sysctls of strangers, so it's safe to execute sysctl_pernet_ops's methods in parallel with any other pernet_operations. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert net_ns_ops methodsKirill Tkhai
This patch starts to convert pernet_subsys, registered from pure initcalls. net_ns_ops::net_ns_net_init/net_ns_net_init, methods use only ida_simple_* functions, which are not need a synchronization. They are synchronized by idr subsystem. So, net_ns_ops methods are able to be executed in parallel with methods of other pernet operations. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Allow pernet_operations to be executed in parallelKirill Tkhai
This adds new pernet_operations::async flag to indicate operations, which ->init(), ->exit() and ->exit_batch() methods are allowed to be executed in parallel with the methods of any other pernet_operations. When there are only asynchronous pernet_operations in the system, net_mutex won't be taken for a net construction and destruction. Also, remove BUG_ON(mutex_is_locked()) from net_assign_generic() without replacing with the equivalent net_sem check, as there is one more lockdep assert below. v3: Add comment near net_mutex. Suggested-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Move mutex_unlock() in cleanup_net() upKirill Tkhai
net_sem protects from pernet_list changing, while ops_free_list() makes simple kfree(), and it can't race with other pernet_operations callbacks. So we may release net_mutex earlier then it was. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Introduce net_sem for protection of pernet_listKirill Tkhai
Currently, the mutex is mostly used to protect pernet operations list. It orders setup_net() and cleanup_net() with parallel {un,}register_pernet_operations() calls, so ->exit{,batch} methods of the same pernet operations are executed for a dying net, as were used to call ->init methods, even after the net namespace is unlinked from net_namespace_list in cleanup_net(). But there are several problems with scalability. The first one is that more than one net can't be created or destroyed at the same moment on the node. For big machines with many cpus running many containers it's very sensitive. The second one is that it's need to synchronize_rcu() after net is removed from net_namespace_list(): Destroy net_ns: cleanup_net() mutex_lock(&net_mutex) list_del_rcu(&net->list) synchronize_rcu() <--- Sleep there for ages list_for_each_entry_reverse(ops, &pernet_list, list) ops_exit_list(ops, &net_exit_list) list_for_each_entry_reverse(ops, &pernet_list, list) ops_free_list(ops, &net_exit_list) mutex_unlock(&net_mutex) This primitive is not fast, especially on the systems with many processors and/or when preemptible RCU is enabled in config. So, all the time, while cleanup_net() is waiting for RCU grace period, creation of new net namespaces is not possible, the tasks, who makes it, are sleeping on the same mutex: Create net_ns: copy_net_ns() mutex_lock_killable(&net_mutex) <--- Sleep there for ages I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop with preemptible RCU enabled after CRIU tests round is finished. The solution is to convert net_mutex to the rw_semaphore and add fine grain locks to really small number of pernet_operations, what really need them. Then, pernet_operations::init/::exit methods, modifying the net-related data, will require down_read() locking only, while down_write() will be used for changing pernet_list (i.e., when modules are being loaded and unloaded). This gives signify performance increase, after all patch set is applied, like you may see here: %for i in {1..10000}; do unshare -n bash -c exit; done *before* real 1m40,377s user 0m9,672s sys 0m19,928s *after* real 0m17,007s user 0m5,311s sys 0m11,779 (5.8 times faster) This patch starts replacing net_mutex to net_sem. It adds rw_semaphore, describes the variables it protects, and makes to use, where appropriate. net_mutex is still present, and next patches will kick it out step-by-step. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Cleanup in copy_net_ns()Kirill Tkhai
Line up destructors actions in the revers order to constructors. Next patches will add more actions, and this will be comfortable, if there is the such order. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Assign net to net_namespace_list in setup_net()Kirill Tkhai
This patch merges two repeating pieces of code in one, and they will live in setup_net() now. The only change is that assignment: init_net_initialized = true; becomes reordered with: list_add_tail_rcu(&net->list, &net_namespace_list); The order does not have visible effect, and it is a simple cleanup because of: init_net_initialized is used in !CONFIG_NET_NS case to order proc_net_ns_ops registration occuring at boot time: start_kernel()->proc_root_init()->proc_net_init(), with net_ns_init()->setup_net(&init_net, &init_user_ns) also occuring in boot time from the same init_task. When there are no another tasks to race with them, for the single task it does not matter, which order two sequential independent loads should be made. So we make them reordered. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13esp4: remove redundant initialization of pointer esphColin Ian King
Pointer esph is being assigned a value that is never read, esph is re-assigned and only read inside an if statement, hence the initialization is redundant and can be removed. Cleans up clang warning: net/ipv4/esp4.c:657:21: warning: Value stored to 'esph' during its initialization is never read Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-02-13xfrm_user: uncoditionally validate esn replay attribute structFlorian Westphal
The sanity test added in ecd7918745234 can be bypassed, validation only occurs if XFRM_STATE_ESN flag is set, but rest of code doesn't care and just checks if the attribute itself is present. So always validate. Alternative is to reject if we have the attribute without the flag but that would change abi. Reported-by: syzbot+0ab777c27d2bb7588f73@syzkaller.appspotmail.com Cc: Mathias Krause <minipli@googlemail.com> Fixes: ecd7918745234 ("xfrm_user: ensure user supplied esn replay window is valid") Fixes: d8647b79c3b7e ("xfrm: Add user interface for esn and big anti-replay windows") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>