summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2016-11-08batman-adv: fix rare race conditions on interface removalLinus Lüssing
In rare cases during shutdown the following general protection fault can happen: general protection fault: 0000 [#1] SMP Modules linked in: batman_adv(O-) [...] CPU: 3 PID: 1714 Comm: rmmod Tainted: G O 4.6.0-rc6+ #1 [...] Call Trace: [<ffffffffa0363294>] batadv_hardif_disable_interface+0x29a/0x3a6 [batman_adv] [<ffffffffa0373db4>] batadv_softif_destroy_netlink+0x4b/0xa4 [batman_adv] [<ffffffff813b52f3>] __rtnl_link_unregister+0x48/0x92 [<ffffffff813b9240>] rtnl_link_unregister+0xc1/0xdb [<ffffffff8108547c>] ? bit_waitqueue+0x87/0x87 [<ffffffffa03850d2>] batadv_exit+0x1a/0xf48 [batman_adv] [<ffffffff810c26f9>] SyS_delete_module+0x136/0x1b0 [<ffffffff8144dc65>] entry_SYSCALL_64_fastpath+0x18/0xa8 [<ffffffff8108aaca>] ? trace_hardirqs_off_caller+0x37/0xa6 Code: 89 f7 e8 21 bd 0d e1 4d 85 e4 75 0e 31 f6 48 c7 c7 50 d7 3b a0 e8 50 16 f2 e0 49 8b 9c 24 28 01 00 00 48 85 db 0f 84 b2 00 00 00 <48> 8b 03 4d 85 ed 48 89 45 c8 74 09 4c 39 ab f8 00 00 00 75 1c RIP [<ffffffffa0371852>] batadv_purge_outstanding_packets+0x1c8/0x291 [batman_adv] RSP <ffff88001da5fd78> ---[ end trace 803b9bdc6a4a952b ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt It does not happen often, but may potentially happen when frequently shutting down and reinitializing an interface. With some carefully placed msleep()s/mdelay()s it can be reproduced easily. The issue is, that on interface removal, any still running worker thread of a forwarding packet will race with the interface purging routine to free a forwarding packet. Temporarily giving up a spin-lock to be able to sleep in the purging routine is not safe. Furthermore, there is a potential general protection fault not just for the purging side shown above, but also on the worker side: Temporarily removing a forw_packet from the according forw_{bcast,bat}_list will make it impossible for the purging routine to catch and cancel it. # How this patch tries to fix it: With this patch we split the queue purging into three steps: Step 1), removing forward packets from the queue of an interface and by that claim it as our responsibility to free. Step 2), we are either lucky to cancel a pending worker before it starts to run. Or if it is already running, we wait and let it do its thing, except two things: Through the claiming in step 1) we prevent workers from a) re-arming themselves. And b) prevent workers from freeing packets which we still hold in the interface purging routine. Finally, step 3, we are sure that no forwarding packets are pending or even running anymore on the interface to remove. We can then safely free the claimed forwarding packets. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08batman-adv: Add module alias for batadv netlink familySven Eckelmann
The batman-adv module has to be loaded to fulfill genl request by the userspace. When it is not loaded then requests will fail. It is therefore useful to get the module automatically loaded when such a request is made. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08batman-adv: Update wifi flags on upper link changeSven Eckelmann
Things like VLANs don't have their link set when they are created. Thus the wifi flags have to be evaluated later to fix their contents for the link interface. Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08batman-adv: retrieve B.A.T.M.A.N. V WiFi neighbor stats from real interfaceMarek Lindner
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> [sven.eckelmann@open-mesh.com: re-add batadv_get_real_netdev to take rtnl semaphore for batadv_get_real_netdevice] Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08batman-adv: additional checks for virtual interfaces on top of WiFiMarek Lindner
In a few situations batman-adv tries to determine whether a given interface is a WiFi interface to enable specific WiFi optimizations. If the interface batman-adv has been configured with is a virtual interface (e.g. VLAN) it would not be properly detected as WiFi interface and thus not benefit from the special WiFi treatment. This patch changes that by peeking under the hood whenever a virtual interface is in play. Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> [sven.eckelmann@open-mesh.com: integrate in wifi_flags caching, retrieve namespace of link interface] Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08batman-adv: Cache the type of wifi device for each hardifSven Eckelmann
batman-adv is requiring the type of wifi device in different contexts. Some of them can take the rtnl semaphore and some of them already have the semaphore taken. But even others don't allow that the semaphore will be taken. The data has to be retrieved when the hardif is added to batman-adv because some of the wifi information for an hardif will only be available with rtnl lock. It can then be cached in the batadv_hard_iface and the functions is_wifi_netdev and is_cfg80211_netdev can just compare the correct bits without imposing extra locking requirements. Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08batman-adv: refactor wifi interface detectionMarek Lindner
The ELP protocol requires cfg80211 to auto-detect the WiFi througput to a given neighbor. Use batadv_is_cfg80211_netdev() to determine whether or not an interface is eligible. Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08batman-adv: Reject unicast packet with zero/mcast dst addressSven Eckelmann
An unicast batman-adv packet cannot be transmitted to a multicast or zero mac address. So reject incoming packets which still have these classes of addresses as destination mac address in the outer ethernet header. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08batman-adv: Return non-const ptr in batadv_getlink_netSven Eckelmann
The returned net_namespace of batadv_getlink_net may be used with functions that potentially modify the struct. Thus it must return the pointer as non-const like rtnl_link_ops::get_link_net does. Signed-off-by: Sven Eckelmann <sven.eckelmann@open-mesh.com> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08batman-adv: Disallow zero and mcast src address for mgmt framesSven Eckelmann
The routing check for management frames is validating the source mac address in the outer ethernet header. It rejects every source mac address which is a broadcast address. But it also has to reject the zero-mac address and multicast mac addresses. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08batman-adv: Disallow mcast src address for data framesSven Eckelmann
The routing checks are validating the source mac address of the outer ethernet header. They reject every source mac address which is a broadcast address. But they also have to reject any multicast mac addresses. Signed-off-by: Sven Eckelmann <sven@narfation.org> [sw@simonwunderlich.de: fix commit message typo] Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08batman-adv: Remove dev_queue_xmit return code exceptionSven Eckelmann
No caller of batadv_send_skb_to_orig is expecting the results to be -1 (-EPERM) anymore when the skbuff was not consumed. They will instead expect that the skbuff is always consumed. Having such return code filter is therefore not needed anymore. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-08batman-adv: Consume skb in receive handlersSven Eckelmann
Receiving functions in Linux consume the supplied skbuff. Doing the same in the batadv_rx_handler functions makes the behavior more similar to the rest of the Linux network code. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-07fib_trie: Correct /proc/net/route off by one errorAlexander Duyck
The display of /proc/net/route has had a couple issues due to the fact that when I originally rewrote most of fib_trie I made it so that the iterator was tracking the next value to use instead of the current. In addition it had an off by 1 error where I was tracking the first piece of data as position 0, even though in reality that belonged to the SEQ_START_TOKEN. This patch updates the code so the iterator tracks the last reported position and key instead of the next expected position and key. In addition it shifts things so that all of the leaves start at 1 instead of trying to report leaves starting with offset 0 as being valid. With these two issues addressed this should resolve any off by one errors that were present in the display of /proc/net/route. Fixes: 25b97c016b26 ("ipv4: off-by-one in continuation handling in /proc/net/route") Cc: Andy Whitcroft <apw@canonical.com> Reported-by: Jason Baron <jbaron@akamai.com> Tested-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07net: icmp6_send should use dst dev to determine L3 domainDavid Ahern
icmp6_send is called in response to some event. The skb may not have the device set (skb->dev is NULL), but it is expected to have a dst set. Update icmp6_send to use the dst on the skb to determine L3 domain. Fixes: ca254490c8dfd ("net: Add VRF support to IPv6 stack") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07sock: do not set sk_err in sock_dequeue_err_skbSoheil Hassas Yeganeh
Do not set sk_err when dequeuing errors from the error queue. Doing so results in: a) Bugs: By overwriting existing sk_err values, it possibly hides legitimate errors. It is also incorrect when local errors are queued with ip_local_error. That happens in the context of a system call, which already returns the error code. b) Inconsistent behavior: When there are pending errors on the error queue, sk_err is sometimes 0 (e.g., for the first timestamp on the error queue) and sometimes set to an error code (after dequeuing the first timestamp). c) Suboptimality: Setting sk_err to ENOMSG on simple TX timestamps can abort parallel reads and writes. Removing this line doesn't break userspace. This is because userspace code cannot rely on sk_err for detecting whether there is something on the error queue. Except for ICMP messages received for UDP and RAW, sk_err is not set at enqueue time, and as a result sk_err can be 0 while there are plenty of errors on the error queue. For ICMP packets in UDP and RAW, sk_err is set when they are enqueued on the error queue, but that does not result in aborting reads and writes. For such cases, sk_err is only readable via getsockopt(SO_ERROR) which will reset the value of sk_err on its own. More importantly, prior to this patch, recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e., sk_err is set by sock_dequeue_err_skb without atomic ops or locks) which can store 0 in sk_err even when we have ICMP messages pending. Removing this line from sock_dequeue_err_skb eliminates that race. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07qdisc: catch misconfig of attaching qdisc to tx_queue_len zero deviceJesper Dangaard Brouer
It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] https://github.com/docker/libcontainer/pull/193 [2] https://github.com/openshift/origin/pull/11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07net/qdisc: IFF_NO_QUEUE drivers should use consistent TX queue lenJesper Dangaard Brouer
The flag IFF_NO_QUEUE marks virtual device drivers that doesn't need a default qdisc attached, given they will be backed by physical device, that already have a qdisc attached for pushback. It is still supported to attach a qdisc to a IFF_NO_QUEUE device, as this can be useful for difference policy reasons (e.g. bandwidth limiting containers). For this to work, the tx_queue_len need to have a sane value, because some qdiscs inherit/copy the tx_queue_len (namely, pfifo, bfifo, gred, htb, plug and sfb). Commit a813104d9233 ("IFF_NO_QUEUE: Fix for drivers not calling ether_setup()") caught situations where some drivers didn't initialize tx_queue_len. The problem with the commit was choosing 1 as the fallback value. A qdisc queue length of 1 causes more harm than good, because it creates hard to debug situations for userspace. It gives userspace a false sense of a working config after attaching a qdisc. As low volume traffic (that doesn't activate the qdisc policy) works, like ping, while traffic that e.g. needs shaping cannot reach the configured policy levels, given the queue length is too small. This patch change the value to DEFAULT_TX_QUEUE_LEN, given other IFF_NO_QUEUE devices (that call ether_setup()) also use this value. Fixes: a813104d9233 ("IFF_NO_QUEUE: Fix for drivers not calling ether_setup()") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07net: make default TX queue length a defined constantJesper Dangaard Brouer
The default TX queue length of Ethernet devices have been a magic constant of 1000, ever since the initial git import. Looking back in historical trees[1][2] the value used to be 100, with the same comment "Ethernet wants good queues". The commit[3] that changed this from 100 to 1000 didn't describe why, but from conversations with Robert Olsson it seems that it was changed when Ethernet devices went from 100Mbit/s to 1Gbit/s, because the link speed increased x10 the queue size were also adjusted. This value later caused much heartache for the bufferbloat community. This patch merely moves the value into a defined constant. [1] https://git.kernel.org/cgit/linux/kernel/git/davem/netdev-vger-cvs.git/ [2] https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/ [3] https://git.kernel.org/tglx/history/c/98921832c232 Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07SUNRPC: Fix suspicious RCU usageAnna Schumaker
We need to hold the rcu_read_lock() when calling rcu_dereference(), otherwise we can't guarantee that the object being dereferenced still exists. Fixes: 39e5d2df ("SUNRPC search xprt switch for sockaddr") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-07udp: do fwd memory scheduling on dequeuePaolo Abeni
A new argument is added to __skb_recv_datagram to provide an explicit skb destructor, invoked under the receive queue lock. The UDP protocol uses such argument to perform memory reclaiming on dequeue, so that the UDP protocol does not set anymore skb->desctructor. Instead explicit memory reclaiming is performed at close() time and when skbs are removed from the receive queue. The in kernel UDP protocol users now need to call a skb_recv_udp() variant instead of skb_recv_datagram() to properly perform memory accounting on dequeue. Overall, this allows acquiring only once the receive queue lock on dequeue. Tested using pktgen with random src port, 64 bytes packet, wire-speed on a 10G link as sender and udp_sink as the receiver, using an l4 tuple rxhash to stress the contention, and one or more udp_sink instances with reuseport. nr sinks vanilla patched 1 440 560 3 2150 2300 6 3650 3800 9 4450 4600 12 6250 6450 v1 -> v2: - do rmem and allocated memory scheduling under the receive lock - do bulk scheduling in first_packet_length() and in udp_destruct_sock() - avoid the typdef for the dequeue callback Suggested-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07net/sock: add an explicit sk argument for ip_cmsg_recv_offset()Paolo Abeni
So that we can use it even after orphaining the skbuff. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07sctp: assign assoc_id earlier in __sctp_connectMarcelo Ricardo Leitner
sctp_wait_for_connect() currently already holds the asoc to keep it alive during the sleep, in case another thread release it. But Andrey Konovalov and Dmitry Vyukov reported an use-after-free in such situation. Problem is that __sctp_connect() doesn't get a ref on the asoc and will do a read on the asoc after calling sctp_wait_for_connect(), but by then another thread may have closed it and the _put on sctp_wait_for_connect will actually release it, causing the use-after-free. Fix is, instead of doing the read after waiting for the connect, do it before so, and avoid this issue as the socket is still locked by then. There should be no issue on returning the asoc id in case of failure as the application shouldn't trust on that number in such situations anyway. This issue doesn't exist in sctp_sendmsg() path. Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07net: Update raw socket bind to consider l3 domainDavid Ahern
Binding a raw socket to a local address fails if the socket is bound to an L3 domain: $ vrf-test -s -l 10.100.1.2 -R -I red error binding socket: 99: Cannot assign requested address Update raw_bind to look consider if sk_bound_dev_if is bound to an L3 domain and use inet_addr_type_table to lookup the address. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04Merge tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd bugfixes from Bruce Fields: "Fixes for some recent regressions including fallout from the vmalloc'd stack change (after which we can no longer encrypt stuff on the stack)" * tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux: nfsd: Fix general protection fault in release_lock_stateid() svcrdma: backchannel cannot share a page for send and rcv buffers sunrpc: fix some missing rq_rbuffer assignments sunrpc: don't pass on-stack memory to sg_set_buf nfsd: move blocked lock handling under a dedicated spinlock
2016-11-04net: inet: Support UID-based routing in IP protocols.Lorenzo Colitti
- Use the UID in routing lookups made by protocol connect() and sendmsg() functions. - Make sure that routing lookups triggered by incoming packets (e.g., Path MTU discovery) take the UID of the socket into account. - For packets not associated with a userspace socket, (e.g., ping replies) use UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. This allows all namespaces to apply routing and iptables rules to kernel-originated traffic in that namespaces by matching UID 0. This is better than using the UID of the kernel socket that is sending the traffic, because the UID of kernel sockets created at namespace creation time (e.g., the per-processor ICMP and TCP sockets) is the UID of the user that created the socket, which might not be mapped in the namespace. Tested: compiles allnoconfig, allyesconfig, allmodconfig Tested: https://android-review.googlesource.com/253302 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04net: core: add UID to flows, rules, and routesLorenzo Colitti
- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a range of UIDs. - Define a RTA_UID attribute for per-UID route lookups and dumps. - Support passing these attributes to and from userspace via rtnetlink. The value INVALID_UID indicates no UID was specified. - Add a UID field to the flow structures. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04net: core: Add a UID field to struct sock.Lorenzo Colitti
Protocol sockets (struct sock) don't have UIDs, but most of the time, they map 1:1 to userspace sockets (struct socket) which do. Various operations such as the iptables xt_owner match need access to the "UID of a socket", and do so by following the backpointer to the struct socket. This involves taking sk_callback_lock and doesn't work when there is no socket because userspace has already called close(). Simplify this by adding a sk_uid field to struct sock whose value matches the UID of the corresponding struct socket. The semantics are as follows: 1. Whenever sk_socket is non-null: sk_uid is the same as the UID in sk_socket, i.e., matches the return value of sock_i_uid. Specifically, the UID is set when userspace calls socket(), fchown(), or accept(). 2. When sk_socket is NULL, sk_uid is defined as follows: - For a socket that no longer has a sk_socket because userspace has called close(): the previous UID. - For a cloned socket (e.g., an incoming connection that is established but on which userspace has not yet called accept): the UID of the socket it was cloned from. - For a socket that has never had an sk_socket: UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. Kernel sockets created by sock_create_kern are a special case of #1 and sk_uid is the user that created them. For kernel sockets created at network namespace creation time, such as the per-processor ICMP and TCP sockets, this is the user that created the network namespace. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04batman-adv: Detect missing primaryif during tp_send as errorSven Eckelmann
The throughput meter detects different situations as problems for the current test. It stops the test after these and reports it to userspace. This also has to be done when the primary interface disappeared during the test. Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation") Reported-by: Joe Perches <joe@perches.com> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-04batman-adv: Revert "fix splat on disabling an interface"Sven Eckelmann
The commit 9799c50372b2 ("batman-adv: fix splat on disabling an interface") fixed a warning but at the same time broke the rtnl function add_slave for devices which were temporarily removed. batadv_softif_slave_add requires soft_iface of and hard_iface to be NULL before it is allowed to be enslaved. But this resetting of soft_iface to NULL in batadv_hardif_disable_interface was removed with the aforementioned commit. Reported-by: Julian Labus <julian@freifunk-rtk.de> Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2016-11-03genetlink: fix a memory leak on error pathWANG Cong
In __genl_register_family(), when genl_validate_assign_mc_groups() fails, we forget to free the memory we possibly allocate for family->attrbuf. Note, some callers call genl_unregister_family() to clean up on error path, it doesn't work because the family is inserted to the global list in the nearly last step. Cc: Jakub Kicinski <kubakici@wp.pl> Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03ipv6: dccp: add missing bind_conflict to dccp_ipv6_mappedEric Dumazet
While fuzzing kernel with syzkaller, Andrey reported a nasty crash in inet6_bind() caused by DCCP lacking a required method. Fixes: ab1e0a13d7029 ("[SOCK] proto: Add hashinfo member to struct proto") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03net/sched: cls_flower: Support matching on SCTP portsSimon Horman
Support matching on SCTP ports in the same way that matching on TCP and UDP ports is already supported. Example usage: tc qdisc add dev eth0 ingress tc filter add dev eth0 protocol ip parent ffff: \ flower indev eth0 ip_proto sctp dst_port 80 \ action drop Signed-off-by: Simon Horman <simon.horman@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03ipv6: dccp: fix out of bound access in dccp_v6_err()Eric Dumazet
dccp_v6_err() does not use pskb_may_pull() and might access garbage. We only need 4 bytes at the beginning of the DCCP header, like TCP, so the 8 bytes pulled in icmpv6_notify() are more than enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03netlink: netlink_diag_dump() runs without locksEric Dumazet
A recent commit removed locking from netlink_diag_dump() but forgot one error case. ===================================== [ BUG: bad unlock balance detected! ] 4.9.0-rc3+ #336 Not tainted ------------------------------------- syz-executor/4018 is trying to release lock ([ 36.220068] nl_table_lock ) at: [<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182 but there are no more locks to release! other info that might help us debug this: 3 locks held by syz-executor/4018: #0: [ 36.220068] ( sock_diag_mutex[ 36.220068] ){+.+.+.} , at: [ 36.220068] [<ffffffff82c3873b>] sock_diag_rcv+0x1b/0x40 #1: [ 36.220068] ( sock_diag_table_mutex[ 36.220068] ){+.+.+.} , at: [ 36.220068] [<ffffffff82c38e00>] sock_diag_rcv_msg+0x140/0x3a0 #2: [ 36.220068] ( nlk->cb_mutex[ 36.220068] ){+.+.+.} , at: [ 36.220068] [<ffffffff82db6600>] netlink_dump+0x50/0xac0 stack backtrace: CPU: 1 PID: 4018 Comm: syz-executor Not tainted 4.9.0-rc3+ #336 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 ffff8800645df688 ffffffff81b46934 ffffffff84eb3e78 ffff88006ad85800 ffffffff82dc8683 ffffffff84eb3e78 ffff8800645df6b8 ffffffff812043ca dffffc0000000000 ffff88006ad85ff8 ffff88006ad85fd0 00000000ffffffff Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff81b46934>] dump_stack+0xb3/0x10f lib/dump_stack.c:51 [<ffffffff812043ca>] print_unlock_imbalance_bug+0x17a/0x1a0 kernel/locking/lockdep.c:3388 [< inline >] __lock_release kernel/locking/lockdep.c:3512 [<ffffffff8120cfd8>] lock_release+0x8e8/0xc60 kernel/locking/lockdep.c:3765 [< inline >] __raw_read_unlock ./include/linux/rwlock_api_smp.h:225 [<ffffffff83fc001a>] _raw_read_unlock+0x1a/0x30 kernel/locking/spinlock.c:255 [<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182 [<ffffffff82db6947>] netlink_dump+0x397/0xac0 net/netlink/af_netlink.c:2110 Fixes: ad202074320c ("netlink: Use rhashtable walk interface in diag dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03dccp: fix out of bound access in dccp_v4_err()Eric Dumazet
dccp_v4_err() does not use pskb_may_pull() and might access garbage. We only need 4 bytes at the beginning of the DCCP header, like TCP, so the 8 bytes pulled in icmp_socket_deliver() are more than enough. This patch might allow to process more ICMP messages, as some routers are still limiting the size of reflected bytes to 28 (RFC 792), instead of extended lengths (RFC 1812 4.3.2.3) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03dccp: do not send reset to already closed socketsEric Dumazet
Andrey reported following warning while fuzzing with syzkaller WARNING: CPU: 1 PID: 21072 at net/dccp/proto.c:83 dccp_set_state+0x229/0x290 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 21072 Comm: syz-executor Not tainted 4.9.0-rc1+ #293 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 ffff88003d4c7738 ffffffff81b474f4 0000000000000003 dffffc0000000000 ffffffff844f8b00 ffff88003d4c7804 ffff88003d4c7800 ffffffff8140c06a 0000000041b58ab3 ffffffff8479ab7d ffffffff8140beae ffffffff8140cd00 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff81b474f4>] dump_stack+0xb3/0x10f lib/dump_stack.c:51 [<ffffffff8140c06a>] panic+0x1bc/0x39d kernel/panic.c:179 [<ffffffff8111125c>] __warn+0x1cc/0x1f0 kernel/panic.c:542 [<ffffffff8111144c>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585 [<ffffffff8389e5d9>] dccp_set_state+0x229/0x290 net/dccp/proto.c:83 [<ffffffff838a0aa2>] dccp_close+0x612/0xc10 net/dccp/proto.c:1016 [<ffffffff8316bf1f>] inet_release+0xef/0x1c0 net/ipv4/af_inet.c:415 [<ffffffff82b6e89e>] sock_release+0x8e/0x1d0 net/socket.c:570 [<ffffffff82b6e9f6>] sock_close+0x16/0x20 net/socket.c:1017 [<ffffffff815256ad>] __fput+0x29d/0x720 fs/file_table.c:208 [<ffffffff81525bb5>] ____fput+0x15/0x20 fs/file_table.c:244 [<ffffffff811727d8>] task_work_run+0xf8/0x170 kernel/task_work.c:116 [< inline >] exit_task_work include/linux/task_work.h:21 [<ffffffff8111bc53>] do_exit+0x883/0x2ac0 kernel/exit.c:828 [<ffffffff811221fe>] do_group_exit+0x10e/0x340 kernel/exit.c:931 [<ffffffff81143c94>] get_signal+0x634/0x15a0 kernel/signal.c:2307 [<ffffffff81054aad>] do_signal+0x8d/0x1a30 arch/x86/kernel/signal.c:807 [<ffffffff81003a05>] exit_to_usermode_loop+0xe5/0x130 arch/x86/entry/common.c:156 [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190 [<ffffffff81006298>] syscall_return_slowpath+0x1a8/0x1e0 arch/x86/entry/common.c:259 [<ffffffff83fc1a62>] entry_SYSCALL_64_fastpath+0xc0/0xc2 Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled Fix this the same way we did for TCP in commit 565b7b2d2e63 ("tcp: do not send reset to already closed sockets") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03dccp: do not release listeners too soonEric Dumazet
Andrey Konovalov reported following error while fuzzing with syzkaller : IPv4: Attempt to release alive inet socket ffff880068e98940 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Modules linked in: CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88006b9e0000 task.stack: ffff880068770000 RIP: 0010:[<ffffffff819ead5f>] [<ffffffff819ead5f>] selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639 RSP: 0018:ffff8800687771c8 EFLAGS: 00010202 RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010 RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002 R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940 R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000 FS: 00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0 Stack: ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007 ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480 ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0 Call Trace: [<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317 [<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81 [<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460 [<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873 [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0 net/ipv4/ip_input.c:216 [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232 [< inline >] NF_HOOK ./include/linux/netfilter.h:255 [<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257 [< inline >] dst_input ./include/net/dst.h:507 [<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396 [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232 [< inline >] NF_HOOK ./include/linux/netfilter.h:255 [<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487 [<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213 [<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251 [<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279 [<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303 [<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308 [<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332 [< inline >] new_sync_write fs/read_write.c:499 [<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512 [<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560 [< inline >] SYSC_write fs/read_write.c:607 [<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599 [<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2 It turns out DCCP calls __sk_receive_skb(), and this broke when lookups no longer took a reference on listeners. Fix this issue by adding a @refcounted parameter to __sk_receive_skb(), so that sock_put() is used only when needed. Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03tcp: fix return value for partial writesEric Dumazet
After my commit, tcp_sendmsg() might restart its loop after processing socket backlog. If sk_err is set, we blindly return an error, even though we copied data to user space before. We should instead return number of bytes that could be copied, otherwise user space might resend data and corrupt the stream. This might happen if another thread is using recvmsg(MSG_ERRQUEUE) to process timestamps. Issue was diagnosed by Soheil and Willem, big kudos to them ! Fixes: d41a69f1d390f ("tcp: make tcp_sendmsg() aware of socket backlog") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Tested-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03ipv4: allow local fragmentation in ip_finish_output_gso()Lance Richardson
Some configurations (e.g. geneve interface with default MTU of 1500 over an ethernet interface with 1500 MTU) result in the transmission of packets that exceed the configured MTU. While this should be considered to be a "bad" configuration, it is still allowed and should not result in the sending of packets that exceed the configured MTU. Fix by dropping the assumption in ip_finish_output_gso() that locally originated gso packets will never need fragmentation. Basic testing using iperf (observing CPU usage and bandwidth) have shown no measurable performance impact for traffic not requiring fragmentation. Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack") Reported-by: Jan Tluka <jtluka@redhat.com> Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03ipv6: on reassembly, record frag_max_sizeWillem de Bruijn
IP6CB and IPCB have a frag_max_size field. In IPv6 this field is filled in when packets are reassembled by the connection tracking code. Also fill in when reassembling in the input path, to expose it through cmsg IPV6_RECVFRAGSIZE in all cases. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03ipv6: add IPV6_RECVFRAGSIZE cmsgWillem de Bruijn
When reading a datagram or raw packet that arrived fragmented, expose the maximum fragment size if recorded to allow applications to estimate receive path MTU. At this point, the field is only recorded when ipv6 connection tracking is enabled. A follow-up patch will record this field also in the ipv6 input path. Tested using the test for IP_RECVFRAGSIZE plus ip netns exec to ip addr add dev veth1 fc07::1/64 ip netns exec from ip addr add dev veth0 fc07::2/64 ip netns exec to ./recv_cmsg_recvfragsize -6 -u -p 6000 & ip netns exec from nc -q 1 -u fc07::1 6000 < payload Both with and without enabling connection tracking ip6tables -A INPUT -m state --state NEW -p udp -j LOG Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03ipv4: add IP_RECVFRAGSIZE cmsgWillem de Bruijn
The IP stack records the largest fragment of a reassembled packet in IPCB(skb)->frag_max_size. When reading a datagram or raw packet that arrived fragmented, expose the value to allow applications to estimate receive path MTU. Tested: Sent data over a veth pair of which the source has a small mtu. Sent data using netcat, received using a dedicated process. Verified that the cmsg IP_RECVFRAGSIZE is returned only when data arrives fragmented, and in that cases matches the veth mtu. ip link add veth0 type veth peer name veth1 ip netns add from ip netns add to ip link set dev veth1 netns to ip netns exec to ip addr add dev veth1 192.168.10.1/24 ip netns exec to ip link set dev veth1 up ip link set dev veth0 netns from ip netns exec from ip addr add dev veth0 192.168.10.2/24 ip netns exec from ip link set dev veth0 up ip netns exec from ip link set dev veth0 mtu 1300 ip netns exec from ethtool -K veth0 ufo off dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 & ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03tcp: fix potential memory corruptionEric Dumazet
Imagine initial value of max_skb_frags is 17, and last skb in write queue has 15 frags. Then max_skb_frags is lowered to 14 or smaller value. tcp_sendmsg() will then be allowed to add additional page frags and eventually go past MAX_SKB_FRAGS, overflowing struct skb_shared_info. Fixes: 5f74f82ea34c ("net:Add sysctl_max_skb_frags") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Cc: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03net: ip, raw_diag -- Use jump for exiting from nested loopCyrill Gorcunov
I managed to miss that sk_for_each is called under "for" cycle so need to use goto here to return matching socket. CC: David S. Miller <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: David Ahern <dsa@cumulusnetworks.com> CC: Andrey Vagin <avagin@openvz.org> CC: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03net: ip, raw_diag -- Fix socket leaking for destroy requestCyrill Gorcunov
In raw_diag_destroy the helper raw_sock_get returns with sock_hold call, so we have to put it then. CC: David S. Miller <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: David Ahern <dsa@cumulusnetworks.com> CC: Andrey Vagin <avagin@openvz.org> CC: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03inet: fix sleeping inside inet_wait_for_connect()WANG Cong
Andrey reported this kernel warning: WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724 __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719 do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff811f5a5c>] prepare_to_wait+0xbc/0x210 kernel/sched/wait.c:178 Modules linked in: CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 ffff88006625f7a0 ffffffff81b46914 ffff88006625f818 0000000000000000 ffffffff84052960 0000000000000000 ffff88006625f7e8 ffffffff81111237 ffff88006aceac00 ffffffff00001e2c ffffed000cc4beff ffffffff84052960 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff81b46914>] dump_stack+0xb3/0x10f lib/dump_stack.c:51 [<ffffffff81111237>] __warn+0x1a7/0x1f0 kernel/panic.c:550 [<ffffffff8111132c>] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565 [<ffffffff811922fc>] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719 [< inline >] slab_pre_alloc_hook mm/slab.h:393 [< inline >] slab_alloc_node mm/slub.c:2634 [< inline >] slab_alloc mm/slub.c:2716 [<ffffffff81508da0>] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240 [<ffffffff8146be14>] kmemdup+0x24/0x50 mm/util.c:113 [<ffffffff8388b2cf>] dccp_feat_clone_sp_val.part.5+0x4f/0xe0 net/dccp/feat.c:374 [< inline >] dccp_feat_clone_sp_val net/dccp/feat.c:1141 [< inline >] dccp_feat_change_recv net/dccp/feat.c:1141 [<ffffffff8388d491>] dccp_feat_parse_options+0xaa1/0x13d0 net/dccp/feat.c:1411 [<ffffffff83894f01>] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128 [<ffffffff83891280>] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644 [<ffffffff838b8a94>] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681 [< inline >] sk_backlog_rcv ./include/net/sock.h:872 [<ffffffff82b7ceb6>] __release_sock+0x126/0x3a0 net/core/sock.c:2044 [<ffffffff82b7d189>] release_sock+0x59/0x1c0 net/core/sock.c:2502 [< inline >] inet_wait_for_connect net/ipv4/af_inet.c:547 [<ffffffff8316b2a2>] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617 [<ffffffff8316b8d5>] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656 [<ffffffff82b705e4>] SYSC_connect+0x244/0x2f0 net/socket.c:1533 [<ffffffff82b72dd4>] SyS_connect+0x24/0x30 net/socket.c:1514 [<ffffffff83fbf701>] entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:209 Unlike commit 26cabd31259ba43f68026ce3f62b78094124333f ("sched, net: Clean up sk_wait_event() vs. might_sleep()"), the sleeping function is called before schedule_timeout(), this is indeed a bug. Fix this by moving the wait logic to the new API, it is similar to commit ff960a731788a7408b6f66ec4fd772ff18833211 ("netdev, sched/wait: Fix sleeping inside wait event"). Reported-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03netfilter: handle NF_REPEAT from nf_conntrack_in()Pablo Neira Ayuso
NF_REPEAT is only needed from nf_conntrack_in() under a very specific case required by the TCP protocol tracker, we can handle this case without returning to the core hook path. Handling of NF_REPEAT from the nf_reinject() is left untouched. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03netfilter: merge nf_iterate() into nf_hook_slow()Pablo Neira Ayuso
nf_iterate() has become rather simple, we can integrate this code into nf_hook_slow() to reduce the amount of LOC in the core path. However, we still need nf_iterate() around for nf_queue packet handling, so move this function there where we only need it. I think it should be possible to refactor nf_queue code to get rid of it definitely, but given this is slow path anyway, let's have a look this later. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03netfilter: remove hook_entries field from nf_hook_statePablo Neira Ayuso
This field is only useful for nf_queue, so store it in the nf_queue_entry structure instead, away from the core path. Pass hook_head to nf_hook_slow(). Since we always have a valid entry on the first iteration in nf_iterate(), we can use 'do { ... } while (entry)' loop instead. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>