summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2022-11-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== The following patchset contains Netfilter fixes for net: 1) Fix deadlock in nfnetlink due to missing mutex release in error path, from Ziyang Xuan. 2) Clean up pending autoload module list from nf_tables_exit_net() path, from Shigeru Yoshida. 3) Fixes for the netfilter's reverse path selftest, from Phil Sutter. All of these bugs have been around for several releases. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-09driver core: class: make namespace and get_ownership take const *Greg Kroah-Hartman
The callbacks in struct class namespace() and get_ownership() do not modify the struct device passed to them, so mark the pointer as constant and fix up all callbacks in the kernel to have the correct function signature. This helps make it more obvious what calls and callbacks do, and do not, modify structures passed to them. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Link: https://lore.kernel.org/r/20221001165426.2690912-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-09Merge tag 'rxrpc-next-20221108' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs rxrpc changes David Howells says: ==================== rxrpc: Increasing SACK size and moving away from softirq, part 1 AF_RXRPC has some issues that need addressing: (1) The SACK table has a maximum capacity of 255, but for modern networks that isn't sufficient. This is hard to increase in the upstream code because of the way the application thread is coupled to the softirq and retransmission side through a ring buffer. Adjustments to the rx protocol allows a capacity of up to 8192, and having a ring sufficiently large to accommodate that would use an excessive amount of memory as this is per-call. (2) Processing ACKs in softirq mode causes the ACKs get conflated, with only the most recent being considered. Whilst this has the upside that the retransmission algorithm only needs to deal with the most recent ACK, it causes DATA transmission for a call to be very bursty because DATA packets cannot be transmitted in softirq mode. Rather transmission must be delegated to either the application thread or a workqueue, so there tend to be sudden bursts of traffic for any particular call due to scheduling delays. (3) All crypto in a single call is done in series; however, each DATA packet is individually encrypted so encryption and decryption of large calls could be parallelised if spare CPU resources are available. This is the first of a number of sets of patches that try and address them. The overall aims of these changes include: (1) To get rid of the TxRx ring and instead pass the packets round in queues (eg. sk_buff_head). On the Tx side, each ACK packet comes with a SACK table that can be parsed as-is, so there's no particular need to maintain our own; we just have to refer to the ACK. On the Rx side, we do need to maintain a SACK table with one bit per entry - but only if packets go missing - and we don't want to have to perform a complex transformation to get the information into an ACK packet. (2) To try and move almost all processing of received packets out of the softirq handler and into a high-priority kernel I/O thread. Only the transferral of packets would be left there. I would still use the encap_rcv hook to receive packets as there's a noticeable performance drop from letting the UDP socket put the packets into its own queue and then getting them out of there. (3) To make the I/O thread also do all the transmission. The app thread would be responsible for packaging the data into packets and then buffering them for the I/O thread to transmit. This would make it easier for the app thread to run ahead of the I/O thread, and would mean the I/O thread is less likely to have to wait around for a new packet to come available for transmission. (4) To logically partition the socket/UAPI/KAPI side of things from the I/O side of things. The local endpoint, connection, peer and call objects would belong to the I/O side. The socket side would not then touch the private internals of calls and suchlike and would not change their states. It would only look at the send queue, receive queue and a way to pass a message to cause an abort. (5) To remove as much locking, synchronisation, barriering and atomic ops as possible from the I/O side. Exclusion would be achieved by limiting modification of state to the I/O thread only. Locks would still need to be used in communication with the UDP socket and the AF_RXRPC socket API. (6) To provide crypto offload kernel threads that, when there's slack in the system, can see packets that need crypting and provide parallelisation in dealing with them. (7) To remove the use of system timers. Since each timer would then send a poke to the I/O thread, which would then deal with it when it had the opportunity, there seems no point in using system timers if, instead, a list of timeouts can be sensibly consulted. An I/O thread only then needs to schedule with a timeout when it is idle. (8) To use zero-copy sendmsg to send packets. This would make use of the I/O thread being the sole transmitter on the socket to manage the dead-reckoning sequencing of the completion notifications. There is a problem with zero-copy, though: the UDP socket doesn't handle running out of option memory very gracefully. With regard to this first patchset, the changes made include: (1) Some fixes, including a fallback for proc_create_net_single_write(), setting ack.bufferSize to 0 in ACK packets and a fix for rxrpc congestion management, which shouldn't be saving the cwnd value between calls. (2) Improvements in rxrpc tracepoints, including splitting the timer tracepoint into a set-timer and a timer-expired trace. (3) Addition of a new proc file to display some stats. (4) Some code cleanups, including removing some unused bits and unnecessary header inclusions. (5) A change to the recently added UDP encap_err_rcv hook so that it has the same signature as {ip,ipv6}_icmp_error(), and then just have rxrpc point its UDP socket's hook directly at those. (6) Definition of a new struct, rxrpc_txbuf, that is used to hold transmissible packets of DATA and ACK type in a single 2KiB block rather than using an sk_buff. This allows the buffer to be on a number of queues simultaneously more easily, and also guarantees that the entire block is in a single unit for zerocopy purposes and that the data payload is aligned for in-place crypto purposes. (7) ACK txbufs are allocated at proposal and queued for later transmission rather than being stored in a single place in the rxrpc_call struct, which means only a single ACK can be pending transmission at a time. The queue is then drained at various points. This allows the ACK generation code to be simplified. (8) The Rx ring buffer is removed. When a jumbo packet is received (which comprises a number of ordinary DATA packets glued together), it used to be pointed to by the ring multiple times, with an annotation in a side ring indicating which subpacket was in that slot - but this is no longer possible. Instead, the packet is cloned once for each subpacket, barring the last, and the range of data is set in the skb private area. This makes it easier for the subpackets in a jumbo packet to be decrypted in parallel. (9) The Tx ring buffer is removed. The side annotation ring that held the SACK information is also removed. Instead, in the event of packet loss, the SACK data attached an ACK packet is parsed. (10) Allocate an skcipher request when needed in the rxkad security class rather than caching one in the rxrpc_call struct. This deals with a race between externally-driven call disconnection getting rid of the skcipher request and sendmsg/recvmsg trying to use it because they haven't seen the completion yet. This is also needed to support parallelisation as the skcipher request cannot be used by two or more threads simultaneously. (11) Call udp_sendmsg() and udpv6_sendmsg() directly rather than going through kernel_sendmsg() so that we can provide our own iterator (zerocopy explicitly doesn't work with a KVEC iterator). This also lets us avoid the overhead of the security hook. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-09net/core: Allow live renaming when an interface is upAndy Ren
Allow a network interface to be renamed when the interface is up. As described in the netconsole documentation [1], when netconsole is used as a built-in, it will bring up the specified interface as soon as possible. As a result, user space will not be able to rename the interface since the kernel disallows renaming of interfaces that are administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set by the kernel. The original solution [2] to this problem was to add a new parameter to the netconsole configuration parameters that allows renaming of the interface used by netconsole while it is administratively up. However, during the discussion that followed, it became apparent that we have no reason to keep the current restriction and instead we should allow user space to rename interfaces regardless of their administrative state: 1. The restriction was put in place over 20 years ago when renaming was only possible via IOCTL and before rtnetlink started notifying user space about such changes like it does today. 2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version 5.2 and no regressions were reported. 3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about the administrative state of interface. Therefore, allow user space to rename running interfaces by removing the restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in possible triage by emitting a message to the kernel log that an interface was renamed while UP. [1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst [2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/ Signed-off-by: Andy Ren <andy.ren@getcruise.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-08Merge tag 'linux-can-fixes-for-6.1-20221107' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== can 2022-11-07 The first patch is by Chen Zhongjin and adds a missing dev_remove_pack() to the AF_CAN protocol. Zhengchao Shao's patch fixes a potential NULL pointer deref in AF_CAN's can_rx_register(). The next patch is by Oliver Hartkopp and targets the CAN ISO-TP protocol, and fixes the state handling for echo TX processing. Oliver Hartkopp's patch for the j1939 protocol adds a missing initialization of the CAN headers inside outgoing skbs. Another patch by Oliver Hartkopp fixes an out of bounds read in the check for invalid CAN frames in the xmit callback of virtual CAN devices. This touches all non virtual device drivers as we decided to rename the function requiring that netdev_priv points to a struct can_priv. (Note: This patch will create a merge conflict with net-next where the pch_can driver has removed.) The last patch is by Geert Uytterhoeven and adds the missing ECC error checks for the channels 2-7 in the rcar_canfd driver. * tag 'linux-can-fixes-for-6.1-20221107' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: can: rcar_canfd: Add missing ECC error checks for channels 2-7 can: dev: fix skb drop check can: j1939: j1939_send_one(): fix missing CAN header initialization can: isotp: fix tx state handling for echo tx processing can: af_can: fix NULL pointer dereference in can_rx_register() can: af_can: can_exit(): add missing dev_remove_pack() of canxl_packet ==================== Link: https://lore.kernel.org/r/20221107133217.59861-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-08netfilter: Cleanup nft_net->module_list from nf_tables_exit_net()Shigeru Yoshida
syzbot reported a warning like below [1]: WARNING: CPU: 3 PID: 9 at net/netfilter/nf_tables_api.c:10096 nf_tables_exit_net+0x71c/0x840 Modules linked in: CPU: 2 PID: 9 Comm: kworker/u8:0 Tainted: G W 6.1.0-rc3-00072-g8e5423e991e8 #47 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 Workqueue: netns cleanup_net RIP: 0010:nf_tables_exit_net+0x71c/0x840 ... Call Trace: <TASK> ? __nft_release_table+0xfc0/0xfc0 ops_exit_list+0xb5/0x180 cleanup_net+0x506/0xb10 ? unregister_pernet_device+0x80/0x80 process_one_work+0xa38/0x1730 ? pwq_dec_nr_in_flight+0x2b0/0x2b0 ? rwlock_bug.part.0+0x90/0x90 ? _raw_spin_lock_irq+0x46/0x50 worker_thread+0x67e/0x10e0 ? process_one_work+0x1730/0x1730 kthread+0x2e5/0x3a0 ? kthread_complete_and_exit+0x40/0x40 ret_from_fork+0x1f/0x30 </TASK> In nf_tables_exit_net(), there is a case where nft_net->commit_list is empty but nft_net->module_list is not empty. Such a case occurs with the following scenario: 1. nfnetlink_rcv_batch() is called 2. nf_tables_newset() returns -EAGAIN and NFNL_BATCH_FAILURE bit is set to status 3. nf_tables_abort() is called with NFNL_ABORT_AUTOLOAD (nft_net->commit_list is released, but nft_net->module_list is not because of NFNL_ABORT_AUTOLOAD flag) 4. Jump to replay label 5. netlink_skb_clone() fails and returns from the function (this is caused by fault injection in the reproducer of syzbot) This patch fixes this issue by calling __nf_tables_abort() when nft_net->module_list is not empty in nf_tables_exit_net(). Fixes: eb014de4fd41 ("netfilter: nf_tables: autoload modules from the abort path") Link: https://syzkaller.appspot.com/bug?id=802aba2422de4218ad0c01b46c9525cc9d4e4aa3 [1] Reported-by: syzbot+178efee9e2d7f87f5103@syzkaller.appspotmail.com Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-11-08netfilter: nfnetlink: fix potential dead lock in nfnetlink_rcv_msg()Ziyang Xuan
When type is NFNL_CB_MUTEX and -EAGAIN error occur in nfnetlink_rcv_msg(), it does not execute nfnl_unlock(). That would trigger potential dead lock. Fixes: 50f2db9e368f ("netfilter: nfnetlink: consolidate callback types") Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-11-08rxrpc: Allocate an skcipher each time needed rather than reusingDavid Howells
In the rxkad security class, allocate the skcipher used to do packet encryption and decription rather than allocating one up front and reusing it for each packet. Reusing the skcipher precludes doing crypto in parallel. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Fix congestion managementDavid Howells
rxrpc has a problem in its congestion management in that it saves the congestion window size (cwnd) from one call to another, but if this is 0 at the time is saved, then the next call may not actually manage to ever transmit anything. To this end: (1) Don't save cwnd between calls, but rather reset back down to the initial cwnd and re-enter slow-start if data transmission is idle for more than an RTT. (2) Preserve ssthresh instead, as that is a handy estimate of pipe capacity. Knowing roughly when to stop slow start and enter congestion avoidance can reduce the tendency to overshoot and drop larger amounts of packets when probing. In future, cwind growth also needs to be constrained when the window isn't being filled due to being application limited. Reported-by: Simon Wilkinson <sxw@auristor.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Remove the rxtx ringDavid Howells
The Rx/Tx ring is no longer used, so remove it. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Save last ACK's SACK table rather than marking txbufsDavid Howells
Improve the tracking of which packets need to be transmitted by saving the last ACK packet that we receive that has a populated soft-ACK table rather than marking packets. Then we can step through the soft-ACK table and look at the packets we've transmitted beyond that to determine which packets we might want to retransmit. We also look at the highest serial number that has been acked to try and guess which packets we've transmitted the peer is likely to have seen. If necessary, we send a ping to retrieve that number. One downside that might be a problem is that we can't then compare the previous acked/unacked state so easily in rxrpc_input_soft_acks() - which is a potential problem for the slow-start algorithm. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Remove call->lockDavid Howells
call->lock is no longer necessary, so remove it. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Don't use a ring buffer for call Tx queueDavid Howells
Change the way the Tx queueing works to make the following ends easier to achieve: (1) The filling of packets, the encryption of packets and the transmission of packets can be handled in parallel by separate threads, rather than rxrpc_sendmsg() allocating, filling, encrypting and transmitting each packet before moving onto the next one. (2) Get rid of the fixed-size ring which sets a hard limit on the number of packets that can be retained in the ring. This allows the number of packets to increase without having to allocate a very large ring or having variable-sized rings. [Note: the downside of this is that it's then less efficient to locate a packet for retransmission as we then have to step through a list and examine each buffer in the list.] (3) Allow the filler/encrypter to run ahead of the transmission window. (4) Make it easier to do zero copy UDP from the packet buffers. (5) Make it easier to do zero copy from userspace to the packet buffers - and thence to UDP (only if for unauthenticated connections). To that end, the following changes are made: (1) Use the new rxrpc_txbuf struct instead of sk_buff for keeping packets to be transmitted in. This allows them to be placed on multiple queues simultaneously. An sk_buff isn't really necessary as it's never passed on to lower-level networking code. (2) Keep the transmissable packets in a linked list on the call struct rather than in a ring. As a consequence, the annotation buffer isn't used either; rather a flag is set on the packet to indicate ackedness. (3) Use the RXRPC_CALL_TX_LAST flag to indicate that the last packet to be transmitted has been queued. Add RXRPC_CALL_TX_ALL_ACKED to indicate that all packets up to and including the last got hard acked. (4) Wire headers are now stored in the txbuf rather than being concocted on the stack and they're stored immediately before the data, thereby allowing zerocopy of a single span. (5) Don't bother with instant-resend on transmission failure; rather, leave it for a timer or an ACK packet to trigger. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Get rid of the Rx ringDavid Howells
Get rid of the Rx ring and replace it with a pair of queues instead. One queue gets the packets that are in-sequence and are ready for processing by recvmsg(); the other queue gets the out-of-sequence packets for addition to the first queue as the holes get filled. The annotation ring is removed and replaced with a SACK table. The SACK table has the bits set that correspond exactly to the sequence number of the packet being acked. The SACK ring is copied when an ACK packet is being assembled and rotated so that the first ACK is in byte 0. Flow control handling is altered so that packets that are moved to the in-sequence queue are hard-ACK'd even before they're consumed - and then the Rx window size in the ACK packet (rsize) is shrunk down to compensate (even going to 0 if the window is full). Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Clone received jumbo subpackets and queue separatelyDavid Howells
Split up received jumbo packets into separate skbuffs by cloning the original skbuff for each subpacket and setting the offset and length of the data in that subpacket in the skbuff's private data. The subpackets are then placed on the recvmsg queue separately. The security class then gets to revise the offset and length to remove its metadata. If we fail to clone a packet, we just drop it and let the peer resend it. The original packet gets used for the final subpacket. This should make it easier to handle parallel decryption of the subpackets. It also simplifies the handling of lost or misordered packets in the queuing/buffering loop as the possibility of overlapping jumbo packets no longer needs to be considered. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Split the rxrpc_recvmsg tracepointDavid Howells
Split the rxrpc_recvmsg tracepoint so that the tracepoints that are about data packet processing (and which have extra pieces of information) are separate from the tracepoint that shows the general flow of recvmsg(). Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Clean up ACK handlingDavid Howells
Clean up the rxrpc_propose_ACK() function. If deferred PING ACK proposal is split out, it's only really needed for deferred DELAY ACKs. All other ACKs, bar terminal IDLE ACK are sent immediately. The deferred IDLE ACK submission can be handled by conversion of a DELAY ACK into an IDLE ACK if there's nothing to be SACK'd. Also, because there's a delay between an ACK being generated and being transmitted, it's possible that other ACKs of the same type will be generated during that interval. Apart from the ACK time and the serial number responded to, most of the ACK body, including window and SACK parameters, are not filled out till the point of transmission - so we can avoid generating a new ACK if there's one pending that will cover the SACK data we need to convey. Therefore, don't propose a new DELAY or IDLE ACK for a call if there's one already pending. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Allocate ACK records at proposal and queue for transmissionDavid Howells
Allocate rxrpc_txbuf records for ACKs and put onto a queue for the transmitter thread to dispatch. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Define rxrpc_txbuf struct to carry data to be transmittedDavid Howells
Define a struct, rxrpc_txbuf, to carry data to be transmitted instead of a socket buffer so that it can be placed onto multiple queues at once. This also allows the data buffer to be in the same allocation as the internal data. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Remove call->tx_phaseDavid Howells
Remove call->tx_phase as it's only ever set. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Remove the flags from the rxrpc_skb tracepointDavid Howells
Remove the flags from the rxrpc_skb tracepoint as we're no longer going to be using this for the transmission buffers and so marking which are transmission buffers isn't going to be necessary. Note that this also remove the rxrpc skb flag that indicates if this is a transmission buffer and so the count is not updated for the moment. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Remove unnecessary header inclusionsDavid Howells
Remove a bunch of unnecessary header inclusions. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Call udp_sendmsg() directlyDavid Howells
Call udp_sendmsg() and udpv6_sendmsg() directly rather than calling kernel_sendmsg() as the latter assumes we want a kvec-class iterator. However, zerocopy explicitly doesn't work with such an iterator. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Use the core ICMP/ICMP6 parsersDavid Howells
Make rxrpc_encap_rcv_err() pass the ICMP/ICMP6 skbuff to ip_icmp_error() or ipv6_icmp_error() as appropriate to do the parsing rather than trying to do it in rxrpc. This pushes an error report onto the UDP socket's error queue and calls ->sk_error_report() from which point rxrpc can pick it up. It would be preferable to steal the packet directly from ip*_icmp_error() rather than letting it get queued, but this is probably good enough. Also note that __udp4_lib_err() calls sk_error_report() twice in some cases. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08net: Change the udp encap_err_rcv to allow use of {ip,ipv6}_icmp_error()David Howells
Change the udp encap_err_rcv signature to match ip_icmp_error() and ipv6_icmp_error() so that those can be used from the called function and export them. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: netdev@vger.kernel.org
2022-11-08rxrpc: Fix ack.bufferSize to be 0 when generating an ackDavid Howells
ack.bufferSize should be set to 0 when generating an ack. Fixes: 8d94aa381dab ("rxrpc: Calls shouldn't hold socket refs") Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Record stats for why the REQUEST-ACK flag is being setDavid Howells
Record stats for why the REQUEST-ACK flag is being set. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Record statistics about ACK typesDavid Howells
Record statistics about the different types of ACKs that have been transmitted and received and the number of ACKs that have been filled out and transmitted or that have been skipped. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Add stats procfile and DATA packet statsDavid Howells
Add a procfile, /proc/net/rxrpc/stats, to display some statistics about what rxrpc has been doing. Writing a blank line to the stats file will clear the increment-only counters. Allocated resource counters don't get cleared. Add some counters to count various things about DATA packets, including the number created, transmitted and retransmitted and the number received, the number of ACK-requests markings and the number of jumbo packets received. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Track highest acked serialDavid Howells
Keep track of the highest DATA serial number that has been acked by the peer for future purposes. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Split call timer-expiration from call timer-set tracepointDavid Howells
Split the tracepoint for call timer-set to separate out the call timer-expiration event Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08rxrpc: Trace setting of the request-ack flagDavid Howells
Add a tracepoint to log why the request-ack flag is set on an outgoing DATA packet, allowing debugging as to why. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2022-11-08net: sched: add helper support in act_ctXin Long
This patch is to add helper support in act_ct for OVS actions=ct(alg=xxx) offloading, which is corresponding to Commit cae3a2627520 ("openvswitch: Allow attaching helpers to ct action") in OVS kernel part. The difference is when adding TC actions family and proto cannot be got from the filter/match, other than helper name in tb[TCA_CT_HELPER_NAME], we also need to send the family in tb[TCA_CT_HELPER_FAMILY] and the proto in tb[TCA_CT_HELPER_PROTO] to kernel. Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-08net: sched: call tcf_ct_params_free to free params in tcf_ct_initXin Long
This patch is to make the err path simple by calling tcf_ct_params_free(), so that it won't cause problems when more members are added into param and need freeing on the err path. Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-08net: move add ct helper function to nf_conntrack_helper for ovs and tcXin Long
Move ovs_ct_add_helper from openvswitch to nf_conntrack_helper and rename as nf_ct_add_helper, so that it can be used in TC act_ct in the next patch. Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-08net: move the ct helper function to nf_conntrack_helper for ovs and tcXin Long
Move ovs_ct_helper from openvswitch to nf_conntrack_helper and rename as nf_ct_helper so that it can be used in TC act_ct in the next patch. Note that it also adds the checks for the family and proto, as in TC act_ct, the packets with correct family and proto are not guaranteed. Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-08ethtool: Fail number of channels change when it conflicts with rxnfcGal Pressman
Similar to what we do with the hash indirection table [1], when network flow classification rules are forwarding traffic to channels greater than the requested number of channels, fail the operation. Without this, traffic could be directed to channels which no longer exist (dropped) after changing number of channels. [1] commit d4ab4286276f ("ethtool: correctly ensure {GS}CHANNELS doesn't conflict with GS{RXFH}") Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://lore.kernel.org/r/20221106123127.522985-1-gal@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-08ethtool: linkstate: add a statistic for PHY down eventsJakub Kicinski
The previous attempt to augment carrier_down (see Link) was not met with much enthusiasm so let's do the simple thing of exposing what some devices already maintain. Add a common ethtool statistic for link going down. Currently users have to maintain per-driver mapping to extract the right stat from the vendor-specific ethtool -S stats. carrier_down does not fit the bill because it counts a lot of software related false positives. Add the statistic to the extended link state API to steer vendors towards implementing all of it. Implement for bnxt and all Linux-controlled PHYs. mlx5 and (possibly) enic also have a counter for this but I leave the implementation to their maintainers. Link: https://lore.kernel.org/r/20220520004500.2250674-1-kuba@kernel.org Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20221104190125.684910-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-07sctp: clear out_curr if all frag chunks of current msg are prunedXin Long
A crash was reported by Zhen Chen: list_del corruption, ffffa035ddf01c18->next is NULL WARNING: CPU: 1 PID: 250682 at lib/list_debug.c:49 __list_del_entry_valid+0x59/0xe0 RIP: 0010:__list_del_entry_valid+0x59/0xe0 Call Trace: sctp_sched_dequeue_common+0x17/0x70 [sctp] sctp_sched_fcfs_dequeue+0x37/0x50 [sctp] sctp_outq_flush_data+0x85/0x360 [sctp] sctp_outq_uncork+0x77/0xa0 [sctp] sctp_cmd_interpreter.constprop.0+0x164/0x1450 [sctp] sctp_side_effects+0x37/0xe0 [sctp] sctp_do_sm+0xd0/0x230 [sctp] sctp_primitive_SEND+0x2f/0x40 [sctp] sctp_sendmsg_to_asoc+0x3fa/0x5c0 [sctp] sctp_sendmsg+0x3d5/0x440 [sctp] sock_sendmsg+0x5b/0x70 and in sctp_sched_fcfs_dequeue() it dequeued a chunk from stream out_curr outq while this outq was empty. Normally stream->out_curr must be set to NULL once all frag chunks of current msg are dequeued, as we can see in sctp_sched_dequeue_done(). However, in sctp_prsctp_prune_unsent() as it is not a proper dequeue, sctp_sched_dequeue_done() is not called to do this. This patch is to fix it by simply setting out_curr to NULL when the last frag chunk of current msg is dequeued from out_curr stream in sctp_prsctp_prune_unsent(). Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations") Reported-by: Zhen Chen <chenzhen126@huawei.com> Tested-by: Caowangbao <caowangbao@huawei.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-07sctp: remove the unnecessary sinfo_stream check in sctp_prsctp_prune_unsentXin Long
Since commit 5bbbbe32a431 ("sctp: introduce stream scheduler foundations"), sctp_stream_outq_migrate() has been called in sctp_stream_init/update to removes those chunks to streams higher than the new max. There is no longer need to do such check in sctp_prsctp_prune_unsent(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-07tipc: fix the msg->req tlv len check in tipc_nl_compat_name_table_dump_headerXin Long
This is a follow-up for commit 974cb0e3e7c9 ("tipc: fix uninit-value in tipc_nl_compat_name_table_dump") where it should have type casted sizeof(..) to int to work when TLV_GET_DATA_LEN() returns a negative value. syzbot reported a call trace because of it: BUG: KMSAN: uninit-value in ... tipc_nl_compat_name_table_dump+0x841/0xea0 net/tipc/netlink_compat.c:934 __tipc_nl_compat_dumpit+0xab2/0x1320 net/tipc/netlink_compat.c:238 tipc_nl_compat_dumpit+0x991/0xb50 net/tipc/netlink_compat.c:321 tipc_nl_compat_recv+0xb6e/0x1640 net/tipc/netlink_compat.c:1324 genl_family_rcv_msg_doit net/netlink/genetlink.c:731 [inline] genl_family_rcv_msg net/netlink/genetlink.c:775 [inline] genl_rcv_msg+0x103f/0x1260 net/netlink/genetlink.c:792 netlink_rcv_skb+0x3a5/0x6c0 net/netlink/af_netlink.c:2501 genl_rcv+0x3c/0x50 net/netlink/genetlink.c:803 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0xf3b/0x1270 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x1288/0x1440 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg net/socket.c:734 [inline] Reported-by: syzbot+e5dbaaa238680ce206ea@syzkaller.appspotmail.com Fixes: 974cb0e3e7c9 ("tipc: fix uninit-value in tipc_nl_compat_name_table_dump") Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/ccd6a7ea801b15aec092c3b532a883b4c5708695.1667594933.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-07netlink: Fix potential skb memleak in netlink_ackTao Chen
Fix coverity issue 'Resource leak'. We should clean the skb resource if nlmsg_put/append failed. Fixes: 738136a0e375 ("netlink: split up copies in the ack construction") Signed-off-by: Tao Chen <chentao.kernel@linux.alibaba.com> Link: https://lore.kernel.org/r/bff442d62c87de6299817fe1897cc5a5694ba9cc.1667638204.git.chentao.kernel@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-07can: j1939: j1939_send_one(): fix missing CAN header initializationOliver Hartkopp
The read access to struct canxl_frame::len inside of a j1939 created skbuff revealed a missing initialization of reserved and later filled elements in struct can_frame. This patch initializes the 8 byte CAN header with zero. Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Cc: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/linux-can/20221104052235.GA6474@pengutronix.de Reported-by: syzbot+d168ec0caca4697e03b1@syzkaller.appspotmail.com Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20221104075000.105414-1-socketcan@hartkopp.net Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-11-07can: isotp: fix tx state handling for echo tx processingOliver Hartkopp
In commit 4b7fe92c0690 ("can: isotp: add local echo tx processing for consecutive frames") the data flow for consecutive frames (CF) has been reworked to improve the reliability of long data transfers. This rework did not touch the transmission and the tx state changes of single frame (SF) transfers which likely led to the WARN in the isotp_tx_timer_handler() catching a wrong tx state. This patch makes use of the improved frame processing for SF frames and sets the ISOTP_SENDING state in isotp_sendmsg() within the cmpxchg() condition handling. A review of the state machine and the timer handling additionally revealed a missing echo timeout handling in the case of the burst mode in isotp_rcv_echo() and removes a potential timer configuration uncertainty in isotp_rcv_fc() when the receiver requests consecutive frames. Fixes: 4b7fe92c0690 ("can: isotp: add local echo tx processing for consecutive frames") Link: https://lore.kernel.org/linux-can/CAO4mrfe3dG7cMP1V5FLUkw7s+50c9vichigUMQwsxX4M=45QEw@mail.gmail.com/T/#u Reported-by: Wei Chen <harperchen1110@gmail.com> Cc: stable@vger.kernel.org # v6.0 Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20221104142551.16924-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-11-07can: af_can: fix NULL pointer dereference in can_rx_register()Zhengchao Shao
It causes NULL pointer dereference when testing as following: (a) use syscall(__NR_socket, 0x10ul, 3ul, 0) to create netlink socket. (b) use syscall(__NR_sendmsg, ...) to create bond link device and vxcan link device, and bind vxcan device to bond device (can also use ifenslave command to bind vxcan device to bond device). (c) use syscall(__NR_socket, 0x1dul, 3ul, 1) to create CAN socket. (d) use syscall(__NR_bind, ...) to bind the bond device to CAN socket. The bond device invokes the can-raw protocol registration interface to receive CAN packets. However, ml_priv is not allocated to the dev, dev_rcv_lists is assigned to NULL in can_rx_register(). In this case, it will occur the NULL pointer dereference issue. The following is the stack information: BUG: kernel NULL pointer dereference, address: 0000000000000008 PGD 122a4067 P4D 122a4067 PUD 1223c067 PMD 0 Oops: 0000 [#1] PREEMPT SMP RIP: 0010:can_rx_register+0x12d/0x1e0 Call Trace: <TASK> raw_enable_filters+0x8d/0x120 raw_enable_allfilters+0x3b/0x130 raw_bind+0x118/0x4f0 __sys_bind+0x163/0x1a0 __x64_sys_bind+0x1e/0x30 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> Fixes: 4e096a18867a ("net: introduce CAN specific pointer in the struct net_device") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Marc Kleine-Budde <mkl@pengutronix.de> Link: https://lore.kernel.org/all/20221028085650.170470-1-shaozhengchao@huawei.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-11-07can: af_can: can_exit(): add missing dev_remove_pack() of canxl_packetChen Zhongjin
In can_init(), dev_add_pack(&canxl_packet) is added but not removed in can_exit(). It breaks the packet handler list and can make kernel panic when can_init() is called for the second time. | > modprobe can && rmmod can | > rmmod xxx && modprobe can | | BUG: unable to handle page fault for address: fffffbfff807d7f4 | RIP: 0010:dev_add_pack+0x133/0x1f0 | Call Trace: | <TASK> | can_init+0xaa/0x1000 [can] | do_one_initcall+0xd3/0x4e0 | ... Fixes: fb08cba12b52 ("can: canxl: update CAN infrastructure for CAN XL frames") Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20221031033053.37849-1-chenzhongjin@huawei.com [mkl: adjust subject and commit message] Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-11-07genetlink: convert control family to split opsJakub Kicinski
Prove that the split ops work. Sadly we need to keep bug-wards compatibility and specify the same policy for dump as do, even tho we don't parse inputs for the dump. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-07genetlink: allow families to use split ops directlyJakub Kicinski
Let families to hook in the new split ops. They are more flexible and should not be much larger than full ops. Each split op is 40B while full op is 48B. Devlink for example has 54 dos and 19 dumps, 2 of the dumps do not have a do -> 56 full commands = 2688B. Split ops would have taken 2920B, so 9% more space while allowing individual per/post doit and per-type policies. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-07genetlink: inline old iteration helpersJakub Kicinski
All dumpers use the iterators now, inline the cmd by index stuff into iterator code. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-07genetlink: use iterator in the op to policy map dumpingJakub Kicinski
We can't put the full iterator in the struct ctrl_dump_policy_ctx because dump context is statically sized by netlink core. Allocate it dynamically. Rename policy to dump_map to make the logic a little easier to follow. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>