Age | Commit message (Collapse) | Author |
|
call->lock is no longer necessary, so remove it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Change the way the Tx queueing works to make the following ends easier to
achieve:
(1) The filling of packets, the encryption of packets and the transmission
of packets can be handled in parallel by separate threads, rather than
rxrpc_sendmsg() allocating, filling, encrypting and transmitting each
packet before moving onto the next one.
(2) Get rid of the fixed-size ring which sets a hard limit on the number
of packets that can be retained in the ring. This allows the number
of packets to increase without having to allocate a very large ring or
having variable-sized rings.
[Note: the downside of this is that it's then less efficient to locate
a packet for retransmission as we then have to step through a list and
examine each buffer in the list.]
(3) Allow the filler/encrypter to run ahead of the transmission window.
(4) Make it easier to do zero copy UDP from the packet buffers.
(5) Make it easier to do zero copy from userspace to the packet buffers -
and thence to UDP (only if for unauthenticated connections).
To that end, the following changes are made:
(1) Use the new rxrpc_txbuf struct instead of sk_buff for keeping packets
to be transmitted in. This allows them to be placed on multiple
queues simultaneously. An sk_buff isn't really necessary as it's
never passed on to lower-level networking code.
(2) Keep the transmissable packets in a linked list on the call struct
rather than in a ring. As a consequence, the annotation buffer isn't
used either; rather a flag is set on the packet to indicate ackedness.
(3) Use the RXRPC_CALL_TX_LAST flag to indicate that the last packet to be
transmitted has been queued. Add RXRPC_CALL_TX_ALL_ACKED to indicate
that all packets up to and including the last got hard acked.
(4) Wire headers are now stored in the txbuf rather than being concocted
on the stack and they're stored immediately before the data, thereby
allowing zerocopy of a single span.
(5) Don't bother with instant-resend on transmission failure; rather,
leave it for a timer or an ACK packet to trigger.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Get rid of the Rx ring and replace it with a pair of queues instead. One
queue gets the packets that are in-sequence and are ready for processing by
recvmsg(); the other queue gets the out-of-sequence packets for addition to
the first queue as the holes get filled.
The annotation ring is removed and replaced with a SACK table. The SACK
table has the bits set that correspond exactly to the sequence number of
the packet being acked. The SACK ring is copied when an ACK packet is
being assembled and rotated so that the first ACK is in byte 0.
Flow control handling is altered so that packets that are moved to the
in-sequence queue are hard-ACK'd even before they're consumed - and then
the Rx window size in the ACK packet (rsize) is shrunk down to compensate
(even going to 0 if the window is full).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Split up received jumbo packets into separate skbuffs by cloning the
original skbuff for each subpacket and setting the offset and length of the
data in that subpacket in the skbuff's private data. The subpackets are
then placed on the recvmsg queue separately. The security class then gets
to revise the offset and length to remove its metadata.
If we fail to clone a packet, we just drop it and let the peer resend it.
The original packet gets used for the final subpacket.
This should make it easier to handle parallel decryption of the subpackets.
It also simplifies the handling of lost or misordered packets in the
queuing/buffering loop as the possibility of overlapping jumbo packets no
longer needs to be considered.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Split the rxrpc_recvmsg tracepoint so that the tracepoints that are about
data packet processing (and which have extra pieces of information) are
separate from the tracepoint that shows the general flow of recvmsg().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Clean up the rxrpc_propose_ACK() function. If deferred PING ACK proposal
is split out, it's only really needed for deferred DELAY ACKs. All other
ACKs, bar terminal IDLE ACK are sent immediately. The deferred IDLE ACK
submission can be handled by conversion of a DELAY ACK into an IDLE ACK if
there's nothing to be SACK'd.
Also, because there's a delay between an ACK being generated and being
transmitted, it's possible that other ACKs of the same type will be
generated during that interval. Apart from the ACK time and the serial
number responded to, most of the ACK body, including window and SACK
parameters, are not filled out till the point of transmission - so we can
avoid generating a new ACK if there's one pending that will cover the SACK
data we need to convey.
Therefore, don't propose a new DELAY or IDLE ACK for a call if there's one
already pending.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Allocate rxrpc_txbuf records for ACKs and put onto a queue for the
transmitter thread to dispatch.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Define a struct, rxrpc_txbuf, to carry data to be transmitted instead of a
socket buffer so that it can be placed onto multiple queues at once. This
also allows the data buffer to be in the same allocation as the internal
data.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Remove call->tx_phase as it's only ever set.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Remove the flags from the rxrpc_skb tracepoint as we're no longer going to
be using this for the transmission buffers and so marking which are
transmission buffers isn't going to be necessary.
Note that this also remove the rxrpc skb flag that indicates if this is a
transmission buffer and so the count is not updated for the moment.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Remove a bunch of unnecessary header inclusions.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Call udp_sendmsg() and udpv6_sendmsg() directly rather than calling
kernel_sendmsg() as the latter assumes we want a kvec-class iterator.
However, zerocopy explicitly doesn't work with such an iterator.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Make rxrpc_encap_rcv_err() pass the ICMP/ICMP6 skbuff to ip_icmp_error() or
ipv6_icmp_error() as appropriate to do the parsing rather than trying to do
it in rxrpc.
This pushes an error report onto the UDP socket's error queue and calls
->sk_error_report() from which point rxrpc can pick it up.
It would be preferable to steal the packet directly from ip*_icmp_error()
rather than letting it get queued, but this is probably good enough.
Also note that __udp4_lib_err() calls sk_error_report() twice in some
cases.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Change the udp encap_err_rcv signature to match ip_icmp_error() and
ipv6_icmp_error() so that those can be used from the called function and
export them.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
|
|
ack.bufferSize should be set to 0 when generating an ack.
Fixes: 8d94aa381dab ("rxrpc: Calls shouldn't hold socket refs")
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Record stats for why the REQUEST-ACK flag is being set.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Record statistics about the different types of ACKs that have been
transmitted and received and the number of ACKs that have been filled out
and transmitted or that have been skipped.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Add a procfile, /proc/net/rxrpc/stats, to display some statistics about
what rxrpc has been doing. Writing a blank line to the stats file will
clear the increment-only counters. Allocated resource counters don't get
cleared.
Add some counters to count various things about DATA packets, including the
number created, transmitted and retransmitted and the number received, the
number of ACK-requests markings and the number of jumbo packets received.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Keep track of the highest DATA serial number that has been acked by the
peer for future purposes.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Split the tracepoint for call timer-set to separate out the call
timer-expiration event
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
Add a tracepoint to log why the request-ack flag is set on an outgoing DATA
packet, allowing debugging as to why.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
|
|
This patch is to add helper support in act_ct for OVS actions=ct(alg=xxx)
offloading, which is corresponding to Commit cae3a2627520 ("openvswitch:
Allow attaching helpers to ct action") in OVS kernel part.
The difference is when adding TC actions family and proto cannot be got
from the filter/match, other than helper name in tb[TCA_CT_HELPER_NAME],
we also need to send the family in tb[TCA_CT_HELPER_FAMILY] and the
proto in tb[TCA_CT_HELPER_PROTO] to kernel.
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This patch is to make the err path simple by calling tcf_ct_params_free(),
so that it won't cause problems when more members are added into param and
need freeing on the err path.
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Move ovs_ct_add_helper from openvswitch to nf_conntrack_helper and
rename as nf_ct_add_helper, so that it can be used in TC act_ct in
the next patch.
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Move ovs_ct_helper from openvswitch to nf_conntrack_helper and rename
as nf_ct_helper so that it can be used in TC act_ct in the next patch.
Note that it also adds the checks for the family and proto, as in TC
act_ct, the packets with correct family and proto are not guaranteed.
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Similar to what we do with the hash indirection table [1], when network
flow classification rules are forwarding traffic to channels greater
than the requested number of channels, fail the operation.
Without this, traffic could be directed to channels which no longer
exist (dropped) after changing number of channels.
[1] commit d4ab4286276f ("ethtool: correctly ensure {GS}CHANNELS doesn't conflict with GS{RXFH}")
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20221106123127.522985-1-gal@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The previous attempt to augment carrier_down (see Link)
was not met with much enthusiasm so let's do the simple
thing of exposing what some devices already maintain.
Add a common ethtool statistic for link going down.
Currently users have to maintain per-driver mapping
to extract the right stat from the vendor-specific ethtool -S
stats. carrier_down does not fit the bill because it counts
a lot of software related false positives.
Add the statistic to the extended link state API to steer
vendors towards implementing all of it.
Implement for bnxt and all Linux-controlled PHYs. mlx5 and (possibly)
enic also have a counter for this but I leave the implementation
to their maintainers.
Link: https://lore.kernel.org/r/20220520004500.2250674-1-kuba@kernel.org
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20221104190125.684910-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
A crash was reported by Zhen Chen:
list_del corruption, ffffa035ddf01c18->next is NULL
WARNING: CPU: 1 PID: 250682 at lib/list_debug.c:49 __list_del_entry_valid+0x59/0xe0
RIP: 0010:__list_del_entry_valid+0x59/0xe0
Call Trace:
sctp_sched_dequeue_common+0x17/0x70 [sctp]
sctp_sched_fcfs_dequeue+0x37/0x50 [sctp]
sctp_outq_flush_data+0x85/0x360 [sctp]
sctp_outq_uncork+0x77/0xa0 [sctp]
sctp_cmd_interpreter.constprop.0+0x164/0x1450 [sctp]
sctp_side_effects+0x37/0xe0 [sctp]
sctp_do_sm+0xd0/0x230 [sctp]
sctp_primitive_SEND+0x2f/0x40 [sctp]
sctp_sendmsg_to_asoc+0x3fa/0x5c0 [sctp]
sctp_sendmsg+0x3d5/0x440 [sctp]
sock_sendmsg+0x5b/0x70
and in sctp_sched_fcfs_dequeue() it dequeued a chunk from stream
out_curr outq while this outq was empty.
Normally stream->out_curr must be set to NULL once all frag chunks of
current msg are dequeued, as we can see in sctp_sched_dequeue_done().
However, in sctp_prsctp_prune_unsent() as it is not a proper dequeue,
sctp_sched_dequeue_done() is not called to do this.
This patch is to fix it by simply setting out_curr to NULL when the
last frag chunk of current msg is dequeued from out_curr stream in
sctp_prsctp_prune_unsent().
Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations")
Reported-by: Zhen Chen <chenzhen126@huawei.com>
Tested-by: Caowangbao <caowangbao@huawei.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since commit 5bbbbe32a431 ("sctp: introduce stream scheduler foundations"),
sctp_stream_outq_migrate() has been called in sctp_stream_init/update to
removes those chunks to streams higher than the new max. There is no longer
need to do such check in sctp_prsctp_prune_unsent().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is a follow-up for commit 974cb0e3e7c9 ("tipc: fix uninit-value
in tipc_nl_compat_name_table_dump") where it should have type casted
sizeof(..) to int to work when TLV_GET_DATA_LEN() returns a negative
value.
syzbot reported a call trace because of it:
BUG: KMSAN: uninit-value in ...
tipc_nl_compat_name_table_dump+0x841/0xea0 net/tipc/netlink_compat.c:934
__tipc_nl_compat_dumpit+0xab2/0x1320 net/tipc/netlink_compat.c:238
tipc_nl_compat_dumpit+0x991/0xb50 net/tipc/netlink_compat.c:321
tipc_nl_compat_recv+0xb6e/0x1640 net/tipc/netlink_compat.c:1324
genl_family_rcv_msg_doit net/netlink/genetlink.c:731 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
genl_rcv_msg+0x103f/0x1260 net/netlink/genetlink.c:792
netlink_rcv_skb+0x3a5/0x6c0 net/netlink/af_netlink.c:2501
genl_rcv+0x3c/0x50 net/netlink/genetlink.c:803
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0xf3b/0x1270 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x1288/0x1440 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg net/socket.c:734 [inline]
Reported-by: syzbot+e5dbaaa238680ce206ea@syzkaller.appspotmail.com
Fixes: 974cb0e3e7c9 ("tipc: fix uninit-value in tipc_nl_compat_name_table_dump")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/ccd6a7ea801b15aec092c3b532a883b4c5708695.1667594933.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix coverity issue 'Resource leak'.
We should clean the skb resource if nlmsg_put/append failed.
Fixes: 738136a0e375 ("netlink: split up copies in the ack construction")
Signed-off-by: Tao Chen <chentao.kernel@linux.alibaba.com>
Link: https://lore.kernel.org/r/bff442d62c87de6299817fe1897cc5a5694ba9cc.1667638204.git.chentao.kernel@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The read access to struct canxl_frame::len inside of a j1939 created
skbuff revealed a missing initialization of reserved and later filled
elements in struct can_frame.
This patch initializes the 8 byte CAN header with zero.
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/linux-can/20221104052235.GA6474@pengutronix.de
Reported-by: syzbot+d168ec0caca4697e03b1@syzkaller.appspotmail.com
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20221104075000.105414-1-socketcan@hartkopp.net
Cc: stable@vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
In commit 4b7fe92c0690 ("can: isotp: add local echo tx processing for
consecutive frames") the data flow for consecutive frames (CF) has been
reworked to improve the reliability of long data transfers.
This rework did not touch the transmission and the tx state changes of
single frame (SF) transfers which likely led to the WARN in the
isotp_tx_timer_handler() catching a wrong tx state. This patch makes use
of the improved frame processing for SF frames and sets the ISOTP_SENDING
state in isotp_sendmsg() within the cmpxchg() condition handling.
A review of the state machine and the timer handling additionally revealed
a missing echo timeout handling in the case of the burst mode in
isotp_rcv_echo() and removes a potential timer configuration uncertainty
in isotp_rcv_fc() when the receiver requests consecutive frames.
Fixes: 4b7fe92c0690 ("can: isotp: add local echo tx processing for consecutive frames")
Link: https://lore.kernel.org/linux-can/CAO4mrfe3dG7cMP1V5FLUkw7s+50c9vichigUMQwsxX4M=45QEw@mail.gmail.com/T/#u
Reported-by: Wei Chen <harperchen1110@gmail.com>
Cc: stable@vger.kernel.org # v6.0
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20221104142551.16924-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
It causes NULL pointer dereference when testing as following:
(a) use syscall(__NR_socket, 0x10ul, 3ul, 0) to create netlink socket.
(b) use syscall(__NR_sendmsg, ...) to create bond link device and vxcan
link device, and bind vxcan device to bond device (can also use
ifenslave command to bind vxcan device to bond device).
(c) use syscall(__NR_socket, 0x1dul, 3ul, 1) to create CAN socket.
(d) use syscall(__NR_bind, ...) to bind the bond device to CAN socket.
The bond device invokes the can-raw protocol registration interface to
receive CAN packets. However, ml_priv is not allocated to the dev,
dev_rcv_lists is assigned to NULL in can_rx_register(). In this case,
it will occur the NULL pointer dereference issue.
The following is the stack information:
BUG: kernel NULL pointer dereference, address: 0000000000000008
PGD 122a4067 P4D 122a4067 PUD 1223c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
RIP: 0010:can_rx_register+0x12d/0x1e0
Call Trace:
<TASK>
raw_enable_filters+0x8d/0x120
raw_enable_allfilters+0x3b/0x130
raw_bind+0x118/0x4f0
__sys_bind+0x163/0x1a0
__x64_sys_bind+0x1e/0x30
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
</TASK>
Fixes: 4e096a18867a ("net: introduce CAN specific pointer in the struct net_device")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/all/20221028085650.170470-1-shaozhengchao@huawei.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
In can_init(), dev_add_pack(&canxl_packet) is added but not removed in
can_exit(). It breaks the packet handler list and can make kernel
panic when can_init() is called for the second time.
| > modprobe can && rmmod can
| > rmmod xxx && modprobe can
|
| BUG: unable to handle page fault for address: fffffbfff807d7f4
| RIP: 0010:dev_add_pack+0x133/0x1f0
| Call Trace:
| <TASK>
| can_init+0xaa/0x1000 [can]
| do_one_initcall+0xd3/0x4e0
| ...
Fixes: fb08cba12b52 ("can: canxl: update CAN infrastructure for CAN XL frames")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20221031033053.37849-1-chenzhongjin@huawei.com
[mkl: adjust subject and commit message]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Prove that the split ops work.
Sadly we need to keep bug-wards compatibility and specify
the same policy for dump as do, even tho we don't parse
inputs for the dump.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Let families to hook in the new split ops.
They are more flexible and should not be much larger than
full ops. Each split op is 40B while full op is 48B.
Devlink for example has 54 dos and 19 dumps, 2 of the dumps
do not have a do -> 56 full commands = 2688B.
Split ops would have taken 2920B, so 9% more space while
allowing individual per/post doit and per-type policies.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
All dumpers use the iterators now, inline the cmd by index
stuff into iterator code.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We can't put the full iterator in the struct ctrl_dump_policy_ctx
because dump context is statically sized by netlink core.
Allocate it dynamically.
Rename policy to dump_map to make the logic a little easier to follow.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Subsequent changes will expose split op structures to users,
so walking the family ops with just an index will get harder.
Add a structured iterator, convert the simple cases.
Policy dumping needs more careful conversion.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
All callers go via genl_get_cmd_split() now, so rename it
to genl_get_cmd() remove the original.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pass do and dump versions of the op to ctrl_dumppolicy_put_op()
so that it can provide a different policy index for the two.
Since we now look at policies, and those are set appropriately
there's no need to look at the GENL_DONT_VALIDATE_DUMP flag.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Separate adding doit and dumpit policies for CTRL_CMD_GETPOLICY.
This has no effect until we actually allow do and dump to come
from different sources as netlink_policy_dump_add_policy()
does deduplication.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Now that genl_get_cmd_split() is informed what type of callback
user is trying to access (do or dump) we can check that this
callback is indeed available and return an error early.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Set the policy and maxattr pointers based on validation flags.
genl_family_rcv_msg_attrs_parse() will do nothing and return NULL
if maxattrs is zero, so no behavior change is expected.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We currently have two forms of operations - small ops and "full" ops
(or just ops). The former does not have pointers for some of the less
commonly used features (namely dump start/done and policy).
The "full" ops, however, still don't contain all the necessary
information. In particular the policy is per command ID, while
do and dump often accept different attributes. It's also not
possible to define different pre_doit and post_doit callbacks
for different commands within the family.
At the same time a lot of commands do not support dumping and
therefore all the dump-related information is wasted space.
Create a new command representation which can hold info about
a do implementation or a dump implementation, but not both at
the same time.
Use this new representation on the command execution path
(genl_family_rcv_msg) as we either run a do or a dump and
don't have to create a "full" op there.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The code at the top of ctrl_dumppolicy() dumps mappings between
ops and policies. It supports dumping both the entire family and
single op if dump is filtered. But both of those cases are handled
inside a loop, which makes the logic harder to follow and change.
Refactor to split the two cases more clearly.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When copying a `struct ifaddrlblmsg` to the network, __ifal_reserved
remained uninitialized, resulting in a 1-byte infoleak:
BUG: KMSAN: kernel-network-infoleak in __netdev_start_xmit ./include/linux/netdevice.h:4841
__netdev_start_xmit ./include/linux/netdevice.h:4841
netdev_start_xmit ./include/linux/netdevice.h:4857
xmit_one net/core/dev.c:3590
dev_hard_start_xmit+0x1dc/0x800 net/core/dev.c:3606
__dev_queue_xmit+0x17e8/0x4350 net/core/dev.c:4256
dev_queue_xmit ./include/linux/netdevice.h:3009
__netlink_deliver_tap_skb net/netlink/af_netlink.c:307
__netlink_deliver_tap+0x728/0xad0 net/netlink/af_netlink.c:325
netlink_deliver_tap net/netlink/af_netlink.c:338
__netlink_sendskb net/netlink/af_netlink.c:1263
netlink_sendskb+0x1d9/0x200 net/netlink/af_netlink.c:1272
netlink_unicast+0x56d/0xf50 net/netlink/af_netlink.c:1360
nlmsg_unicast ./include/net/netlink.h:1061
rtnl_unicast+0x5a/0x80 net/core/rtnetlink.c:758
ip6addrlbl_get+0xfad/0x10f0 net/ipv6/addrlabel.c:628
rtnetlink_rcv_msg+0xb33/0x1570 net/core/rtnetlink.c:6082
...
Uninit was created at:
slab_post_alloc_hook+0x118/0xb00 mm/slab.h:742
slab_alloc_node mm/slub.c:3398
__kmem_cache_alloc_node+0x4f2/0x930 mm/slub.c:3437
__do_kmalloc_node mm/slab_common.c:954
__kmalloc_node_track_caller+0x117/0x3d0 mm/slab_common.c:975
kmalloc_reserve net/core/skbuff.c:437
__alloc_skb+0x27a/0xab0 net/core/skbuff.c:509
alloc_skb ./include/linux/skbuff.h:1267
nlmsg_new ./include/net/netlink.h:964
ip6addrlbl_get+0x490/0x10f0 net/ipv6/addrlabel.c:608
rtnetlink_rcv_msg+0xb33/0x1570 net/core/rtnetlink.c:6082
netlink_rcv_skb+0x299/0x550 net/netlink/af_netlink.c:2540
rtnetlink_rcv+0x26/0x30 net/core/rtnetlink.c:6109
netlink_unicast_kernel net/netlink/af_netlink.c:1319
netlink_unicast+0x9ab/0xf50 net/netlink/af_netlink.c:1345
netlink_sendmsg+0xebc/0x10f0 net/netlink/af_netlink.c:1921
...
This patch ensures that the reserved field is always initialized.
Reported-by: syzbot+3553517af6020c4f2813f1003fe76ef3cbffe98d@syzkaller.appspotmail.com
Fixes: 2a8cc6c89039 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.")
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code
of TCPOPT_SACK_PERM is called to enable sack after data is sent
and dupacks are received , it will trigger a warning in function
tcp_verify_left_out() as follows:
============================================
WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132
tcp_timeout_mark_lost+0x154/0x160
tcp_enter_loss+0x2b/0x290
tcp_retransmit_timer+0x50b/0x640
tcp_write_timer_handler+0x1c8/0x340
tcp_write_timer+0xe5/0x140
call_timer_fn+0x3a/0x1b0
__run_timers.part.0+0x1bf/0x2d0
run_timer_softirq+0x43/0xb0
__do_softirq+0xfd/0x373
__irq_exit_rcu+0xf6/0x140
The warning is caused in the following steps:
1. a socket named socketA is created
2. socketA enters repair mode without build a connection
3. socketA calls connect() and its state is changed to TCP_ESTABLISHED
directly
4. socketA leaves repair mode
5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup
ack receives) increase
6. socketA enters repair mode again
7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack
8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out
increases
9. sack_outs + lost_out > packets_out triggers since lost_out and
sack_outs increase repeatly
In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if
Step7 not happen and the warning will not be triggered. As suggested by
Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was
already sent.
socket-tcp tests in CRIU has been tested as follows:
$ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp* --keep-going \
--ignore-taint
socket-tcp* represent all socket-tcp tests in test/zdtm/static/.
Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters")
Signed-off-by: Lu Wei <luwei32@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 4ff09db1b79b ("bpf: net: Change sk_getsockopt() to take the
sockptr_t argument") made it possible to call sk_getsockopt()
with both user and kernel address space buffers through the use of
the sockptr_t type. Unfortunately at the time of conversion the
security_socket_getpeersec_stream() LSM hook was written to only
accept userspace buffers, and in a desire to avoid having to change
the LSM hook the commit author simply passed the sockptr_t's
userspace buffer pointer. Since the only sk_getsockopt() callers
at the time of conversion which used kernel sockptr_t buffers did
not allow SO_PEERSEC, and hence the
security_socket_getpeersec_stream() hook, this was acceptable but
also very fragile as future changes presented the possibility of
silently passing kernel space pointers to the LSM hook.
There are several ways to protect against this, including careful
code review of future commits, but since relying on code review to
catch bugs is a recipe for disaster and the upstream eBPF maintainer
is "strongly against defensive programming", this patch updates the
LSM hook, and all of the implementations to support sockptr_t and
safely handle both user and kernel space buffers.
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|