Age | Commit message (Collapse) | Author |
|
Syzkaller managed to trigger lock dependency in xsk_notify via
register_netdevice. As discussed in [0], using register_netdevice
in the notifiers is problematic so skip adding hamradio for ops-locked
devices.
xsk_notifier+0x89/0x230 net/xdp/xsk.c:1664
notifier_call_chain+0x1b6/0x3e0 kernel/notifier.c:85
call_netdevice_notifiers_extack net/core/dev.c:2267 [inline]
call_netdevice_notifiers net/core/dev.c:2281 [inline]
unregister_netdevice_many_notify+0x14d7/0x1ff0 net/core/dev.c:12156
unregister_netdevice_many net/core/dev.c:12219 [inline]
unregister_netdevice_queue+0x33c/0x380 net/core/dev.c:12063
register_netdevice+0x1689/0x1ae0 net/core/dev.c:11241
bpq_new_device drivers/net/hamradio/bpqether.c:481 [inline]
bpq_device_event+0x491/0x600 drivers/net/hamradio/bpqether.c:523
notifier_call_chain+0x1b6/0x3e0 kernel/notifier.c:85
call_netdevice_notifiers_extack net/core/dev.c:2267 [inline]
call_netdevice_notifiers net/core/dev.c:2281 [inline]
__dev_notify_flags+0x18d/0x2e0 net/core/dev.c:-1
netif_change_flags+0xe8/0x1a0 net/core/dev.c:9608
dev_change_flags+0x130/0x260 net/core/dev_api.c:68
devinet_ioctl+0xbb4/0x1b50 net/ipv4/devinet.c:1200
inet_ioctl+0x3c0/0x4c0 net/ipv4/af_inet.c:1001
0: https://lore.kernel.org/netdev/20250625140357.6203d0af@kernel.org/
Fixes: 4c975fd70002 ("net: hold instance lock during NETDEV_REGISTER/UP")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: syzbot+e6300f66a999a6612477@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=e6300f66a999a6612477
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250806213726.1383379-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Syzkaller managed to trigger lock dependency in xsk_notify via
register_netdevice. As discussed in [0], using register_netdevice
in the notifiers is problematic so skip adding lapbeth for ops-locked
devices.
xsk_notifier+0xa4/0x280 net/xdp/xsk.c:1645
notifier_call_chain+0xbc/0x410 kernel/notifier.c:85
call_netdevice_notifiers_info+0xbe/0x140 net/core/dev.c:2230
call_netdevice_notifiers_extack net/core/dev.c:2268 [inline]
call_netdevice_notifiers net/core/dev.c:2282 [inline]
unregister_netdevice_many_notify+0xf9d/0x2700 net/core/dev.c:12077
unregister_netdevice_many net/core/dev.c:12140 [inline]
unregister_netdevice_queue+0x305/0x3f0 net/core/dev.c:11984
register_netdevice+0x18f1/0x2270 net/core/dev.c:11149
lapbeth_new_device drivers/net/wan/lapbether.c:420 [inline]
lapbeth_device_event+0x5b1/0xbe0 drivers/net/wan/lapbether.c:462
notifier_call_chain+0xbc/0x410 kernel/notifier.c:85
call_netdevice_notifiers_info+0xbe/0x140 net/core/dev.c:2230
call_netdevice_notifiers_extack net/core/dev.c:2268 [inline]
call_netdevice_notifiers net/core/dev.c:2282 [inline]
__dev_notify_flags+0x12c/0x2e0 net/core/dev.c:9497
netif_change_flags+0x108/0x160 net/core/dev.c:9526
dev_change_flags+0xba/0x250 net/core/dev_api.c:68
devinet_ioctl+0x11d5/0x1f50 net/ipv4/devinet.c:1200
inet_ioctl+0x3a7/0x3f0 net/ipv4/af_inet.c:1001
0: https://lore.kernel.org/netdev/20250625140357.6203d0af@kernel.org/
Fixes: 4c975fd70002 ("net: hold instance lock during NETDEV_REGISTER/UP")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: syzbot+e67ea9c235b13b4f0020@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=e67ea9c235b13b4f0020
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250806213726.1383379-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ksz8873_valid_regs[] was added for register access for KSZ8863/KSZ8873
switches, but the reset register is not in the list so
ksz8_reset_switch() does not take any effect.
Replace regmap_update_bits() using ksz_regmap_8 with ksz_rmw8() so that
an error message will be given if the register is not defined.
A side effect of not resetting the switch is the static MAC table is not
cleared. Further additions to the table will show write error as there
are only 8 entries in the table.
Fixes: d0dec3333040 ("net: dsa: microchip: Add register access control for KSZ8873 chip")
Signed-off-by: Tristram Ha <tristram.ha@microchip.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250807005453.8306-1-Tristram.Ha@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A cloned head skb still shares these frag skbs in fraglist with the
original head skb. It's not safe to access these frag skbs.
syzbot reported two use-of-uninitialized-memory bugs caused by this:
BUG: KMSAN: uninit-value in sctp_inq_pop+0x15b7/0x1920 net/sctp/inqueue.c:211
sctp_inq_pop+0x15b7/0x1920 net/sctp/inqueue.c:211
sctp_assoc_bh_rcv+0x1a7/0xc50 net/sctp/associola.c:998
sctp_inq_push+0x2ef/0x380 net/sctp/inqueue.c:88
sctp_backlog_rcv+0x397/0xdb0 net/sctp/input.c:331
sk_backlog_rcv+0x13b/0x420 include/net/sock.h:1122
__release_sock+0x1da/0x330 net/core/sock.c:3106
release_sock+0x6b/0x250 net/core/sock.c:3660
sctp_wait_for_connect+0x487/0x820 net/sctp/socket.c:9360
sctp_sendmsg_to_asoc+0x1ec1/0x1f00 net/sctp/socket.c:1885
sctp_sendmsg+0x32b9/0x4a80 net/sctp/socket.c:2031
inet_sendmsg+0x25a/0x280 net/ipv4/af_inet.c:851
sock_sendmsg_nosec net/socket.c:718 [inline]
and
BUG: KMSAN: uninit-value in sctp_assoc_bh_rcv+0x34e/0xbc0 net/sctp/associola.c:987
sctp_assoc_bh_rcv+0x34e/0xbc0 net/sctp/associola.c:987
sctp_inq_push+0x2a3/0x350 net/sctp/inqueue.c:88
sctp_backlog_rcv+0x3c7/0xda0 net/sctp/input.c:331
sk_backlog_rcv+0x142/0x420 include/net/sock.h:1148
__release_sock+0x1d3/0x330 net/core/sock.c:3213
release_sock+0x6b/0x270 net/core/sock.c:3767
sctp_wait_for_connect+0x458/0x820 net/sctp/socket.c:9367
sctp_sendmsg_to_asoc+0x223a/0x2260 net/sctp/socket.c:1886
sctp_sendmsg+0x3910/0x49f0 net/sctp/socket.c:2032
inet_sendmsg+0x269/0x2a0 net/ipv4/af_inet.c:851
sock_sendmsg_nosec net/socket.c:712 [inline]
This patch fixes it by linearizing cloned gso packets in sctp_rcv().
Fixes: 90017accff61 ("sctp: Add GSO support")
Reported-by: syzbot+773e51afe420baaf0e2b@syzkaller.appspotmail.com
Reported-by: syzbot+70a42f45e76bede082be@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://patch.msgid.link/dd7dc337b99876d4132d0961f776913719f7d225.1754595611.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Similar to delta_cpu(), delta_platform() is called in turbostat main
loop. This ensures accurate SysWatt readings in periodic monitoring mode
$ sudo turbostat -S -q --show power -i 1
CoreTmp PkgTmp PkgWatt CorWatt GFXWatt RAMWatt PKG_% RAM_% SysWatt
60 61 6.21 1.13 0.16 0.00 0.00 0.00 13.07
58 61 6.00 1.07 0.18 0.00 0.00 0.00 12.75
58 61 5.74 1.05 0.17 0.00 0.00 0.00 12.22
58 60 6.27 1.11 0.24 0.00 0.00 0.00 13.55
However, delta_platform() is missing for forked program and causes bogus
SysWatt reporting,
$ sudo turbostat -S -q --show power sleep 1
1.004736 sec
CoreTmp PkgTmp PkgWatt CorWatt GFXWatt RAMWatt PKG_% RAM_% SysWatt
57 58 6.05 1.02 0.16 0.00 0.00 0.00 0.03
Add missing delta_platform() for forked program.
Fixes: e5f687b89bc2 ("tools/power turbostat: Add RAPL psys as a built-in counter")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Kernels configured with CONFIG_MULTIUSER=n have no cap_get_proc().
Check for ENOSYS to recognize this case, and continue on to
attempt to access the requested MSRs (such as temperature).
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
turbostat.c: In function 'parse_int_file':
turbostat.c:5567:19: error: 'PATH_MAX' undeclared (first use in this function)
5567 | char path[PATH_MAX];
| ^~~~~~~~
turbostat.c: In function 'probe_graphics':
turbostat.c:6787:19: error: 'PATH_MAX' undeclared (first use in this function)
6787 | char path[PATH_MAX];
| ^~~~~~~~
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Reviewed-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
$ sudo turbostat --quiet --show junk
turbostat: Counter 'junk' can not be added.
Previously, invalid arguments to --show and --hide were silently ignored
Acked-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
It is possible for a vsock to autobind to VMADDR_PORT_ANY. This can
cause a use-after-free when a connection is made to the bound socket.
The socket returned by accept() also has port VMADDR_PORT_ANY but is not
on the list of unbound sockets. Binding it will result in an extra
refcount decrement similar to the one fixed in fcdd2242c023 (vsock: Keep
the binding until socket destruction).
Modify the check in __vsock_bind_connectible() to also prevent binding
to VMADDR_PORT_ANY.
Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Reported-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Budimir Markovic <markovicbudimir@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250807041811.678-1-markovicbudimir@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The variable ret in icss_iep_extts_enable() was incorrectly declared
as u32, while the function returns int and may return negative error
codes. This will cause sign extension issues and incorrect error
propagation. Update ret to be int to fix error handling.
This change corrects the declaration to avoid potential type mismatch.
Fixes: c1e0230eeaab ("net: ti: icss-iep: Add IEP driver")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250805142323.1949406-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Page pool can have pages "directly" (locklessly) recycled to it,
if the NAPI that owns the page pool is scheduled to run on the same CPU.
To make this safe we check that the NAPI is disabled while we destroy
the page pool. In most cases NAPI and page pool lifetimes are tied
together so this happens naturally.
The queue API expects the following order of calls:
-> mem_alloc
alloc new pp
-> stop
napi_disable
-> start
napi_enable
-> mem_free
free old pp
Here we allocate the page pool in ->mem_alloc and free in ->mem_free.
But the NAPIs are only stopped between ->stop and ->start. We created
page_pool_disable_direct_recycling() to safely shut down the recycling
in ->stop. This way the page_pool_destroy() call in ->mem_free doesn't
have to worry about recycling any more.
Unfortunately, the page_pool_disable_direct_recycling() is not enough
to deal with failures which necessitate freeing the _new_ page pool.
If we hit a failure in ->mem_alloc or ->stop the new page pool has
to be freed while the NAPI is active (assuming driver attaches the
page pool to an existing NAPI instance and doesn't reallocate NAPIs).
Freeing the new page pool is technically safe because it hasn't been
used for any packets, yet, so there can be no recycling. But the check
in napi_assert_will_not_race() has no way of knowing that. We could
check if page pool is empty but that'd make the check much less likely
to trigger during development.
Add page_pool_enable_direct_recycling(), pairing with
page_pool_disable_direct_recycling(). It will allow us to create the new
page pools in "disabled" state and only enable recycling when we know
the reconfig operation will not fail.
Coincidentally it will also let us re-enable the recycling for the old
pool, if the reconfig failed:
-> mem_alloc (new)
-> stop (old)
# disables direct recycling for old
-> start (new)
# fail!!
-> start (old)
# go back to old pp but direct recycling is lost :(
-> mem_free (new)
The new helper is idempotent to make the life easier for drivers,
which can operate in HDS mode and support zero-copy Rx.
The driver can call the helper twice whether there are two pools
or it has multiple references to a single pool.
Fixes: 40eca00ae605 ("bnxt_en: unlink page pool when stopping Rx queue")
Tested-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250805003654.2944974-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When link settings are changed emac->speed is populated by
emac_adjust_link(). The link speed and other settings are then written into
the DRAM. However if both ports are brought down after this and brought up
again or if the operating mode is changed and a firmware reload is needed,
the DRAM is cleared by icssg_config(). As a result the link settings are
lost.
Fix this by calling emac_adjust_link() after icssg_config(). This re
populates the settings in the DRAM after a new firmware load.
Fixes: 9facce84f406 ("net: ti: icssg-prueth: Fix firmware load sequence.")
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Message-ID: <20250805173812.2183161-1-danishanwar@ti.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Jijie Shao says:
====================
There are some bugfix for hibmcge ethernet driver
This patch set is intended to fix several issues for hibmcge driver:
1. Holding the rtnl_lock in pci_error_handlers->reset_prepare()
may lead to a deadlock issue.
2. A division by zero issue caused by debugfs when the port is down.
3. A probabilistic false positive issue with np_link_fail.
v2: https://lore.kernel.org/20250805181446.3deaceb9@kernel.org
v1: https://lore.kernel.org/20250731134749.4090041-1-shaojijie@huawei.com
====================
Link: https://patch.msgid.link/20250806102758.3632674-1-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, after modifying device port mode, the np_link_ok state
is immediately checked. At this point, the device may not yet ready,
leading to the querying of an intermediate state.
This patch will poll to check if np_link is ok after
modifying device port mode, and only report np_link_fail upon timeout.
Fixes: e0306637e85d ("net: hibmcge: Add support for mac link exception handling feature")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When the network port is down, the queue is released, and ring->len is 0.
In debugfs, hbg_get_queue_used_num() will be called,
which may lead to a division by zero issue.
This patch adds a check, if ring->len is 0,
hbg_get_queue_used_num() directly returns 0.
Fixes: 40735e7543f9 ("net: hibmcge: Implement .ndo_start_xmit function")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, the hibmcge netdev acquires the rtnl_lock in
pci_error_handlers.reset_prepare() and releases it in
pci_error_handlers.reset_done().
However, in the PCI framework:
pci_reset_bus - __pci_reset_slot - pci_slot_save_and_disable_locked -
pci_dev_save_and_disable - err_handler->reset_prepare(dev);
In pci_slot_save_and_disable_locked():
list_for_each_entry(dev, &slot->bus->devices, bus_list) {
if (!dev->slot || dev->slot!= slot)
continue;
pci_dev_save_and_disable(dev);
if (dev->subordinate)
pci_bus_save_and_disable_locked(dev->subordinate);
}
This will iterate through all devices under the current bus and execute
err_handler->reset_prepare(), causing two devices of the hibmcge driver
to sequentially request the rtnl_lock, leading to a deadlock.
Since the driver now executes netif_device_detach()
before the reset process, it will not concurrently with
other netdev APIs, so there is no need to hold the rtnl_lock now.
Therefore, this patch removes the rtnl_lock during the reset process and
adjusts the position of HBG_NIC_STATE_RESETTING to ensure
that multiple resets are not executed concurrently.
Fixes: 3f5a61f6d504f ("net: hibmcge: Add reset supported in this module")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Reinstantiate Florian Westphal as a Netfilter maintainer.
2) Depend on both NETFILTER_XTABLES and NETFILTER_XTABLES_LEGACY,
from Arnd Bergmann.
3) Use id to annotate last conntrack/expectation visited to resume
netlink dump, patches from Florian Westphal.
4) Fix bogus element in nft_pipapo avx2 lookup, introduced in
the last nf-next batch of updates, also from Florian.
5) Return 0 instead of recycling ret variable in
nf_conntrack_log_invalid_sysctl(), introduced in the last
nf-next batch of updates, from Dan Carpenter.
6) Fix WARN_ON_ONCE triggered by syzbot with larger cgroup level
in nft_socket.
* tag 'nf-25-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_socket: remove WARN_ON_ONCE with huge level value
netfilter: conntrack: clean up returns in nf_conntrack_log_invalid_sysctl()
netfilter: nft_set_pipapo: don't return bogus extension pointer
netfilter: ctnetlink: remove refcounting in expectation dumpers
netfilter: ctnetlink: fix refcount leak on table dump
netfilter: add back NETFILTER_XTABLES dependencies
MAINTAINERS: resurrect my netfilter maintainer entry
====================
Link: https://patch.msgid.link/20250807112948.1400523-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If the allocated size exceeds UINT_MAX, then it's necessary to cast
the mr->nr_pages value to size_t to prevent it from overflowing. In
practice this isn't much of a concern as the required memory size will
have been validated upfront, and accounted to the user. And > 4GB sizes
will be necessary to make the lack of a cast a problem, which greatly
exceeds normal user locked_vm settings that are generally in the kb to
mb range. However, if root is used, then accounting isn't done, and
then it's possible to hit this issue.
Link: https://lore.kernel.org/all/6895b298.050a0220.7f033.0059.GAE@google.com/
Cc: stable@vger.kernel.org
Reported-by: syzbot+23727438116feb13df15@syzkaller.appspotmail.com
Fixes: 087f997870a9 ("io_uring/memmap: implement mmap for regions")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Sabrina Dubroca says:
====================
This series fixes a few issues with GSO. Some recent patches made the
incorrect assumption that GSO is only used by offload. The first two
patches in this series restore the old behavior.
The final patch is in the UDP GSO code, but fixes an issue with IPsec
that is currently masked by the lack of GSO for SW crypto. With GSO,
VXLAN over IPsec doesn't get checksummed.
====================
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Define a new, optional, callback that allows the driver to
specify how the return data buffer is allocated. If that callback
is set, mailbox/pcc.c is now responsible for reading from and
writing to the PCC shared buffer.
This also allows for proper checks of the Commnand complete flag
between the PCC sender and receiver.
For Type 4 channels, initialize the command complete flag prior
to accepting messages.
Since the mailbox does not know what memory allocation scheme
to use for response messages, the client now has an optional
callback that allows it to allocate the buffer for a response
message.
When an outbound message is written to the buffer, the mailbox
checks for the flag indicating the client wants an tx complete
notification via IRQ. Upon receipt of the interrupt It will
pair it with the outgoing message. The expected use is to
free the kernel memory buffer for the previous outgoing message.
Signed-off-by: Adam Young <admiyo@os.amperecomputing.com>
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
Previous releases - regressions:
- netlink: avoid infinite retry looping in netlink_unicast()
Previous releases - always broken:
- packet: fix a race in packet_set_ring() and packet_notifier()
- ipv6: reject malicious packets in ipv6_gso_segment()
- sched: mqprio: fix stack out-of-bounds write in tc entry parsing
- net: drop UFO packets (injected via virtio) in udp_rcv_segment()
- eth: mlx5: correctly set gso_segs when LRO is used, avoid false
positive checksum validation errors
- netpoll: prevent hanging NAPI when netcons gets enabled
- phy: mscc: fix parsing of unicast frames for PTP timestamping
- a number of device tree / OF reference leak fixes"
* tag 'net-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (44 commits)
pptp: fix pptp_xmit() error path
net: ti: icssg-prueth: Fix skb handling for XDP_PASS
net: Update threaded state in napi config in netif_set_threaded
selftests: netdevsim: Xfail nexthop test on slow machines
eth: fbnic: Lock the tx_dropped update
eth: fbnic: Fix tx_dropped reporting
eth: fbnic: remove the debugging trick of super high page bias
net: ftgmac100: fix potential NULL pointer access in ftgmac100_phy_disconnect
dt-bindings: net: Replace bouncing Alexandru Tachici emails
dpll: zl3073x: ZL3073X_I2C and ZL3073X_SPI should depend on NET
net/sched: mqprio: fix stack out-of-bounds write in tc entry parsing
Revert "net: mdio_bus: Use devm for getting reset GPIO"
selftests: net: packetdrill: xfail all problems on slow machines
net/packet: fix a race in packet_set_ring() and packet_notifier()
benet: fix BUG when creating VFs
net: airoha: npu: Add missing MODULE_FIRMWARE macros
net: devmem: fix DMA direction on unmapping
ipa: fix compile-testing with qcom-mdt=m
eth: fbnic: unlink NAPIs from queues on error to open
net: Add locking to protect skb->dev access in ip_output
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Alexander Gordeev:
- Support MMIO read/write tracing
- Enable THP swapping and THP migration
- Unmask SLCF bit ("stateless command filtering") introduced with CEX8
cards, so that user space applications like lszcrypt could evaluate
and list this feature
- Fix the value of high_memory variable, so it considers possible
tailing offline memory blocks
- Make vmem_pte_alloc() consistent and always allocate memory of
PAGE_SIZE for page tables. This ensures a page table occupies the
whole page, as the rest of the code assumes
- Fix kernel image end address in the decompressor debug output
- Fix a typo in debug_sprintf_format_fn() comment
* tag 's390-6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/debug: Fix typo in debug_sprintf_format_fn() comment
s390/boot: Fix startup debugging log
s390/mm: Allocate page table with PAGE_SIZE granularity
s390/mm: Enable THP_SWAP and THP_MIGRATION
s390: Support CONFIG_TRACE_MMIO_ACCESS
s390/mm: Set high_memory at the end of the identity mapping
s390/ap: Unmask SLCF bit in card and queue ap functions sysfs
|
|
Pull vhost fix from Michael Tsirkin:
"A single fix for a regression in vhost"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: initialize vq->nheads properly
|
|
Pull drm fixes from Dave Airlie:
"This is the fixes that built up in the merge window, mostly amdgpu and
xe with one i915 display fix, seems like things are pretty good for
rc1.
i915:
- DP LPFS fixes
xe:
- SRIOV: PF fixes and removal of need of module param
- Fix driver unbind around Devcoredump
- Mark xe driver as BROKEN if kernel page size is not 4kB
amdgpu:
- GC 9.5.0 fixes
- SMU fix
- DCE 6 DC fixes
- mmhub client ID fixes
- VRR fix
- Backlight fix
- UserQ fix
- Legacy reset fix
- Misc fixes
amdkfd:
- CRIU fix
- Debugfs fix"
* tag 'drm-next-2025-08-08' of https://gitlab.freedesktop.org/drm/kernel: (28 commits)
drm/amdgpu: add missing vram lost check for LEGACY RESET
drm/amdgpu/discovery: fix fw based ip discovery
drm/amdkfd: Destroy KFD debugfs after destroy KFD wq
amdgpu/amdgpu_discovery: increase timeout limit for IFWI init
drm/amdgpu: Update SDMA firmware version check for user queue support
drm/amdgpu: Add NULL check for asic_funcs
drm/amd/display: Revert "drm/amd/display: Fix AMDGPU_MAX_BL_LEVEL value"
drm/amd/display: fix a Null pointer dereference vulnerability
drm/amd/display: Add primary plane to commits for correct VRR handling
drm/amdgpu: update mmhub 3.3 client id mappings
drm/amdgpu: update mmhub 3.0.1 client id mappings
drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job
drm/amd/display: Fix DCE 6.0 and 6.4 PLL programming.
drm/amd/display: Don't overwrite dce60_clk_mgr
drm/amdkfd: Fix checkpoint-restore on multi-xcc
drm/amd: Restore cached manual clock settings during resume
drm/amd: Restore cached power limit during resume
drm/amdgpu: Update external revid for GC v9.5.0
drm/amdgpu: Update supported modes for GC v9.5.0
Mark xe driver as BROKEN if kernel page size is not 4kB
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev
Pull fbdev fixes for 6.17-rc1:
- Revert a patch which broke VGA console
- Fix an out-of-bounds access bug which may happen during console
resizing when a console is mapped to a frame buffer
* tag 'fbdev-for-6.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
Revert "vgacon: Add check for vc_origin address range in vgacon_scroll()"
fbdev: Fix vmalloc out-of-bounds write in fast_imageblit
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch updates from Huacai Chen:
- Complete KSave registers definition
- Support the mem=<size> kernel parameter
- Support BPF dynamic modification & trampoline
- Add MMC/SDIO controller nodes in dts
- Some bug fixes and other small changes
* tag 'loongarch-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: vDSO: Remove -nostdlib complier flag
LoongArch: dts: Add eMMC/SDIO controller support to Loongson-2K2000
LoongArch: dts: Add SDIO controller support to Loongson-2K1000
LoongArch: dts: Add SDIO controller support to Loongson-2K0500
LoongArch: BPF: Set bpf_jit_bypass_spec_v1/v4()
LoongArch: BPF: Fix the tailcall hierarchy
LoongArch: BPF: Fix jump offset calculation in tailcall
LoongArch: BPF: Add struct ops support for trampoline
LoongArch: BPF: Add basic bpf trampoline support
LoongArch: BPF: Add dynamic code modification support
LoongArch: BPF: Rename and refactor validate_code()
LoongArch: Add larch_insn_gen_{beq,bne} helpers
LoongArch: Don't use %pK through printk() in unwinder
LoongArch: Avoid in-place string operation on FDT content
LoongArch: Support mem=<size> kernel parameter
LoongArch: Make relocate_new_kernel_size be a .quad value
LoongArch: Complete KSave registers definition
|
|
In ksmbd_extract_shortname(), strscpy() is incorrectly called with the
length of the source string (excluding the NUL terminator) rather than
the size of the destination buffer. This results in "__" being copied
to 'extension' rather than "___" (two underscores instead of three).
Use the destination buffer size instead to ensure that the string "___"
(three underscores) is copied correctly.
Cc: stable@vger.kernel.org
Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3")
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Repeated connections from clients with the same IP address may exhaust
the max connections and prevent other normal client connections.
This patch limit repeated connections from clients with the same IP.
Reported-by: tianshuo han <hantianshuo233@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-fixes-6.17-2025-08-07:
amdgpu:
- GC 9.5.0 fixes
- SMU fix
- DCE 6 DC fixes
- mmhub client ID fixes
- VRR fix
- Backlight fix
- UserQ fix
- Legacy reset fix
- Misc fixes
amdkfd:
- CRIU fix
- Debugfs fix
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250807132030.1168068-1-alexander.deucher@amd.com
|
|
https://gitlab.freedesktop.org/drm/xe/kernel into drm-next
- SRIOV: PF fixes and removal of need of module param (Michal)
- Fix driver unbind around Devcoredump (Bala)
- Mark xe driver as BROKEN if kernel page size is not 4kB (Simon)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/aJNXnIAp2Cq-2pZj@intel.com
|
|
There's no need for separate conn_wait and disconn_wait queues.
This will simplify the move to common code, the server code
already a single wait_queue for this.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
_smbd_get_connection
It is already called long before we may hit this cleanup code path.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This matches the timeout for tcp connections.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: f198186aa9bb ("CIFS: SMBD: Establish SMB Direct connection")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
vmd_msi_alloc() allocates struct vmd_irq and stashes it into
irq_data->chip_data associated with the VMD's interrupt domain.
vmd_msi_free() extracts the pointer by calling irq_get_chip_data() and
frees it.
irq_get_chip_data() returns the chip_data associated with the top interrupt
domain. This worked in the past because VMD's interrupt domain was the top
domain.
But d7d8ab87e3e7 ("PCI: vmd: Switch to msi_create_parent_irq_domain()")
changed the interrupt domain hierarchy so VMD's interrupt domain is not the
top domain anymore. irq_get_chip_data() now returns the chip_data at the
MSI devices' interrupt domains. It is therefore broken for vmd_msi_free()
to kfree() this chip_data.
Fix by extracting the chip_data associated with the VMD's interrupt domain.
Fixes: d7d8ab87e3e7 ("PCI: vmd: Switch to msi_create_parent_irq_domain()")
Reported-by: Kenneth Crudup <kenny@panix.com>
Closes: https://lore.kernel.org/linux-pci/dfa40e48-8840-4e61-9fda-25cdb3ad81c1@panix.com/
Reported-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Closes: https://lore.kernel.org/linux-pci/ed53280ed15d1140700b96cca2734bf327ee92539e5eb68e80f5bbbf0f01@linux.gnuweeb.org/
Tested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Tested-by: Kenneth Crudup <kenny@panix.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://patch.msgid.link/20250807081051.2253962-1-namcao@linutronix.de
|
|
Ilya Leoshkevich says:
====================
perf/s390: Regression: Move uid filtering to BPF filters
v4: https://lore.kernel.org/bpf/20250806114227.14617-1-iii@linux.ibm.com/
v4 -> v5: Fix a typo in the commit message (Yonghong).
v3: https://lore.kernel.org/bpf/20250805130346.1225535-1-iii@linux.ibm.com/
v3 -> v4: Rename the new field to dont_enable (Alexei, Eduard).
Switch the Fixes: tag in patch 2 (Alexander, Thomas).
Fix typos in the cover letter (Thomas).
v2: https://lore.kernel.org/bpf/20250728144340.711196-1-tmricht@linux.ibm.com/
v2 -> v3: Use no_ioctl_enable in perf.
v1: https://lore.kernel.org/bpf/20250725093405.3629253-1-tmricht@linux.ibm.com/
v1 -> v2: Introduce no_ioctl_enable (Jiri).
Hi,
This series fixes a regression caused by moving UID filtering to BPF.
The regression affects all events that support auxiliary data, most
notably, "cycles" events on s390, but also PT events on Intel. The
symptom is missing events when UID filtering is enabled.
Patch 1 introduces a new option for the
bpf_program__attach_perf_event_opts() function.
Patch 2 makes use of it in perf, and also contains a lot of technical
details of why exactly the problem is occurring.
Thanks to Thomas Richter for the investigation and the initial version
of this fix, and to Jiri Olsa for suggestions.
Best regards,
Ilya
====================
Link: https://patch.msgid.link/20250806162417.19666-1-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
On s390, and, in general, on all platforms where the respective event
supports auxiliary data gathering, the command:
# ./perf record -u 0 -aB --synth=no -- ./perf test -w thloop
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.011 MB perf.data ]
# ./perf report --stats | grep SAMPLE
#
does not generate samples in the perf.data file. On x86 the command:
# sudo perf record -e intel_pt// -u 0 ls
is broken too.
Looking at the sequence of calls in 'perf record' reveals this
behavior:
1. The event 'cycles' is created and enabled:
record__open()
+-> evlist__apply_filters()
+-> perf_bpf_filter__prepare()
+-> bpf_program.attach_perf_event()
+-> bpf_program.attach_perf_event_opts()
+-> __GI___ioctl(..., PERF_EVENT_IOC_ENABLE, ...)
The event 'cycles' is enabled and active now. However the event's
ring-buffer to store the samples generated by hardware is not
allocated yet.
2. The event's fd is mmap()ed to create the ring buffer:
record__open()
+-> record__mmap()
+-> record__mmap_evlist()
+-> evlist__mmap_ex()
+-> perf_evlist__mmap_ops()
+-> mmap_per_cpu()
+-> mmap_per_evsel()
+-> mmap__mmap()
+-> perf_mmap__mmap()
+-> mmap()
This allocates the ring buffer for the event 'cycles'. With mmap()
the kernel creates the ring buffer:
perf_mmap(): kernel function to create the event's ring
| buffer to save the sampled data.
|
+-> ring_buffer_attach(): Allocates memory for ring buffer.
| The PMU has auxiliary data setup function. The
| has_aux(event) condition is true and the PMU's
| stop() is called to stop sampling. It is not
| restarted:
|
| if (has_aux(event))
| perf_event_stop(event, 0);
|
+-> cpumsf_pmu_stop():
Hardware sampling is stopped. No samples are generated and saved
anymore.
3. After the event 'cycles' has been mapped, the event is enabled a
second time in:
__cmd_record()
+-> evlist__enable()
+-> __evlist__enable()
+-> evsel__enable_cpu()
+-> perf_evsel__enable_cpu()
+-> perf_evsel__run_ioctl()
+-> perf_evsel__ioctl()
+-> __GI___ioctl(., PERF_EVENT_IOC_ENABLE, .)
The second
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
is just a NOP in this case. The first invocation in (1.) sets the
event::state to PERF_EVENT_STATE_ACTIVE. The kernel functions
perf_ioctl()
+-> _perf_ioctl()
+-> _perf_event_enable()
+-> __perf_event_enable()
return immediately because event::state is already set to
PERF_EVENT_STATE_ACTIVE.
This happens on s390, because the event 'cycles' offers the possibility
to save auxilary data. The PMU callbacks setup_aux() and free_aux() are
defined. Without both callback functions, cpumsf_pmu_stop() is not
invoked and sampling continues.
To remedy this, remove the first invocation of
ioctl(..., PERF_EVENT_IOC_ENABLE, ...).
in step (1.) Create the event in step (1.) and enable it in step (3.)
after the ring buffer has been mapped.
Output after:
# ./perf record -aB --synth=no -u 0 -- ./perf test -w thloop 2
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.876 MB perf.data ]
# ./perf report --stats | grep SAMPLE
SAMPLE events: 16200 (99.5%)
SAMPLE events: 16200
#
The software event succeeded both before and after the patch:
# ./perf record -e cpu-clock -aB --synth=no -u 0 -- \
./perf test -w thloop 2
[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 2.870 MB perf.data ]
# ./perf report --stats | grep SAMPLE
SAMPLE events: 53506 (99.8%)
SAMPLE events: 53506
#
Fixes: b4c658d4d63d61 ("perf target: Remove uid from target")
Suggested-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Co-developed-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250806162417.19666-3-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Automatically enabling a perf event after attaching a BPF prog to it is
not always desirable.
Add a new "dont_enable" field to struct bpf_perf_event_opts. While
introducing "enable" instead would be nicer in that it would avoid
a double negation in the implementation, it would make
DECLARE_LIBBPF_OPTS() less efficient.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Suggested-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Co-developed-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250806162417.19666-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
__qgroup_excl_accounting() uses the qgroup iterator machinery to
update the account of one qgroups usage for all its parent hierarchy,
when we either add or remove a relation and have only exclusive usage.
However, there is a small bug there: we loop with an extra iteration
temporary qgroup called `cur` but never actually refer to that in the
body of the loop. As a result, we redundantly account the same usage to
the first qgroup in the list.
This can be reproduced in the following way:
mkfs.btrfs -f -O squota <dev>
mount <dev> <mnt>
btrfs subvol create <mnt>/sv
dd if=/dev/zero of=<mnt>/sv/f bs=1M count=1
sync
btrfs qgroup create 1/100 <mnt>
btrfs qgroup create 2/200 <mnt>
btrfs qgroup assign 1/100 2/200 <mnt>
btrfs qgroup assign 0/256 1/100 <mnt>
btrfs qgroup show <mnt>
and the broken result is (note the 2MiB on 1/100 and 0Mib on 2/100):
Qgroupid Referenced Exclusive Path
-------- ---------- --------- ----
0/5 16.00KiB 16.00KiB <toplevel>
0/256 1.02MiB 1.02MiB sv
Qgroupid Referenced Exclusive Path
-------- ---------- --------- ----
0/5 16.00KiB 16.00KiB <toplevel>
0/256 1.02MiB 1.02MiB sv
1/100 2.03MiB 2.03MiB 2/100<1 member qgroup>
2/100 0.00B 0.00B <0 member qgroups>
With this fix, which simply re-uses `qgroup` as the iteration variable,
we see the expected result:
Qgroupid Referenced Exclusive Path
-------- ---------- --------- ----
0/5 16.00KiB 16.00KiB <toplevel>
0/256 1.02MiB 1.02MiB sv
Qgroupid Referenced Exclusive Path
-------- ---------- --------- ----
0/5 16.00KiB 16.00KiB <toplevel>
0/256 1.02MiB 1.02MiB sv
1/100 1.02MiB 1.02MiB 2/100<1 member qgroup>
2/100 1.02MiB 1.02MiB <0 member qgroups>
The existing fstests did not exercise two layer inheritance so this bug
was missed. I intend to add that testing there, as well.
Fixes: a0bdc04b0732 ("btrfs: qgroup: use qgroup_iterator in __qgroup_excl_accounting()")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We call btrfs_zone_finish_one_bg() to zone finish one block group and make
room to activate another block group. Currently, we can choose a metadata
block group as a target. But, as we reserve an active metadata block group,
we no longer want to select a metadata block group. So, skip it in the
loop.
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is an internal report that balance triggered transaction abort,
with the following call trace:
item 85 key (594509824 169 0) itemoff 12599 itemsize 33
extent refs 1 gen 197740 flags 2
ref#0: tree block backref root 7
item 86 key (594558976 169 0) itemoff 12566 itemsize 33
extent refs 1 gen 197522 flags 2
ref#0: tree block backref root 7
...
BTRFS error (device loop0): extent item not found for insert, bytenr 594526208 num_bytes 16384 parent 449921024 root_objectid 934 owner 1 offset 0
BTRFS error (device loop0): failed to run delayed ref for logical 594526208 num_bytes 16384 type 182 action 1 ref_mod 1: -117
------------[ cut here ]------------
BTRFS: Transaction aborted (error -117)
WARNING: CPU: 1 PID: 6963 at ../fs/btrfs/extent-tree.c:2168 btrfs_run_delayed_refs+0xfa/0x110 [btrfs]
And btrfs check doesn't report anything wrong related to the extent
tree.
[CAUSE]
The cause is a little complex, firstly the extent tree indeed doesn't
have the backref for 594526208.
The extent tree only have the following two backrefs around that bytenr
on-disk:
item 65 key (594509824 METADATA_ITEM 0) itemoff 13880 itemsize 33
refs 1 gen 197740 flags TREE_BLOCK
tree block skinny level 0
(176 0x7) tree block backref root CSUM_TREE
item 66 key (594558976 METADATA_ITEM 0) itemoff 13847 itemsize 33
refs 1 gen 197522 flags TREE_BLOCK
tree block skinny level 0
(176 0x7) tree block backref root CSUM_TREE
But the such missing backref item is not an corruption on disk, as the
offending delayed ref belongs to subvolume 934, and that subvolume is
being dropped:
item 0 key (934 ROOT_ITEM 198229) itemoff 15844 itemsize 439
generation 198229 root_dirid 256 bytenr 10741039104 byte_limit 0 bytes_used 345571328
last_snapshot 198229 flags 0x1000000000001(RDONLY) refs 0
drop_progress key (206324 EXTENT_DATA 2711650304) drop_level 2
level 2 generation_v2 198229
And that offending tree block 594526208 is inside the dropped range of
that subvolume. That explains why there is no backref item for that
bytenr and why btrfs check is not reporting anything wrong.
But this also shows another problem, as btrfs will do all the orphan
subvolume cleanup at a read-write mount.
So half-dropped subvolume should not exist after an RW mount, and
balance itself is also exclusive to subvolume cleanup, meaning we
shouldn't hit a subvolume half-dropped during relocation.
The root cause is, there is no orphan item for this subvolume.
In fact there are 5 subvolumes from around 2021 that have the same
problem.
It looks like the original report has some older kernels running, and
caused those zombie subvolumes.
Thankfully upstream commit 8d488a8c7ba2 ("btrfs: fix subvolume/snapshot
deletion not triggered on mount") has long fixed the bug.
[ENHANCEMENT]
For repairing such old fs, btrfs-progs will be enhanced.
Considering how delayed the problem will show up (at run delayed ref
time) and at that time we have to abort transaction already, it is too
late.
Instead here we reject any half-dropped subvolume for reloc tree at the
earliest time, preventing confusion and extra time wasted on debugging
similar bugs.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we only log an error message if we can't find the block group
for a log tree extent buffer when unaccounting it (while freeing a log
tree). A missing block group means something is seriously wrong and we
end up leaking space from the metadata space info. So return -ENOENT in
case we don't find the block group.
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Inside nocow_one_range(), if the checksum cloning for data reloc inode
failed, we call btrfs_cleanup_ordered_extents() to cleanup the just
allocated ordered extents.
But unlike extent_clear_unlock_delalloc(),
btrfs_cleanup_ordered_extents() requires a length, not an inclusive end
bytenr.
This can be problematic, as the @end is normally way larger than @len.
This means btrfs_cleanup_ordered_extents() can be called on folios
out of the correct range, and if the out-of-range folio is under
writeback, we can incorrectly clear the ordered flag of the folio, and
trigger the DEBUG_WARN() inside btrfs_writepage_cow_fixup().
Fix the wrong parameter with correct length instead.
Fixes: 94f6c5c17e52 ("btrfs: move ordered extent cleanup to where they are allocated")
CC: stable@vger.kernel.org # 6.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When hitting a large folio, btrfs_cleanup_ordered_extents() will get the
same large folio multiple times, and clearing the same range again and
again.
Thankfully this is not causing anything wrong, just inefficiency.
This is caused by the fact that we're iterating folios using the old
page index, thus can hit the same large folio again and again.
Enhance it by increasing @index to the index of the folio end, and only
increase @index by 1 if we failed to grab a folio.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is a potential deadlock that can happen in
try_release_subpage_extent_buffer() because the irq-safe xarray spin
lock fs_info->buffer_tree is being acquired before the irq-unsafe
eb->refs_lock.
This leads to the potential race:
// T1 (random eb->refs user) // T2 (release folio)
spin_lock(&eb->refs_lock);
// interrupt
end_bbio_meta_write()
btrfs_meta_folio_clear_writeback()
btree_release_folio()
folio_test_writeback() //false
try_release_extent_buffer()
try_release_subpage_extent_buffer()
xa_lock_irq(&fs_info->buffer_tree)
spin_lock(&eb->refs_lock); // blocked; held by T1
buffer_tree_clear_mark()
xas_lock_irqsave() // blocked; held by T2
I believe that the spin lock can safely be replaced by an rcu_read_lock.
The xa_for_each loop does not need the spin lock as it's already
internally protected by the rcu_read_lock. The extent buffer is also
protected by the rcu_read_lock so it won't be freed before we take the
eb->refs_lock and check the ref count.
The rcu_read_lock is taken and released every iteration, just like the
spin lock, which means we're not protected against concurrent
insertions into the xarray. This is fine because we rely on
folio->private to detect if there are any ebs remaining in the folio.
There is already some precedent for this with find_extent_buffer_nolock,
which loads an extent buffer from the xarray with only rcu_read_lock.
lockdep warning:
=====================================================
WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
6.16.0-0_fbk701_debug_rc0_123_g4c06e63b9203 #1 Tainted: G E N
-----------------------------------------------------
kswapd0/66 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
ffff000011ffd600 (&eb->refs_lock){+.+.}-{3:3}, at: try_release_extent_buffer+0x18c/0x560
and this task is already holding:
ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560
which would create a new lock dependency:
(&buffer_xa_class){-.-.}-{3:3} -> (&eb->refs_lock){+.+.}-{3:3}
but this new dependency connects a HARDIRQ-irq-safe lock:
(&buffer_xa_class){-.-.}-{3:3}
... which became HARDIRQ-irq-safe at:
lock_acquire+0x178/0x358
_raw_spin_lock_irqsave+0x60/0x88
buffer_tree_clear_mark+0xc4/0x160
end_bbio_meta_write+0x238/0x398
btrfs_bio_end_io+0x1f8/0x330
btrfs_orig_write_end_io+0x1c4/0x2c0
bio_endio+0x63c/0x678
blk_update_request+0x1c4/0xa00
blk_mq_end_request+0x54/0x88
virtblk_request_done+0x124/0x1d0
blk_mq_complete_request+0x84/0xa0
virtblk_done+0x130/0x238
vring_interrupt+0x130/0x288
__handle_irq_event_percpu+0x1e8/0x708
handle_irq_event+0x98/0x1b0
handle_fasteoi_irq+0x264/0x7c0
generic_handle_domain_irq+0xa4/0x108
gic_handle_irq+0x7c/0x1a0
do_interrupt_handler+0xe4/0x148
el1_interrupt+0x30/0x50
el1h_64_irq_handler+0x14/0x20
el1h_64_irq+0x6c/0x70
_raw_spin_unlock_irq+0x38/0x70
__run_timer_base+0xdc/0x5e0
run_timer_softirq+0xa0/0x138
handle_softirqs.llvm.13542289750107964195+0x32c/0xbd0
____do_softirq.llvm.17674514681856217165+0x18/0x28
call_on_irq_stack+0x24/0x30
__irq_exit_rcu+0x164/0x430
irq_exit_rcu+0x18/0x88
el1_interrupt+0x34/0x50
el1h_64_irq_handler+0x14/0x20
el1h_64_irq+0x6c/0x70
arch_local_irq_enable+0x4/0x8
do_idle+0x1a0/0x3b8
cpu_startup_entry+0x60/0x80
rest_init+0x204/0x228
start_kernel+0x394/0x3f0
__primary_switched+0x8c/0x8958
to a HARDIRQ-irq-unsafe lock:
(&eb->refs_lock){+.+.}-{3:3}
... which became HARDIRQ-irq-unsafe at:
...
lock_acquire+0x178/0x358
_raw_spin_lock+0x4c/0x68
free_extent_buffer_stale+0x2c/0x170
btrfs_read_sys_array+0x1b0/0x338
open_ctree+0xeb0/0x1df8
btrfs_get_tree+0xb60/0x1110
vfs_get_tree+0x8c/0x250
fc_mount+0x20/0x98
btrfs_get_tree+0x4a4/0x1110
vfs_get_tree+0x8c/0x250
do_new_mount+0x1e0/0x6c0
path_mount+0x4ec/0xa58
__arm64_sys_mount+0x370/0x490
invoke_syscall+0x6c/0x208
el0_svc_common+0x14c/0x1b8
do_el0_svc+0x4c/0x60
el0_svc+0x4c/0x160
el0t_64_sync_handler+0x70/0x100
el0t_64_sync+0x168/0x170
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&eb->refs_lock);
local_irq_disable();
lock(&buffer_xa_class);
lock(&eb->refs_lock);
<Interrupt>
lock(&buffer_xa_class);
*** DEADLOCK ***
2 locks held by kswapd0/66:
#0: ffff800085506e40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xe8/0xe50
#1: ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560
Link: https://www.kernel.org/doc/Documentation/locking/lockdep-design.rst#:~:text=Multi%2Dlock%20dependency%20rules%3A
Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray")
CC: stable@vger.kernel.org # 6.16+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
I accidentally added a bug in pptp_xmit() that syzbot caught for us.
Only call ip_rt_put() if a route has been allocated.
BUG: unable to handle page fault for address: ffffffffffffffdb
PGD df3b067 P4D df3b067 PUD df3d067 PMD 0
Oops: Oops: 0002 [#1] SMP KASAN PTI
CPU: 1 UID: 0 PID: 6346 Comm: syz.0.336 Not tainted 6.16.0-next-20250804-syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025
RIP: 0010:arch_atomic_add_return arch/x86/include/asm/atomic.h:85 [inline]
RIP: 0010:raw_atomic_sub_return_release include/linux/atomic/atomic-arch-fallback.h:846 [inline]
RIP: 0010:atomic_sub_return_release include/linux/atomic/atomic-instrumented.h:327 [inline]
RIP: 0010:__rcuref_put include/linux/rcuref.h:109 [inline]
RIP: 0010:rcuref_put+0x172/0x210 include/linux/rcuref.h:173
Call Trace:
<TASK>
dst_release+0x24/0x1b0 net/core/dst.c:167
ip_rt_put include/net/route.h:285 [inline]
pptp_xmit+0x14b/0x1a90 drivers/net/ppp/pptp.c:267
__ppp_channel_push+0xf2/0x1c0 drivers/net/ppp/ppp_generic.c:2166
ppp_channel_push+0x123/0x660 drivers/net/ppp/ppp_generic.c:2198
ppp_write+0x2b0/0x400 drivers/net/ppp/ppp_generic.c:544
vfs_write+0x27b/0xb30 fs/read_write.c:684
ksys_write+0x145/0x250 fs/read_write.c:738
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: de9c4861fb42 ("pptp: ensure minimal skb length in pptp_xmit()")
Reported-by: syzbot+27d7cfbc93457e472e00@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/689095a5.050a0220.1fc43d.0009.GAE@google.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250807142146.2877060-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Because it's only used in sbitmap.c
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20250807032413.1469456-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently elevators will record internal 'async_depth' to throttle
asynchronous requests, and they both calculate shallow_dpeth based on
sb->shift, with the respect that sb->shift is the available tags in one
word.
However, sb->shift is not the availbale tags in the last word, see
__map_depth:
if (index == sb->map_nr - 1)
return sb->depth - (index << sb->shift);
For consequence, if the last word is used, more tags can be get than
expected, for example, assume nr_requests=256 and there are four words,
in the worst case if user set nr_requests=32, then the first word is
the last word, and still use bits per word, which is 64, to calculate
async_depth is wrong.
One the ohter hand, due to cgroup qos, bfq can allow only one request
to be allocated, and set shallow_dpeth=1 will still allow the number
of words request to be allocated.
Fix this problems by using shallow_depth to the whole sbitmap instead
of per word, also change kyber, mq-deadline and bfq to follow this,
a new helper __map_depth_with_shallow() is introduced to calculate
available bits in each word.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250807032413.1469456-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 528589947c180 ("nvmet: initialize discovery subsys after debugfs
is initialized") changed nvmet_init() to initialize nvme discovery after
"nvmet" debugfs directory is initialized. The change broke nvmet_exit()
because discovery subsystem now depends on debugfs. Debugfs should be
destroyed after discovery subsystem. Fix nvmet_exit() to do that.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs96AfFQpyDKF_MdfJsnOEo=2V7dQgqjFv+k3t7H-=yGhA@mail.gmail.com/
Fixes: 528589947c180 ("nvmet: initialize discovery subsys after debugfs is initialized")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Link: https://lore.kernel.org/r/20250807053507.2794335-1-mkhalfella@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
syzbot managed to reach this WARN_ON_ONCE by passing a huge level
value, remove it.
WARNING: CPU: 0 PID: 5853 at net/netfilter/nft_socket.c:220 nft_socket_init+0x2f4/0x3d0 net/netfilter/nft_socket.c:220
Reported-by: syzbot+a225fea35d7baf8dbdc3@syzkaller.appspotmail.com
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|