Age | Commit message (Collapse) | Author |
|
Add of_match_table for aw88081 driver to make matching
between dts and driver more flexible
Signed-off-by: Weidong Wang <wangweidong.a@awinic.com>
Link: https://patch.msgid.link/20250410024953.26565-1-wangweidong.a@awinic.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
On SNDRV_PCM_TRIGGER_START event, audio data pointers are not reset.
This leads to wrong data buffer usage when multiple TRIGGER_START are
received and ends to incorrect buffer usage between the user-space and
the driver. Indeed, the driver can read data that are not already set by
the user-space or the user-space and the driver are writing and reading
the same area.
Fix that resetting data pointers on each SNDRV_PCM_TRIGGER_START events.
Fixes: 075c7125b11c ("ASoC: fsl: Add support for QMC audio")
Cc: stable@vger.kernel.org
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Link: https://patch.msgid.link/20250410091643.535627-1-herve.codina@bootlin.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Add entries for the various addresses that I have been using over
the years and remap all of them to kernel.org alias.
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Link: https://patch.msgid.link/20250410103713.24875-3-srinivas.kandagatla@linaro.org
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
My Linaro email will stop working soon. Use @kernel.org email instead.
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Link: https://patch.msgid.link/20250410103713.24875-2-srinivas.kandagatla@linaro.org
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains a Netfilter fix and improved test coverage:
1) Fix AVX2 matching in nft_pipapo, from Florian Westphal.
2) Extend existing test to improve coverage for the aforementioned bug,
also from Florian.
netfilter pull request 25-04-10
* tag 'nf-25-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: add test case for recent mismatch bug
nft_set_pipapo: fix incorrect avx2 match of 5th field octet
====================
Link: https://patch.msgid.link/20250410103647.1030244-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Without 'nft_set_pipapo: fix incorrect avx2 match of 5th field octet"
this fails:
TEST: reported issues
Add two elements, flush, re-add 1s [ OK ]
net,mac with reload 0s [ OK ]
net,port,proto 3s [ OK ]
avx2 false match 0s [FAIL]
False match for fe80:dead:01fe:0a02:0b03:6007:8009:a001
Other tests do not detect the kernel bug as they only alter parts in
the /64 netmask.
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Given a set element like:
icmpv6 . dead:beef:00ff::1
The value of 'ff' is irrelevant, any address will be matched
as long as the other octets are the same.
This is because of too-early register clobbering:
ymm7 is reloaded with new packet data (pkt[9]) but it still holds data
of an earlier load that wasn't processed yet.
The existing tests in nft_concat_range.sh selftests do exercise this code
path, but do not trigger incorrect matching due to the network prefix
limitation.
Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation")
Reported-by: sontu mazumdar <sontu21@gmail.com>
Closes: https://lore.kernel.org/netfilter/CANgxkqwnMH7fXra+VUfODT-8+qFLgskq3set1cAzqqJaV4iEZg@mail.gmail.com/T/#t
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Ensure we have enough data in linear buffer from skb before accessing
initial bytes. This prevents potential out-of-bounds accesses
when processing short packets.
When ppp_sync_txmung receives an incoming package with an empty
payload:
(remote) gef➤ p *(struct pppoe_hdr *) (skb->head + skb->network_header)
$18 = {
type = 0x1,
ver = 0x1,
code = 0x0,
sid = 0x2,
length = 0x0,
tag = 0xffff8880371cdb96
}
from the skb struct (trimmed)
tail = 0x16,
end = 0x140,
head = 0xffff88803346f400 "4",
data = 0xffff88803346f416 ":\377",
truesize = 0x380,
len = 0x0,
data_len = 0x0,
mac_len = 0xe,
hdr_len = 0x0,
it is not safe to access data[2].
Reported-by: syzbot+29fc8991b0ecb186cf40@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=29fc8991b0ecb186cf40
Tested-by: syzbot+29fc8991b0ecb186cf40@syzkaller.appspotmail.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com>
Link: https://patch.msgid.link/20250408-bound-checking-ppp_txmung-v2-1-94bb6e1b92d0@arnaud-lcm.com
[pabeni@redhat.com: fixed subj typo]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Compiling the kernel with gcc12.3 W=2 results in shadowing warnings:
warning: declaration of '__pptr' shadows a previous local [-Wshadow]
const struct { type x; } __packed *__pptr = (typeof(__pptr))(ptr);
note: in definition of macro '__put_unaligned_t'
__pptr->x = (val);
note: in expansion of macro '__get_unaligned_t'
__put_unaligned_t(type, __get_unaligned_t(type, src), dst);
__get_unaligned_t() and __put_unaligned_t() use a local variable named
'__pptr', which can lead to variable shadowing when these macros are used in
the same scope. This results in a -Wshadow warning during compilation.
To address this issue, rename the local variables within the macros to
ensure uniqueness.
Signed-off-by: Peng Jiang <jiang.peng9@zte.com.cn>
Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250324191230477zpGtgIRSH4mEHdtxGtgx9@zte.com.cn
|
|
Use cfg->ctrl_ops->io_init callback make it work for all platform.
And it's also gets rid of code duplication
Fixes: 3f60dbd40d3f ("drm/rockchip: dw_hdmi_qp: Add platform ctrl callback")
Signed-off-by: Andy Yan <andy.yan@rock-chips.com>
Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://lore.kernel.org/r/20250317102757.565679-1-andyshrk@163.com
|
|
This is a copy-paste error, which affects DP1 usage.
Fixes: 328e6885996c ("drm/rockchip: vop2: Add platform specific callback")
Signed-off-by: Andy Yan <andy.yan@rock-chips.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://lore.kernel.org/r/20250312064218.524143-1-andyshrk@163.com
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.15-2025-04-09:
amdgpu:
- MES FW version caching fixes
- Only use GTT as a fallback if we already have a backing store
- dma_buf fix
- IP discovery fix
- Replay and PSR with VRR fix
- DC FP fixes
- eDP fixes
- KIQ TLB invalidate fix
- Enable dmem groups support
- Allow pinning VRAM dma bufs if imports can do P2P
- Workload profile fixes
- Prevent possible division by 0 in fan handling
amdkfd:
- Queue reset fixes
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250409165238.1180153-1-alexander.deucher@amd.com
|
|
Remove duplicate code in function z_erofs_register_pcluster()
Signed-off-by: Bo Liu <liubo03@inspur.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250410042048.3044-2-liubo03@inspur.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Two new file system attributes were recently added. See MS-FSCC 2.5.1:
FILE_SUPPORTS_POSIX_UNLINK_RENAME and
FILE_RETURNS_CLEANUP_RESULT_INFO
Update the missing defines for ksmbd and cifs.ko
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
cpufreq_cpu_get_raw() can return NULL when the target CPU is not present
in the policy->cpus mask. scpi_cpufreq_get_rate() does not check for
this case, which results in a NULL pointer dereference.
Fixes: 343a8d17fa8d ("cpufreq: scpi: remove arm_big_little dependency")
Signed-off-by: Henry Martin <bsdhenrymartin@gmail.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
|
|
cpufreq_cpu_get_raw() can return NULL when the target CPU is not present
in the policy->cpus mask. scmi_cpufreq_get_rate() does not check for
this case, which results in a NULL pointer dereference.
Add NULL check after cpufreq_cpu_get_raw() to prevent this issue.
Fixes: 99d6bdf33877 ("cpufreq: add support for CPU DVFS based on SCMI message protocol")
Signed-off-by: Henry Martin <bsdhenrymartin@gmail.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
|
|
cpufreq_cpu_get_raw() can return NULL when the target CPU is not present
in the policy->cpus mask. apple_soc_cpufreq_get_rate() does not check
for this case, which results in a NULL pointer dereference.
Fixes: 6286bbb40576 ("cpufreq: apple-soc: Add new driver to control Apple SoC CPU P-states")
Signed-off-by: Henry Martin <bsdhenrymartin@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
|
|
https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes
drm/i915 fixes for v6.15-rc2:
- Fix scanline offset for LNL+ and BMG+
- Fix GVT unterminated-string-initialization build warning
- Fix DP rate limit when sink doesn't support TPS4
- Handle GDDR + ECC memory type detection
- Fix VRR parameter change check
- Fix fence not released on early probe errors
- Disable render power gating during live selftests
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://lore.kernel.org/r/87lds9wlpq.fsf@intel.com
|
|
Hou Tao says:
====================
bpf: Support atomic update for htab of maps
From: Hou Tao <houtao1@huawei.com>
Hi,
The motivation for the patch set comes from the question raised by Cody
Haas [1]. When trying to concurrently lookup and update an existing
element in a htab of maps, the lookup procedure may return -ENOENT
unexpectedly. The first revision of the patch set tried to resolve the
problem by making the insertion of the new element and the deletion of
the old element being atomic from the perspective of the lookup process.
While the solution would benefit all hash maps, it does not fully
resolved the problem due to the immediate reuse issue. Therefore, in v2
of the patch set, it only fixes the problem for fd htab.
Please see individual patches for details. Comments are always welcome.
v3:
* rebase on bpf_next/for-next
* add Acked-by tags
v2: https://lore.kernel.org/bpf/20250308135110.953269-1-houtao@huaweicloud.com/
* only support atomic update for fd htab
v1: https://lore.kernel.org/bpf/20250204082848.13471-1-hotforest@gmail.com
[1]: https://lore.kernel.org/xdp-newbies/CAH7f-ULFTwKdoH_t2SFc5rWCVYLEg-14d1fBYWH2eekudsnTRg@mail.gmail.com/
====================
Link: https://patch.msgid.link/20250401062250.543403-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add a test case to verify the atomic update of existing elements in the
htab of maps. The test proceeds in three steps:
1) fill the outer map with keys in the range [0, 8]
For each inner array map, the value of its first element is set as the
key used to lookup the inner map.
2) create 16 threads to lookup these keys concurrently
Each lookup thread first lookups the inner map, then it checks whether
the first value of the inner array map is the same as the key used to
lookup the inner map.
3) create 8 threads to overwrite these keys concurrently
Each update thread first creates an inner array, it sets the first value
of the array to the key used to update the outer map, then it uses the
key and the inner map to update the outer map.
Without atomic update support, the lookup operation may return -ENOENT
during the lookup of outer map, or return -EINVAL during the comparison
of the first value in the inner map and the key used for inner map, and
the test will fail. After the atomic update change, both the lookup and
the comparison will succeed.
Given that the update of outer map is slow, the test case sets the loop
number for each thread as 5 to reduce the total running time. However,
the loop number could also be adjusted through FD_HTAB_LOOP_NR
environment variable.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250401062250.543403-7-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The update of element in fd htab is in-place now, therefore, there is no
need to allocate per-cpu extra_elems, just remove it.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250401062250.543403-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add is_fd_htab() helper to check whether the map is htab of maps.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250401062250.543403-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
As reported by Cody Haas [1], when there is concurrent map lookup and
map update operation in an existing element for htab of maps, the map
lookup procedure may return -ENOENT unexpectedly.
The root cause is twofold:
1) the update of existing element involves two separated list operation
In htab_map_update_elem(), it first inserts the new element at the head
of list, then it deletes the old element. Therefore, it is possible a
lookup operation has already iterated to the middle of the list when a
concurrent update operation begins, and the lookup operation will fail
to find the target element.
2) the immediate reuse of htab element.
It is more subtle. Even through the lookup operation finds the old
element, it is possible that the target element has been removed by a
concurrent update operation, and the element has been reused immediately
by other update operation which runs on the same CPU as the previous
update operation, and the element is inserted into the same bucket list.
After these steps above, when the lookup operation tries to compare the
key in the old element with the expected key, the match will fail
because the key in the old element have been overwritten by other update
operation.
The two-step update process is relatively straightforward to address.
The more challenging aspect is the immediate reuse. As Alexei pointed
out:
So since 2022 both prealloc and no_prealloc reuse elements.
We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP
that will use _rcu() flavor of freeing into bpf_ma,
but it has to have a strong reason.
Given that htab of maps doesn't support special field in value and
directly stores the inner map pointer in htab_element, just do in-place
update for htab of maps instead of attempting to address the immediate
reuse issue.
[1]: https://lore.kernel.org/xdp-newbies/CAH7f-ULFTwKdoH_t2SFc5rWCVYLEg-14d1fBYWH2eekudsnTRg@mail.gmail.com/
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250401062250.543403-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_place,
and add a new percpu argument for the helper to support in-place update
for both per-cpu htab and htab of maps.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250401062250.543403-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
All hash maps store map key and map value together. The relative offset
of the map value compared to the map key is round_up(key_size, 8).
Therefore, factor out a common helper htab_elem_value() to calculate the
address of the map value instead of duplicating the logic.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250401062250.543403-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In the bpf_l4_csum_replace helper, the BPF_F_PSEUDO_HDR flag should only
be set if the modified header field is part of the pseudo-header.
If you modify for example the UDP ports and pass BPF_F_PSEUDO_HDR,
inet_proto_csum_replace4 will update skb->csum even though it shouldn't
(the port and the UDP checksum updates null each other).
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/5126ef84ba75425b689482cbc98bffe75e5d8ab0.1744102490.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
BPF_F_RECOMPUTE_CSUM doesn't update the actual L3 and L4 checksums in
the packet, but simply updates skb->csum (according to skb->ip_summed).
This patch clarifies that to avoid confusions.
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/ff6895d42936f03dbb82334d8bcfd50e00c79086.1744102490.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Willem de Bruijn says:
====================
support SKF_NET_OFF and SKF_LL_OFF on skb frags
From: Willem de Bruijn <willemb@google.com>
Address a longstanding issue that may lead to missed packets
depending on system configuration.
Ensure that reading from packet contents works regardless of skb
geometry, also when using the special SKF_.. negative offsets to
offset from L2 or L3 header.
Patch 2 is the selftest for the fix.
v2->v3
- do not remove bpf_internal_load_pointer_neg_helper, because it
is still used in the sparc32 JIT
v1->v2
- introduce bfp_skb_load_helper_convert_offset to avoid open
coding
- selftest: add comment why early demux must be disabled
v2: https://lore.kernel.org/netdev/20250404142633.1955847-1-willemdebruijn.kernel@gmail.com/
v1: https://lore.kernel.org/netdev/20250403140846.1268564-1-willemdebruijn.kernel@gmail.com/
====================
Link: https://patch.msgid.link/20250408132833.195491-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Verify that a classic BPF linux socket filter correctly matches
packet contents. Including when accessing contents in an
skb_frag.
1. Open a SOCK_RAW socket with a classic BPF filter on UDP dport 8000.
2. Open a tap device with IFF_NAPI_FRAGS to inject skbs with frags.
3. Send a packet for which the UDP header is in frag[0].
4. Receive this packet to demonstrate that the socket accepted it.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20250408132833.195491-3-willemdebruijn.kernel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Classic BPF socket filters with SKB_NET_OFF and SKB_LL_OFF fail to
read when these offsets extend into frags.
This has been observed with iwlwifi and reproduced with tun with
IFF_NAPI_FRAGS. The below straightforward socket filter on UDP port,
applied to a RAW socket, will silently miss matching packets.
const int offset_proto = offsetof(struct ip6_hdr, ip6_nxt);
const int offset_dport = sizeof(struct ip6_hdr) + offsetof(struct udphdr, dest);
struct sock_filter filter_code[] = {
BPF_STMT(BPF_LD + BPF_B + BPF_ABS, SKF_AD_OFF + SKF_AD_PKTTYPE),
BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, PACKET_HOST, 0, 4),
BPF_STMT(BPF_LD + BPF_B + BPF_ABS, SKF_NET_OFF + offset_proto),
BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_UDP, 0, 2),
BPF_STMT(BPF_LD + BPF_H + BPF_ABS, SKF_NET_OFF + offset_dport),
This is unexpected behavior. Socket filter programs should be
consistent regardless of environment. Silent misses are
particularly concerning as hard to detect.
Use skb_copy_bits for offsets outside linear, same as done for
non-SKF_(LL|NET) offsets.
Offset is always positive after subtracting the reference threshold
SKB_(LL|NET)_OFF, so is always >= skb_(mac|network)_offset. The sum of
the two is an offset against skb->data, and may be negative, but it
cannot point before skb->head, as skb_(mac|network)_offset would too.
This appears to go back to when frag support was introduced to
sk_run_filter in linux-2.4.4, before the introduction of git.
The amount of code change and 8/16/32 bit duplication are unfortunate.
But any attempt I made to be smarter saved very few LoC while
complicating the code.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/20250122200402.3461154-1-maze@google.com/
Link: https://elixir.bootlin.com/linux/2.4.4/source/net/core/filter.c#L244
Reported-by: Matt Moeller <moeller.matt@gmail.com>
Co-developed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://lore.kernel.org/r/20250408132833.195491-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Jiayuan Chen says:
====================
bpf, sockmap: Fix data loss and panic issues
I was writing a benchmark based on sockmap + TCP and discovered several
issues:
1. When EAGAIN occurs, the direction of skb is incorrect, causing data
loss when retry.
2. When sending partial data, the offset is not recorded, leading to
duplicate data being sent when retry.
3. An unexpected BUG_ON() judgment in skb_linearize is triggered.
4. The memory of psock->ingress_skb is not limited by the socket buffer
and memcg.
Issues 1, 2, and 3 are described in each patch's commit message.
Regarding issue 4, this patchset does not cover it as it is difficult to
handle in practice, and I am still working on it.
Here is a brief description of the issue:
When using sockmap to skb/stream redirect, if the receiving end does not
perform read operations, all data will be buffered in ingress_skb.
For example:
'''
// set memory limit to 50G
cgcreate -g memory:myGroup
cgset -r memory.max="5000M" myGroup
// start benchmark and disable consumer from reading
cgexec -g "memory:myGroup" ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --delay-consumer=-1 -d 100
Iter 0 ( 29.179us): Send Speed 2668.548 MB/s (20360.406 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 1 ( -7.237us): Send Speed 2694.467 MB/s (20557.149 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 2 ( -1.918us): Send Speed 2693.404 MB/s (20548.039 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 3 ( -0.684us): Send Speed 2693.138 MB/s (20548.014 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 4 ( 7.879us): Send Speed 2698.620 MB/s (20588.838 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 5 ( -3.224us): Send Speed 2696.553 MB/s (20573.066 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 6 ( -5.409us): Send Speed 2699.705 MB/s (20597.111 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 7 ( -0.439us): Send Speed 2699.691 MB/s (20597.009 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
...
// memory usage are not limited
cat /proc/slabinfo | grep skb
skbuff_small_head 11824024 11824024 704 46 8 : tunables 0 0 0 : slabdata 257044 257044 0
skbuff_fclone_cache 11822080 11822080 512 32 4 : tunables 0 0 0 : slabdata 369440 369440 0
'''
Thus, a simple socket in a large file upload/download model can eat the
entire OS memory.
We must charge the skb memory to psock->sk, and if we do not want losing
skb, we need to feedback the error info to read_sock/read_skb when the
enqueue operation of psock->ingress_skb fails.
---
My another patch related to stability also requires maintainers to spare
some time from their busy schedules for review.
https://lore.kernel.org/bpf/20250317092257.68760-1-jiayuan.chen@linux.dev/T/#t
====================
Link: https://patch.msgid.link/20250407142234.47591-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add TCP+sockmap-based benchmark.
Since sockmap's own update and delete operations are generally less
critical, the performance of the fast forwarding framework built upon
it is the key aspect.
Also with cgset/cgexec, we can observe the behavior of sockmap under
memory pressure.
The benchmark can be run with:
'''
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress
'''
In the future, we plan to move socket_helpers.h out of the prog_tests
directory to make it accessible for the benchmark. This will enable
better support for various socket types.
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20250407142234.47591-5-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The panic can be reproduced by executing the command:
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --rx-strp 100000
Then a kernel panic was captured:
'''
[ 657.460555] kernel BUG at net/core/skbuff.c:2178!
[ 657.462680] Tainted: [W]=WARN
[ 657.463287] Workqueue: events sk_psock_backlog
...
[ 657.469610] <TASK>
[ 657.469738] ? die+0x36/0x90
[ 657.469916] ? do_trap+0x1d0/0x270
[ 657.470118] ? pskb_expand_head+0x612/0xf40
[ 657.470376] ? pskb_expand_head+0x612/0xf40
[ 657.470620] ? do_error_trap+0xa3/0x170
[ 657.470846] ? pskb_expand_head+0x612/0xf40
[ 657.471092] ? handle_invalid_op+0x2c/0x40
[ 657.471335] ? pskb_expand_head+0x612/0xf40
[ 657.471579] ? exc_invalid_op+0x2d/0x40
[ 657.471805] ? asm_exc_invalid_op+0x1a/0x20
[ 657.472052] ? pskb_expand_head+0xd1/0xf40
[ 657.472292] ? pskb_expand_head+0x612/0xf40
[ 657.472540] ? lock_acquire+0x18f/0x4e0
[ 657.472766] ? find_held_lock+0x2d/0x110
[ 657.472999] ? __pfx_pskb_expand_head+0x10/0x10
[ 657.473263] ? __kmalloc_cache_noprof+0x5b/0x470
[ 657.473537] ? __pfx___lock_release.isra.0+0x10/0x10
[ 657.473826] __pskb_pull_tail+0xfd/0x1d20
[ 657.474062] ? __kasan_slab_alloc+0x4e/0x90
[ 657.474707] sk_psock_skb_ingress_enqueue+0x3bf/0x510
[ 657.475392] ? __kasan_kmalloc+0xaa/0xb0
[ 657.476010] sk_psock_backlog+0x5cf/0xd70
[ 657.476637] process_one_work+0x858/0x1a20
'''
The panic originates from the assertion BUG_ON(skb_shared(skb)) in
skb_linearize(). A previous commit(see Fixes tag) introduced skb_get()
to avoid race conditions between skb operations in the backlog and skb
release in the recvmsg path. However, this caused the panic to always
occur when skb_linearize is executed.
The "--rx-strp 100000" parameter forces the RX path to use the strparser
module which aggregates data until it reaches 100KB before calling sockmap
logic. The 100KB payload exceeds MAX_MSG_FRAGS, triggering skb_linearize.
To fix this issue, just move skb_get into sk_psock_skb_ingress_enqueue.
'''
sk_psock_backlog:
sk_psock_handle_skb
skb_get(skb) <== we move it into 'sk_psock_skb_ingress_enqueue'
sk_psock_skb_ingress____________
↓
|
| → sk_psock_skb_ingress_self
| sk_psock_skb_ingress_enqueue
sk_psock_verdict_apply_________________↑ skb_linearize
'''
Note that for verdict_apply path, the skb_get operation is unnecessary so
we add 'take_ref' param to control it's behavior.
Fixes: a454d84ee20b ("bpf, sockmap: Fix skb refcnt race after locking changes")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20250407142234.47591-4-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In the !ingress path under sk_psock_handle_skb(), when sending data to the
remote under snd_buf limitations, partial skb data might be transmitted.
Although we preserved the partial transmission state (offset/length), the
state wasn't properly consumed during retries. This caused the retry path
to resend the entire skb data instead of continuing from the previous
offset, resulting in data overlap at the receiver side.
Fixes: 405df89dd52c ("bpf, sockmap: Improved check for empty queue")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20250407142234.47591-3-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
We call skb_bpf_redirect_clear() to clean _sk_redir before handling skb in
backlog, but when sk_psock_handle_skb() return EAGAIN due to sk_rcvbuf
limit, the redirect info in _sk_redir is not recovered.
Fix skb redir loss during EAGAIN retries by restoring _sk_redir
information using skb_bpf_set_redir().
Before this patch:
'''
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress
Setting up benchmark 'sockmap'...
create socket fd c1:13 p1:14 c2:15 p2:16
Benchmark 'sockmap' started.
Send Speed 1343.172 MB/s, BPF Speed 1343.238 MB/s, Rcv Speed 65.271 MB/s
Send Speed 1352.022 MB/s, BPF Speed 1352.088 MB/s, Rcv Speed 0 MB/s
Send Speed 1354.105 MB/s, BPF Speed 1354.105 MB/s, Rcv Speed 0 MB/s
Send Speed 1355.018 MB/s, BPF Speed 1354.887 MB/s, Rcv Speed 0 MB/s
'''
Due to the high send rate, the RX processing path may frequently hit the
sk_rcvbuf limit. Once triggered, incorrect _sk_redir will cause the flow
to mistakenly enter the "!ingress" path, leading to send failures.
(The Rcv speed depends on tcp_rmem).
After this patch:
'''
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress
Setting up benchmark 'sockmap'...
create socket fd c1:13 p1:14 c2:15 p2:16
Benchmark 'sockmap' started.
Send Speed 1347.236 MB/s, BPF Speed 1347.367 MB/s, Rcv Speed 65.402 MB/s
Send Speed 1353.320 MB/s, BPF Speed 1353.320 MB/s, Rcv Speed 65.536 MB/s
Send Speed 1353.186 MB/s, BPF Speed 1353.121 MB/s, Rcv Speed 65.536 MB/s
'''
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20250407142234.47591-2-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Jiayuan Chen says:
====================
bpf: fix ktls panic with sockmap and add tests
We can reproduce the issue using the existing test program:
'./test_sockmap --ktls'
Or use the selftest I provided, which will cause a panic:
------------[ cut here ]------------
kernel BUG at lib/iov_iter.c:629!
PKRU: 55555554
Call Trace:
<TASK>
? die+0x36/0x90
? do_trap+0xdd/0x100
? iov_iter_revert+0x178/0x180
? iov_iter_revert+0x178/0x180
? do_error_trap+0x7d/0x110
? iov_iter_revert+0x178/0x180
? exc_invalid_op+0x50/0x70
? iov_iter_revert+0x178/0x180
? asm_exc_invalid_op+0x1a/0x20
? iov_iter_revert+0x178/0x180
? iov_iter_revert+0x5c/0x180
tls_sw_sendmsg_locked.isra.0+0x794/0x840
tls_sw_sendmsg+0x52/0x80
? inet_sendmsg+0x1f/0x70
__sys_sendto+0x1cd/0x200
? find_held_lock+0x2b/0x80
? syscall_trace_enter+0x140/0x270
? __lock_release.isra.0+0x5e/0x170
? find_held_lock+0x2b/0x80
? syscall_trace_enter+0x140/0x270
? lockdep_hardirqs_on_prepare+0xda/0x190
? ktime_get_coarse_real_ts64+0xc2/0xd0
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x90/0x170
1. It looks like the issue started occurring after bpf being introduced to
ktls and later the addition of assertions to iov_iter has caused a panic.
If my fix tag is incorrect, please assist me in correcting the fix tag.
2. I make minimal changes for now, it's enough to make ktls work
correctly.
---
v1->v2: Added more content to the commit message
https://lore.kernel.org/all/20250123171552.57345-1-mrpre@163.com/#r
---
====================
Link: https://patch.msgid.link/20250219052015.274405-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
add ktls selftest for sockmap
Test results:
sockmap_ktls/sockmap_ktls disconnect_after_delete IPv4 SOCKMAP:OK
sockmap_ktls/sockmap_ktls update_fails_when_sock_has_ulp IPv4 SOCKMAP:OK
sockmap_ktls/sockmap_ktls disconnect_after_delete IPv4 SOCKMAP:OK
sockmap_ktls/sockmap_ktls update_fails_when_sock_has_ulp IPv4 SOCKMAP:OK
sockmap_ktls/sockmap_ktls disconnect_after_delete IPv4 SOCKMAP:OK
sockmap_ktls/sockmap_ktls update_fails_when_sock_has_ulp IPv4 SOCKMAP:OK
sockmap_ktls/sockmap_ktls disconnect_after_delete IPv4 SOCKMAP:OK
sockmap_ktls/sockmap_ktls update_fails_when_sock_has_ulp IPv4 SOCKMAP:OK
sockmap_ktls/tls simple offload:OK
sockmap_ktls/tls tx cork:OK
sockmap_ktls/tls tx cork with push:OK
sockmap_ktls/tls simple offload:OK
sockmap_ktls/tls tx cork:OK
sockmap_ktls/tls tx cork with push:OK
sockmap_ktls:OK
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20250219052015.274405-3-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
[ 2172.936997] ------------[ cut here ]------------
[ 2172.936999] kernel BUG at lib/iov_iter.c:629!
......
[ 2172.944996] PKRU: 55555554
[ 2172.945155] Call Trace:
[ 2172.945299] <TASK>
[ 2172.945428] ? die+0x36/0x90
[ 2172.945601] ? do_trap+0xdd/0x100
[ 2172.945795] ? iov_iter_revert+0x178/0x180
[ 2172.946031] ? iov_iter_revert+0x178/0x180
[ 2172.946267] ? do_error_trap+0x7d/0x110
[ 2172.946499] ? iov_iter_revert+0x178/0x180
[ 2172.946736] ? exc_invalid_op+0x50/0x70
[ 2172.946961] ? iov_iter_revert+0x178/0x180
[ 2172.947197] ? asm_exc_invalid_op+0x1a/0x20
[ 2172.947446] ? iov_iter_revert+0x178/0x180
[ 2172.947683] ? iov_iter_revert+0x5c/0x180
[ 2172.947913] tls_sw_sendmsg_locked.isra.0+0x794/0x840
[ 2172.948206] tls_sw_sendmsg+0x52/0x80
[ 2172.948420] ? inet_sendmsg+0x1f/0x70
[ 2172.948634] __sys_sendto+0x1cd/0x200
[ 2172.948848] ? find_held_lock+0x2b/0x80
[ 2172.949072] ? syscall_trace_enter+0x140/0x270
[ 2172.949330] ? __lock_release.isra.0+0x5e/0x170
[ 2172.949595] ? find_held_lock+0x2b/0x80
[ 2172.949817] ? syscall_trace_enter+0x140/0x270
[ 2172.950211] ? lockdep_hardirqs_on_prepare+0xda/0x190
[ 2172.950632] ? ktime_get_coarse_real_ts64+0xc2/0xd0
[ 2172.951036] __x64_sys_sendto+0x24/0x30
[ 2172.951382] do_syscall_64+0x90/0x170
......
After calling bpf_exec_tx_verdict(), the size of msg_pl->sg may increase,
e.g., when the BPF program executes bpf_msg_push_data().
If the BPF program sets cork_bytes and sg.size is smaller than cork_bytes,
it will return -ENOSPC and attempt to roll back to the non-zero copy
logic. However, during rollback, msg->msg_iter is reset, but since
msg_pl->sg.size has been increased, subsequent executions will exceed the
actual size of msg_iter.
'''
iov_iter_revert(&msg->msg_iter, msg_pl->sg.size - orig_size);
'''
The changes in this commit are based on the following considerations:
1. When cork_bytes is set, rolling back to non-zero copy logic is
pointless and can directly go to zero-copy logic.
2. We can not calculate the correct number of bytes to revert msg_iter.
Assume the original data is "abcdefgh" (8 bytes), and after 3 pushes
by the BPF program, it becomes 11-byte data: "abc?de?fgh?".
Then, we set cork_bytes to 6, which means the first 6 bytes have been
processed, and the remaining 5 bytes "?fgh?" will be cached until the
length meets the cork_bytes requirement.
However, some data in "?fgh?" is not within 'sg->msg_iter'
(but in msg_pl instead), especially the data "?" we pushed.
So it doesn't seem as simple as just reverting through an offset of
msg_iter.
3. For non-TLS sockets in tcp_bpf_sendmsg, when a "cork" situation occurs,
the user-space send() doesn't return an error, and the returned length is
the same as the input length parameter, even if some data is cached.
Additionally, I saw that the current non-zero-copy logic for handling
corking is written as:
'''
line 1177
else if (ret != -EAGAIN) {
if (ret == -ENOSPC)
ret = 0;
goto send_end;
'''
So it's ok to just return 'copied' without error when a "cork" situation
occurs.
Fixes: fcb14cb1bdac ("new iov_iter flavour - ITER_UBUF")
Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20250219052015.274405-2-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When I ran the repro [0] and waited a few seconds, I observed two
LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1]
Reproduction Steps:
1) Mount CIFS
2) Add an iptables rule to drop incoming FIN packets for CIFS
3) Unmount CIFS
4) Unload the CIFS module
5) Remove the iptables rule
At step 3), the CIFS module calls sock_release() for the underlying
TCP socket, and it returns quickly. However, the socket remains in
FIN_WAIT_1 because incoming FIN packets are dropped.
At this point, the module's refcnt is 0 while the socket is still
alive, so the following rmmod command succeeds.
# ss -tan
State Recv-Q Send-Q Local Address:Port Peer Address:Port
FIN-WAIT-1 0 477 10.0.2.15:51062 10.0.0.137:445
# lsmod | grep cifs
cifs 1159168 0
This highlights a discrepancy between the lifetime of the CIFS module
and the underlying TCP socket. Even after CIFS calls sock_release()
and it returns, the TCP socket does not die immediately in order to
close the connection gracefully.
While this is generally fine, it causes an issue with LOCKDEP because
CIFS assigns a different lock class to the TCP socket's sk->sk_lock
using sock_lock_init_class_and_name().
Once an incoming packet is processed for the socket or a timer fires,
sk->sk_lock is acquired.
Then, LOCKDEP checks the lock context in check_wait_context(), where
hlock_class() is called to retrieve the lock class. However, since
the module has already been unloaded, hlock_class() logs a warning
and returns NULL, triggering the null-ptr-deref.
If LOCKDEP is enabled, we must ensure that a module calling
sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded
while such a socket is still alive to prevent this issue.
Let's hold the module reference in sock_lock_init_class_and_name()
and release it when the socket is freed in sk_prot_free().
Note that sock_lock_init() clears sk->sk_owner for svc_create_socket()
that calls sock_lock_init_class_and_name() for a listening socket,
which clones a socket by sk_clone_lock() without GFP_ZERO.
[0]:
CIFS_SERVER="10.0.0.137"
CIFS_PATH="//${CIFS_SERVER}/Users/Administrator/Desktop/CIFS_TEST"
DEV="enp0s3"
CRED="/root/WindowsCredential.txt"
MNT=$(mktemp -d /tmp/XXXXXX)
mount -t cifs ${CIFS_PATH} ${MNT} -o vers=3.0,credentials=${CRED},cache=none,echo_interval=1
iptables -A INPUT -s ${CIFS_SERVER} -j DROP
for i in $(seq 10);
do
umount ${MNT}
rmmod cifs
sleep 1
done
rm -r ${MNT}
iptables -D INPUT -s ${CIFS_SERVER} -j DROP
[1]:
DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 10 PID: 0 at kernel/locking/lockdep.c:234 hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223)
Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs]
CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.14.0 #36
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223)
...
Call Trace:
<IRQ>
__lock_acquire (kernel/locking/lockdep.c:4853 kernel/locking/lockdep.c:5178)
lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816)
_raw_spin_lock_nested (kernel/locking/spinlock.c:379)
tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350)
...
BUG: kernel NULL pointer dereference, address: 00000000000000c4
PF: supervisor read access in kernel mode
PF: error_code(0x0000) - not-present page
PGD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Tainted: G W 6.14.0 #36
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:__lock_acquire (kernel/locking/lockdep.c:4852 kernel/locking/lockdep.c:5178)
Code: 15 41 09 c7 41 8b 44 24 20 25 ff 1f 00 00 41 09 c7 8b 84 24 a0 00 00 00 45 89 7c 24 20 41 89 44 24 24 e8 e1 bc ff ff 4c 89 e7 <44> 0f b6 b8 c4 00 00 00 e8 d1 bc ff ff 0f b6 80 c5 00 00 00 88 44
RSP: 0018:ffa0000000468a10 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ff1100010091cc38 RCX: 0000000000000027
RDX: ff1100081f09ca48 RSI: 0000000000000001 RDI: ff1100010091cc88
RBP: ff1100010091c200 R08: ff1100083fe6e228 R09: 00000000ffffbfff
R10: ff1100081eca0000 R11: ff1100083fe10dc0 R12: ff1100010091cc88
R13: 0000000000000001 R14: 0000000000000000 R15: 00000000000424b1
FS: 0000000000000000(0000) GS:ff1100081f080000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c4 CR3: 0000000002c4a003 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<IRQ>
lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816)
_raw_spin_lock_nested (kernel/locking/spinlock.c:379)
tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350)
ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1))
ip_local_deliver_finish (./include/linux/rcupdate.h:878 net/ipv4/ip_input.c:234)
ip_sublist_rcv_finish (net/ipv4/ip_input.c:576)
ip_list_rcv_finish (net/ipv4/ip_input.c:628)
ip_list_rcv (net/ipv4/ip_input.c:670)
__netif_receive_skb_list_core (net/core/dev.c:5939 net/core/dev.c:5986)
netif_receive_skb_list_internal (net/core/dev.c:6040 net/core/dev.c:6129)
napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:519 ./include/net/gro.h:514 net/core/dev.c:6496)
e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3815)
__napi_poll.constprop.0 (net/core/dev.c:7191)
net_rx_action (net/core/dev.c:7262 net/core/dev.c:7382)
handle_softirqs (kernel/softirq.c:561)
__irq_exit_rcu (kernel/softirq.c:596 kernel/softirq.c:435 kernel/softirq.c:662)
irq_exit_rcu (kernel/softirq.c:680)
common_interrupt (arch/x86/kernel/irq.c:280 (discriminator 14))
</IRQ>
<TASK>
asm_common_interrupt (./arch/x86/include/asm/idtentry.h:693)
RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:744)
Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d c3 2b 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
RSP: 0018:ffa00000000ffee8 EFLAGS: 00000202
RAX: 000000000000640b RBX: ff1100010091c200 RCX: 0000000000061aa4
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812f30c5
RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
? do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325)
default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118)
do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325)
cpu_startup_entry (kernel/sched/idle.c:422 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:315)
common_startup_64 (arch/x86/kernel/head_64.S:421)
</TASK>
Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs]
CR2: 00000000000000c4
Fixes: ed07536ed673 ("[PATCH] lockdep: annotate nfs/nfsd in-kernel sockets")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250407163313.22682-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A nexthop is only chosen when the calculated multipath hash falls in the
nexthop's hash region (i.e., the hash is smaller than the nexthop's hash
threshold) and when the nexthop is assigned a non-negative score by
rt6_score_route().
Commit 4d0ab3a6885e ("ipv6: Start path selection from the first
nexthop") introduced an unintentional difference between the first
nexthop and the rest when the score is negative.
When the first nexthop matches, but has a negative score, the code will
currently evaluate subsequent nexthops until one is found with a
non-negative score. On the other hand, when a different nexthop matches,
but has a negative score, the code will fallback to the nexthop with
which the selection started ('match').
Align the behavior across all nexthops and fallback to 'match' when the
first nexthop matches, but has a negative score.
Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N")
Fixes: 4d0ab3a6885e ("ipv6: Start path selection from the first nexthop")
Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Closes: https://lore.kernel.org/netdev/67efef607bc41_1ddca82948c@willemb.c.googlers.com.notmuch/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250408084316.243559-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
phylink-controlled PHY
DSA has 2 kinds of drivers:
1. Those who call dsa_switch_suspend() and dsa_switch_resume() from
their device PM ops: qca8k-8xxx, bcm_sf2, microchip ksz
2. Those who don't: all others. The above methods should be optional.
For type 1, dsa_switch_suspend() calls dsa_user_suspend() -> phylink_stop(),
and dsa_switch_resume() calls dsa_user_resume() -> phylink_start().
These seem good candidates for setting mac_managed_pm = true because
that is essentially its definition [1], but that does not seem to be the
biggest problem for now, and is not what this change focuses on.
Talking strictly about the 2nd category of DSA drivers here (which
do not have MAC managed PM, meaning that for their attached PHYs,
mdio_bus_phy_suspend() and mdio_bus_phy_resume() should run in full),
I have noticed that the following warning from mdio_bus_phy_resume() is
triggered:
WARN_ON(phydev->state != PHY_HALTED && phydev->state != PHY_READY &&
phydev->state != PHY_UP);
because the PHY state machine is running.
It's running as a result of a previous dsa_user_open() -> ... ->
phylink_start() -> phy_start() having been initiated by the user.
The previous mdio_bus_phy_suspend() was supposed to have called
phy_stop_machine(), but it didn't. So this is why the PHY is in state
PHY_NOLINK by the time mdio_bus_phy_resume() runs.
mdio_bus_phy_suspend() did not call phy_stop_machine() because for
phylink, the phydev->adjust_link function pointer is NULL. This seems a
technicality introduced by commit fddd91016d16 ("phylib: fix PAL state
machine restart on resume"). That commit was written before phylink
existed, and was intended to avoid crashing with consumer drivers which
don't use the PHY state machine - phylink always does, when using a PHY.
But phylink itself has historically not been developed with
suspend/resume in mind, and apparently not tested too much in that
scenario, allowing this bug to exist unnoticed for so long. Plus, prior
to the WARN_ON(), it would have likely been invisible.
This issue is not in fact restricted to type 2 DSA drivers (according to
the above ad-hoc classification), but can be extrapolated to any MAC
driver with phylink and MDIO-bus-managed PHY PM ops. DSA is just where
the issue was reported. Assuming mac_managed_pm is set correctly, a
quick search indicates the following other drivers might be affected:
$ grep -Zlr PHYLINK_NETDEV drivers/ | xargs -0 grep -L mac_managed_pm
drivers/net/ethernet/atheros/ag71xx.c
drivers/net/ethernet/microchip/sparx5/sparx5_main.c
drivers/net/ethernet/microchip/lan966x/lan966x_main.c
drivers/net/ethernet/freescale/dpaa2/dpaa2-mac.c
drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c
drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
drivers/net/ethernet/freescale/ucc_geth.c
drivers/net/ethernet/freescale/enetc/enetc_pf_common.c
drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
drivers/net/ethernet/marvell/mvneta.c
drivers/net/ethernet/marvell/prestera/prestera_main.c
drivers/net/ethernet/mediatek/mtk_eth_soc.c
drivers/net/ethernet/altera/altera_tse_main.c
drivers/net/ethernet/wangxun/txgbe/txgbe_phy.c
drivers/net/ethernet/meta/fbnic/fbnic_phylink.c
drivers/net/ethernet/tehuti/tn40_phy.c
drivers/net/ethernet/mscc/ocelot_net.c
Make the existing conditions dependent on the PHY device having a
phydev->phy_link_change() implementation equal to the default
phy_link_change() provided by phylib. Otherwise, we implicitly know that
the phydev has the phylink-provided phylink_phy_change() callback, and
when phylink is used, the PHY state machine always needs to be stopped/
started on the suspend/resume path. The code is structured as such that
if phydev->phy_link_change() is absent, it is a matter of time until the
kernel will crash - no need to further complicate the test.
Thus, for the situation where the PM is not managed by the MAC, we will
make the MDIO bus PM ops treat identically the phylink-controlled PHYs
with the phylib-controlled PHYs where an adjust_link() callback is
supplied. In both cases, the MDIO bus PM ops should stop and restart the
PHY state machine.
[1] https://lore.kernel.org/netdev/Z-1tiW9zjcoFkhwc@shell.armlinux.org.uk/
Fixes: 744d23c71af3 ("net: phy: Warn about incorrect mdio_bus_phy_resume() state")
Reported-by: Wei Fang <wei.fang@nxp.com>
Tested-by: Wei Fang <wei.fang@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250407094042.2155633-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In an upcoming change, mdio_bus_phy_may_suspend() will need to
distinguish a phylib-based PHY client from a phylink PHY client.
For that, it will need to compare the phydev->phy_link_change() function
pointer with the eponymous phy_link_change() provided by phylib.
To avoid forward function declarations, the default PHY link state
change method should be moved upwards. There is no functional change
associated with this patch, it is only to reduce the noise from a real
bug fix.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250407093900.2155112-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For systems with missing iptables-legacy tool this selftest fails.
Add check to find if iptables-legacy tool is available and skip the
test if the tool is missing.
Fixes: de9c8d848d90 ("selftests/bpf: S/iptables/iptables-legacy/ in the bpf_nf and xdp_synproxy test")
Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250409095633.33653-1-skb99@linux.ibm.com
|
|
The link_create.flags are currently not used for multi-uprobes, so return
-EINVAL if it is set, same as for other attach APIs.
We allow target_fd to have an arbitrary value for multi-uprobe, though,
as there are existing users (libbpf) relying on this.
Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250407035752.1108927-2-chen.dylane@linux.dev
|
|
The link_create.flags are currently not used for multi-kprobes, so return
-EINVAL if it is set, same as for other attach APIs.
We allow target_fd, on the other hand, to have an arbitrary value for
multi-kprobe, as there are existing users (libbpf) relying on this.
Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250407035752.1108927-1-chen.dylane@linux.dev
|
|
Mykyta Yatsenko says:
====================
libbpf: introduce line_info and func_info getters
From: Mykyta Yatsenko <yatsenko@meta.com>
This patchset introduces new libbpf API getters that enable the retrieval
of .BTF.ext line and func info.
This change enables users to load bpf_program directly using bpf_prog_load,
bypassing the higher-level bpf_object__load API. Providing line and
function info is essential for BPF program verification in some cases.
v3 -> v5
* Fix tests on s390x, nits.
v2 -> v3
* Return ENOTSUPP if func or line info struct size differs from the one in
uapi linux headers.
* Add selftests.
v1 -> v2
Move bpf_line_info_min and bpf_func_info_min from libbpf_internal.h to
btf.h. Did not remove _min suffix, because there already are bpf_line_info
and bpf_func_info structs in uapi/../bpf.h.
====================
Link: https://patch.msgid.link/20250408234417.452565-1-mykyta.yatsenko5@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
Add selftests checking that line and func info retrieved by newly added
libbpf APIs are the same as returned by kernel via bpf_prog_get_info_by_fd.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250408234417.452565-3-mykyta.yatsenko5@gmail.com
|
|
Introducing new libbpf API getters for BTF.ext func and line info,
namely:
bpf_program__func_info
bpf_program__func_info_cnt
bpf_program__line_info
bpf_program__line_info_cnt
This change enables scenarios, when user needs to load bpf_program
directly using `bpf_prog_load`, instead of higher-level
`bpf_object__load`. Line and func info are required for checking BTF
info in verifier; verification may fail without these fields if, for
example, program calls `bpf_obj_new`.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250408234417.452565-2-mykyta.yatsenko5@gmail.com
|
|
Extend commit e3c9abd0d14b ("selftests/bpf: Implement setting global
variables in veristat") to support applying presets to members of
the global structs or unions in veristat.
For example:
```
./veristat set_global_vars.bpf.o -G "union1.struct3.var_u8_h = 0xBB"
```
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250408104544.140317-1-mykyta.yatsenko5@gmail.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
- Fixes tpm2, futex, and mincore tests
- Create a dedicated .gitignore for tpm2 tests
* tag 'linux_kselftest-fixes-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/mincore: Allow read-ahead pages to reach the end of the file
selftests/futex: futex_waitv wouldblock test should fail
selftests: tpm2: test_smoke: use POSIX-conformant expression operator
selftests: tpm2: create a dedicated .gitignore
|