Age | Commit message (Collapse) | Author |
|
To ease backports in setup.c, let's group changes from
__lsm_ro_after_init to __ro_after_init with commit f22f9aaf6c3d
("selinux: remove the runtime disable functionality"), and the
landlock_lsmid addition with commit f3b8788cde61 ("LSM: Identify modules
by more than name").
That will help to backport the following errata.
Cc: Günther Noack <gnoack@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250318161443.279194-2-mic@digikod.net
Fixes: f3b8788cde61 ("LSM: Identify modules by more than name")
Signed-off-by: Mickaël Salaün <mic@digikod.net>
|
|
In case the fib match is used from the input hook we can avoid the fib
lookup if early demux assigned a socket for us: check that the input
interface matches sk-cached one.
Rework the existing 'lo bypass' logic to first check sk, then
for loopback interface type to elide the fib lookup.
This speeds up fib matching a little, before:
93.08 GBit/s (no rules at all)
75.1 GBit/s ("fib saddr . iif oif missing drop" in prerouting)
75.62 GBit/s ("fib saddr . iif oif missing drop" in input)
After:
92.48 GBit/s (no rules at all)
75.62 GBit/s (fib rule in prerouting)
90.37 GBit/s (fib rule in input).
Numbers for the 'no rules' and 'prerouting' are expected to
closely match in-between runs, the 3rd/input test case exercises the
the 'avoid lookup if cached ifindex in sk matches' case.
Test used iperf3 via veth interface, lo can't be used due to existing
loopback test.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This change removes a redundant null check found by Smatch.
Fixes: 403a0293f1c2 ("mmc: core: Add open-ended Ext memory addressing")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/linux-mmc/345be6cd-f2f3-472e-a897-ca4b7c4cf826@stanley.mountain/
Signed-off-by: Avri Altman <avri.altman@sandisk.com>
Link: https://lore.kernel.org/r/20250319203642.778016-1-avri.altman@sandisk.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
Setting pci_msi_ignore_mask inhibits the toggling of the mask bit for both
MSI and MSI-X entries globally, regardless of the IRQ chip they are using.
Only Xen sets the pci_msi_ignore_mask when routing physical interrupts over
event channels, to prevent PCI code from attempting to toggle the maskbit,
as it's Xen that controls the bit.
However, the pci_msi_ignore_mask being global will affect devices that use
MSI interrupts but are not routing those interrupts over event channels
(not using the Xen pIRQ chip). One example is devices behind a VMD PCI
bridge. In that scenario the VMD bridge configures MSI(-X) using the
normal IRQ chip (the pIRQ one in the Xen case), and devices behind the
bridge configure the MSI entries using indexes into the VMD bridge MSI
table. The VMD bridge then demultiplexes such interrupts and delivers to
the destination device(s). Having pci_msi_ignore_mask set in that scenario
prevents (un)masking of MSI entries for devices behind the VMD bridge.
Move the signaling of no entry masking into the MSI domain flags, as that
allows setting it on a per-domain basis. Set it for the Xen MSI domain
that uses the pIRQ chip, while leaving it unset for the rest of the
cases.
Remove pci_msi_ignore_mask at once, since it was only used by Xen code, and
with Xen dropping usage the variable is unneeded.
This fixes using devices behind a VMD bridge on Xen PV hardware domains.
Albeit Devices behind a VMD bridge are not known to Xen, that doesn't mean
Linux cannot use them. By inhibiting the usage of
VMD_FEAT_CAN_BYPASS_MSI_REMAP and the removal of the pci_msi_ignore_mask
bodge devices behind a VMD bridge do work fine when use from a Linux Xen
hardware domain. That's the whole point of the series.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Message-ID: <20250219092059.90850-4-roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
MSI remapping bypass (directly configuring MSI entries for devices on the
VMD bus) won't work under Xen, as Xen is not aware of devices in such bus,
and hence cannot configure the entries using the pIRQ interface in the PV
case, and in the PVH case traps won't be setup for MSI entries for such
devices.
Until Xen is aware of devices in the VMD bus prevent the
VMD_FEAT_CAN_BYPASS_MSI_REMAP capability from being used when running as
any kind of Xen guest.
The MSI remapping bypass is an optional feature of VMD bridges, and hence
when running under Xen it will be masked and devices will be forced to
redirect its interrupts from the VMD bridge. That mode of operation must
always be supported by VMD bridges and works when Xen is not aware of
devices behind the VMD bridge.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Message-ID: <20250219092059.90850-3-roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
around compiler segfault
Due to pending percpu improvements in -next, GCC9 and GCC10 are
crashing during the build with:
lib/zstd/compress/huf_compress.c:1033:1: internal compiler error: Segmentation fault
1033 | {
| ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-9/README.Bugs> for instructions.
The DYNAMIC_BMI2 feature is a known-challenging feature of
the ZSTD library, with an existing GCC quirk turning it off
for GCC versions below 4.8.
Increase the DYNAMIC_BMI2 version cutoff to GCC 11.0 - GCC 10.5
is the last version known to crash.
Reported-by: Michael Kelley <mhklinux@outlook.com>
Debugged-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: https://lore.kernel.org/r/SN6PR02MB415723FBCD79365E8D72CA5FD4D82@SN6PR02MB4157.namprd02.prod.outlook.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Clang does not tolerate the use of non-TLS symbols for the per-CPU stack
protector very well, and to work around this limitation, the symbol
passed via the -mstack-protector-guard-symbol= option is never defined
in C code, but only in the linker script, and it is exported from an
assembly file. This is necessary because Clang will fail to generate the
correct %GS based references in a compilation unit that includes a
non-TLS definition of the guard symbol being used to store the stack
cookie.
This problem is only triggered by symbol definitions, not by
declarations, but nonetheless, the declaration in <asm/asm-prototypes.h>
is conditional on __GENKSYMS__ being #define'd, so that only genksyms
will observe it, but for ordinary compilation, it will be invisible.
This is causing problems with the genksyms alternative gendwarfksyms,
which does not #define __GENKSYMS__, does not observe the symbol
declaration, and therefore lacks the information it needs to version it.
Adding the #define creates problems in other places, so that is not a
straight-forward solution. So take the easy way out, and drop the
conditional on __GENKSYMS__, as this is not really needed to begin with.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20250320213238.4451-2-ardb@kernel.org
|
|
The current hypercall interface for doing PCI device operations always uses
a segment field that has a 16 bit width. However on Linux there are buses
like VMD that hook up devices into the PCI hierarchy at segment >= 0x10000,
after the maximum possible segment enumerated in ACPI.
Attempting to register or manage those devices with Xen would result in
errors at best, or overlaps with existing devices living on the truncated
equivalent segment values. Note also that the VMD segment numbers are
arbitrarily assigned by the OS, and hence there would need to be some
negotiation between Xen and the OS to agree on how to enumerate VMD
segments and devices behind them.
Skip notifying Xen about those devices. Given how VMD bridges can
multiplex interrupts on behalf of devices behind them there's no need for
Xen to be aware of such devices for them to be usable by Linux.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Juergen Gross <jgross@suse.com>
Message-ID: <20250219092059.90850-2-roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
Mounting a corrupted filesystem with directory which contains '.' dir
entry with rec_len == block size results in out-of-bounds read (later
on, when the corrupted directory is removed).
ext4_empty_dir() assumes every ext4 directory contains at least '.'
and '..' as directory entries in the first data block. It first loads
the '.' dir entry, performs sanity checks by calling ext4_check_dir_entry()
and then uses its rec_len member to compute the location of '..' dir
entry (in ext4_next_entry). It assumes the '..' dir entry fits into the
same data block.
If the rec_len of '.' is precisely one block (4KB), it slips through the
sanity checks (it is considered the last directory entry in the data
block) and leaves "struct ext4_dir_entry_2 *de" point exactly past the
memory slot allocated to the data block. The following call to
ext4_check_dir_entry() on new value of de then dereferences this pointer
which results in out-of-bounds mem access.
Fix this by extending __ext4_check_dir_entry() to check for '.' dir
entries that reach the end of data block. Make sure to ignore the phony
dir entries for checksum (by checking name_len for non-zero).
Note: This is reported by KASAN as use-after-free in case another
structure was recently freed from the slot past the bound, but it is
really an OOB read.
This issue was found by syzkaller tool.
Call Trace:
[ 38.594108] BUG: KASAN: slab-use-after-free in __ext4_check_dir_entry+0x67e/0x710
[ 38.594649] Read of size 2 at addr ffff88802b41a004 by task syz-executor/5375
[ 38.595158]
[ 38.595288] CPU: 0 UID: 0 PID: 5375 Comm: syz-executor Not tainted 6.14.0-rc7 #1
[ 38.595298] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 38.595304] Call Trace:
[ 38.595308] <TASK>
[ 38.595311] dump_stack_lvl+0xa7/0xd0
[ 38.595325] print_address_description.constprop.0+0x2c/0x3f0
[ 38.595339] ? __ext4_check_dir_entry+0x67e/0x710
[ 38.595349] print_report+0xaa/0x250
[ 38.595359] ? __ext4_check_dir_entry+0x67e/0x710
[ 38.595368] ? kasan_addr_to_slab+0x9/0x90
[ 38.595378] kasan_report+0xab/0xe0
[ 38.595389] ? __ext4_check_dir_entry+0x67e/0x710
[ 38.595400] __ext4_check_dir_entry+0x67e/0x710
[ 38.595410] ext4_empty_dir+0x465/0x990
[ 38.595421] ? __pfx_ext4_empty_dir+0x10/0x10
[ 38.595432] ext4_rmdir.part.0+0x29a/0xd10
[ 38.595441] ? __dquot_initialize+0x2a7/0xbf0
[ 38.595455] ? __pfx_ext4_rmdir.part.0+0x10/0x10
[ 38.595464] ? __pfx___dquot_initialize+0x10/0x10
[ 38.595478] ? down_write+0xdb/0x140
[ 38.595487] ? __pfx_down_write+0x10/0x10
[ 38.595497] ext4_rmdir+0xee/0x140
[ 38.595506] vfs_rmdir+0x209/0x670
[ 38.595517] ? lookup_one_qstr_excl+0x3b/0x190
[ 38.595529] do_rmdir+0x363/0x3c0
[ 38.595537] ? __pfx_do_rmdir+0x10/0x10
[ 38.595544] ? strncpy_from_user+0x1ff/0x2e0
[ 38.595561] __x64_sys_unlinkat+0xf0/0x130
[ 38.595570] do_syscall_64+0x5b/0x180
[ 38.595583] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes: ac27a0ec112a0 ("[PATCH] ext4: initial copy of files from ext3")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Mahmoud Adam <mngyadam@amazon.com>
Cc: stable@vger.kernel.org
Cc: security@kernel.org
Link: https://patch.msgid.link/b3ae36a6794c4a01944c7d70b403db5b@amazon.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
A user complained that a message such as:
EXT4-fs (nvme0n1p3): re-mounted UUID ro. Quota mode: none.
implied that the file system was previously mounted read/write and was
now remounted read-only, when it could have been some other mount
state that had changed by the "mount -o remount" operation. Fix this
by only logging "ro"or "r/w" when it has changed.
https://bugzilla.kernel.org/show_bug.cgi?id=219132
Signed-off-by: Nicolas Bretz <bretznic@gmail.com>
Link: https://patch.msgid.link/20250319171011.8372-1-bretznic@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The error out label of file_modified() should be out_inode_lock in
ext4_fallocate().
Fixes: 2890e5e0f49e ("ext4: move out common parts into ext4_fallocate()")
Reported-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250319023557.2785018-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently, outside error paths, we auto commit the super block after 1
hour has passed and 16MB worth of updates have been written since last
commit. This is a policy decision so make this tunable while keeping the
defaults same. This is useful if user wants to tweak the superblock
behavior or for debugging the codepath by allowing to trigger it more
frequently.
We can now tweak the super block update using sb_update_sec and
sb_update_kb files in /sys/fs/ext4/<dev>/
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/950fb8c9b2905620e16f02a3b9eeea5a5b6cb87e.1742279837.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Presently we always BUG_ON if trying to start a transaction on a journal marked
with JBD2_UNMOUNT, since this should never happen. However, while ltp running
stress tests, it was observed that in case of some error handling paths, it is
possible for update_super_work to start a transaction after the journal is
destroyed eg:
(umount)
ext4_kill_sb
kill_block_super
generic_shutdown_super
sync_filesystem /* commits all txns */
evict_inodes
/* might start a new txn */
ext4_put_super
flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
jbd2_journal_destroy
journal_kill_thread
journal->j_flags |= JBD2_UNMOUNT;
jbd2_journal_commit_transaction
jbd2_journal_get_descriptor_buffer
jbd2_journal_bmap
ext4_journal_bmap
ext4_map_blocks
...
ext4_inode_error
ext4_handle_error
schedule_work(&sbi->s_sb_upd_work)
/* work queue kicks in */
update_super_work
jbd2_journal_start
start_this_handle
BUG_ON(journal->j_flags &
JBD2_UNMOUNT)
Hence, introduce a new mount flag to indicate journal is destroying and only do
a journaled (and deferred) update of sb if this flag is not set. Otherwise, just
fallback to an un-journaled commit.
Further, in the journal destroy path, we have the following sequence:
1. Set mount flag indicating journal is destroying
2. force a commit and wait for it
3. flush pending sb updates
This sequence is important as it ensures that, after this point, there is no sb
update that might be journaled so it is safe to update the sb outside the
journal. (To avoid race discussed in 2d01ddc86606)
Also, we don't need a similar check in ext4_grp_locked_error since it is only
called from mballoc and AFAICT it would be always valid to schedule work here.
Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/9613c465d6ff00cd315602f99283d5f24018c3f7.1742279837.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Define an ext4 wrapper over jbd2_journal_destroy to make sure we
have consistent behavior during journal destruction. This will also
come useful in the next patch where we add some ext4 specific logic
in the destroy path.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/c3ba78c5c419757e6d5f2d8ebb4a8ce9d21da86a.1742279837.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
sizeof(char) evaluates to 1. Remove the churn.
Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://patch.msgid.link/20250316-ext4-hash-kcalloc-v2-1-2a99e93ec6e0@ethancedwards.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When the filesystem performs file or filesystem synchronization (e.g.,
ext4_sync_file()), it queries the journal to determine whether to flush
the file device through jbd2_trans_will_send_data_barrier(). If the
target transaction has not started committing, it assumes that the
journal will submit the flush command, allowing the filesystem to bypass
a redundant flush command. However, this assumption is not always valid.
If the journal is not located on the filesystem device, the journal
commit thread will not submit the flush command unless the variable
->t_need_data_flush is set to 1. Consequently, the flush may be missed,
and data may be lost following a power failure or system crash, even if
the synchronization appears to succeed.
Unfortunately, we cannot determine with certainty whether the target
transaction will flush to the filesystem device before it commits.
However, if it has not started committing, it must be the running
transaction. Therefore, fix it by always set its t_need_data_flush to 1,
ensuring that the committing thread will flush the filesystem device.
Fixes: bbd2be369107 ("jbd2: Add function jbd2_trans_will_send_data_barrier()")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241206111327.4171337-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Pull drm fixes from Dave Airlie:
"Just the usual spread of a bunch for amdgpu, and small changes to
others.
scheduler:
- fix fence reference leak
xe:
- Fix for an error if exporting a dma-buf multiple time
amdgpu:
- Fix video caps limits on several asics
- SMU 14.x fixes
- GC 12 fixes
- eDP fixes
- DMUB fix
amdkfd:
- GC 12 trap handler fix
- GC 7/8 queue validation fix
radeon:
- VCE IB parsing fix
v3d:
- fix job error handling bugs
qaic:
- fix two integer overflows
host1x:
- fix NULL domain handling"
* tag 'drm-fixes-2025-03-21' of https://gitlab.freedesktop.org/drm/kernel: (21 commits)
drm/xe: Fix exporting xe buffers multiple times
gpu: host1x: Do not assume that a NULL domain means no DMA IOMMU
drm/amdgpu/pm: Handle SCLK offset correctly in overdrive for smu 14.0.2
drm/amd/display: Fix incorrect fw_state address in dmub_srv
drm/amd/display: Use HW lock mgr for PSR1 when only one eDP
drm/amd/display: Fix message for support_edp0_on_dp1
drm/amdkfd: Fix user queue validation on Gfx7/8
drm/amdgpu: Restore uncached behaviour on GFX12
drm/amdgpu/gfx12: correct cleanup of 'me' field with gfx_v12_0_me_fini()
drm/amdkfd: Fix instruction hazard in gfx12 trap handler
drm/amdgpu/pm: wire up hwmon fan speed for smu 14.0.2
drm/amd/pm: add unique_id for gfx12
drm/amdgpu: Remove JPEG from vega and carrizo video caps
drm/amdgpu: Fix JPEG video caps max size for navi1x and raven
drm/amdgpu: Fix MPEG2, MPEG4 and VC1 video caps max size
drm/radeon: fix uninitialized size issue in radeon_vce_cs_parse()
accel/qaic: Fix integer overflow in qaic_validate_req()
accel/qaic: Fix possible data corruption in BOs > 2G
drm/v3d: Set job pointer to NULL when the job's fence has an error
drm/v3d: Don't run jobs that have errors flagged in its fence
...
|
|
Pull smb client fix from Steve French:
"smb3 client reconnect fix"
* tag 'v6.14-rc7-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: don't retry IO on failed negprotos with soft mounts
|
|
By default, io_uring marks a waiting task as being in iowait, if it's
sleeping waiting on events and there are pending requests. This isn't
necessarily always useful, and may be confusing on non-storage setups
where iowait isn't expected. It can also cause extra power usage, by
preventing the CPU from entering lower sleep states.
This adds a new enter flag, IORING_ENTER_NO_IOWAIT. If set, then
io_uring will not account the sleeping task as being in iowait. If the
kernel supports this feature, then it will be marked by having the
IORING_FEAT_NO_IOWAIT feature flag set.
As the kernel currently does not support separating the iowait
accounting and CPU frequency boosting, the IORING_ENTER_NO_IOWAIT
controls both of these at the same time. In the future, if those do end
up being split, then it'd be possible to control them separately.
However, it seems more likely that the kernel will decouple iowait and
CPU frequency boosting anyway.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For loop target, write cache isn't enabled, and each write isn't be
marked as DSYNC too.
Fix it by enabling write cache, meantime fix FLUSH implementation
by not taking LBA range into account, and there isn't such info
for FLUSH command.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250321004758.152572-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.14-2025-03-20:
amdgpu:
- Fix video caps limits on several asics
- SMU 14.x fixes
- GC 12 fixes
- eDP fixes
- DMUB fix
amdkfd:
- GC 12 trap handler fix
- GC 7/8 queue validation fix
radeon:
- VCE IB parsing fix
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250320210800.1358992-1-alexander.deucher@amd.com
|
|
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
Driver Changes:
- Fix for an error if exporting a dma-buf multiple time (Tomasz)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/Z9xalLaCWsNbh0P0@fedora
|
|
ssh://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
A sched fence reference leak fix, two fence fixes for v3d, two overflow
fixes for quaic, and a iommu handling fix for host1x.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <mripard@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250320-valiant-outstanding-nightingale-e9acae@houat
|
|
for-6.15/block
Pull NVMe updates from Keith:
"nvme updates for Linux 6.15
- Secure concatenation for TCP transport (Hannes)
- Multipath sysfs visibility (Nilay)
- Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li)
- Correct use of 64-bit BARs for pci-epf target (Niklas)
- Socket fix for selinux when used in containers (Peijie)"
* tag 'nvme-6.15-2025-03-20' of git://git.infradead.org/nvme: (22 commits)
nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi)
nvme-tcp: fix selinux denied when calling sock_sendmsg
nvmet: pci-epf: Always configure BAR0 as 64-bit
nvmet: Remove duplicate uuid_copy
nvme: zns: Simplify nvme_zone_parse_entry()
nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls
nvmet-fc: Remove unused functions
nvme-pci: remove stale comment
nvme-fc: Utilise min3() to simplify queue count calculation
nvme-multipath: Add visibility for queue-depth io-policy
nvme-multipath: Add visibility for numa io-policy
nvme-multipath: Add visibility for round-robin io-policy
nvmet: add tls_concat and tls_key debugfs entries
nvmet-tcp: support secure channel concatenation
nvmet: Add 'sq' argument to alloc_ctrl_args
nvme-fabrics: reset admin connection for secure concatenation
nvme-tcp: request secure channel concatenation
nvme-keyring: add nvme_tls_psk_refresh()
nvme: add nvme_auth_derive_tls_psk()
nvme: add nvme_auth_generate_digest()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fix from Marek Szyprowski:
- fix missing clear bdr in check_ram_in_range_map() (Baochen Qiang)
* tag 'dma-mapping-6.14-2025-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: fix missing clear bdr in check_ram_in_range_map()
|
|
This patch replaces max(a, min(b, c)) by clamp(val, lo, hi) in the nvme
driver. The clamp() macro explicitly expresses the intent of constraining
a value within bounds, improving code readability.
Signed-off-by: Li Haoran <li.haoran7@zte.com.cn>
Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
In a SELinux enabled kernel, socket_create() initializes the security
label of the socket using the security label of the calling process,
this typically works well.
However, in a containerized environment like Kubernetes, problem arises
when a privileged container(domain spc_t) connects to an NVMe target and
mounts the NVMe as persistent storage for unprivileged containers(domain
container_t).
This is because the container_t domain cannot access resources labeled
with spc_t, resulting in socket_sendmsg returning -EACCES.
The solution is to use socket_create_kern() instead of socket_create(),
which labels the socket context to kernel_t. Access control will then
be handled by the VFS layer rather than the socket itself.
Signed-off-by: Peijie Shao <shaopeijie@cestc.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
NVMe PCIe Transport Specification 1.1, section 2.1.10, claims that the
BAR0 type is Implementation Specific.
However, in NVMe 1.1, the type is required to be 64-bit.
Thus, to make our PCI EPF work on as many host systems as possible,
always configure the BAR0 type to be 64-bit.
In the rare case that the underlying PCI EPC does not support configuring
BAR0 as 64-bit, the call to pci_epc_set_bar() will fail, and we will
return a failure back to the user.
This should not be a problem, as most PCI EPCs support configuring a BAR
as 64-bit (and those EPCs with .only_64bit set to true in epc_features
only support configuring the BAR as 64-bit).
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Fixes: 0faa0fe6f90e ("nvmet: New NVMe PCI endpoint function target driver")
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
We do uuid_copy twice in nvmet_alloc_ctrl so this patch deletes one
of the calls.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Instead of passing a pointer to a struct nvme_ctrl and a pointer to a
struct nvme_ns_head as the first two arguments of
nvme_zone_parse_entry(), pass only a pointer to a struct nvme_ns as both
the controller structure and ns head structure can be infered from the
namespace structure.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
'destroy_workqueue()' already drains the queue before destroying it, so
there is no need to flush it explicitly.
Remove the redundant 'flush_workqueue()' calls.
This was generated with coccinelle:
@@
expression E;
@@
- flush_workqueue(E);
destroy_workqueue(E);
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The functions nvmet_fc_iodnum() and nvmet_fc_fodnum() are currently
unutilized.
Following commit c53432030d86 ("nvme-fabrics: Add target support for FC
transport"), which introduced these two functions, they have not been
used at all in practice.
Remove them to resolve the compiler warnings.
Fix follow errors with clang-19 when W=1e:
drivers/nvme/target/fc.c:177:1: error: unused function 'nvmet_fc_iodnum' [-Werror,-Wunused-function]
177 | nvmet_fc_iodnum(struct nvmet_fc_ls_iod *iodptr)
| ^~~~~~~~~~~~~~~
drivers/nvme/target/fc.c:183:1: error: unused function 'nvmet_fc_fodnum' [-Werror,-Wunused-function]
183 | nvmet_fc_fodnum(struct nvmet_fc_fcp_iod *fodptr)
| ^~~~~~~~~~~~~~~
2 errors generated.
make[8]: *** [scripts/Makefile.build:207: drivers/nvme/target/fc.o] Error 1
make[7]: *** [scripts/Makefile.build:465: drivers/nvme/target] Error 2
make[6]: *** [scripts/Makefile.build:465: drivers/nvme] Error 2
make[6]: *** Waiting for unfinished jobs....
Fixes: c53432030d86 ("nvme-fabrics: Add target support for FC transport")
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The ns variable has been removed in commit 62451a2b2e7e ("nvme: separate
command prep and issue"). Drop reference to ns in comment.
Fixes: 62451a2b2e7e ("nvme: separate command prep and issue")
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Refactor nvme_fc_create_io_queues() and nvme_fc_recreate_io_queues() to
use the min3() macro to find the minimum between 3 values instead of
multiple min()'s. This shortens the code and makes it easier to read.
Signed-off-by: Qasim Ijaz <qasdev00@gmail.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
This patch helps add nvme native multipath visibility for queue-depth
io-policy. It adds a new attribute file named "queue_depth" under
namespace device path node which would print the number of active/
in-flight I/O requests currently queued for the given path.
For instance, if we have a shared namespace accessible from two different
controllers/paths then accessing head block node of the shared namespace
would show the following output:
$ ls -l /sys/block/nvme1n1/multipath/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
In the above example, nvme1n1 is head gendisk node created for a shared
namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
paths. For queue-depth io-policy we can then refer the "queue_depth"
attribute file created under each namespace path:
$ cat /sys/block/nvme1n1/multipath/nvme1c1n1/queue_depth
518
$cat /sys/block/nvme1n1/multipath/nvme1c3n1/queue_depth
504
>From the above output, we can infer that I/O workload targeted at nvme1n1
uses two paths nvme1c1n1 and nvme1c3n1 and the current queue depth of each
path is 518 and 504 respectively. Reading "queue_depth" file when
configured io-policy is anything but queue-depth would show no output.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
This patch helps add nvme native multipath visibility for numa io-policy.
It adds a new attribute file named "numa_nodes" under namespace gendisk
device path node which prints the list of numa nodes preferred by the
given namespace path. The numa nodes value is comma delimited list of
nodes or A-B range of nodes.
For instance, if we have a shared namespace accessible from two different
controllers/paths then accessing head node of the shared namespace would
show the following output:
$ ls -l /sys/block/nvme1n1/multipath/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
In the above example, nvme1n1 is head gendisk node created for a shared
namespace and this namespace is accessible from nvme1c1n1 and nvme1c3n1
paths. For numa io-policy we can then refer the "numa_nodes" attribute
file created under each namespace path:
$ cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
0-1
$ cat /sys/block/nvme1n1/multipath/nvme1c3n1/numa_nodes
2-3
>From the above output, we infer that I/O workload targeted at nvme1n1
and running on numa nodes 0 and 1 would prefer using path nvme1c1n1.
Similarly, I/O workload running on numa nodes 2 and 3 would prefer
using path nvme1c3n1. Reading "numa_nodes" file when configured
io-policy is anything but numa would show no output.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
This patch helps add nvme native multipath visibility for round-robin
io-policy. It creates a "multipath" sysfs directory under head gendisk
device node directory and then from "multipath" directory it adds a link
to each namespace path device the head node refers.
For instance, if we have a shared namespace accessible from two different
controllers/paths then we create a soft link to each path device from head
disk node as shown below:
$ ls -l /sys/block/nvme1n1/multipath/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
In the above example, nvme1n1 is head gendisk node created for a shared
namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
paths.
For round-robin I/O policy, we could easily infer from the above output
that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1
and nvme1c3n1.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add debugfs entries to display the 'concat' and 'tls_key' controller
attributes.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Evaluate the SC_C flag during DH-CHAP-HMAC negotiation to check if secure
concatenation as specified in the NVMe Base Specification v2.1, section
8.3.4.3: "Secure Channel Concatenationand" is requested. If requested the
generated PSK is inserted into the keyring once negotiation has finished
allowing for an encrypted connection once the admin queue is restarted.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
For secure concatenation the result of the TLS handshake will be
stored in the 'sq' struct, so add it to the alloc_ctrl_args struct.
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When secure concatenation is requested the connection needs to be
reset to enable TLS encryption on the new cnnection.
That implies that the original connection used for the DH-CHAP
negotiation really shouldn't be used, and we should reset as soon
as the DH-CHAP negotiation has succeeded on the admin queue.
Based on an idea from Sagi.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add a fabrics option 'concat' to request secure channel concatenation as
specified the NVME Base Specification v2.1, section 8.3.4.3: Secure Channel
Concatenation.
When secure channel concatenation is enabled a 'generated PSK' is inserted
into the keyring such that it's available after reset.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add a function to refresh a generated PSK in the specified keyring.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add a function to derive the TLS PSK as specified TP8018.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add a function to calculate the PSK digest as specified in TP8018.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add a function to generate a NVMe PSK from the shared credentials
negotiated by DH-HMAC-CHAP.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Separate out the HKDF functions into a separate module to
to make them available to other callers.
And add a testsuite to the module with test vectors
from RFC 5869 (and additional vectors for SHA384 and SHA512)
to ensure the integrity of the algorithm.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Acked-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Some user decides test result by exit code only, and wouldn't like to be
bothered by the test result.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250320013743.4167489-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk_drv may be built-in, so don't show modprobe failure, and we
do check `/dev/ublk-control` for skipping test if ublk_drv isn't
enabled.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250320013743.4167489-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add one dependency helper which can include new uapi definition which
isn't synced from kernel.
This way also helps a lot for downstream test deployment.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250320013743.4167489-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|