Age | Commit message (Collapse) | Author |
|
Pidfs supports extensible and non-extensible ioctls. The extensible
ioctls need to check for the ioctl number itself not just the ioctl
command otherwise both backward- and forward compatibility are broken.
The pidfs ioctl handler also needs to look at the type of the ioctl
command to guard against cases where "[...] a daemon receives some
random file descriptor from a (potentially less privileged) client and
expects the FD to be of some specific type, it might call ioctl() on
this FD with some type-specific command and expect the call to fail if
the FD is of the wrong type; but due to the missing type check, the
kernel instead performs some action that userspace didn't expect."
(cf. [1]]
Link: https://lore.kernel.org/r/20250204-work-pidfs-ioctl-v1-1-04987d239575@kernel.org
Link: https://lore.kernel.org/r/CAG48ez2K9A5GwtgqO31u9ZL292we8ZwAA=TJwwEv7wRuJ3j4Lw@mail.gmail.com [1]
Fixes: 8ce352818820 ("pidfs: check for valid ioctl commands")
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reported-by: Jann Horn <jannh@google.com>
Cc: stable@vger.kernel.org # v6.13; please backport with 8ce352818820 ("pidfs: check for valid ioctl commands")
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Amir Goldstein <amir73il@gmail.com> says:
The two Fix patches have been tested by Alex together and each one
independently.
I also verified that they pass the LTP inoityf/fanotify tests.
* patches from https://lore.kernel.org/r/20250203223205.861346-1-amir73il@gmail.com:
fsnotify: disable pre-content and permission events by default
fsnotify: disable notification by default for all pseudo files
fsnotify: use accessor to set FMODE_NONOTIFY_*
Link: https://lore.kernel.org/r/20250203223205.861346-1-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
After introducing pre-content events, we had a regression related to
disabling huge faults on files that should never have pre-content events
enabled.
This happened because the default f_mode of allocated files (0) does
not disable pre-content events.
Pre-content events are disabled in file_set_fsnotify_mode_by_watchers()
but internal files may not get to call this helper.
Initialize f_mode to disable permission and pre-content events for all
files and if needed they will be enabled for the callers of
file_set_fsnotify_mode_by_watchers().
Fixes: 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches")
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Closes: https://lore.kernel.org/linux-fsdevel/20250131121703.1e4d00a7.alex.williamson@redhat.com/
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20250203223205.861346-4-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
STATMOUNT_MNT_OPTS can actually be missing if there are no options. This
is a change of behavior since 75ead69a7173 ("fs: don't let statmount return
empty strings").
The other checks shouldn't actually trigger, but add them for correctness
and for easier debugging if the test fails.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20250129160641.35485-1-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Most pseudo files are not applicable for fsnotify events at all,
let alone to the new pre-content events.
Disable notifications to all files allocated with alloc_file_pseudo()
and enable legacy inotify events for the specific cases of pipe and
socket, which have known users of inotify events.
Pre-content events are also kept disabled for sockets and pipes.
Fixes: 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches")
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Closes: https://lore.kernel.org/linux-fsdevel/20250131121703.1e4d00a7.alex.williamson@redhat.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wi2pThSVY=zhO=ZKxViBj5QCRX-=AS2+rVknQgJnHXDFg@mail.gmail.com/
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20250203223205.861346-3-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Prepending security options was made conditional on sb->s_op->show_options,
but security options are independent of sb options.
Fixes: 056d33137bf9 ("fs: prepend statmount.mnt_opts string with security_sb_mnt_opts()")
Fixes: f9af549d1fd3 ("fs: export mount options via statmount()")
Cc: stable@vger.kernel.org # v6.11
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20250129151253.33241-1-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The FMODE_NONOTIFY_* bits are a 2-bits mode. Open coding manipulation
of those bits is risky. Use an accessor file_set_fsnotify_mode() to
set the mode.
Rename file_set_fsnotify_mode() => file_set_fsnotify_mode_from_watchers()
to make way for the simple accessor name.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20250203223205.861346-2-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Andreas Gruenbacher <agruenba@redhat.com> says:
Here's an updated version with an additional comment saying that
lockref_init() initializes count to 1.
* patches from https://lore.kernel.org/r/20250130135624.1899988-1-agruenba@redhat.com:
lockref: remove count argument of lockref_init
gfs2: switch to lockref_init(..., 1)
gfs2: use lockref_init for gl_lockref
Link: https://lore.kernel.org/r/20250130135624.1899988-1-agruenba@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
All users of lockref_init() now initialize the count to 1, so hardcode
that and remove the count argument.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lore.kernel.org/r/20250130135624.1899988-4-agruenba@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In qd_alloc(), initialize the lockref count to 1 to cover the common
case. Compensate for that in gfs2_quota_init() by adjusting the count
back down to 0; this only occurs when mounting the filesystem rw.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lore.kernel.org/r/20250130135624.1899988-3-agruenba@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Move the initialization of gl_lockref from gfs2_init_glock_once() to
gfs2_glock_get(). This allows to use lockref_init() there.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lore.kernel.org/r/20250130135624.1899988-2-agruenba@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Just like it's normal for unset values to be zero, unset strings should be
empty instead of containing random values.
It seems to be a typical mistake that the mask returned by statmount is not
checked, which can result in various bugs.
With this fix, these bugs are prevented, since it is highly likely that
userspace would just want to turn the missing mask case into an empty
string anyway (most of the recently found cases are of this type).
Link: https://lore.kernel.org/all/CAJfpegsVCPfCn2DpM8iiYSS5DpMsLB8QBUCHecoj6s0Vxf4jzg@mail.gmail.com/
Fixes: 68385d77c05b ("statmount: simplify string option retrieval")
Fixes: 46eae99ef733 ("add statmount(2) syscall")
Cc: stable@vger.kernel.org # v6.8
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20250130121500.113446-1-mszeredi@redhat.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Building with GCC 15 results in build error
fs/vboxsf/super.c:24:54: error: initializer-string for array of ‘unsigned char’ is too long [-Werror=unterminated-string-initialization]
24 | static const unsigned char VBSF_MOUNT_SIGNATURE[4] = "\000\377\376\375";
| ^~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
Due to GCC having enabled -Werror=unterminated-string-initialization[0]
by default. Separately initializing each array element of
VBSF_MOUNT_SIGNATURE to ensure NUL termination, thus satisfying GCC 15
and fixing the build error.
[0]: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wno-unterminated-string-initialization
Signed-off-by: Brahmajit Das <brahmajit.xyz@gmail.com>
Link: https://lore.kernel.org/r/20250121162648.1408743-1-brahmajit.xyz@gmail.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Clang static checker(scan-build) warning:
fs/stat.c:287:21: warning: The left expression of the compound assignment is
an uninitialized value. The computed value will also be garbage.
287 | stat->result_mask |= STATX_MNT_ID_UNIQUE;
| ~~~~~~~~~~~~~~~~~ ^
fs/stat.c:290:21: warning: The left expression of the compound assignment is
an uninitialized value. The computed value will also be garbage.
290 | stat->result_mask |= STATX_MNT_ID;
When vfs_getattr() failed because of security_inode_getattr(), 'stat' is
uninitialized. In this case, there is a harmless garbage problem in
vfs_statx_path(). It's better to return error directly when
vfs_getattr() failed, avoiding garbage value and more clearly.
Signed-off-by: Su Hui <suhui@nfschina.com>
Link: https://lore.kernel.org/r/20250119025946.1168957-1-suhui@nfschina.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Before attaching a new root to the old root, the children counter of the
new root is checked to verify that only the upcoming CPU's top group have
been connected to it. However since the recently added commit b729cc1ec21a
("timers/migration: Fix another race between hotplug and idle entry/exit")
this check is not valid anymore because the old root is pre-accounted
as a child to the new root. Therefore after connecting the upcoming
CPU's top group to the new root, the children count to be expected must
be 2 and not 1 anymore.
This omission results in the old root to not be connected to the new
root. Then eventually the system may run with more than one top level,
which defeats the purpose of a single idle migrator.
Also the old root is pre-accounted but not connected upon the new root
creation. But it can be connected to the new root later on. Therefore
the old root may be accounted twice to the new root. The propagation of
such overcommit can end up creating a double final top-level root with a
groupmask incorrectly initialized. Although harmless given that the final
top level roots will never have a parent to walk up to, this oddity
opportunistically reported the core issue:
WARNING: CPU: 8 PID: 0 at kernel/time/timer_migration.c:543 tmigr_requires_handle_remote
CPU: 8 UID: 0 PID: 0 Comm: swapper/8
RIP: 0010:tmigr_requires_handle_remote
Call Trace:
<IRQ>
? tmigr_requires_handle_remote
? hrtimer_run_queues
update_process_times
tick_periodic
tick_handle_periodic
__sysvec_apic_timer_interrupt
sysvec_apic_timer_interrupt
</IRQ>
Fix the problem by taking the old root into account in the children count
of the new root so the connection is not omitted.
Also warn when more than one top level group exists to better detect
similar issues in the future.
Fixes: b729cc1ec21a ("timers/migration: Fix another race between hotplug and idle entry/exit")
Reported-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250205160220.39467-1-frederic@kernel.org
|
|
The space separator was factored out from the multiple chip name prints,
but several irq_chip::irq_print_chip() callbacks still print a leading
space. Remove the superfluous double spaces.
Fixes: 9d9f204bdf7243bf ("genirq/proc: Add missing space separator back")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/893f7e9646d8933cd6786d5a1ef3eb076d263768.1738764803.git.geert+renesas@glider.be
|
|
https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes
- Fix the build error with clamp after WARN_ON on gcc 13.x+ (Guenter)
- HDCP related fixes (Suraj)
- PMU fix zero delta busyness issue (Umesh)
- Fix page cleanup on DMA remap failure (Brian)
- Drop 64bpp YUV formats from ICL+ SDR planes (Ville)
- GuC log related fix (Daniele)
- DisplayPort related fixes (Ankit, Jani)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/Z6TDHpgI6dnOc0KI@intel.com
|
|
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
UAPI Changes:
- OA uAPI related fixes (Ashutosh)
Driver Changes:
- Fix SRIOV migration initialization (Michal)
- Restore devcoredump to a sane state (Lucas)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/Z6S9rI1ScT_5Aw6_@intel.com
|
|
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
A couple of fixes for ivpu to error handling, komeda for format
handling, AST DP timeout fix when enabling the output, locking fix for
zynqmp DP support, tiled format handling in drm/client, and refcounting
fix for bochs
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <mripard@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250206-encouraging-judicious-quoll-adc1dc@houat
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.14-2025-02-05:
amdgpu:
- Add BO metadata flag for DCC
- Fix potential out of bounds access in display
- Seamless boot fix
- CONFIG_FRAME_WARN fix
- PSR1 fix
UAPI:
- Add new tiling flag for DCC write compress disable
Proposed userspace: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33255
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250205214910.3664690-1-alexander.deucher@amd.com
|
|
bch_extent_rebalance
Previously, bch2_bkey_sectors_need_rebalance() called
bch2_target_accepts_data(), checking whether the target is writable.
However, this means that adding or removing devices from a target would
change the value of bch2_bkey_sectors_need_rebalance() for an existing
extent; this needs to be invariant so that the extent trigger can
correctly maintain rebalance_work accounting.
Instead, check target_accepts_data() in io_opts_to_rebalance_opts(),
before creating the bch_extent_rebalance entry.
This fixes (one?) cause of rebalance_work accounting being off.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Spotted by sparse.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The discard path is supposed to issue journal flushes when there's too
many buckets empty buckets that need a journal commit before they can be
written to again, but at some point this code seems to have been lost.
Bring it back with a new optimization to make sure we don't issue too
many journal flushes: the journal now tracks the sequence number of the
most recent flush in progress, which the discard path uses when deciding
which buckets need a journal flush.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
In the previous commit b3d82c2f2761, code was added to prevent journal sequence
overflow. Among them, the code added to journal_entry_open() uses the
bch2_fs_fatal_err_on() function to handle errors.
However, __journal_res_get() , which calls journal_entry_open() , calls
journal_entry_open() while holding journal->lock , but bch2_fs_fatal_err_on()
internally tries to acquire journal->lock , which results in a deadlock.
So we need to add a locked helper to handle fatal errors even when the
journal->lock is held.
Fixes: b3d82c2f2761 ("bcachefs: Guard against journal seq overflow")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
For some unknown reason, checks on struct bkey_s_c_snapshot and struct
bkey_s_c_snapshot_tree pointers are missing.
Therefore, I think it would be appropriate to fix the incorrect pointer checking
through this patch.
Fixes: 4bd06f07bcb5 ("bcachefs: Fixes for snapshot_tree.master_subvol")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add an (initial?) patch submission checklist, focusing mainly on
testing.
Yes, all patches must be tested, and that starts (but does not end) with
the patch author.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The destination argument of memtostr*() and strtomem*() must be a
fixed-size char array at compile time, so there is no need to use
__builtin_object_size() (which is useful for when an argument is
either a pointer or unknown). Instead use ARRAY_SIZE(), which has the
benefit of working around a bug in Clang (fixed[1] in 15+) that got
__builtin_object_size() wrong sometimes.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501310832.kiAeOt2z-lkp@intel.com/
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://github.com/llvm/llvm-project/commit/d8e0a6d5e9dd2311641f9a8a5d2bf90829951ddc [1]
Tested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
In preparation for adding stricter type checking to the str/mem*()
helpers, provide a way to check that a variable is a byte array
via __must_be_byte_array().
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
The C kernel helpers for evaluating C Strings were positioned where they
were visible to assembly inclusion, which was not intended. Move them
into the kernel and C-only area of the header so future changes won't
confuse the assembler.
Fixes: d7a516c6eeae ("compiler.h: Fix undefined BUILD_BUG_ON_ZERO()")
Fixes: 559048d156ff ("string: Check for "nonstring" attribute on strscpy() arguments")
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
ice: managing MSI-X in driver
Michal Swiatkowski says:
It is another try to allow user to manage amount of MSI-X used for each
feature in ice. First was via devlink resources API, it wasn't accepted
in upstream. Also static MSI-X allocation using devlink resources isn't
really user friendly.
This try is using more dynamic way. "Dynamic" across whole kernel when
platform supports it and "dynamic" across the driver when not.
To achieve that reuse global devlink parameter pf_msix_max and
pf_msix_min. It fits how ice hardware counts MSI-X. In case of ice amount
of MSI-X reported on PCI is a whole MSI-X for the card (with MSI-X for
VFs also). Having pf_msix_max allow user to statically set how many
MSI-X he wants on PF and how many should be reserved for VFs.
pf_msix_min is used to set minimum number of MSI-X with which ice driver
should probe correctly.
Meaning of this field in case of dynamic vs static allocation:
- on system with dynamic MSI-X allocation support
* alloc pf_msix_min as static, rest will be allocated dynamically
- on system without dynamic MSI-X allocation support
* try alloc pf_msix_max as static, minimum acceptable result is
pf_msix_min
As Jesse and Piotr suggested pf_msix_max and pf_msix_min can (an
probably should) be stored in NVM. This patchset isn't implementing
that.
Dynamic (kernel or driver) way means that splitting MSI-X across the
RDMA and eth in case there is a MSI-X shortage isn't correct. Can work
when dynamic is only on driver site, but can't when dynamic is on kernel
site.
Let's remove this code and move to MSI-X allocation feature by feature.
If there is no more MSI-X for a feature, a feature is working with less
MSI-X or it is turned off.
There is a regression here. With MSI-X splitting user can run RDMA and
eth even on system with not enough MSI-X. Now only eth will work. RDMA
can be turned on by changing number of PF queues (lowering) and reprobe
RDMA driver.
Example:
72 CPU number, eth, RDMA and flow director (1 MSI-X), 1 MSI-X for OICR
on PF, and 1 more for RDMA. Card is using 1 + 72 + 1 + 72 + 1 = 147.
We set pf_msix_min = 2, pf_msix_max = 128
OICR: 1
eth: 72
flow director: 1
RDMA: 128 - 74 = 54
We can change number of queues on pf to 36 and do devlink reinit
OICR: 1
eth: 36
RDMA: 73
flow director: 1
We can also (implemented in "ice: enable_rdma devlink param") turned
RDMA off.
OICR: 1
eth: 72
RDMA: 0 (turned off)
flow director: 1
After this changes we have a static base vector for SRIOV (SIOV probably
in the feature). Last patch from this series is simplifying managing VF
MSI-X code based on static vector.
Now changing queues using ethtool is also changing MSI-X. If there is
enough MSI-X it is always one to one. When there is not enough there
will be more queues than MSI-X.
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: init flow director before RDMA
ice: simplify VF MSI-X managing
ice: enable_rdma devlink param
ice: treat dyn_allowed only as suggestion
ice, irdma: move interrupts code to irdma
ice: get rid of num_lan_msix field
ice: remove splitting MSI-X between features
ice: devlink PF MSI-X max and min parameter
ice: count combined queues using Rx/Tx count
====================
Link: https://patch.msgid.link/20250205185512.895887-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Simplify miic_parse_dt() by using the for_each_available_child_of_node()
helper instead of manually skipping unavailable child nodes.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/3e394d4cf8204bcf17b184bfda474085aa8ed0dd.1738771631.git.geert+renesas@glider.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Populate the PCS supported_interfaces bitmap with the interfaces that
this PCS supports. This makes the manual checking in miic_validate()
redundant, so remove that.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1tfhYq-003aTm-Nx@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A call to rtnl_nets_destroy() is needed to release references taken on
netns put in rtnl_nets.
CC: stable@vger.kernel.org
Fixes: 636af13f213b ("rtnetlink: Register rtnl_dellink() and rtnl_setlink() with RTNL_FLAG_DOIT_PERNET_WIP.")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250205221037.2474426-1-nicolas.dichtel@6wind.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
John Daley says:
====================
enic: Use Page Pool API for receiving packets
Use the Page Pool API for RX. The Page Pool API improves bandwidth and
CPU overhead by recycling pages instead of allocating new buffers in the
driver. Also, page pool fragment allocation for smaller MTUs is used
allow multiple packets to share pages.
RX code was moved to its own file and some refactoring was done
beforehand to make the page pool changes more trasparent and to simplify
the resulting code.
====================
Link: https://patch.msgid.link/20250205235416.25410-1-johndale@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With the move to using the Page Pool API for RX, rx copybreak was not
showing any improvement in host CPU overhead, latency or bandwidth so
the driver no longer makes use of the rx_copybreak setting. This patch
removes the ethtool tuneable hooks to set and get the rx copybreak since
they and now no-ops. Rx copybreak was the only tunable supported, so
remove the set and get tunable callbacks all together.
Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Link: https://patch.msgid.link/20250205235416.25410-5-johndale@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The Page Pool API improves bandwidth and CPU overhead by recycling pages
instead of allocating new buffers in the driver. Make use of page pool
fragment allocation for smaller MTUs so that multiple packets can share
a page. For MTUs larger than PAGE_SIZE, adjust the 'order' page
parameter so that contiguous pages can be used to receive the larger
packets.
The RQ descriptor field 'os_buf' is repurposed to hold page pointers
allocated from page_pool instead of SKBs. When packets arrive, SKBs are
allocated and the page pointers are attached instead of preallocating SKBs.
'alloc_fail' netdev statistic is incremented when page_pool_dev_alloc()
fails.
Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Link: https://patch.msgid.link/20250205235416.25410-4-johndale@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Split up RX handler functions in preparation for moving
to a page pool based implementation.
No functional changes.
Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Link: https://patch.msgid.link/20250205235416.25410-3-johndale@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Move RX handler code into its own file in preparation for further
changes. Some formatting changes were necessary in order to satisfy
checkpatch but there were no functional changes.
Co-developed-by: Nelson Escobar <neescoba@cisco.com>
Signed-off-by: Nelson Escobar <neescoba@cisco.com>
Co-developed-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: John Daley <johndale@cisco.com>
Link: https://patch.msgid.link/20250205235416.25410-2-johndale@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If an AX25 device is bound to a socket by setting the SO_BINDTODEVICE
socket option, a refcount leak will occur in ax25_release().
Commit 9fd75b66b8f6 ("ax25: Fix refcount leaks caused by ax25_cb_del()")
added decrement of device refcounts in ax25_release(). In order for that
to work correctly the refcounts must already be incremented when the
device is bound to the socket. An AX25 device can be bound to a socket
by either calling ax25_bind() or setting SO_BINDTODEVICE socket option.
In both cases the refcounts should be incremented, but in fact it is done
only in ax25_bind().
This bug leads to the following issue reported by Syzkaller:
================================================================
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 1 PID: 5932 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
Modules linked in:
CPU: 1 UID: 0 PID: 5932 Comm: syz-executor424 Not tainted 6.13.0-rc4-syzkaller-00110-g4099a71718b0 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
Call Trace:
<TASK>
__refcount_dec include/linux/refcount.h:336 [inline]
refcount_dec include/linux/refcount.h:351 [inline]
ref_tracker_free+0x710/0x820 lib/ref_tracker.c:236
netdev_tracker_free include/linux/netdevice.h:4156 [inline]
netdev_put include/linux/netdevice.h:4173 [inline]
netdev_put include/linux/netdevice.h:4169 [inline]
ax25_release+0x33f/0xa10 net/ax25/af_ax25.c:1069
__sock_release+0xb0/0x270 net/socket.c:640
sock_close+0x1c/0x30 net/socket.c:1408
...
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
...
</TASK>
================================================================
Fix the implementation of ax25_setsockopt() by adding increment of
refcounts for the new device bound, and decrement of refcounts for
the old unbound device.
Fixes: 9fd75b66b8f6 ("ax25: Fix refcount leaks caused by ax25_cb_del()")
Reported-by: syzbot+33841dc6aa3e1d86b78a@syzkaller.appspotmail.com
Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru>
Link: https://patch.msgid.link/20250203091203.1744-1-m.masimov@mt-integration.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There are at least two cases where napi_id may not present and the
napi_id should be elided:
1. Queues could be created, but napi_enable may not have been called
yet. In this case, there may be a NAPI but it may not have an ID and
output of a napi_id should be elided.
2. TX-only NAPIs currently do not have NAPI IDs. If a TX queue happens
to be linked with a TX-only NAPI, elide the NAPI ID from the netlink
output as a NAPI ID of 0 is not useful for users.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250205193751.297211-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix the netlink type for hardware timestamp flags, which are represented
as a bitset of flags. Although only one flag is supported currently, the
correct netlink bitset type should be used instead of u32 to keep
consistency with other fields. Address this by adding a new named string
set description for the hwtstamp flag structure.
The code has been introduced in the current release so the uAPI change is
still okay.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Fixes: 6e9e2eed4f39 ("net: ethtool: Add support for tsconfig command to get/set hwtstamp config")
Link: https://patch.msgid.link/20250205110304.375086-1-kory.maincent@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
David Wei says:
====================
io_uring zero copy rx
This patchset contains net/ patches needed by a new io_uring request
implementing zero copy rx into userspace pages, eliminating a kernel
to user copy.
We configure a page pool that a driver uses to fill a hw rx queue to
hand out user pages instead of kernel pages. Any data that ends up
hitting this hw rx queue will thus be dma'd into userspace memory
directly, without needing to be bounced through kernel memory. 'Reading'
data out of a socket instead becomes a _notification_ mechanism, where
the kernel tells userspace where the data is. The overall approach is
similar to the devmem TCP proposal.
This relies on hw header/data split, flow steering and RSS to ensure
packet headers remain in kernel memory and only desired flows hit a hw
rx queue configured for zero copy. Configuring this is outside of the
scope of this patchset.
We share netdev core infra with devmem TCP. The main difference is that
io_uring is used for the uAPI and the lifetime of all objects are bound
to an io_uring instance. Data is 'read' using a new io_uring request
type. When done, data is returned via a new shared refill queue. A zero
copy page pool refills a hw rx queue from this refill queue directly. Of
course, the lifetime of these data buffers are managed by io_uring
rather than the networking stack, with different refcounting rules.
This patchset is the first step adding basic zero copy support. We will
extend this iteratively with new features e.g. dynamically allocated
zero copy areas, THP support, dmabuf support, improved copy fallback,
general optimisations and more.
In terms of netdev support, we're first targeting Broadcom bnxt. Patches
aren't included since Taehee Yoo has already sent a more comprehensive
patchset adding support in [1]. Google gve should already support this,
and Mellanox mlx5 support is WIP pending driver changes.
===========
Performance
===========
Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet.
Test setup:
* AMD EPYC 9454
* Broadcom BCM957508 200G
* Kernel v6.11 base [2]
* liburing fork [3]
* kperf fork [4]
* 4K MTU
* Single TCP flow
With application thread + net rx softirq pinned to _different_ cores:
+-------------------------------+
| epoll | io_uring |
|-----------|-------------------|
| 82.2 Gbps | 116.2 Gbps (+41%) |
+-------------------------------+
Pinned to _same_ core:
+-------------------------------+
| epoll | io_uring |
|-----------|-------------------|
| 62.6 Gbps | 80.9 Gbps (+29%) |
+-------------------------------+
=====
Links
=====
Broadcom bnxt support:
[1]: https://lore.kernel.org/20241003160620.1521626-8-ap420073@gmail.com
Linux kernel branch including io_uring bits:
[2]: https://github.com/isilence/linux.git zcrx/v13
liburing for testing:
[3]: https://github.com/isilence/liburing.git zcrx/next
kperf for testing:
[4]: https://git.kernel.dk/kperf.git
====================
Link: https://patch.msgid.link/20250204215622.695511-1-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add helpers that properly prep or remove a memory provider for an rx
queue then restart the queue.
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250204215622.695511-11-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add helpers for memory providers to interact with page pools.
net_mp_niov_{set,clear}_page_pool() serve to [dis]associate a net_iov
with a page pool. If used, the memory provider is responsible to match
"set" calls with "clear" once a net_iov is not going to be used by a page
pool anymore, changing a page pool, etc.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250204215622.695511-10-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There is a good bunch of places in generic paths assuming that the only
page pool memory provider is devmem TCP. As we want to reuse the net_iov
and provider infrastructure, we need to patch it up and explicitly check
the provider type when we branch into devmem TCP code.
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250204215622.695511-9-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Devmem TCP needs a hook in unregister_netdevice_many_notify() to upkeep
the set tracking queues it's bound to, i.e. ->bound_rxqs. Instead of
devmem sticking directly out of the genetic path, add a mp function.
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250204215622.695511-8-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a mandatory callback that prints information about the memory
provider to netlink.
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250204215622.695511-7-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a nested attribute for io_uring memory provider info. For now it is
empty and its presence indicates that a particular page pool or queue
has an io_uring memory provider attached.
$ ./cli.py --spec netlink/specs/netdev.yaml --dump page-pool-get
[{'id': 80,
'ifindex': 2,
'inflight': 64,
'inflight-mem': 262144,
'napi-id': 525},
{'id': 79,
'ifindex': 2,
'inflight': 320,
'inflight-mem': 1310720,
'io_uring': {},
'napi-id': 525},
...
$ ./cli.py --spec netlink/specs/netdev.yaml --dump queue-get
[{'id': 0, 'ifindex': 1, 'type': 'rx'},
{'id': 0, 'ifindex': 1, 'type': 'tx'},
{'id': 0, 'ifindex': 2, 'napi-id': 513, 'type': 'rx'},
{'id': 1, 'ifindex': 2, 'napi-id': 514, 'type': 'rx'},
...
{'id': 12, 'ifindex': 2, 'io_uring': {}, 'napi-id': 525, 'type': 'rx'},
...
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250204215622.695511-6-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A spin off from the original page pool memory providers patch by Jakub,
which allows extending page pools with custom allocators. One of such
providers is devmem TCP, and the other is io_uring zerocopy added in
following patches.
Link: https://lore.kernel.org/netdev/20230707183935.997267-7-kuba@kernel.org/
Co-developed-by: Jakub Kicinski <kuba@kernel.org> # initial mp proposal
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250204215622.695511-5-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
which serves as a useful abstraction to share data and provide a
context. However, it's too devmem specific, and we want to reuse it for
other memory providers, and for that we need to decouple net_iov from
devmem. Make net_iov to point to a new base structure called
net_iov_area, which dmabuf_genpool_chunk_owner extends.
Reviewed-by: Mina Almasry <almasrymina@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250204215622.695511-4-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|