Age | Commit message (Collapse) | Author |
|
Adds kernel-doc for externally linked dc_stream_remove_writeback function.
Signed-off-by: James Flowers <bold.zone2373@fastmail.com>
Reviewed-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Introduce a mount option to allow sysadmins to specify the maximum size
of an atomic write. If the filesystem can work with the supplied value,
that becomes the new guaranteed maximum.
The value mustn't be too big for the existing filesystem geometry (max
write size, max AG/rtgroup size). We dynamically recompute the
tr_atomic_write transaction reservation based on the given block size,
check that the current log size isn't less than the new minimum log size
constraints, and set a new maximum.
The actual software atomic write max is still computed based off of
tr_atomic_ioend the same way it has for the past few commits. Note also
that xfs_calc_atomic_write_log_geometry is non-static because mkfs will
need that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
|
|
Update the limits returned from xfs_get_atomic_write_{min, max, max_opt)().
No reflink support always means no CoW-based atomic writes.
For updating xfs_get_atomic_write_min(), we support blocksize only and that
depends on HW or reflink support.
For updating xfs_get_atomic_write_max(), for no reflink, we are limited to
blocksize but only if HW support. Otherwise we are limited to combined
limit in mp->m_atomic_write_unit_max.
For updating xfs_get_atomic_write_max_opt(), ultimately we are limited by
the bdev atomic write limit. If xfs_get_atomic_write_max() does not report
> 1x blocksize, then just continue to report 0 as before.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: update comments in the helper functions]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Now that CoW-based atomic writes are supported, update the max size of an
atomic write for the data device.
The limit of a CoW-based atomic write will be the limit of the number of
logitems which can fit into a single transaction.
In addition, the max atomic write size needs to be aligned to the agsize.
Limit the size of atomic writes to the greatest power-of-two factor of the
agsize so that allocations for an atomic write will always be aligned
compatibly with the alignment requirements of the storage.
Function xfs_atomic_write_logitems() is added to find the limit the number
of log items which can fit in a single transaction.
Amend the max atomic write computation to create a new transaction
reservation type, and compute the maximum size of an atomic write
completion (in fsblocks) based on this new transaction reservation.
Initially, tr_atomic_write is a clone of tr_itruncate, which provides a
reasonable level of parallelism. In the next patch, we'll add a mount
option so that sysadmins can configure their own limits.
[djwong: use a new reservation type for atomic write ioends, refactor
group limit calculations]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[jpg: rounddown power-of-2 always]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
Now HW offload will not be required to support atomic writes and is
an optional feature.
CoW-based atomic writes will be supported with out-of-places write and
atomic extent remapping.
Either mode of operation may be used for an atomic write, depending on the
circumstances.
The preferred method is HW offload as it will be faster. If HW offload is
not available then we always use the CoW-based method. If HW offload is
available but not possible to use, then again we use the CoW-based method.
If available, HW offload would not be possible for the write length
exceeding the HW offload limit, the write spanning multiple extents,
unaligned disk blocks, etc.
Apart from the write exceeding the HW offload limit, other conditions for
HW offload usage can only be detected in the iomap handling for the write.
As such, we use a fallback method to issue the write if we detect in the
->iomap_begin() handler that HW offload is not possible. Special code
-ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload is
not possible.
In other words, atomic writes are supported on any filesystem that can
perform out of place write remapping atomically (i.e. reflink) up to
some fairly large size. If the conditions are right (a single correctly
aligned overwrite mapping) then the filesystem will use any available
hardware support to avoid the filesystem metadata updates.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.
For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.
Note that there is a limit on the amount of log intent items which can be
fit into a single transaction, but this is being ignored for now since
the count of items for a typical atomic write would be much less than is
typically supported. A typical atomic write would be expected to be 64KB
or less, which means only 16 possible extents unmaps, which is quite
small.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add tr_atomic_ioend]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
For when large atomic writes (> 1x FS block) are supported, there will be
various occasions when HW offload may not be possible.
Such instances include:
- unaligned extent mapping wrt write length
- extent mappings which do not cover the full write, e.g. the write spans
sparse or mixed-mapping extents
- the write length is greater than HW offload can support
- no hardware support at all
In those cases, we need to fallback to the CoW-based atomic write mode. For
this, report special code -ENOPROTOOPT to inform the caller that HW
offload-based method is not possible.
In addition to the occasions mentioned, if the write covers an unallocated
range, we again judge that we need to rely on the CoW-based method when we
would need to allocate anything more than 1x block. This is because if we
allocate less blocks that is required for the write, then again HW
offload-based method would not be possible. So we are taking a pessimistic
approach to writes covering unallocated space.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: various cleanups]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
For CoW-based atomic writes, reuse the infrastructure for reflink CoW fork
support.
Add ->iomap_begin() callback xfs_atomic_write_cow_iomap_begin() to create
staging mappings in the CoW fork for atomic write updates.
The general steps in the function are as follows:
- find extent mapping in the CoW fork for the FS block range being written
- if part or full extent is found, proceed to process found extent
- if no extent found, map in new blocks to the CoW fork
- convert unwritten blocks in extent if required
- update iomap extent mapping and return
The bulk of this function is quite similar to the processing in
xfs_reflink_allocate_cow(), where we try to find an extent mapping; if
none exists, then allocate a new extent in the CoW fork, convert unwritten
blocks, and return a mapping.
Performance testing has shown the XFS_ILOCK_EXCL locking to be quite
a bottleneck, so this is an area which could be optimised in future.
Christoph Hellwig contributed almost all of the code in
xfs_atomic_write_cow_iomap_begin().
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add a new xfs_can_sw_atomic_write to convey intent better]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Currently the size of atomic write allowed is fixed at the blocksize.
To start to lift this restriction, partly refactor
xfs_report_atomic_write() to into helpers -
xfs_get_atomic_write_{min, max}() - and use those helpers to find the
per-inode atomic write limits and check according to that.
Also add xfs_get_atomic_write_max_opt() to return the optimal limit, and
just return 0 since large atomics aren't supported yet.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Refactor xfs_reflink_end_cow_extent() into separate parts which process
the CoW range and commit the transaction.
This refactoring will be used in future for when it is required to commit
a range of extents as a single transaction, similar to how it was done
pre-commit d6f215f359637.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Add a BMAPI flag to provide a hint to the block allocator to align extents
according to the extszhint.
This will be useful for atomic writes to ensure that we are not being
allocated extents which are not suitable (for atomic writes).
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Currently only HW which can write at least 1x block is supported.
For supporting atomic writes > 1x block, a CoW-based method will also be
used and this will not be resticted to using HW which can write >= 1x
block.
However for deciding if HW-based atomic writes can be used, we need to
start adding checks for write length < HW min, which complicates the
code. Indeed, a statx field similar to unit_max_opt should also be
added for this minimum, which is undesirable.
HW which can only write > 1x blocks would be uncommon and quite weird,
so let's just not support it.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
In the transaction reservation code, hoist the logic that computes the
reservation needed to finish one log intent item into separate helper
functions. These will be used in subsequent patches to estimate the
number of blocks that an online repair can commit to reaping in the same
transaction as the change committing the new data structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Add selected helpers to estimate the transaction reservation required to
write various log intent and buffer items to the log. These helpers
will be used by the online repair code for more precise estimations of
how much work can be done in a single transaction.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Separate out setting buftarg atomic writes limits into a dedicated
function, xfs_configure_buftarg_atomic_writes(), to keep the specific
functionality self-contained.
For naming consistency, rename xfs_setsize_buftarg() ->
xfs_configure_buftarg().
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[jpg: separate out from patch "xfs: ignore HW which ..."]
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
In future we will want to be able to check if specifically HW offload-based
atomic writes are possible, so rename xfs_inode_can_atomicwrite() ->
xfs_inode_can_hw_atomicwrite().
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[djwong: add an underscore to be consistent with everything else]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
It's silly to call xfs_setsize_buftarg from xfs_alloc_buftarg with the
block device LBA size because we don't need to ask the block layer to
validate a geometry number that it provided us. Instead, set the
preliminary bt_meta_sector* fields to the LBA size in preparation for
reading the primary super.
However, we still want to flush and invalidate the pagecache for all
three block devices before we start reading metadata from those devices,
so call sync_blockdev() per bdev in xfs_alloc_buftarg().
This will enable a subsequent patch to validate hw atomic write geometry
against the filesystem geometry.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
[jpg: call sync_blockdev() from xfs_alloc_buftarg()]
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
XFS will be able to support large atomic writes (atomic write > 1x block)
in future. This will be achieved by using different operating methods,
depending on the size of the write.
Specifically a new method of operation based in FS atomic extent remapping
will be supported in addition to the current HW offload-based method.
The FS method will generally be appreciably slower performing than the
HW-offload method. However the FS method will be typically able to
contribute to achieving a larger atomic write unit max limit.
XFS will support a hybrid mode, where HW offload method will be used when
possible, i.e. HW offload is used when the length of the write is
supported, and for other times FS-based atomic writes will be used.
As such, there is an atomic write length at which the user may experience
appreciably slower performance.
Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt.
When zero, it means that there is no such performance boundary.
Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is
ok for older kernels which don't support this new field, as they would
report 0 in this field (from zeroing in cp_statx()) already. Furthermore
those older kernels don't support large atomic writes - apart from block
fops, but there would be consistent performance there for atomic writes
in range [unit min, unit max].
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
We frequently use 'bcachefs list_journal -a' for debugging, as it
provides a record of all btree transactions, and a history of what
happened.
But it's not so useful if we immediately discard journal buckets right
after they're no longer dirty.
This tweaks journal reclaim to only discard when we're low on space,
keeping the journal mostly un-discarded.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When we go emergency read-only, make sure we do a final write_super() to
persist counters and error counts - this can be critical for piecing
together what fsck was doing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
These just indicate that we're shutting down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We often filter out EROFS errors to avoid log spew after an emergency
shutdown - journal_shutdown is just another emergency shutdown error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
A pre-existing valid cfid returned from find_or_create_cached_dir might
race with a lease break, meaning open_cached_dir doesn't consider it
valid, and thinks it's newly-constructed. This leaks a dentry reference
if the allocation occurs before the queued lease break work runs.
Avoid the race by extending holding the cfid_list_lock across
find_or_create_cached_dir and when the result is checked.
Cc: stable@vger.kernel.org
Reviewed-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Intel hybrid processors have CPUs of different capacity. Populate the
interface /sys/devices/system/cpu/cpuN/cpu_capacity.
This interface uses the per-CPU variable `cpu_scale`. On x86 this
variable has no other use besides feeding the sysfs entries. Initialize
it when setting CPU capacity for the scheduler and scale-invariant code.
Feed it with arch_scale_cpu_capacity() as it gives capacity normalized
to the interval [0, SCHED_CAPACITY_SCALE].
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
arch_topology.c provides functionality to parse and scale CPU capacity.
It also provides a corresponding sysfs interface. Some architectures
parse and scale CPU capacity differently as per their own needs. On
Intel processors, for instance, it is responsibility of the Intel
P-state driver.
Relocate the implementation of that interface to a common location in
topology.c. Architectures can use the interface and populate it using
their own mechanisms.
An alternative approach would be to compile arch_topology.c even if
not needed only to get this interface. This approach would create
duplicated and conflicting functionality and data structures.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20250419025504.9760-2-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Update documentation of attributes for platform temperature control.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://patch.msgid.link/20250429000110.236243-4-srinivas.pandruvada@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Enable the Platform Temperature Control feature for Lunar Lake and
Panther Lake.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://patch.msgid.link/20250429000110.236243-3-srinivas.pandruvada@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Platform Temperature Control is a dynamic control loop implemented in
hardware to manage the skin or any board temperature of a device. The
reported skin or board temperature is controlled by comparing to a
configured target temperature and adjusting the SoC (System on Chip)
performance accordingly. The feature supports up to three platform
sensors.
OEMs (Original Equipment Manufacturers) can configure this feature
through the BIOS and provide temperature input directly to the hardware
via the Platform Environment Control Interface (PECI). As a result,
this feature can operate independently of any OS-level control.
The OS interface can be used to further fine-tune the default OEM
configuration. Here are some scenarios where the OS interface is
beneficial:
Verification of Firmware Control: Check if firmware-based control is
enabled. If it is, thermal controls from the OS/user space can be
backed out.
Adjusting Target Limits: While OEMs can set an aggressive target limit,
the OS can adjust this to a less aggressive limit based on operating
modes or conditions.
Given that this is platform temperature control, it is expected that a
single user-level manager owns and manages the controls. If multiple
user-level software applications attempt to write different targets, it
can lead to unexpected behavior. For instance, on a Linux desktop, the
Linux thermal daemon can manage these temperature controls, as it has
access to all other temperature control settings.
The hardware control interface is via MMIO offsets in the processor
thermal device MMIO space. There are three instances of MMIO registers.
Refer to the platform_temperature_control.c for MMIO details.
Expose "enable" and "temperature_target" via sysfs.
There are three instances of this controls. So up to three different
sensors can be controlled independently.
Sysfs interface:
tree /sys/bus/pci/devices/0000\:00\:04.0/ptc_?_control/
/sys/bus/pci/devices/0000:00:04.0/ptc_0_control/
├── enable
└── temperature_target
/sys/bus/pci/devices/0000:00:04.0/ptc_1_control/
├── enable
└── temperature_target
/sys/bus/pci/devices/0000:00:04.0/ptc_2_control/
├── enable
└── temperature_target
Description of attributes:
Enable: 1 for enable, 0 for disable. This attribute can be used to
read the current status. User space can write 0 or 1 to disable or
enable this feature respectively.
temperature_target: Target temperature limit to which hardware
will try to limit in milli degree C.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://patch.msgid.link/20250429000110.236243-2-srinivas.pandruvada@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Append PCIe Gen5 limitations to xe_firmware document.
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250506054835.3395220-4-raag.jadav@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
Expose sysfs attributes for PCIe link downgrade capability and status.
v2: Move from debugfs to sysfs (Lucas, Rodrigo, Badal)
Rework macros and their naming (Rodrigo)
v3: Use sysfs_create_files() (Riana)
Fix checkpatch warning (Riana)
v4: s/downspeed/downgrade (Lucas, Rodrigo, Riana)
v5: Use PCIe Gen agnostic naming (Rodrigo)
v6: s/pcie_gen/auto_link (Lucas)
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Riana Tauro <riana.tauro@intel.com>
Link: https://lore.kernel.org/r/20250506054835.3395220-3-raag.jadav@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
Since xe_device_sysfs_init() exposes device specific attributes, a better
place for it is xe_device_probe().
Signed-off-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Riana Tauro <riana.tauro@intel.com>
Link: https://lore.kernel.org/r/20250506054835.3395220-2-raag.jadav@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
xe_force_wake_get() is dependent on xe_pm_runtime_get(), so for
the release path, xe_force_wake_put() should be called first then
xe_pm_runtime_put().
Combine the error path and normal path together with goto.
Fixes: 85d547608ef5 ("drm/xe/xe_gt_debugfs: Update handling of xe_force_wake_get return")
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://lore.kernel.org/r/20250507022302.2187527-1-shuicheng.lin@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
Doing cpufreq-specific EAS checks that require accessing policy
internals directly from sched_is_eas_possible() is a bit unfortunate,
so introduce cpufreq_ready_for_eas() in cpufreq, move those checks
into that new function and make sched_is_eas_possible() call it.
While at it, address a possible race between the EAS governor check
and governor change by doing the former under the policy rwsem.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/2317800.iZASKD2KPV@rjwysocki.net
|
|
Add a helper for checking if schedutil is the current governor for
a given cpufreq policy and use it in sched_is_eas_possible() to avoid
accessing cpufreq policy internals directly from there.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/3365956.44csPzL39Z@rjwysocki.net
|
|
In 'lookup_or_create_module_kobject()', an internal kobject is created
using 'module_ktype'. So call to 'kobject_put()' on error handling
path causes an attempt to use an uninitialized completion pointer in
'module_kobject_release()'. In this scenario, we just want to release
kobject without an extra synchronization required for a regular module
unloading process, so adding an extra check whether 'complete()' is
actually required makes 'kobject_put()' safe.
Reported-by: syzbot+7fb8a372e1f6add936dd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7fb8a372e1f6add936dd
Fixes: 942e443127e9 ("module: Fix mod->mkobj.kobj potentially freed too early")
Cc: stable@vger.kernel.org
Suggested-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://lore.kernel.org/r/20250507065044.86529-1-dmantipov@yandex.ru
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
|
|
Without CONFIG_DRM_XE_GPUSVM set, GPU SVM is not initialized thus below
warning pops. Refine the flush work code to be controlled by the config
to avoid below warning:
"
[ 453.132028] ------------[ cut here ]------------
[ 453.132527] WARNING: CPU: 9 PID: 4491 at kernel/workqueue.c:4205 __flush_work+0x379/0x3a0
[ 453.133355] Modules linked in: xe drm_ttm_helper ttm gpu_sched drm_buddy drm_suballoc_helper drm_gpuvm drm_exec
[ 453.134352] CPU: 9 UID: 0 PID: 4491 Comm: xe_exec_mix_mod Tainted: G U W 6.15.0-rc3+ #7 PREEMPT(full)
[ 453.135405] Tainted: [U]=USER, [W]=WARN
...
[ 453.136921] RIP: 0010:__flush_work+0x379/0x3a0
[ 453.137417] Code: 8b 45 00 48 8b 55 08 89 c7 48 c1 e8 04 83 e7 08 83 e0 0f 83 cf 02 89 c6 48 0f ba 6d 00 03 e9 d5 fe ff ff 0f 0b e9 db fd ff ff <0f> 0b 45 31 e4 e9 d1 fd ff ff 0f 0b e9 03 ff ff ff 0f 0b e9 d6 fe
[ 453.139250] RSP: 0018:ffffc90000c67b18 EFLAGS: 00010246
[ 453.139782] RAX: 0000000000000000 RBX: ffff888108a24000 RCX: 0000000000002000
[ 453.140521] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8881016d61c8
[ 453.141253] RBP: ffff8881016d61c8 R08: 0000000000000000 R09: 0000000000000000
[ 453.141985] R10: 0000000000000000 R11: 0000000008a24000 R12: 0000000000000001
[ 453.142709] R13: 0000000000000002 R14: 0000000000000000 R15: ffff888107db8c00
[ 453.143450] FS: 00007f44853d4c80(0000) GS:ffff8882f469b000(0000) knlGS:0000000000000000
[ 453.144276] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 453.144853] CR2: 00007f4487629228 CR3: 00000001016aa000 CR4: 00000000000406f0
[ 453.145594] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 453.146320] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 453.147061] Call Trace:
[ 453.147336] <TASK>
[ 453.147579] ? tick_nohz_tick_stopped+0xd/0x30
[ 453.148067] ? xas_load+0x9/0xb0
[ 453.148435] ? xa_load+0x6f/0xb0
[ 453.148781] __xe_vm_bind_ioctl+0xbd5/0x1500 [xe]
[ 453.149338] ? dev_printk_emit+0x48/0x70
[ 453.149762] ? _dev_printk+0x57/0x80
[ 453.150148] ? drm_ioctl+0x17c/0x440
[ 453.150544] ? __drm_dev_vprintk+0x36/0x90
[ 453.150983] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe]
[ 453.151575] ? drm_ioctl_kernel+0x9f/0xf0
[ 453.151998] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe]
[ 453.152560] drm_ioctl_kernel+0x9f/0xf0
[ 453.152968] drm_ioctl+0x20f/0x440
[ 453.153332] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe]
[ 453.153893] ? ioctl_has_perm.constprop.0.isra.0+0xae/0x100
[ 453.154489] ? memory_bm_test_bit+0x5/0x60
[ 453.154935] xe_drm_ioctl+0x47/0x70 [xe]
[ 453.155419] __x64_sys_ioctl+0x8d/0xc0
[ 453.155824] do_syscall_64+0x47/0x110
[ 453.156228] entry_SYSCALL_64_after_hwframe+0x76/0x7e
"
v2 (Matt):
refine commit message to have more details
add Fixes tag
move the code to xe_svm.h which already have the config
remove a blank line per codestyle suggestion
Fixes: 63f6e480d115 ("drm/xe: Add SVM garbage collector")
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250502170052.1787973-1-shuicheng.lin@intel.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Add a new reviewer, Hongbo Li, for better community development
- Fix an I/O hang out of file-backed mounts
- Address a rare data corruption caused by concurrent I/Os on the same
deduplicated compressed data
- Minor cleanup
* tag 'erofs-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: ensure the extra temporary copy is valid for shortened bvecs
erofs: remove unused enum type
fs/erofs/fileio: call erofs_onlinefolio_split() after bio_add_folio()
MAINTAINERS: erofs: add myself as reviewer
|
|
Device flags could be updated in the meantime while MGMT_OP_ADD_DEVICE
is pending on hci_update_passive_scan_sync so instead of setting the
current_flags as cmd->user_data just do a lookup using
hci_conn_params_lookup and use the latest stored flags.
Fixes: a182d9c84f9c ("Bluetooth: MGMT: Fix Add Device to responding before completing")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
copy_from_user() has more checks and is more safer than
__copy_from_user()
Suggested-by: Kees Cook <kees@kernel.org>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Link: https://lore.kernel.org/r/acabf20aa8621c7bc8de09b1bffb8d14b5376484.1746126614.git.harish.chegondi@intel.com
|
|
To receive 428dc9fc0873 ("sched_ext: bpf_iter_scx_dsq_new() should always
initialize iterator") which conflicts with cdf5a6faa8cf ("sched_ext: Move
dsq_hash into scx_sched"). The conflict is a simple context conflict which
can be resolved by taking changes from both changes in the right order.
|
|
BPF programs may call next() and destroy() on BPF iterators even after new()
returns an error value (e.g. bpf_for_each() macro ignores error returns from
new()). bpf_iter_scx_dsq_new() could leave the iterator in an uninitialized
state after an error return causing bpf_iter_scx_dsq_next() to dereference
garbage data. Make bpf_iter_scx_dsq_new() always clear $kit->dsq so that
next() and destroy() become noops.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 650ba21b131e ("sched_ext: Implement DSQ iterator")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
Avoids the need for manual cleanup of_node_put() in early exits
from the loop.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Link: https://lore.kernel.org/r/20240830073824.3539690-1-ruanjinjie@huawei.com
|
|
In tegra_crtc_reset(), new memory is allocated with kzalloc(), but
no check is performed. Before calling __drm_atomic_helper_crtc_reset,
state should be checked to prevent possible null pointer dereference.
Fixes: b7e0b04ae450 ("drm/tegra: Convert to using __drm_atomic_helper_crtc_reset() for reset.")
Cc: stable@vger.kernel.org
Signed-off-by: Qiu-ji Chen <chenqiuji666@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Link: https://lore.kernel.org/r/20241106095906.15247-1-chenqiuji666@gmail.com
|
|
The of_get_child_by_name() increments the refcount in tegra_dc_rgb_probe,
but the driver does not decrement the refcount during unbind. Fix the
unbound reference count using devm_add_action_or_reset() helper.
Fixes: d8f4a9eda006 ("drm: Add NVIDIA Tegra20 support")
Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Link: https://lore.kernel.org/r/20250205112137.36055-1-biju.das.jz@bp.renesas.com
|
|
The current code can issue CDMA flushes (DMAPUT bumps) in the middle
of a job, before all opcodes have been written into the pushbuffer.
This can happen when pushbuffer fills up. Presumably this made sense
at some point in the past, but it doesn't anymore, as it cannot lead
to more space appearing in the pushbuffer as it is only cleaned full
jobs at a time.
Mid-job flushes can also cause problems, as in an extreme situation
(seen in practice), the hardware can run through the entire pushbuffer
including the prefix of a partially written job without the driver
being able to process any CDMA updates. This can cause the engine
MLOCK to be taken and held for extended periods as the tail of the
job is not yet available to hardware.
Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Link: https://lore.kernel.org/r/20250204024546.1168126-1-mperttunen@nvidia.com
|
|
Some platform devices create child devices dynamically and require the
parent device's msi-map to map device IDs to actual sideband information.
A typical use case is using ITS as a PCIe Endpoint Controller(EPC)'s
doorbell function, where PCI hosts send TLP memory writes to the EP
controller. The EP controller converts these writes to AXI transactions
and appends platform-specific sideband information.
EPC's DTS will provide such information by msi-map and msi-mask. A
simplified dts as
pcie-ep@10000000 {
...
msi-map = <0 &its 0xc 8>;
^^^ 0xc is implement defined sideband information,
which append to AXI write transaction.
^ 0 is function index.
msi-mask = <0x7>
}
Check msi-map if msi-parent missed to keep compatility with existing systems.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250414-ep-msi-v18-5-f69b49917464@nxp.com
|
|
Document the use of (msi|iommu)-map for PCI Endpoint (EP) controllers,
which can use MSI as a doorbell mechanism. Each EP controller can support
up to 8 physical functions and 65,536 virtual functions.
Define how to construct device IDs using function bits [2:0] and virtual
function index bits [31:3], enabling (msi|iommu)-map to associate each
child device with a specific (msi|iommu)-specifier.
The EP cannot rely on PCI Requester ID (RID) because the RID is determined
by the PCI topology of the host system. Since the EP may be connected to
different PCI hosts, the RID can vary between systems and is therefore not
a reliable identifier.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/all/20250414-ep-msi-v18-4-f69b49917464@nxp.com
|
|
Set the IRQ_DOMAIN_FLAG_MSI_IMMUTABLE flag for ITS, as it does not change
the address/data pair after setup.
Ensure compatibility with MSI users, such as PCIe Endpoint Doorbell, which
require the address/data pair to remain unchanged. Enable PCIe endpoints to
use ITS for triggering doorbells from the PCIe Root Complex (RC) side.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250414-ep-msi-v18-3-f69b49917464@nxp.com
|
|
Add the flag IRQ_DOMAIN_FLAG_MSI_IMMUTABLE and the API function
irq_domain_is_msi_immutable() to check if the MSI controller retains an
immutable address/data pair during irq_set_affinity().
Ensure compatibility with MSI users like PCIe Endpoint Doorbell, which
require the address/data pair to remain unchanged after setup. Use this
function to verify if the MSI controller is immutable.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250414-ep-msi-v18-2-f69b49917464@nxp.com
|
|
platform_device_msi_free_irqs_all()
platform_device_msi_init_and_alloc_irqs() performs two tasks: allocating
the MSI domain for a platform device, and allocate a number of MSIs in that
domain.
platform_device_msi_free_irqs_all() only frees the MSIs, and leaves the MSI
domain alive.
Given that platform_device_msi_init_and_alloc_irqs() is the sole tool a
platform device has to allocate platform MSIs, it makes sense for
platform_device_msi_free_irqs_all() to teardown the MSI domain at the same
time as the MSIs.
This avoids warnings and unexpected behaviours when a driver repeatedly
allocates and frees MSIs.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/all/20250414-ep-msi-v18-1-f69b49917464@nxp.com
|