Age | Commit message (Collapse) | Author |
|
extent-tree.h is included more than once, added in a0231804affe ("btrfs:
move extent-tree helpers into their own header file").
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
When debugging a scrub related metadata error, it turns out that our
metadata error reporting is not ideal.
The only 3 error messages are:
- BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 0, gen 1
Showing we have metadata generation mismatch errors.
- BTRFS error (device dm-2): unable to fixup (regular) error at logical 7110656 on dev /dev/mapper/test-scratch1
Showing which tree blocks are corrupted.
- BTRFS warning (device dm-2): checksum/header error at logical 24772608 on dev /dev/mapper/test-scratch2, physical 3801088: metadata node (level 1) in tree 5
Showing which physical range the corrupted metadata is at.
We have to combine the above 3 to know we have a corrupted metadata with
generation mismatch.
And this is already the better case, if we have other problems, like
fsid mismatch, we can not even know the cause.
[CAUSE]
The problem is caused by the fact that, scrub_checksum_tree_block()
never outputs any error message.
It just return two bits for scrub: sblock->header_error, and
sblock->generation_error.
And later we report error in scrub_print_warning(), but unfortunately we
only have two bits, there is not really much thing we can done to print
any detailed errors.
[FIX]
This patch will do the following to enhance the error reporting of
metadata scrub:
- Add extra warning (ratelimited) for every error we hit
This can help us to distinguish the different types of errors.
Some errors can help us to know what's going wrong immediately,
like bytenr mismatch.
- Re-order the checks
Currently we check bytenr first, then immediately generation.
This can lead to false generation mismatch reports, while the fsid
mismatches.
Here is the new output for the bug I'm debugging (we forgot to
writeback tree blocks for commit roots):
BTRFS warning (device dm-2): tree block 24117248 mirror 1 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc
BTRFS warning (device dm-2): tree block 24117248 mirror 0 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc
Now we can immediately know it's some tree blocks didn't even get written
back, other than the original confusing generation mismatch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When a file system has ZNS devices which are constrained by a maximum
number of active block groups, then not being able to use all the block
groups for every allocation is not ideal, and could cause us to loop a
ton with mixed size allocations.
In general, since zoned doesn't write into gaps behind where block
groups are writing, it is not susceptible to the same sort of
fragmentation that size classes are designed to solve, so we can skip
size classes for zoned file systems in general, even though there would
probably be no harm for SMR devices.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since the size class is an artifact of an arbitrary anti fragmentation
strategy, it doesn't really make sense to persist it. Furthermore, most
of the size class logic assumes fresh block groups. That is of course
not a reasonable assumption -- we will be upgrading kernels with
existing filesystems whose block groups are not classified.
To work around those issues, implement logic to compute the size class
of the block groups as we cache them in. To perfectly assess the state
of a block group, we would have to read the entire extent tree (since
the free space cache mashes together contiguous extent items) which
would be prohibitively expensive for larger file systems with more
extents.
We can do it relatively cheaply by implementing a simple heuristic of
sampling a handful of extents and picking the smallest one we see. In
the happy case where the block group was classified, we will only see
extents of the correct size. In the unhappy case, we will hopefully find
one of the smaller extents, but there is no perfect answer anyway.
Autorelocation will eventually churn up the block group if there is
significant freeing anyway.
There was no regression in mount performance at end state of the fsperf
test suite, and the delay until the block group is marked cached is
minimized by the constant number of extent samples.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The aim of this patch is to reduce the fragmentation of block groups
under certain unhappy workloads. It is particularly effective when the
size of extents correlates with their lifetime, which is something we
have observed causing fragmentation in the fleet at Meta.
This patch categorizes extents into size classes:
- x < 128KiB: "small"
- 128KiB < x < 8MiB: "medium"
- x > 8MiB: "large"
and as much as possible reduces allocations of extents into block groups
that don't match the size class. This takes advantage of any (possible)
correlation between size and lifetime and also leaves behind predictable
re-usable gaps when extents are freed; small writes don't gum up bigger
holes.
Size classes are implemented in the following way:
- Mark each new block group with a size class of the first allocation
that goes into it.
- Add two new passes to ffe: "unset size class" and "wrong size class".
First, try only matching block groups, then try unset ones, then allow
allocation of new ones, and finally allow mismatched block groups.
- Filtering is done just by skipping inappropriate ones, there is no
special size class indexing.
Other solutions I considered were:
- A best fit allocator with an rb-tree. This worked well, as small
writes didn't leak big holes from large freed extents, but led to
regressions in ffe and write performance due to lock contention on
the rb-tree with every allocation possibly updating it in parallel.
Perhaps something clever could be done to do the updates in the
background while being "right enough".
- A fixed size "working set". This prevents freeing an extent
drastically changing where writes currently land, and seems like a
good option too. Doesn't take advantage of size in any way.
- The same size class idea, but implemented with xarray marks. This
turned out to be slower than looping the linked list and skipping
wrong block groups, and is also less flexible since we must have only
3 size classes (max #marks). With the current approach we can have as
many as we like.
Performance testing was done via: https://github.com/josefbacik/fsperf
Of particular relevance are the new fragmentation specific tests.
A brief summary of the testing results:
- Neutral results on existing tests. There are some minor regressions
and improvements here and there, but nothing that truly stands out as
notable.
- Improvement on new tests where size class and extent lifetime are
correlated. Fragmentation in these cases is completely eliminated
and write performance is generally a little better. There is also
significant improvement where extent sizes are just a bit larger than
the size class boundaries.
- Regression on one new tests: where the allocations are sized
intentionally a hair under the borders of the size classes. Results
are neutral on the test that intentionally attacks this new scheme by
mixing extent size and lifetime.
The full dump of the performance results can be found here:
https://bur.io/fsperf/size-class-2022-11-15.txt
(there are ANSI escape codes, so best to curl and view in terminal)
Here is a snippet from the full results for a new test which mixes
buffered writes appending to a long lived set of files and large short
lived fallocates:
bufferedappendvsfallocate results
metric baseline current stdev diff
======================================================================================
avg_commit_ms 31.13 29.20 2.67 -6.22%
bg_count 14 15.60 0 11.43%
commits 11.10 12.20 0.32 9.91%
elapsed 27.30 26.40 2.98 -3.30%
end_state_mount_ns 11122551.90 10635118.90 851143.04 -4.38%
end_state_umount_ns 1.36e+09 1.35e+09 12248056.65 -1.07%
find_free_extent_calls 116244.30 114354.30 964.56 -1.63%
find_free_extent_ns_max 599507.20 1047168.20 103337.08 74.67%
find_free_extent_ns_mean 3607.19 3672.11 101.20 1.80%
find_free_extent_ns_min 500 512 6.67 2.40%
find_free_extent_ns_p50 2848 2876 37.65 0.98%
find_free_extent_ns_p95 4916 5000 75.45 1.71%
find_free_extent_ns_p99 20734.49 20920.48 1670.93 0.90%
frag_pct_max 61.67 0 8.05 -100.00%
frag_pct_mean 43.59 0 6.10 -100.00%
frag_pct_min 25.91 0 16.60 -100.00%
frag_pct_p50 42.53 0 7.25 -100.00%
frag_pct_p95 61.67 0 8.05 -100.00%
frag_pct_p99 61.67 0 8.05 -100.00%
fragmented_bg_count 6.10 0 1.45 -100.00%
max_commit_ms 49.80 46 5.37 -7.63%
sys_cpu 2.59 2.62 0.29 1.39%
write_bw_bytes 1.62e+08 1.68e+08 17975843.50 3.23%
write_clat_ns_mean 57426.39 54475.95 2292.72 -5.14%
write_clat_ns_p50 46950.40 42905.60 2101.35 -8.62%
write_clat_ns_p99 148070.40 143769.60 2115.17 -2.90%
write_io_kbytes 4194304 4194304 0 0.00%
write_iops 2476.15 2556.10 274.29 3.23%
write_lat_ns_max 2101667.60 2251129.50 370556.59 7.11%
write_lat_ns_mean 59374.91 55682.00 2523.09 -6.22%
write_lat_ns_min 17353.10 16250 1646.08 -6.36%
There are some mixed improvements/regressions in most metrics along with
an elimination of fragmentation in this workload.
On the balance, the drastic 1->0 improvement in the happy cases seems
worth the mix of regressions and improvements we do observe.
Some considerations for future work:
- Experimenting with more size classes
- More hinting/search ordering work to approximate a best-fit allocator
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
find_free_extent is a complicated function. It consists (at least) of:
- a hint that jumps into the middle of a for loop macro
- a middle loop trying every raid level
- an outer loop ascending through ffe loop levels
- complicated logic for skipping some of those ffe loop levels
- multiple underlying in-bg allocators (zoned, cluster, no cluster)
Which is all to say that more tracing is helpful for debugging its
behavior. Add two new tracepoints: at the entrance to the block_groups
loop (hit for every raid level and every ffe_ctl loop) and at the point
we seriously consider a block_group for allocation. This way we can see
the whole path through the algorithm, including hints, multiple loops,
etc.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The allocator tracepoints currently have a pile of values from ffe_ctl.
In modifying the allocator and adding more tracepoints, I found myself
adding to the already long argument list of the tracepoints. It makes it
a lot simpler to just send in the ffe_ctl itself.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Given that wait is always set to 1, so remove the argument.
Last use of wait with 0 was in 0c304304feab ("Btrfs: remove
csum_bytes_left").
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We currently use 'ret' and 'err' to track the return value for
log_dir_items(), which is confusing and likely the cause for previous
bugs where log_dir_items() did not return an error when it should, fixed
in previous patches.
So change this and use only a single variable, 'ret', to track the return
value. This is simpler and makes it similar to most of the existing code.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we use the value 1 for BTRFS_LOG_FORCE_COMMIT, but that value
has a few inconveniences:
1) If it's ever used by btrfs_log_inode(), or any function down the call
chain, we have to remember to btrfs_set_log_full_commit(), which is
repetitive and has a chance to be forgotten in future use cases.
btrfs_log_inode_parent() only calls btrfs_set_log_full_commit() when
it gets a negative value from btrfs_log_inode();
2) Down the call chain of btrfs_log_inode(), we may have functions that
need to force a log commit, but can return either an error (negative
value), false (0) or true (1). So they are forced to return some
random negative to force a log commit - using BTRFS_LOG_FORCE_COMMIT
would make the intention more clear. Currently the only example is
flush_dir_items_batch().
So turn BTRFS_LOG_FORCE_COMMIT into a negative value. The chosen value
is -(MAX_ERRNO + 1), so that it does not overlap any errno value and makes
it easier to debug.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The header file linux/mm.h provides PAGE_ALIGN, PAGE_ALIGNED,
PAGE_ALIGN_DOWN macros. Use these macros to make code more
concise.
Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When btrfs_get_chunk_map fails to allocate a new em the cleanup does not
need to be done so the goto target is out_err, which is consistent with
current coding style.
Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We had a recent bug that would have been caught by a newer compiler with
-Wmaybe-uninitialized and would have saved us a month of failing tests
that I didn't have time to investigate.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
With -Wmaybe-uninitialized compiler complains about ret being possibly
uninitialized, which isn't possible as the WQ_ constants are set only
from our code, however we can handle the default case and get rid of the
warning.
The value is set to BLK_STS_IOERR so it does not issue any IO and could
be potentially detected, but this is basically a "cannot happen" error.
To catch any problems during development use the assert.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ set the error in default: ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Fix an uninitialized warning we get with -Wmaybe-uninitialized where it
thought zno may have been uninitialized, in both cases it depends on
zinfo->zone_cache but we know the value won't change between checks.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/linux-btrfs/af6c527cbd8bdc782e50bd33996ee83acc3a16fb.1671221596.git.josef@toxicpanda.com/
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We only have 3 possible mirrors, and we have ASSERT()'s to make sure
we're not passing in an invalid super mirror into this function, so
technically this value isn't uninitialized. However
-Wmaybe-uninitialized will complain, so set it to U64_MAX so if we don't
have ASSERT()'s turned on it'll error out later on when it see's the
zone is beyond our maximum zones.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
convert_extent_bit
We will pass in the parent and p pointer into our tree_search function
to avoid doing a second search when inserting a new extent state into
the tree. However because this is conditional upon passing in these
pointers the compiler seems to think these values can be uninitialized
if we're using -Wmaybe-uninitialized. Fix this by initializing these
values.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
reclaim isn't set in the alloc case, however we only care about
reclaim in the !alloc case. This isn't an actual problem, however
-Wmaybe-uninitialized will complain, so initialize reclaim to quiet the
compiler.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Anybody that calls get_inode_gen() can have an uninitialized gen if
there's an error. This isn't a big deal because all the users just exit
if they get an error, however it makes -Wmaybe-uninitialized complain,
so fix this up to always initialize the passed in gen, this quiets all
of the uninitialized warnings in send.c.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can conditionally pass in a locked page, and then we'll use that page
range to skip marking errors as that will happen in another layer.
However this causes the compiler to complain because it doesn't
understand we only use these values when we have the page. Make the
compiler stop complaining by setting these values to 0.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
While trying to sync messages.[ch] I ended up with this dependency on
messages.h in the rest of btrfs-progs code base because it's where
btrfs_abort_transaction() was now held. We want to keep messages.[ch]
limited to the kernel code, and the btrfs_abort_transaction() code
better fits in the transaction code and not in messages.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ move the __cold attributes ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that none of the functions called by btrfs_merge_delayed_refs() needs
a btrfs_trans_handle, directly pass in a btrfs_fs_info to
btrfs_merge_delayed_refs().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that drop_delayed_ref() doesn't need a btrfs_trans_handle, drop it
from insert_delayed_ref() as well.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that drop_delayed_ref() doesn't get the btrfs_trans_handle passed in
anymore, we can get rid of it in merge_ref() as well.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
drop_delayed_ref() doesn't use the btrfs_trans_handle it gets passed in,
so remove it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform drivers fix from Hans de Goede:
"Intel vsec driver Meteor Lake PCI ids addition"
* tag 'platform-drivers-x86-v6.2-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86/intel/vsec: Add support for Meteor Lake
|
|
CONFIG_DRM_USE_DYNAMIC_DEBUG breaks debug prints for (at least modular)
drm drivers. The debug prints can be reinstated by manually frobbing
/sys/module/drm/parameters/debug after the fact, but at that point the
damage is done and all debugs from driver probe are lost. This makes
drivers totally undebuggable.
There's a more complete fix in progress [1], with further details, but
we need this fixed in stable kernels. Mark the feature as broken and
disable it by default, with hopes distros follow suit and disable it as
well.
[1] https://lore.kernel.org/r/20230125203743.564009-1-jim.cromie@gmail.com
Fixes: 84ec67288c10 ("drm_print: wrap drm_*_dbg in dyndbg descriptor factory macro")
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: dri-devel@lists.freedesktop.org
Cc: <stable@vger.kernel.org> # v6.1+
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Maxime Ripard <maxime@cerno.tech>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230207143337.2126678-1-jani.nikula@intel.com
|
|
The runtime Power Management of CPU topology is not compatible with
PREEMPT_RT:
1. Core cpuidle path disables IRQs.
2. Core cpuidle calls cpuidle-psci.
3. cpuidle-psci in __psci_enter_domain_idle_state() calls
pm_runtime_put_sync_suspend() and pm_runtime_get_sync() which use
spinlocks (which are sleeping on PREEMPT_RT).
Deep sleep modes are not a priority of Realtime kernels because the
latencies might become unpredictable. On the other hand the PSCI CPU
idle power domain is a parent of other devices and power domain
controllers, thus it cannot be simply skipped (e.g. on Qualcomm SM8250).
Disable the idle callbacks in cpuidle-psci and mark the domain as
always on. This is a trade-off between making PREEMPT_RT working and
still having a proper power domain hierarchy in the system.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Adrien Thierry <athierry@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The obsolete cpufreq driver was removed, drop the platform device
and data accordingly.
Link: https://lore.kernel.org/all/20230112135342.3927338-1-keguang.zhang@gmail.com
Signed-off-by: Keguang Zhang <keguang.zhang@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
When setting the power limit time window, software updates the 'y' bits
and 'f' bits in the power limit register, and the value hardware takes
follows the formula below
Time window = 2 ^ y * (1 + f / 4) * Time_Unit
When handling large time window input from userspace, using left
shifting breaks in two cases:
1. when ilog2(value) is bigger than 31, in expression "1 << y", left
shifting by more than 31 bits has undefined behavior. This breaks
'y'. For example, on an Alderlake platform, "1 << 32" returns 1.
2. when ilog2(value) equals 31, "1 << 31" returns negative value
because '1' is recognized as signed int. And this breaks 'f'.
Given that 'y' has 5 bits and hardware can never take a value larger
than 31, fix the first problem by clamp the time window to the maximum
possible value that the hardware can take.
Fix the second problem by using unsigned bit left shift.
Note that hardware has its own maximum time window limitation, which
may be lower than the time window value retrieved from the power limit
register. When this happens, hardware clamps the input to its maximum
time window limitation. That is why a software clamp is preferred to
handle the problem on hand.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
[ rjw: Adjusted the comment added by this change ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Currently, uring_cmd with UBLK_IO_FETCH_REQ or
UBLK_IO_COMMIT_AND_FETCH_REQ is always checked whether
userspace server has provided IO buffer even flag
UBLK_F_NEED_GET_DATA is configured.
This is a excessive check. If UBLK_F_NEED_GET_DATA is
configured, FETCH_RQ doesn't need to provide IO buffer;
COMMIT_AND_FETCH_REQ also doesn't need to do that if
the IO type is not READ.
Check ub_cmd->addr together with ublk_need_get_data()
and IO type in ublk_ch_uring_cmd().
With this fix, userspace server doesn't need to preserve
buffers for every ublk_io when flag UBLK_F_NEED_GET_DATA
is configured, in order to save memory.
Signed-off-by: Liu Xiaodong <xiaodong.liu@intel.com>
Fixes: c86019ff75c1 ("ublk_drv: add support for UBLK_IO_NEED_GET_DATA")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230210141356.112321-1-xiaodong.liu@intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
cpumask"), a successful call to sched_setaffinity() should always save
the user requested cpu affinity mask in a task's user_cpus_ptr. However,
when the given cpu mask is the same as the current one, user_cpus_ptr
is not updated. Fix this by saving the user mask in this case too.
Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230203181849.221943-1-longman@redhat.com
|
|
Tetsuo-San noted that commit f5d39b020809 ("freezer,sched: Rewrite
core freezer logic") broke call_usermodehelper_exec() for the KILLABLE
case.
Specifically it was missed that the second, unconditional,
wait_for_completion() was not optional and ensures the on-stack
completion is unused before going out-of-scope.
Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic")
Reported-by: syzbot+6cd18e123583550cf469@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Debugged-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Y90ar35uKQoUrLEK@hirez.programming.kicks-ass.net
|
|
Texts in numbered lists are rendered as continous paragraph when there
should have been breaks between first line text in the beginning of list
item and the description. Fix this by adding appropriate line breaks and
indent the rest of lines to match the first line of numbered list item.
Fixes: d6d71ee4a14ae6 ("PM: Introduce Intel PowerClamp Driver")
Fixes: 6bbe6f5732faea ("docs: thermal: convert to ReST")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
kernel test robot reported htmldocs warning:
Documentation/admin-guide/thermal/intel_powerclamp.rst:328: WARNING: Inline emphasis start-string without end-string.
The mistaken asterisk in /proc/irq/*/smp_affinity is rendered as hyperlink
as the result.
Escape the asterisk to fix above warning.
Link: https://lore.kernel.org/linux-doc/202302122247.N4S791c4-lkp@intel.com/
Fixes: ebf51971021881 ("thermal: intel: powerclamp: Add two module parameters")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
kernel test robot reported htmldocs warnings:
Documentation/admin-guide/index.rst:62: WARNING: toctree contains reference to nonexisting document 'admin-guide/thermal'
Documentation/admin-guide/thermal/intel_powerclamp.rst: WARNING: document isn't included in any toctree
Add toctree entry for thermal/ docs to fix these warnings.
Link: https://lore.kernel.org/linux-doc/202302121759.MmJgDTxc-lkp@intel.com/
Fixes: 707bf8e1dfd51d ("Documentation: admin-guide: Move intel_powerclamp documentation")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
If the cpuidle driver provides the target residency and exit latency in
nanoseconds, the corresponding values in microseconds need to be set to
reflect the provided numbers in order for the sysfs interface to show
them correctly, so make __cpuidle_driver_init() do that.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into soc/defconfig
More ARM64 defconfig updates for v6.3
Here are two more defconfig updates for 6.3, enabling the SM8450 Display
clock controller driver, as well as the SDAM driver, a driver exposing
SRAM on newer Qualcomm PMICs to other devices.
* tag 'qcom-arm64-defconfig-for-6.3-2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
arm64: defconfig: enable Qualcomm SDAM nvmem driver
arm64: defconfig: enable SM8450 DISPCC clock driver
arm64: defconfig: enable the clock driver for Qualcomm SA8775P platforms
arm64: defconfig: enable Visionox VTDR6130 DSI Panel driver
arm64: defconfig: enable SM8550 DISPCC clock driver
arm64: defconfig: enable Qualcomm PCIe modem drivers
arm64: defconfig: Enable SC8280XP Display Clock Controller
arm64: defconfig: Enable GCC, TCSRCC, pinctrl and interconnect for SM8550
arm64: defconfig: enable crypto userspace API
arm64: defconfig: build SDM_LPASSCC_845 as a module
arm64: defconfig: enable camera on Thundercomm RB5 platform
arm64: defconfig: build PINCTRL_SM8250_LPASS_LPI as module
arm64: defconfig: Enable Qualcomm EUD
Link: https://lore.kernel.org/r/20230210181516.2021902-1-andersson@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/dt
More Qualcomm ARM64 DT updates for 6.3
The new Qualcomm QDU1000 and QRU1000 platforms, and the IDP device on
these are introduced. New support for a couple of USB modem sticks from
THWC are introduced, so is support for Xiaomi Mi Pad 5 Pro and the Pro
SKU of the Herobrine device.
The Core Bus Fabric (CBF) is introduced on MSM8996. Interconnect paths
for UFS are also described.
A few fixes related to the power-grid of herobrine, on SC7280, are
introduced.
QFPROM is introduced on IPQ8074 and Interconnect providers are added for
SDM670.
On SDM845 the duplicated wcd9340 audio coded description is moved from
devices to a common file, audio devices are added to the OnePlus 6 and
6T.
On SM6115 debug UART, SMP2P, watchdog nodes are introduced, and the
platform is switched to use #address/size-cells of 2, in line with most
other platforms.
Camera control interface and clock controllers are added for SM6350, and
the CCI interface is enabled on the Fairphone FP4.
On SM8350 the interconnect reference of SDHCI controller is corrected,
DSI1 PHY clocks are properly described as sources for the Display clock
controller and DSI1 is wired up to the display controller.
The firmware paths are corrected for the Sony Xperia Nagara platform.
The GPR bus, audio servic3es and LPASS pinctrl nodes are added for the
SM8550 platform. Additionally a few small typos/errors are corrected.
gpio-ranges are corrected across MSM8953, SM6115 and SC8280XP and a
range of DT validation issues are corrected.
* tag 'qcom-arm64-for-6.3-2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: (81 commits)
arm64: dts: qcom: sc7280: Power herobrine's 3.3 eDP/TS rail more properly
arm64: dts: qcom: pmk8550: fix PON compatible
arm64: dts: qcom: sm8550: fix DSI controller compatible
arm64: dts: qcom: sc7280: Hook up the touchscreen IO rail on evoker
arm64: dts: qcom: sc7280: Hook up the touchscreen IO rail on villager
arm64: dts: qcom: sc7280: Add 3ms ramp to herobrine's pp3300_left_in_mlb
arm64: dts: qcom: sc7280: On QCard, regulator L3C should be 1.8V
arm64: dts: qcom: sc8280xp: correct LPASS GPIO gpio-ranges
arm64: dts: qcom: msm8992-lg-bullhead: Enable regulators
arm64: dts: qcom: sm6115: correct TLMM gpio-ranges
arm64: dts: qcom: msm8953: correct TLMM gpio-ranges
arm64: dts: qcom: msm8992-lg-bullhead: Correct memory overlaps with the SMEM and MPSS memory regions
arm64: dts: qcom: sm8350-hdk: correct LT9611 pin function
arm64: dts: qcom: sm8350-hdk: align pin config node names with bindings
arm64: dts: qcom: sm6350: Use specific qmpphy compatible
arm64: dts: qcom: sm6115: Add smp2p nodes
arm64: dts: qcom: sm7225-fairphone-fp4: Enable CCI busses
arm64: dts: qcom: sm6350: Add CCI nodes
arm64: dts: qcom: sm6350: Add camera clock controller
dt-bindings: clock: add QCOM SM6350 camera clock bindings
...
Link: https://lore.kernel.org/r/20230210192908.2039976-1-andersson@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/dt
More Qualcomm ARM32 DTS updates for 6.3
This adds backlight, notification LED, vibrator, volume keys and hall
sensor to the OnePlus One, and provides a range of Devicetree validation
fixes across various platforms.
* tag 'qcom-dts-for-6.3-2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: (22 commits)
ARM: dts: qcom: align OPP table names with DT schema
ARM: dts: qcom: msm8974-oneplus-bacon: Add notification LED
ARM: dts: qcom: msm8974-oneplus-bacon: Add backlight
ARM: dts: qcom: msm8974-oneplus-bacon: Add volume keys and hall sensor
ARM: dts: qcom: msm8974-oneplus-bacon: Add vibrator
ARM: dts: qcom: pm8941: Add vibrator node
ARM: dts: qcom: sdx55: correct TLMM gpio-ranges
dt-bindings: arm: qcom: add the sa8775p-ride board
ARM: dts: qcom: apq8064: add second DSI host and PHY
ARM: dts: qcom: apq8060-dragonboard: align MPP pin node names with DT schema
dt-bindings: arm: qcom: Add Xiaomi Mi Pad 5 Pro (xiaomi-elish)
ARM: dts: qcom-sdx65: align RPMh regulator nodes with bindings
ARM: dts: qcom-sdx55: align RPMh regulator nodes with bindings
ARM: dts: qcom: use "okay" for status
ARM: dts: qcom: sdx65: Add Qcom SMMU-500 as the fallback for IOMMU node
ARM: dts: qcom: sdx55: Add Qcom SMMU-500 as the fallback for IOMMU node
ARM: dts: qcom: apq8064: use hdmi_phy for the MMCC's hdmipll clock
ARM: dts: qcom: apq8064: add #clock-cells to the HDMI PHY node
ARM: dts: qcom: ipq8064: move reg-less nodes outside soc node
dt-bindings: qcom: Document msm8916-thwc-uf896 and ufi001c
...
Link: https://lore.kernel.org/r/20230210185846.2032601-1-andersson@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux into arm/dt
Samsung DTS ARM changes for v6.3, part two
Several cleanups pointed out by `make dtbs_check`:
1. Align LED status node name with bindings.
2. Drop redundant properties.
3. Move i2c-gpio node out of soc to top-level, as soc node is expected
to have only MMIO nodes.
4. Correct SPI NOR flash compatible in SMDK5250 and SMDKv310.
5. Align GPIO property names in WM1811-family codec nodes with bindings.
6. Correct MAX98090 codec DAI cells in Snow.
* tag 'samsung-dt-6.3-2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux:
ARM: dts: exynos: correct max98090 DAI argument in Snow
ARM: dts: s5pv210: add "gpios" suffix to wlf,ldo1ena on Aries
ARM: dts: exynos: add "gpios" suffix to wlf,ldo1ena on Arndale
ARM: dts: exynos: add "gpios" suffix to wlf,ldo1ena on Midas
ARM: dts: exynos: correct SPI nor compatible in SMDK5250
ARM: dts: exynos: correct SPI nor compatible in SMDKv310
ARM: dts: exynos: move I2C10 out of soc node on Arndale
ARM: dts: exynos: drop redundant address/size cells from I2C10 on Arndale
ARM: dts: exynos: drop default status from I2C10 on Arndale
ARM: dts: exynos: align status led name with bindings on Origen4210
Link: https://lore.kernel.org/r/20230211113103.58894-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
It is reported that amd_pmf driver is missing "depends on" for
CONFIG_POWER_SUPPLY causing the following build error.
ld: drivers/platform/x86/amd/pmf/core.o: in function `amd_pmf_remove':
core.c:(.text+0x10): undefined reference to `power_supply_unreg_notifier'
ld: drivers/platform/x86/amd/pmf/core.o: in function `amd_pmf_probe':
core.c:(.text+0x38f): undefined reference to `power_supply_reg_notifier'
make[1]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1
make: *** [Makefile:1248: vmlinux] Error 2
Add this to the Kconfig file.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217028
Fixes: c5258d39fc4c ("platform/x86/amd/pmf: Add helper routine to update SPS thermals")
Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Link: https://lore.kernel.org/r/20230213121457.1764463-1-Shyam-sundar.S-k@amd.com
Cc: stable@vger.kernel.org
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
|
The only caller to get_kernel_pages() [shm_get_kernel_pages()] has been
updated to not need it.
Remove get_kernel_pages().
Cc: Mel Gorman <mgorman@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Andrew Morton <akpm@linux-foudation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sumit Garg <sumit.garg@linaro.org>
Signed-off-by: Jens Wiklander <jens.wiklander@linaro.org>
|
|
The kernel pages used by shm_get_kernel_pages() are allocated using
GFP_KERNEL through the following call stack:
trusted_instantiate()
trusted_payload_alloc() -> GFP_KERNEL
<trusted key op>
tee_shm_register_kernel_buf()
register_shm_helper()
shm_get_kernel_pages()
Where <trusted key op> is one of:
trusted_key_unseal()
trusted_key_get_random()
trusted_key_seal()
Because the pages can't be from highmem get_kernel_pages() boils down to
a get_page() call.
Remove the get_kernel_pages() call and open code the get_page().
In case a highmem page does slip through warn on once for a kmap'ed
address.
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sumit Garg <sumit.garg@linaro.org>
Signed-off-by: Jens Wiklander <jens.wiklander@linaro.org>
|
|
The kernel pages used by shm_get_kernel_pages() are allocated using
GFP_KERNEL through the following call stack:
trusted_instantiate()
trusted_payload_alloc() -> GFP_KERNEL
<trusted key op>
tee_shm_register_kernel_buf()
register_shm_helper()
shm_get_kernel_pages()
Where <trusted key op> is one of:
trusted_key_unseal()
trusted_key_get_random()
trusted_key_seal()
Remove the vmalloc page support from shm_get_kernel_pages(). Replace
with a warn on once.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jens Wiklander <jens.wiklander@linaro.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sumit Garg <sumit.garg@linaro.org>
Signed-off-by: Jens Wiklander <jens.wiklander@linaro.org>
|
|
is_kmap_addr() is only looking at the kmap() address range which may
cause check_heap_object() to miss checking an overflow on a
kmap_local_page() page.
Add a check for the kmap_local_page() address range to is_kmap_addr().
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Andrew Morton <akpm@linux-foudation.org>
Signed-off-by: Jens Wiklander <jens.wiklander@linaro.org>
|
|
Add CLOCK_EVT_FEAT_DYNIRQ to allow the IRQ could be runtime set affinity
to the cores that needs wake up, otherwise saying core0 has to send
IPI to wakeup core1. With CLOCK_EVT_FEAT_DYNIRQ set, when broadcast
timer could wake up the cores, IPI is not needed.
After enabling this feature, especially the scene where cpuidle is
enabled can benefit.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Link: https://lore.kernel.org/r/20230209040239.24710-1-frank.li@vivo.com
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
The comment in the remove callback suggests that the driver is not
supposed to be unbound. However returning an error code in the remove
callback doesn't accomplish that. Instead set the suppress_bind_attrs
property (which makes it impossible to unbind the driver via sysfs).
The only remaining way to unbind a em_sti device would be module
unloading, but that doesn't apply here, as the driver cannot be built as
a module.
Also drop the useless remove callback.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20230207193010.469495-1-u.kleine-koenig@pengutronix.de
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
The comment in the remove callback suggests that the driver is not
supposed to be unbound. However returning an error code in the remove
callback doesn't accomplish that. Instead set the suppress_bind_attrs
property (which makes it impossible to unbind the driver via sysfs).
The only remaining way to unbind a sh_tmu device would be module
unloading, but that doesn't apply here, as the driver cannot be built as
a module.
Also drop the useless remove callback.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20230207193614.472060-1-u.kleine-koenig@pengutronix.de
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|
|
A static key is used to select between SBI and Sstc timer usage in
riscv_clock_next_event(), but currently the direction is resolved
after cpuhp_setup_state() is called (which sets the next event). The
first event will therefore fall through the sbi_set_timer() path; this
breaks Sstc-only systems. So, apply the jump patching before first
use.
Fixes: 9f7a8ff6391f ("RISC-V: Prefer sstc extension if available")
Signed-off-by: Matt Evans <mev@rivosinc.com>
Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/CDDAB2D0-264E-42F3-8E31-BA210BEB8EC1@rivosinc.com
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
|