Age | Commit message (Collapse) | Author |
|
After calling btrfs_search_slot is a common practice to check if the
slot found isn't bigger than number of slots in the current leaf, and if
so, search for the same key in the next leaf by calling btrfs_next_leaf,
which calls btrfs_next_old_leaf to do the job.
Calling btrfs_next_item in the same situation would end up in the same
code flow, since
* btrfs_next_item
* btrfs_next_old_item
* if slot >= nritems(curr_leaf)
btrfs_next_old_leaf
Change btrfs_verify_dev_extents and calculate_emulated_zone_size
functions to use btrfs_next_leaf in the same situation.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently all the callers of btrfs_find_all_roots() pass a value of false
for its ignore_offset argument. This makes the argument pointless and we
can remove it and make btrfs_find_all_roots() always pass false as the
ignore_offset argument for btrfs_find_all_roots_safe(). So just do that.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During a fast fsync, if we have already fsynced the file before and in the
current transaction, we can make the inode item update more efficient and
avoid acquiring a write lock on the leaf's parent.
To update the inode item we are always using btrfs_insert_empty_item() to
get a path pointing to the inode item, which calls btrfs_search_slot()
with an "ins_len" argument of 'sizeof(struct btrfs_inode_item) +
sizeof(struct btrfs_item)', and that always results in the search taking
a write lock on the level 1 node that is the parent of the leaf that
contains the inode item. This adds unnecessary lock contention on log
trees when we have multiple fsyncs in parallel against inodes in the same
subvolume, which has a very significant impact due to the fact that log
trees are short lived and their height very rarely goes beyond level 2.
Also, by using btrfs_insert_empty_item() when we need to update the inode
item, we also end up splitting the leaf of the existing inode item when
the leaf has an amount of free space smaller than the size of an inode
item.
Improve this by using btrfs_seach_slot(), with a 0 "ins_len" argument,
when we know the inode item already exists in the log. This avoids these
two inefficiencies.
The following script, using fio, was used to perform the tests:
$ cat fio-test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 4 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
BLOCK_SIZE=$4
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=randwrite
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=$BLOCK_SIZE
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
echo "mount options: $MOUNT_OPTIONS"
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The tests were done on a physical machine, with 12 cores, 64G of RAM,
using a NVMEe device and using a non-debug kernel config (the default one
from Debian). The summary line from fio is provided below for each test
run.
With 8 jobs, file size 256M, fsync frequency of 4 and a block size of 4K:
Before: WRITE: bw=28.3MiB/s (29.7MB/s), 28.3MiB/s-28.3MiB/s (29.7MB/s-29.7MB/s), io=2048MiB (2147MB), run=72297-72297msec
After: WRITE: bw=28.7MiB/s (30.1MB/s), 28.7MiB/s-28.7MiB/s (30.1MB/s-30.1MB/s), io=2048MiB (2147MB), run=71411-71411msec
+1.4% throughput, -1.2% runtime
With 16 jobs, file size 256M, fsync frequency of 4 and a block size of 4K:
Before: WRITE: bw=40.0MiB/s (42.0MB/s), 40.0MiB/s-40.0MiB/s (42.0MB/s-42.0MB/s), io=4096MiB (4295MB), run=99980-99980msec
After: WRITE: bw=40.9MiB/s (42.9MB/s), 40.9MiB/s-40.9MiB/s (42.9MB/s-42.9MB/s), io=4096MiB (4295MB), run=97933-97933msec
+2.2% throughput, -2.1% runtime
The changes are small but it's possible to be better on faster hardware as
in the test machine used disk utilization was pretty much 100% during the
whole time the tests were running (observed with 'iostat -xz 1').
The tests also included the previous patch with the subject of:
"btrfs: avoid unnecessary log mutex contention when syncing log".
So they compared a branch without that patch and without this patch versus
a branch with these two patches applied.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
One of the last steps of syncing the log is to remove all log contexts
from the root's list of contexts, done at btrfs_remove_all_log_ctxs().
There we iterate over all the contexts in the list and delete each one
from the list, and after that we call INIT_LIST_HEAD() on the list. That
is unnecessary since at that point the list is empty.
So just remove the INIT_LIST_HEAD() call. It's not needed, increases code
size (bloat-o-meter reported a delta of -122 for btrfs_sync_log() after
this change) and increases two critical sections delimited by log mutexes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When syncing the log we acquire the root's log mutex just to update the
root's last_log_commit. This is unnecessary because:
1) At this point there can only be one task updating this value, which is
the task committing the current log transaction. Any task that enters
btrfs_sync_log() has to wait for the previous log transaction to commit
and wait for the current log transaction to commit if someone else
already started it (in this case it never reaches to the point of
updating last_log_commit, as that is done by the committing task);
2) All readers of the root's last_log_commit don't acquire the root's
log mutex. This is to avoid blocking the readers, potentially for too
long and because getting a stale value of last_log_commit does not
cause any functional problem, in the worst case getting a stale value
results in logging an inode unnecessarily. Plus it's actually very
rare to get a stale value that results in unnecessarily logging the
inode.
So in order to avoid unnecessary contention on the root's log mutex,
which is used for several different purposes, like starting/joining a
log transaction and starting writeback of a log transaction, stop
acquiring the log mutex for updating the root's last_log_commit.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When using the NO_HOLES feature and expanding the size of an inode, we
update the inode's last_trans, last_sub_trans and last_log_commit fields
at maybe_insert_hole() so that a fsync does know that the inode needs to
be logged (by making sure that btrfs_inode_in_log() returns false). This
happens for expanding truncate operations, buffered writes, direct IO
writes and when cloning extents to an offset greater than the inode's
i_size.
However the way we do it is racy, because in between setting the inode's
last_sub_trans and last_log_commit fields, the log transaction ID that was
assigned to last_sub_trans might be committed before we read the root's
last_log_commit and assign that value to last_log_commit. If that happens
it would make a future call to btrfs_inode_in_log() return true. This is
a race that should be extremely unlikely to be hit in practice, and it is
the same that was described by commit bc0939fcfab0d7 ("btrfs: fix race
between marking inode needs to be logged and log syncing").
The fix would simply be to set last_log_commit to the value we assigned
to last_sub_trans minus 1, like it was done in that commit. However
updating these two fields plus the last_trans field is pointless here
because all the callers of btrfs_cont_expand() (which is the only
caller of maybe_insert_hole()) always call btrfs_set_inode_last_trans()
or btrfs_update_inode() after calling btrfs_cont_expand(). Calling either
btrfs_set_inode_last_trans() or btrfs_update_inode() guarantees that the
next fsync will log the inode, as it makes btrfs_inode_in_log() return
false.
So just remove the code that explicitly sets the inode's last_trans,
last_sub_trans and last_log_commit fields.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In commit 351cbf6e4410e7 ("btrfs: use nofs allocations for running delayed
items") we wrapped all btree updates when running delayed items with
memalloc_nofs_save() and memalloc_nofs_restore(), due to a lock inversion
detected by lockdep involving reclaim and the mutex of delayed nodes.
The problem is because the ref verify tool does some memory allocations
with GFP_KERNEL, which can trigger reclaim and reclaim can trigger inode
eviction, which requires locking the mutex of an inode's delayed node.
On the other hand the ref verify tool is called when allocating metadata
extents as part of operations that modify a btree, which is a problem when
running delayed nodes, where we do btree updates while holding the mutex
of a delayed node. This is what caused the lockdep warning.
Instead of wrapping every btree update when running delayed nodes, change
the ref verify tool to never do GFP_KERNEL allocations, because:
1) We get less repeated code, which at the moment does not even have a
comment mentioning why we need to setup the NOFS context, which is a
recommended good practice as mentioned at
Documentation/core-api/gfp_mask-from-fs-io.rst
2) The ref verify tool is something meant only for debugging and not
something that should be enabled on non-debug / non-development
kernels;
3) We may have yet more places outside delayed-inode.c where we have
similar problem: doing btree updates while holding some lock and
then having the GFP_KERNEL memory allocations, from the ref verify
tool, trigger reclaim and trying again to acquire the same lock
through the reclaim path.
Or we could get more such cases in the future, therefore this change
prevents getting into similar cases when using the ref verify tool.
Curiously most of the memory allocations done by the ref verify tool
were already using GFP_NOFS, except a few ones for no apparent reason.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When we insert the delayed items of an inode, which corresponds to the
directory index keys for a directory (key type BTRFS_DIR_INDEX_KEY), we
do the following:
1) Pick the first delayed item from the rbtree and insert it into the
fs/subvolume btree, using btrfs_insert_empty_item() for that;
2) Without releasing the path returned by btrfs_insert_empty_item(),
keep collecting as many consecutive delayed items from the rbtree
as possible, as long as each one's BTRFS_DIR_INDEX_KEY key is the
immediate successor of the previously picked item and as long as
they fit in the available space of the leaf the path points to;
3) Then insert all the collected items into the leaf;
4) Release the reserve metadata space for each collected item and
release each item (implies deleting from the rbtree);
5) Unlock the path.
While this is much better than inserting items one by one, it can be
improved in a few aspects:
1) Instead of adding items based on the remaining free space of the
leaf, collect as many items that can fit in a leaf and bulk insert
them. This results in less and larger batches, reducing the total
amount of time to insert the delayed items. For example when adding
100K files to a directory, we ended up creating 1658 batches with
very variable sizes ranging from 1 item to 118 items, on a filesystem
with a node/leaf size of 16K. After this change, we end up with 839
batches, with the vast majority of them having exactly 120 items;
2) We do the search for more items to batch, by iterating the rbtree,
while holding a write lock on the leaf;
3) While still holding the leaf locked, we are releasing the reserved
metadata for each item and then deleting each item, keeping a write
lock on the leaf for longer than necessary. Releasing the delayed items
one by one can take a significant amount of time, because deleting
them from the rbtree can often be a bit slow when the deletion results
in rebalancing the rbtree.
So change this so that we try to create larger batches, with a total
item size up to the maximum a leaf can support, and by unlocking the leaf
immediately after inserting the items, releasing the reserved metadata
space of each item and releasing each item without holding the write lock
on the leaf.
The following script that runs fs_mark was used to test this change:
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-m single -d single"
FILES=1000000
THREADS=16
FILE_SIZE=0
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t 16"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
It was run on machine with 12 cores, 64G of ram, using a NVMe device and
using a non-debug kernel config (Debian's default config).
Results before this change:
FSUse% Count Size Files/sec App Overhead
1 16000000 0 76182.1 72223046
3 32000000 0 62746.9 80776528
5 48000000 0 77029.0 93022381
6 64000000 0 73691.6 95251075
8 80000000 0 66288.0 85089634
Results after this change:
FSUse% Count Size Files/sec App Overhead
1 16000000 0 79049.5 (+3.7%) 69700824
3 32000000 0 65248.9 (+3.9%) 80583693
5 48000000 0 77991.4 (+1.2%) 90040908
6 64000000 0 75096.8 (+1.9%) 89862241
8 80000000 0 66926.8 (+1.0%) 84429169
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
group items
When extent tree gets corrupted, normally it's not extent tree root, but
one toasted tree leaf/node.
In that case, rescue=ibadroots mount option won't help as it can only
handle the extent tree root corruption.
This patch will enhance the behavior by:
- Allow fill_dummy_bgs() to ignore -EEXIST error
This means we may have some block group items read from disk, but
then hit some error halfway.
- Fallback to fill_dummy_bgs() if any error gets hit in
btrfs_read_block_groups()
Of course, this still needs rescue=ibadroots mount option.
With that, rescue=ibadroots can handle extent tree corruption more
gracefully and allow a better recover chance.
Reported-by: Zhenyu Wu <wuzy001@gmail.com>
Link: https://www.spinics.net/lists/linux-btrfs/msg114424.html
Reviewed-by: Su Yue <l@damenly.su>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Using a transaction in btrfs_search_slot is only useful when we are
searching to add or modify the tree. When the function is used for
searching, insert length and mod arguments are 0, there is no need to
use a transaction.
No functional changes, changing for consistency.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At reada_for_search(), when attempting to readahead a node or leaf's
siblings, we skip the readahead of the siblings if the node/leaf is
already in memory. That is probably fine for the READA_FORWARD and
READA_BACK readahead types, as they are used on contexts where we
end up reading some consecutive leaves, but usually not the whole btree.
However for a READA_FORWARD_ALWAYS mode, currently only used for full
send operations, it does not make sense to skip the readahead if the
target node or leaf is already loaded in memory, since we know the caller
is visiting every node and leaf of the btree in ascending order.
So change the behaviour to not skip the readahead when the target node is
already in memory and the readahead mode is READA_FORWARD_ALWAYS.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk, with 32GiB of RAM and
using a non-debug kernel config (Debian's default config).
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
file_count=2000000
add_files $file_count 0 4
echo
echo "Creating snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
The duration of the full send operations, in seconds, were the following:
Before this change: 85 seconds
After this change: 76 seconds (-11.2%)
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The pages in block_ctx have never been allocated from highmem (in
btrfsic_read_block) so the mapping is pointless and can be removed.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The pages in compressed_pages are not from highmem anymore so we can
drop the mapping for checksum calculation and inline extent.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is
simply page_address and kunmap is a no-op.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is
simply page_address and kunmap is a no-op.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is
simply page_address and kunmap is a no-op.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The highmem flag is used for allocating pages for compression and for
raid56 pages. The high memory makes sense on 32bit systems but is not
without problems. On 64bit system's it's just another layer of wrappers.
The time the pages are allocated for compression or raid56 is relatively
short (about a transaction commit), so the pages are not blocked
indefinitely. As the number of pages depends on the amount of data being
written/read, there's a theoretical problem. A fast device on a 32bit
system could use most of the low memory pool, while with the highmem
allocation that would not happen. This was possibly the original idea
long time ago, but nowadays we optimize for 64bit systems.
This patch removes all usage of the __GFP_HIGHMEM flag for page
allocation, the kmap/kunmap are still in place and will be removed in
followup patches. Remaining is masking out the bit in
alloc_extent_state and __lookup_free_space_inode, that can safely stay.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Drop variable 'devices' (used only once) and add new variable for
the fs_devices, so it is used at two locations within btrfs_trim_fs()
function and also helps to access fs_devices->devices.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Both callers use btrfs_header_nritems to feed the max argument. Remove
the argument and let generic_bin_search call it itself.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
One of the final things that must be done to add a new chunk is
inserting its device extent items in the device tree. They describe
the portion of allocated device physical space during phase 1 of
chunk allocation. This is currently done in btrfs_finish_chunk_alloc
whose name isn't very informative. What's more, this function is only
used in block-group.c but is defined as public. There isn't anything
special about it that would warrant it being defined in volumes.c.
Just move btrfs_finish_chunk_alloc and alloc_chunk_dev_extent to
block-group.c, make the former static and rename both functions to
insert_dev_extents and insert_dev_extent respectively.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function prototypes below aren't necessary as the functions are
first defined before called. Remove them.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
On 64K pages the size of the extent_buffer::pages array is 1 and
compilation with -Warray-bounds warns due to
kaddr = page_address(eb->pages[idx + 1]);
when reading byte range crossing page boundary.
This does never actually overflow the array because on 64K because all
the data fit in one page and bounds are checked by check_setget_bounds.
To fix the reported overflows and warnings add a compile-time condition
that will allow compiler to eliminate the dead code that reads from the
idx + 1 page.
Link: https://lore.kernel.org/lkml/20210623083901.1d49d19d@canb.auug.org.au/
CC: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There used to be a patch in the original series for zoned support which
limited the extent size to max_zone_append_size, but this patch has been
dropped somewhere around v9.
We've decided to go the opposite direction, instead of limiting extents
in the first place we split them before submission to comply with the
device's limits.
Remove the related code, btrfs_fs_info::max_zone_append_size and
btrfs_zoned_device_info::max_zone_append_size.
This also removes the workaround for dm-crypt introduced in
1d68128c107a ("btrfs: zoned: fail mount if the device does not support
zone append") because the fix has been merged as f34ee1dce642 ("dm
crypt: Fix zoned block device support").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This patch changes the data type of the variable 'val' from
int to u32.
Addresses-Coverity: argument of type "int *" is incompatible with parameter of type "u32 *"
Signed-off-by: Sai Krishna Potthuri <lakshmi.sai.krishna.potthuri@xilinx.com>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/925cebbe4eb73c7d0a536da204748d33c7100d8c.1624448778.git.michal.simek@xilinx.com
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
|
|
According to Armada XP datasheet bit at 0 position is corresponding for
TxInProg indication.
Fixes: c5aff18204da ("net: mvneta: driver for Marvell Armada 370/XP network unit")
Signed-off-by: Maxim Kiselev <bigunclemax@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the case of taprio offload is not enabled, the error handling path
causes a kernel crash due to kernel NULL pointer deference.
Fix this by adding check for NULL before attempt to access 'plat->est'
on the mutex_lock() call.
The following kernel panic is observed without this patch:
RIP: 0010:mutex_lock+0x10/0x20
Call Trace:
tc_setup_taprio+0x482/0x560 [stmmac]
kmem_cache_alloc_trace+0x13f/0x490
taprio_disable_offload.isra.0+0x9d/0x180 [sch_taprio]
taprio_destroy+0x6c/0x100 [sch_taprio]
qdisc_create+0x2e5/0x4f0
tc_modify_qdisc+0x126/0x740
rtnetlink_rcv_msg+0x12b/0x380
_raw_spin_lock_irqsave+0x19/0x40
_raw_spin_unlock_irqrestore+0x18/0x30
create_object+0x212/0x340
rtnl_calcit.isra.0+0x110/0x110
netlink_rcv_skb+0x50/0x100
netlink_unicast+0x191/0x230
netlink_sendmsg+0x243/0x470
sock_sendmsg+0x5e/0x60
____sys_sendmsg+0x20b/0x280
copy_msghdr_from_user+0x5c/0x90
__mod_memcg_state+0x87/0xf0
___sys_sendmsg+0x7c/0xc0
lru_cache_add+0x7f/0xa0
_raw_spin_unlock+0x16/0x30
wp_page_copy+0x449/0x890
handle_mm_fault+0x921/0xfc0
__sys_sendmsg+0x59/0xa0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace b1f19b24368a96aa ]---
Fixes: b60189e0392f ("net: stmmac: Integrate EST with TAPRIO scheduler API")
Cc: <stable@vger.kernel.org> # 5.10.x
Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2021-08-20
This series contains updates to igc and e1000e drivers.
Aaron Ma resolves a page fault which occurs when thunderbolt is
unplugged for igc.
Toshiki Nishioka fixes Tx queue looping to use actual number of queues
instead of max value for igc.
Sasha fixes an incorrect latency comparison by decoding the values before
comparing and prevents attempted writes to read-only NVMs for e1000e.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A successful 'xge_mdio_config()' call should be balanced by a corresponding
'xge_mdio_remove()' call in the error handling path of the probe, as
already done in the remove function.
Update the error handling path accordingly.
Fixes: ea8ab16ab225 ("drivers: net: xgene-v2: Add MDIO support")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
off in Fedora and RHEL8. Several other distros have followed suit.
I've heard of one problem in all that time: Someone migrated from an
older distro that supported "-o mand" to one that didn't, and the host
had a fstab entry with "mand" in it which broke on reboot. They didn't
actually _use_ mandatory locking so they just removed the mount option
and moved on.
This patch rips out mandatory locking support wholesale from the kernel,
along with the Kconfig option and the Documentation file. It also
changes the mount code to ignore the "mand" mount option instead of
erroring out, and to throw a big, ugly warning.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
|
|
The helper to send IRQ notification for regulator errors had still
old description mentioning calling BUG() as a last resort when
error status reading has kept failing for more times than a given
threshold.
The impementation calling BUG() did never end-up in-tree but was
replaced by hopefully more sophisticated handler trying to power-off
the system.
Fix the documentation to reflect actual behaviour.
Signed-off-by: Matti Vaittinen <matti.vaittinen@fi.rohmeurope.com>
Link: https://lore.kernel.org/r/20210823075651.GA3717293@localhost.localdomain
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
* irq/qcom-pdc-nowake-cleanup:
: Fix the QCOM PDC mishandling of the interrupt hierarchy by trimming
: it when necessary.
: Export irq_domain_disconnect_hierarchy as a consequence of it.
irqchip/qcom-pdc: Trim unused levels of the interrupt hierarchy
irqdomain: Export irq_domain_disconnect_hierarchy()
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The QCOM PDC driver creates a bunch of unnecessary levels in
the interrupt hierarchy when dealing with non-wakeup-capable
interrupts. By definition, these lines are terminated at the
PDC level, and everything below this is completely fake.
This also results in additional complexity as most of the
callbacks have to check for the validity of the parent level.
Needless to say, this doesn't look very good.
Solve this by disconnecting the interrupt hierarchy below
the last valid level, and considerably simplify the handling
of all the other interrupts by avoiding now unnecessary cheks.
In most cases, the standard irq_*_parent() handlers are directly
used.
This also cures an issue reporting by Maulik where gpio_to_irq()
returns an error after having observed a set of invalid levels.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Maulik Shah <mkshah@codeaurora.org>
Tested-by: Maulik Shah <mkshah@codeaurora.org>
Link: https://lore.kernel.org/r/1629705880-27877-3-git-send-email-mkshah@codeaurora.org
|
|
Export irq_domain_disconnect_hierarchy() so irqchip module drivers
can use it.
Signed-off-by: Maulik Shah <mkshah@codeaurora.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/1629705880-27877-2-git-send-email-mkshah@codeaurora.org
|
|
Commit 4fa82a87ba55 ("opp: Allow required-opps to be used for non genpd
use cases") dereferences the pointers in required_opp_tables but these
might be set to an ERR_PTR if the list still has lazy links pending,
resulting in segfaults. Prior to this patch IS_ERR was also checked on
required_opp_tables[i] before reading ->is_genpd inside
_opp_table_alloc_required_tables, which is at the same time the
predicate to add this table to the lazy list. This segfault is solved
by reordering the checks to bail on lazy pending tables before reading
->is_genpd.
Fixes: 4fa82a87ba55 ("opp: Allow required-opps to be used for non genpd use cases")
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@somainline.org>
Signed-off-by: Marijn Suijten <marijn.suijten@somainline.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
|
|
Mark die() as a function which accepts printf-style arguments so that
the compiler can typecheck them against the supplied format string.
Use the C99 inttypes.h format specifiers as relocs.c gets built for both
32- and 64-bit.
Original version of the patch by Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/YNnb6Q4QHtNYC049@zn.tnic
|
|
|
|
Validate csum_start in gre_handle_offloads before we call _gre_xmit so
that we do not crash later when the csum_start value is used in the
lco_csum function call.
This patch deals with ipv6 code.
Fixes: Fixes: b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common
GRE functions")
Reported-by: syzbot+ff8e1b9f2f36481e2efc@syzkaller.appspotmail.com
Signed-off-by: Shreyansh Chouhan <chouhan.shreyansh630@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Validate csum_start in gre_handle_offloads before we call _gre_xmit so
that we do not crash later when the csum_start value is used in the
lco_csum function call.
This patch deals with ipv4 code.
Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
Reported-by: syzbot+ff8e1b9f2f36481e2efc@syzkaller.appspotmail.com
Signed-off-by: Shreyansh Chouhan <chouhan.shreyansh630@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix random crashes on some 32-bit CPUs by adding isync() after
locking/unlocking KUEP
- Fix intermittent crashes when loading modules with strict module RWX
- Fix a section mismatch introduce by a previous fix.
Thanks to Christophe Leroy, Fabiano Rosas, Laurent Vivier, Murilo
Opsfelder Araújo, Nathan Chancellor, and Stan Johnson.
h# -----BEGIN PGP SIGNATURE-----
* tag 'powerpc-5.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Fix set_memory_*() against concurrent accesses
powerpc/32s: Fix random crashes by adding isync() after locking/unlocking KUEP
powerpc/xive: Do not mark xive_request_ipi() as __init
|
|
cc-option, __cc-option, cc-option-yn, and cc-disable-warning all invoke
the compiler during build time, and can slow down the build when these
checks become stale for our supported compilers, whose minimally
supported versions increases over time.
See Documentation/process/changes.rst for the current supported minimal
versions (GCC 4.9+, clang 10.0.1+). Compiler version support for these
flags may be verified on godbolt.org.
The following flags are supported by all supported versions of GCC and
Clang. Remove their cc-option, __cc-option, and cc-option-yn tests.
-Wno-address-of-packed-member
-mno-avx
-m32
-mno-80387
-march=k8
-march=nocona
-march=core2
-march=atom
-mtune=generic
-mfentry
[ mingo: Fixed regression on GCC, via partial revert of the stack-boundary changes. ]
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/1436
Link: https://lkml.kernel.org/r/20210812183848.1519994-1-ndesaulniers@google.com
|
|
The recent commit
064855a69003 ("x86/resctrl: Fix default monitoring groups reporting")
caused a RHEL build failure with an uninitialized variable warning
treated as an error because it removed the default case snippet.
The RHEL Makefile uses '-Werror=maybe-uninitialized' to force possibly
uninitialized variable warnings to be treated as errors. This is also
reported by smatch via the 0day robot.
The error from the RHEL build is:
arch/x86/kernel/cpu/resctrl/monitor.c: In function ‘__mon_event_count’:
arch/x86/kernel/cpu/resctrl/monitor.c:261:12: error: ‘m’ may be used
uninitialized in this function [-Werror=maybe-uninitialized]
m->chunks += chunks;
^~
The upstream Makefile does not build using '-Werror=maybe-uninitialized'.
So, the problem is not seen there. Fix the problem by putting back the
default case snippet.
[ bp: note that there's nothing wrong with the code and other compilers
do not trigger this warning - this is being done just so the RHEL compiler
is happy. ]
Fixes: 064855a69003 ("x86/resctrl: Fix default monitoring groups reporting")
Reported-by: Terry Bowman <Terry.Bowman@amd.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/162949631908.23903.17090272726012848523.stgit@bmoger-ubuntu
|
|
There is an existing lock hierarchy of
&dev->event_lock --> &fasync_struct.fa_lock --> &f->f_owner.lock
from the following call chain:
input_inject_event():
spin_lock_irqsave(&dev->event_lock,...);
input_handle_event():
input_pass_values():
input_to_handler():
evdev_events():
evdev_pass_values():
spin_lock(&client->buffer_lock);
__pass_event():
kill_fasync():
kill_fasync_rcu():
read_lock(&fa->fa_lock);
send_sigio():
read_lock_irqsave(&fown->lock,...);
&dev->event_lock is HARDIRQ-safe, so interrupts have to be disabled
while grabbing &fasync_struct.fa_lock, otherwise we invert the lock
hierarchy. However, since kill_fasync which calls kill_fasync_rcu is
an exported symbol, it may not necessarily be called with interrupts
disabled.
As kill_fasync_rcu may be called with interrupts disabled (for
example, in the call chain above), we replace calls to
read_lock/read_unlock on &fasync_struct.fa_lock in kill_fasync_rcu
with read_lock_irqsave/read_unlock_irqrestore.
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
|
|
Syzbot reports a potential deadlock in do_fcntl:
========================================================
WARNING: possible irq lock inversion dependency detected
5.12.0-syzkaller #0 Not tainted
--------------------------------------------------------
syz-executor132/8391 just changed the state of lock:
ffff888015967bf8 (&f->f_owner.lock){.+..}-{2:2}, at: f_getown_ex fs/fcntl.c:211 [inline]
ffff888015967bf8 (&f->f_owner.lock){.+..}-{2:2}, at: do_fcntl+0x8b4/0x1200 fs/fcntl.c:395
but this lock was taken by another, HARDIRQ-safe lock in the past:
(&dev->event_lock){-...}-{2:2}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of:
&dev->event_lock --> &new->fa_lock --> &f->f_owner.lock
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&f->f_owner.lock);
local_irq_disable();
lock(&dev->event_lock);
lock(&new->fa_lock);
<Interrupt>
lock(&dev->event_lock);
*** DEADLOCK ***
This happens because there is a lock hierarchy of
&dev->event_lock --> &new->fa_lock --> &f->f_owner.lock
from the following call chain:
input_inject_event():
spin_lock_irqsave(&dev->event_lock,...);
input_handle_event():
input_pass_values():
input_to_handler():
evdev_events():
evdev_pass_values():
spin_lock(&client->buffer_lock);
__pass_event():
kill_fasync():
kill_fasync_rcu():
read_lock(&fa->fa_lock);
send_sigio():
read_lock_irqsave(&fown->lock,...);
However, since &dev->event_lock is HARDIRQ-safe, interrupts have to be
disabled while grabbing &f->f_owner.lock, otherwise we invert the lock
hierarchy.
Hence, we replace calls to read_lock/read_unlock on &f->f_owner.lock,
with read_lock_irq/read_unlock_irq.
Reported-and-tested-by: syzbot+e6d5398a02c516ce5e70@syzkaller.appspotmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk driver fixes from Stephen Boyd:
- Make the regulator state match the GDSC power domain state at boot on
Qualcomm SoCs so that the regulator isn't turned off inadvertently.
- Fix earlycon on i.MX6Q SoCs
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: qcom: gdsc: Ensure regulator init state matches GDSC state
clk: imx6q: fix uart earlycon unwork
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some small driver fixes for 5.14-rc7.
They consist of:
- revert for an interconnect patch that was found to have problems
- ipack tpci200 driver fixes for reported problems
- slimbus messaging and ngd fixes for reported problems
All are small and have been in linux-next for a while with no reported
issues"
* tag 'char-misc-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
ipack: tpci200: fix memory leak in the tpci200_register
ipack: tpci200: fix many double free issues in tpci200_pci_probe
slimbus: ngd: reset dma setup during runtime pm
slimbus: ngd: set correct device for pm
slimbus: messaging: check for valid transaction id
slimbus: messaging: start transaction ids from 1 instead of zero
Revert "interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fix from Greg KH:
"Here is a single USB typec tcpm fix for a reported problem for
5.14-rc7. It showed up in 5.13 and resolves an issue that Hans found.
It has been in linux-next this week with no reported problems"
* tag 'usb-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: typec: tcpm: Fix VDMs sometimes not being forwarded to alt-mode drivers
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- fix the sifive-l2-cache device tree bindings for json-schema
compatibility. This does not change the intended behavior of the
binding.
- avoid improperly freeing necessary resources during early boot.
* tag 'riscv-for-linus-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Fix a number of free'd resources in init_resources()
dt-bindings: sifive-l2-cache: Fix 'select' matching
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fix from Vasily Gorbik:
- fix use after free of zpci_dev in pci code
* tag 's390-5.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/pci: fix use after free of zpci_dev
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull mandatory file locking deprecation warning from Jeff Layton:
"As discussed on the list, this patch just adds a new warning for folks
who still have mandatory locking enabled and actually mount with '-o
mand'. I'd like to get this in for v5.14 so we can push this out into
stable kernels and hopefully reach folks who have mounts with -o mand.
For now, I'm operating under the assumption that we'll fully remove
this support in v5.15, but we can move that out if any legitimate
users of this facility speak up between now and then"
* tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: warn about impending deprecation of mandatory locks
|
|
Commit
79419e13e808 ("x86/boot/compressed/64: Setup IDT in startup_32 boot path")
introduced an IDT into the 32-bit boot path of the decompressor stub.
But the IDT is set up before ExitBootServices() is called, and some UEFI
firmwares rely on their own IDT.
Save the firmware IDT on boot and restore it before calling into EFI
functions to fix boot failures introduced by above commit.
Fixes: 79419e13e808 ("x86/boot/compressed/64: Setup IDT in startup_32 boot path")
Reported-by: Fabio Aiuto <fabioaiuto83@gmail.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Cc: stable@vger.kernel.org # 5.13+
Link: https://lkml.kernel.org/r/20210820125703.32410-1-joro@8bytes.org
|