Age | Commit message (Collapse) | Author |
|
This adds a new watermark, higher priority than BCH_WATERMARK_reclaim,
for interior btree updates. We've seen a deadlock where journal replay
triggers a ton of btree node merges, and these use up all available open
buckets and then interior updates get stuck.
One cause of this is that we're currently lacking btree node merging on
write buffer btrees - that needs to be fixed as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Sign error when checking the watermark - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The UAF bug is due to smb2_reconnect_server() accessing a session that
is already being teared down by another thread that is executing
__cifs_put_smb_ses(). This can happen when (a) the client has
connection to the server but no session or (b) another thread ends up
setting @ses->ses_status again to something different than
SES_EXITING.
To fix this, we need to make sure to unconditionally set
@ses->ses_status to SES_EXITING and prevent any other threads from
setting a new status while we're still tearing it down.
The following can be reproduced by adding some delay to right after
the ipc is freed in __cifs_put_smb_ses() - which will give
smb2_reconnect_server() worker a chance to run and then accessing
@ses->ipc:
kinit ...
mount.cifs //srv/share /mnt/1 -o sec=krb5,nohandlecache,echo_interval=10
[disconnect srv]
ls /mnt/1 &>/dev/null
sleep 30
kdestroy
[reconnect srv]
sleep 10
umount /mnt/1
...
CIFS: VFS: Verify user has a krb5 ticket and keyutils is installed
CIFS: VFS: \\srv Send error in SessSetup = -126
CIFS: VFS: Verify user has a krb5 ticket and keyutils is installed
CIFS: VFS: \\srv Send error in SessSetup = -126
general protection fault, probably for non-canonical address
0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP NOPTI
CPU: 3 PID: 50 Comm: kworker/3:1 Not tainted 6.9.0-rc2 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
04/01/2014
Workqueue: cifsiod smb2_reconnect_server [cifs]
RIP: 0010:__list_del_entry_valid_or_report+0x33/0xf0
Code: 4f 08 48 85 d2 74 42 48 85 c9 74 59 48 b8 00 01 00 00 00 00 ad
de 48 39 c2 74 61 48 b8 22 01 00 00 00 00 74 69 <48> 8b 01 48 39 f8 75
7b 48 8b 72 08 48 39 c6 0f 85 88 00 00 00 b8
RSP: 0018:ffffc900001bfd70 EFLAGS: 00010a83
RAX: dead000000000122 RBX: ffff88810da53838 RCX: 6b6b6b6b6b6b6b6b
RDX: 6b6b6b6b6b6b6b6b RSI: ffffffffc02f6878 RDI: ffff88810da53800
RBP: ffff88810da53800 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88810c064000
R13: 0000000000000001 R14: ffff88810c064000 R15: ffff8881039cc000
FS: 0000000000000000(0000) GS:ffff888157c00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe3728b1000 CR3: 000000010caa4000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
<TASK>
? die_addr+0x36/0x90
? exc_general_protection+0x1c1/0x3f0
? asm_exc_general_protection+0x26/0x30
? __list_del_entry_valid_or_report+0x33/0xf0
__cifs_put_smb_ses+0x1ae/0x500 [cifs]
smb2_reconnect_server+0x4ed/0x710 [cifs]
process_one_work+0x205/0x6b0
worker_thread+0x191/0x360
? __pfx_worker_thread+0x10/0x10
kthread+0xe2/0x110
? __pfx_kthread+0x10/0x10
ret_from_fork+0x34/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Get rid of the unnecessary refcounting on callback structs.
Copy interesting callback info into the lkb struct rather
than maintaining pointers to callback structs from the lkb.
This goes back to the way things were done prior to
commit 61bed0baa4db ("fs: dlm: use a non-static queue for callbacks").
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch fixes the following issue:
node 1 is dir
node 2 is master
node 3 is other
1->2: unlock
2: put final lkb, rsb moved to toss
2->1: unlock_reply
1: queue lkb callback with EUNLOCK
2->1: remove
1: receive_remove ignored (rsb on keep because of queued lkb callback)
1: complete lkb callback, put_lkb, move rsb to toss
3->1: lookup
1->3: lookup_reply master=2
3->2: request
2->3: request_reply EBADR
In summary:
An unexpected lkb reference causes the rsb to remain on the wrong list.
The rsb being on the wrong list causes receive_remove to be ignored.
An ignored receive_remove causes inconsistent dir and master state.
This sequence requires an unusually long delay in delivering the unlock
callback, because the remove message from 2->1 usually happens after
some seconds. So, it's not known exactly how frequently this sequence
occurs in pratice. It's possible that the same end result could also
have another unknown cause.
The solution for this issue is to further separate callback state
from the lkb, so that an lkb reference (and from that, an rsb ref)
are not held while a callback remains queued. Then, within the
unlock_reply, the lkb will be freed and the rsb moved to the toss
list. So, the receive_remove will not be ignored.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch combines the failure and default cases for enqueue and
dequeue a callback to the lkb callback queue that should end in both
cases as it should never happen.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Save lkb callback info when queueing the callback so that the
lkb struct is not needed in the callback workqueue processing.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Remove the ability to dump pending lkb callbacks from debugfs.
The prepares for separating lkb structs from callbacks.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Stop using lkb structs in the callback tracepoints so that lkb
references are not needed. This prepares for separating lkb
structs from callbacks.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
This patch fixes the copy lvb decision for user space lock requests.
Checking dlm_lvb_operations is done earlier, where granted/requested
lock modes are available to use in the matrix.
The decision had been moved to the wrong location, where granted mode
and requested mode where the same, which causes the dlm_lvb_operations
matix to produce the wrong copy decision. For PW or EX requests, the
caller could get invalid lvb data.
Fixes: 61bed0baa4db ("fs: dlm: use a non-static queue for callbacks")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
Use bio_list_merge_init instead of open coding bio_list_merge and
bio_list_init.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240328084147.2954434-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bitmap_set_bits() does not start with the FS' prefix and may collide
with a new generic helper one day. It operates with the FS-specific
types, so there's no change those two could do the same thing.
Just add the prefix to exclude such possible conflict.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
bitmap_size() is a pretty generic name and one may want to use it for
a generic bitmap API function. At the same time, its logic is
NTFS-specific, as it aligns to the sizeof(u64), not the sizeof(long)
(although it uses ideologically right ALIGN() instead of division).
Add the prefix 'ntfs3_' used for that FS (not just 'ntfs_' to not mix
it with the legacy module) and use generic BITS_TO_U64() while at it.
Suggested-by: Yury Norov <yury.norov@gmail.com> # BITS_TO_U64()
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There's an issue that if special files is created before quota
project is enabled, then it's not possible to link this file. This
works fine for normal files. This happens because xfs_quota skips
special files (no ioctls to set necessary flags). The check for
having the same project ID for source and destination then fails as
source file doesn't have any ID.
mkfs.xfs -f /dev/sda
mount -o prjquota /dev/sda /mnt/test
mkdir /mnt/test/foo
mkfifo /mnt/test/foo/fifo1
xfs_quota -xc "project -sp /mnt/test/foo 9" /mnt/test
> Setting up project 9 (path /mnt/test/foo)...
> xfs_quota: skipping special file /mnt/test/foo/fifo1
> Processed 1 (/etc/projects and cmdline) paths for project 9 with recursion depth infinite (-1).
ln /mnt/test/foo/fifo1 /mnt/test/foo/fifo1_link
> ln: failed to create hard link '/mnt/test/testdir/fifo1_link' => '/mnt/test/testdir/fifo1': Invalid cross-device link
mkfifo /mnt/test/foo/fifo2
ln /mnt/test/foo/fifo2 /mnt/test/foo/fifo2_link
Fix this by allowing linking of special files to the project quota
if special files doesn't have any ID set (ID = 0).
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
overlapping extent repair was colliding with extent past end of inode
checks - don't update "extent ends at" until we know we have an extent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We were missing an iter_traverse().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If something is wrong with a logged op, we just want to delete it -
there's nothing to repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds opts.recovery_pass_limit, and redoes -o norecovery to make use
of it; this fixes some issues with -o norecovery so it can be safely
used for data recovery.
Norecovery means "don't do journal replay"; it's an important data
recovery tool when we're getting stuck in journal replay.
When using it this way we need to make sure we don't free journal keys
after startup, so we continue to overlay them: thus it needs to imply
retain_recovery_info, as well as nochanges.
recovery_pass_limit is an explicit option for telling recovery to exit
after a specific recovery pass; this is a much cleaner way of
implementing -o norecovery, as well as being a useful debug feature in
its own right.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Flag that we need to run a recovery pass and run it - persistenly, so if
we crash it'll still get run.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This makes bch_sb_field_ext more consistent with the rest of -o
nochanges - we don't want to be varying other codepaths based on -o
nochanges, since it's used for testing in dry run mode; also fixes some
potential null ptr derefs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Finishing logged ops requires the filesystem to be in a reasonably
consistent state - and other fsck passes don't require it to have
completed, so just run it last.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We've grown a fair amount of code for managing recovery passes; tracking
which ones we're running, which ones need to be run, and flagging in the
superblock which ones need to be run on the next recovery.
So it's worth splitting out into its own file, this code is pretty
different from the code in recovery.c.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When we haven't yet allocated any btree nodes for a given btree, we
first need to call the regular split path to allocate one.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Remove some duplication, and inconsistency between check_fix_ptrs and
the main ptr marking paths
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When dropping keys now outside a now because we're changing the node
min/max, we need to redo the node's accounting as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
when writing file with direct_IO on bcachefs, then performance is
much lower than other fs due to write back throttle in block layer:
wbt_wait+1
__rq_qos_throttle+32
blk_mq_submit_bio+394
submit_bio_noacct_nocheck+649
bch2_submit_wbio_replicas+538
__bch2_write+2539
bch2_direct_write+1663
bch2_write_iter+318
aio_write+355
io_submit_one+1224
__x64_sys_io_submit+169
do_syscall_64+134
entry_SYSCALL_64_after_hwframe+110
add set REQ_SYNC and REQ_IDLE in bio->bi_opf as standard dirct-io
Signed-off-by: zhuxiaohui <zhuxiaohui.400@bytedance.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Consolidate bch2_gc_check_topology() and btree_node_interior_verify(),
and replace them with an improved version,
bch2_btree_node_check_topology().
This checks that children of an interior node correctly span the full
range of the parent node with no overlaps.
Also, ensure that topology repairs at runtime are always a fatal error;
in particular, this adds a check in btree_iter_down() - if we don't find
a key while walking down the btree that's indicative of a topology error
and should be flagged as such, not a null ptr deref.
Some checks in btree_update_interior.c remaining BUG_ONS(), because we
already checked the node for topology errors when starting the update,
and the assertions indicate that we _just_ corrupted the btree node -
i.e. the problem can't be that existing on disk corruption, they
indicate an actual algorithmic bug.
In the future, we'll be annotating the fsck errors list with which
recovery pass corrects them; the open coded "run explicit recovery pass
or fatal error" in bch2_btree_node_check_topology() will in the future
be done for every fsck_err() call.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Don't pick a pivot that's going to be deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
btree_and_journal_iter is now safe to use at runtime, not just during
recovery before journal keys have been freed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The old code doesn't consider the mem alloced from mempool when call
krealloc on trans->mem. Also in bch2_trans_put, using mempool_free to
free trans->mem by condition "trans->mem_bytes == BTREE_TRANS_MEM_MAX"
is inaccurate when trans->mem was allocated by krealloc function.
Instead, we use used_mempool stuff to record the situation, and realloc
or free the trans->mem in elegant way.
Also, after krealloc failed in __bch2_trans_kmalloc, the old data
should be copied to the new buffer when alloc from mempool_alloc.
Fixes: 31403dca5bb1 ("bcachefs: optimize __bch2_trans_get(), kill DEBUG_TRANSACTIONS")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We don't normally do extent updates this early in recovery, but some of
the repair paths have to and when we do, we don't want to do anything
that requires the snapshots table.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Previously, we assumed that keys were consistent with the snapshots
btree - but that's not correct as fsck may not have been run or may not
be complete.
This adds checks and error handling when using the in-memory snapshots
table (that mirrors the snapshots btree).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We need to add bounds checking for snapshot table accesses - it turns
out there are cases where we do need to use the snapshots table before
fsck checks have completed (and indeed, fsck may not have been run).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
before:
u64s 18 type inode_v3 0:1879048192:U32_MAX len 0 ver 0: mode=40700
flags= (15300000)
journal_seq=4
bi_size=0
bi_sectors=0
bi_version=0bi_atime=227064388944
...
after:
u64s 18 type inode_v3 0:1879048192:U32_MAX len 0 ver 0: mode=40700
flags= (15300000)
journal_seq=4
bi_size=0
bi_sectors=0
bi_version=0
bi_atime=227064388944
...
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
btree write buffer flush has two phases
- in natural key order, which is more efficient but may fail
- then in journal order
The journal order flush was assuming that keys were still correctly
ordered by journal sequence number - but due to coalescing by the
previous phase, we need an additional sort.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Backpointers that point to invalid devices are caught by fsck, not
.key_invalid; so .key_invalid needs to check for them instead of hitting
asserts.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.
In cifssmb.c:
Using strncpy with a length argument equal to strlen(src) is generally
dangerous because it can cause string buffers to not be NUL-terminated.
In this case, however, there was extra effort made to ensure the buffer
was NUL-terminated via a manual NUL-byte assignment. In an effort to rid
the kernel of strncpy() use, let's swap over to using strscpy() which
guarantees NUL-termination on the destination buffer.
To handle the case where ea_name is NULL, let's use the ?: operator to
substitute in an empty string, thereby allowing strscpy to still
NUL-terminate the destintation string.
Interesting note: this flex array buffer may go on to also have some
value encoded after the NUL-termination:
| if (ea_value_len)
| memcpy(parm_data->list.name + name_len + 1,
| ea_value, ea_value_len);
Now for smb2ops.c and smb2transport.c:
Both of these cases are simple, strncpy() is used to copy string
literals which have a length less than the destination buffer's size. We
can simply swap in the new 2-argument version of strscpy() introduced in
Commit e6584c3964f2f ("string: Allow 2-argument strscpy()").
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Deduplicate Kconfig entries for CONFIG_CXL_PMU
- Fix unselectable choice entry in MIPS Kconfig, and forbid this
structure
- Remove unused include/asm-generic/export.h
- Fix a NULL pointer dereference bug in modpost
- Enable -Woverride-init warning consistently with W=1
- Drop KCSAN flags from *.mod.c files
* tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: Fix typo HEIGTH to HEIGHT
Documentation/llvm: Note s390 LLVM=1 support with LLVM 18.1.0 and newer
kbuild: Disable KCSAN for autogenerated *.mod.c intermediaries
kbuild: make -Woverride-init warnings more consistent
modpost: do not make find_tosym() return NULL
export.h: remove include/asm-generic/export.h
kconfig: do not reparent the menu inside a choice block
MIPS: move unselectable FIT_IMAGE_FDT_EPM5 out of the "System type" choice
cxl: remove CONFIG_CXL_PMU entry in drivers/cxl/Kconfig
|
|
Commit(f55c096f62f1 exfat: do not zero the extended part) changed
the timing of synchronizing bitmap and inode in exfat_cont_expand().
The change caused xfstests generic/013 to fail if 'dirsync' or 'sync'
is enabled. So this commit restores the timing.
Fixes: f55c096f62f1 ("exfat: do not zero the extended part")
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|