Age | Commit message (Collapse) | Author |
|
Fix race between rmmod and /proc/XXX's inode instantiation.
The bug is that pde->proc_ops don't belong to /proc, it belongs to a
module, therefore dereferencing it after /proc entry has been registered
is a bug unless use_pde/unuse_pde() pair has been used.
use_pde/unuse_pde can be avoided (2 atomic ops!) because pde->proc_ops
never changes so information necessary for inode instantiation can be
saved _before_ proc_register() in PDE itself and used later, avoiding
pde->proc_ops->... dereference.
rmmod lookup
sys_delete_module
proc_lookup_de
pde_get(de);
proc_get_inode(dir->i_sb, de);
mod->exit()
proc_remove
remove_proc_subtree
proc_entry_rundown(de);
free_module(mod);
if (S_ISREG(inode->i_mode))
if (de->proc_ops->proc_read_iter)
--> As module is already freed, will trigger UAF
BUG: unable to handle page fault for address: fffffbfff80a702b
PGD 817fc4067 P4D 817fc4067 PUD 817fc0067 PMD 102ef4067 PTE 0
Oops: Oops: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 26 UID: 0 PID: 2667 Comm: ls Tainted: G
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:proc_get_inode+0x302/0x6e0
RSP: 0018:ffff88811c837998 EFLAGS: 00010a06
RAX: dffffc0000000000 RBX: ffffffffc0538140 RCX: 0000000000000007
RDX: 1ffffffff80a702b RSI: 0000000000000001 RDI: ffffffffc0538158
RBP: ffff8881299a6000 R08: 0000000067bbe1e5 R09: 1ffff11023906f20
R10: ffffffffb560ca07 R11: ffffffffb2b43a58 R12: ffff888105bb78f0
R13: ffff888100518048 R14: ffff8881299a6004 R15: 0000000000000001
FS: 00007f95b9686840(0000) GS:ffff8883af100000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff80a702b CR3: 0000000117dd2000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
proc_lookup_de+0x11f/0x2e0
__lookup_slow+0x188/0x350
walk_component+0x2ab/0x4f0
path_lookupat+0x120/0x660
filename_lookup+0x1ce/0x560
vfs_statx+0xac/0x150
__do_sys_newstat+0x96/0x110
do_syscall_64+0x5f/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
[adobriyan@gmail.com: don't do 2 atomic ops on the common path]
Link: https://lkml.kernel.org/r/3d25ded0-1739-447e-812b-e34da7990dcf@p183
Fixes: 778f3dd5a13c ("Fix procfs compat_ioctl regression")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David S. Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When a 9p tree was mounted with option 'posixacl', parent directory had a
default ACL set for its subdirectories, e.g.:
setfacl -m default:group:simpsons:rwx parentdir
then creating a subdirectory crashed 9p client, as v9fs_fid_add() call in
function v9fs_vfs_mkdir_dotl() sets the passed 'fid' pointer to NULL
(since dafbe689736) even though the subsequent v9fs_set_create_acl() call
expects a valid non-NULL 'fid' pointer:
[ 37.273191] BUG: kernel NULL pointer dereference, address: 0000000000000000
...
[ 37.322338] Call Trace:
[ 37.323043] <TASK>
[ 37.323621] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
[ 37.324448] ? page_fault_oops (arch/x86/mm/fault.c:714)
[ 37.325532] ? search_module_extables (kernel/module/main.c:3733)
[ 37.326742] ? p9_client_walk (net/9p/client.c:1165) 9pnet
[ 37.328006] ? search_bpf_extables (kernel/bpf/core.c:804)
[ 37.329142] ? exc_page_fault (./arch/x86/include/asm/paravirt.h:686 arch/x86/mm/fault.c:1488 arch/x86/mm/fault.c:1538)
[ 37.330196] ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:574)
[ 37.331330] ? p9_client_walk (net/9p/client.c:1165) 9pnet
[ 37.332562] ? v9fs_fid_xattr_get (fs/9p/xattr.c:30) 9p
[ 37.333824] v9fs_fid_xattr_set (fs/9p/fid.h:23 fs/9p/xattr.c:121) 9p
[ 37.335077] v9fs_set_acl (fs/9p/acl.c:276) 9p
[ 37.336112] v9fs_set_create_acl (fs/9p/acl.c:307) 9p
[ 37.337326] v9fs_vfs_mkdir_dotl (fs/9p/vfs_inode_dotl.c:411) 9p
[ 37.338590] vfs_mkdir (fs/namei.c:4313)
[ 37.339535] do_mkdirat (fs/namei.c:4336)
[ 37.340465] __x64_sys_mkdir (fs/namei.c:4354)
[ 37.341455] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
[ 37.342447] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
Fix this by simply swapping the sequence of these two calls in
v9fs_vfs_mkdir_dotl(), i.e. calling v9fs_set_create_acl() before
v9fs_fid_add().
Fixes: dafbe689736f ("9p fid refcount: cleanup p9_fid_put calls")
Reported-by: syzbot+5b667f9a1fee4ba3775a@syzkaller.appspotmail.com
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Message-ID: <E1tsiI6-002iMG-Kh@kylie.crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
It's possible for checksum errors to be transient - e.g. flakey
controller or cable, thus we need additional retries (besides retrying
from different replicas) before we can definitely return an error.
This is particularly important for the next patch, which will allow the
data move path to move extents with checksum errors - we don't want to
accidentally introduce bitrot due to a transient error!
- bch2_bkey_pick_read_device() is substantially reworked, and
bch2_dev_io_failures is expanded to record more information about the
type of failure (i.e. number of checksum errors).
It now returns an error code that describes more precisely the reason
for the failure - checksum error, io error, or offline device, instead
of the previous generic "insufficient devices". This is important for
the next patches that add poisoning, as we only want to poison extents
when we've got real checksum errors (or perhaps IO errors?) - not
because a device was offline.
- Add a new option and superblock field for the number of checksum
retries.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Users have been asking for this, and now that errors are returned to the
top level read retry path - we can.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Next patch will be adding an additional retry loop for checksum errors,
so that we can rule out transient errors before marking an extent as
poisoned.
Prerequisite to this is returning errors to bch2_rbio_retry(); this will
also let us add a "successful retry" message.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Now that the read path uses proper error codes, we can get rid of the
weird rbio->hole signalling to the move path that the read didn't
happen.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When we do a read to a buffer that's mapped into userspace, it's
possible to get a spurious checksum error if userspace was modified the
buffer at the same time.
When we retry those, they have to be bounced before we know definitively
whether we're reading corrupt data.
But the retry path propagates read flags differently, so needs special
handling.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Kill the READ_ERR/READ_RETRY/READ_RETRY_AVOID enums, and add standard
error codes that describe precisely which error occured.
This is going to be used for the data move path, to move but poison
extents with checksum errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
dm-flakey is busted, and this is simpler anyways - this lets us test the
checksum error retry ptahs
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Only needed in retry path, no point in wasting stack space.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Small optimization for bch2_bkey_sectors_need_rebalance()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
[12308.606480] watchdog: BUG: soft lockup - CPU#18 stuck for 26s! [umount:48479]
[12308.606485] Modules linked in: bcachefs lz4hc_compress lz4_compress lz4_decompress sunrpc overlay nf_conntrack_netlink xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE bridge stp llc xfrm_user ip6table_nat ip6table_filter ip6_tables iptable_nat xt_addrtype iptable_filter ip_tables x_tables nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 psample ext4 mbcache jbd2 nls_iso8859_1 nls_cp850 vfat fat binfmt_misc skx_edac_common nfit edac_core libnvdimm cbc encrypted_keys intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common ipmi_ssif x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm drivetemp rapl intel_cstate coretemp mgag200 i2c_algo_bit ixgbe drm_shmem_helper drm_kms_helper mdio_devres xfrm_algo mdio drm ptp intel_uncore mei_me efi_pstore evdev uas pl2303 pps_core libphy usb_storage usbserial lpc_ich mei drm_panel_orientation_quirks acpi_power_meter tiny_power_button ipmi_si mfd_core intel_pch_thermal acpi_tad acpi_ipmi ioatdma
[12308.606541] ipmi_devintf ipmi_msghandler dca wmi button efivarfs polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 sha1_generic xhci_pci xhci_hcd aesni_intel ehci_pci ehci_hcd gf128mul crypto_simd cryptd usbcore hpwdt usb_common
[12308.606557] CPU: 18 UID: 0 PID: 48479 Comm: umount Tainted: G L 6.14.0-rc6-x86_64-00159-ga09496a03e63 #1
[12308.606560] Tainted: [L]=SOFTLOCKUP
[12308.606561] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 07/20/2023
[12308.606563] RIP: 0010:clear_page_erms+0x7/0x10
[12308.606570] Code: 48 89 47 38 48 8d 7f 40 75 d9 90 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 b9 00 10 00 00 31 c0 <f3> aa c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[12308.606572] RSP: 0018:ffff9ed5b622fba0 EFLAGS: 00010246
[12308.606574] RAX: 0000000000000000 RBX: ffff90347fffe6c0 RCX: 00000000000004c0
[12308.606575] RDX: ffffe34ea9bec1c0 RSI: 00000000000405f0 RDI: ffff902eafb07b40
[12308.606576] RBP: ffff9ed5b622fbf0 R08: 0000000000000001 R09: 0000000000000006
[12308.606577] R10: 0000000000040001 R11: 0000000000000000 R12: ffffe34ea9bec000
[12308.606578] R13: 0000000000000000 R14: 0000000000000006 R15: ffffe34ea9bed000
[12308.606580] FS: 00007fe704ecfb68(0000) GS:ffff9053fea00000(0000) knlGS:0000000000000000
[12308.606581] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12308.606582] CR2: 00007f18159068ae CR3: 00000001314d0005 CR4: 00000000007726f0
[12308.606583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[12308.606584] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[12308.606584] PKRU: 55555554
[12308.606585] Call Trace:
[12308.606587] <IRQ>
[12308.606590] ? show_regs.cold+0x19/0x28
[12308.606595] ? watchdog_timer_fn.cold+0x3d/0x9d
[12308.606598] ? __pfx_watchdog_timer_fn+0x10/0x10
[12308.606602] ? __hrtimer_run_queues+0x12e/0x250
[12308.606607] ? hrtimer_interrupt+0xfd/0x220
[12308.606609] ? __sysvec_apic_timer_interrupt+0x53/0xe0
[12308.606614] ? sysvec_apic_timer_interrupt+0x76/0xa0
[12308.606619] </IRQ>
[12308.606620] <TASK>
[12308.606620] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[12308.606626] ? clear_page_erms+0x7/0x10
[12308.606628] ? __free_pages_ok+0x374/0x640
[12308.606633] free_frozen_pages+0x34/0x570
[12308.606636] __folio_put+0x87/0xe0
[12308.606641] free_large_kmalloc+0x70/0x80
[12308.606645] kfree+0x2f6/0x390
[12308.606648] kvfree+0x2d/0x40
[12308.606653] __btree_node_data_free+0xaf/0xf0 [bcachefs]
[12308.606726] btree_node_data_free+0x6a/0x80 [bcachefs]
[12308.606778] bch2_fs_btree_cache_exit+0x262/0x440 [bcachefs]
[12308.606829] bch2_fs_release+0xe8/0x340 [bcachefs]
[12308.606905] kobject_put+0x60/0xc0
[12308.606908] bch2_fs_free+0xdd/0x120 [bcachefs]
[12308.606981] bch2_kill_sb+0x1e/0x30 [bcachefs]
[12308.607051] deactivate_locked_super+0x32/0xb0
[12308.607055] deactivate_super+0x40/0x50
[12308.607057] cleanup_mnt+0xc3/0x160
[12308.607060] __cleanup_mnt+0x12/0x20
[12308.607062] task_work_run+0x5f/0xa0
[12308.607064] syscall_exit_to_user_mode+0x194/0x1a0
[12308.607066] do_syscall_64+0x67/0x170
[12308.607068] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[12308.607070] RIP: 0033:0x7fe704e66eed
[12308.607073] Code: 08 49 89 ca b8 a5 00 00 00 0f 05 48 89 c7 e8 8a e6 ff ff 48 83 c4
Reported-by: Stijn Tintel <stijn@linux-ipv6.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
These are commonly needed when debugging, and saves from having to ask
users to dig.
Also, rebalance_status now includes pending rebalance work.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
There's no need to record "." dirents in the directory data (while
they could be used for sanity checks, they aren't very useful.)
Omitting "." dirents also improves directory data deduplication.
Use a per-inode (instead of per-sb) flag to indicate if the "." dirent
is omitted or not, ensuring compatibility with incremental builds. It
also reuses EROFS_I_NLINK_1_BIT, as it has very limited use cases for
directories with `nlink = 1`.
Emit the "." entry as the last virtual dirent in the directory because
it is _much_ less frequently used than the ".." dirent. It also keeps
`f_pos` meaningful, as it strictly follows the directory data when it's
less than i_size.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250310095459.2620647-6-hsiangkao@linux.alibaba.com
|
|
It adapts the on-disk changes from the previous commit. It also
supports EROFS_NULL_ADDR (all 1's) for EROFS_INODE_FLAT_PLAIN inodes
to indicate 0-filled inodes, as it's common for composefs use cases.
As a result, EROFS_INODE_CHUNK_BASED is no longer needed.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250310095459.2620647-5-hsiangkao@linux.alibaba.com
|
|
The current 32-bit block addressing limits EROFS to a 16TiB maximum
volume size with 4KiB blocks. However, several new use cases now
require larger capacity support:
- Massive datasets for model training in order to boost random
sampling performance for each epoch;
- Object storage clients using EROFS direct passthrough.
This extends core on-disk structures to support 48-bit block addressing,
such as inodes, device slots, and inode chunks.
Additionally:
- Expand superblock root NID to 8-byte `rootnid_8b` to enable full
out-of-place update incremental builds;
- Introduce `epoch` field in the superblock as well as add `mtime`
field to 32-byte compact inodes for basic timestamp support.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250310095459.2620647-4-hsiangkao@linux.alibaba.com
|
|
- Switch to on-stack `copied` since it's just 64 bytes;
- Get rid of `nblks` and derive `i_blocks` directly;
- Use `inode_set_mtime()` instead of `inode_set_ctime()`
to follow the ondisk naming;
- Rearrange the code.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250310095459.2620647-3-hsiangkao@linux.alibaba.com
|
|
It's simple enough to be folded into erofs_map_blocks().
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250310095459.2620647-2-hsiangkao@linux.alibaba.com
|
|
It seems that all compressors need those two values, so just move
them into the common structure.
`struct z_erofs_lz4_decompress_ctx` can be dropped too.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250305124007.1810731-1-hsiangkao@linux.alibaba.com
|
|
Simplify the logic in z_erofs_fill_inode_lazy() by combining the
handling of ztailpacking and fragments, as they are mutually exclusive.
Note that `h->h_clusterbits >> Z_EROFS_FRAGMENT_INODE_BIT` is handled
above, so no need to duplicate the check.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250224123747.1387122-2-hsiangkao@linux.alibaba.com
|
|
Use `z_idata_size != 0` to indicate that ztailpacking is enabled.
`Z_EROFS_ADVISE_INLINE_PCLUSTER` cannot be ignored, as `h_idata_size`
could be non-zero prior to erofs-utils 1.6 [1].
Additionally, merge `z_idataoff` and `z_fragmentoff` since these two
features are mutually exclusive for a given inode.
[1] https://git.kernel.org/xiang/erofs-utils/c/547bea3cb71a
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250225114038.3259726-1-hsiangkao@linux.alibaba.com
|
|
Actually, volume name doesn't need to include the NIL terminator if
the string length matches the on-disk field size as mentioned in [1].
I tend to relax it together with the upcoming 48-bit block addressing
(or stable kernels which backport this fix) so that we could have a
chance to record a 16-byte volume name like ext4.
Since in-memory `volume_name` has no user, just get rid of the unneeded
check for now. `sbi->uuid` is useless and avoid it too.
[1] https://lore.kernel.org/r/96efe46b-dcce-4490-bba1-a0b00932d1cc@linux.alibaba.com
Fixes: a64d9493f587 ("staging: erofs: refuse to mount images with malformed volume name")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250225033934.2542635-1-hsiangkao@linux.alibaba.com
|
|
Since EROFS_KMAP_ATOMIC is no longer valid, get rid of erofs_kmap_type too.
Signed-off-by: Bo Liu <liubo03@inspur.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250217093141.2659-1-liubo03@inspur.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
There's no need to enumerate each type. No logic changes.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250210032923.3382136-1-hongzhen@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify reverts from Jan Kara:
"Syzbot has found out that fsnotify HSM events generated on page fault
can be generated while we already hold freeze protection for the
filesystem (when you do buffered write from a buffer which is mmapped
file on the same filesystem) which violates expectations for HSM
events and could lead to deadlocks of HSM clients with filesystem
freezing.
Since it's quite late in the cycle we've decided to revert changes
implementing HSM events on page fault for now and instead just
generate one event for the whole range on mmap(2) so that HSM client
can fetch the data at that moment"
* tag 'fsnotify_for_v6.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
Revert "fanotify: disable readahead if we have pre-content watches"
Revert "mm: don't allow huge faults for files with pre content watches"
Revert "fsnotify: generate pre-content permission event on page fault"
Revert "xfs: add pre-content fsnotify hook for DAX faults"
Revert "ext4: add pre-content fsnotify hook for DAX faults"
fsnotify: add pre-content hooks on mmap()
|
|
Pull smb server fixes from Steve French:
- Two fixes for oplock break/lease races
* tag 'v6.14-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: prevent connection release during oplock break notification
ksmbd: fix use-after-free in ksmbd_free_work_struct
|
|
Single caller, so inline it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Found with CC=clang W=1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Use max() to simplify gen_after() and improve its readability.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The extra byte is not used - remove it.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
And the stripes heap gets deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a simple tracepoint for stripe creation, we'll want to expand this
later.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Convert to the new persistent stripe LRU.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Convert to the new persistent stripe LRU.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We're improving our handling of write errors - we shouldn't write
degraded data just because a write failed once, we should retry it (on
other devices, if possible).
But for this to work, we need to kick devices out when they're only
returning errors - otherwise those retries will loop infinitely.
This adds a configurable timeout - if writes are failing for too long,
we'll set that device read-only.
In the future we should also implement more tracking and another knob
for an "allowed error rate", so that we can kick out drives that are
acting "unhealthy".
Another thing we'll want is a mechanism (likely in userspace) for
bringing a device back in after a transient error - perhaps a cable was
jiggled, or there was a controller reset.
After transient errors we also need a mechanism to walk (from the
journal) recent btree updates that weren't flushed to that device and
treat them as "degraded", since unflushed data may well not have been
written. Out of scope for this patch, but becoming relevant.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Previously, we woudn't try to read at all from a failed device - that
doesn't make much sense, the device may be unhealthy (perhaps taking
longer than it should to service reads), but if it's our only option we
should still try to read from it.
Now, bch2_bkey_pick_read_device() will pick failed devices only if there
are no non-failed replicas to read from.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The next patch implementing freezing will change bch2_dev_get_ioref() to
sleep if a device is currently frozen.
Add an annotation and fix the journal code accordingly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This was completely fubar; it's now simplified a bit as well.
Note that for_each_online_member() takes and releases io_refs as it
iterates, so we need to release that if we break.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We can't use the standard fs_holder_ops because they're meant for single
device filesystems - fs_bdev_mark_dead() in particular - and they assume
that the blk_holder is the super_block, which also doesn't work for a
multi device filesystem.
These generally follow the standard fs_holder_ops; the
locking/refcounting is a bit simplified because c->ro_ref suffices, and
bch2_fs_bdev_mark_dead() is not necessarily shutting down the entire
filesystem.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This is necessary for the new blk_holder_ops, which want the vfs
super_block available for synchronization.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Note that we open block devices before we allocate bch_fs, but once
attached to a filesystem they will be closed before the bch_fs is torn
down - so stashing a pointer without a refcount looks incorrect but it's
not.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
More prep work for automatically kicking devices out after too many IO
errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We need to start accounting successes for every IO, not just failures,
so introduce a unified hook for io completion accounting and convert
io_read.c.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We were using our device pointer after we'd released our ref to it.
Unlikely to be a race that's practical to hit, since actually removing a
member device is a whole process besides just taking it offline, but -
needs to be fixed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If a device is ro or failed, we might not have anywhere to move a
replica.
Check for this early, before doing the read and attempting to write.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This implements a new extent field bitflags that apply to the whole
extent. There's been a couple things we've wanted this for in the past,
but the immediate need is extent poisoning, to solve a rebalance issue.
Unknown extent fields can't be parsed (we won't known their size, so we
can't advance to the next field), so this is an incompat feature, and
using it prevents the filesystem from being mounted by old versions.
This also adds the BCH_EXTENT_poisoned flag; this indicates that the
data is known to be bad (i.e. there was a checksum error, and we had to
write a new checksum) and reads will return errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
For future usage, we'll want a dedicated error code for better
debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|