summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-12-12xfs: don't move nondir/nonreg temporary repair files to the metadir namespaceDarrick J. Wong
Only directories or regular files are allowed in the metadata directory tree. Don't move the repair tempfile to the metadir namespace if this is not true; this will cause the inode verifiers to trip. xrep_tempfile_adjust_directory_tree opportunistically moves sc->tempip from the regular directory tree to the metadata directory tree if sc->ip is part of the metadata directory tree. However, the scrub setup functions grab sc->ip and create sc->tempip before we actually get around to checking if the file mode is the right type for the scrubber. IOWs, you can invoke the symlink scrubber with the file handle of a subdirectory in the metadir. xrep_setup_symlink will create a temporary symlink file, xrep_tempfile_adjust_directory_tree will foolishly try to set the METADATA flag on the temp symlink, which trips the inode verifier in the inode item precommit, which shuts down the filesystem when expensive checks are turned on. If they're /not/ turned on, then xchk_symlink will return ENOENT when it sees that it's been passed a symlink, but the invalid inode could still get flushed to disk. We don't want that. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 9dc31acb01a1c7 ("xfs: move repair temporary files to the metadata directory tree") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: fix sb_spino_align checks for large fsblock sizesDarrick J. Wong
For a sparse inodes filesystem, mkfs.xfs computes the values of sb_spino_align and sb_inoalignmt with the following code: int cluster_size = XFS_INODE_BIG_CLUSTER_SIZE; if (cfg->sb_feat.crcs_enabled) cluster_size *= cfg->inodesize / XFS_DINODE_MIN_SIZE; sbp->sb_spino_align = cluster_size >> cfg->blocklog; sbp->sb_inoalignmt = XFS_INODES_PER_CHUNK * cfg->inodesize >> cfg->blocklog; On a V5 filesystem with 64k fsblocks and 512 byte inodes, this results in cluster_size = 8192 * (512 / 256) = 16384. As a result, sb_spino_align and sb_inoalignmt are both set to zero. Unfortunately, this trips the new sb_spino_align check that was just added to xfs_validate_sb_common, and the mkfs fails: # mkfs.xfs -f -b size=64k, /dev/sda meta-data=/dev/sda isize=512 agcount=4, agsize=81136 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 = exchange=0 metadir=0 data = bsize=65536 blocks=324544, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=65536 ascii-ci=0, ftype=1, parent=0 log =internal log bsize=65536 blocks=5006, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=65536 blocks=0, rtextents=0 = rgcount=0 rgsize=0 extents Discarding blocks...Sparse inode alignment (0) is invalid. Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200 libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1 mkfs.xfs: Releasing dirty buffer to free list! found dirty buffer (bulk) on free list! Sparse inode alignment (0) is invalid. Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200 libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1 mkfs.xfs: writing AG headers failed, err=22 Prior to commit 59e43f5479cce1 this all worked fine, even if "sparse" inodes are somewhat meaningless when everything fits in a single fsblock. Adjust the checks to handle existing filesystems. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 59e43f5479cce1 ("xfs: sb_spino_align is not verified") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: convert quotacheck to attach dquot buffersDarrick J. Wong
Now that we've converted the dquot logging machinery to attach the dquot buffer to the li_buf pointer so that the AIL dqflush doesn't have to allocate or read buffers in a reclaim path, do the same for the quotacheck code so that the reclaim shrinker dqflush call doesn't have to do that either. Cc: <stable@vger.kernel.org> # v6.12 Fixes: 903edea6c53f09 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: attach dquot buffer to dquot log item bufferDarrick J. Wong
Ever since 6.12-rc1, I've observed a pile of warnings from the kernel when running fstests with quotas enabled: WARNING: CPU: 1 PID: 458580 at mm/page_alloc.c:4221 __alloc_pages_noprof+0xc9c/0xf18 CPU: 1 UID: 0 PID: 458580 Comm: xfsaild/sda3 Tainted: G W 6.12.0-rc6-djwa #rc6 6ee3e0e531f6457e2d26aa008a3b65ff184b377c <snip> Call trace: __alloc_pages_noprof+0xc9c/0xf18 alloc_pages_mpol_noprof+0x94/0x240 alloc_pages_noprof+0x68/0xf8 new_slab+0x3e0/0x568 ___slab_alloc+0x5a0/0xb88 __slab_alloc.constprop.0+0x7c/0xf8 __kmalloc_noprof+0x404/0x4d0 xfs_buf_get_map+0x594/0xde0 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_buf_read_map+0x64/0x2e0 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_trans_read_buf_map+0x1dc/0x518 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_qm_dqflush+0xac/0x468 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_qm_dquot_logitem_push+0xe4/0x148 [xfs 384cb02810558b4c490343c164e9407332118f88] xfsaild+0x3f4/0xde8 [xfs 384cb02810558b4c490343c164e9407332118f88] kthread+0x110/0x128 ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]--- This corresponds to the line: WARN_ON_ONCE(current->flags & PF_MEMALLOC); within the NOFAIL checks. What's happening here is that the XFS AIL is trying to write a disk quota update back into the filesystem, but for that it needs to read the ondisk buffer for the dquot. The buffer is not in memory anymore, probably because it was evicted. Regardless, the buffer cache tries to allocate a new buffer, but those allocations are NOFAIL. The AIL thread has marked itself PF_MEMALLOC (aka noreclaim) since commit 43ff2122e6492b ("xfs: on-stack delayed write buffer lists") presumably because reclaim can push on XFS to push on the AIL. An easy way to fix this probably would have been to drop the NOFAIL flag from the xfs_buf allocation and open code a retry loop, but then there's still the problem that for bs>ps filesystems, the buffer itself could require up to 64k worth of pages. Inode items had similar behavior (multi-page cluster buffers that we don't want to allocate in the AIL) which we solved by making transaction precommit attach the inode cluster buffers to the dirty log item. Let's solve the dquot problem in the same way. So: Make a real precommit handler to read the dquot buffer and attach it to the log item; pass it to dqflush in the push method; and have the iodone function detach the buffer once we've flushed everything. Add a state flag to the log item to track when a thread has entered the precommit -> push mechanism to skip the detaching if it turns out that the dquot is very busy, as we don't hold the dquot lock between log item commit and AIL push). Reading and attaching the dquot buffer in the precommit hook is inspired by the work done for inode cluster buffers some time ago. Cc: <stable@vger.kernel.org> # v6.12 Fixes: 903edea6c53f09 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: clean up log item accesses in xfs_qm_dqflush{,_done}Darrick J. Wong
Clean up these functions a little bit before we move on to the real modifications, and make the variable naming consistent for dquot log items. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: separate dquot buffer reads from xfs_dqflushDarrick J. Wong
The first step towards holding the dquot buffer in the li_buf instead of reading it in the AIL is to separate the part that reads the buffer from the actual flush code. There should be no functional changes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: don't lose solo dquot update transactionsDarrick J. Wong
Quota counter updates are tracked via incore objects which hang off the xfs_trans object. These changes are then turned into dirty log items in xfs_trans_apply_dquot_deltas just prior to commiting the log items to the CIL. However, updating the incore deltas do not cause XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure quota counter update will be silently discarded if there are no other dirty log items attached to the transaction. This is currently not the case anywhere in the filesystem because quota updates always dirty at least one other metadata item, but a subsequent bug fix will add dquot log item precommits, so we actually need a dirty dquot log item prior to xfs_trans_run_precommits. Also let's not leave a logic bomb. Cc: <stable@vger.kernel.org> # v2.6.35 Fixes: 0924378a689ccb ("xfs: split out iclog writing from xfs_trans_commit()") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: don't lose solo superblock counter update transactionsDarrick J. Wong
Superblock counter updates are tracked via per-transaction counters in the xfs_trans object. These changes are then turned into dirty log items in xfs_trans_apply_sb_deltas just prior to commiting the log items to the CIL. However, updating the per-transaction counter deltas do not cause XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure sb counter update will be silently discarded if there are no other dirty log items attached to the transaction. This is currently not the case anywhere in the filesystem because sb counter updates always dirty at least one other metadata item, but let's not leave a logic bomb. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: avoid nested calls to __xfs_trans_commitDarrick J. Wong
Currently, __xfs_trans_commit calls xfs_defer_finish_noroll, which calls __xfs_trans_commit again on the same transaction. In other words, there's a nested function call (albeit with slightly different arguments) that has caused minor amounts of confusion in the past. There's no reason to keep this around, since there's only one place where we actually want the xfs_defer_finish_noroll, and that is in the top level xfs_trans_commit call. This also reduces stack usage a little bit. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: only run precommits once per transaction objectDarrick J. Wong
Committing a transaction tx0 with a defer ops chain of (A, B, C) creates a chain of transactions that looks like this: tx0 -> txA -> txB -> txC Prior to commit cb042117488dbf, __xfs_trans_commit would run precommits on tx0, then call xfs_defer_finish_noroll to convert A-C to tx[A-C]. Unfortunately, after the finish_noroll loop we forgot to run precommits on txC. That was fixed by adding the second precommit call. Unfortunately, none of us remembered that xfs_defer_finish_noroll calls __xfs_trans_commit a second time to commit tx0 before finishing work A in txA and committing that. In other words, we run precommits twice on tx0: xfs_trans_commit(tx0) __xfs_trans_commit(tx0, false) xfs_trans_run_precommits(tx0) xfs_defer_finish_noroll(tx0) xfs_trans_roll(tx0) txA = xfs_trans_dup(tx0) __xfs_trans_commit(tx0, true) xfs_trans_run_precommits(tx0) This currently isn't an issue because the inode item precommit is idempotent; the iunlink item precommit deletes itself so it can't be called again; and the buffer/dquot item precommits only check the incore objects for corruption. However, it doesn't make sense to run precommits twice. Fix this situation by only running precommits after finish_noroll. Cc: <stable@vger.kernel.org> # v6.4 Fixes: cb042117488dbf ("xfs: defered work could create precommits") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: unlock inodes when erroring out of xfs_trans_alloc_dirDarrick J. Wong
Debugging a filesystem patch with generic/475 caused the system to hang after observing the following sequences in dmesg: XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x491520 len 32 error 5 XFS (dm-0): metadata I/O error in "xfs_btree_read_buf_block+0xba/0x160 [xfs]" at daddr 0x3445608 len 8 error 5 XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x138e1c0 len 32 error 5 XFS (dm-0): log I/O error -5 XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ea/0x4b0 [xfs] (fs/xfs/xfs_trans_buf.c:311). Shutting down filesystem. XFS (dm-0): Please unmount the filesystem and rectify the problem(s) XFS (dm-0): Internal error dqp->q_ino.reserved < dqp->q_ino.count at line 869 of file fs/xfs/xfs_trans_dquot.c. Caller xfs_trans_dqresv+0x236/0x440 [xfs] XFS (dm-0): Corruption detected. Unmount and run xfs_repair XFS (dm-0): Unmounting Filesystem be6bcbcc-9921-4deb-8d16-7cc94e335fa7 The system is stuck in unmount trying to lock a couple of inodes so that they can be purged. The dquot corruption notice above is a clue to what happened -- a link() call tried to set up a transaction to link a child into a directory. Quota reservation for the transaction failed after IO errors shut down the filesystem, but then we forgot to unlock the inodes on our way out. Fix that. Cc: <stable@vger.kernel.org> # v6.10 Fixes: bd5562111d5839 ("xfs: Hold inode locks in xfs_trans_alloc_dir") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: fix scrub tracepoints when inode-rooted btrees are involvedDarrick J. Wong
Fix a minor mistakes in the scrub tracepoints that can manifest when inode-rooted btrees are enabled. The existing code worked fine for bmap btrees, but we should tighten the code up to be less sloppy. Cc: <stable@vger.kernel.org> # v5.7 Fixes: 92219c292af8dd ("xfs: convert btree cursor inode-private member names") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: update btree keys correctly when _insrec splits an inode root blockDarrick J. Wong
In commit 2c813ad66a72, I partially fixed a bug wherein xfs_btree_insrec would erroneously try to update the parent's key for a block that had been split if we decided to insert the new record into the new block. The solution was to detect this situation and update the in-core key value that we pass up to the caller so that the caller will (eventually) add the new block to the parent level of the tree with the correct key. However, I missed a subtlety about the way inode-rooted btrees work. If the full block was a maximally sized inode root block, we'll solve that fullness by moving the root block's records to a new block, resizing the root block, and updating the root to point to the new block. We don't pass a pointer to the new block to the caller because that work has already been done. The new record will /always/ land in the new block, so in this case we need to use xfs_btree_update_keys to update the keys. This bug can theoretically manifest itself in the very rare case that we split a bmbt root block and the new record lands in the very first slot of the new block, though I've never managed to trigger it in practice. However, it is very easy to reproduce by running generic/522 with the realtime rmapbt patchset if rtinherit=1. Cc: <stable@vger.kernel.org> # v4.8 Fixes: 2c813ad66a7218 ("xfs: support btrees with overlapping intervals for keys") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: fix error bailout in xfs_rtginode_createDarrick J. Wong
smatch reported that we screwed up the error cleanup in this function. Fix it. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: ae897e0bed0f54 ("xfs: support creating per-RTG files in growfs") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: fix null bno_hint handling in xfs_rtallocate_rtgDarrick J. Wong
xfs_bmap_rtalloc initializes the bno_hint variable to NULLRTBLOCK (aka NULLFSBLOCK). If the allocation request is for a file range that's adjacent to an existing mapping, it will then change bno_hint to the blkno hint in the bmalloca structure. In other words, bno_hint is either a rt block number, or it's all 1s. Unfortunately, commit ec12f97f1b8a8f didn't take the NULLRTBLOCK state into account, which means that it tries to translate that into a realtime extent number. We then end up with an obnoxiously high rtx number and pointlessly feed that to the near allocator. This often fails and falls back to the by-size allocator. Seeing as we had no locality hint anyway, this is a waste of time. Fix the code to detect a lack of bno_hint correctly. This was detected by running xfs/009 with metadir enabled and a 28k rt extent size. Cc: <stable@vger.kernel.org> # v6.12 Fixes: ec12f97f1b8a8f ("xfs: make the rtalloc start hint a xfs_rtblock_t") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: mark metadir repair tempfiles with IRECOVERYDarrick J. Wong
Once in a long while, xfs/566 and xfs/801 report directory corruption in one of the metadata subdirectories while it's forcibly rebuilding all filesystem metadata. I observed the following sequence of events: 1. Initiate a repair of the parent pointers for the /quota/user file. This is the secret file containing user quota data. 2. The pptr repair thread creates a temporary file and begins staging parent pointers in the ondisk metadata in preparation for an exchange-range to commit the new pptr data. 3. At the same time, initiate a repair of the /quota directory itself. 4. The dir repair thread finds the temporary file from (2), scans it for parent pointers, and stages a dirent in its own temporary dir in preparation to commit the fixed directory. 5. The parent pointer repair completes and frees the temporary file. 6. The dir repair commits the new directory and scans it again. It finds the dirent that points to the old temporary file in (2) and marks the directory corrupt. Oops! Repair code must never scan the temporary files that other repair functions create to stage new metadata. They're not supposed to do that, but the predicate function xrep_is_tempfile is incorrect because it assumes that any XFS_DIFLAG2_METADATA file cannot ever be a temporary file, but xrep_tempfile_adjust_directory_tree creates exactly that. Fix this by setting the IRECOVERY flag on temporary metadata directory inodes and using that to correct the predicate. Repair code is supposed to erase all the data in temporary files before releasing them, so it's ok if a thread scans the temporary file after we drop IRECOVERY. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: bb6cdd5529ff67 ("xfs: hide metadata inodes from everyone because they are special") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: set XFS_SICK_INO_SYMLINK_ZAPPED explicitly when zapping a symlinkDarrick J. Wong
If we need to reset a symlink target to the "durr it's busted" string, then we clear the zapped flag as well. However, this should be using the provided helper so that we don't set the zapped state on an otherwise ok symlink. Cc: <stable@vger.kernel.org> # v6.10 Fixes: 2651923d8d8db0 ("xfs: online repair of symbolic links") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: separate healthy clearing mask during repairDarrick J. Wong
In commit d9041681dd2f53 we introduced some XFS_SICK_*ZAPPED flags so that the inode record repair code could clean up a damaged inode record enough to iget the inode but still be able to remember that the higher level repair code needs to be called. As part of that, we introduced a xchk_mark_healthy_if_clean helper that is supposed to cause the ZAPPED state to be removed if that higher level metadata actually checks out. This was done by setting additional bits in sick_mask hoping that xchk_update_health will clear all those bits after a healthy scrub. Unfortunately, that's not quite what sick_mask means -- bits in that mask are indeed cleared if the metadata is healthy, but they're set if the metadata is NOT healthy. fsck is only intended to set the ZAPPED bits explicitly. If something else sets the CORRUPT/XCORRUPT state after the xchk_mark_healthy_if_clean call, we end up marking the metadata zapped. This can happen if the following sequence happens: 1. Scrub runs, discovers that the metadata is fine but could be optimized and calls xchk_mark_healthy_if_clean on a ZAPPED flag. That causes the ZAPPED flag to be set in sick_mask because the metadata is not CORRUPT or XCORRUPT. 2. Repair runs to optimize the metadata. 3. Some other metadata used for cross-referencing in (1) becomes corrupt. 4. Post-repair scrub runs, but this time it sets CORRUPT or XCORRUPT due to the events in (3). 5. Now the xchk_health_update sets the ZAPPED flag on the metadata we just repaired. This is not the correct state. Fix this by moving the "if healthy" mask to a separate field, and only ever using it to clear the sick state. Cc: <stable@vger.kernel.org> # v6.8 Fixes: d9041681dd2f53 ("xfs: set inode sick state flags when we zap either ondisk fork") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: don't drop errno values when we fail to ficlone the entire rangeDarrick J. Wong
Way back when we first implemented FICLONE for XFS, life was simple -- either the the entire remapping completed, or something happened and we had to return an errno explaining what happened. Neither of those ioctls support returning partial results, so it's all or nothing. Then things got complicated when copy_file_range came along, because it actually can return the number of bytes copied, so commit 3f68c1f562f1e4 tried to make it so that we could return a partial result if the REMAP_FILE_CAN_SHORTEN flag is set. This is also how FIDEDUPERANGE can indicate that the kernel performed a partial deduplication. Unfortunately, the logic is wrong if an error stops the remapping and CAN_SHORTEN is not set. Because those callers cannot return partial results, it is an error for ->remap_file_range to return a positive quantity that is less than the @len passed in. Implementations really should be returning a negative errno in this case, because that's what btrfs (which introduced FICLONE{,RANGE}) did. Therefore, ->remap_range implementations cannot silently drop an errno that they might have when the number of bytes remapped is less than the number of bytes requested and CAN_SHORTEN is not set. Found by running generic/562 on a 64k fsblock filesystem and wondering why it reported corrupt files. Cc: <stable@vger.kernel.org> # v4.20 Fixes: 3fc9f5e409319e ("xfs: remove xfs_reflink_remap_range") Really-Fixes: 3f68c1f562f1e4 ("xfs: support returning partial reflink results") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: return a 64-bit block count from xfs_btree_count_blocksDarrick J. Wong
With the nrext64 feature enabled, it's possible for a data fork to have 2^48 extent mappings. Even with a 64k fsblock size, that maps out to a bmbt containing more than 2^32 blocks. Therefore, this predicate must return a u64 count to avoid an integer wraparound that will cause scrub to do the wrong thing. It's unlikely that any such filesystem currently exists, because the incore bmbt would consume more than 64GB of kernel memory on its own, and so far nobody except me has driven a filesystem that far, judging from the lack of complaints. Cc: <stable@vger.kernel.org> # v5.19 Fixes: df9ad5cc7a5240 ("xfs: Introduce macros to represent new maximum extent counts for data/attr forks") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: keep quota directory inode loadedDarrick J. Wong
In the same vein as the previous patch, there's no point in the metapath scrub setup function doing a lookup on the quota metadir just so it can validate that lookups work correctly. Instead, retain the quota directory inode in memory for the lifetime of the mount so that we can check this meaningfully. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 128a055291ebbc ("xfs: scrub quota file metapaths") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: metapath scrubber should use the already loaded inodesDarrick J. Wong
Don't waste time in xchk_setup_metapath_dqinode doing a second lookup of the quota inodes, just grab them from the quotainfo structure. The whole point of this scrubber is to make sure that the dirents exist, so it's completely silly to do lookups. Cc: <stable@vger.kernel.org> # v6.13-rc1 Fixes: 128a055291ebbc ("xfs: scrub quota file metapaths") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12xfs: fix off-by-one error in fsmap's end_daddr usageDarrick J. Wong
In commit ca6448aed4f10a, we created an "end_daddr" variable to fix fsmap reporting when the end of the range requested falls in the middle of an unknown (aka free on the rmapbt) region. Unfortunately, I didn't notice that the the code sets end_daddr to the last sector of the device but then uses that quantity to compute the length of the synthesized mapping. Zizhi Wo later observed that when end_daddr isn't set, we still don't report the last fsblock on a device because in that case (aka when info->last is true), the info->high mapping that we pass to xfs_getfsmap_group_helper has a startblock that points to the last fsblock. This is also wrong because the code uses startblock to compute the length of the synthesized mapping. Fix the second problem by setting end_daddr unconditionally, and fix the first problem by setting start_daddr to one past the end of the range to query. Cc: <stable@vger.kernel.org> # v6.11 Fixes: ca6448aed4f10a ("xfs: Fix missing interval for missing_owner in xfs fsmap") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reported-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-12Merge tag 'v6.13-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull smb server fixes from Steve French: - fix ctime setting in setattr - fix reference count on user session to avoid potential race with session expire - fix query dir issue * tag 'v6.13-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: set ATTR_CTIME flags when setting mtime ksmbd: fix racy issue from session lookup and expire ksmbd: retry iterate_dir in smb2_query_dir
2024-12-13erofs: add erofs_sb_free() helperGao Xiang
Unify the common parts of erofs_fc_free() and erofs_kill_sb() as erofs_sb_free(). Thus, fput() in erofs_fc_get_tree() is no longer needed, too. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241212133504.2047178-1-hsiangkao@linux.alibaba.com
2024-12-13erofs: fix PSI memstall accountingGao Xiang
Max Kellermann recently reported psi_group_cpu.tasks[NR_MEMSTALL] is incorrect in the 6.11.9 kernel. The root cause appears to be that, since the problematic commit, bio can be NULL, causing psi_memstall_leave() to be skipped in z_erofs_submit_queue(). Reported-by: Max Kellermann <max.kellermann@ionos.com> Closes: https://lore.kernel.org/r/CAKPOu+8tvSowiJADW2RuKyofL_CSkm_SuyZA7ME5vMLWmL6pqw@mail.gmail.com Fixes: 9e2f9d34dd12 ("erofs: handle overlapped pclusters out of crafted images properly") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241127085236.3538334-1-hsiangkao@linux.alibaba.com
2024-12-13erofs: fix rare pcluster memory leak after unmountingGao Xiang
There may still exist some pcluster with valid reference counts during unmounting. Instead of introducing another synchronization primitive, just try again as unmounting is relatively rare. This approach is similar to z_erofs_cache_invalidate_folio(). It was also reported by syzbot as a UAF due to commit f5ad9f9a603f ("erofs: free pclusters if no cached folio is attached"): BUG: KASAN: slab-use-after-free in do_raw_spin_trylock+0x72/0x1f0 kernel/locking/spinlock_debug.c:123 .. queued_spin_trylock include/asm-generic/qspinlock.h:92 [inline] do_raw_spin_trylock+0x72/0x1f0 kernel/locking/spinlock_debug.c:123 __raw_spin_trylock include/linux/spinlock_api_smp.h:89 [inline] _raw_spin_trylock+0x20/0x80 kernel/locking/spinlock.c:138 spin_trylock include/linux/spinlock.h:361 [inline] z_erofs_put_pcluster fs/erofs/zdata.c:959 [inline] z_erofs_decompress_pcluster fs/erofs/zdata.c:1403 [inline] z_erofs_decompress_queue+0x3798/0x3ef0 fs/erofs/zdata.c:1425 z_erofs_decompressqueue_work+0x99/0xe0 fs/erofs/zdata.c:1437 process_one_work kernel/workqueue.c:3229 [inline] process_scheduled_works+0xa68/0x1840 kernel/workqueue.c:3310 worker_thread+0x870/0xd30 kernel/workqueue.c:3391 kthread+0x2f2/0x390 kernel/kthread.c:389 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> However, it seems a long outstanding memory leak. Fix it now. Fixes: f5ad9f9a603f ("erofs: free pclusters if no cached folio is attached") Reported-by: syzbot+7ff87b095e7ca0c5ac39@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/674c1235.050a0220.ad585.0032.GAE@google.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241203072821.1885740-1-hsiangkao@linux.alibaba.com
2024-12-12fuse: fix direct io folio offset and length calculationJoanne Koong
For the direct io case, the pages from userspace may be part of a huge folio, even if all folios in the page cache for fuse are small. Fix the logic for calculating the offset and length of the folio for the direct io case, which currently incorrectly assumes that all folios encountered are one page size. Fixes: 3b97c3652d91 ("fuse: convert direct io to use folios") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-12-11Revert "unicode: Don't special case ignorable code points"Linus Torvalds
This reverts commit 5c26d2f1d3f5e4be3e196526bead29ecb139cf91. It turns out that we can't do this, because while the old behavior of ignoring ignorable code points was most definitely wrong, we have case-folding filesystems with on-disk hash values with that wrong behavior. So now you can't look up those names, because they hash to something different. Of course, it's also entirely possible that in the meantime people have created *new* files with the new ("more correct") case folding logic, and reverting will just make other things break. The correct solution is to not do case folding in filesystems, but sadly, people seem to never really understand that. People still see it as a feature, not a bug. Reported-by: Qi Han <hanqi@vivo.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219586 Cc: Gabriel Krisman Bertazi <krisman@suse.de> Requested-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-11fs: don't block write during exec on pre-content watched filesAmir Goldstein
Commit 2a010c412853 ("fs: don't block i_writecount during exec") removed the legacy behavior of getting ETXTBSY on attempt to open and executable file for write while it is being executed. This commit was reverted because an application that depends on this legacy behavior was broken by the change. We need to allow HSM writing into executable files while executed to fill their content on-the-fly. To that end, disable the ETXTBSY legacy behavior for files that are watched by pre-content events. This change is not expected to cause regressions with existing systems which do not have any pre-content event listeners. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241128142532.465176-1-amir73il@gmail.com
2024-12-11fs: enable pre-content events on supported file systemsJosef Bacik
Now that all the code has been added for pre-content events, and the various file systems that need the page fault hooks for fsnotify have been updated, add SB_I_ALLOW_HSM to the supported file systems. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/46960dcb2725fa0317895ed66a8409ba1c306a82.1731684329.git.josef@toxicpanda.com
2024-12-11ext4: add pre-content fsnotify hook for DAX faultsJan Kara
ext4 has its own handling for DAX faults. Add the pre-content fsnotify hook for this case. Signed-off-by: Jan Kara <jack@suse.cz>
2024-12-11btrfs: disable defrag on pre-content watched filesJosef Bacik
We queue up inodes to be defrag'ed asynchronously, which means we do not have their original file for readahead. This means that the code to skip readahead on pre-content watched files will not run, and we could potentially read in empty pages. Handle this corner case by disabling defrag on files that are currently being watched for pre-content events. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/4cc5bcea13db7904174353d08e85157356282a59.1731684329.git.josef@toxicpanda.com
2024-12-11xfs: add pre-content fsnotify hook for DAX faultsJosef Bacik
xfs has it's own handling for DAX faults, so we need to add the pre-content fsnotify hook for this case. Other faults go through filemap_fault so they're handled properly there. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/9eccdf59a65b72f0a1a5e2f2b9bff8eda2d4f2d9.1731684329.git.josef@toxicpanda.com
2024-12-11iomap: fix zero padding data issue in concurrent append writesLong Li
During concurrent append writes to XFS filesystem, zero padding data may appear in the file after power failure. This happens due to imprecise disk size updates when handling write completion. Consider this scenario with concurrent append writes same file: Thread 1: Thread 2: ------------ ----------- write [A, A+B] update inode size to A+B submit I/O [A, A+BS] write [A+B, A+B+C] update inode size to A+B+C <I/O completes, updates disk size to min(A+B+C, A+BS)> <power failure> After reboot: 1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C] |< Block Size (BS) >| |DDDDDDDDDDDDDDDD0000000000000000| ^ ^ ^ A A+B A+B+C (EOF) 2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS] |< Block Size (BS) >|< Block Size (BS) >| |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000| ^ ^ ^ ^ A A+B A+BS A+B+C (EOF) D = Valid Data 0 = Zero Padding The issue stems from disk size being set to min(io_offset + io_size, inode->i_size) at I/O completion. Since io_offset+io_size is block size granularity, it may exceed the actual valid file data size. In the case of concurrent append writes, inode->i_size may be larger than the actual range of valid file data written to disk, leading to inaccurate disk size updates. This patch modifies the meaning of io_size to represent the size of valid data within EOF in an ioend. If the ioend spans beyond i_size, io_size will be trimmed to provide the file with more accurate size information. This is particularly useful for on-disk size updates at completion time. After this change, ioends that span i_size will not grow or merge with other ioends in concurrent scenarios. However, these cases that need growth/merging rarely occur and it seems no noticeable performance impact. Although rounding up io_size could enable ioend growth/merging in these scenarios, we decided to keep the code simple after discussion [1]. Another benefit is that it makes the xfs_ioend_is_append() check more accurate, which can reduce unnecessary end bio callbacks of xfs_end_bio() in certain scenarios, such as repeated writes at the file tail without extending the file size. Link [1]: https://patchwork.kernel.org/project/xfs/patch/20241113091907.56937-1-leo.lilong@huawei.com Fixes: ae259a9c8593 ("fs: introduce iomap infrastructure") # goes further back than this Signed-off-by: Long Li <leo.lilong@huawei.com> Link: https://lore.kernel.org/r/20241209114241.3725722-3-leo.lilong@huawei.com Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-11iomap: pass byte granular end position to iomap_add_to_ioendLong Li
This is a preparatory patch for fixing zero padding issues in concurrent append write scenarios. In the following patches, we need to obtain byte-granular writeback end position for io_size trimming after EOF handling. Due to concurrent writeback and truncate operations, inode size may shrink. Resampling inode size would force writeback code to handle the newly appeared post-EOF blocks, which is undesirable. As Dave explained in [1]: "Really, the issue is that writeback mappings have to be able to handle the range being mapped suddenly appear to be beyond EOF. This behaviour is a longstanding writeback constraint, and is what iomap_writepage_handle_eof() is attempting to handle. We handle this by only sampling i_size_read() whilst we have the folio locked and can determine the action we should take with that folio (i.e. nothing, partial zeroing, or skip altogether). Once we've made the decision that the folio is within EOF and taken action on it (i.e. moved the folio to writeback state), we cannot then resample the inode size because a truncate may have started and changed the inode size." To avoid resampling inode size after EOF handling, we convert end_pos to byte-granular writeback position and return it from EOF handling function. Since iomap_set_range_dirty() can handle unaligned lengths, this conversion has no impact on it. However, iomap_find_dirty_range() requires aligned start and end range to find dirty blocks within the given range, so the end position needs to be rounded up when passed to it. LINK [1]: https://lore.kernel.org/linux-xfs/Z1Gg0pAa54MoeYME@localhost.localdomain/ Signed-off-by: Long Li <leo.lilong@huawei.com> Link: https://lore.kernel.org/r/20241209114241.3725722-2-leo.lilong@huawei.com Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-10smb: client: destroy cfid_put_wq on module exitEnzo Matsumiya
Fix potential problem in rmmod Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-10cifs: Use str_yes_no() helper in cifs_ses_add_channel()Thorsten Blum
Remove hard-coded strings by using the str_yes_no() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-10cifs: Fix rmdir failure due to ongoing I/O on deleted fileDavid Howells
The cifs_io_request struct (a wrapper around netfs_io_request) holds open the file on the server, even beyond the local Linux file being closed. This can cause problems with Windows-based filesystems as the file's name still exists after deletion until the file is closed, preventing the parent directory from being removed and causing spurious test failures in xfstests due to inability to remove a directory. The symptom looks something like this in the test output: rm: cannot remove '/mnt/scratch/test/p0/d3': Directory not empty rm: cannot remove '/mnt/scratch/test/p1/dc/dae': Directory not empty Fix this by waiting in unlink and rename for any outstanding I/O requests to be completed on the target file before removing that file. Note that this doesn't prevent Linux from trying to start new requests after deletion if it still has the file open locally - something that's perfectly acceptable on a UNIX system. Note also that whilst I've marked this as fixing the commit to make cifs use netfslib, I don't know that it won't occur before that. Fixes: 3ee1a1fc3981 ("cifs: Cut over to using netfslib") Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-10Merge tag 'for-6.13-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes. Apart from the one liners and updated bio splitting error handling there's a fix for subvolume mount with different flags. This was known and fixed for some time but I've delayed it to give it more testing. - fix unbalanced locking when swapfile activation fails when the subvolume gets deleted in the meantime - add btrfs error handling after bio_split() calls that got error handling recently - during unmount, flush delalloc workers at the right time before the cleaner thread is shut down - fix regression in buffered write folio conversion, explicitly wait for writeback as FGP_STABLE flag is currently a no-op on btrfs - handle race in subvolume mount with different flags, the conversion to the new mount API did not handle the case where multiple subvolumes get mounted in parallel, which is a distro use case" * tag 'for-6.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: flush delalloc workers queue before stopping cleaner kthread during unmount btrfs: handle bio_split() errors btrfs: properly wait for writeback before buffered write btrfs: fix missing snapshot drew unlock when root is dead during swap activation btrfs: fix mount failure due to remount races
2024-12-10ksmbd: set ATTR_CTIME flags when setting mtimeNamjae Jeon
David reported that the new warning from setattr_copy_mgtime is coming like the following. [ 113.215316] ------------[ cut here ]------------ [ 113.215974] WARNING: CPU: 1 PID: 31 at fs/attr.c:300 setattr_copy+0x1ee/0x200 [ 113.219192] CPU: 1 UID: 0 PID: 31 Comm: kworker/1:1 Not tainted 6.13.0-rc1+ #234 [ 113.220127] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 [ 113.221530] Workqueue: ksmbd-io handle_ksmbd_work [ksmbd] [ 113.222220] RIP: 0010:setattr_copy+0x1ee/0x200 [ 113.222833] Code: 24 28 49 8b 44 24 30 48 89 53 58 89 43 6c 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc 48 89 df e8 77 d6 ff ff e9 cd fe ff ff <0f> 0b e9 be fe ff ff 66 0 [ 113.225110] RSP: 0018:ffffaf218010fb68 EFLAGS: 00010202 [ 113.225765] RAX: 0000000000000120 RBX: ffffa446815f8568 RCX: 0000000000000003 [ 113.226667] RDX: ffffaf218010fd38 RSI: ffffa446815f8568 RDI: ffffffff94eb03a0 [ 113.227531] RBP: ffffaf218010fb90 R08: 0000001a251e217d R09: 00000000675259fa [ 113.228426] R10: 0000000002ba8a6d R11: ffffa4468196c7a8 R12: ffffaf218010fd38 [ 113.229304] R13: 0000000000000120 R14: ffffffff94eb03a0 R15: 0000000000000000 [ 113.230210] FS: 0000000000000000(0000) GS:ffffa44739d00000(0000) knlGS:0000000000000000 [ 113.231215] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 113.232055] CR2: 00007efe0053d27e CR3: 000000000331a000 CR4: 00000000000006b0 [ 113.232926] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 113.233812] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 113.234797] Call Trace: [ 113.235116] <TASK> [ 113.235393] ? __warn+0x73/0xd0 [ 113.235802] ? setattr_copy+0x1ee/0x200 [ 113.236299] ? report_bug+0xf3/0x1e0 [ 113.236757] ? handle_bug+0x4d/0x90 [ 113.237202] ? exc_invalid_op+0x13/0x60 [ 113.237689] ? asm_exc_invalid_op+0x16/0x20 [ 113.238185] ? setattr_copy+0x1ee/0x200 [ 113.238692] btrfs_setattr+0x80/0x820 [btrfs] [ 113.239285] ? get_stack_info_noinstr+0x12/0xf0 [ 113.239857] ? __module_address+0x22/0xa0 [ 113.240368] ? handle_ksmbd_work+0x6e/0x460 [ksmbd] [ 113.240993] ? __module_text_address+0x9/0x50 [ 113.241545] ? __module_address+0x22/0xa0 [ 113.242033] ? unwind_next_frame+0x10e/0x920 [ 113.242600] ? __pfx_stack_trace_consume_entry+0x10/0x10 [ 113.243268] notify_change+0x2c2/0x4e0 [ 113.243746] ? stack_depot_save_flags+0x27/0x730 [ 113.244339] ? set_file_basic_info+0x130/0x2b0 [ksmbd] [ 113.244993] set_file_basic_info+0x130/0x2b0 [ksmbd] [ 113.245613] ? process_scheduled_works+0xbe/0x310 [ 113.246181] ? worker_thread+0x100/0x240 [ 113.246696] ? kthread+0xc8/0x100 [ 113.247126] ? ret_from_fork+0x2b/0x40 [ 113.247606] ? ret_from_fork_asm+0x1a/0x30 [ 113.248132] smb2_set_info+0x63f/0xa70 [ksmbd] ksmbd is trying to set the atime and mtime via notify_change without also setting the ctime. so This patch add ATTR_CTIME flags when setting mtime to avoid a warning. Reported-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-10ksmbd: fix racy issue from session lookup and expireNamjae Jeon
Increment the session reference count within the lock for lookup to avoid racy issue with session expire. Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-25737 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-10ksmbd: retry iterate_dir in smb2_query_dirHobin Woo
Some file systems do not ensure that the single call of iterate_dir reaches the end of the directory. For example, FUSE fetches entries from a daemon using 4KB buffer and stops fetching if entries exceed the buffer. And then an actor of caller, KSMBD, is used to fill the entries from the buffer. Thus, pattern searching on FUSE, files located after the 4KB could not be found and STATUS_NO_SUCH_FILE was returned. Signed-off-by: Hobin Woo <hobin.woo@samsung.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Tested-by: Yoonho Shin <yoonho.shin@samsung.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-10fanotify: allow to set errno in FAN_DENY permission responseAmir Goldstein
With FAN_DENY response, user trying to perform the filesystem operation gets an error with errno set to EPERM. It is useful for hierarchical storage management (HSM) service to be able to deny access for reasons more diverse than EPERM, for example EAGAIN, if HSM could retry the operation later. Allow fanotify groups with priority FAN_CLASSS_PRE_CONTENT to responsd to permission events with the response value FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error. Limit custom error values to errors expected on read(2)/write(2) and open(2) of regular files. This list could be extended in the future. Userspace can test for legitimate values of FAN_DENY_ERRNO(errno) by writing a response to an fanotify group fd with a value of FAN_NOFD in the fd field of the response. The change in fanotify_response is backward compatible, because errno is written in the high 8 bits of the 32bit response field and old kernels reject respose value with high bits set. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/1e5fb6af84b69ca96b5c849fa5f10bdf4d1dc414.1731684329.git.josef@toxicpanda.com
2024-12-10fanotify: report file range info with pre-content eventsAmir Goldstein
With group class FAN_CLASS_PRE_CONTENT, report offset and length info along with FAN_PRE_ACCESS pre-content events. This information is meant to be used by hierarchical storage managers that want to fill partial content of files on first access to range. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/b90a9e6c809dd3cad5684da90f23ea93ec6ce8c8.1731684329.git.josef@toxicpanda.com
2024-12-10fanotify: introduce FAN_PRE_ACCESS permission eventAmir Goldstein
Similar to FAN_ACCESS_PERM permission event, but it is only allowed with class FAN_CLASS_PRE_CONTENT and only allowed on regular files and dirs. Unlike FAN_ACCESS_PERM, it is safe to write to the file being accessed in the context of the event handler. This pre-content event is meant to be used by hierarchical storage managers that want to fill the content of files on first read access. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/b80986f8d5b860acea2c9a73c0acd93587be5fe4.1731684329.git.josef@toxicpanda.com
2024-12-10fsnotify: generate pre-content permission event on truncateAmir Goldstein
Generate FS_PRE_ACCESS event before truncate, without sb_writers held. Move the security hooks also before sb_start_write() to conform with other security hooks (e.g. in write, fallocate). The event will have a range info of the page surrounding the new size to provide an opportunity to fill the conetnt at the end of file before truncating to non-page aligned size. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/23af8201db6ac2efdea94f09ab067d81ba5de7a7.1731684329.git.josef@toxicpanda.com
2024-12-10fsnotify: pass optional file access range in pre-content eventAmir Goldstein
We would like to add file range information to pre-content events. Pass a struct file_range with offset and length to event handler along with pre-content permission event. The offset and length are aligned to page size, but we may need to align them to minimum folio size for filesystems with large block size. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/88eddee301231d814aede27fb4d5b41ae37c9702.1731684329.git.josef@toxicpanda.com
2024-12-10fsnotify: introduce pre-content permission eventsAmir Goldstein
The new FS_PRE_ACCESS permission event is similar to FS_ACCESS_PERM, but it meant for a different use case of filling file content before access to a file range, so it has slightly different semantics. Generate FS_PRE_ACCESS/FS_ACCESS_PERM as two seperate events, so content scanners could inspect the content filled by pre-content event handler. Unlike FS_ACCESS_PERM, FS_PRE_ACCESS is also called before a file is modified by syscalls as write() and fallocate(). FS_ACCESS_PERM is reported also on blockdev and pipes, but the new pre-content events are only reported for regular files and dirs. The pre-content events are meant to be used by hierarchical storage managers that want to fill the content of files on first access. There are some specific requirements from filesystems that could be used with pre-content events, so add a flag for fs to opt-in for pre-content events explicitly before they can be used. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/b934c5e3af205abc4e0e4709f6486815937ddfdf.1731684329.git.josef@toxicpanda.com
2024-12-10fanotify: rename a misnamed constantAmir Goldstein
FANOTIFY_PIDFD_INFO_HDR_LEN is not the length of the header. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/8776ab90fe538225aeb561c560296bafd16b97c4.1731684329.git.josef@toxicpanda.com