summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-04-21zonefs: Add active seq file accountingDamien Le Moal
Modify struct zonefs_sb_info to add the s_active_seq_files atomic to count the number of seq files representing a zone that is partially written or explicitly open, that is, to count sequential files with a zone that is in an active state on the device. The helper function zonefs_account_active() is introduced to update this counter whenever a file is written or truncated. This helper is also used in the zonefs_seq_file_write_open() and zonefs_seq_file_write_close() functions when the explicit_open mount option is used. The s_active_seq_files counter is exported through sysfs using the read-only attribute nr_active_seq_files. The device maximum number of active zones is also exported through sysfs with the read-only attribute max_active_seq_files. Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
2022-04-21zonefs: Export open zone resource information through sysfsDamien Le Moal
To allow applications to easily check the current usage status of the open zone resources of the mounted device, export through sysfs the counter of write open sequential files s_wro_seq_files field of struct zonefs_sb_info. The attribute is named nr_wro_seq_files and is read only. The maximum number of write open sequential files (zones) indicated by the s_max_wro_seq_files field of struct zonefs_sb_info is also exported as the read only attribute max_wro_seq_files. Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
2022-04-21zonefs: Always do seq file write open accountingDamien Le Moal
The explicit_open mount option forces an explicitly open of the zone of sequential files that are open for writing to ensure that the open file can be written without the device failing write operations due to open zone resources limit being exceeded. To implement this, zonefs accounts all write open seq file when this mount option is used. This accounting however can be easily performed even when the explicit_open mount option is not used, thus allowing applications to control zone resources on their own, without relying on open() system call failures from zonefs. To implement this, the helper zonefs_file_use_exp_open() is removed and replaced with the helper zonefs_seq_file_need_wro() which test if a file is a sequential file being open with write access. zonefs_open_zone() and zonefs_close_zone() are renamed respectively to zonefs_seq_file_write_open() and zonefs_seq_file_write_close() and modified to update the s_wro_seq_files counter regardless of the explicit_open mount option use. If the explicit_open mount option is used, zonefs_seq_file_write_open() execute an explicit zone open operation for a sequential file open for writing for the first time, as before. Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
2022-04-21zonefs: Rename super block information fieldsDamien Le Moal
The s_open_zones field of struct zonefs_sb_info is used to count the number of files that are open for writing and may not necessarilly correspond to the number of open zones on the device. For instance, an application may open for writing a sequential zone file, fully write it and keep the file open. In such case, the zone of the file is not open anymore (it is in the full state). Avoid confusion about this counter meaning by renaming it to s_wro_seq_files. To keep things consistent, the field s_max_open_zones is renamed to s_max_wro_seq_files. Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
2022-04-21zonefs: Fix management of open zonesDamien Le Moal
The mount option "explicit_open" manages the device open zone resources to ensure that if an application opens a sequential file for writing, the file zone can always be written by explicitly opening the zone and accounting for that state with the s_open_zones counter. However, if some zones are already open when mounting, the device open zone resource usage status will be larger than the initial s_open_zones value of 0. Ensure that this inconsistency does not happen by closing any sequential zone that is open when mounting. Furthermore, with ZNS drives, closing an explicitly open zone that has not been written will change the zone state to "closed", that is, the zone will remain in an active state. Since this can then cause failures of explicit open operations on other zones if the drive active zone resources are exceeded, we need to make sure that the zone is not active anymore by resetting it instead of closing it. To address this, zonefs_zone_mgmt() is modified to change a REQ_OP_ZONE_CLOSE request into a REQ_OP_ZONE_RESET for sequential zones that have not been written. Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
2022-04-21zonefs: Clear inode information flags on inode creationDamien Le Moal
Ensure that the i_flags field of struct zonefs_inode_info is cleared to 0 when initializing a zone file inode, avoiding seeing the flag ZONEFS_ZONE_OPEN being incorrectly set. Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
2022-04-21xfs: simplify local variable assignment in file write codeKaixu Xia
Get the struct inode pointer from iocb->ki_filp->f_mapping->host directly and the other variables are unnecessary, so simplify the local variables assignment. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: reorder iunlink remove operation in xfs_ifreeDave Chinner
The O_TMPFILE creation implementation creates a specific order of operations for inode allocation/freeing and unlinked list modification. Currently both are serialised by the AGI, so the order doesn't strictly matter as long as the are both in the same transaction. However, if we want to move the unlinked list insertions largely out from under the AGI lock, then we have to be concerned about the order in which we do unlinked list modification operations. O_TMPFILE creation tells us this order is inode allocation/free, then unlinked list modification. Change xfs_ifree() to use this same ordering on unlinked list removal. This way we always guarantee that when we enter the iunlinked list removal code from this path, we already have the AGI locked and we don't have to worry about lock nesting AGI reads inside unlink list locks because it's already locked and attached to the transaction. We can do this safely as the inode freeing and unlinked list removal are done in the same transaction and hence are atomic operations with respect to log recovery. Reported-by: Frank Hofmann <fhofmann@cloudflare.com> Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: convert buffer flags to unsigned.Dave Chinner
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned fields to be unsigned. This manifests as a compiler error such as: /kisskb/src/fs/xfs/./xfs_trace.h:432:2: note: in expansion of macro 'TP_printk' TP_printk("dev %d:%d daddr 0x%llx bbcount 0x%x hold %d pincount %d " ^ /kisskb/src/fs/xfs/./xfs_trace.h:440:5: note: in expansion of macro '__print_flags' __print_flags(__entry->flags, "|", XFS_BUF_FLAGS), ^ /kisskb/src/fs/xfs/xfs_buf.h:67:4: note: in expansion of macro 'XBF_UNMAPPED' { XBF_UNMAPPED, "UNMAPPED" } ^ /kisskb/src/fs/xfs/./xfs_trace.h:440:40: note: in expansion of macro 'XFS_BUF_FLAGS' __print_flags(__entry->flags, "|", XFS_BUF_FLAGS), ^ /kisskb/src/fs/xfs/./xfs_trace.h: In function 'trace_raw_output_xfs_buf_flags_class': /kisskb/src/fs/xfs/xfs_buf.h:46:23: error: initializer element is not constant #define XBF_UNMAPPED (1 << 31)/* do not map the buffer */ as __print_flags assigns XFS_BUF_FLAGS to a structure that uses an unsigned long for the flag. Since this results in the value of XBF_UNMAPPED causing a signed integer overflow, the result is technically undefined behavior, which gcc-5 does not accept as an integer constant. This is based on a patch from Arnd Bergman <arnd@arndb.de>. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-20Merge tag 'erofs-for-5.18-rc4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "One patch to fix a use-after-free race related to the on-stack z_erofs_decompressqueue, which happens very rarely but needs to be fixed properly soon. The other patch fixes some sysfs Sphinx warnings" * tag 'erofs-for-5.18-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: Documentation/ABI: sysfs-fs-erofs: Fix Sphinx errors erofs: fix use-after-free of on-stack io[]
2022-04-20Revert "fs/pipe: use kvcalloc to allocate a pipe_buffer array"Linus Torvalds
This reverts commit 5a519c8fe4d620912385f94372fc8472fa98c662. It turns out that making the pipe almost arbitrarily large has some rather unexpected downsides. The kernel test robot reports a kernel warning that is due to pipe->max_usage now growing to the point where the iter_file_splice_write() buffer allocation can no longer be satisfied as a slab allocation, and the int nbufs = pipe->max_usage; struct bio_vec *array = kcalloc(nbufs, sizeof(struct bio_vec), GFP_KERNEL); code sequence there will now always fail as a result. That code could be modified to use kvcalloc() too, but I feel very uncomfortable making those kinds of changes for a very niche use case that really should have other options than make these kinds of fundamental changes to pipe behavior. Maybe the CRIU process dumping should be multi-threaded, and use multiple pipes and multiple cores, rather than try to use one larger pipe to minimize splice() calls. Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/all/20220420073717.GD16310@xsang-OptiPlex-9020/ Cc: Andrei Vagin <avagin@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-20f2fs: fix wrong condition check when failing metapage readJaegeuk Kim
This patch fixes wrong initialization. Fixes: 50c63009f6ab ("f2fs: avoid an infinite loop in f2fs_sync_dirty_inodes") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-20f2fs: keep io_flags to avoid IO split due to different op_flags in two fio ↵Jaegeuk Kim
holders Let's attach io_flags to bio only, so that we can merge IOs given original io_flags only. Fixes: 64bf0eef0171 ("f2fs: pass the bio operation to bio_alloc_bioset") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-20f2fs: remove obsolete whint_modeJaegeuk Kim
This patch removes obsolete whint_mode. Fixes: 41d36a9f3e53 ("fs: remove kiocb.ki_hint") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-19binfmt_flat: Drop vestiges of coredump supportEric W. Biederman
There is the briefest start of coredump support in binfmt_flat. It is actually a pain to maintain as binfmt_flat is not built on most architectures so it is easy to overlook. Since the support does not do anything remove it. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Acked-by: Greg Ungerer <gerg@linux-m68k.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/87mtgh17li.fsf_-_@email.froward.int.ebiederm.org
2022-04-19fs: fix acl translationChristian Brauner
Last cycle we extended the idmapped mounts infrastructure to support idmapped mounts of idmapped filesystems (No such filesystem yet exist.). Since then, the meaning of an idmapped mount is a mount whose idmapping is different from the filesystems idmapping. While doing that work we missed to adapt the acl translation helpers. They still assume that checking for the identity mapping is enough. But they need to use the no_idmapping() helper instead. Note, POSIX ACLs are always translated right at the userspace-kernel boundary using the caller's current idmapping and the initial idmapping. The order depends on whether we're coming from or going to userspace. The filesystem's idmapping doesn't matter at the border. Consequently, if a non-idmapped mount is passed we need to make sure to always pass the initial idmapping as the mount's idmapping and not the filesystem idmapping. Since it's irrelevant here it would yield invalid ids and prevent setting acls for filesystems that are mountable in a userns and support posix acls (tmpfs and fuse). I verified the regression reported in [1] and verified that this patch fixes it. A regression test will be added to xfstests in parallel. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215849 [1] Fixes: bd303368b776 ("fs: support mapped mounts of mapped filesystems") Cc: Seth Forshee <sforshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: <stable@vger.kernel.org> # 5.17 Cc: <regressions@lists.linux.dev> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-19fs: jfs: fix possible NULL pointer dereference in dbFree()Zixuan Fu
In our fault-injection testing, the variable "nblocks" in dbFree() can be zero when kmalloc_array() fails in dtSearch(). In this case, the variable "mp" in dbFree() would be NULL and then it is dereferenced in "write_metapage(mp)". The failure log is listed as follows: [ 13.824137] BUG: kernel NULL pointer dereference, address: 0000000000000020 ... [ 13.827416] RIP: 0010:dbFree+0x5f7/0x910 [jfs] [ 13.834341] Call Trace: [ 13.834540] <TASK> [ 13.834713] txFreeMap+0x7b4/0xb10 [jfs] [ 13.835038] txUpdateMap+0x311/0x650 [jfs] [ 13.835375] jfs_lazycommit+0x5f2/0xc70 [jfs] [ 13.835726] ? sched_dynamic_update+0x1b0/0x1b0 [ 13.836092] kthread+0x3c2/0x4a0 [ 13.836355] ? txLockFree+0x160/0x160 [jfs] [ 13.836763] ? kthread_unuse_mm+0x160/0x160 [ 13.837106] ret_from_fork+0x1f/0x30 [ 13.837402] </TASK> ... This patch adds a NULL check of "mp" before "write_metapage(mp)" is called. Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Zixuan Fu <r33s3n6@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2022-04-19btrfs: fix direct I/O writes for split bios on zoned devicesChristoph Hellwig
When a bio is split in btrfs_submit_direct, dip->file_offset contains the file offset for the first bio. But this means the start value used in btrfs_end_dio_bio to record the write location for zone devices is incorrect for subsequent bios. CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-19btrfs: fix direct I/O read repair for split biosChristoph Hellwig
When a bio is split in btrfs_submit_direct, dip->file_offset contains the file offset for the first bio. But this means the start value used in btrfs_check_read_dio_bio is incorrect for subsequent bios. Add a file_offset field to struct btrfs_bio to pass along the correct offset. Given that check_data_csum only uses start of an error message this means problems with this miscalculation will only show up when I/O fails or checksums mismatch. The logic was removed in f4f39fc5dc30 ("btrfs: remove btrfs_bio::logical member") but we need it due to the bio splitting. CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-19btrfs: fix and document the zoned device choice in alloc_new_bioChristoph Hellwig
Zone Append bios only need a valid block device in struct bio, but not the device in the btrfs_bio. Use the information from btrfs_zoned_get_device to set up bi_bdev and fix zoned writes on multi-device file system with non-homogeneous capabilities and remove the pointless btrfs_bio.device assignment. Add big fat comments explaining what is going on here. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-19btrfs: fix leaked plug after failure syncing log on zoned filesystemsFilipe Manana
On a zoned filesystem, if we fail to allocate the root node for the log root tree while syncing the log, we end up returning without finishing the IO plug we started before, resulting in leaking resources as we have started writeback for extent buffers of a log tree before. That allocation failure, which typically is either -ENOMEM or -ENOSPC, is not fatal and the fsync can safely fallback to a full transaction commit. So release the IO plug if we fail to allocate the extent buffer for the root of the log root tree when syncing the log on a zoned filesystem. Fixes: 3ddebf27fcd3a9 ("btrfs: zoned: reorder log node allocation on zoned filesystem") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-18binfmt_flat: do not stop relocating GOT entries prematurely on riscvNiklas Cassel
bFLT binaries are usually created using elf2flt. The linker script used by elf2flt has defined the .data section like the following for the last 19 years: .data : { _sdata = . ; __data_start = . ; data_start = . ; *(.got.plt) *(.got) FILL(0) ; . = ALIGN(0x20) ; LONG(-1) . = ALIGN(0x20) ; ... } It places the .got.plt input section before the .got input section. The same is true for the default linker script (ld --verbose) on most architectures except x86/x86-64. The binfmt_flat loader should relocate all GOT entries until it encounters a -1 (the LONG(-1) in the linker script). The problem is that the .got.plt input section starts with a GOTPLT header (which has size 16 bytes on elf64-riscv and 8 bytes on elf32-riscv), where the first word is set to -1. See the binutils implementation for riscv [1]. This causes the binfmt_flat loader to stop relocating GOT entries prematurely and thus causes the application to crash when running. Fix this by skipping the whole GOTPLT header, since the whole GOTPLT header is reserved for the dynamic linker. The GOTPLT header will only be skipped for bFLT binaries with flag FLAT_FLAG_GOTPIC set. This flag is unconditionally set by elf2flt if the supplied ELF binary has the symbol _GLOBAL_OFFSET_TABLE_ defined. ELF binaries without a .got input section should thus remain unaffected. Tested on RISC-V Canaan Kendryte K210 and RISC-V QEMU nommu_virt_defconfig. [1] https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=bfd/elfnn-riscv.c;hb=binutils-2_38#l3275 Cc: <stable@vger.kernel.org> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220414091018.896737-1-niklas.cassel@wdc.com Fixed-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/lkml/202204182333.OIUOotK8-lkp@intel.com Signed-off-by: Kees Cook <keescook@chromium.org>
2022-04-18cifs: Use kzalloc instead of kmalloc/memsetHaowen Bai
Use kzalloc rather than duplicating its implementation, which makes code simple and easy to understand. Signed-off-by: Haowen Bai <baihaowen@meizu.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-17direct-io: remove random prefetchesChristoph Hellwig
Randomly poking into block device internals for manual prefetches isn't exactly a very maintainable thing to do. And none of the performance critical direct I/O implementations still use this library function anyway, so just drop it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220415045258.199825-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARDChristoph Hellwig
Secure erase is a very different operation from discard in that it is a data integrity operation vs hint. Fully split the limits and helper infrastructure to make the separation more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2] Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Chao Yu <chao@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17block: add a bdev_discard_granularity helperChristoph Hellwig
Abstract away implementation details from file systems by providing a block_device based helper to retrieve the discard granularity. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Link: https://lore.kernel.org/r/20220415045258.199825-26-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17block: remove QUEUE_FLAG_DISCARDChristoph Hellwig
Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes. The only places where needs special attention is the RAID5 driver, which must clear discard support for security reasons by default, even if the default stacking rules would allow for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17block: add a bdev_max_discard_sectors helperChristoph Hellwig
Add a helper to query the number of sectors support per each discard bio based on the block device and use this helper to stop various places from poking into the request_queue to see if discard is supported and if so how much. This mirrors what is done e.g. for write zeroes as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-24-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17block: add a bdev_max_zone_append_sectors helperChristoph Hellwig
Add a helper to check the max supported sectors for zone append based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17block: add a bdev_stable_writes helperChristoph Hellwig
Add a helper to check the stable writes flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17block: add a bdev_fua helperChristoph Hellwig
Add a helper to check the FUA flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17block: add a bdev_write_cache helperChristoph Hellwig
Add a helper to check the write cache flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17block: add a bdev_nonrot helperChristoph Hellwig
Add a helper to check the nonrot flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17ntfs3: use bdev_logical_block_size instead of open coding itChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220415045258.199825-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17btrfs: use bdev_max_active_zones instead of open coding itChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17block: turn bio_kmalloc into a simple kmalloc wrapperChristoph Hellwig
Remove the magic autofree semantics and require the callers to explicitly call bio_init to initialize the bio. This allows bio_free to catch accidental bio_put calls on bio_init()ed bios as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de> Acked-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20220406061228.410163-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17squashfs: always use bio_kmalloc in squashfs_bio_readChristoph Hellwig
If a plain kmalloc that is not backed by a mempool is safe here for a large read (and the actual page allocations), it must also be for a small one, so simplify the code a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Phillip Lougher <phillip@squashfs.org.uk> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220406061228.410163-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17btrfs: simplify ->flush_bio handlingChristoph Hellwig
Use and embedded bios that is initialized when used instead of bio_kmalloc plus bio_reset. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220406061228.410163-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17io_uring: fix leaks on IOPOLL and CQE_SKIPPavel Begunkov
If all completed requests in io_do_iopoll() were marked with REQ_F_CQE_SKIP, we'll not only skip CQE posting but also io_free_batch_list() leaking memory and resources. Move @nr_events increment before REQ_F_CQE_SKIP check. We'll potentially return the value greater than the real one, but iopolling will deal with it and the userspace will re-iopoll if needed. In anyway, I don't think there are many use cases for REQ_F_CQE_SKIP + IOPOLL. Fixes: 83a13a4181b0e ("io_uring: tweak iopoll CQE_SKIP event counting") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5072fc8693fbfd595f89e5d4305bfcfd5d2f0a64.1650186611.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-16io_uring: free iovec if file assignment failsJens Axboe
We just return failure in this case, but we need to release the iovec first. If we're doing IO with more than FAST_IOV segments, then the iovec is allocated and must be freed. Reported-by: syzbot+96b43810dfe9c3bb95ed@syzkaller.appspotmail.com Fixes: 584b0180f0f4 ("io_uring: move read/write file prep state into actual opcode handler") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-15Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "14 patches. Subsystems affected by this patch series: MAINTAINERS, binfmt, and mm (tmpfs, secretmem, kasan, kfence, pagealloc, zram, compaction, hugetlb, vmalloc, and kmemleak)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: kmemleak: take a full lowmem check in kmemleak_*_phys() mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE" revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders" hugetlb: do not demote poisoned hugetlb pages mm: compaction: fix compiler warning when CONFIG_COMPACTION=n mm: fix unexpected zeroed page mapping with zram swap mm, page_alloc: fix build_zonerefs_node() mm, kfence: support kmem_dump_obj() for KFENCE objects kasan: fix hw tags enablement when KUNIT tests are disabled irq_work: use kasan_record_aux_stack_noalloc() record callstack mm/secretmem: fix panic when growing a memfd_secret tmpfs: fix regressions from wider use of ZERO_PAGE MAINTAINERS: Broadcom internal lists aren't maintainers
2022-04-15revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"Andrew Morton
Despite Mike's attempted fix (925346c129da117122), regressions reports continue: https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/ https://bugzilla.kernel.org/show_bug.cgi?id=215720 https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info So revert this patch. Fixes: 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE") Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Fangrui Song <maskray@google.com> Cc: H.J. Lu <hjl.tools@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"Andrew Morton
Commit 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders") was an attempt to fix regressions due to 9630f0d60fec5f ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE"). But regressionss continue to be reported: https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/ https://bugzilla.kernel.org/show_bug.cgi?id=215720 https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info This patch reverts the fix, so the original can also be reverted. Fixes: 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders") Cc: H.J. Lu <hjl.tools@gmail.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Song Liu <songliubraving@fb.com> Cc: David Rientjes <rientjes@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Fangrui Song <maskray@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: Mike Rapoport <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15Merge tag 'io_uring-5.18-2022-04-14' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Ensure we check and -EINVAL any use of reserved or struct padding. Although we generally always do that, it's missed in two spots for resource updates, one for the ring fd registration from this merge window, and one for the extended arg. Make sure we have all of them handled. (Dylan) - A few fixes for the deferred file assignment (me, Pavel) - Add a feature flag for the deferred file assignment so apps can tell we handle it correctly (me) - Fix a small perf regression with the current file position fix in this merge window (me) * tag 'io_uring-5.18-2022-04-14' of git://git.kernel.dk/linux-block: io_uring: abort file assignment prior to assigning creds io_uring: fix poll error reporting io_uring: fix poll file assign deadlock io_uring: use right issue_flags for splice/tee io_uring: verify pad field is 0 in io_get_ext_arg io_uring: verify resv is 0 in ringfd register/unregister io_uring: verify that resv2 is 0 in io_uring_rsrc_update2 io_uring: move io_uring_rsrc_update2 validation io_uring: fix assign file locking issue io_uring: stop using io_wq_work as an fd placeholder io_uring: move apoll->events cache io_uring: io_kiocb_update_pos() should not touch file for non -1 offset io_uring: flag the fact that linked file assignment is sane
2022-04-15erofs: fix use-after-free of on-stack io[]Hongyu Jin
The root cause is the race as follows: Thread #1 Thread #2(irq ctx) z_erofs_runqueue() struct z_erofs_decompressqueue io_A[]; submit bio A z_erofs_decompress_kickoff(,,1) z_erofs_decompressqueue_endio(bio A) z_erofs_decompress_kickoff(,,-1) spin_lock_irqsave() atomic_add_return() io_wait_event() -> pending_bios is already 0 [end of function] wake_up_locked(io_A[]) // crash Referenced backtrace in kernel 5.4: [ 10.129422] Unable to handle kernel paging request at virtual address eb0454a4 [ 10.364157] CPU: 0 PID: 709 Comm: getprop Tainted: G WC O 5.4.147-ab09225 #1 [ 11.556325] [<c01b33b8>] (__wake_up_common) from [<c01b3300>] (__wake_up_locked+0x40/0x48) [ 11.565487] [<c01b3300>] (__wake_up_locked) from [<c044c8d0>] (z_erofs_vle_unzip_kickoff+0x6c/0xc0) [ 11.575438] [<c044c8d0>] (z_erofs_vle_unzip_kickoff) from [<c044c854>] (z_erofs_vle_read_endio+0x16c/0x17c) [ 11.586082] [<c044c854>] (z_erofs_vle_read_endio) from [<c06a80e8>] (clone_endio+0xb4/0x1d0) [ 11.595428] [<c06a80e8>] (clone_endio) from [<c04a1280>] (blk_update_request+0x150/0x4dc) [ 11.604516] [<c04a1280>] (blk_update_request) from [<c06dea28>] (mmc_blk_cqe_complete_rq+0x144/0x15c) [ 11.614640] [<c06dea28>] (mmc_blk_cqe_complete_rq) from [<c04a5d90>] (blk_done_softirq+0xb0/0xcc) [ 11.624419] [<c04a5d90>] (blk_done_softirq) from [<c010242c>] (__do_softirq+0x184/0x56c) [ 11.633419] [<c010242c>] (__do_softirq) from [<c01051e8>] (irq_exit+0xd4/0x138) [ 11.641640] [<c01051e8>] (irq_exit) from [<c010c314>] (__handle_domain_irq+0x94/0xd0) [ 11.650381] [<c010c314>] (__handle_domain_irq) from [<c04fde70>] (gic_handle_irq+0x50/0xd4) [ 11.659641] [<c04fde70>] (gic_handle_irq) from [<c0101b70>] (__irq_svc+0x70/0xb0) Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20220401115527.4935-1-hongyu.jin.cn@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-04-14ext4: update the cached overhead value in the superblockTheodore Ts'o
If we (re-)calculate the file system overhead amount and it's different from the on-disk s_overhead_clusters value, update the on-disk version since this can take potentially quite a while on bigalloc file systems. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2022-04-14io_uring: abort file assignment prior to assigning credsJens Axboe
We need to either restore creds properly if we fail on the file assignment, or just do the file assignment first instead. Let's do the latter as it's simpler, should make no difference here for file assignment. Link: https://lore.kernel.org/lkml/000000000000a7edb305dca75a50@google.com/ Reported-by: syzbot+60c52ca98513a8760a91@syzkaller.appspotmail.com Fixes: 6bf9c47a3989 ("io_uring: defer file assignment") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-14ext4: force overhead calculation if the s_overhead_cluster makes no senseTheodore Ts'o
If the file system does not use bigalloc, calculating the overhead is cheap, so force the recalculation of the overhead so we don't have to trust the precalculated overhead in the superblock. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2022-04-14ksmbd: set fixed sector size to FS_SECTOR_SIZE_INFORMATIONNamjae Jeon
Currently ksmbd is using ->f_bsize from vfs_statfs() as sector size. If fat/exfat is a local share, ->f_bsize is a cluster size that is too large to be used as a sector size. Sector sizes larger than 4K cause problem occurs when mounting an iso file through windows client. The error message can be obtained using Mount-DiskImage command, the error is: "Mount-DiskImage : The sector size of the physical disk on which the virtual disk resides is not supported." This patch reports fixed 4KB sector size if ->s_blocksize is bigger than 4KB. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-04-14ksmbd: increment reference count of parent fpNamjae Jeon
Add missing increment reference count of parent fp in ksmbd_lookup_fd_inode(). Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>