summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-09-18fuse: split fuse_mount off of fuse_connMax Reitz
We want to allow submounts for the same fuse_conn, but with different superblocks so that each of the submounts has its own device ID. To do so, we need to split all mount-specific information off of fuse_conn into a new fuse_mount structure, so that multiple mounts can share a single fuse_conn. We need to take care only to perform connection-level actions once (i.e. when the fuse_conn and thus the first fuse_mount are established, or when the last fuse_mount and thus the fuse_conn are destroyed). For example, fuse_sb_destroy() must invoke fuse_send_destroy() until the last superblock is released. To do so, we keep track of which fuse_mount is the root mount and perform all fuse_conn-level actions only when this fuse_mount is involved. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-18fuse: drop fuse_conn parameter where possibleMax Reitz
With the last commit, all functions that handle some existing fuse_req no longer need to be given the associated fuse_conn, because they can get it from the fuse_req object. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-18fuse: store fuse_conn in fuse_reqMax Reitz
Every fuse_req belongs to a fuse_conn. Right now, we always know which fuse_conn that is based on the respective device, but we want to allow multiple (sub)mounts per single connection, and then the corresponding filesystem is not going to be so trivial to obtain. Storing a pointer to the associated fuse_conn in every fuse_req will allow us to trivially find any request's superblock (and thus filesystem) even then. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-18fuse: fix page dereference after freeMiklos Szeredi
After unlock_request() pages from the ap->pages[] array may be put (e.g. by aborting the connection) and the pages can be freed. Prevent use after free by grabbing a reference to the page before calling unlock_request(). The original patch was created by Pradeep P V K. Reported-by: Pradeep P V K <ppvk@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: add logic to free up a memory rangeVivek Goyal
Add logic to free up a busy memory range. Freed memory range will be returned to free pool. Add a worker which can be started to select and free some busy memory ranges. Process can also steal one of its busy dax ranges if free range is not available. I will refer it to as direct reclaim. If free range is not available and nothing can't be stolen from same inode, caller waits on a waitq for free range to become available. For reclaiming a range, as of now we need to hold following locks in specified order. down_write(&fi->i_mmap_sem); down_write(&fi->dax->sem); We look for a free range in following order. A. Try to get a free range. B. If not, try direct reclaim. C. If not, wait for a memory range to become free Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: maintain a list of busy elementsVivek Goyal
This list will be used selecting fuse_dax_mapping to free when number of free mappings drops below a threshold. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: serialize truncate/punch_hole and dax fault pathVivek Goyal
Currently in fuse we don't seem have any lock which can serialize fault path with truncate/punch_hole path. With dax support I need one for following reasons. 1. Dax requirement DAX fault code relies on inode size being stable for the duration of fault and want to serialize with truncate/punch_hole and they explicitly mention it. static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, const struct iomap_ops *ops) /* * Check whether offset isn't beyond end of file now. Caller is * supposed to hold locks serializing us with truncate / punch hole so * this is a reliable test. */ max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 2. Make sure there are no users of pages being truncated/punch_hole get_user_pages() might take references to page and then do some DMA to said pages. Filesystem might truncate those pages without knowing that a DMA is in progress or some I/O is in progress. So use dax_layout_busy_page() to make sure there are no such references and I/O is not in progress on said pages before moving ahead with truncation. 3. Limitation of kvm page fault error reporting If we are truncating file on host first and then removing mappings in guest lateter (truncate page cache etc), then this could lead to a problem with KVM. Say a mapping is in place in guest and truncation happens on host. Now if guest accesses that mapping, then host will take a fault and kvm will either exit to qemu or spin infinitely. IOW, before we do truncation on host, we need to make sure that guest inode does not have any mapping in that region or whole file. 4. virtiofs memory range reclaim Soon I will introduce the notion of being able to reclaim dax memory ranges from a fuse dax inode. There also I need to make sure that no I/O or fault is going on in the reclaimed range and nobody is using it so that range can be reclaimed without issues. Currently if we take inode lock, that serializes read/write. But it does not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem for this purpose. It can be used to serialize with faults. As of now, I am adding taking this semaphore only in dax fault path and not regular fault path because existing code does not have one. May be existing code can benefit from it as well to take care of some races, but that we can fix later if need be. For now, I am just focussing only on DAX path which is new path. Also added logic to take fuse_inode->i_mmap_sem in truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and fuse dax fault are mutually exlusive and avoid all the above problems. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: define dax address space operationsVivek Goyal
This is done along the lines of ext4 and xfs. I primarily wanted ->writepages hook at this time so that I could call into dax_writeback_mapping_range(). This in turn will decide which pfns need to be written back. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: add DAX mmap supportStefan Hajnoczi
Add DAX mmap() support. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: implement dax read/write operationsVivek Goyal
This patch implements basic DAX support. mmap() is not implemented yet and will come in later patches. This patch looks into implemeting read/write. We make use of interval tree to keep track of per inode dax mappings. Do not use dax for file extending writes, instead just send WRITE message to daemon (like we do for direct I/O path). This will keep write and i_size change atomic w.r.t crash. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: implement FUSE_INIT map_alignment fieldStefan Hajnoczi
The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment constraints via the FUST_INIT map_alignment field. Parse this field and ensure our DAX mappings meet the alignment constraints. We don't actually align anything differently since our mappings are already 2MB aligned. Just check the value when the connection is established. If it becomes necessary to honor arbitrary alignments in the future we'll have to adjust how mappings are sized. The upshot of this commit is that we can be confident that mappings will work even when emulating x86 on Power and similar combinations where the host page sizes are different. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: keep a list of free dax memory rangesVivek Goyal
Divide the dax memory range into fixed size ranges (2MB for now) and put them in a list. This will track free ranges. Once an inode requires a free range, we will take one from here and put it in interval-tree of ranges assigned to inode. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: add a mount option to enable daxVivek Goyal
Add a mount option to allow using dax with virtio_fs. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: set up virtio_fs dax_deviceStefan Hajnoczi
Setup a dax device. Use the shm capability to find the cache entry and map it. The DAX window is accessed by the fs/dax.c infrastructure and must have struct pages (at least on x86). Use devm_memremap_pages() to map the DAX window PCI BAR and allocate struct page. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: get rid of no_mount_optionsVivek Goyal
This option was introduced so that for virtio_fs we don't show any mounts options fuse_show_options(). Because we don't offer any of these options to be controlled by mounter. Very soon we are planning to introduce option "dax" which mounter should be able to specify. And no_mount_options does not work anymore. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: provide a helper function for virtqueue initializationVivek Goyal
This reduces code duplication and make it little easier to read code. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10dax: Create a range version of dax_layout_busy_page()Vivek Goyal
virtiofs device has a range of memory which is mapped into file inodes using dax. This memory is mapped in qemu on host and maps different sections of real file on host. Size of this memory is limited (determined by administrator) and depending on filesystem size, we will soon reach a situation where all the memory is in use and we need to reclaim some. As part of reclaim process, we will need to make sure that there are no active references to pages (taken by get_user_pages()) on the memory range we are trying to reclaim. I am planning to use dax_layout_busy_page() for this. But in current form this is per inode and scans through all the pages of the inode. We want to reclaim only a portion of memory (say 2MB page). So we want to make sure that only that 2MB range of pages do not have any references (and don't want to unmap all the pages of inode). Hence, create a range version of this function named dax_layout_busy_page_range() which can be used to pass a range which needs to be unmapped. Cc: Dan Williams <dan.j.williams@intel.com> Cc: linux-nvdimm@lists.01.org Cc: Jan Kara <jack@suse.cz> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: "Weiny, Ira" <ira.weiny@intel.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-04fuse: update project homepageAndré Almeida
As stated in https://sourceforge.net/projects/fuse/, "the FUSE project has moved to https://github.com/libfuse/" in 22-Dec-2015. Update URLs to reflect this. Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-30Merge tag '5.9-rc2-smb-fix' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cfis fix from Steve French: "DFS fix for referral problem when using SMB1" * tag '5.9-rc2-smb-fix' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix check of tcon dfs in smb1
2020-08-29Merge tag 'fallthrough-fixes-5.9-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull fallthrough fixes from Gustavo A. R. Silva: "Fix some minor issues introduced by the recent treewide fallthrough conversions: - Fix identation issue - Fix erroneous fallthrough annotation - Remove unnecessary fallthrough annotation - Fix code comment changed by fallthrough conversion" * tag 'fallthrough-fixes-5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: arm64/cpuinfo: Remove unnecessary fallthrough annotation media: dib0700: Fix identation issue in dib8096_set_param_override() afs: Remove erroneous fallthough annotation iio: dpot-dac: fix code comment in dpot_dac_read_raw()
2020-08-28Merge tag 'io_uring-5.9-2020-08-28' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "A few fixes in here, all based on reports and test cases from folks using it. Most of it is stable material as well: - Hashed work cancelation fix (Pavel) - poll wakeup signalfd fix - memlock accounting fix - nonblocking poll retry fix - ensure we never return -ERESTARTSYS for reads - ensure offset == -1 is consistent with preadv2() as documented - IOPOLL -EAGAIN handling fixes - remove useless task_work bounce for block based -EAGAIN retry" * tag 'io_uring-5.9-2020-08-28' of git://git.kernel.dk/linux-block: io_uring: don't bounce block based -EAGAIN retry off task_work io_uring: fix IOPOLL -EAGAIN retries io_uring: clear req->result on IOPOLL re-issue io_uring: make offset == -1 consistent with preadv2/pwritev2 io_uring: ensure read requests go through -ERESTART* transformation io_uring: don't use poll handler if file can't be nonblocking read/written io_uring: fix imbalanced sqo_mm accounting io_uring: revert consumed iov_iter bytes on error io-wq: fix hang after cancelling pending hashed work io_uring: don't recurse on tsk->sighand->siglock with signalfd
2020-08-28Merge tag 'writeback_for_v5.9-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull writeback fixes from Jan Kara: "Fixes for writeback code occasionally skipping writeback of some inodes or livelocking sync(2)" * tag 'writeback_for_v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: writeback: Drop I_DIRTY_TIME_EXPIRE writeback: Fix sync livelock due to b_dirty_time processing writeback: Avoid skipping inode writeback writeback: Protect inode->i_io_list with inode->i_lock
2020-08-28Merge tag 'gfs2-v5.9-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fix from Andreas Gruenbacher: "Fix a memory leak on filesystem withdraw. We didn't detect this bug because we have slab merging on by default (CONFIG_SLAB_MERGE_DEFAULT). Adding 'slub_nomerge' to the kernel command line exposed the problem" * tag 'gfs2-v5.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: add some much needed cleanup for log flushes that fail
2020-08-28Merge tag 'ceph-for-5.9-rc3' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "We have an inode number handling change, prompted by s390x which is a 64-bit architecture with a 32-bit ino_t, a patch to disallow leases to avoid potential data integrity issues when CephFS is re-exported via NFS or CIFS and a fix for the bulk of W=1 compilation warnings" * tag 'ceph-for-5.9-rc3' of git://github.com/ceph/ceph-client: ceph: don't allow setlease on cephfs ceph: fix inode number handling on arches with 32-bit ino_t libceph: add __maybe_unused to DEFINE_CEPH_FEATURE
2020-08-28cifs: fix check of tcon dfs in smb1Paulo Alcantara
For SMB1, the DFS flag should be checked against tcon->Flags rather than tcon->share_flags. While at it, add an is_tcon_dfs() helper to check for DFS capability in a more generic way. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
2020-08-27io_uring: don't bounce block based -EAGAIN retry off task_workJens Axboe
These events happen inline from submission, so there's no need to bounce them through the original task. Just set them up for retry and issue retry directly instead of going over task_work. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-27io_uring: fix IOPOLL -EAGAIN retriesJens Axboe
This normally isn't hit, as polling is mostly done on NVMe with deep queue depths. But if we do run into request starvation, we need to ensure that retries are properly serialized. Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-27afs: Remove erroneous fallthough annotationDan Carpenter
The fall through annotation comes after a return statement so it's not reachable. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-26io_uring: clear req->result on IOPOLL re-issueJens Axboe
Make sure we clear req->result, which was set to -EAGAIN for retry purposes, when moving it to the reissue list. Otherwise we can end up retrying a request more than once, which leads to weird results in the io-wq handling (and other spots). Cc: stable@vger.kernel.org Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-26io_uring: make offset == -1 consistent with preadv2/pwritev2Jens Axboe
The man page for io_uring generally claims were consistent with what preadv2 and pwritev2 accept, but turns out there's a slight discrepancy in how offset == -1 is handled for pipes/streams. preadv doesn't allow it, but preadv2 does. This currently causes io_uring to return -EINVAL if that is attempted, but we should allow that as documented. This change makes us consistent with preadv2/pwritev2 for just passing in a NULL ppos for streams if the offset is -1. Cc: stable@vger.kernel.org # v5.7+ Reported-by: Benedikt Ames <wisp3rwind@posteo.eu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-25Merge tag 'nfsd-5.9-1' of git://git.linux-nfs.org/projects/cel/cel-2.6Linus Torvalds
Pull nfs server fixes from Chuck Lever: - Eliminate an oops introduced in v5.8 - Remove a duplicate #include added by nfsd-5.9 * tag 'nfsd-5.9-1' of git://git.linux-nfs.org/projects/cel/cel-2.6: SUNRPC: remove duplicate include nfsd: fix oops on mixed NFSv4/NFSv3 client access
2020-08-25Merge tag 'm68knommu-for-v5.9-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu Pull m68knommu fix from Greg Ungerer: "Only a single fix for the binfmt_flat loader (reverting a recent change)" * tag 'm68knommu-for-v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu: binfmt_flat: revert "binfmt_flat: don't offset the data start"
2020-08-25io_uring: ensure read requests go through -ERESTART* transformationJens Axboe
We need to call kiocb_done() for any ret < 0 to ensure that we always get the proper -ERESTARTSYS (and friends) transformation done. At some point this should be tied into general error handling, so we can get rid of the various (mostly network) related commands that check and perform this substitution. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-25io_uring: don't use poll handler if file can't be nonblocking read/writtenJens Axboe
There's no point in using the poll handler if we can't do a nonblocking IO attempt of the operation, since we'll need to go async anyway. In fact this is actively harmful, as reading from eg pipes won't return 0 to indicate EOF. Cc: stable@vger.kernel.org # v5.7+ Reported-by: Benedikt Ames <wisp3rwind@posteo.eu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-25io_uring: fix imbalanced sqo_mm accountingJens Axboe
We do the initial accounting of locked_vm and pinned_vm before we have setup ctx->sqo_mm, which means we can end up having not accounted the memory at setup time, but still decrement it when we exit. This causes an imbalance in the accounting. Setup ctx->sqo_mm earlier in io_uring_create(), before we do the first accounting of mm->{locked,pinned}_vm. This also unifies the state grabbing for the ctx, and eliminates a failure case in io_sq_offload_start(). Fixes: f74441e6311a ("io_uring: account locked memory before potential error case") Reported-by: Robert M. Muncrief <rmuncrief@humanavance.com> Reported-by: Niklas Schnelle <schnelle@linux.ibm.com> Tested-by: Niklas Schnelle <schnelle@linux.ibm.com> Tested-by: Robert M. Muncrief <rmuncrief@humanavance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-25io_uring: revert consumed iov_iter bytes on errorJens Axboe
Some consumers of the iov_iter will return an error, but still have bytes consumed in the iterator. This is an issue for -EAGAIN, since we rely on a sane iov_iter state across retries. Fix this by ensuring that we revert consumed bytes, if any, if the file operations have consumed any bytes from iterator. This is similar to what generic_file_read_iter() does, and is always safe as we have the previous bytes count handy already. Fixes: ff6165b2d7f6 ("io_uring: retain iov_iter state over io_read/io_write calls") Reported-by: Dmitry Shulyak <yashulyak@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-24Merge tag 'for-5.9-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix swapfile activation on subvolumes with deleted snapshots - error value mixup when removing directory entries from tree log - fix lzo compression level reset after previous level setting - fix space cache memory leak after transaction abort - fix const function attribute - more error handling improvements * tag 'for-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: detect nocow for swap after snapshot delete btrfs: check the right error variable in btrfs_del_dir_entries_in_log btrfs: fix space cache memory leak after transaction abort btrfs: use the correct const function attribute for btrfs_get_num_csums btrfs: reset compression level for lzo on remount btrfs: handle errors from async submission
2020-08-24ceph: don't allow setlease on cephfsJeff Layton
Leases don't currently work correctly on kcephfs, as they are not broken when caps are revoked. They could eventually be implemented similarly to how we did them in libcephfs, but for now don't allow them. [ idryomov: no need for simple_nosetlease() in ceph_dir_fops and ceph_snapdir_fops ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-24ceph: fix inode number handling on arches with 32-bit ino_tJeff Layton
Tuan and Ulrich mentioned that they were hitting a problem on s390x, which has a 32-bit ino_t value, even though it's a 64-bit arch (for historical reasons). I think the current handling of inode numbers in the ceph driver is wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's actually not a problem. 32-bit arches can deal with 64-bit inode numbers just fine when userland code is compiled with LFS support (the common case these days). What we really want to do is just use 64-bit numbers everywhere, unless someone has mounted with the ino32 mount option. In that case, we want to ensure that we hash the inode number down to something that will fit in 32 bits before presenting the value to userland. Add new helper functions that do this, and only do the conversion before presenting these values to userland in getattr and readdir. The inode table hashvalue is changed to just cast the inode number to unsigned long, as low-order bits are the most likely to vary anyway. While it's not strictly required, we do want to put something in inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on the size of the ino_t type. NOTE: This is a user-visible change on 32-bit arches: 1/ inode numbers will be seen to have changed between kernel versions. 32-bit arches will see large inode numbers now instead of the hashed ones they saw before. 2/ any really old software not built with LFS support may start failing stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we can do about these, but hopefully the intersection of people running such code on ceph will be very small. The workaround for both problems is to mount with "-o ino32". [ idryomov: changelog tweak ] URL: https://tracker.ceph.com/issues/46828 Reported-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com> Reported-and-Tested-by: Tuan Hoang1 <Tuan.Hoang1@ibm.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-24gfs2: add some much needed cleanup for log flushes that failBob Peterson
When a log flush fails due to io errors, it signals the failure but does not clean up after itself very well. This is because buffers are added to the transaction tr_buf and tr_databuf queue, but the io error causes gfs2_log_flush to bypass the "after_commit" functions responsible for dequeueing the bd elements. If the bd elements are added to the ail list before the error, function ail_drain takes care of dequeueing them. But if they haven't gotten that far, the elements are forgotten and make the transactions unable to be freed. This patch introduces new function trans_drain which drains the bd elements from the transaction so they can be freed properly. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-08-24binfmt_flat: revert "binfmt_flat: don't offset the data start"Max Filippov
binfmt_flat loader uses the gap between text and data to store data segment pointers for the libraries. Even in the absence of shared libraries it stores at least one pointer to the executable's own data segment. Text and data can go back to back in the flat binary image and without offsetting data segment last few instructions in the text segment may get corrupted by the data segment pointer. Fix it by reverting commit a2357223c50a ("binfmt_flat: don't offset the data start"). Cc: stable@vger.kernel.org Fixes: a2357223c50a ("binfmt_flat: don't offset the data start") Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Greg Ungerer <gerg@linux-m68k.org>
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23io-wq: fix hang after cancelling pending hashed workPavel Begunkov
Don't forget to update wqe->hash_tail after cancelling a pending work item, if it was hashed. Cc: stable@vger.kernel.org # 5.7+ Reported-by: Dmitry Shulyak <yashulyak@gmail.com> Fixes: 86f3cd1b589a1 ("io-wq: handle hashed writes in chains") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-23io_uring: don't recurse on tsk->sighand->siglock with signalfdJens Axboe
If an application is doing reads on signalfd, and we arm the poll handler because there's no data available, then the wakeup can recurse on the tasks sighand->siglock as the signal delivery from task_work_add() will use TWA_SIGNAL and that attempts to lock it again. We can detect the signalfd case pretty easily by comparing the poll->head wait_queue_head_t with the target task signalfd wait queue. Just use normal task wakeup for this case. Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-22Merge branch 'work.epoll' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull epoll fixes from Al Viro: "Fix reference counting and clean up exit paths" * 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do_epoll_ctl(): clean the failure exits up a bit epoll: Keep a reference on files added to the check list
2020-08-22do_epoll_ctl(): clean the failure exits up a bitAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-22epoll: Keep a reference on files added to the check listMarc Zyngier
When adding a new fd to an epoll, and that this new fd is an epoll fd itself, we recursively scan the fds attached to it to detect cycles, and add non-epool files to a "check list" that gets subsequently parsed. However, this check list isn't completely safe when deletions can happen concurrently. To sidestep the issue, make sure that a struct file placed on the check list sees its f_count increased, ensuring that a concurrent deletion won't result in the file disapearing from under our feet. Cc: stable@vger.kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-21Merge tag 'io_uring-5.9-2020-08-21' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Make sure the head link cancelation includes async work - Get rid of kiocb_wait_page_queue_init(), makes no sense to have it as a separate function since you moved it into io_uring itself - io_import_iovec cleanups (Pavel, me) - Use system_unbound_wq for ring exit work, to avoid spawning tons of these if we have tons of rings exiting at the same time - Fix req->flags overflow flag manipulation (Pavel) * tag 'io_uring-5.9-2020-08-21' of git://git.kernel.dk/linux-block: io_uring: kill extra iovec=NULL in import_iovec() io_uring: comment on kfree(iovec) checks io_uring: fix racy req->flags modification io_uring: use system_unbound_wq for ring exit work io_uring: cleanup io_import_iovec() of pre-mapped request io_uring: get rid of kiocb_wait_page_queue_init() io_uring: find and cancel head link async work on files exit
2020-08-21Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "11 patches. Subsystems affected by this: misc, mm/hugetlb, mm/vmalloc, mm/misc, romfs, relay, uprobes, squashfs, mm/cma, mm/pagealloc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, page_alloc: fix core hung in free_pcppages_bulk() mm: include CMA pages in lowmem_reserve at boot squashfs: avoid bio_alloc() failure with 1Mbyte blocks uprobes: __replace_page() avoid BUG in munlock_vma_page() kernel/relay.c: fix memleak on destroy relay channel romfs: fix uninitialized memory leak in romfs_dev_read() mm/rodata_test.c: fix missing function declaration mm/vunmap: add cond_resched() in vunmap_pmd_range khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter() hugetlb_cgroup: convert comma to semicolon mailmap: add Andi Kleen
2020-08-21Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Improvements to ext4's block allocator performance for very large file systems, especially when the file system or files which are highly fragmented. There is a new mount option, prefetch_block_bitmaps which will pull in the block bitmaps and set up the in-memory buddy bitmaps when the file system is initially mounted. Beyond that, a lot of bug fixes and cleanups. In particular, a number of changes to make ext4 more robust in the face of write errors or file system corruptions" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits) ext4: limit the length of per-inode prealloc list ext4: reorganize if statement of ext4_mb_release_context() ext4: add mb_debug logging when there are lost chunks ext4: Fix comment typo "the the". jbd2: clean up checksum verification in do_one_pass() ext4: change to use fallthrough macro ext4: remove unused parameter of ext4_generic_delete_entry function mballoc: replace seq_printf with seq_puts ext4: optimize the implementation of ext4_mb_good_group() ext4: delete invalid comments near ext4_mb_check_limits() ext4: fix typos in ext4_mb_regular_allocator() comment ext4: fix checking of directory entry validity for inline directories fs: prevent BUG_ON in submit_bh_wbc() ext4: correctly restore system zone info when remount fails ext4: handle add_system_zone() failure in ext4_setup_system_zone() ext4: fold ext4_data_block_valid_rcu() into the caller ext4: check journal inode extents more carefully ext4: don't allow overlapping system zones ext4: handle error of ext4_setup_system_zone() on remount ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp() ...