summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-04-26fs/affs: remove node generation checkFabian Frederick
node generation has to be stored on disk. AFAICS we won't be able to manage it on AFFS. This patch removes relevant check in affs_nfs_get_inode() Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26fs/affs: import amigaffs.hFabian Frederick
Have that file in global include/linux is not needed. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26fs/affs: bugfix: make symbolic links work againFabian Frederick
AFFS symbolic links were broken since kernel 2.6.29 Problem was bisected to the following commit ebd09abbd969 ("vfs: ensure page symlinks are NUL-terminated") commit 035146851cfa ("vfs: introduce helper function to safely NUL-terminate symlinks") AFFS wasn't setting inode size when reading symbolic link from disk or creating a new one. Result was zero allocation in pagecache. ln -s file symlink ls -lrt file symlink -> This patch adds inode isize information on inode get and symbolic link addition. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26statx: Kill fd-with-NULL-path support in favour of AT_EMPTY_PATHDavid Howells
With the new statx() syscall, the following both allow the attributes of the file attached to a file descriptor to be retrieved: statx(dfd, NULL, 0, ...); and: statx(dfd, "", AT_EMPTY_PATH, ...); Change the code to reject the first option, though this means copying the path and engaging pathwalk for the fstat() equivalent. dfd can be a non-directory provided path is "". [ The timing of this isn't wonderful, but applying this now before we have statx() in any released kernel, before anybody starts using the NULL special case. - Linus ] Fixes: a528d35e8bfc ("statx: Add a system call to make enhanced file info available") Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric Sandeen <sandeen@sandeen.net> cc: fstests@vger.kernel.org cc: linux-api@vger.kernel.org cc: linux-man@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-26orangefs: handle zero size write in debugfsDan Carpenter
If we write zero bytes to this debugfs file, then it will cause an underflow when we do copy_from_user(buf, ubuf, count - 1). Debugfs can normally only be written to by root so the impact of this is low. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26orangefs: do not wait for timeout if umountingMartin Brandenburg
When the computer is turned off, all the processes are killed and then all the filesystems are umounted. OrangeFS should not wait for the userspace daemon to come back in that case. This only works for plain umount(2). To actually take advantage of this interactively, `umount -f' is needed; otherwise umount will issue a statfs first, which will wait for the userspace daemon to come back. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26orangefs: return from orangefs_devreq_read quickly if possibleMartin Brandenburg
It is not necessary to take the lock and search through the request list if the list is empty. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26orangefs: ensure the userspace component is unmounted if mount failsMartin Brandenburg
If the mount is aborted after userspace has been asked to mount, userspace must be told to unmount. Ordinarily orangefs_kill_sb does the unmount. However it cannot be called if the superblock has not been set up. This is a very narrow window. The NULL fs_id is not unmounted. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26orangefs: do not check possibly stale size on truncateMartin Brandenburg
Let the server figure this out because our size might be out of date or not present. The bug was that xfs_io -f -t -c "pread -v 0 100" /mnt/foo echo "Test" > /mnt/foo xfs_io -f -t -c "pread -v 0 100" /mnt/foo fails because the second truncate did not happen if nothing had requested the size after the write in echo. Thus i_size was zero (not present) and the orangefs_setattr though i_size was zero and there was nothing to do. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26orangefs: implement statxMartin Brandenburg
Fortunately OrangeFS has had a getattr request mask for a long time. The server basically has two difficulty levels for attributes. Fetching any attribute except size requires communicating with the metadata server for that handle. Since all the attributes are right there, it makes sense to return them all. Fetching the size requires communicating with every I/O server (that the file is distributed across). Therefore if asked for anything except size, get everything except size, and if asked for size, get everything. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26orangefs: remove ORANGEFS_READDIR macrosMartin Brandenburg
They are clones of the ORANGEFS_ITERATE macros in use elsewhere. Delete ORANGEFS_ITERATE_NEXT which is a hack previously used by readdir. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26orangefs: support very large directoriesMartin Brandenburg
This works by maintaining a linked list of pages which the directory has been read into rather than one giant fixed-size buffer. This replaces code which limits the total directory size to the total amount that could be returned in one server request. Since filenames are usually considerably shorter than the maximum, the old code could usually handle several server requests before running out of space. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26orangefs: support llseek on directoriesMartin Brandenburg
This and the previous commit fix xfstests generic/257. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26orangefs: rewrite readdir to fix several bugsMartin Brandenburg
In the past, readdir assumed that the user buffer will be large enough that all entries from the server will fit. If this was not true, entries would be skipped. Since it works now, request 512 entries rather than 96 per server operation. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26orangefs: do not set getattr_time on orangefs_lookupMartin Brandenburg
Since orangefs_lookup calls orangefs_iget which calls orangefs_inode_getattr, getattr_time will get set. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26orangefs: clean up oversize xattr validationMartin Brandenburg
Also don't check flags as this has been validated by the VFS already. Fix an off-by-one error in the max size checking. Stop logging just because userspace wants to write attributes which do not fit. This and the previous commit fix xfstests generic/020. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26orangefs: fix bounds check for listxattrMartin Brandenburg
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26orangefs: remove unused get_fsid_from_inoMartin Brandenburg
Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26NFSv4: Don't special case "launder"Trond Myklebust
If the client receives a fatal server error from nfs_pageio_add_request(), then we should always truncate the page on which the error occurred. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-26NFS: Add a few more fatal I/O errors to nfs_error_is_fatal()Trond Myklebust
EACCES, EDQUOT, EFBIG and ESTALE are all fatal errors as far as NFS I/O is concerned. They need to be reported back to the application. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-26Merge branches 'uaccess.alpha', 'uaccess.arc', 'uaccess.arm', ↵Al Viro
'uaccess.arm64', 'uaccess.avr32', 'uaccess.bfin', 'uaccess.c6x', 'uaccess.cris', 'uaccess.frv', 'uaccess.h8300', 'uaccess.hexagon', 'uaccess.ia64', 'uaccess.m32r', 'uaccess.m68k', 'uaccess.metag', 'uaccess.microblaze', 'uaccess.mips', 'uaccess.mn10300', 'uaccess.nios2', 'uaccess.openrisc', 'uaccess.parisc', 'uaccess.powerpc', 'uaccess.s390', 'uaccess.score', 'uaccess.sh', 'uaccess.sparc', 'uaccess.tile', 'uaccess.um', 'uaccess.unicore32', 'uaccess.x86' and 'uaccess.xtensa' into work.uaccess
2017-04-26Btrfs: fix reported number of inode blocksFilipe Manana
Currently when there are buffered writes that were not yet flushed and they fall within allocated ranges of the file (that is, not in holes or beyond eof assuming there are no prealloc extents beyond eof), btrfs simply reports an incorrect number of used blocks through the stat(2) system call (or any of its variants), regardless of mount options or inode flags (compress, compress-force, nodatacow). This is because the number of blocks used that is reported is based on the current number of bytes in the vfs inode plus the number of dealloc bytes in the btrfs inode. The later covers bytes that both fall within allocated regions of the file and holes. Example scenarios where the number of reported blocks is wrong while the buffered writes are not flushed: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec) $ sync $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec) # The following should have reported 64K... $ du -h /mnt/sdc/foo1 128K /mnt/sdc/foo1 $ sync # After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo1 64K /mnt/sdc/foo1 $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec) $ sync $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 65536 64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec) # The following should have reported 128K... $ du -h /mnt/sdc/foo2 192K /mnt/sdc/foo2 $ sync # After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo2 128K /mnt/sdc/foo2 So the number of used file blocks is simply incorrect, unlike in other filesystems such as ext4 and xfs for example, but only while the buffered writes are not flushed. Fix this by tracking the number of delalloc bytes that fall within holes and beyond eof of a file, and use instead this new counter when reporting the number of used blocks for an inode. Another different problem that exists is that the delalloc bytes counter is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from the respective range in the inode's iotree) and the vfs inode's bytes counter is only incremented when writeback finishes (through insert_reserved_file_extent()). Therefore while writeback is ongoing we simply report a wrong number of blocks used by an inode if the write operation covers a range previously unallocated. While this change does not fix this problem, it does minimizes it a lot by shortening that time window, as the new dealloc bytes counter (new_delalloc_bytes) is only decremented when writeback finishes right before updating the vfs inode's bytes counter. Fully fixing this second problem is not trivial and will be addressed later by a different patch. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26Btrfs: send, fix file hole not being preserved due to inline extentFilipe Manana
Normally we don't have inline extents followed by regular extents, but there's currently at least one harmless case where this happens. For example, when the page size is 4Kb and compression is enabled: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar In this case we get a compressed inline extent, representing 4Kb of data, followed by a hole extent and then a regular data extent. The inline extent was not expanded/converted to a regular extent exactly because it represents 4Kb of data. This does not cause any apparent problem (such as the issue solved by commit e1699d2d7bf6 ("btrfs: add missing memset while reading compressed inline extents")) except trigger an unexpected case in the incremental send code path that makes us issue an operation to write a hole when it's not needed, resulting in more writes at the receiver and wasting space at the receiver. So teach the incremental send code to deal with this particular case. The issue can be currently triggered by running fstests btrfs/137 with compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26Btrfs: fix extent map leak during fallocate error pathFilipe Manana
If the call to btrfs_qgroup_reserve_data() failed, we were leaking an extent map structure. The failure can happen either due to an -ENOMEM condition or, when quotas are enabled, due to -EDQUOT for example. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>
2017-04-26Btrfs: fix incorrect space accounting after failure to insert inline extentFilipe Manana
When using compression, if we fail to insert an inline extent we incorrectly end up attempting to free the reserved data space twice, once through extent_clear_unlock_delalloc(), because we pass it the flag EXTENT_DO_ACCOUNTING, and once through a direct call to btrfs_free_reserved_data_space_noquota(). This results in a trace like the following: [ 834.576240] ------------[ cut here ]------------ [ 834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs] [ 834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2 [ 834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs] [ 834.596103] Call Trace: [ 834.596103] dump_stack+0x67/0x90 [ 834.596103] __warn+0xc2/0xdd [ 834.596103] warn_slowpath_null+0x1d/0x1f [ 834.596103] btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 834.596103] compress_file_range.constprop.42+0x2fa/0x3fc [btrfs] [ 834.596103] ? submit_compressed_extents+0x3a7/0x3a7 [btrfs] [ 834.596103] async_cow_start+0x32/0x4d [btrfs] [ 834.596103] btrfs_scrubparity_helper+0x187/0x3e7 [btrfs] [ 834.596103] btrfs_delalloc_helper+0xe/0x10 [btrfs] [ 834.596103] process_one_work+0x273/0x4e4 [ 834.596103] worker_thread+0x1eb/0x2ca [ 834.596103] ? rescuer_thread+0x2b6/0x2b6 [ 834.596103] kthread+0x100/0x108 [ 834.596103] ? __list_del_entry+0x22/0x22 [ 834.596103] ret_from_fork+0x2e/0x40 [ 834.611656] ---[ end trace 719902fe6bdef08f ]--- So fix this by not calling directly btrfs_free_reserved_data_space_noquota() if an error happened. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26Btrfs: fix invalid attempt to free reserved space on failure to cow rangeFilipe Manana
When attempting to COW a file range (we are starting writeback and doing COW), if we manage to reserve an extent for the range we will write into but fail after reserving it and before creating the respective ordered extent, we end up in an error path where we attempt to decrement the data space's bytes_may_use counter after we already did it while reserving the extent, leading to a warning/trace like the following: [ 847.621524] ------------[ cut here ]------------ [ 847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg [ 847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2 [ 847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 847.648601] Call Trace: [ 847.648601] dump_stack+0x67/0x90 [ 847.648601] __warn+0xc2/0xdd [ 847.648601] warn_slowpath_null+0x1d/0x1f [ 847.648601] btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 847.648601] btrfs_clear_bit_hook+0x140/0x258 [btrfs] [ 847.648601] clear_state_bit+0x87/0x128 [btrfs] [ 847.648601] __clear_extent_bit+0x222/0x2b7 [btrfs] [ 847.648601] clear_extent_bit+0x17/0x19 [btrfs] [ 847.648601] extent_clear_unlock_delalloc+0x3b/0x6b [btrfs] [ 847.648601] cow_file_range.isra.39+0x387/0x39a [btrfs] [ 847.648601] run_delalloc_nocow+0x4d7/0x70e [btrfs] [ 847.648601] ? arch_local_irq_save+0x9/0xc [ 847.648601] run_delalloc_range+0xa7/0x2b5 [btrfs] [ 847.648601] writepage_delalloc.isra.31+0xb9/0x15c [btrfs] [ 847.648601] __extent_writepage+0x249/0x2e8 [btrfs] [ 847.648601] extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs] [ 847.648601] ? arch_local_irq_save+0x9/0xc [ 847.648601] ? mark_lock+0x24/0x201 [ 847.648601] extent_writepages+0x4b/0x5c [btrfs] [ 847.648601] ? btrfs_writepage_start_hook+0xed/0xed [btrfs] [ 847.648601] btrfs_writepages+0x28/0x2a [btrfs] [ 847.648601] do_writepages+0x23/0x2c [ 847.648601] __filemap_fdatawrite_range+0x5a/0x61 [ 847.648601] filemap_fdatawrite_range+0x13/0x15 [ 847.648601] btrfs_fdatawrite_range+0x20/0x46 [btrfs] [ 847.648601] start_ordered_ops+0x19/0x23 [btrfs] [ 847.648601] btrfs_sync_file+0x136/0x42c [btrfs] [ 847.648601] vfs_fsync_range+0x8c/0x9e [ 847.648601] vfs_fsync+0x1c/0x1e [ 847.648601] do_fsync+0x31/0x4a [ 847.648601] SyS_fsync+0x10/0x14 [ 847.648601] entry_SYSCALL_64_fastpath+0x18/0xad [ 847.648601] RIP: 0033:0x7f5b05200800 [ 847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [ 847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800 [ 847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003 [ 847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f [ 847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046 [ 847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740 [ 847.648601] ? trace_hardirqs_off_caller+0x3f/0xaa [ 847.685787] ---[ end trace 2a4a3e15382508e8 ]--- So fix this by not attempting to decrement the data space info's bytes_may_use counter if we already reserved the extent and an error happened before creating the ordered extent. We are already correctly freeing the reserved extent if an error happens, so there's no additional measure needed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26btrfs: Handle delalloc error correctly to avoid ordered extent hangQu Wenruo
[BUG] If run_delalloc_range() returns error and there is already some ordered extents created, btrfs will be hanged with the following backtrace: Call Trace: __schedule+0x2d4/0xae0 schedule+0x3d/0x90 btrfs_start_ordered_extent+0x160/0x200 [btrfs] ? wake_atomic_t_function+0x60/0x60 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] process_one_work+0x2af/0x720 ? process_one_work+0x22b/0x720 worker_thread+0x4b/0x4f0 kthread+0x10f/0x150 ? process_one_work+0x720/0x720 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 [CAUSE] |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<>| |<---------- cleanup range --------->| || \_=> First page handled by end_extent_writepage() in __extent_writepage() The problem is caused by error handler of run_delalloc_range(), which doesn't handle any created ordered extents, leaving them waiting on btrfs_finish_ordered_io() to finish. However after run_delalloc_range() returns error, __extent_writepage() won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered except the first page, and btrfs_finish_ordered_io() won't be triggered for created ordered extents either. So OE 2~n will hang forever, and if OE 1 is larger than one page, it will also hang. [FIX] Introduce btrfs_cleanup_ordered_extents() function to cleanup created ordered extents and finish them manually. The function is based on existing btrfs_endio_direct_write_update_ordered() function, and modify it to act just like btrfs_writepage_endio_hook() but handles specified range other than one page. After fix, delalloc error will be handled like: |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<>|<-------- ----------->|<------ old error handler --------->| || || || \_=> Cleaned up by cleanup_ordered_extents() \_=> First page handled by end_extent_writepage() in __extent_writepage() Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum errorQu Wenruo
[BUG] When btrfs_reloc_clone_csum() reports error, it can underflow metadata and leads to kernel assertion on outstanding extents in run_delalloc_nocow() and cow_file_range(). BTRFS info (device vdb5): relocating block group 12582912 flags data BTRFS info (device vdb5): found 1 extents assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 Currently, due to another bug blocking ordered extents, the bug is only reproducible under certain block group layout and using error injection. a) Create one data block group with one 4K extent in it. To avoid the bug that hangs btrfs due to ordered extent which never finishes b) Make btrfs_reloc_clone_csum() always fail c) Relocate that block group [CAUSE] run_delalloc_nocow() and cow_file_range() handles error from btrfs_reloc_clone_csum() wrongly: (The ascii chart shows a more generic case of this bug other than the bug mentioned above) |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<----------- cleanup range --------------->| |<----------- ----------->| \/ btrfs_finish_ordered_io() range So error handler, which calls extent_clear_unlock_delalloc() with EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() will both cover OE n, and free its metadata, causing metadata under flow. [Fix] The fix is to ensure after calling btrfs_add_ordered_extent(), we only call error handler after increasing the iteration offset, so that cleanup range won't cover any created ordered extent. |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<----------- ----------->|<---------- cleanup range --------->| \/ btrfs_finish_ordered_io() range Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26ovl: do not set overlay.opaque on non-dir createAmir Goldstein
The optimization for opaque dir create was wrongly being applied also to non-dir create. Fixes: 97c684cc9110 ("ovl: create directories inside merged parent opaque") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # v4.10
2017-04-25lockd: remove redundant check on blockColin Ian King
A null check followed by a return is being performed already, so block is always non-null at the second check on block, hence we can remove this redundant null-check (Detected by PVS-Studio). Also re-work comment to clean up a check-patch warning. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25NFS: don't try to cross a mountpount when there isn't one there.NeilBrown
consider the sequence of commands: mkdir -p /import/nfs /import/bind /import/etc mount --bind / /import/bind mount --make-private /import/bind mount --bind /import/etc /import/bind/etc exportfs -o rw,no_root_squash,crossmnt,async,no_subtree_check localhost:/ mount -o vers=4 localhost:/ /import/nfs ls -l /import/nfs/etc You would not expect this to report a stale file handle. Yet it does. The manipulations under /import/bind cause the dentry for /etc to get the DCACHE_MOUNTED flag set, even though nothing is mounted on /etc. This causes nfsd to call nfsd_cross_mnt() even though there is no mountpoint. So an upcall to mountd for "/etc" is performed. The 'crossmnt' flag on the export of / causes mountd to report that /etc is exported as it is a descendant of /. It assumes the kernel wouldn't ask about something that wasn't a mountpoint. The filehandle returned identifies the filesystem and the inode number of /etc. When this filehandle is presented to rpc.mountd, via "nfsd.fh", the inode cannot be found associated with any name in /etc/exports, or with any mountpoint listed by getmntent(). So rpc.mountd says the filehandle doesn't exist. Hence ESTALE. This is fixed by teaching nfsd not to trust DCACHE_MOUNTED too much. It is just a hint, not a guarantee. Change nfsd_mountpoint() to return '1' for a certain mountpoint, '2' for a possible mountpoint, and 0 otherwise. Then change nfsd_crossmnt() to check if follow_down() actually found a mountpount and, if not, to avoid performing a lookup if the location is not known to certainly require an export-point. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25nfsd4: remove pointless strdup_if_nonnullNeilBrown
kstrdup() already checks for NULL. (Brought to our attention by Jason Yann noticing (from sparse output) that it should have been declared static.) Signed-off-by: NeilBrown <neilb@suse.com> Reported-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25nfsd: check for oversized NFSv2/v3 argumentsJ. Bruce Fields
A client can append random data to the end of an NFSv2 or NFSv3 RPC call without our complaining; we'll just stop parsing at the end of the expected data and ignore the rest. Encoded arguments and replies are stored together in an array of pages, and if a call is too large it could leave inadequate space for the reply. This is normally OK because NFS RPC's typically have either short arguments and long replies (like READ) or long arguments and short replies (like WRITE). But a client that sends an incorrectly long reply can violate those assumptions. This was observed to cause crashes. So, insist that the argument not be any longer than we expect. Also, several operations increment rq_next_page in the decode routine before checking the argument size, which can leave rq_next_page pointing well past the end of the page array, causing trouble later in svc_free_pages. As followup we may also want to rewrite the encoding routines to check more carefully that they aren't running off the end of the page array. Reported-by: Tuomas Haanpää <thaan@synopsys.com> Reported-by: Ari Kauppi <ari@synopsys.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25f2fs: enable small discard by defaultChao Yu
This patch start to enable 4K granularity small discard by default when realtime discard is on, so, in seriously fragmented space, small size discard can be issued in time to avoid useless storage space occupying of invalid filesystem's data, then performance of flash storage can be recovered. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25f2fs: delay awaking discard threadChao Yu
It's better to delay awaking discard thread while queuing discard commands in checkpoint, it will help to give more chances for merging big and small discard. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25f2fs: seperate read nat page from nat_tree_lockYunlei He
This patch seperate nat page read io from nat_tree_lock. -lock_page -get_node_info() -current_nat_addr ...... -> write_checkpoint -get_meta_page Because we lock node page, we can make sure no other threads modify this nid concurrently. So we just obtain current_nat_addr under nat_tree_lock, node info is always same in both nat pack. Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25f2fs: fix multiple f2fs_add_link() having same name for inline dentrySheng Yong
Commit 88c5c13a5027 (f2fs: fix multiple f2fs_add_link() calls having same name) does not cover the scenario where inline dentry is enabled. In that case, F2FS_I(dir)->task will be NULL, and __f2fs_add_link will lookup dentries one more time. This patch fixes it by moving the assigment of current task to a upper level to cover both normal and inline dentry. Cc: <stable@vger.kernel.org> Fixes: 88c5c13a5027 (f2fs: fix multiple f2fs_add_link() calls having same name) Signed-off-by: Sheng Yong <shengyong1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25nfsd: stricter decoding of write-like NFSv2/v3 opsJ. Bruce Fields
The NFSv2/v3 code does not systematically check whether we decode past the end of the buffer. This generally appears to be harmless, but there are a few places where we do arithmetic on the pointers involved and don't account for the possibility that a length could be negative. Add checks to catch these. Reported-by: Tuomas Haanpää <thaan@synopsys.com> Reported-by: Ari Kauppi <ari@synopsys.com> Reviewed-by: NeilBrown <neilb@suse.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25nfsd4: minor NFSv2/v3 write decoding cleanupJ. Bruce Fields
Use a couple shortcuts that will simplify a following bugfix. Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25nfsd: check for oversized NFSv2/v3 argumentsJ. Bruce Fields
A client can append random data to the end of an NFSv2 or NFSv3 RPC call without our complaining; we'll just stop parsing at the end of the expected data and ignore the rest. Encoded arguments and replies are stored together in an array of pages, and if a call is too large it could leave inadequate space for the reply. This is normally OK because NFS RPC's typically have either short arguments and long replies (like READ) or long arguments and short replies (like WRITE). But a client that sends an incorrectly long reply can violate those assumptions. This was observed to cause crashes. Also, several operations increment rq_next_page in the decode routine before checking the argument size, which can leave rq_next_page pointing well past the end of the page array, causing trouble later in svc_free_pages. So, following a suggestion from Neil Brown, add a central check to enforce our expectation that no NFSv2/v3 call has both a large call and a large reply. As followup we may also want to rewrite the encoding routines to check more carefully that they aren't running off the end of the page array. We may also consider rejecting calls that have any extra garbage appended. That would be safer, and within our rights by spec, but given the age of our server and the NFS protocol, and the fact that we've never enforced this before, we may need to balance that against the possibility of breaking some oddball client. Reported-by: Tuomas Haanpää <thaan@synopsys.com> Reported-by: Ari Kauppi <ari@synopsys.com> Cc: stable@vger.kernel.org Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25NFSv3: nfs3_nlm_alloc_call should be declared staticTrond Myklebust
Fix compiler warnings. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-25block: remove block_device_operations ->direct_access()Dan Williams
Now that all the producers and consumers of dax interfaces have been converted to using dax_operations on a dax_device, remove the block device direct_access enabling. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25block, dax: convert bdev_dax_supported() to dax_direct_access()Dan Williams
Kill of the final user of bdev_direct_access() and struct blk_dax_ctl. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25filesystem-dax: convert to dax_direct_access()Dan Williams
Now that a dax_device is plumbed through all dax-capable drivers we can switch from block_device_operations to dax_operations for invoking ->direct_access. This also lets us kill off some usages of struct blk_dax_ctl on the way to its eventual removal. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25Revert "block: use DAX for partition table reads"Dan Williams
commit d1a5f2b4d8a1 ("block: use DAX for partition table reads") was part of a stalled effort to allow dax mappings of block devices. Since then the device-dax mechanism has filled the role of dax-mapping static device ranges. Now that we are moving ->direct_access() from a block_device operation to a dax_inode operation we would need block devices to map and carry their own dax_inode reference. Unless / until we decide to revive dax mapping of raw block devices through the dax_inode scheme, there is no need to carry read_dax_sector(). Its removal in turn allows for the removal of bdev_direct_access() and should have been included in commit 223757016837 ("block_dev: remove DAX leftovers"). Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25ext2, ext4, xfs: retrieve dax_device for iomap operationsDan Williams
In preparation for converting fs/dax.c to use dax_direct_access() instead of bdev_direct_access(), add the plumbing to retrieve the dax_device associated with a given block_device. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25NFS: Don't write back further requests if there is a pending write errorTrond Myklebust
If the server has already returned a fatal write error that the user has not yet received on this file, then don't write back the other pages. Instead, act as if they have been sent, and have returned with the same error. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-25pNFS: Fix use after free issues in pnfs_do_read()Trond Myklebust
The assumption should be that if the caller returns PNFS_ATTEMPTED, then hdr has been consumed, and so we should not be testing hdr->task.tk_status. If the caller returns PNFS_TRY_AGAIN, then we need to recoalesce and free hdr. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-25ceph: fix recursion between ceph_set_acl() and __ceph_setattr()Yan, Zheng
ceph_set_acl() calls __ceph_setattr() if the setacl operation needs to modify inode's i_mode. __ceph_setattr() updates inode's i_mode, then calls posix_acl_chmod(). The problem is that __ceph_setattr() calls posix_acl_chmod() before sending the setattr request. The get_acl() call in posix_acl_chmod() can trigger a getxattr request. The reply of the getxattr request can restore inode's i_mode to its old value. The set_acl() call in posix_acl_chmod() sees old value of inode's i_mode, so it calls __ceph_setattr() again. Cc: stable@vger.kernel.org # needs backporting for < 4.9 Link: http://tracker.ceph.com/issues/19688 Reported-by: Jerry Lee <leisurelysw24@gmail.com> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Tested-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>