summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2019-10-26Merge tag 'for-linus-2019-10-26' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block and io_uring fixes from Jens Axboe: "A bit bigger than usual at this point in time, mostly due to some good bug hunting work by Pavel that resulted in three io_uring fixes from him and two from me. Anyway, this pull request contains: - Revert of the submit-and-wait optimization for io_uring, it can't always be done safely. It depends on commands always making progress on their own, which isn't necessarily the case outside of strict file IO. (me) - Series of two patches from me and three from Pavel, fixing issues with shared data and sequencing for io_uring. - Lastly, two timeout sequence fixes for io_uring (zhangyi) - Two nbd patches fixing races (Josef) - libahci regulator_get_optional() fix (Mark)" * tag 'for-linus-2019-10-26' of git://git.kernel.dk/linux-block: nbd: verify socket is supported during setup ata: libahci_platform: Fix regulator_get_optional() misuse nbd: handle racing with error'ed out commands nbd: protect cmd->status with cmd->lock io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD io_uring: used cached copies of sq->dropped and cq->overflow io_uring: Fix race for sqes with userspace io_uring: Fix broken links with offloading io_uring: Fix corrupted user_data io_uring: correct timeout req sequence when inserting a new entry io_uring : correct timeout req sequence when waiting timeout io_uring: revert "io_uring: optimize submit_and_wait API"
2019-10-26Merge tag 'dax-fix-5.4-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax fix from Dan Williams: "Fix a performance regression that followed from a fix to the conversion of the fsdax implementation to the xarray. v5.3 users report that they stop seeing huge page mappings on an application + filesystem layout that was seeing huge pages previously on v5.2" * tag 'dax-fix-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: fs/dax: Fix pmd vs pte conflict detection
2019-10-25fcntl: fix typo in RWH_WRITE_LIFE_NOT_SET r/w hint nameEugene Syromiatnikov
According to commit message in the original commit c75b1d9421f8 ("fs: add fcntl() interface for setting/getting write life time hints"), as well as userspace library[1] and man page update[2], R/W hint constants are intended to have RWH_* prefix. However, RWF_WRITE_LIFE_NOT_SET retained "RWF_*" prefix used in the early versions of the proposed patch set[3]. Rename it and provide the old name as a synonym for the new one for backward compatibility. [1] https://github.com/axboe/fio/commit/bd553af6c849 [2] https://github.com/mkerrisk/man-pages/commit/580082a186fd [3] https://www.mail-archive.com/linux-block@vger.kernel.org/msg09638.html Fixes: c75b1d9421f8 ("fs: add fcntl() interface for setting/getting write life time hints") Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25make __d_alloc() staticAl Viro
no users outside of fs/dcache.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-10-25Btrfs: fix race leading to metadata space leak after task received signalFilipe Manana
When a task that is allocating metadata needs to wait for the async reclaim job to process its ticket and gets a signal (because it was killed for example) before doing the wait, the task ends up erroring out but with space reserved for its ticket, which never gets released, resulting in a metadata space leak (more specifically a leak in the bytes_may_use counter of the metadata space_info object). Here's the sequence of steps leading to the space leak: 1) A task tries to create a file for example, so it ends up trying to start a transaction at btrfs_create(); 2) The filesystem is currently in a state where there is not enough metadata free space to satisfy the transaction's needs. So at space-info.c:__reserve_metadata_bytes() we create a ticket and add it to the list of tickets of the space info object. Also, because the metadata async reclaim job is not running, we queue a job ro run metadata reclaim; 3) In the meanwhile the task receives a signal (like SIGTERM from a kill command for example); 4) After queing the async reclaim job, at __reserve_metadata_bytes(), we unlock the metadata space info and call handle_reserve_ticket(); 5) That last function calls wait_reserve_ticket(), which acquires the lock from the metadata space info. Then in the first iteration of its while loop, it calls prepare_to_wait_event(), which returns -ERESTARTSYS because the task has a pending signal. As a result, we set the error field of the ticket to -EINTR and exit the while loop without deleting the ticket from the list of tickets (in the space info object). After exiting the loop we unlock the space info; 6) The async reclaim job is able to release enough metadata, acquires the metadata space info's lock and then reserves space for the ticket, since the ticket is still in the list of (non-priority) tickets. The space reservation happens at btrfs_try_granting_tickets(), called from maybe_fail_all_tickets(). This increments the bytes_may_use counter from the metadata space info object, sets the ticket's bytes field to zero (meaning success, that space was reserved) and removes it from the list of tickets; 7) wait_reserve_ticket() returns, with the error field of the ticket set to -EINTR. Then handle_reserve_ticket() just propagates that error to the caller. Because an error was returned, the caller does not release the reserved space, since the expectation is that any error means no space was reserved. Fix this by removing the ticket from the list, while holding the space info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns an error. Also add some comments and an assertion to guarantee we never end up with a ticket that has an error set and a bytes counter field set to zero, to more easily detect regressions in the future. This issue could be triggered sporadically by some test cases from fstests such as generic/269 for example, which tries to fill a filesystem and then kills fsstress processes running in the background. When this issue happens, we get a warning in syslog/dmesg when unmounting the filesystem, like the following: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs] (...) CPU: 0 PID: 13240 Comm: umount Tainted: G W L 5.3.0-rc8-btrfs-next-48+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs] (...) RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286 RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8 RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001 R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508 R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100 FS: 00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: close_ctree+0x1ad/0x390 [btrfs] generic_shutdown_super+0x6c/0x110 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0xa0 [btrfs] deactivate_locked_super+0x3a/0x70 cleanup_mnt+0xb4/0x160 task_work_run+0x7e/0xc0 exit_to_usermode_loop+0xfa/0x100 do_syscall_64+0x1cb/0x220 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f4274d2cb37 (...) RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240 RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015 R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64 R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0 softirqs last enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace bcf4b235461b26f6 ]--- BTRFS info (device sdb): space_info 4 has 19116032 free, is full BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536 BTRFS info (device sdb): global_block_rsv: size 0 reserved 0 BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0 BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0 BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0 BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0 Fixes: 374bf9c5cd7d0b ("btrfs: unify error handling for ticket flushing") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-25btrfs: tree-checker: Fix wrong check on max devidQu Wenruo
[BUG] The following script will cause false alert on devid check. #!/bin/bash dev1=/dev/test/test dev2=/dev/test/scratch1 mnt=/mnt/btrfs umount $dev1 &> /dev/null umount $dev2 &> /dev/null umount $mnt &> /dev/null mkfs.btrfs -f $dev1 mount $dev1 $mnt _fail() { echo "!!! FAILED !!!" exit 1 } for ((i = 0; i < 4096; i++)); do btrfs dev add -f $dev2 $mnt || _fail btrfs dev del $dev1 $mnt || _fail dev_tmp=$dev1 dev1=$dev2 dev2=$dev_tmp done [CAUSE] Tree-checker uses BTRFS_MAX_DEVS() and BTRFS_MAX_DEVS_SYS_CHUNK() as upper limit for devid. But we can have devid holes just like above script. So the check for devid is incorrect and could cause false alert. [FIX] Just remove the whole devid check. We don't have any hard requirement for devid assignment. Furthermore, even devid could get corrupted by a bitflip, we still have dev extents verification at mount time, so corrupted data won't sneak in. This fixes fstests btrfs/194. Reported-by: Anand Jain <anand.jain@oracle.com> Fixes: ab4ba2e13346 ("btrfs: tree-checker: Verify dev item") CC: stable@vger.kernel.org # 5.2+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-25btrfs: Consider system chunk array size for new SYSTEM chunksQu Wenruo
For SYSTEM chunks, despite the regular chunk item size limit, there is another limit due to system chunk array size. The extra limit was removed in a refactoring, so add it back. Fixes: e3ecdb3fdecf ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk") CC: stable@vger.kernel.org # 5.3+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-25io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREADJens Axboe
We currently assume that submissions from the sqthread are successful, and if IO polling is enabled, we use that value for knowing how many completions to look for. But if we overflowed the CQ ring or some requests simply got errored and already completed, they won't be available for polling. For the case of IO polling and SQTHREAD usage, look at the pending poll list. If it ever hits empty then we know that we don't have anymore pollable requests inflight. For that case, simply reset the inflight count to zero. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25io_uring: used cached copies of sq->dropped and cq->overflowJens Axboe
We currently use the ring values directly, but that can lead to issues if the application is malicious and changes these values on our behalf. Created in-kernel cached versions of them, and just overwrite the user side when we update them. This is similar to how we treat the sq/cq ring tail/head updates. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25f2fs: cache global IPU bioChao Yu
In commit 8648de2c581e ("f2fs: add bio cache for IPU"), we added f2fs_submit_ipu_bio() in __write_data_page() as below: __write_data_page() if (!S_ISDIR(inode->i_mode) && !IS_NOQUOTA(inode)) { f2fs_submit_ipu_bio(sbi, bio, page); .... } in order to avoid below deadlock: Thread A Thread B - __write_data_page (inode x, page y) - f2fs_do_write_data_page - set_page_writeback ---- set writeback flag in page y - f2fs_inplace_write_data - f2fs_balance_fs - lock gc_mutex - lock gc_mutex - f2fs_gc - do_garbage_collect - gc_data_segment - move_data_page - f2fs_wait_on_page_writeback - wait_on_page_writeback --- wait writeback of page y However, the bio submission breaks the merge of IPU IOs. So in this patch let's add a global bio cache for merged IPU pages, then f2fs_wait_on_page_writeback() is able to submit bio if a writebacked page is cached in global bio cache. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-10-25io_uring: Fix race for sqes with userspacePavel Begunkov
io_ring_submit() finalises with 1. io_commit_sqring(), which releases sqes to the userspace 2. Then calls to io_queue_link_head(), accessing released head's sqe Reorder them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25io_uring: Fix broken links with offloadingPavel Begunkov
io_sq_thread() processes sqes by 8 without considering links. As a result, links will be randomely subdivided. The easiest way to fix it is to call io_get_sqring() inside io_submit_sqes() as do io_ring_submit(). Downsides: 1. This removes optimisation of not grabbing mm_struct for fixed files 2. It submitting all sqes in one go, without finer-grained sheduling with cq processing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25io_uring: Fix corrupted user_dataPavel Begunkov
There is a bug, where failed linked requests are returned not with specified @user_data, but with garbage from a kernel stack. The reason is that io_fail_links() uses req->user_data, which is uninitialised when called from io_queue_sqe() on fail path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25autofs: fix a leak in autofs_expire_indirect()Al Viro
if the second call of should_expire() in there ends up grabbing and returning a new reference to dentry, we need to drop it before continuing. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-10-24cifs: Fix cifsInodeInfo lock_sem deadlock when reconnect occursDave Wysochanski
There's a deadlock that is possible and can easily be seen with a test where multiple readers open/read/close of the same file and a disruption occurs causing reconnect. The deadlock is due a reader thread inside cifs_strict_readv calling down_read and obtaining lock_sem, and then after reconnect inside cifs_reopen_file calling down_read a second time. If in between the two down_read calls, a down_write comes from another process, deadlock occurs. CPU0 CPU1 ---- ---- cifs_strict_readv() down_read(&cifsi->lock_sem); _cifsFileInfo_put OR cifs_new_fileinfo down_write(&cifsi->lock_sem); cifs_reopen_file() down_read(&cifsi->lock_sem); Fix the above by changing all down_write(lock_sem) calls to down_write_trylock(lock_sem)/msleep() loop, which in turn makes the second down_read call benign since it will never block behind the writer while holding lock_sem. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Suggested-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed--by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-10-24CIFS: Fix use after free of file info structuresPavel Shilovsky
Currently the code assumes that if a file info entry belongs to lists of open file handles of an inode and a tcon then it has non-zero reference. The recent changes broke that assumption when putting the last reference of the file info. There may be a situation when a file is being deleted but nothing prevents another thread to reference it again and start using it. This happens because we do not hold the inode list lock while checking the number of references of the file info structure. Fix this by doing the proper locking when doing the check. Fixes: 487317c99477d ("cifs: add spinlock for the openFileList to cifsInodeInfo") Fixes: cb248819d209d ("cifs: use cifsInodeInfo->open_file_lock while iterating to avoid a panic") Cc: Stable <stable@vger.kernel.org> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-10-24CIFS: Fix retry mid list corruption on reconnectsPavel Shilovsky
When the client hits reconnect it iterates over the mid pending queue marking entries for retry and moving them to a temporary list to issue callbacks later without holding GlobalMid_Lock. In the same time there is no guarantee that mids can't be removed from the temporary list or even freed completely by another thread. It may cause a temporary list corruption: [ 430.454897] list_del corruption. prev->next should be ffff98d3a8f316c0, but was 2e885cb266355469 [ 430.464668] ------------[ cut here ]------------ [ 430.466569] kernel BUG at lib/list_debug.c:51! [ 430.468476] invalid opcode: 0000 [#1] SMP PTI [ 430.470286] CPU: 0 PID: 13267 Comm: cifsd Kdump: loaded Not tainted 5.4.0-rc3+ #19 [ 430.473472] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 430.475872] RIP: 0010:__list_del_entry_valid.cold+0x31/0x55 ... [ 430.510426] Call Trace: [ 430.511500] cifs_reconnect+0x25e/0x610 [cifs] [ 430.513350] cifs_readv_from_socket+0x220/0x250 [cifs] [ 430.515464] cifs_read_from_socket+0x4a/0x70 [cifs] [ 430.517452] ? try_to_wake_up+0x212/0x650 [ 430.519122] ? cifs_small_buf_get+0x16/0x30 [cifs] [ 430.521086] ? allocate_buffers+0x66/0x120 [cifs] [ 430.523019] cifs_demultiplex_thread+0xdc/0xc30 [cifs] [ 430.525116] kthread+0xfb/0x130 [ 430.526421] ? cifs_handle_standard+0x190/0x190 [cifs] [ 430.528514] ? kthread_park+0x90/0x90 [ 430.530019] ret_from_fork+0x35/0x40 Fix this by obtaining extra references for mids being retried and marking them as MID_DELETED which indicates that such a mid has been dequeued from the pending list. Also move mid cleanup logic from DeleteMidQEntry to _cifs_mid_q_entry_release which is called when the last reference to a particular mid is put. This allows to avoid any use-after-free of response buffers. The patch needs to be backported to stable kernels. A stable tag is not mentioned below because the patch doesn't apply cleanly to any actively maintained stable kernel. Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-and-tested-by: David Wysochanski <dwysocha@redhat.com> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-10-24Merge tag 'gfs2-v5.4-rc4.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fix from Andreas Gruenbacher: "Fix a memory leak introduced in -rc1" * tag 'gfs2-v5.4-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Fix memory leak when gfs2meta's fs_context is freed
2019-10-24xfs: Sanity check flags of Q_XQUOTARM callJan Kara
Flags passed to Q_XQUOTARM were not sanity checked for invalid values. Fix that. Fixes: 9da93f9b7cdf ("xfs: fix Q_XQUOTARM ioctl") Reported-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-24gfs2: Fix memory leak when gfs2meta's fs_context is freedAndrew Price
gfs2 and gfs2meta share an ->init_fs_context function which allocates an args structure stored in fc->fs_private. gfs2 registers a ->free function to free this memory when the fs_context is cleaned up, but there was not one registered for gfs2meta, causing a leak. Register a ->free function for gfs2meta. The existing gfs2_fc_free function does what we need. Reported-by: syzbot+c2fdfd2b783754878fb6@syzkaller.appspotmail.com Fixes: 1f52aa08d12f ("gfs2: Convert gfs2 to fs_context") Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-10-24ext2: return error when fail to allocating memory in ioctlChengguang Xu
Currently, we do not check memory allocation result for ei->i_block_alloc_info in ioctl, this patch checks it and returns error in failure case. Link: https://lore.kernel.org/r/20191023135643.28837-1-cgxu519@mykernel.net Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Jan Kara <jack@suse.cz>
2019-10-23io_uring: correct timeout req sequence when inserting a new entryzhangyi (F)
The sequence number of the timeout req (req->sequence) indicate the expected completion request. Because of each timeout req consume a sequence number, so the sequence of each timeout req on the timeout list shouldn't be the same. But now, we may get the same number (also incorrect) if we insert a new entry before the last one, such as submit such two timeout reqs on a new ring instance below. req->sequence req_1 (count = 2): 2 req_2 (count = 1): 2 Then, if we submit a nop req, req_2 will still timeout even the nop req finished. This patch fix this problem by adjust the sequence number of each reordered reqs when inserting a new entry. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-23io_uring : correct timeout req sequence when waiting timeoutzhangyi (F)
The sequence number of reqs on the timeout_list before the timeout req should be adjusted in io_timeout_fn(), because the current timeout req will consumes a slot in the cq_ring and cq_tail pointer will be increased, otherwise other timeout reqs may return in advance without waiting for enough wait_nr. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-23io_uring: revert "io_uring: optimize submit_and_wait API"Jens Axboe
There are cases where it isn't always safe to block for submission, even if the caller asked to wait for events as well. Revert the previous optimization of doing that. This reverts two commits: bf7ec93c644cb c576666863b78 Fixes: c576666863b78 ("io_uring: optimize submit_and_wait API") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-23xfs: add mising include of xfs_pnfs.h for missing declarationsBen Dooks (Codethink)
The xfs_pnfs.c file is missing an include of xfs_pnfs.h to add the prototypes of the functions it exports. Include this file to fix the following sparse warnings: fs/xfs/xfs_pnfs.c:27:1: warning: symbol 'xfs_break_leased_layouts' was not declared. Should it be static? fs/xfs/xfs_pnfs.c:52:1: warning: symbol 'xfs_fs_get_uuid' was not declared. Should it be static? fs/xfs/xfs_pnfs.c:77:1: warning: symbol 'xfs_fs_map_blocks' was not declared. Should it be static? fs/xfs/xfs_pnfs.c:226:1: warning: symbol 'xfs_fs_commit_blocks' was not declared. Should it be static? Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-23xfs: don't set bmapi total block req where minleft isBrian Foster
xfs_bmapi_write() takes a total block requirement parameter that is passed down to the block allocation code and is used to specify the total block requirement of the associated transaction. This is used to try and select an AG that can not only satisfy the requested extent allocation, but can also accommodate subsequent allocations that might be required to complete the transaction. For example, additional bmbt block allocations may be required on insertion of the resulting extent to an inode data fork. While it's important for callers to calculate and reserve such extra blocks in the transaction, it is not necessary to pass the total value to xfs_bmapi_write() in all cases. The latter automatically sets minleft to ensure that sufficient free blocks remain after the allocation attempt to expand the format of the associated inode (i.e., such as extent to btree conversion, btree splits, etc). Therefore, any callers that pass a total block requirement of the bmap mapping length plus worst case bmbt expansion essentially specify the additional reservation requirement twice. These callers can pass a total of zero to rely on the bmapi minleft policy. Beyond being superfluous, the primary motivation for this change is that the total reservation logic in the bmbt code is dubious in scenarios where minlen < maxlen and a maxlen extent cannot be allocated (which is more common for data extent allocations where contiguity is not required). The total value is based on maxlen in the xfs_bmapi_write() caller. If the bmbt code falls back to an allocation between minlen and maxlen, that allocation will not succeed until total is reset to minlen, which essentially throws away any additional reservation included in total by the caller. In addition, the total value is not reset until after alignment is dropped, which means that such callers drop alignment far too aggressively than necessary. Update all callers of xfs_bmapi_write() that pass a total block value of the mapping length plus bmbt reservation to instead pass zero and rely on xfs_bmapi_minleft() to enforce the bmbt reservation requirement. This trades off slightly less conservative AG selection for the ability to preserve alignment in more scenarios. xfs_bmapi_write() callers that incorporate unrelated or additional reservations in total beyond what is already included in minleft must continue to use the former. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-23xfs: cap longest free extent to maximum allocatableDave Chinner
Cap longest extent to the largest we can allocate based on limits calculated at mount time. Dynamic state (such as finobt blocks) can result in the longest free extent exceeding the size we can allocate, and that results in failure to align full AG allocations when the AG is empty. Result: xfs_io-4413 [003] 426.412459: xfs_alloc_vextent_loopfailed: dev 8:96 agno 0 agbno 32 minlen 243968 maxlen 244000 mod 0 prod 1 minleft 1 total 262148 alignment 32 minalignslop 0 len 0 type NEAR_BNO otype START_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x5 firstblock 0xffffffffffffffff minlen and maxlen are now separated by the alignment size, and allocation fails because args.total > free space in the AG. [bfoster: Added xfs_bmap_btalloc() changes.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-23ext4: add kunit test for decoding extended timestampsIurii Zaikin
KUnit tests for decoding extended 64 bit timestamps that verify the seconds part of [a/c/m] timestamps in ext4 inode structs are decoded correctly. Test data is derived from the table in the Inode Timestamps section of Documentation/filesystems/ext4/inodes.rst. KUnit tests run during boot and output the results to the debug log in TAP format (http://testanything.org/). Only useful for kernel devs running KUnit test harness and are not for inclusion into a production build. Signed-off-by: Iurii Zaikin <yzaikin@google.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Tested-by: Brendan Higgins <brendanhiggins@google.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2019-10-23pipe: Reduce #inclusion of pipe_fs_i.hDavid Howells
Remove some #inclusions of linux/pipe_fs_i.h that don't seem to be necessary any more. Signed-off-by: David Howells <dhowells@redhat.com>
2019-10-23compat_ioctl: move SG_GET_REQUEST_TABLE handlingArnd Bergmann
SG_GET_REQUEST_TABLE is now the last ioctl command that needs a conversion handler. This is only used in a single file, so the implementation should be there. I'm trying to simplify it in the process, to get rid of the compat_alloc_user_space() and extra copy, by adding a put_compat_request_table() function instead, which copies the data in the right format to user space. Cc: linux-scsi@vger.kernel.org Cc: Doug Gilbert <dgilbert@interlog.com> Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: ppp: move simple commands into ppp_generic.cArnd Bergmann
All ppp commands that are not already handled in ppp_compat_ioctl() are compatible, so they can now handled by calling the native ppp_ioctl() directly. Without CONFIG_BLOCK, the generic compat_ioctl table is now empty, so add a check to avoid a build failure in the looking function for that configuration. Cc: netdev@vger.kernel.org Cc: linux-ppp@vger.kernel.org Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: handle PPPIOCGIDLE for 64-bit time_tArnd Bergmann
The ppp_idle structure is defined in terms of __kernel_time_t, which is defined as 'long' on all architectures, and this usage is not affected by the y2038 problem since it transports a time interval rather than an absolute time. However, the ppp user space defines the same structure as time_t, which may be 64-bit wide on new libc versions even on 32-bit architectures. It's easy enough to just handle both possible structure layouts on all architectures, to deal with the possibility that a user space ppp implementation comes with its own ppp_idle structure definition, as well as to document the fact that the driver is y2038-safe. Doing this also avoids the need for a special compat mode translation, since 32-bit and 64-bit kernels now support the same interfaces. The old 32-bit structure is also available on native 64-bit architectures now, but this is harmless. Cc: netdev@vger.kernel.org Cc: linux-ppp@vger.kernel.org Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: move PPPIOCSCOMPRESS to ppp_genericAl Viro
Rather than using a compat_alloc_user_space() buffer, moving this next to the native handler allows sharing most of the code, leaving only the user copy portion distinct. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: netdev@vger.kernel.org Cc: linux-ppp@vger.kernel.org Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: unify copy-in of ppp filtersAl Viro
Now that isdn4linux is gone, the is only one implementation of PPPIOCSPASS and PPPIOCSACTIVE in ppp_generic.c, so this is where the compat_ioctl support should be implemented. The two commands are implemented in very similar ways, so introduce new helpers to allow sharing between the two and between native and compat mode. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> [arnd: rebased, and added changelog text] Cc: netdev@vger.kernel.org Cc: linux-ppp@vger.kernel.org Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: move SIOCOUTQ out of compat_ioctl.cArnd Bergmann
All users of this call are in socket or tty code, so handling it there means we can avoid the table entry in fs/compat_ioctl.c. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Eric Dumazet <edumazet@google.com> Cc: netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: reimplement SG_IO handlingArnd Bergmann
There are two code locations that implement the SG_IO ioctl: the old sg.c driver, and the generic scsi_ioctl helper that is in turn used by multiple drivers. To eradicate the old compat_ioctl conversion handler for the SG_IO command, I implement a readable pair of put_sg_io_hdr() /get_sg_io_hdr() helper functions that can be used for both compat and native mode, and then I call this from both drivers. For the iovec handling, there is already a compat_import_iovec() function that can simply be called in place of import_iovec(). To avoid having to pass the compat/native state through multiple indirections, I mark the SG_IO command itself as compatible in fs/compat_ioctl.c and use in_compat_syscall() to figure out where we are called from. As a side-effect of this, the sg.c driver now also accepts the 32-bit sg_io_hdr format in compat mode using the read/write interface, not just ioctl. This should improve compatiblity with old 32-bit binaries, but it would break if any application intentionally passes the 64-bit data structure in compat mode here. Steffen Maier helped debug an issue in an earlier version of this patch. Cc: Steffen Maier <maier@linux.ibm.com> Cc: linux-scsi@vger.kernel.org Cc: Doug Gilbert <dgilbert@interlog.com> Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: move WDIOC handling into wdt driversArnd Bergmann
All watchdog drivers implement the same set of ioctl commands, and fortunately all of them are compatible between 32-bit and 64-bit architectures. Modern drivers always go through drivers/watchdog/wdt.c as an abstraction layer, but older ones implement their own file_operations on a character device for this. Move the handling from fs/compat_ioctl.c into the individual drivers. Note that most of the legacy drivers will never be used on 64-bit hardware, because they are for an old 32-bit SoC implementation, but doing them all at once is safer than trying to guess which ones do or do not need the compat_ioctl handling. Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23fs: compat_ioctl: move FITRIM emulation into file systemsArnd Bergmann
Remove the special case for FITRIM, and make file systems handle that like all other ioctl commands with their own handlers. Cc: linux-ext4@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: linux-nilfs@vger.kernel.org Cc: ocfs2-devel@oss.oracle.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23gfs2: add compat_ioctl supportArnd Bergmann
Out of the four ioctl commands supported on gfs2, only FITRIM works in compat mode. Add a proper handler based on the ext4 implementation. Fixes: 6ddc5c3ddf25 ("gfs2: getlabel support") Reviewed-by: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: remove unused convert_in_user macroArnd Bergmann
The last users are all gone, so let's remove the macro as well. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: remove last RAID handling codeArnd Bergmann
Commit aa98aa31987a ("md: move compat_ioctl handling into md.c") already removed the COMPATIBLE_IOCTL() table entries and added a complete implementation, but a few lines got left behind and should also be removed here. Cc: linux-raid@vger.kernel.org Cc: Song Liu <song@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: remove /dev/raw ioctl translationArnd Bergmann
The /dev/rawX implementation already handles these just fine, so the entries in the table are not needed any more. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: remove PCI ioctl translationArnd Bergmann
The /proc/pci/ implementation already handles these just fine, so the entries in the table are not needed any more. Cc: linux-pci@vger.kernel.org Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: remove joystick ioctl translationArnd Bergmann
The joystick driver already handles these just fine, so the entries in the table are not needed any more. Cc: linux-input@vger.kernel.org Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: remove /dev/random commandsArnd Bergmann
These are all handled by the random driver, so instead of listing each ioctl, we can use the generic compat_ptr_ioctl() helper. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: remove IGNORE_IOCTL()Arnd Bergmann
Since commit 07d106d0a33d ("vfs: fix up ENOIOCTLCMD error handling"), we don't warn about unhandled compat-ioctl command code any more, but just return the same error that a native file descriptor returns when there is no handler. This means the IGNORE_IOCTL() annotations are completely useless and can all be removed. TIOCSTART/TIOCSTOP and KDGHWCLK/KDSHWCLK fall into the same category, but for some reason were listed as COMPATIBLE_IOCTL(). Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: remove translation for sound ioctlsArnd Bergmann
The SNDCTL_* and SOUND_* commands are the old OSS user interface. I checked all the sound ioctl commands listed in fs/compat_ioctl.c to see if we still need the translation handlers. Here is what I found: - sound/oss/ is (almost) gone from the kernel, this is what actually needed all the translations - The ALSA emulation for OSS correctly handles all compat_ioctl commands already. - sound/oss/dmasound/ is the last holdout of the original OSS code, this is only used on arch/m68k, which has no 64-bit mode and hence needs no compat handlers - arch/um/drivers/hostaudio_kern.c may run in 64-bit mode with 32-bit x86 user space underneath it. This rare corner case is the only one that still needs the compat handlers. By adding a simple redirect of .compat_ioctl to .unlocked_ioctl in the UML driver, we can remove all the COMPATIBLE_IOCTL() annotations without a change in functionality. For completeness, I'm adding the same thing to the dmasound file, knowing that it makes no difference. The compat_ioctl list contains one comment about SNDCTL_DSP_MAPINBUF and SNDCTL_DSP_MAPOUTBUF, which actually would need a translation handler if implemented. However, the native implementation just returns -EINVAL, so we don't care. Reviewed-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: remove HIDIO translationArnd Bergmann
The two drivers implementing these both gained proper compat_ioctl() handlers a long time ago with commits bb6c8d8fa9b5 ("HID: hiddev: Add 32bit ioctl compatibilty") and ae5e49c79c05 ("HID: hidraw: add compatibility ioctl() for 32-bit applications."), so the lists in fs/compat_ioctl.c are no longer used. It appears that the lists were also incomplete, so the translation didn't actually work correctly when it was still in use. Remove them as cleanup. Cc: linux-bluetooth@vger.kernel.org Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: remove HCIUART handlingArnd Bergmann
As of commit f0193d3ea73b ("change semantics of ldisc ->compat_ioctl()"), all hciuart ioctl commands are handled correctly in the driver, and we never need to go through the table here. Cc: linux-bluetooth@vger.kernel.org Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: move hci_sock handlers into driverArnd Bergmann
All these ioctl commands are compatible, so we can handle them with a trivial wrapper in hci_sock.c and remove the listing in fs/compat_ioctl.c. A few of the commands pass integer arguments instead of pointers, so for correctness skip the compat_ptr() conversion here. Acked-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>