summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2019-11-18btrfs: Open-code name_in_log_ref in replay_one_nameNikolay Borisov
That function adds unnecessary indirection between backref_in_log and the caller. Furthermore it also "downgrades" backref_in_log's return value to a boolean, when in fact it could very well be an error. Rectify the situation by simply opencoding name_in_log_ref in replay_one_name and properly handling possible return codes from backref_in_log. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Properly handle backref_in_log retvalNikolay Borisov
This function can return a negative error value if btrfs_search_slot errors for whatever reason or if btrfs_alloc_path runs out of memory. This is currently problemattic because backref_in_log is treated by its callers as if it returns boolean. Fix this by adding proper error handling in callers. That also enables the function to return the direct error code from btrfs_search_slot. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Don't opencode btrfs_find_name_in_backref in backref_in_logNikolay Borisov
Direct replacement, though note that the inside of the loop in btrfs_find_name_in_backref is organized in a slightly different way but is equvalent. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: transaction: Cleanup unused TRANS_STATE_BLOCKEDQu Wenruo
The state was introduced in commit 4a9d8bdee368 ("Btrfs: make the state of the transaction more readable"), then in commit 302167c50b32 ("btrfs: don't end the transaction for delayed refs in throttle") the state is completely removed. So we can just clean up the state since it's only compared but never set. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: transaction: describe transaction states and transitionsQu Wenruo
Add an overview of the basic btrfs transaction transitions, including the following states: - No transaction states - Transaction N [[TRANS_STATE_RUNNING]] - Transaction N [[TRANS_STATE_COMMIT_START]] - Transaction N [[TRANS_STATE_COMMIT_DOING]] - Transaction N [[TRANS_STATE_UNBLOCKED]] - Transaction N [[TRANS_STATE_COMPLETED]] For each state, the comment will include: - Basic explaination about current state - How to go next stage - What will happen if we call various start_transaction() functions - Relationship to transaction N+1 This doesn't provide tech details, but serves as a cheat sheet for reader to get into the code a little easier. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: use has_single_bit_set for clarityDavid Sterba
Replace is_power_of_2 with the helper that is self-documenting and remove the open coded call in alloc_profile_is_valid. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: add 64bit safe helper for power of two checksDavid Sterba
As is_power_of_two takes unsigned long, it's not safe on 32bit architectures, but we could pass any u64 value in seveal places. Add a separate helper and also an alias that better expresses the purpose for which the helper is used. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: balance: use term redundancy instead of integrity in messageAnand Jain
When balance reduces the number of copies of metadata, it reduces the redundancy, use the term redundancy instead of integrity. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: move btrfs_unlock_up_safe to other locking functionsDavid Sterba
The function belongs to the family of locking functions, so move it there. The 'noinline' keyword is dropped as it's now an exported function that does not need it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: move btrfs_set_path_blocking to other locking functionsDavid Sterba
The function belongs to the family of locking functions, so move it there. The 'noinline' keyword is dropped as it's now an exported function that does not need it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: make btrfs_assert_tree_locked static inlineDavid Sterba
The function btrfs_assert_tree_locked is used outside of the locking code so it is exported, however we can make it static inine as it's fairly trivial. This is the only locking assertion used in release builds, inlining improves the text size by 174 bytes and reduces stack consumption in the callers. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: make locking assertion helpers static inlineDavid Sterba
I've noticed that none of the btrfs_assert_*lock* debugging helpers is inlined, despite they're short and mostly a value update. Making them inline shaves 67 from the text size, reduces stack consumption and perhaps also slightly improves the performance due to avoiding unnecessary calls. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: get rid of pointless wtag variable in async-thread.cOmar Sandoval
Commit ac0c7cf8be00 ("btrfs: fix crash when tracepoint arguments are freed by wq callbacks") added a void pointer, wtag, which is passed into trace_btrfs_all_work_done() instead of the freed work item. This is silly for a few reasons: 1. The freed work item still has the same address. 2. work is still in scope after it's freed, so assigning wtag doesn't stop anyone from using it. 3. The tracepoint has always taken a void * argument, so assigning wtag doesn't actually make things any more type-safe. (Note that the original bug in commit bc074524e123 ("btrfs: prefix fsid to all trace events") was that the void * was implicitly casted when it was passed to btrfs_work_owner() in the trace point itself). Instead, let's add some clearer warnings as comments. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: get rid of unique workqueue helper functionsOmar Sandoval
Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed write") worked around the issue that a recycled work item could get a false dependency on the original work item due to how the workqueue code guarantees non-reentrancy. It did so by giving different work functions to different types of work. However, the fixes in the previous few patches are more complete, as they prevent a work item from being recycled at all (except for a tiny window that the kernel workqueue code handles for us). This obsoletes the previous fix, so we don't need the unique helpers for correctness. The only other reason to keep them would be so they show up in stack traces, but they always seem to be optimized to a tail call, so they don't show up anyways. So, let's just get rid of the extra indirection. While we're here, rename normal_work_helper() to the more informative btrfs_work_helper(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: don't prematurely free work in scrub_missing_raid56_worker()Omar Sandoval
Currently, scrub_missing_raid56_worker() puts and potentially frees sblock (which embeds the work item) and then submits a bio through scrub_wr_submit(). This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". Fix it by dropping the reference after we submit the bio. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: don't prematurely free work in reada_start_machine_worker()Omar Sandoval
Currently, reada_start_machine_worker() frees the reada_machine_work and then calls __reada_start_machine() to do readahead. This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". There _might_ already be a deadlock here: reada_start_machine_worker() can depend on itself through stacked filesystems (__read_start_machine() -> reada_start_machine_dev() -> reada_tree_block_flagged() -> read_extent_buffer_pages() -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() -> submit_stripe_bio() -> submit_bio() onto a loop device can trigger readahead on the lower filesystem). Either way, let's fix it by freeing the work at the end. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: don't prematurely free work in end_workqueue_fn()Omar Sandoval
Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds the work item) and then calls bio_endio(). This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". In particular, the endio call may depend on other work items. For example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() -> __btrfs_correct_data_nocsum() -> dio_read_error() -> submit_dio_repair_bio(), which submits a bio that is also completed through a end_workqueue_fn() work item. However, __btrfs_correct_data_nocsum() waits for the newly submitted bio to complete, thus it depends on another work item. This example currently usually works because we use different workqueue helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR. However, it may deadlock with stacked filesystems and is fragile overall. The proper fix is to free the work item at the very end of the work function, so let's do that. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: don't prematurely free work in run_ordered_work()Omar Sandoval
We hit the following very strange deadlock on a system with Btrfs on a loop device backed by another Btrfs filesystem: 1. The top (loop device) filesystem queues an async_cow work item from cow_file_range_async(). We'll call this work X. 2. Worker thread A starts work X (normal_work_helper()). 3. Worker thread A executes the ordered work for the top filesystem (run_ordered_work()). 4. Worker thread A finishes the ordered work for work X and frees X (work->ordered_free()). 5. Worker thread A executes another ordered work and gets blocked on I/O to the bottom filesystem (still in run_ordered_work()). 6. Meanwhile, the bottom filesystem allocates and queues an async_cow work item which happens to be the recently-freed X. 7. The workqueue code sees that X is already being executed by worker thread A, so it schedules X to be executed _after_ worker thread A finishes (see the find_worker_executing_work() call in process_one_work()). Now, the top filesystem is waiting for I/O on the bottom filesystem, but the bottom filesystem is waiting for the top filesystem to finish, so we deadlock. This happens because we are breaking the workqueue assumption that a work item cannot be recycled while it still depends on other work. Fix it by waiting to free the work item until we are done with all of the related ordered work. P.S.: One might ask why the workqueue code doesn't try to detect a recycled work item. It actually does try by checking whether the work item has the same work function (find_worker_executing_work()), but in our case the function is the same. This is the only key that the workqueue code has available to compare, short of adding an additional, layer-violating "custom key". Considering that we're the only ones that have ever hit this, we should just play by the rules. Unfortunately, we haven't been able to create a minimal reproducer other than our full container setup using a compress-force=zstd filesystem on top of another compress-force=zstd filesystem. Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: get rid of unnecessary memset() of work itemOmar Sandoval
Commit fc97fab0ea59 ("btrfs: Replace fs_info->qgroup_rescan_worker workqueue with btrfs_workqueue.") converted qgroup_rescan_work to be initialized with btrfs_init_work(), but it left behind an unnecessary memset(). Get rid of the memset(). Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: move the failrec tree stuff into extent-io-tree.hJosef Bacik
This needs to be cleaned up in the future, but for now it belongs to the extent-io-tree stuff since it uses the internal tree search code. Needed to export get_state_failrec and set_state_failrec as well since we're not going to move the actual IO part of the failrec stuff out at this point. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: export find_delalloc_rangeJosef Bacik
This utilizes internal stuff to the extent_io_tree, so we need to export it before we move it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: move extent_io_tree defs to their own headerJosef Bacik
extent_io.c/h are huge, encompassing a bunch of different things. The extent_io_tree code can live on its own, so separate this out. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: separate out the extent io init functionJosef Bacik
We are moving extent_io_tree into it's on file, so separate out the extent_state init stuff from extent_io_tree_init(). Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: separate out the extent leak codeJosef Bacik
We check both extent buffer and extent state leaks in the same function, separate these two functions out so we can move them around. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: ctree: Remove stray comment of setting up path lockQu Wenruo
The following comment shows up in btrfs_search_slot() with out much sense: /* * setup the path here so we can release it under lock * contention with the cow code */ if (cow) { /* code touching path->lock[] is far away from here */ } This comment hasn't been cleaned up after the relevant code has been removed. The original code is introduced in commit 65b51a009e29 ("btrfs_search_slot: reduce lock contention by cowing in two stages"): + + /* + * setup the path here so we can release it under lock + * contention with the cow code + */ + p->nodes[level] = b; + if (!p->skip_locking) + p->locks[level] = 1; + But in current code, we have different timing for modifying path lock, so just remove the comment. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: ctree: Reduce one indent level for btrfs_search_old_slot()Qu Wenruo
Similar to btrfs_search_slot() done in previous patch, make a shortcut for the level 0 case and allow to reduce indentation for the remaining case. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: ctree: Reduce one indent level for btrfs_search_slot()Qu Wenruo
In btrfs_search_slot(), we something like: if (level != 0) { /* Do search inside tree nodes*/ } else { /* Do search inside tree leaves */ goto done; } This caused extra indent for tree node search code. Change it to something like: if (level == 0) { /* Do search inside tree leaves */ goto done' } /* Do search inside tree nodes */ So we have more space to maneuver our code, this is especially useful as the tree nodes search code is more complex than the leaves search code. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: tree-checker: Add check for INODE_REFQu Wenruo
For INODE_REF we will check: - Objectid (ino) against previous key To detect missing INODE_ITEM. - No overflow/padding in the data payload Much like DIR_ITEM, but with less members to check. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: tree-checker: Try to detect missing INODE_ITEMQu Wenruo
For the following items, key->objectid is inode number: - DIR_ITEM - DIR_INDEX - XATTR_ITEM - EXTENT_DATA - INODE_REF So in the subvolume tree, such items must have its previous item share the same objectid, e.g.: (257 INODE_ITEM 0) (257 DIR_INDEX xxx) (257 DIR_ITEM xxx) (258 INODE_ITEM 0) (258 INODE_REF 0) (258 XATTR_ITEM 0) (258 EXTENT_DATA 0) But if we have the following sequence, then there is definitely something wrong, normally some INODE_ITEM is missing, like: (257 INODE_ITEM 0) (257 DIR_INDEX xxx) (257 DIR_ITEM xxx) (258 XATTR_ITEM 0) <<< objecitd suddenly changed to 258 (258 EXTENT_DATA 0) So just by checking the previous key for above inode based key types, we can detect a missing inode item. For INODE_REF key type, the check will be added along with INODE_REF checker. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18Btrfs: make btrfs_wait_extents() staticFilipe Manana
It's not used ouside of transaction.c Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Add assert to catch nested transaction commitNikolay Borisov
A recent patch to btrfs showed that there was at least 1 case where a nested transaction was committed. Nested transaction in this case means a code which has a transaction handle calls some function which in turn obtains a copy of the same transaction handle. In such cases the correct thing to do is for the lower callee to call btrfs_end_transaction which contains appropriate checks so as to not commit the transaction which will result in stale trans handler for the caller. To catch such cases add an assert in btrfs_commit_transaction ensuring btrfs_trans_handle::use_count is always 1. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: simplify inode locking for RWF_NOWAITGoldwyn Rodrigues
This is similar to 942491c9e6d6 ("xfs: fix AIM7 regression"). Apparently our current rwsem code doesn't like doing the trylock, then lock for real scheme. This causes extra contention on the lock and can be measured eg. by AIM7 benchmark. So change our read/write methods to just do the trylock for the RWF_NOWAIT case. Fixes: edf064e7c6fe ("btrfs: nowait aio support") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-15Merge tag 'for-linus-20191115' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A few fixes that should make it into this release. This contains: - io_uring: - The timeout command assumes sequence == 0 means that we want one completion, but this kind of overloading is unfortunate as it prevents users from doing a pure time based wait. Since this operation was introduced in this cycle, let's correct it now, while we can. (me) - One-liner to fix an issue with dependent links and fixed buffer reads. The actual IO completed fine, but the link got severed since we stored the wrong expected value. (me) - Add TIMEOUT to list of opcodes that don't need a file. (Pavel) - rsxx missing workqueue destry calls. Old bug. (Chuhong) - Fix blk-iocost active list check (Jiufei) - Fix impossible-to-hit overflow merge condition, that still hit some folks very rarely (Junichi) - Fix bfq hang issue from 5.3. This didn't get marked for stable, but will go into stable post this merge (Paolo)" * tag 'for-linus-20191115' of git://git.kernel.dk/linux-block: rsxx: add missed destroy_workqueue calls in remove iocost: check active_list of all the ancestors in iocg_activate() block, bfq: deschedule empty bfq_queues not referred by any process io_uring: ensure registered buffer import returns the IO length io_uring: Fix getting file for timeout block: check bi_size overflow before merge io_uring: make timeout sequence == 0 mean no sequence
2019-11-15Merge tag 'ceph-for-5.4-rc8' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "Two fixes for the buffered reads and O_DIRECT writes serialization patch that went into -rc1 and a fixup for a bogus warning on older gcc versions" * tag 'ceph-for-5.4-rc8' of git://github.com/ceph/ceph-client: rbd: silence bogus uninitialized warning in rbd_object_map_update_finish() ceph: increment/decrement dio counter on async requests ceph: take the inode lock before acquiring cap refs
2019-11-15afs: Fix race in commit bulk status fetchDavid Howells
When a lookup is done, the afs filesystem will perform a bulk status-fetch operation on the requested vnode (file) plus the next 49 other vnodes from the directory list (in AFS, directory contents are downloaded as blobs and parsed locally). When the results are received, it will speculatively populate the inode cache from the extra data. However, if the lookup races with another lookup on the same directory, but for a different file - one that's in the 49 extra fetches, then if the bulk status-fetch operation finishes first, it will try and update the inode from the other lookup. If this other inode is still in the throes of being created, however, this will cause an assertion failure in afs_apply_status(): BUG_ON(test_bit(AFS_VNODE_UNSET, &vnode->flags)); on or about fs/afs/inode.c:175 because it expects data to be there already that it can compare to. Fix this by skipping the update if the inode is being created as the creator will presumably set up the inode with the same information. Fixes: 39db9815da48 ("afs: Fix application of the results of a inline bulk status fetch") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull misc vfs fixes from Al Viro: "Assorted fixes all over the place; some of that is -stable fodder, some regressions from the last window" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable ecryptfs: fix unlink and rmdir in face of underlying fs modifications audit_get_nd(): don't unlock parent too early exportfs_decode_fh(): negative pinned may become positive without the parent locked cgroup: don't put ERR_PTR() into fc->root autofs: fix a leak in autofs_expire_indirect() aio: Fix io_pgetevents() struct __compat_aio_sigset layout fs/namespace.c: fix use-after-free of mount in mnt_warn_timestamp_expiry()
2019-11-15pipe: Remove sync on wake_upsDavid Howells
2019-11-15pipe: Increase the writer-wakeup threshold to reduce context-switch countDavid Howells
Increase the threshold at which the reader sends a wake event to the writers in the queue such that the queue must be half empty before the wake is issued rather than the wake being issued when just a single slot available. This reduces the number of context switches in the tests significantly, without altering the amount of work achieved. With my pipe-bench program, there's a 20% reduction versus an unpatched kernel. Suggested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15pipe: Check for ring full inside of the spinlock in pipe_write()David Howells
Make pipe_write() check to see if the ring has become full between it taking the pipe mutex, checking the ring status and then taking the spinlock. This can happen if a notification is written into the pipe as that happens without the pipe mutex. Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15pipe: Remove redundant wakeup from pipe_write()David Howells
Remove a redundant wakeup from pipe_write(). Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15pipe: Rearrange sequence in pipe_write() to preallocate slotDavid Howells
Rearrange the sequence in pipe_write() so that the allocation of the new buffer, the allocation of a ring slot and the attachment to the ring is done under the pipe wait spinlock and then the lock is dropped and the buffer can be filled. The data copy needs to be done with the spinlock unheld and irqs enabled, so the lock needs to be dropped first. However, the reader can't progress as we're holding pipe->mutex. We also need to drop the lock as that would impact others looking at the pipe waitqueue, such as poll(), the consumer and a future kernel message writer. We just abandon the preallocated slot if we get a copy error. Future writes may continue it and a future read will eventually recycle it. Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15pipe: Conditionalise wakeup in pipe_read()David Howells
Only do a wakeup in pipe_read() if we made space in a completely full buffer. The producer shouldn't be waiting on pipe->wait otherwise. Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15pipe: Advance tail pointer inside of wait spinlock in pipe_read()David Howells
Advance the pipe ring tail pointer inside of wait spinlock in pipe_read() so that the pipe can be written into with kernel notifications from contexts where pipe->mutex cannot be taken. Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15pipe: Allow pipes to have kernel-reserved slotsDavid Howells
Split pipe->ring_size into two numbers: (1) pipe->ring_size - indicates the hard size of the pipe ring. (2) pipe->max_usage - indicates the maximum number of pipe ring slots that userspace orchestrated events can fill. This allows for a pipe that is both writable by the general kernel notification facility and by userspace, allowing plenty of ring space for notifications to be added whilst preventing userspace from being able to pin too much unswappable kernel space. Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15y2038: timerfd: Use timespec64 internallyArnd Bergmann
timerfd_show() uses a 'struct itimerspec' internally, but that is deprecated because of the time_t overflow and a conflict with the glibc type of the same name that is now incompatible in user space. Use a pair of timespec64 variables instead as a simple replacement. As this removes the last use of itimerspec from the kernel, allowing the removal of the definition from the uapi headers along with timespec and timeval later. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15y2038: elfcore: Use __kernel_old_timeval for process timesArnd Bergmann
We store elapsed time for a crashed process in struct elf_prstatus using 'timeval' structures. Once glibc starts using 64-bit time_t, this becomes incompatible with the kernel's idea of timeval since the structure layout no longer matches on 32-bit architectures. This changes the definition of the elf_prstatus structure to use __kernel_old_timeval instead, which is hardcoded to the currently used binary layout. There is no risk of overflow in y2038 though, because the time values are all relative times, and can store up to 68 years of process elapsed time. There is a risk of applications breaking at build time when they use the new kernel headers and expect the type to be exactly 'timeval' rather than a structure that has the same fields as before. Those applications have to be modified to deal with 64-bit time_t anyway. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15y2038: syscalls: change remaining timeval to __kernel_old_timevalArnd Bergmann
All of the remaining syscalls that pass a timeval (gettimeofday, utime, futimesat) can trivially be changed to pass a __kernel_old_timeval instead, which has a compatible layout, but avoids ambiguity with the timeval type in user space. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15y2038: remove CONFIG_64BIT_TIMEArnd Bergmann
The CONFIG_64BIT_TIME option is defined on all architectures, and can be removed for simplicity now. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-14ext4: fix a bug in ext4_wait_for_tail_page_commityangerkun
No need to wait for any commit once the page is fully truncated. Besides, it may confuse e.g. concurrent ext4_writepage() with the page still be dirty (will be cleared by truncate_pagecache() in ext4_setattr()) but buffers has been freed; and then trigger a bug show as below: [ 26.057508] ------------[ cut here ]------------ [ 26.058531] kernel BUG at fs/ext4/inode.c:2134! ... [ 26.088130] Call trace: [ 26.088695] ext4_writepage+0x914/0xb28 [ 26.089541] writeout.isra.4+0x1b4/0x2b8 [ 26.090409] move_to_new_page+0x3b0/0x568 [ 26.091338] __unmap_and_move+0x648/0x988 [ 26.092241] unmap_and_move+0x48c/0xbb8 [ 26.093096] migrate_pages+0x220/0xb28 [ 26.093945] kernel_mbind+0x828/0xa18 [ 26.094791] __arm64_sys_mbind+0xc8/0x138 [ 26.095716] el0_svc_common+0x190/0x490 [ 26.096571] el0_svc_handler+0x60/0xd0 [ 26.097423] el0_svc+0x8/0xc Run the procedure (generate by syzkaller) parallel with ext3. void main() { int fd, fd1, ret; void *addr; size_t length = 4096; int flags; off_t offset = 0; char *str = "12345"; fd = open("a", O_RDWR | O_CREAT); assert(fd >= 0); /* Truncate to 4k */ ret = ftruncate(fd, length); assert(ret == 0); /* Journal data mode */ flags = 0xc00f; ret = ioctl(fd, _IOW('f', 2, long), &flags); assert(ret == 0); /* Truncate to 0 */ fd1 = open("a", O_TRUNC | O_NOATIME); assert(fd1 >= 0); addr = mmap(NULL, length, PROT_WRITE | PROT_READ, MAP_SHARED, fd, offset); assert(addr != (void *)-1); memcpy(addr, str, 5); mbind(addr, length, 0, 0, 0, MPOL_MF_MOVE); } And the bug will be triggered once we seen the below order. reproduce1 reproduce2 ... | ... truncate to 4k | change to journal data mode | | memcpy(set page dirty) truncate to 0: | ext4_setattr: | ... | ext4_wait_for_tail_page_commit | | mbind(trigger bug) truncate_pagecache(clean dirty)| ... ... | mbind will call ext4_writepage() since the page still be dirty, and then report the bug since the buffers has been free. Fix it by return directly once offset equals to 0 which means the page has been fully truncated. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: yangerkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20190919063508.1045-1-yangerkun@huawei.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14ext4: bio_alloc with __GFP_DIRECT_RECLAIM never failsGao Xiang
Similar to [1] [2], bio_alloc with __GFP_DIRECT_RECLAIM flags guarantees bio allocation under some given restrictions, as stated in block/bio.c and fs/direct-io.c So here it's ok to not check for NULL value from bio_alloc(). [1] https://lore.kernel.org/r/20191030035518.65477-1-gaoxiang25@huawei.com [2] https://lore.kernel.org/r/20190830162812.GA10694@infradead.org Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Link: https://lore.kernel.org/r/20191031092315.139267-1-gaoxiang25@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>