summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-04-19btrfs: remove stale comment and logic from btrfs_inode_in_log()Filipe Manana
Currently btrfs_inode_in_log() checks the list of modified extents of the inode, and has a comment mentioning why, as it used to be necessary to make sure if we did something like the following: mmap write range A mmap write range B msync range A (ranged fsync) msync range B (ranged fsync) we ended up with both ranges being logged. If we did not check it, then the second fsync would do nothing because btrfs_inode_in_log() would return true. This was added in 125c4cf9f37c98 ("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync") and test case generic/325 from fstests exercises that scenario. However, as of commit 487781796d3022 ("btrfs: make fast fsyncs wait only for writeback"), every ranged fsync is now turned into a full ranged fsync (operates on the range from 0 to LLONG_MAX), so it is now pointless to test of emptiness of the list of modified extents, and the comment is clearly outdated. So just remove the comment and list emptiness check, while also changing the function's return type to be a boolean instead of an integer. In case one day we get support for ranged fsyncs again, it will be easy to notice the check is necessary again, because it will make generic/325 always fail. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: fix race between marking inode needs to be logged and log syncingFilipe Manana
We have a race between marking that an inode needs to be logged, either at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between btrfs_sync_log(). The following steps describe how the race happens. 1) We are at transaction N; 2) Inode I was previously fsynced in the current transaction so it has: inode->logged_trans set to N; 3) The inode's root currently has: root->log_transid set to 1 root->last_log_commit set to 0 Which means only one log transaction was committed to far, log transaction 0. When a log tree is created we set ->log_transid and ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree()); 4) One more range of pages is dirtied in inode I; 5) Some task A starts an fsync against some other inode J (same root), and so it joins log transaction 1. Before task A calls btrfs_sync_log()... 6) Task B starts an fsync against inode I, which currently has the full sync flag set, so it starts delalloc and waits for the ordered extent to complete before calling btrfs_inode_in_log() at btrfs_sync_file(); 7) During ordered extent completion we have btrfs_update_inode() called against inode I, which in turn calls btrfs_set_inode_last_trans(), which does the following: spin_lock(&inode->lock); inode->last_trans = trans->transaction->transid; inode->last_sub_trans = inode->root->log_transid; inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); So ->last_trans is set to N and ->last_sub_trans set to 1. But before setting ->last_log_commit... 8) Task A is at btrfs_sync_log(): - it increments root->log_transid to 2 - starts writeback for all log tree extent buffers - waits for the writeback to complete - writes the super blocks - updates root->last_log_commit to 1 It's a lot of slow steps between updating root->log_transid and root->last_log_commit; 9) The task doing the ordered extent completion, currently at btrfs_set_inode_last_trans(), then finally runs: inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); Which results in inode->last_log_commit being set to 1. The ordered extent completes; 10) Task B is resumed, and it calls btrfs_inode_in_log() which returns true because we have all the following conditions met: inode->logged_trans == N which matches fs_info->generation && inode->last_subtrans (1) <= inode->last_log_commit (1) && inode->last_subtrans (1) <= root->last_log_commit (1) && list inode->extent_tree.modified_extents is empty And as a consequence we return without logging the inode, so the existing logged version of the inode does not point to the extent that was written after the previous fsync. It should be impossible in practice for one task be able to do so much progress in btrfs_sync_log() while another task is at btrfs_set_inode_last_trans() right after it reads root->log_transid and before it reads root->last_log_commit. Even if kernel preemption is enabled we know the task at btrfs_set_inode_last_trans() can not be preempted because it is holding the inode's spinlock. However there is another place where we do the same without holding the spinlock, which is in the memory mapped write path at: vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) { (...) BTRFS_I(inode)->last_trans = fs_info->generation; BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; (...) So with preemption happening after setting ->last_sub_trans and before setting ->last_log_commit, it is less of a stretch to have another task do enough progress at btrfs_sync_log() such that the task doing the memory mapped write ends up with ->last_sub_trans and ->last_log_commit set to the same value. It is still a big stretch to get there, as the task doing btrfs_sync_log() has to start writeback, wait for its completion and write the super blocks. So fix this in two different ways: 1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the value of ->last_sub_trans minus 1; 2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just like we do for buffered and direct writes at btrfs_file_write_iter(), which is all we need to make sure multiple writes and fsyncs to an inode in the same transaction never result in an fsync missing that the inode changed and needs to be logged. Turn this into a helper function and use it both at btrfs_page_mkwrite() and at btrfs_file_write_iter() - this also fixes the problem that at btrfs_page_mkwrite() we were setting those fields without the protection of the inode's spinlock. This is an extremely unlikely race to happen in practice. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: fix race between memory mapped writes and fsyncFilipe Manana
When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush any new delalloc that might have been created before taking the lock and then wait either for the ordered extents to complete or just for the writeback to complete (depending on whether the full sync flag is set or not). We then start logging the inode and assume that while we are doing it no one else is touching the inode's file extent items (or adding new ones). That is generally true because all operations that modify an inode acquire the inode's lock first, including buffered and direct IO writes. However there is one exception: memory mapped writes, which do not and can not acquire the inode's lock. This can cause two types of issues: ending up logging file extent items with overlapping ranges, which is detected by the tree checker and will result in aborting the transaction when starting writeback for a log tree's extent buffers, or a silent corruption where we log a version of the file that never existed. Scenario 1 - logging overlapping extents The following steps explain how we can end up with file extents items with overlapping ranges in a log tree due to a race between a fsync and memory mapped writes: 1) Task A starts an fsync on inode X, which has the full sync runtime flag set. First it starts by flushing all delalloc for the inode; 2) Task A then locks the inode and flushes any other delalloc that might have been created after the previous flush and waits for all ordered extents to complete; 3) In the inode's root we have the following leaf: Leaf N, generation == current transaction id: --------------------------------------------------------- | (...) [ file extent item, offset 640K, length 128K ] | --------------------------------------------------------- The last file extent item in leaf N covers the file range from 640K to 768K; 4) Task B does a memory mapped write for the page corresponding to the file range from 764K to 768K; 5) Task A starts logging the inode. At copy_inode_items_to_log() it uses btrfs_search_forward() to search for leafs modified in the current transaction that contain items for the inode. It finds leaf N and copies all the inode items from that leaf into the log tree. Now the log tree has a copy of the last file extent item from leaf N. At the end of the while loop at copy_inode_items_to_log(), we have the minimum key set to: min_key.objectid = <inode X number> min_key.type = BTRFS_EXTENT_DATA_KEY min_key.offset = 640K Then we increment the key's offset by 1 so that the next call to btrfs_search_forward() leaves us at the first key greater than the key we just processed. But before btrfs_search_forward() is called again... 6) Dellaloc for the page at offset 764K, dirtied by task B, is started. It can be started for several reasons: - The async reclaim task is attempting to satisfy metadata or data reservation requests, and it has reached a point where it decided to flush delalloc; - Due to memory pressure the VMM triggers writeback of dirty pages; - The system call sync_file_range(2) is called from user space. 7) When the respective ordered extent completes, it trims the length of the existing file extent item for file offset 640K from 128K to 124K, and a new file extent item is added with a key offset of 764K and a length of 4K; 8) Task A calls btrfs_search_forward(), which returns us a path pointing to the leaf (can be leaf N or some other) containing the new file extent item for file offset 764K. We end up copying this item to the log tree, which overlaps with the last copied file extent item, which covers the file range from 640K to 768K. When writeback is triggered for log tree's extent buffers, the issue will be detected by the tree checker which will dump a trace and an error message on dmesg/syslog. If the writeback is triggered when syncing the log, which typically is, then we also end up aborting the current transaction. This is the same type of problem fixed in 0c713cbab6200b ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges"). Scenario 2 - logging a version of the file that never existed This scenario only happens when using the NO_HOLES feature and results in a silent corruption, in the sense that is not detectable by 'btrfs check' or the tree checker: 1) We have an inode I with a size of 1M and two file extent items, one covering an extent with disk_bytenr == X for the file range [0, 512K) and another one covering another extent with disk_bytenr == Y for the file range [512K, 1M); 2) A hole is punched for the file range [512K, 1M); 3) Task A starts an fsync of inode I, which has the full sync runtime flag set. It starts by flushing all existing delalloc, locks the inode (VFS lock), starts any new delalloc that might have been created before taking the lock and waits for all ordered extents to complete; 4) Some other task does a memory mapped write for the page corresponding to the file range [640K, 644K) for example; 5) Task A then logs all items of the inode with the call to copy_inode_items_to_log(); 6) In the meanwhile delalloc for the range [640K, 644K) is started. It can be started for several reasons: - The async reclaim task is attempting to satisfy metadata or data reservation requests, and it has reached a point where it decided to flush delalloc; - Due to memory pressure the VMM triggers writeback of dirty pages; - The system call sync_file_range(2) is called from user space. 7) The ordered extent for the range [640K, 644K) completes and a file extent item for that range is added to the subvolume tree, pointing to a 4K extent with a disk_bytenr == Z; 8) Task A then calls btrfs_log_holes(), to scan for implicit holes in the subvolume tree. It finds two implicit holes: - one for the file range [512K, 640K) - one for the file range [644K, 1M) As a result we end up neither logging a hole for the range [640K, 644K) nor logging the file extent item with a disk_bytenr == Z. This means that if we have a power failure and replay the log tree we end up getting the following file extent layout: [ disk_bytenr X ] [ hole ] [ disk_bytenr Y ] [ hole ] 0 512K 512K 640K 640K 644K 644K 1M Which does not corresponding to any layout the file ever had before the power failure. The only two valid layouts would be: [ disk_bytenr X ] [ hole ] 0 512K 512K 1M and [ disk_bytenr X ] [ hole ] [ disk_bytenr Z ] [ hole ] 0 512K 512K 640K 640K 644K 644K 1M This can be fixed by serializing memory mapped writes with fsync, and there are two ways to do it: 1) Make a fsync lock the entire file range, from 0 to (u64)-1 / LLONG_MAX in the inode's io tree. This prevents the race but also blocks any reads during the duration of the fsync, which has a negative impact for many common workloads; 2) Make an fsync write lock the i_mmap_lock semaphore in the inode. This semaphore was recently added by Josef's patch set: btrfs: add a i_mmap_lock to our inode btrfs: cleanup inode_lock/inode_unlock uses btrfs: exclude mmaps while doing remap btrfs: exclude mmap from happening during all fallocate operations and is used to solve races between memory mapped writes and clone/dedupe/fallocate. This also makes us have the same behaviour we have regarding other writes (buffered and direct IO) and fsync - block them while the inode logging is in progress. This change uses the second approach due to the performance impact of the first one. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: exclude mmap from happening during all fallocate operationsJosef Bacik
There's a small window where a deadlock can happen between fallocate and mmap. This is described in detail by Filipe: """ When doing a fallocate operation we lock the inode, flush delalloc within the target range, wait for any ordered extents to complete and then lock the file range. Before we lock the range and after we flush delalloc, there is a time window where another task can come in and do a memory mapped write for a page within the fallocate range. This means that after fallocate locks the range, there can be a dirty page in the range. More often than not, this does not cause any problem. The exception is when we are low on available metadata space, because an fallocate operation needs to start a transaction while holding the file range locked, either through btrfs_prealloc_file_range() or through the call to btrfs_fallocate_update_isize(). If that's the case, we can end up in a deadlock. The following list of steps explains how that happens: 1) A fallocate operation starts, locks the inode, flushes delalloc in the range and waits for ordered extents in the range to complete; 2) Before the fallocate task locks the file range, another task does a memory mapped write for a page in the fallocate target range. This is possible since memory mapped writes do not (and can not) lock the inode; 3) The fallocate task locks the file range. At this point there is one dirty page in the range (due to the memory mapped write); 4) When the fallocate task attempts to start a transaction, it blocks when attempting to reserve metadata space, since we are low on available metadata space. Before blocking (wait on its reservation ticket), it starts the async reclaim task (if not running already); 5) The async reclaim task is not able to release space through any other means, so it decides to flush delalloc for inodes with dirty pages. It finds that the inode used in the fallocate operation has a dirty page and therefore queues a job (fs_info->flush_workers workqueue) to flush delalloc for that inode and waits on that job to complete; 6) The flush job blocks when attempting to lock the file range because it is currently locked by the fallocate task; 7) The fallocate task keeps waiting for its metadata reservation, waiting for a wakeup on its reservation ticket. The async reclaim task is waiting on the flush job, which in turn is waiting for locking the file range that is currently locked by the fallocate task. So unless some other task is able to release enough metadata space, for example an ordered extent for some other inode completes, we end up in a deadlock between all these tasks. When this happens stack traces like the following show up in dmesg/syslog: INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] Call Trace: __schedule+0x5d1/0xcf0 schedule+0x45/0xe0 lock_extent_bits+0x1e6/0x2d0 [btrfs] ? finish_wait+0x90/0x90 btrfs_invalidatepage+0x32c/0x390 [btrfs] ? __mod_memcg_state+0x8e/0x160 __extent_writepage+0x2d4/0x400 [btrfs] extent_write_cache_pages+0x2b2/0x500 [btrfs] ? lock_release+0x20e/0x4c0 ? trace_hardirqs_on+0x1b/0xf0 extent_writepages+0x43/0x90 [btrfs] ? lock_acquire+0x1a3/0x490 do_writepages+0x43/0xe0 ? __filemap_fdatawrite_range+0xa4/0x100 __filemap_fdatawrite_range+0xc5/0x100 btrfs_run_delalloc_work+0x17/0x40 [btrfs] btrfs_work_helper+0xf1/0x600 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x50/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Call Trace: __schedule+0x5d1/0xcf0 ? kvm_clock_read+0x14/0x30 ? wait_for_completion+0x81/0x110 schedule+0x45/0xe0 schedule_timeout+0x30c/0x580 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? lock_acquire+0x1a3/0x490 ? try_to_wake_up+0x7a/0xa20 ? lock_release+0x20e/0x4c0 ? lock_acquired+0x199/0x490 ? wait_for_completion+0x81/0x110 wait_for_completion+0xab/0x110 start_delalloc_inodes+0x2af/0x390 [btrfs] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] flush_space+0x24f/0x660 [btrfs] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x20f/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 (...) several tasks waiting for the inode lock held by the fallocate task below (...) RIP: 0033:0x7f61efe73fff Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5. RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003 R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001 R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 task:fdm-stress state:D stack: 0 pid:2508243 ppid:2508153 flags:0x00000000 Call Trace: __schedule+0x5d1/0xcf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 schedule+0x45/0xe0 __reserve_bytes+0x4a4/0xb10 [btrfs] ? finish_wait+0x90/0x90 btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] btrfs_block_rsv_add+0x1f/0x50 [btrfs] start_transaction+0x2d1/0x760 [btrfs] btrfs_replace_file_extents+0x120/0x930 [btrfs] ? btrfs_fallocate+0xdcf/0x1260 [btrfs] btrfs_fallocate+0xdfb/0x1260 [btrfs] ? filename_lookup+0xf1/0x180 vfs_fallocate+0x14f/0x440 ioctl_preallocate+0x92/0xc0 do_vfs_ioctl+0x66b/0x750 ? __do_sys_newfstat+0x53/0x60 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 """ Fix this by disallowing mmaps from happening while we're doing any of the fallocate operations on this inode. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: exclude mmaps while doing remapJosef Bacik
Darrick reported a potential issue to me where we could allow mmap writes after validating a page range matched in the case of dedupe. Generally we rely on lock page -> lock extent with the ordered flush to protect us, but this is done after we check the pages because we use the generic helpers, so we could modify the page in between doing the check and locking the range. There also exists a deadlock, as described by Filipe """ When cloning a file range, we lock the inodes, flush any delalloc within the respective file ranges, wait for any ordered extents and then lock the file ranges in both inodes. This means that right after we flush delalloc and before we lock the file ranges, memory mapped writes can come in and dirty pages in the file ranges of the clone operation. Most of the time this is harmless and causes no problems. However, if we are low on available metadata space, we can later end up in a deadlock when starting a transaction to replace file extent items. This happens if when allocating metadata space for the transaction, we need to wait for the async reclaim thread to release space and the reclaim thread needs to flush delalloc for the inode that got the memory mapped write and has its range locked by the clone task. Basically what happens is the following: 1) A clone operation locks inodes A and B, flushes delalloc for both inodes in the respective file ranges and waits for any ordered extents in those ranges to complete; 2) Before the clone task locks the file ranges, another task does a memory mapped write (which does not lock the inode) for one of the inodes of the clone operation. So now we have a dirty page in one of the ranges used by the clone operation; 3) The clone operation locks the file ranges for inodes A and B; 4) Later, when iterating over the file extents of inode A, the clone task attempts to start a transaction. There's not enough available free metadata space, so the async reclaim task is started (if not running already) and we wait for someone to wake us up on our reservation ticket; 5) The async reclaim task is not able to release space by any other means and decides to flush delalloc for the inode of the clone operation; 6) The workqueue job used to flush the inode blocks when starting delalloc for the inode, since the file range is currently locked by the clone task; 7) But the clone task is waiting on its reservation ticket and the async reclaim task is waiting on the flush job to complete, which can't progress since the clone task has the file range locked. So unless some other task is able to release space, for example an ordered extent for some other inode completes, we have a deadlock between all these tasks; When this happens stack traces like the following show up in dmesg/syslog: INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] Call Trace: __schedule+0x5d1/0xcf0 schedule+0x45/0xe0 lock_extent_bits+0x1e6/0x2d0 [btrfs] ? finish_wait+0x90/0x90 btrfs_invalidatepage+0x32c/0x390 [btrfs] ? __mod_memcg_state+0x8e/0x160 __extent_writepage+0x2d4/0x400 [btrfs] extent_write_cache_pages+0x2b2/0x500 [btrfs] ? lock_release+0x20e/0x4c0 ? trace_hardirqs_on+0x1b/0xf0 extent_writepages+0x43/0x90 [btrfs] ? lock_acquire+0x1a3/0x490 do_writepages+0x43/0xe0 ? __filemap_fdatawrite_range+0xa4/0x100 __filemap_fdatawrite_range+0xc5/0x100 btrfs_run_delalloc_work+0x17/0x40 [btrfs] btrfs_work_helper+0xf1/0x600 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x50/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Call Trace: __schedule+0x5d1/0xcf0 ? kvm_clock_read+0x14/0x30 ? wait_for_completion+0x81/0x110 schedule+0x45/0xe0 schedule_timeout+0x30c/0x580 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? lock_acquire+0x1a3/0x490 ? try_to_wake_up+0x7a/0xa20 ? lock_release+0x20e/0x4c0 ? lock_acquired+0x199/0x490 ? wait_for_completion+0x81/0x110 wait_for_completion+0xab/0x110 start_delalloc_inodes+0x2af/0x390 [btrfs] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] flush_space+0x24f/0x660 [btrfs] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x20f/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 (...) several other tasks blocked on inode locks held by the clone task below (...) RIP: 0033:0x7f61efe73fff Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5. RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0 R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002 R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 task: fdm-stress state:D stack: 0 pid:2508234 ppid:2508153 flags:0x00004000 Call Trace: __schedule+0x5d1/0xcf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 schedule+0x45/0xe0 __reserve_bytes+0x4a4/0xb10 [btrfs] ? finish_wait+0x90/0x90 btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] btrfs_block_rsv_add+0x1f/0x50 [btrfs] start_transaction+0x2d1/0x760 [btrfs] btrfs_replace_file_extents+0x120/0x930 [btrfs] ? lock_release+0x20e/0x4c0 btrfs_clone+0x3e4/0x7e0 [btrfs] ? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs] btrfs_clone_files+0xf6/0x150 [btrfs] btrfs_remap_file_range+0x324/0x3d0 [btrfs] do_clone_file_range+0xd4/0x1f0 vfs_clone_file_range+0x4d/0x230 ? lock_release+0x20e/0x4c0 ioctl_file_clone+0x8f/0xc0 do_vfs_ioctl+0x342/0x750 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 """ Fix both of these issues by excluding mmaps from happening we are doing any sort of remap, which prevents this race completely. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpersJosef Bacik
A few places we intermix btrfs_inode_lock with a inode_unlock, and some places we just use inode_lock/inode_unlock instead of btrfs_inode_lock. None of these places are using this incorrectly, but as we adjust some of these callers it would be nice to keep everything consistent, so convert everybody to use btrfs_inode_lock/btrfs_inode_unlock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: add a i_mmap_lock to our inodeJosef Bacik
We need to be able to exclude page_mkwrite from happening concurrently with certain operations. To facilitate this, add a i_mmap_lock to our inode, down_read() it in our mkwrite, and add a new ILOCK flag to indicate that we want to take the i_mmap_lock as well. I used pahole to check the size of the btrfs_inode, the sizes are as follows no lockdep: before: 1120 (3 per 4k page) after: 1160 (3 per 4k page) lockdep: before: 2072 (1 per 4k page) after: 2224 (1 per 4k page) We're slightly larger but it doesn't change how many objects we can fit per page. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: remove mirror argument from btrfs_csum_verify_data()Goldwyn Rodrigues
The parameter mirror is not used and does not make sense for checksum verification of the given bio. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: remove force argument from run_delalloc_nocow()Goldwyn Rodrigues
force_cow can be calculated from inode and does not need to be passed as an argument. This simplifies run_delalloc_nocow() call from btrfs_run_delalloc_range() A new function, should_nocow() checks if the range should be NOCOWed or not. The function returns true iff either BTRFS_INODE_NODATA or BTRFS_INODE_PREALLOC, but is not a defrag extent. Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: don't opencode extent_changeset_freeNikolay Borisov
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: assign proper values to a bool variable in dev_extent_hole_check_zonedJiapeng Chong
Fix the following coccicheck warnings: ./fs/btrfs/volumes.c:1462:10-11: WARNING: return of 0/1 in function 'dev_extent_hole_check_zoned' with return type bool. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: add btree read ahead for incremental send operationsFilipe Manana
Currently we do not do btree read ahead when doing an incremental send, however we know that we will read and process any node or leaf in the send root that has a generation greater than the generation of the parent root. So triggering read ahead for such nodes and leafs is beneficial for an incremental send. This change does that, triggers read ahead of any node or leaf in the send root that has a generation greater then the generation of the parent root. As for the parent root, no readahead is triggered because knowing in advance which nodes/leaves are going to be read is not so linear and there's often a large time window between visiting nodes or leaves of the parent root. So I opted to leave out the parent root, and triggering read ahead for its nodes/leaves seemed to have not made significant difference. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk and with 16GiB of ram: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)" } initial_file_count=500000 add_files $initial_file_count 0 4 echo echo "Creating first snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 echo echo "Adding more files..." add_files $((initial_file_count / 4)) $initial_file_count 4 echo echo "Updating 1/50th of the initial files..." for ((i = 1; i < $initial_file_count; i += 50)); do xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null done echo echo "Creating second snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap2 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing incremental send..." start=$(date +%s) btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null end=$(date +%s) echo echo "Incremental send took $((end - start)) seconds" umount $MNT Before this change, incremental send duration: with $initial_file_count == 200000: 51 seconds with $initial_file_count == 500000: 168 seconds After this change, incremental send duration: with $initial_file_count == 200000: 39 seconds (-26.7%) with $initial_file_count == 500000: 125 seconds (-29.4%) For $initial_file_count == 200000 there are 62600 nodes and leaves in the btree of the first snapshot, and 77759 nodes and leaves in the btree of the second snapshot. The root nodes were at level 2. While for $initial_file_count == 500000 there are 152476 nodes and leaves in the btree of the first snapshot, and 190511 nodes and leaves in the btree of the second snapshot. The root nodes were at level 2 as well. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: add btree read ahead for full send operationsFilipe Manana
When doing a full send we know that we are going to be reading every node and leaf of the send root, so we benefit from enabling read ahead for the btree. This change enables read ahead for full send operations only, incremental sends will have read ahead enabled in a different way by a separate patch. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk and with 16GiB of RAM: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)" } initial_file_count=500000 add_files $initial_file_count 0 4 echo echo "Creating first snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 echo echo "Adding more files..." add_files $((initial_file_count / 4)) $initial_file_count 4 echo echo "Updating 1/50th of the initial files..." for ((i = 1; i < $initial_file_count; i += 50)); do xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null done echo echo "Creating second snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap2 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing incremental send..." start=$(date +%s) btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null end=$(date +%s) echo echo "Incremental send took $((end - start)) seconds" umount $MNT Before this change, full send duration: with $initial_file_count == 200000: 165 seconds with $initial_file_count == 500000: 407 seconds After this change, full send duration: with $initial_file_count == 200000: 149 seconds (-10.2%) with $initial_file_count == 500000: 353 seconds (-14.2%) For $initial_file_count == 200000 there are 62600 nodes and leaves in the btree of the first snapshot, while for $initial_file_count == 500000 there are 152476 nodes and leaves. The roots were at level 2. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: simplify code flow in btrfs_delayed_inode_reserve_metadataNikolay Borisov
btrfs_block_rsv_add can return only ENOSPC since it's called with NO_FLUSH modifier. This so simplify the logic in btrfs_delayed_inode_reserve_metadata to exploit this invariant. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add assert and comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: remove btrfs_inode parameter from btrfs_delayed_inode_reserve_metadataNikolay Borisov
It's only used for tracepoint to obtain the inode number, but we already have the ino from btrfs_delayed_node::inode_id. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: simplify commit logic in try_flush_qgroupNikolay Borisov
It's no longer expected to call this function with an open transaction so all the workarounds concerning this can be removed. In fact it'll constitute a bug to call this function with a transaction already held so WARN in this case. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: scrub: drop a few function declarationsAnand Jain
Drop function declarations at the beginning of the file scrub.c. These functions are defined before they are used in the same file and don't need forward declaration. No functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: change return type to bool in btrfs_extent_readonlyAnand Jain
btrfs_extent_readonly() checks if the block group is readonly, the bool return type should be used. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: unexport btrfs_extent_readonly() and make it staticAnand Jain
btrfs_extent_readonly() is used by can_nocow_extent() in inode.c. So move it from extent-tree.c to inode.c and declare it as static. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: replace open coded while loop with proper constructNikolay Borisov
btrfs_inc_block_group_ro wants to ensure that the current transaction is not running dirty block groups, if it is it waits and loops again. That logic is currently implemented using a goto label. Actually using a proper do {} while() construct doesn't hurt readability nor does it introduce excessive nesting and makes the relevant code stand out by being encompassed in the loop construct. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: replace offset_in_entry with in_rangeNikolay Borisov
No point in duplicating the functionality just use the generic helper that has the same semantics. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: make find_desired_extent take btrfs_inodeNikolay Borisov
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: make btrfs_replace_file_extents take btrfs_inodeNikolay Borisov
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: fix comment for btrfs ordered extent flag bitsQu Wenruo
There is small error in comment about BTRFS_ORDERED_* flags, added in commit 3c198fe06449 ("btrfs: rework the order of btrfs_ordered_extent::flags") but the fixup did not get merged in time. The 4 types are for ordered extent itself, not for direct io. Only 3 types support direct io, REGULAR/NOCOW/PREALLOC. Fix the comment to reflect that. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19Merge series "spi: stm32-qspi: Fix and update" from ↵Mark Brown
<patrice.chotard@foss.st.com> Patrice Chotard <patrice.chotard@foss.st.com>: From: Patrice Chotard <patrice.chotard@foss.st.com> Christophe Kerello (1): spi: stm32-qspi: fix pm_runtime usage_count counter Patrice Chotard (2): spi: stm32-qspi: Trigger DMA only if more than 4 bytes to transfer spi: stm32-qspi: Add dirmap support drivers/spi/spi-stm32-qspi.c | 106 +++++++++++++++++++++++++++-------- 1 file changed, 84 insertions(+), 22 deletions(-) -- 2.17.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
2021-04-19Merge tag 'qcom-dts-for-5.13-2' of ↵Arnd Bergmann
git://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/dt More Qualcomm DTS updates for 5.13 This adds CPUfreq, interconnect providers, IPC, remoteproc and IPA to the SDX55 platform and then adds board files for the Telit FN980 TLB and Thundercomm TurboX T55. * tag 'qcom-dts-for-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: ARM: dts: qcom: sdx55: add IPA information ARM: dts: qcom: sdx55: Add basic devicetree support for Thundercomm T55 dt-bindings: arm: qcom: Add binding for Thundercomm T55 kit ARM: dts: qcom: sdx55: Add basic devicetree support for Telit FN980 TLB dt-bindings: arm: qcom: Add binding for Telit FN980 TLB board ARM: dts: qcom: sdx55: Add Modem remoteproc node ARM: dts: qcom: Fix node name for NAND controller node ARM: dts: qcom: sdx55: Add interconnect nodes ARM: dts: qcom: sdx55: Add SCM node dt-bindings: firmware: scm: Add compatible for SDX55 ARM: dts: qcom: sdx55: Add IMEM and PIL info region ARM: dts: qcom: sdx55: Add modem SMP2P node ARM: dts: qcom: sdx55: Add CPUFreq support ARM: dts: qcom: sdx55: Add support for APCS block ARM: dts: qcom: sdx55: Add support for A7 PLL clock Link: https://lore.kernel.org/r/20210419150956.860423-1-bjorn.andersson@linaro.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-04-19arm64: dts: qcom: sc7180: Update iommu property for simultaneous playbackV Sujith Kumar Reddy
Update iommu property in lpass cpu node for supporting simultaneous playback on headset and speaker. Reviewed-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: V Sujith Kumar Reddy <vsujithk@codeaurora.org> Signed-off-by: Srinivasa Rao Mandadapu <srivasam@codeaurora.org> Link: https://lore.kernel.org/r/20210406163330.11996-1-srivasam@codeaurora.org Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2021-04-19arm64: dts: qcom: sc7180: pompom: Add "dmic_clk_en" + sound modelDouglas Anderson
Match what's downstream for this board. Reviewed-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Cc: Srinivasa Rao Mandadapu <srivasam@codeaurora.org> Cc: Ajit Pandey <ajitp@codeaurora.org> Cc: Judy Hsiao <judyhsiao@chromium.org> Cc: Cheng-Yi Chiang <cychiang@chromium.org> Cc: Stephen Boyd <swboyd@chromium.org> Cc: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20210315133924.v2.2.If218189eff613a6c48ba12d75fad992377d8f181@changeid Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2021-04-19arm64: dts: qcom: sc7180: coachz: Add "dmic_clk_en"Douglas Anderson
This was present downstream. Add upstream too. NOTE: upstream I managed to get some sort of halfway state and got one pinctrl entry in the coachz-r1 device tree. Remove that as part of this since it's now in the dtsi. Reviewed-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Cc: Srinivasa Rao Mandadapu <srivasam@codeaurora.org> Cc: Ajit Pandey <ajitp@codeaurora.org> Cc: Judy Hsiao <judyhsiao@chromium.org> Cc: Cheng-Yi Chiang <cychiang@chromium.org> Cc: Stephen Boyd <swboyd@chromium.org> Cc: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20210315133924.v2.1.I601a051cad7cfd0923e55b69ef7e5748910a6096@changeid Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2021-04-19ARM: dts: mstar: Add a dts for M5Stack UnitV2Daniel Palmer
M5Stack are releasing a new widget based on the SigmaStar SSD202D. We have some support for the SSD202D so lets add a dts for it. Signed-off-by: Daniel Palmer <daniel@0x0f.com> Link: https://m5stack-store.myshopify.com/products/unitv2-ai-camera-gc2145 Link: https://lore.kernel.org/r/20210417011015.2105280-4-daniel@0x0f.com' Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-04-19dt-bindings: arm: mstar: Add compatible for M5Stack UnitV2Daniel Palmer
Add a compatible for the M5Stack UnitV2 that is based on the SigmaStar SSD202D (inifinity2m). Signed-off-by: Daniel Palmer <daniel@0x0f.com> Link: https://lore.kernel.org/r/20210417011015.2105280-3-daniel@0x0f.com' Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-04-19dt-bindings: vendor-prefixes: Add vendor prefix for M5StackDaniel Palmer
M5Stack make various modules for STEM, Makers, IoT. Their UnitV2 is based on a SigmaStar SSD202D SoC which we already have some minimal support for so add a prefix in preparation for UnitV2 board support. Signed-off-by: Daniel Palmer <daniel@0x0f.com> Link: https://m5stack.com/ Link: https://lore.kernel.org/r/20210417011015.2105280-2-daniel@0x0f.com' Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-04-19arm64: dts: mt8183: fix dtbs_check warningMatthias Brugger
Fix unit names to make dtbs_check happy. Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com> Reviewed-by: Enric Balletbo i Serra <enric.balletbo@collabora.com> Link: https://lore.kernel.org/r/20210414144643.17435-2-matthias.bgg@kernel.org Link: https://lore.kernel.org/r/20210416143923.23406-3-matthias.bgg@kernel.org' Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-04-19arm64: dts: mt8183-pumpkin: fix dtbs_check warningMatthias Brugger
Fix unit names to make dtbs_check happy. Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com> Reviewed-by: Enric Balletbo i Serra <enric.balletbo@collabora.com> Link: https://lore.kernel.org/r/20210414144643.17435-1-matthias.bgg@kernel.org Link: https://lore.kernel.org/r/20210416143923.23406-2-matthias.bgg@kernel.org' Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-04-19Merge tag 'memory-controller-drv-5.13-2' of ↵Arnd Bergmann
git://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl into arm/drivers Memory controller drivers for v5.13, part two 1. Renesas RPC: fix possible NULL pointer. 2. Exynos5422 DMC: add proper error checking for clk_prepare. 3. Mediatek SMI: use device-links instead of explicit PM runtime calls. * tag 'memory-controller-drv-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl: memory: mtk-smi: Add device-link between smi-larb and smi-common memory: samsung: exynos5422-dmc: handle clk_set_parent() failure memory: renesas-rpc-if: fix possible NULL pointer dereference of resource Link: https://lore.kernel.org/r/20210415065514.7385-1-krzysztof.kozlowski@canonical.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-04-19spi: Handle SPI device setup callback failure.Joe Burmeister
If the setup callback failed, but the controller has auto_runtime_pm and set_cs, the setup failure could be missed. Signed-off-by: Joe Burmeister <joe.burmeister@devtank.co.uk> Link: https://lore.kernel.org/r/20210419130631.4586-1-joe.burmeister@devtank.co.uk Signed-off-by: Mark Brown <broonie@kernel.org>
2021-04-19spi: sync up initial chipselect stateDavid Bauer
When initially probing the SPI slave device, the call for disabling an SPI device without the SPI_CS_HIGH flag is not applied, as the condition for checking whether or not the state to be applied equals the one currently set evaluates to true. This however might not necessarily be the case, as the chipselect might be active. Add a force flag to spi_set_cs which allows to override this early exit condition. Set it to false everywhere except when called from spi_setup to sync up the initial CS state. Fixes commit d40f0b6f2e21 ("spi: Avoid setting the chip select if we don't need to") Signed-off-by: David Bauer <mail@david-bauer.net> Link: https://lore.kernel.org/r/20210416195956.121811-1-mail@david-bauer.net Signed-off-by: Mark Brown <broonie@kernel.org>
2021-04-19spi: stm32-qspi: Add dirmap supportPatrice Chotard
Add stm32_qspi_dirmap_read() and stm32_qspi_dirmap_create() to get dirmap support. Update the exec_op callback which doens't allow anymore memory map access. Memory map access are only available through the dirmap_read callback. Signed-off-by: Patrice Chotard <patrice.chotard@foss.st.com> Link: https://lore.kernel.org/r/20210419121541.11617-4-patrice.chotard@foss.st.com Signed-off-by: Mark Brown <broonie@kernel.org>
2021-04-19spi: stm32-qspi: Trigger DMA only if more than 4 bytes to transferPatrice Chotard
In order to optimize accesses to spi flashes, trigger a DMA only if more than 4 bytes has to be transferred. DMA transfer preparation's cost becomes negligible above 4 bytes to transfer. Below this threshold, indirect transfer give more throughput. mtd_speedtest shows that page write throughtput increases : - from 779 to 853 KiB/s (~9.5%) with s25fl512s SPI-NOR. - from 5283 to 5666 KiB/s (~7.25%) with Micron SPI-NAND. Signed-off-by: Christophe Kerello <christophe.kerello@foss.st.com> Signed-off-by: Patrice Chotard <patrice.chotard@foss.st.com> Link: https://lore.kernel.org/r/20210419121541.11617-3-patrice.chotard@foss.st.com Signed-off-by: Mark Brown <broonie@kernel.org>
2021-04-19spi: stm32-qspi: fix pm_runtime usage_count counterChristophe Kerello
pm_runtime usage_count counter is not well managed. pm_runtime_put_autosuspend callback drops the usage_counter but this one has never been increased. Add pm_runtime_get_sync callback to bump up the usage counter. It is also needed to use pm_runtime_force_suspend and pm_runtime_force_resume APIs to handle properly the clock. Fixes: 9d282c17b023 ("spi: stm32-qspi: Add pm_runtime support") Signed-off-by: Christophe Kerello <christophe.kerello@foss.st.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210419121541.11617-2-patrice.chotard@foss.st.com Signed-off-by: Mark Brown <broonie@kernel.org>
2021-04-19x86/build: Disable HIGHMEM64G selection for M486SXMaciej W. Rozycki
Fix a regression caused by making the 486SX separately selectable in Kconfig, for which the HIGHMEM64G setting has not been updated and therefore has become exposed as a user-selectable option for the M486SX configuration setting unlike with original M486 and all the other settings that choose non-PAE-enabled processors: High Memory Support > 1. off (NOHIGHMEM) 2. 4GB (HIGHMEM4G) 3. 64GB (HIGHMEM64G) choice[1-3?]: With the fix in place the setting is now correctly removed: High Memory Support > 1. off (NOHIGHMEM) 2. 4GB (HIGHMEM4G) choice[1-2?]: [ bp: Massage commit message. ] Fixes: 87d6021b8143 ("x86/math-emu: Limit MATH_EMULATION to 486SX compatibles") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org # v5.5+ Link: https://lkml.kernel.org/r/alpine.DEB.2.21.2104141221340.44318@angie.orcam.me.uk
2021-04-19m68k: sun3x: Remove unneeded semicolonWan Jiabing
Fix the following coccicheck warning: ./arch/m68k/include/asm/sun3xflop.h:109:2-3: Unneeded semicolon Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Link: https://lore.kernel.org/r/20210415031450.23379-1-wanjiabing@vivo.com Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2021-04-19platform/x86: touchscreen_dmi: Add info for the Teclast Tbook 11 tabletHans de Goede
Add touchscreen info for the Teclast Tbook 11 tablet. This includes info for getting the firmware directly from the UEFI, so that the user does not need to manually install the firmware in /lib/firmware/silead. This change will make the touchscreen on these devices work OOTB, without requiring any manual setup. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210417173105.4134-1-hdegoede@redhat.com
2021-04-19platform/x86: intel_pmc_core: Add support for Alder Lake PCH-PDavid E. Box
Alder PCH-P is based on Tiger Lake PCH. Signed-off-by: David E. Box <david.e.box@linux.intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Rajneesh Bhardwaj <irenic.rajneesh@gmail.com> Link: https://lore.kernel.org/r/20210417031252.3020837-10-david.e.box@linux.intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-04-19platform/x86: intel_pmc_core: Add LTR registers for Tiger LakeGayatri Kammela
Just like Ice Lake, Tiger Lake uses Cannon Lake's LTR information and supports a few additional registers. Hence add the LTR registers specific to Tiger Lake to the cnp_ltr_show_map[]. Also adjust the number of LTR IPs for Tiger Lake to the correct amount. Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com> Signed-off-by: David E. Box <david.e.box@linux.intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Rajneesh Bhardwaj <irenic.rajneesh@gmail.com> Link: https://lore.kernel.org/r/20210417031252.3020837-9-david.e.box@linux.intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-04-19platform/x86: intel_pmc_core: Add option to set/clear LPM modeDavid E. Box
By default the Low Power Mode (LPM or sub-state) status registers will latch condition status on every entry into Package C10. This is configurable in the PMC to allow latching on any achievable sub-state. Add a debugfs file to support this. Also add the option to clear the status registers to 0. Clearing the status registers before testing removes ambiguity around when the current values were set. The new file, latch_lpm_mode, looks like this: [c10] S0i2.0 S0i3.0 S0i2.1 S0i3.1 S0i3.2 clear Signed-off-by: David E. Box <david.e.box@linux.intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210417031252.3020837-8-david.e.box@linux.intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-04-19platform/x86: intel_pmc_core: Add requirements file to debugfsGayatri Kammela
Add the debugfs file, substate_requirements, to view the low power mode (LPM) requirements for each enabled mode alongside the last latched status of the condition. After this patch, the new file will look like this: Element | S0i2.0 | S0i3.0 | S0i2.1 | S0i3.1 | S0i3.2 | Status | USB2PLL_OFF_STS | Required | Required | Required | Required | Required | | PCIe/USB3.1_Gen2PLL_OFF_STS | Required | Required | Required | Required | Required | | PCIe_Gen3PLL_OFF_STS | Required | Required | Required | Required | Required | Yes | OPIOPLL_OFF_STS | Required | Required | Required | Required | Required | Yes | OCPLL_OFF_STS | Required | Required | Required | Required | Required | Yes | MainPLL_OFF_STS | | Required | | Required | Required | | Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com> Co-developed-by: David E. Box <david.e.box@linux.intel.com> Signed-off-by: David E. Box <david.e.box@linux.intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210417031252.3020837-7-david.e.box@linux.intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-04-19platform/x86: intel_pmc_core: Get LPM requirements for Tiger LakeGayatri Kammela
Platforms that support low power modes (LPM) such as Tiger Lake maintain requirements for each sub-state that a readable in the PMC. However, unlike LPM status registers, requirement registers are not memory mapped but are available from an ACPI _DSM. Collect the requirements for Tiger Lake using the _DSM method and store in a buffer. Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com> Co-developed-by: David E. Box <david.e.box@linux.intel.com> Signed-off-by: David E. Box <david.e.box@linux.intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210417031252.3020837-6-david.e.box@linux.intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-04-19platform/x86: intel_pmc_core: Show LPM residency in microsecondsGayatri Kammela
Modify the low power mode (LPM or sub-state) residency counters to display in microseconds just like the slp_s0_residency counter. The granularity of the counter is approximately 30.5us per tick. Double this value then divide by two to maintain accuracy. Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com> Signed-off-by: David E. Box <david.e.box@linux.intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Rajneesh Bhardwaj <irenic.rajneesh@gmail.com> Link: https://lore.kernel.org/r/20210417031252.3020837-5-david.e.box@linux.intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-04-19platform/x86: intel_pmc_core: Handle sub-states genericallyGayatri Kammela
The current implementation of pmc_core_substate_res_show() is written specifically for Tiger Lake. However, new platform will also have sub-states and may support different modes. Therefore rewrite the code to handle sub-states generically. Obtain the number and type of enabled states form the PMC. Use the Low Power Mode (LPM) priority register to store the states in order from shallowest to deepest for displays. Add a for_each macro to simplify this. While changing the sub-state display it makes sense to show only the "enabled" sub-states instead of showing all possible ones. After this patch, the debugfs file looks like this: Substate Residency S0i2.0 0 S0i3.0 0 S0i2.1 9329279 S0i3.1 0 S0i3.2 0 Suggested-by: David E. Box <david.e.box@linux.intel.com> Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com> Signed-off-by: David E. Box <david.e.box@linux.intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Rajneesh Bhardwaj <irenic.rajneesh@gmail.com> Link: https://lore.kernel.org/r/20210417031252.3020837-4-david.e.box@linux.intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>