linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2020-09-25	usbcore/driver: Fix incorrect downcast	M. Vefa Bicakci
	This commit resolves a minor bug in the selection/discovery of more specific USB device drivers for devices that are currently bound to generic USB device drivers. The bug is related to the way a candidate USB device driver is compared against the generic USB device driver. The code in is_dev_usb_generic_driver() assumes that the device driver in question is a USB device driver by calling to_usb_device_driver(dev->driver) to downcast; however I have observed that this assumption is not always true, through code instrumentation. This commit avoids the incorrect downcast altogether by comparing the USB device's driver (i.e., dev->driver) to the generic USB device driver directly. This method was suggested by Alan Stern. This bug was found while investigating Andrey Konovalov's report indicating usbip device driver misbehaviour with the recently merged generic USB device driver selection feature. The report is linked below. Fixes: d5643d2249b2 ("USB: Fix device driver race") Cc: <stable@vger.kernel.org> # 5.8 Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Bastien Nocera <hadess@hadess.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Valentina Manea <valentina.manea.m@gmail.com> Cc: <syzkaller@googlegroups.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: M. Vefa Bicakci <m.v.b@runbox.com> Link: https://lore.kernel.org/r/20200922110703.720960-4-m.v.b@runbox.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-25	usbcore/driver: Fix specific driver selection	M. Vefa Bicakci
	This commit resolves a bug in the selection/discovery of more specific USB device drivers for devices that are currently bound to generic USB device drivers. The bug is in the logic that determines whether a device currently bound to a generic USB device driver should be re-probed by a more specific USB device driver or not. The code in __usb_bus_reprobe_drivers() used to have the following lines: if (usb_device_match_id(udev, new_udriver->id_table) == NULL && (!new_udriver->match \|\| new_udriver->match(udev) != 0)) return 0; ret = device_reprobe(dev); As the reader will notice, the code checks whether the USB device in consideration matches the identifier table (id_table) of a specific USB device_driver (new_udriver), followed by a similar check, but this time with the USB device driver's match function. However, the match function's return value is not checked correctly. When match() returns zero, it means that the specific USB device driver is not applicable to the USB device in question, but the code then goes on to reprobe the device with the new USB device driver under consideration. All this to say, the logic is inverted. This bug was found by code inspection and instrumentation while investigating the root cause of the issue reported by Andrey Konovalov, where usbip took over syzkaller's virtual USB devices in an undesired manner. The report is linked below. Fixes: d5643d2249b2 ("USB: Fix device driver race") Cc: <stable@vger.kernel.org> # 5.8 Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Bastien Nocera <hadess@hadess.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Valentina Manea <valentina.manea.m@gmail.com> Cc: <syzkaller@googlegroups.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: M. Vefa Bicakci <m.v.b@runbox.com> Link: https://lore.kernel.org/r/20200922110703.720960-3-m.v.b@runbox.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-25	Revert "usbip: Implement a match function to fix usbip"	M. Vefa Bicakci
	This commit reverts commit 7a2f2974f265 ("usbip: Implement a match function to fix usbip"). In summary, commit d5643d2249b2 ("USB: Fix device driver race") inadvertently broke usbip functionality, which I resolved in an incorrect manner by introducing a match function to usbip, usbip_match(), that unconditionally returns true. However, the usbip_match function, as is, causes usbip to take over virtual devices used by syzkaller for USB fuzzing, which is a regression reported by Andrey Konovalov. Furthermore, in conjunction with the fix of another bug, handled by another patch titled "usbcore/driver: Fix specific driver selection" in this patch set, the usbip_match function causes unexpected USB subsystem behaviour when the usbip_host driver is loaded. The unexpected behaviour can be qualified as follows: - If commit 41160802ab8e ("USB: Simplify USB ID table match") is included in the kernel, then all USB devices are bound to the usbip_host driver, which appears to the user as if all USB devices were disconnected. - If the same commit (41160802ab8e) is not in the kernel (as is the case with v5.8.10) then all USB devices are re-probed and re-bound to their original device drivers, which appears to the user as a disconnection and re-connection of USB devices. Please note that this commit will make usbip non-operational again, until yet another patch in this patch set is merged, titled "usbcore/driver: Accommodate usbip". Cc: <stable@vger.kernel.org> # 5.8: 41160802ab8e: USB: Simplify USB ID table match Cc: <stable@vger.kernel.org> # 5.8 Cc: Bastien Nocera <hadess@hadess.net> Cc: Valentina Manea <valentina.manea.m@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: <syzkaller@googlegroups.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: M. Vefa Bicakci <m.v.b@runbox.com> Link: https://lore.kernel.org/r/20200922110703.720960-2-m.v.b@runbox.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-25	btrfs: move btrfs_scratch_superblocks into btrfs_dev_replace_finishing	Josef Bacik
	We need to move the closing of the src_device out of all the device replace locking, but we definitely want to zero out the superblock before we commit the last time to make sure the device is properly removed. Handle this by pushing btrfs_scratch_superblocks into btrfs_dev_replace_finishing, and then later on we'll move the src_device closing and freeing stuff where we need it to be. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-25	ice: fix memory leak in ice_vsi_setup	Jacob Keller
	During ice_vsi_setup, if ice_cfg_vsi_lan fails, it does not properly release memory associated with the VSI rings. If we had used devres allocations for the rings, this would be ok. However, we use kzalloc and kfree_rcu for these ring structures. Using the correct label to cleanup the rings during ice_vsi_setup highlights an issue in the ice_vsi_clear_rings function: it can leave behind stale ring pointers in the q_vectors structure. When releasing rings, we must also ensure that no q_vector associated with the VSI will point to this ring again. To resolve this, loop over all q_vectors and release their ring mapping. Because we are about to free all rings, no q_vector should remain pointing to any of the rings in this VSI. Fixes: 5513b920a4f7 ("ice: Update Tx scheduler tree for VSI multi-Tx queue support") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2020-09-25	ice: fix memory leak if register_netdev_fails	Jacob Keller
	The ice_setup_pf_sw function can cause a memory leak if register_netdev fails, due to accidentally failing to free the VSI rings. Fix the memory leak by using ice_vsi_release, ensuring we actually go through the full teardown process. This should be safe even if the netdevice is not registered because we will have set the netdev pointer to NULL, ensuring ice_vsi_release won't call unregister_netdev. An alternative fix would be moving management of the PF VSI netdev into the main VSI setup code. This is complicated and likely requires significant refactor in how we manage VSIs Fixes: 3a858ba392c3 ("ice: Add support for VSI allocation and deallocation") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2020-09-25	ice: Fix call trace on suspend	Anirudh Venkataramanan
	It appears that the ice_suspend flow is missing a call to pci_save_state and this is triggering the message "State of device not saved by ice_suspend" and a call trace. Fix it. Fixes: 769c500dcc1e ("ice: Add advanced power mgmt for WoL") Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2020-09-25	iavf: Fix incorrect adapter get in iavf_resume	Sylwester Dziedziuch
	When calling iavf_resume there was a crash because wrong function was used to get iavf_adapter and net_device pointers. Changed how iavf_resume is getting iavf_adapter and net_device pointers from pci_dev. Fixes: 5eae00c57f5e ("i40evf: main driver core") Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2020-09-25	iocost: consider iocgs with active delays for debt forgiveness	Tejun Heo
	An iocg may have 0 debt but non-zero delay. The current debt forgiveness logic doesn't act on such iocgs. This can lead to unexpected behaviors - an iocg with a little bit of debt will have its delay canceled through debt forgiveness but one w/o any debt but active delay will have to wait out until its delay decays out. This patch updates the debt handling logic so that it treats delays the same as debts. If either debt or delay is active, debt forgiveness logic kicks in and acts on both the same way. Also, avoid turning the debt and delay directly to zero as that can confuse state transitions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	iocost: add iocg_forgive_debt tracepoint	Tejun Heo
	Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	iocost: reimplement debt forgiveness using average usage	Tejun Heo
	Debt forgiveness logic was counting the number of consecutive !busy periods as the trigger condition. While this usually works, it can easily be thrown off by temporary fluctuations especially on configurations w/ short periods. This patch reimplements debt forgiveness so that: * Use the average usage over the forgiveness period instead of counting consecutive periods. * Debt is reduced at around the target rate (1/2 every 100ms) regardless of ioc period duration. * Usage threshold is raised to 50%. Combined with the preceding changes and the switch to average usage, this makes debt forgivness a lot more effective at reducing the amount of unnecessary idleness. * Constants are renamed with DFGV_ prefix. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	iocost: recalculate delay after debt reduction	Tejun Heo
	Debt sets the initial delay duration which is decayed over time. The current debt reduction halved the debt but didn't change the delay. It prevented future debts from increasing delay but didn't do anything to lower the existing delay, limiting the mechanism's ability to reduce unnecessary idling. Reset iocg->delay to 0 after debt reduction so that iocg_kick_waitq() recalculates new delay value based on the reduced debt amount. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	iocost: replace nr_shortages cond in ioc_forgive_debts() with busy_level one	Tejun Heo
	Debt reduction was blocked if any iocg was short on budget in the past period to avoid reducing debts while some iocgs are saturated. However, this ends up unnecessarily blocking debt reduction due to temporary local imbalances when the device is generally being underutilized, while also failing to block when the underlying device is overwhelmed and the usage becomes low from high latency. Given that debt accumulation mostly happens with swapout bursts which can significantly deteriorate the underlying device's latency response, the current logic is not great. Let's replace it with ioc->busy_level based condition so that we block debt reduction when the underlying device is being saturated. ioc_forgive_debts() call is moved after busy_level determination. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	iocost: factor out ioc_forgive_debts()	Tejun Heo
	Debt reduction logic is going to be improved and expanded. Factor it out into ioc_forgive_debts() and generalize the comment a bit. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	Merge tag 'devfreq-fixes-for-5.9-rc7' of ↵	Rafael J. Wysocki
	git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux Pull devfreq updates for 5.9-rc7 from Chanwoo Choi: "1. Update devfreq core - Add missing timer type to devfreq_summary debugfs node. 2. Fix devfreq device driver - Fix the exception handling about clock on tegra30-devfreq.c" * tag 'devfreq-fixes-for-5.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux: PM / devfreq: tegra30: Disable clock on error in probe PM / devfreq: Add timer type to devfreq_summary debugfs
2020-09-25	dm: add support for REQ_NOWAIT and enable it for linear target	Konstantin Khlebnikov
	Add DM target feature flag DM_TARGET_NOWAIT which advertises that target works with REQ_NOWAIT bios. Add dm_table_supports_nowait() and update dm_table_set_restrictions() to set/clear QUEUE_FLAG_NOWAIT accordingly. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	block: add QUEUE_FLAG_NOWAIT	Mike Snitzer
	Add QUEUE_FLAG_NOWAIT to allow a block device to advertise support for REQ_NOWAIT. Bio-based devices may set QUEUE_FLAG_NOWAIT where applicable. Update QUEUE_FLAG_MQ_DEFAULT to include QUEUE_FLAG_NOWAIT. Also update submit_bio_checks() to verify it is set for REQ_NOWAIT bios. Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	vsprintf: use bd_partno in bdev_name	Christoph Hellwig
	No need to go through the hd_struct to find the partition number. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	block: use bd_partno in bdevname	Christoph Hellwig
	No need to go through the hd_struct to find the partition number. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	target/iblock: fix holder printing in iblock_show_configfs_dev_params	Christoph Hellwig
	bd_contains is never NULL for an open block device. In addition ibd_bd is always set to a block device that was exclusively opened by the target code, so the holder is guranteed to be ib_dev as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	drbd: don't set ->bd_contains	Christoph Hellwig
	The ->bd_contains field is set by __blkdev_get and drivers have no business manipulating it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	drbd: don't detour through bd_contains for the gendisk	Christoph Hellwig
	bd_disk is set on all block devices, including those for partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	md: don't detour through bd_contains for the gendisk	Christoph Hellwig
	bd_disk is set on all block devices, including those for partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	md: compare bd_disk instead of bd_contains	Christoph Hellwig
	To check for partitions of the same disk bd_contains works as well, but bd_disk is way more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	block: add a bdev_is_partition helper	Christoph Hellwig
	Add a littler helper to make the somewhat arcane bd_contains checks a little more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	Documentation/hdio: fix up obscure bd_contains references	Christoph Hellwig
	bd_contains is an implementation detail and should not be mentioned in a userspace API documentation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	block: remove unused BLK_QC_T_EAGAIN flag	Jeffle Xu
	commit 7b6620d7db56 ("block: remove REQ_NOWAIT_INLINE") removed the REQ_NOWAIT_INLINE related code, but the diff wasn't applied to blk_types.h somehow. Then commit 2771cefeac49 ("block: remove the REQ_NOWAIT_INLINE flag") removed the REQ_NOWAIT_INLINE flag while the BLK_QC_T_EAGAIN flag still remains. Fixes: 7b6620d7db56 ("block: remove REQ_NOWAIT_INLINE") Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	Merge branch 'md-next' of ↵	Jens Axboe
	https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.10/drivers Pull MD updates from Song. * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid10: improve discard request for far layout md/raid10: improve raid10 discard request md/raid10: pull codes that wait for blocked dev into one function md/raid10: extend r10bio devs to raid disks md: add md_submit_discard_bio() for submitting discard bio md: Simplify code with existing definition RESYNC_SECTORS in raid10.c md/raid5: reallocate page array after setting new stripe_size md/raid5: resize stripe_head when reshape array md/raid5: let multiple devices of stripe_head share page md/raid6: let async recovery function support different page offset md/raid6: let syndrome computor support different page offset md/raid5: convert to new xor compution interface md/raid5: add new xor function to support different page offset md/raid5: make async_copy_data() to support different page offset md/raid5: add a new member of offset into r5dev md: only calculate blocksize once and use i_blocksize()
2020-09-25	io_uring: ensure open/openat2 name is cleaned on cancelation	Jens Axboe
	If we cancel these requests, we'll leak the memory associated with the filename. Add them to the table of ops that need cleaning, if REQ_F_NEED_CLEANUP is set. Cc: stable@vger.kernel.org Fixes: e62753e4e292 ("io_uring: call statx directly") Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25	KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE	Sean Christopherson
	Reset the MMU context during kvm_set_cr4() if SMAP or PKE is toggled. Recent commits to (correctly) not reload PDPTRs when SMAP/PKE are toggled inadvertantly skipped the MMU context reset due to the mask of bits that triggers PDPTR loads also being used to trigger MMU context resets. Fixes: 427890aff855 ("kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode") Fixes: cb957adb4ea4 ("kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode") Cc: Jim Mattson <jmattson@google.com> Cc: Peter Shier <pshier@google.com> Cc: Oliver Upton <oupton@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923215352.17756-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-25	rseq/selftests: Test MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ	Peter Oskolkov
	Based on Google-internal RSEQ work done by Paul Turner and Andrew Hunter. This patch adds a selftest for MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ. The test quite often fails without the previous patch in this patchset, but consistently passes with it. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lkml.kernel.org/r/20200923233618.2572849-3-posk@google.com
2020-09-25	rseq/selftests,x86_64: Add rseq_offset_deref_addv()	Peter Oskolkov
	This patch adds rseq_offset_deref_addv() function to tools/testing/selftests/rseq/rseq-x86.h, to be used in a selftest in the next patch in the patchset. Once an architecture adds support for this function they should define "RSEQ_ARCH_HAS_OFFSET_DEREF_ADDV". Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lkml.kernel.org/r/20200923233618.2572849-2-posk@google.com
2020-09-25	rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ	Peter Oskolkov
	This patchset is based on Google-internal RSEQ work done by Paul Turner and Andrew Hunter. When working with per-CPU RSEQ-based memory allocations, it is sometimes important to make sure that a global memory location is no longer accessed from RSEQ critical sections. For example, there can be two per-CPU lists, one is "active" and accessed per-CPU, while another one is inactive and worked on asynchronously "off CPU" (e.g. garbage collection is performed). Then at some point the two lists are swapped, and a fast RCU-like mechanism is required to make sure that the previously active list is no longer accessed. This patch introduces such a mechanism: in short, membarrier() syscall issues an IPI to a CPU, restarting a potentially active RSEQ critical section on the CPU. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lkml.kernel.org/r/20200923233618.2572849-1-posk@google.com
2020-09-25	sched/fair: Use dst group while checking imbalance for NUMA balancer	Barry Song
	Barry Song noted the following Something is wrong. In find_busiest_group(), we are checking if src has higher load, however, in task_numa_find_cpu(), we are checking if dst will have higher load after balancing. It seems it is not sensible to check src. It maybe cause wrong imbalance value, for example, if dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and src_running = env->src_stats.nr_running - 1 results in 1; The current code is thinking imbalance as 0 since src_running is smaller than 2. This is inconsistent with load balancer. Basically, in find_busiest_group(), the NUMA imbalance is ignored if moving a task "from an almost idle domain" to a "domain with spare capacity". This patch forbids movement "from a misplaced domain" to "an almost idle domain" as that is closer to what the CPU load balancer expects. This patch is not a universal win. The old behaviour was intended to allow a task from an almost idle NUMA node to migrate to its preferred node if the destination had capacity but there are corner cases. For example, a NAS compute load could be parallelised to use 1/3rd of available CPUs but not all those potential tasks are active at all times allowing this logic to trigger. An obvious example is specjbb 2005 running various numbers of warehouses on a 2 socket box with 80 cpus. specjbb 5.9.0-rc4 5.9.0-rc4 vanilla dstbalance-v1r1 Hmean tput-1 46425.00 ( 0.00%) 43394.00 * -6.53%* Hmean tput-2 98416.00 ( 0.00%) 96031.00 * -2.42%* Hmean tput-3 150184.00 ( 0.00%) 148783.00 * -0.93%* Hmean tput-4 200683.00 ( 0.00%) 197906.00 * -1.38%* Hmean tput-5 236305.00 ( 0.00%) 245549.00 * 3.91%* Hmean tput-6 281559.00 ( 0.00%) 285692.00 * 1.47%* Hmean tput-7 338558.00 ( 0.00%) 334467.00 * -1.21%* Hmean tput-8 340745.00 ( 0.00%) 372501.00 * 9.32%* Hmean tput-9 424343.00 ( 0.00%) 413006.00 * -2.67%* Hmean tput-10 421854.00 ( 0.00%) 434261.00 * 2.94%* Hmean tput-11 493256.00 ( 0.00%) 485330.00 * -1.61%* Hmean tput-12 549573.00 ( 0.00%) 529959.00 * -3.57%* Hmean tput-13 593183.00 ( 0.00%) 555010.00 * -6.44%* Hmean tput-14 588252.00 ( 0.00%) 599166.00 * 1.86%* Hmean tput-15 623065.00 ( 0.00%) 642713.00 * 3.15%* Hmean tput-16 703924.00 ( 0.00%) 660758.00 * -6.13%* Hmean tput-17 666023.00 ( 0.00%) 697675.00 * 4.75%* Hmean tput-18 761502.00 ( 0.00%) 758360.00 * -0.41%* Hmean tput-19 796088.00 ( 0.00%) 798368.00 * 0.29%* Hmean tput-20 733564.00 ( 0.00%) 823086.00 * 12.20%* Hmean tput-21 840980.00 ( 0.00%) 856711.00 * 1.87%* Hmean tput-22 804285.00 ( 0.00%) 872238.00 * 8.45%* Hmean tput-23 795208.00 ( 0.00%) 889374.00 * 11.84%* Hmean tput-24 848619.00 ( 0.00%) 966783.00 * 13.92%* Hmean tput-25 750848.00 ( 0.00%) 903790.00 * 20.37%* Hmean tput-26 780523.00 ( 0.00%) 962254.00 * 23.28%* Hmean tput-27 1042245.00 ( 0.00%) 991544.00 * -4.86%* Hmean tput-28 1090580.00 ( 0.00%) 1035926.00 * -5.01%* Hmean tput-29 999483.00 ( 0.00%) 1082948.00 * 8.35%* Hmean tput-30 1098663.00 ( 0.00%) 1113427.00 * 1.34%* Hmean tput-31 1125671.00 ( 0.00%) 1134175.00 * 0.76%* Hmean tput-32 968167.00 ( 0.00%) 1250286.00 * 29.14%* Hmean tput-33 1077676.00 ( 0.00%) 1060893.00 * -1.56%* Hmean tput-34 1090538.00 ( 0.00%) 1090933.00 * 0.04%* Hmean tput-35 967058.00 ( 0.00%) 1107421.00 * 14.51%* Hmean tput-36 1051745.00 ( 0.00%) 1210663.00 * 15.11%* Hmean tput-37 1019465.00 ( 0.00%) 1351446.00 * 32.56%* Hmean tput-38 1083102.00 ( 0.00%) 1064541.00 * -1.71%* Hmean tput-39 1232990.00 ( 0.00%) 1303623.00 * 5.73%* Hmean tput-40 1175542.00 ( 0.00%) 1340943.00 * 14.07%* Hmean tput-41 1127826.00 ( 0.00%) 1339492.00 * 18.77%* Hmean tput-42 1198313.00 ( 0.00%) 1411023.00 * 17.75%* Hmean tput-43 1163733.00 ( 0.00%) 1228253.00 * 5.54%* Hmean tput-44 1305562.00 ( 0.00%) 1357886.00 * 4.01%* Hmean tput-45 1326752.00 ( 0.00%) 1406061.00 * 5.98%* Hmean tput-46 1339424.00 ( 0.00%) 1418451.00 * 5.90%* Hmean tput-47 1415057.00 ( 0.00%) 1381570.00 * -2.37%* Hmean tput-48 1392003.00 ( 0.00%) 1421167.00 * 2.10%* Hmean tput-49 1408374.00 ( 0.00%) 1418659.00 * 0.73%* Hmean tput-50 1359822.00 ( 0.00%) 1391070.00 * 2.30%* Hmean tput-51 1414246.00 ( 0.00%) 1392679.00 * -1.52%* Hmean tput-52 1432352.00 ( 0.00%) 1354020.00 * -5.47%* Hmean tput-53 1387563.00 ( 0.00%) 1409563.00 * 1.59%* Hmean tput-54 1406420.00 ( 0.00%) 1388711.00 * -1.26%* Hmean tput-55 1438804.00 ( 0.00%) 1387472.00 * -3.57%* Hmean tput-56 1399465.00 ( 0.00%) 1400296.00 * 0.06%* Hmean tput-57 1428132.00 ( 0.00%) 1396399.00 * -2.22%* Hmean tput-58 1432385.00 ( 0.00%) 1386253.00 * -3.22%* Hmean tput-59 1421612.00 ( 0.00%) 1371416.00 * -3.53%* Hmean tput-60 1429423.00 ( 0.00%) 1389412.00 * -2.80%* Hmean tput-61 1396230.00 ( 0.00%) 1351122.00 * -3.23%* Hmean tput-62 1418396.00 ( 0.00%) 1383098.00 * -2.49%* Hmean tput-63 1409918.00 ( 0.00%) 1374662.00 * -2.50%* Hmean tput-64 1410236.00 ( 0.00%) 1376216.00 * -2.41%* Hmean tput-65 1396405.00 ( 0.00%) 1364418.00 * -2.29%* Hmean tput-66 1395975.00 ( 0.00%) 1357326.00 * -2.77%* Hmean tput-67 1392986.00 ( 0.00%) 1349642.00 * -3.11%* Hmean tput-68 1386541.00 ( 0.00%) 1343261.00 * -3.12%* Hmean tput-69 1374407.00 ( 0.00%) 1342588.00 * -2.32%* Hmean tput-70 1377513.00 ( 0.00%) 1334654.00 * -3.11%* Hmean tput-71 1369319.00 ( 0.00%) 1334952.00 * -2.51%* Hmean tput-72 1354635.00 ( 0.00%) 1329005.00 * -1.89%* Hmean tput-73 1350933.00 ( 0.00%) 1318942.00 * -2.37%* Hmean tput-74 1351714.00 ( 0.00%) 1316347.00 * -2.62%* Hmean tput-75 1352198.00 ( 0.00%) 1309974.00 * -3.12%* Hmean tput-76 1349490.00 ( 0.00%) 1286064.00 * -4.70%* Hmean tput-77 1336131.00 ( 0.00%) 1303684.00 * -2.43%* Hmean tput-78 1308896.00 ( 0.00%) 1271024.00 * -2.89%* Hmean tput-79 1326703.00 ( 0.00%) 1290862.00 * -2.70%* Hmean tput-80 1336199.00 ( 0.00%) 1291629.00 * -3.34%* The performance at the mid-point is better but not universally better. The patch is a mixed bag depending on the workload, machine and overall levels of utilisation. Sometimes it's better (sometimes much better), other times it is worse (sometimes much worse). Given that there isn't a universally good decision in this section and more people seem to prefer the patch then it may be best to keep the LB decisions consistent and revisit imbalance handling when the load balancer code changes settle down. Jirka Hladky added the following observation. Our results are mostly in line with what you see. We observe big gains (20-50%) when the system is loaded to 1/3 of the maximum capacity and mixed results at the full load - some workloads benefit from the patch at the full load, others not, but performance changes at the full load are mostly within the noise of results (+/-5%). Overall, we think this patch is helpful. [mgorman@techsingularity.net: Rewrote changelog] Fixes: fb86f5b211 ("sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity") Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200921221849.GI3179@techsingularity.net
2020-09-25	sched/fair: Reduce busy load balance interval	Vincent Guittot
	The busy_factor, which increases load balance interval when a cpu is busy, is set to 32 by default. This value generates some huge LB interval on large system like the THX2 made of 2 node x 28 cores x 4 threads. For such system, the interval increases from 112ms to 3584ms at MC level. And from 228ms to 7168ms at NUMA level. Even on smaller system, a lower busy factor has shown improvement on the fair distribution of the running time so let reduce it for all. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200921072424.14813-5-vincent.guittot@linaro.org
2020-09-25	sched/fair: Minimize concurrent LBs between domain level	Vincent Guittot
	sched domains tend to trigger simultaneously the load balance loop but the larger domains often need more time to collect statistics. This slowness makes the larger domain trying to detach tasks from a rq whereas tasks already migrated somewhere else at a sub-domain level. This is not a real problem for idle LB because the period of smaller domains will increase with its CPUs being busy and this will let time for higher ones to pulled tasks. But this becomes a problem when all CPUs are already busy because all domains stay synced when they trigger their LB. A simple way to minimize simultaneous LB of all domains is to decrement the the busy interval by 1 jiffies. Because of the busy_factor, the interval of larger domain will not be a multiple of smaller ones anymore. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200921072424.14813-4-vincent.guittot@linaro.org
2020-09-25	sched/fair: Reduce minimal imbalance threshold	Vincent Guittot
	The 25% default imbalance threshold for DIE and NUMA domain is large enough to generate significant unfairness between threads. A typical example is the case of 11 threads running on 2x4 CPUs. The imbalance of 20% between the 2 groups of 4 cores is just low enough to not trigger the load balance between the 2 groups. We will have always the same 6 threads on one group of 4 CPUs and the other 5 threads on the other group of CPUS. With a fair time sharing in each group, we ends up with +20% running time for the group of 5 threads. Consider decreasing the imbalance threshold for overloaded case where we use the load to balance task and to ensure fair time sharing. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: Hillf Danton <hdanton@sina.com> Link: https://lkml.kernel.org/r/20200921072424.14813-3-vincent.guittot@linaro.org
2020-09-25	sched/fair: Relax constraint on task's load during load balance	Vincent Guittot
	Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the load balancer currently migrates the waiting task between the CPUs in an almost random manner. The success of a rq pulling a task depends of the value of nr_balance_failed of its domains and its ability to be faster than others to detach it. This behavior results in an unfair distribution of the running time between tasks because some CPUs will run most of the time, if not always, the same task whereas others will share their time between several tasks. Instead of using nr_balance_failed as a boolean to relax the condition for detaching task, the LB will use nr_balanced_failed to relax the threshold between the tasks'load and the imbalance. This mecanism prevents the same rq or domain to always win the load balance fight. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200921072424.14813-2-vincent.guittot@linaro.org
2020-09-25	sched/fair: Remove the force parameter of update_tg_load_avg()	Xianting Tian
	In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used, sometimes update_tg_load_avg(cfs_rq, false) is used. update_tg_load_avg() has the parameter force, but in current code, it never set 1 or true to it, so remove the force parameter. Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200924014755.36253-1-tian.xianting@h3c.com
2020-09-25	sched/fair: Fix wrong cpu selecting from isolated domain	Xunlei Pang
	We've met problems that occasionally tasks with full cpumask (e.g. by putting it into a cpuset or setting to full affinity) were migrated to our isolated cpus in production environment. After some analysis, we found that it is due to the current select_idle_smt() not considering the sched_domain mask. Steps to reproduce on my 31-CPU hyperthreads machine: 1. with boot parameter: "isolcpus=domain,2-31" (thread lists: 0,16 and 1,17) 2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads" 3. some threads will be migrated to the isolated cpu16~17. Fix it by checking the valid domain mask in select_idle_smt(). Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings()) Reported-by: Wetp Zhang <wetp.zy@linux.alibaba.com> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com
2020-09-25	sched: Remove unused inline function uclamp_bucket_base_value()	YueHaibing
	There is no caller in tree, so can remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20200922132410.48440-1-yuehaibing@huawei.com
2020-09-25	sched/rt: Disable RT_RUNTIME_SHARE by default	Daniel Bristot de Oliveira
	The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime between CPUs, allowing a CPU to run a real-time task up to 100% of the time while leaving more space for non-real-time tasks to run on the CPU that lend rt_runtime. The problem is that a CPU can easily borrow enough rt_runtime to allow a spinning rt-task to run forever, starving per-cpu tasks like kworkers, which are non-real-time by design. This patch disables RT_RUNTIME_SHARE by default, avoiding this problem. The feature will still be present for users that want to enable it, though. Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Wei Wang <wvw@google.com> Link: https://lkml.kernel.org/r/b776ab46817e3db5d8ef79175fa0d71073c051c7.1600697903.git.bristot@redhat.com
2020-09-25	sched/deadline: Fix stale throttling on de-/boosted tasks	Lucas Stach
	When a boosted task gets throttled, what normally happens is that it's immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the runtime and clears the dl_throttled flag. There is a special case however: if the throttling happened on sched-out and the task has been deboosted in the meantime, the replenish is skipped as the task will return to its normal scheduling class. This leaves the task with the dl_throttled flag set. Now if the task gets boosted up to the deadline scheduling class again while it is sleeping, it's still in the throttled state. The normal wakeup however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't actually place it on the rq. Thus we end up with a task that is runnable, but not actually on the rq and neither a immediate replenishment happens, nor is the replenishment timer set up, so the task is stuck in forever-throttled limbo. Clear the dl_throttled flag before dropping back to the normal scheduling class to fix this issue. Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20200831110719.2126930-1-l.stach@pengutronix.de
2020-09-25	sched/numa: Use runnable_avg to classify node	Vincent Guittot
	Use runnable_avg to classify numa node state similarly to what is done for normal load balancer. This helps to ensure that numa and normal balancers use the same view of the state of the system. Large arm64system: 2 nodes / 224 CPUs: hackbench -l (256000/#grp) -g #grp grp tip/sched/core +patchset improvement 1 14,008(+/- 4,99 %) 13,800(+/- 3.88 %) 1,48 % 4 4,340(+/- 5.35 %) 4.283(+/- 4.85 %) 1,33 % 16 3,357(+/- 0.55 %) 3.359(+/- 0.54 %) -0,06 % 32 3,050(+/- 0.94 %) 3.039(+/- 1,06 %) 0,38 % 64 2.968(+/- 1,85 %) 3.006(+/- 2.92 %) -1.27 % 128 3,290(+/-12.61 %) 3,108(+/- 5.97 %) 5.51 % 256 3.235(+/- 3.95 %) 3,188(+/- 2.83 %) 1.45 % Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20200921072959.16317-1-vincent.guittot@linaro.org
2020-09-25	mmc: sdhci-s3c: hide forward declaration of of_device_id behind CONFIG_OF	Krzysztof Kozlowski
	The struct of_device_id is not defined with !CONFIG_OF so its forward declaration should be hidden to as well. This should address clang compile warning: drivers/mmc/host/sdhci-s3c.c:464:34: warning: tentative array definition assumed to have one element Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Link: https://lore.kernel.org/r/20200925072532.10272-1-krzk@kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2020-09-25	mmc: sdhci: fix indentation mistakes	Krzysztof Kozlowski
	Fix inconsistent indenting, reported by Smatch: drivers/mmc/host/sdhci-esdhc-imx.c:1380 sdhci_esdhc_imx_hwinit() warn: inconsistent indenting drivers/mmc/host/sdhci-sprd.c:390 sdhci_sprd_request_done() warn: inconsistent indenting Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Link: https://lore.kernel.org/r/20200923153739.30327-2-krzk@kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2020-09-25	mmc: moxart: remove unneeded check for drvdata	Krzysztof Kozlowski
	The 'struct mmc_host *mmc' comes from drvdata set at the end of probe, so it cannot be NULL. The code already dereferences it few lines before the check with mmc_priv(). This also fixes smatch warning: drivers/mmc/host/moxart-mmc.c:692 moxart_remove() warn: variable dereferenced before check 'mmc' (see line 688) Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Link: https://lore.kernel.org/r/20200923153739.30327-1-krzk@kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2020-09-25	mmc: renesas_sdhi: drop local flag for tuning	Wolfram Sang
	The MMC core has now a generic check if some tuning is in progress. Its protected area is a bit larger than the custom one in this driver but we concluded that this works equally well for the intended case. So, drop the local flag and switch to the generic one. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Tested-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Link: https://lore.kernel.org/r/20200922172253.4458-1-wsa@kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2020-09-25	mmc: rtsx_usb_sdmmc: simplify the return expression of sd_change_phase()	Qinglang Miao
	Simplify the return expression. Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Link: https://lore.kernel.org/r/20200921131042.92340-1-miaoqinglang@huawei.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2020-09-25	mmc: core: document mmc_hw_reset()	Wolfram Sang
	Add documentation for mmc_hw_reset to make sure the intended use case is clear. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Link: https://lore.kernel.org/r/20200918215446.65654-1-wsa+renesas@sang-engineering.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>