Age | Commit message (Collapse) | Author |
|
This commit resolves a minor bug in the selection/discovery of more
specific USB device drivers for devices that are currently bound to
generic USB device drivers.
The bug is related to the way a candidate USB device driver is
compared against the generic USB device driver. The code in
is_dev_usb_generic_driver() assumes that the device driver in question
is a USB device driver by calling to_usb_device_driver(dev->driver)
to downcast; however I have observed that this assumption is not always
true, through code instrumentation.
This commit avoids the incorrect downcast altogether by comparing
the USB device's driver (i.e., dev->driver) to the generic USB
device driver directly. This method was suggested by Alan Stern.
This bug was found while investigating Andrey Konovalov's report
indicating usbip device driver misbehaviour with the recently merged
generic USB device driver selection feature. The report is linked
below.
Fixes: d5643d2249b2 ("USB: Fix device driver race")
Cc: <stable@vger.kernel.org> # 5.8
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Bastien Nocera <hadess@hadess.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Valentina Manea <valentina.manea.m@gmail.com>
Cc: <syzkaller@googlegroups.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: M. Vefa Bicakci <m.v.b@runbox.com>
Link: https://lore.kernel.org/r/20200922110703.720960-4-m.v.b@runbox.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This commit resolves a bug in the selection/discovery of more
specific USB device drivers for devices that are currently bound to
generic USB device drivers.
The bug is in the logic that determines whether a device currently
bound to a generic USB device driver should be re-probed by a
more specific USB device driver or not. The code in
__usb_bus_reprobe_drivers() used to have the following lines:
if (usb_device_match_id(udev, new_udriver->id_table) == NULL &&
(!new_udriver->match || new_udriver->match(udev) != 0))
return 0;
ret = device_reprobe(dev);
As the reader will notice, the code checks whether the USB device in
consideration matches the identifier table (id_table) of a specific
USB device_driver (new_udriver), followed by a similar check, but this
time with the USB device driver's match function. However, the match
function's return value is not checked correctly. When match() returns
zero, it means that the specific USB device driver is *not* applicable
to the USB device in question, but the code then goes on to reprobe the
device with the new USB device driver under consideration. All this to
say, the logic is inverted.
This bug was found by code inspection and instrumentation while
investigating the root cause of the issue reported by Andrey Konovalov,
where usbip took over syzkaller's virtual USB devices in an undesired
manner. The report is linked below.
Fixes: d5643d2249b2 ("USB: Fix device driver race")
Cc: <stable@vger.kernel.org> # 5.8
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Bastien Nocera <hadess@hadess.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Valentina Manea <valentina.manea.m@gmail.com>
Cc: <syzkaller@googlegroups.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: M. Vefa Bicakci <m.v.b@runbox.com>
Link: https://lore.kernel.org/r/20200922110703.720960-3-m.v.b@runbox.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This commit reverts commit 7a2f2974f265 ("usbip: Implement a match
function to fix usbip").
In summary, commit d5643d2249b2 ("USB: Fix device driver race")
inadvertently broke usbip functionality, which I resolved in an incorrect
manner by introducing a match function to usbip, usbip_match(), that
unconditionally returns true.
However, the usbip_match function, as is, causes usbip to take over
virtual devices used by syzkaller for USB fuzzing, which is a regression
reported by Andrey Konovalov.
Furthermore, in conjunction with the fix of another bug, handled by another
patch titled "usbcore/driver: Fix specific driver selection" in this patch
set, the usbip_match function causes unexpected USB subsystem behaviour
when the usbip_host driver is loaded. The unexpected behaviour can be
qualified as follows:
- If commit 41160802ab8e ("USB: Simplify USB ID table match") is included
in the kernel, then all USB devices are bound to the usbip_host
driver, which appears to the user as if all USB devices were
disconnected.
- If the same commit (41160802ab8e) is not in the kernel (as is the case
with v5.8.10) then all USB devices are re-probed and re-bound to their
original device drivers, which appears to the user as a disconnection
and re-connection of USB devices.
Please note that this commit will make usbip non-operational again,
until yet another patch in this patch set is merged, titled
"usbcore/driver: Accommodate usbip".
Cc: <stable@vger.kernel.org> # 5.8: 41160802ab8e: USB: Simplify USB ID table match
Cc: <stable@vger.kernel.org> # 5.8
Cc: Bastien Nocera <hadess@hadess.net>
Cc: Valentina Manea <valentina.manea.m@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: <syzkaller@googlegroups.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: M. Vefa Bicakci <m.v.b@runbox.com>
Link: https://lore.kernel.org/r/20200922110703.720960-2-m.v.b@runbox.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
We need to move the closing of the src_device out of all the device
replace locking, but we definitely want to zero out the superblock
before we commit the last time to make sure the device is properly
removed. Handle this by pushing btrfs_scratch_superblocks into
btrfs_dev_replace_finishing, and then later on we'll move the src_device
closing and freeing stuff where we need it to be.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During ice_vsi_setup, if ice_cfg_vsi_lan fails, it does not properly
release memory associated with the VSI rings. If we had used devres
allocations for the rings, this would be ok. However, we use kzalloc and
kfree_rcu for these ring structures.
Using the correct label to cleanup the rings during ice_vsi_setup
highlights an issue in the ice_vsi_clear_rings function: it can leave
behind stale ring pointers in the q_vectors structure.
When releasing rings, we must also ensure that no q_vector associated
with the VSI will point to this ring again. To resolve this, loop over
all q_vectors and release their ring mapping. Because we are about to
free all rings, no q_vector should remain pointing to any of the rings
in this VSI.
Fixes: 5513b920a4f7 ("ice: Update Tx scheduler tree for VSI multi-Tx queue support")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The ice_setup_pf_sw function can cause a memory leak if register_netdev
fails, due to accidentally failing to free the VSI rings. Fix the memory
leak by using ice_vsi_release, ensuring we actually go through the full
teardown process.
This should be safe even if the netdevice is not registered because we
will have set the netdev pointer to NULL, ensuring ice_vsi_release won't
call unregister_netdev.
An alternative fix would be moving management of the PF VSI netdev into
the main VSI setup code. This is complicated and likely requires
significant refactor in how we manage VSIs
Fixes: 3a858ba392c3 ("ice: Add support for VSI allocation and deallocation")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
It appears that the ice_suspend flow is missing a call to pci_save_state
and this is triggering the message "State of device not saved by
ice_suspend" and a call trace. Fix it.
Fixes: 769c500dcc1e ("ice: Add advanced power mgmt for WoL")
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
When calling iavf_resume there was a crash because wrong
function was used to get iavf_adapter and net_device pointers.
Changed how iavf_resume is getting iavf_adapter and net_device
pointers from pci_dev.
Fixes: 5eae00c57f5e ("i40evf: main driver core")
Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
An iocg may have 0 debt but non-zero delay. The current debt forgiveness
logic doesn't act on such iocgs. This can lead to unexpected behaviors - an
iocg with a little bit of debt will have its delay canceled through debt
forgiveness but one w/o any debt but active delay will have to wait out
until its delay decays out.
This patch updates the debt handling logic so that it treats delays the same
as debts. If either debt or delay is active, debt forgiveness logic kicks in
and acts on both the same way.
Also, avoid turning the debt and delay directly to zero as that can confuse
state transitions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Debt forgiveness logic was counting the number of consecutive !busy periods
as the trigger condition. While this usually works, it can easily be thrown
off by temporary fluctuations especially on configurations w/ short periods.
This patch reimplements debt forgiveness so that:
* Use the average usage over the forgiveness period instead of counting
consecutive periods.
* Debt is reduced at around the target rate (1/2 every 100ms) regardless of
ioc period duration.
* Usage threshold is raised to 50%. Combined with the preceding changes and
the switch to average usage, this makes debt forgivness a lot more
effective at reducing the amount of unnecessary idleness.
* Constants are renamed with DFGV_ prefix.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Debt sets the initial delay duration which is decayed over time. The current
debt reduction halved the debt but didn't change the delay. It prevented
future debts from increasing delay but didn't do anything to lower the
existing delay, limiting the mechanism's ability to reduce unnecessary
idling.
Reset iocg->delay to 0 after debt reduction so that iocg_kick_waitq()
recalculates new delay value based on the reduced debt amount.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Debt reduction was blocked if any iocg was short on budget in the past
period to avoid reducing debts while some iocgs are saturated. However, this
ends up unnecessarily blocking debt reduction due to temporary local
imbalances when the device is generally being underutilized, while also
failing to block when the underlying device is overwhelmed and the usage
becomes low from high latency.
Given that debt accumulation mostly happens with swapout bursts which can
significantly deteriorate the underlying device's latency response, the
current logic is not great.
Let's replace it with ioc->busy_level based condition so that we block debt
reduction when the underlying device is being saturated. ioc_forgive_debts()
call is moved after busy_level determination.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Debt reduction logic is going to be improved and expanded. Factor it out
into ioc_forgive_debts() and generalize the comment a bit. No functional
change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux
Pull devfreq updates for 5.9-rc7 from Chanwoo Choi:
"1. Update devfreq core
- Add missing timer type to devfreq_summary debugfs node.
2. Fix devfreq device driver
- Fix the exception handling about clock on tegra30-devfreq.c"
* tag 'devfreq-fixes-for-5.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux:
PM / devfreq: tegra30: Disable clock on error in probe
PM / devfreq: Add timer type to devfreq_summary debugfs
|
|
Add DM target feature flag DM_TARGET_NOWAIT which advertises that
target works with REQ_NOWAIT bios.
Add dm_table_supports_nowait() and update dm_table_set_restrictions()
to set/clear QUEUE_FLAG_NOWAIT accordingly.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add QUEUE_FLAG_NOWAIT to allow a block device to advertise support for
REQ_NOWAIT. Bio-based devices may set QUEUE_FLAG_NOWAIT where
applicable.
Update QUEUE_FLAG_MQ_DEFAULT to include QUEUE_FLAG_NOWAIT. Also
update submit_bio_checks() to verify it is set for REQ_NOWAIT bios.
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No need to go through the hd_struct to find the partition number.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No need to go through the hd_struct to find the partition number.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bd_contains is never NULL for an open block device. In addition ibd_bd
is always set to a block device that was exclusively opened by the
target code, so the holder is guranteed to be ib_dev as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The ->bd_contains field is set by __blkdev_get and drivers have no
business manipulating it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bd_disk is set on all block devices, including those for partitions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bd_disk is set on all block devices, including those for partitions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
To check for partitions of the same disk bd_contains works as well, but
bd_disk is way more obvious.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a littler helper to make the somewhat arcane bd_contains checks a
little more obvious.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bd_contains is an implementation detail and should not be mentioned in
a userspace API documentation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
commit 7b6620d7db56 ("block: remove REQ_NOWAIT_INLINE") removed the
REQ_NOWAIT_INLINE related code, but the diff wasn't applied to
blk_types.h somehow.
Then commit 2771cefeac49 ("block: remove the REQ_NOWAIT_INLINE flag")
removed the REQ_NOWAIT_INLINE flag while the BLK_QC_T_EAGAIN flag still
remains.
Fixes: 7b6620d7db56 ("block: remove REQ_NOWAIT_INLINE")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.10/drivers
Pull MD updates from Song.
* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md/raid10: improve discard request for far layout
md/raid10: improve raid10 discard request
md/raid10: pull codes that wait for blocked dev into one function
md/raid10: extend r10bio devs to raid disks
md: add md_submit_discard_bio() for submitting discard bio
md: Simplify code with existing definition RESYNC_SECTORS in raid10.c
md/raid5: reallocate page array after setting new stripe_size
md/raid5: resize stripe_head when reshape array
md/raid5: let multiple devices of stripe_head share page
md/raid6: let async recovery function support different page offset
md/raid6: let syndrome computor support different page offset
md/raid5: convert to new xor compution interface
md/raid5: add new xor function to support different page offset
md/raid5: make async_copy_data() to support different page offset
md/raid5: add a new member of offset into r5dev
md: only calculate blocksize once and use i_blocksize()
|
|
If we cancel these requests, we'll leak the memory associated with the
filename. Add them to the table of ops that need cleaning, if
REQ_F_NEED_CLEANUP is set.
Cc: stable@vger.kernel.org
Fixes: e62753e4e292 ("io_uring: call statx directly")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Reset the MMU context during kvm_set_cr4() if SMAP or PKE is toggled.
Recent commits to (correctly) not reload PDPTRs when SMAP/PKE are
toggled inadvertantly skipped the MMU context reset due to the mask
of bits that triggers PDPTR loads also being used to trigger MMU context
resets.
Fixes: 427890aff855 ("kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode")
Fixes: cb957adb4ea4 ("kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode")
Cc: Jim Mattson <jmattson@google.com>
Cc: Peter Shier <pshier@google.com>
Cc: Oliver Upton <oupton@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923215352.17756-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Based on Google-internal RSEQ work done by Paul Turner and Andrew
Hunter.
This patch adds a selftest for MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ.
The test quite often fails without the previous patch in this
patchset, but consistently passes with it.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/20200923233618.2572849-3-posk@google.com
|
|
This patch adds rseq_offset_deref_addv() function to
tools/testing/selftests/rseq/rseq-x86.h, to be used in a selftest in
the next patch in the patchset.
Once an architecture adds support for this function they should define
"RSEQ_ARCH_HAS_OFFSET_DEREF_ADDV".
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/20200923233618.2572849-2-posk@google.com
|
|
This patchset is based on Google-internal RSEQ work done by Paul
Turner and Andrew Hunter.
When working with per-CPU RSEQ-based memory allocations, it is
sometimes important to make sure that a global memory location is no
longer accessed from RSEQ critical sections. For example, there can be
two per-CPU lists, one is "active" and accessed per-CPU, while another
one is inactive and worked on asynchronously "off CPU" (e.g. garbage
collection is performed). Then at some point the two lists are
swapped, and a fast RCU-like mechanism is required to make sure that
the previously active list is no longer accessed.
This patch introduces such a mechanism: in short, membarrier() syscall
issues an IPI to a CPU, restarting a potentially active RSEQ critical
section on the CPU.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/20200923233618.2572849-1-posk@google.com
|
|
Barry Song noted the following
Something is wrong. In find_busiest_group(), we are checking if
src has higher load, however, in task_numa_find_cpu(), we are
checking if dst will have higher load after balancing. It seems
it is not sensible to check src.
It maybe cause wrong imbalance value, for example,
if dst_running = env->dst_stats.nr_running + 1 results in 3 or
above, and src_running = env->src_stats.nr_running - 1 results
in 1;
The current code is thinking imbalance as 0 since src_running is
smaller than 2. This is inconsistent with load balancer.
Basically, in find_busiest_group(), the NUMA imbalance is ignored if moving
a task "from an almost idle domain" to a "domain with spare capacity". This
patch forbids movement "from a misplaced domain" to "an almost idle domain"
as that is closer to what the CPU load balancer expects.
This patch is not a universal win. The old behaviour was intended to allow
a task from an almost idle NUMA node to migrate to its preferred node if
the destination had capacity but there are corner cases. For example,
a NAS compute load could be parallelised to use 1/3rd of available CPUs
but not all those potential tasks are active at all times allowing this
logic to trigger. An obvious example is specjbb 2005 running various
numbers of warehouses on a 2 socket box with 80 cpus.
specjbb
5.9.0-rc4 5.9.0-rc4
vanilla dstbalance-v1r1
Hmean tput-1 46425.00 ( 0.00%) 43394.00 * -6.53%*
Hmean tput-2 98416.00 ( 0.00%) 96031.00 * -2.42%*
Hmean tput-3 150184.00 ( 0.00%) 148783.00 * -0.93%*
Hmean tput-4 200683.00 ( 0.00%) 197906.00 * -1.38%*
Hmean tput-5 236305.00 ( 0.00%) 245549.00 * 3.91%*
Hmean tput-6 281559.00 ( 0.00%) 285692.00 * 1.47%*
Hmean tput-7 338558.00 ( 0.00%) 334467.00 * -1.21%*
Hmean tput-8 340745.00 ( 0.00%) 372501.00 * 9.32%*
Hmean tput-9 424343.00 ( 0.00%) 413006.00 * -2.67%*
Hmean tput-10 421854.00 ( 0.00%) 434261.00 * 2.94%*
Hmean tput-11 493256.00 ( 0.00%) 485330.00 * -1.61%*
Hmean tput-12 549573.00 ( 0.00%) 529959.00 * -3.57%*
Hmean tput-13 593183.00 ( 0.00%) 555010.00 * -6.44%*
Hmean tput-14 588252.00 ( 0.00%) 599166.00 * 1.86%*
Hmean tput-15 623065.00 ( 0.00%) 642713.00 * 3.15%*
Hmean tput-16 703924.00 ( 0.00%) 660758.00 * -6.13%*
Hmean tput-17 666023.00 ( 0.00%) 697675.00 * 4.75%*
Hmean tput-18 761502.00 ( 0.00%) 758360.00 * -0.41%*
Hmean tput-19 796088.00 ( 0.00%) 798368.00 * 0.29%*
Hmean tput-20 733564.00 ( 0.00%) 823086.00 * 12.20%*
Hmean tput-21 840980.00 ( 0.00%) 856711.00 * 1.87%*
Hmean tput-22 804285.00 ( 0.00%) 872238.00 * 8.45%*
Hmean tput-23 795208.00 ( 0.00%) 889374.00 * 11.84%*
Hmean tput-24 848619.00 ( 0.00%) 966783.00 * 13.92%*
Hmean tput-25 750848.00 ( 0.00%) 903790.00 * 20.37%*
Hmean tput-26 780523.00 ( 0.00%) 962254.00 * 23.28%*
Hmean tput-27 1042245.00 ( 0.00%) 991544.00 * -4.86%*
Hmean tput-28 1090580.00 ( 0.00%) 1035926.00 * -5.01%*
Hmean tput-29 999483.00 ( 0.00%) 1082948.00 * 8.35%*
Hmean tput-30 1098663.00 ( 0.00%) 1113427.00 * 1.34%*
Hmean tput-31 1125671.00 ( 0.00%) 1134175.00 * 0.76%*
Hmean tput-32 968167.00 ( 0.00%) 1250286.00 * 29.14%*
Hmean tput-33 1077676.00 ( 0.00%) 1060893.00 * -1.56%*
Hmean tput-34 1090538.00 ( 0.00%) 1090933.00 * 0.04%*
Hmean tput-35 967058.00 ( 0.00%) 1107421.00 * 14.51%*
Hmean tput-36 1051745.00 ( 0.00%) 1210663.00 * 15.11%*
Hmean tput-37 1019465.00 ( 0.00%) 1351446.00 * 32.56%*
Hmean tput-38 1083102.00 ( 0.00%) 1064541.00 * -1.71%*
Hmean tput-39 1232990.00 ( 0.00%) 1303623.00 * 5.73%*
Hmean tput-40 1175542.00 ( 0.00%) 1340943.00 * 14.07%*
Hmean tput-41 1127826.00 ( 0.00%) 1339492.00 * 18.77%*
Hmean tput-42 1198313.00 ( 0.00%) 1411023.00 * 17.75%*
Hmean tput-43 1163733.00 ( 0.00%) 1228253.00 * 5.54%*
Hmean tput-44 1305562.00 ( 0.00%) 1357886.00 * 4.01%*
Hmean tput-45 1326752.00 ( 0.00%) 1406061.00 * 5.98%*
Hmean tput-46 1339424.00 ( 0.00%) 1418451.00 * 5.90%*
Hmean tput-47 1415057.00 ( 0.00%) 1381570.00 * -2.37%*
Hmean tput-48 1392003.00 ( 0.00%) 1421167.00 * 2.10%*
Hmean tput-49 1408374.00 ( 0.00%) 1418659.00 * 0.73%*
Hmean tput-50 1359822.00 ( 0.00%) 1391070.00 * 2.30%*
Hmean tput-51 1414246.00 ( 0.00%) 1392679.00 * -1.52%*
Hmean tput-52 1432352.00 ( 0.00%) 1354020.00 * -5.47%*
Hmean tput-53 1387563.00 ( 0.00%) 1409563.00 * 1.59%*
Hmean tput-54 1406420.00 ( 0.00%) 1388711.00 * -1.26%*
Hmean tput-55 1438804.00 ( 0.00%) 1387472.00 * -3.57%*
Hmean tput-56 1399465.00 ( 0.00%) 1400296.00 * 0.06%*
Hmean tput-57 1428132.00 ( 0.00%) 1396399.00 * -2.22%*
Hmean tput-58 1432385.00 ( 0.00%) 1386253.00 * -3.22%*
Hmean tput-59 1421612.00 ( 0.00%) 1371416.00 * -3.53%*
Hmean tput-60 1429423.00 ( 0.00%) 1389412.00 * -2.80%*
Hmean tput-61 1396230.00 ( 0.00%) 1351122.00 * -3.23%*
Hmean tput-62 1418396.00 ( 0.00%) 1383098.00 * -2.49%*
Hmean tput-63 1409918.00 ( 0.00%) 1374662.00 * -2.50%*
Hmean tput-64 1410236.00 ( 0.00%) 1376216.00 * -2.41%*
Hmean tput-65 1396405.00 ( 0.00%) 1364418.00 * -2.29%*
Hmean tput-66 1395975.00 ( 0.00%) 1357326.00 * -2.77%*
Hmean tput-67 1392986.00 ( 0.00%) 1349642.00 * -3.11%*
Hmean tput-68 1386541.00 ( 0.00%) 1343261.00 * -3.12%*
Hmean tput-69 1374407.00 ( 0.00%) 1342588.00 * -2.32%*
Hmean tput-70 1377513.00 ( 0.00%) 1334654.00 * -3.11%*
Hmean tput-71 1369319.00 ( 0.00%) 1334952.00 * -2.51%*
Hmean tput-72 1354635.00 ( 0.00%) 1329005.00 * -1.89%*
Hmean tput-73 1350933.00 ( 0.00%) 1318942.00 * -2.37%*
Hmean tput-74 1351714.00 ( 0.00%) 1316347.00 * -2.62%*
Hmean tput-75 1352198.00 ( 0.00%) 1309974.00 * -3.12%*
Hmean tput-76 1349490.00 ( 0.00%) 1286064.00 * -4.70%*
Hmean tput-77 1336131.00 ( 0.00%) 1303684.00 * -2.43%*
Hmean tput-78 1308896.00 ( 0.00%) 1271024.00 * -2.89%*
Hmean tput-79 1326703.00 ( 0.00%) 1290862.00 * -2.70%*
Hmean tput-80 1336199.00 ( 0.00%) 1291629.00 * -3.34%*
The performance at the mid-point is better but not universally better. The
patch is a mixed bag depending on the workload, machine and overall
levels of utilisation. Sometimes it's better (sometimes much better),
other times it is worse (sometimes much worse). Given that there isn't a
universally good decision in this section and more people seem to prefer
the patch then it may be best to keep the LB decisions consistent and
revisit imbalance handling when the load balancer code changes settle down.
Jirka Hladky added the following observation.
Our results are mostly in line with what you see. We observe
big gains (20-50%) when the system is loaded to 1/3 of the
maximum capacity and mixed results at the full load - some
workloads benefit from the patch at the full load, others not,
but performance changes at the full load are mostly within the
noise of results (+/-5%). Overall, we think this patch is helpful.
[mgorman@techsingularity.net: Rewrote changelog]
Fixes: fb86f5b211 ("sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity")
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200921221849.GI3179@techsingularity.net
|
|
The busy_factor, which increases load balance interval when a cpu is busy,
is set to 32 by default. This value generates some huge LB interval on
large system like the THX2 made of 2 node x 28 cores x 4 threads.
For such system, the interval increases from 112ms to 3584ms at MC level.
And from 228ms to 7168ms at NUMA level.
Even on smaller system, a lower busy factor has shown improvement on the
fair distribution of the running time so let reduce it for all.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200921072424.14813-5-vincent.guittot@linaro.org
|
|
sched domains tend to trigger simultaneously the load balance loop but
the larger domains often need more time to collect statistics. This
slowness makes the larger domain trying to detach tasks from a rq whereas
tasks already migrated somewhere else at a sub-domain level. This is not
a real problem for idle LB because the period of smaller domains will
increase with its CPUs being busy and this will let time for higher ones
to pulled tasks. But this becomes a problem when all CPUs are already busy
because all domains stay synced when they trigger their LB.
A simple way to minimize simultaneous LB of all domains is to decrement the
the busy interval by 1 jiffies. Because of the busy_factor, the interval of
larger domain will not be a multiple of smaller ones anymore.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200921072424.14813-4-vincent.guittot@linaro.org
|
|
The 25% default imbalance threshold for DIE and NUMA domain is large
enough to generate significant unfairness between threads. A typical
example is the case of 11 threads running on 2x4 CPUs. The imbalance of
20% between the 2 groups of 4 cores is just low enough to not trigger
the load balance between the 2 groups. We will have always the same 6
threads on one group of 4 CPUs and the other 5 threads on the other
group of CPUS. With a fair time sharing in each group, we ends up with
+20% running time for the group of 5 threads.
Consider decreasing the imbalance threshold for overloaded case where we
use the load to balance task and to ensure fair time sharing.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Hillf Danton <hdanton@sina.com>
Link: https://lkml.kernel.org/r/20200921072424.14813-3-vincent.guittot@linaro.org
|
|
Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the
load balancer currently migrates the waiting task between the CPUs in an
almost random manner. The success of a rq pulling a task depends of the
value of nr_balance_failed of its domains and its ability to be faster
than others to detach it. This behavior results in an unfair distribution
of the running time between tasks because some CPUs will run most of the
time, if not always, the same task whereas others will share their time
between several tasks.
Instead of using nr_balance_failed as a boolean to relax the condition
for detaching task, the LB will use nr_balanced_failed to relax the
threshold between the tasks'load and the imbalance. This mecanism
prevents the same rq or domain to always win the load balance fight.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200921072424.14813-2-vincent.guittot@linaro.org
|
|
In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used,
sometimes update_tg_load_avg(cfs_rq, false) is used.
update_tg_load_avg() has the parameter force, but in current code,
it never set 1 or true to it, so remove the force parameter.
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200924014755.36253-1-tian.xianting@h3c.com
|
|
We've met problems that occasionally tasks with full cpumask
(e.g. by putting it into a cpuset or setting to full affinity)
were migrated to our isolated cpus in production environment.
After some analysis, we found that it is due to the current
select_idle_smt() not considering the sched_domain mask.
Steps to reproduce on my 31-CPU hyperthreads machine:
1. with boot parameter: "isolcpus=domain,2-31"
(thread lists: 0,16 and 1,17)
2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads"
3. some threads will be migrated to the isolated cpu16~17.
Fix it by checking the valid domain mask in select_idle_smt().
Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings())
Reported-by: Wetp Zhang <wetp.zy@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com
|
|
There is no caller in tree, so can remove it.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20200922132410.48440-1-yuehaibing@huawei.com
|
|
The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime
between CPUs, allowing a CPU to run a real-time task up to 100% of the
time while leaving more space for non-real-time tasks to run on the CPU
that lend rt_runtime.
The problem is that a CPU can easily borrow enough rt_runtime to allow
a spinning rt-task to run forever, starving per-cpu tasks like kworkers,
which are non-real-time by design.
This patch disables RT_RUNTIME_SHARE by default, avoiding this problem.
The feature will still be present for users that want to enable it,
though.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Wei Wang <wvw@google.com>
Link: https://lkml.kernel.org/r/b776ab46817e3db5d8ef79175fa0d71073c051c7.1600697903.git.bristot@redhat.com
|
|
When a boosted task gets throttled, what normally happens is that it's
immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the
runtime and clears the dl_throttled flag. There is a special case however:
if the throttling happened on sched-out and the task has been deboosted in
the meantime, the replenish is skipped as the task will return to its
normal scheduling class. This leaves the task with the dl_throttled flag
set.
Now if the task gets boosted up to the deadline scheduling class again
while it is sleeping, it's still in the throttled state. The normal wakeup
however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't
actually place it on the rq. Thus we end up with a task that is runnable,
but not actually on the rq and neither a immediate replenishment happens,
nor is the replenishment timer set up, so the task is stuck in
forever-throttled limbo.
Clear the dl_throttled flag before dropping back to the normal scheduling
class to fix this issue.
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200831110719.2126930-1-l.stach@pengutronix.de
|
|
Use runnable_avg to classify numa node state similarly to what is done for
normal load balancer. This helps to ensure that numa and normal balancers
use the same view of the state of the system.
Large arm64system: 2 nodes / 224 CPUs:
hackbench -l (256000/#grp) -g #grp
grp tip/sched/core +patchset improvement
1 14,008(+/- 4,99 %) 13,800(+/- 3.88 %) 1,48 %
4 4,340(+/- 5.35 %) 4.283(+/- 4.85 %) 1,33 %
16 3,357(+/- 0.55 %) 3.359(+/- 0.54 %) -0,06 %
32 3,050(+/- 0.94 %) 3.039(+/- 1,06 %) 0,38 %
64 2.968(+/- 1,85 %) 3.006(+/- 2.92 %) -1.27 %
128 3,290(+/-12.61 %) 3,108(+/- 5.97 %) 5.51 %
256 3.235(+/- 3.95 %) 3,188(+/- 2.83 %) 1.45 %
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20200921072959.16317-1-vincent.guittot@linaro.org
|
|
The struct of_device_id is not defined with !CONFIG_OF so its forward
declaration should be hidden to as well. This should address clang
compile warning:
drivers/mmc/host/sdhci-s3c.c:464:34: warning: tentative array definition assumed to have one element
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Link: https://lore.kernel.org/r/20200925072532.10272-1-krzk@kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
Fix inconsistent indenting, reported by Smatch:
drivers/mmc/host/sdhci-esdhc-imx.c:1380 sdhci_esdhc_imx_hwinit() warn: inconsistent indenting
drivers/mmc/host/sdhci-sprd.c:390 sdhci_sprd_request_done() warn: inconsistent indenting
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Link: https://lore.kernel.org/r/20200923153739.30327-2-krzk@kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
The 'struct mmc_host *mmc' comes from drvdata set at the end of probe,
so it cannot be NULL. The code already dereferences it few lines before
the check with mmc_priv(). This also fixes smatch warning:
drivers/mmc/host/moxart-mmc.c:692 moxart_remove() warn: variable dereferenced before check 'mmc' (see line 688)
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Link: https://lore.kernel.org/r/20200923153739.30327-1-krzk@kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
The MMC core has now a generic check if some tuning is in progress. Its
protected area is a bit larger than the custom one in this driver but we
concluded that this works equally well for the intended case. So, drop
the local flag and switch to the generic one.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Tested-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Link: https://lore.kernel.org/r/20200922172253.4458-1-wsa@kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
Simplify the return expression.
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Link: https://lore.kernel.org/r/20200921131042.92340-1-miaoqinglang@huawei.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
Add documentation for mmc_hw_reset to make sure the intended use case is
clear.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Link: https://lore.kernel.org/r/20200918215446.65654-1-wsa+renesas@sang-engineering.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|