Age | Commit message (Collapse) | Author |
|
To allow building up the infrastructure required to support dynamically
enabled FPU features, add:
- XFEATURES_MASK_DYNAMIC
This constant will hold xfeatures which can be dynamically enabled.
- fpu_state_size_dynamic()
A static branch for 64-bit and a simple 'return false' for 32-bit.
This helper allows to add dynamic-feature-specific changes to common
code which is shared between 32-bit and 64-bit without #ifdeffery.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-8-chang.seok.bae@intel.com
|
|
Dynamically enabled XSTATE features are by default disabled for all
processes. A process has to request permission to use such a feature.
To support this implement a architecture specific prctl() with the options:
- ARCH_GET_XCOMP_SUPP
Copies the supported feature bitmap into the user space provided
u64 storage. The pointer is handed in via arg2
- ARCH_GET_XCOMP_PERM
Copies the process wide permitted feature bitmap into the user space
provided u64 storage. The pointer is handed in via arg2
- ARCH_REQ_XCOMP_PERM
Request permission for a feature set. A feature set can be mapped to a
facility, e.g. AMX, and can require one or more XSTATE components to
be enabled.
The feature argument is the number of the highest XSTATE component
which is required for a facility to work.
The request argument is not a user supplied bitmap because that makes
filtering harder (think seccomp) and even impossible because to
support 32bit tasks the argument would have to be a pointer.
The permission mechanism works this way:
Task asks for permission for a facility and kernel checks whether that's
supported. If supported it does:
1) Check whether permission has already been granted
2) Compute the size of the required kernel and user space buffer
(sigframe) size.
3) Validate that no task has a sigaltstack installed
which is smaller than the resulting sigframe size
4) Add the requested feature bit(s) to the permission bitmap of
current->group_leader->fpu and store the sizes in the group
leaders fpu struct as well.
If that is successful then the feature is still not enabled for any of the
tasks. The first usage of a related instruction will result in a #NM
trap. The trap handler validates the permission bit of the tasks group
leader and if permitted it installs a larger kernel buffer and transfers
the permission and size info to the new fpstate container which makes all
the FPU functions which require per task information aware of the extended
feature set.
[ tglx: Adopted to new base code, added missing serialization,
massaged namings, comments and changelog ]
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-7-chang.seok.bae@intel.com
|
|
The upcoming prctl() which is required to request the permission for a
dynamically enabled feature will also provide an option to retrieve the
supported features. If the CPU does not support XSAVE, the supported
features would be 0 even when the CPU supports FP and SSE.
Provide separate storage for the legacy feature set to avoid that and fill
in the bits in the legacy init function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-6-chang.seok.bae@intel.com
|
|
Dynamically enabled features can be requested by any thread of a running
process at any time. The request does neither enable the feature nor
allocate larger buffers. It just stores the permission to use the feature
by adding the features to the permission bitmap and by calculating the
required sizes for kernel and user space.
The reallocation of the kernel buffer happens when the feature is used
for the first time which is caught by an exception. The permission
bitmap is then checked and if the feature is permitted, then it becomes
fully enabled. If not, the task dies similarly to a task which uses an
undefined instruction.
The size information is precomputed to allow proper sigaltstack size checks
once the feature is permitted, but not yet in use because otherwise this
would open race windows where too small stacks could be installed causing
a later fail on signal delivery.
Initialize them to the default feature set and sizes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-5-chang.seok.bae@intel.com
|
|
Split out the size calculation from the paranoia check so it can be used
for recalculating buffer sizes when dynamically enabled features are
supported.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
[ tglx: Adopted to changed base code ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-4-chang.seok.bae@intel.com
|
|
For historical reasons MINSIGSTKSZ is a constant which became already too
small with AVX512 support.
Add a mechanism to enforce strict checking of the sigaltstack size against
the real size of the FPU frame.
The strict check can be enabled via a config option and can also be
controlled via the kernel command line option 'strict_sas_size' independent
of the config switch.
Enabling it might break existing applications which allocate a too small
sigaltstack but 'work' because they never get a signal delivered. Though it
can be handy to filter out binaries which are not yet aware of
AT_MINSIGSTKSZ.
Also the upcoming support for dynamically enabled FPU features requires a
strict sanity check to ensure that:
- Enabling of a dynamic feature, which changes the sigframe size fits
into an enabled sigaltstack
- Installing a too small sigaltstack after a dynamic feature has been
added is not possible.
Implement the base check which is controlled by config and command line
options.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-3-chang.seok.bae@intel.com
|
|
New x86 FPU features will be very large, requiring ~10k of stack in
signal handlers. These new features require a new approach called
"dynamic features".
The kernel currently tries to ensure that altstacks are reasonably
sized. Right now, on x86, sys_sigaltstack() requires a size of >=2k.
However, that 2k is a constant. Simply raising that 2k requirement
to >10k for the new features would break existing apps which have a
compiled-in size of 2k.
Instead of universally enforcing a larger stack, prohibit a process from
using dynamic features without properly-sized altstacks. This must be
enforced in two places:
* A dynamic feature can not be enabled without an large-enough altstack
for each process thread.
* Once a dynamic feature is enabled, any request to install a too-small
altstack will be rejected
The dynamic feature enabling code must examine each thread in a
process to ensure that the altstacks are large enough. Add a new lock
(sigaltstack_lock()) to ensure that threads can not race and change
their altstack after being examined.
Add the infrastructure in form of a config option and provide empty
stubs for architectures which do not need dynamic altstack size checks.
This implementation will be fleshed out for x86 in a future patch called
x86/arch_prctl: Add controls for dynamic XSTATE components
[dhansen: commit message. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-2-chang.seok.bae@intel.com
|
|
Reading out the DP encoders' DPCD during booting or resume is only
required for enabled encoders: such encoders may be modesetted during
the initial commit and the link training this involves depends on an
initialized DPCD. For DDI encoders reading out the DPCD is skipped, do
the same on pre-DDI platforms.
Atm, the first DPCD readout without a sink connected - which is a likely
scneario if the encoder is disabled - leaves intel_dp->num_common_rates
at 0, which resulted in
intel_dp_sync_state()->intel_dp_max_common_rate()
in a
intel_dp->common_rates[-1]
access. This by definition results in an undefined behaviour, though to
my best knowledge in all HW/compiler configurations it actually results
in accessing the array item type value preceding the array. In this
case the preceding value happens to be intel_dp->num_common_rates,
which is 0, so this issue - by luck - didn't cause a user visible
problem.
Nevertheless it's still an undefined behaviour and in CONFIG_UBSAN
builds leads to a kernel BUG() (which revealed this problem for us),
hence CC:stable.
A related problem in case the encoder is enabled but the sink is not
connected or the DPCD readout fails is fixed by the next patch.
v2: Amend the commit message describing the root cause of the
CONFIG_UBSAN BUG().
Fixes: a532cde31de3 ("drm/i915/tc: Fix TypeC port init/resume time sanitization")
References: https://gitlab.freedesktop.org/drm/intel/-/issues/4297
Reported-and-tested-by: Mat Jonczyk <mat.jonczyk@o2.pl>
Cc: Mat Jonczyk <mat.jonczyk@o2.pl>
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211018094154.1407705-2-imre.deak@intel.com
(cherry picked from commit 4ec5ffc341cecbea060739aea1d53398ac2ec3f8)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
|
|
Replace the unconditional clflush() with drm_clflush_virt_range()
which does the wbinvd() fallback when clflush is not available.
This time no justification is given for the clflush in the
offending commit.
Cc: stable@vger.kernel.org
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Fixes: 2c8ab3339e39 ("drm/i915: Pin timeline map after first timeline pin, v4.")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014090941.12159-4-ville.syrjala@linux.intel.com
Reviewed-by: Dave Airlie <airlied@redhat.com>
(cherry picked from commit 9ced12182d0d8401d821e9602e56e276459900fc)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
|
|
This one is apparently a "clflush for good measure", so bit more
justification (if you can call it that) than some of the others.
Convert to drm_clflush_virt_range() again so that machines without
clflush will survive the ordeal.
Cc: stable@vger.kernel.org
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com> #v1
Fixes: 12ca695d2c1e ("drm/i915: Do not share hwsp across contexts any more, v8.")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014090941.12159-3-ville.syrjala@linux.intel.com
Reviewed-by: Dave Airlie <airlied@redhat.com>
(cherry picked from commit af7b6d234eefa30c461cc16912bafb32b9e6141c)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
|
|
Improve a few comments. These were extracted from the patch
"fscrypt: add support for hardware-wrapped keys"
(https://lore.kernel.org/r/20211021181608.54127-4-ebiggers@kernel.org).
Link: https://lore.kernel.org/r/20211026021042.6581-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
In commit c46ed2281bbe ("tpm_tis_spi: add missing SPI device ID entries")
we added SPI IDs for all the DT aliases to handle the fact that we always
use SPI modaliases to load modules even when probed via DT however the
mentioned commit missed that the SPI and OF device ID entries did not
match and were different and so DT nodes with compatible
"tcg,tpm_tis-spi" will not match. Add an extra ID for tpm_tis-spi
rather than just fix the existing one since what's currently there is
going to be better for anyone actually using SPI IDs to instantiate.
Fixes: c46ed2281bbe ("tpm_tis_spi: add missing SPI device ID entries")
Fixes: 96c8395e2166 ("spi: Revert modalias changes")
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
|
|
The Atmel TPM 1.2 chips crash with error
`tpm_try_transmit: send(): error -62` since kernel 4.14.
It is observed from the kernel log after running `tpm_sealdata -z`.
The error thrown from the command is as follows
```
$ tpm_sealdata -z
Tspi_Key_LoadKey failed: 0x00001087 - layer=tddl,
code=0087 (135), I/O error
```
The issue was reproduced with the following Atmel TPM chip:
```
$ tpm_version
T0 TPM 1.2 Version Info:
Chip Version: 1.2.66.1
Spec Level: 2
Errata Revision: 3
TPM Vendor ID: ATML
TPM Version: 01010000
Manufacturer Info: 41544d4c
```
The root cause of the issue is due to the TPM calls to msleep()
were replaced with usleep_range() [1], which reduces
the actual timeout. Via experiments, it is observed that
the original msleep(5) actually sleeps for 15ms.
Because of a known timeout issue in Atmel TPM 1.2 chip,
the shorter timeout than 15ms can cause the error described above.
A few further changes in kernel 4.16 [2] and 4.18 [3, 4] further
reduced the timeout to less than 1ms. With experiments,
the problematic timeout in the latest kernel is the one
for `wait_for_tpm_stat`.
To fix it, the patch reverts the timeout of `wait_for_tpm_stat`
to 15ms for all Atmel TPM 1.2 chips, but leave it untouched
for Ateml TPM 2.0 chip, and chips from other vendors.
As explained above, the chosen 15ms timeout is
the actual timeout before this issue introduced,
thus the old value is used here.
Particularly, TPM_ATML_TIMEOUT_WAIT_STAT_MIN is set to 14700us,
TPM_ATML_TIMEOUT_WAIT_STAT_MIN is set to 15000us according to
the existing TPM_TIMEOUT_RANGE_US (300us).
The fixed has been tested in the system with the affected Atmel chip
with no issues observed after boot up.
References:
[1] 9f3fc7bcddcb tpm: replace msleep() with usleep_range() in TPM
1.2/2.0 generic drivers
[2] cf151a9a44d5 tpm: reduce tpm polling delay in tpm_tis_core
[3] 59f5a6b07f64 tpm: reduce poll sleep time in tpm_transmit()
[4] 424eaf910c32 tpm: reduce polling time to usecs for even finer
granularity
Fixes: 9f3fc7bcddcb ("tpm: replace msleep() with usleep_range() in TPM 1.2/2.0 generic drivers")
Link: https://patchwork.kernel.org/project/linux-integrity/patch/20200926223150.109645-1-hao.wu@rubrik.com/
Signed-off-by: Hao Wu <hao.wu@rubrik.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
|
|
The "4 * be32_to_cpu(data->count)" multiplication can potentially
overflow which would lead to memory corruption. Add a check for that.
Cc: stable@vger.kernel.org
Fixes: 745b361e989a ("tpm: infrastructure for TPM spaces")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
|
|
COMPILE_TEST is helpful to find compilation errors in other platform(e.g.X86).
In this case, the support of COMPILE_TEST is added, so this module could
be compiled in other platform(e.g.X86), without ARCH_SYNQUACER configuration.
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
|
|
When the driver fails to allocate a new Rx buffer, it passes an empty Rx
descriptor (contains zero address and size) to the device and marks it
as invalid by setting the skb pointer in the descriptor's metadata to
NULL.
After processing enough Rx descriptors, the driver will try to process
the invalid descriptor, but will return immediately seeing that the skb
pointer is NULL. Since the driver no longer passes new Rx descriptors to
the device, the Rx queue will eventually become full and the device will
start to drop packets.
Fix this by recycling the received packet if allocation of the new
packet failed. This means that allocation is no longer performed at the
end of the Rx routine, but at the start, before tearing down the DMA
mapping of the received packet.
Remove the comment about the descriptor being zeroed as it is no longer
correct. This is OK because we either use the descriptor as-is (when
recycling) or overwrite its address and size fields with that of the
newly allocated Rx buffer.
The issue was discovered when a process ("perf") consumed too much
memory and put the system under memory pressure. It can be reproduced by
injecting slab allocation failures [1]. After the fix, the Rx queue no
longer comes to a halt.
[1]
# echo 10 > /sys/kernel/debug/failslab/times
# echo 1000 > /sys/kernel/debug/failslab/interval
# echo 100 > /sys/kernel/debug/failslab/probability
FAULT_INJECTION: forcing a failure.
name failslab, interval 1000, probability 100, space 0, times 8
[...]
Call Trace:
<IRQ>
dump_stack_lvl+0x34/0x44
should_fail.cold+0x32/0x37
should_failslab+0x5/0x10
kmem_cache_alloc_node+0x23/0x190
__alloc_skb+0x1f9/0x280
__netdev_alloc_skb+0x3a/0x150
mlxsw_pci_rdq_skb_alloc+0x24/0x90
mlxsw_pci_cq_tasklet+0x3dc/0x1200
tasklet_action_common.constprop.0+0x9f/0x100
__do_softirq+0xb5/0x252
irq_exit_rcu+0x7a/0xa0
common_interrupt+0x83/0xa0
</IRQ>
asm_common_interrupt+0x1e/0x40
RIP: 0010:cpuidle_enter_state+0xc8/0x340
[...]
mlxsw_spectrum2 0000:06:00.0: Failed to alloc skb for RDQ
Fixes: eda6500a987a ("mlxsw: Add PCI bus implementation")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/20211024064014.1060919-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Originally all DAX access when through block_device operations and thus
needed a queue reference. But since commit cccbce671582
("filesystem-dax: convert to dax_direct_access()") all this happens at
the DAX device level which uses its own refcounting. Having the external
refcount thus wasn't needed but has otherwise been harmless for long
time.
But now that "block: drain file system I/O on del_gendisk" waits for
q_usage_count to reach 0 in del_gendisk this whole scheme can't work
anymore (and pmem is the only driver abusing q_usage_count like that).
So switch to the internal reference and remove the unbalanced
blk_freeze_queue_start that is taken care of by del_gendisk.
Fixes: 8e141f9eb803 ("block: drain file system I/O on del_gendisk")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019073641.2323410-2-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
PTP is currently only supported on E810 devices, it is checked
in ice_ptp_init(). However, there is no check in ice_ptp_release().
For other E800 series devices, ice_ptp_release() will be wrongly executed.
Fix the following calltrace.
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
Workqueue: ice ice_service_task [ice]
Call Trace:
dump_stack_lvl+0x5b/0x82
dump_stack+0x10/0x12
register_lock_class+0x495/0x4a0
? find_held_lock+0x3c/0xb0
__lock_acquire+0x71/0x1830
lock_acquire+0x1e6/0x330
? ice_ptp_release+0x3c/0x1e0 [ice]
? _raw_spin_lock+0x19/0x70
? ice_ptp_release+0x3c/0x1e0 [ice]
_raw_spin_lock+0x38/0x70
? ice_ptp_release+0x3c/0x1e0 [ice]
ice_ptp_release+0x3c/0x1e0 [ice]
ice_prepare_for_reset+0xcb/0xe0 [ice]
ice_do_reset+0x38/0x110 [ice]
ice_service_task+0x138/0xf10 [ice]
? __this_cpu_preempt_check+0x13/0x20
process_one_work+0x26a/0x650
worker_thread+0x3f/0x3b0
? __kthread_parkme+0x51/0xb0
? process_one_work+0x650/0x650
kthread+0x161/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
Fixes: 4dd0d5c33c3e ("ice: add lock around Tx timestamp tracker flush")
Signed-off-by: Yongxin Liu <yongxin.liu@windriver.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
When the PF is a member of a link aggregate, and the driver
is removed, the process will hang unless we respond to the
NETDEV_UNREGISTER event that is sent to the event_handler
for LAG.
Add a case statement for the ice_lag_event_handler to unlink
the PF from the link aggregate.
Also remove code that was incorrectly applying a dev_hold to
peer_netdevs that were associated with the ice driver.
Fixes: df006dd4b1dc ("ice: Add initial support framework for LAG")
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Tony Brelinski <tony.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
sm8250 target"
This reverts commit 001ce9785c0674d913531345e86222c965fc8bf4.
This upstream commit broke AOSP (post Android 12 merge) build
on RB5. The device either silently crashes into USB crash mode
after android boot animation or we see a blank blue screen
with following dpu errors in dmesg:
[ T444] hw recovery is not complete for ctl:3
[ T444] [drm:dpu_encoder_phys_vid_prepare_for_kickoff:539] [dpu error]enc31 intf1 ctl 3 reset failure: -22
[ T444] [drm:dpu_encoder_phys_vid_wait_for_commit_done:513] [dpu error]vblank timeout
[ T444] [drm:dpu_kms_wait_for_commit_done:454] [dpu error]wait for commit done returned -110
[ C7] [drm:dpu_encoder_frame_done_timeout:2127] [dpu error]enc31 frame done timeout
[ T444] [drm:dpu_encoder_phys_vid_wait_for_commit_done:513] [dpu error]vblank timeout
[ T444] [drm:dpu_kms_wait_for_commit_done:454] [dpu error]wait for commit done returned -110
Fixes: 001ce9785c06 ("arm64: dts: qcom: sm8250: remove bus clock from the mdss node for sm8250 target")
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Link: https://lore.kernel.org/r/20211014135410.4136412-1-dmitry.baryshkov@linaro.org
|
|
Currently we use fixed size u16 bitmap for subpage bitmap. This is fine
for 4K sectorsize with 64K page size.
But for 4K sectorsize and larger page size, the bitmap is too small,
while for smaller page size like 16K, u16 bitmaps waste too much space.
Here we introduce a new helper structure, btrfs_subpage_bitmap_info, to
record the proper bitmap size, and where each bitmap should start at.
By this, we can later compact all subpage bitmaps into one u32 bitmap.
This patch is the first step.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The existing calling convention of btrfs_alloc_subpage() is pretty
awful. Change it to a more common pattern by returning struct
btrfs_subpage directly and let the caller to determine if the call
succeeded.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
than PAGE_SIZE
There are two call sites of btrfs_alloc_subpage():
- btrfs_attach_subpage()
We have ensured sectorsize is smaller than PAGE_SIZE
- alloc_extent_buffer()
We call btrfs_alloc_subpage() unconditionally.
The alloc_extent_buffer() forces us to check the sectorsize size against
page size inside btrfs_alloc_subpage().
Since the function name, btrfs_alloc_subpage(), already indicates it
should only get called for subpage cases, do the check in
alloc_extent_buffer() and add an ASSERT() in btrfs_alloc_subpage().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Update it since commit 944d3f9fac61 ("btrfs: switch seed device to
list api") did conversion from fs_devices::seed to fs_devices::seed_list.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is no need for the variable ret after d66105cfa873 ("btrfs:
allocate btrfs_ioctl_quota_rescan_args on stack"), remove it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The out label is being overused, we can simply return if the condition
permits.
No functional changes.
Reviewed-by: Su Yue <l@damenly.su>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The user facing function used to allocate new chunks is
btrfs_chunk_alloc, unfortunately there is yet another similar sounding
function - btrfs_alloc_chunk. This creates confusion, especially since
the latter function can be considered "private" in the sense that it
implements the first stage of chunk creation and as such is called by
btrfs_chunk_alloc.
To avoid the awkwardness that comes with having similarly named but
distinctly different in their purpose function rename btrfs_alloc_chunk
to btrfs_create_chunk, given that the main purpose of this function is
to orchestrate the whole process of allocating a chunk - reserving space
into devices, deciding on characteristics of the stripe size and
creating the in-memory structures.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
Commit 110860541f44 ("mm/secretmem: use refcount_t instead of atomic_t")
attempted to fix the problem of secretmem_users wrapping to zero and
allowing suspend once again.
But it was reverted in commit 87066fdd2e30 ("Revert 'mm/secretmem: use
refcount_t instead of atomic_t'") because of the problems it caused - a
refcount_t was not semantically the right type to use.
Instead prevent secretmem_users from wrapping to zero by forbidding new
users if the number of users has wrapped from positive to negative.
This stops a long way short of reaching the necessary 4 billion users
where it wraps to zero again, so there's no need to be clever with
special anti-wrap types or checking the return value from atomic_inc().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jordy Zomer <jordy@pwning.systems>
Cc: Kees Cook <keescook@chromium.org>,
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit efafec27c565 ("spi: Fix tegra20 build with CONFIG_PM=n") already
fixed the build without PM support once. There was an alternative fix
by Guenter in commit 2bab94090b01 ("spi: tegra20-slink: Declare runtime
suspend and resume functions conditionally"), and Mark then merged the
two correctly in ffb1e76f4f32 ("Merge tag 'v5.15-rc2' into spi-5.15").
But for some inexplicable reason, Mark then merged things _again_ in
commit 59c4e190b10c ("Merge tag 'v5.15-rc3' into spi-5.15"), and screwed
things up at that point, and the __maybe_unused attribute on
tegra_slink_runtime_resume() went missing.
Reinstate it, so that alpha (and other architectures without PM support)
builds cleanly again.
Btw, this is another prime example of how random back-merges are not
good. Just don't do them. Subsystem developers should not merge my
tree in any normal circumstances. Both of those merge commits pointed
to above are bad: even the one that got the merge result right doesn't
even mention _why_ it was done, and the one that got it wrong is
obviously broken.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull ARM fixes from Russell King:
- Fix clang-related relocation warning in futex code
- Fix incorrect use of get_kernel_nofault()
- Fix bad code generation in __get_user_check() when kasan is enabled
- Ensure TLB function table is correctly aligned
- Remove duplicated string function definitions in decompressor
- Fix link-time orphan section warnings
- Fix old-style function prototype for arch_init_kprobes()
- Only warn about XIP address when not compile testing
- Handle BE32 big endian for keystone2 remapping
* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9148/1: handle CONFIG_CPU_ENDIAN_BE32 in arch/arm/kernel/head.S
ARM: 9141/1: only warn about XIP address when not compile testing
ARM: 9139/1: kprobes: fix arch_init_kprobes() prototype
ARM: 9138/1: fix link warning with XIP + frame-pointer
ARM: 9134/1: remove duplicate memcpy() definition
ARM: 9133/1: mm: proc-macros: ensure *_tlb_fns are 4B aligned
ARM: 9132/1: Fix __get_user_check failure with ARM KASAN images
ARM: 9125/1: fix incorrect use of get_kernel_nofault()
ARM: 9122/1: select HAVE_FUTEX_CMPXCHG
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
Pull libata fix from Damien Le Moal:
"A single fix in this pull request addressing an invalid error code
return in the sata_mv driver (from Zheyu)"
* tag 'libata-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
ata: sata_mv: Fix the error handling of mv_chip_id()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"Some late pin control fixes, the most generally annoying will probably
be the AMD IRQ storm fix affecting the Microsoft surface.
Summary:
- Three fixes pertaining to Broadcom DT bindings. Some stuff didn't
work out as inteded, we need to back out
- A resume bug fix in the STM32 driver
- Disable and mask the interrupts on probe in the AMD pinctrl driver,
affecting Microsoft surface"
* tag 'pinctrl-v5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: amd: disable and mask interrupts on probe
pinctrl: stm32: use valid pin identifier in stm32_pinctrl_resume()
Revert "pinctrl: bcm: ns: support updated DT binding as syscon subnode"
dt-bindings: pinctrl: brcm,ns-pinmux: drop unneeded CRU from example
Revert "dt-bindings: pinctrl: bcm4708-pinmux: rework binding to use syscon"
|
|
KCSAN complaints about the sbitmap hint update:
==================================================================
BUG: KCSAN: data-race in sbitmap_queue_clear / sbitmap_queue_clear
write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 1:
sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606
blk_mq_put_tag+0x82/0x90
__blk_mq_free_request+0x114/0x180 block/blk-mq.c:507
blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541
__blk_mq_end_request+0x214/0x230 block/blk-mq.c:565
blk_mq_end_request+0x37/0x50 block/blk-mq.c:574
lo_complete_rq+0xca/0x170 drivers/block/loop.c:541
blk_complete_reqs block/blk-mq.c:584 [inline]
blk_done_softirq+0x69/0x90 block/blk-mq.c:589
__do_softirq+0x12c/0x26e kernel/softirq.c:558
run_ksoftirqd+0x13/0x20 kernel/softirq.c:920
smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164
kthread+0x262/0x280 kernel/kthread.c:319
ret_from_fork+0x1f/0x30
write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 0:
sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606
blk_mq_put_tag+0x82/0x90
__blk_mq_free_request+0x114/0x180 block/blk-mq.c:507
blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541
__blk_mq_end_request+0x214/0x230 block/blk-mq.c:565
blk_mq_end_request+0x37/0x50 block/blk-mq.c:574
lo_complete_rq+0xca/0x170 drivers/block/loop.c:541
blk_complete_reqs block/blk-mq.c:584 [inline]
blk_done_softirq+0x69/0x90 block/blk-mq.c:589
__do_softirq+0x12c/0x26e kernel/softirq.c:558
run_ksoftirqd+0x13/0x20 kernel/softirq.c:920
smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164
kthread+0x262/0x280 kernel/kthread.c:319
ret_from_fork+0x1f/0x30
value changed: 0x00000035 -> 0x00000044
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.15.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
==================================================================
which is a data race, but not an important one. This is just updating the
percpu alloc hint, and the reader of that hint doesn't ever require it to
be valid.
Just annotate it with data_race() to silence this one.
Reported-by: syzbot+4f8bfd804b4a1f95b8f6@syzkaller.appspotmail.com
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The second argument was only used by the USB gadget code, yet everyone
pays the overhead of passing a zero to be passed into aio, where it
ends up being part of the aio res2 value.
Now that everybody is passing in zero, kill off the extra argument.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The USB gadget code is the only code that every tried to utilize the
2nd argument of the aio completions, but there are strong suspicions
that it was never actually used by anything on the userspace side.
Out of the 3 cases that touch it, two of them just pass in the same
as res, and the last one passes in error/transfer in res like any
other normal use case.
Remove the 2nd argument, pass 0 like the rest of the in-kernel users
of kiocb based IO.
Link: https://lore.kernel.org/linux-block/20211021174021.273c82b1.john@metanate.com/
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently in net_ns_get_ownership() it may not be able to set uid or gid
if make_kuid or make_kgid returns an invalid value, and an uninit-value
issue can be triggered by this.
This patch is to fix it by initializing the uid and gid before calling
net_ns_get_ownership(), as it does in kobject_get_ownership()
Fixes: e6dee9f3893c ("net-sysfs: add netdev_change_owner()")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The tx queues are not stopped during the live migration. As a result, the
ndo_start_xmit() may access netfront_info->queues which is freed by
talk_to_netback()->xennet_destroy_queues().
This patch is to netif_device_detach() at the beginning of xen-netfront
resuming, and netif_device_attach() at the end of resuming.
CPU A CPU B
talk_to_netback()
-> if (info->queues)
xennet_destroy_queues(info);
to free netfront_info->queues
xennet_start_xmit()
to access netfront_info->queues
-> err = xennet_create_queues(info, &num_queues);
The idea is borrowed from virtio-net.
Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Drivers call netdev_set_num_tc() and then netdev_set_tc_queue()
to set the queue count and offset for each TC. So the queue count
and offset for the TCs may be zero for a short period after dev->num_tc
has been set. If a TX packet is being transmitted at this time in the
code path netdev_pick_tx() -> skb_tx_hash(), skb_tx_hash() may see
nonzero dev->num_tc but zero qcount for the TC. The while loop that
keeps looping while hash >= qcount will not end.
Fix it by checking the TC's qcount to be nonzero before using it.
Fixes: eadec877ce9c ("net: Add support for subordinate traffic classes to netdev_pick_tx")
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When copying the device name, the length of the data memcpy copied exceeds
the length of the source buffer, which cause the KASAN issue below. Use
strscpy_pad() instead.
BUG: KASAN: slab-out-of-bounds in ib_nl_set_path_rec_attrs+0x136/0x320 [ib_core]
Read of size 64 at addr ffff88811a10f5e0 by task rping/140263
CPU: 3 PID: 140263 Comm: rping Not tainted 5.15.0-rc1+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x7d
print_address_description.constprop.0+0x1d/0xa0
kasan_report+0xcb/0x110
kasan_check_range+0x13d/0x180
memcpy+0x20/0x60
ib_nl_set_path_rec_attrs+0x136/0x320 [ib_core]
ib_nl_make_request+0x1c6/0x380 [ib_core]
send_mad+0x20a/0x220 [ib_core]
ib_sa_path_rec_get+0x3e3/0x800 [ib_core]
cma_query_ib_route+0x29b/0x390 [rdma_cm]
rdma_resolve_route+0x308/0x3e0 [rdma_cm]
ucma_resolve_route+0xe1/0x150 [rdma_ucm]
ucma_write+0x17b/0x1f0 [rdma_ucm]
vfs_write+0x142/0x4d0
ksys_write+0x133/0x160
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f26499aa90f
Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 5c fd ff ff 48
RSP: 002b:00007f26495f2dc0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00000000000007d0 RCX: 00007f26499aa90f
RDX: 0000000000000010 RSI: 00007f26495f2e00 RDI: 0000000000000003
RBP: 00005632a8315440 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000293 R12: 00007f26495f2e00
R13: 00005632a83154e0 R14: 00005632a8315440 R15: 00005632a830a810
Allocated by task 131419:
kasan_save_stack+0x1b/0x40
__kasan_kmalloc+0x7c/0x90
proc_self_get_link+0x8b/0x100
pick_link+0x4f1/0x5c0
step_into+0x2eb/0x3d0
walk_component+0xc8/0x2c0
link_path_walk+0x3b8/0x580
path_openat+0x101/0x230
do_filp_open+0x12e/0x240
do_sys_openat2+0x115/0x280
__x64_sys_openat+0xce/0x140
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: 2ca546b92a02 ("IB/sa: Route SA pathrecord query through netlink")
Link: https://lore.kernel.org/r/72ede0f6dab61f7f23df9ac7a70666e07ef314b0.1635055496.git.leonro@nvidia.com
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
A hard hang is observed whenever the ethernet interface is brought
down. If the PHY is stopped before the LPC core block is reset,
the SoC will hang. Comparing lpc_eth_close() and lpc_eth_open() I
re-arranged the ordering of the functions calls in lpc_eth_close() to
reset the hardware before stopping the PHY.
Fixes: b7370112f519 ("lpc32xx: Added ethernet driver")
Signed-off-by: Trevor Woerner <twoerner@gmail.com>
Acked-by: Vladimir Zapolskiy <vz@mleia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Our test reports a null pointer dereference:
[ 168.534653] ==================================================================
[ 168.535614] Disabling lock debugging due to kernel taint
[ 168.536346] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 168.537274] #PF: supervisor read access in kernel mode
[ 168.537964] #PF: error_code(0x0000) - not-present page
[ 168.538667] PGD 0 P4D 0
[ 168.539025] Oops: 0000 [#1] PREEMPT SMP KASAN
[ 168.539656] CPU: 13 PID: 759 Comm: bash Tainted: G B 5.15.0-rc2-next-202100
[ 168.540954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_0738364
[ 168.542736] RIP: 0010:bfq_pd_init+0x88/0x1e0
[ 168.543318] Code: 98 00 00 00 e8 c9 e4 5b ff 4c 8b 65 00 49 8d 7c 24 08 e8 bb e4 5b ff 4d0
[ 168.545803] RSP: 0018:ffff88817095f9c0 EFLAGS: 00010002
[ 168.546497] RAX: 0000000000000001 RBX: ffff888101a1c000 RCX: 0000000000000000
[ 168.547438] RDX: 0000000000000003 RSI: 0000000000000002 RDI: ffff888106553428
[ 168.548402] RBP: ffff888106553400 R08: ffffffff961bcaf4 R09: 0000000000000001
[ 168.549365] R10: ffffffffa2e16c27 R11: fffffbfff45c2d84 R12: 0000000000000000
[ 168.550291] R13: ffff888101a1c098 R14: ffff88810c7a08c8 R15: ffffffffa55541a0
[ 168.551221] FS: 00007fac75227700(0000) GS:ffff88839ba80000(0000) knlGS:0000000000000000
[ 168.552278] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 168.553040] CR2: 0000000000000008 CR3: 0000000165ce7000 CR4: 00000000000006e0
[ 168.554000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 168.554929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 168.555888] Call Trace:
[ 168.556221] <TASK>
[ 168.556510] blkg_create+0x1c0/0x8c0
[ 168.556989] blkg_conf_prep+0x574/0x650
[ 168.557502] ? stack_trace_save+0x99/0xd0
[ 168.558033] ? blkcg_conf_open_bdev+0x1b0/0x1b0
[ 168.558629] tg_set_conf.constprop.0+0xb9/0x280
[ 168.559231] ? kasan_set_track+0x29/0x40
[ 168.559758] ? kasan_set_free_info+0x30/0x60
[ 168.560344] ? tg_set_limit+0xae0/0xae0
[ 168.560853] ? do_sys_openat2+0x33b/0x640
[ 168.561383] ? do_sys_open+0xa2/0x100
[ 168.561877] ? __x64_sys_open+0x4e/0x60
[ 168.562383] ? __kasan_check_write+0x20/0x30
[ 168.562951] ? copyin+0x48/0x70
[ 168.563390] ? _copy_from_iter+0x234/0x9e0
[ 168.563948] tg_set_conf_u64+0x17/0x20
[ 168.564467] cgroup_file_write+0x1ad/0x380
[ 168.565014] ? cgroup_file_poll+0x80/0x80
[ 168.565568] ? __mutex_lock_slowpath+0x30/0x30
[ 168.566165] ? pgd_free+0x100/0x160
[ 168.566649] kernfs_fop_write_iter+0x21d/0x340
[ 168.567246] ? cgroup_file_poll+0x80/0x80
[ 168.567796] new_sync_write+0x29f/0x3c0
[ 168.568314] ? new_sync_read+0x410/0x410
[ 168.568840] ? __handle_mm_fault+0x1c97/0x2d80
[ 168.569425] ? copy_page_range+0x2b10/0x2b10
[ 168.570007] ? _raw_read_lock_bh+0xa0/0xa0
[ 168.570622] vfs_write+0x46e/0x630
[ 168.571091] ksys_write+0xcd/0x1e0
[ 168.571563] ? __x64_sys_read+0x60/0x60
[ 168.572081] ? __kasan_check_write+0x20/0x30
[ 168.572659] ? do_user_addr_fault+0x446/0xff0
[ 168.573264] __x64_sys_write+0x46/0x60
[ 168.573774] do_syscall_64+0x35/0x80
[ 168.574264] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 168.574960] RIP: 0033:0x7fac74915130
[ 168.575456] Code: 73 01 c3 48 8b 0d 58 ed 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 444
[ 168.577969] RSP: 002b:00007ffc3080e288 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 168.578986] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007fac74915130
[ 168.579937] RDX: 0000000000000009 RSI: 000056007669f080 RDI: 0000000000000001
[ 168.580884] RBP: 000056007669f080 R08: 000000000000000a R09: 00007fac75227700
[ 168.581841] R10: 000056007655c8f0 R11: 0000000000000246 R12: 0000000000000009
[ 168.582796] R13: 0000000000000001 R14: 00007fac74be55e0 R15: 00007fac74be08c0
[ 168.583757] </TASK>
[ 168.584063] Modules linked in:
[ 168.584494] CR2: 0000000000000008
[ 168.584964] ---[ end trace 2475611ad0f77a1a ]---
This is because blkg_alloc() is called from blkg_conf_prep() without
holding 'q->queue_lock', and elevator is exited before blkg_create():
thread 1 thread 2
blkg_conf_prep
spin_lock_irq(&q->queue_lock);
blkg_lookup_check -> return NULL
spin_unlock_irq(&q->queue_lock);
blkg_alloc
blkcg_policy_enabled -> true
pd = ->pd_alloc_fn
blkg->pd[i] = pd
blk_mq_exit_sched
bfq_exit_queue
blkcg_deactivate_policy
spin_lock_irq(&q->queue_lock);
__clear_bit(pol->plid, q->blkcg_pols);
spin_unlock_irq(&q->queue_lock);
q->elevator = NULL;
spin_lock_irq(&q->queue_lock);
blkg_create
if (blkg->pd[i])
->pd_init_fn -> q->elevator is NULL
spin_unlock_irq(&q->queue_lock);
Because blkcg_deactivate_policy() requires queue to be frozen, we can
grab q_usage_counter to synchoronize blkg_conf_prep() against
blkcg_deactivate_policy().
Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20211020014036.2141723-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Combine bio_iov_bvec_set() and bio_iov_bvec_set_append() and let the
caller to do iov_iter_advance(). Also get rid of __bio_iov_bvec_set(),
which was duplicated in the final binary, and replace a weird
iov_iter_truncate() of a temporal iter copy with min() better reflecting
the intention.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/bcf1ac36fce769a514e19475f3623cd86a1d8b72.1635006010.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
As with __blkdev_direct_IO_simple(), we can implement direct IO more
efficiently if there is only one bio. Add __blkdev_direct_IO_async() and
blkdev_bio_end_io_async(). This patch brings me from 4.45-4.5 MIOPS with
nullblk to 4.7+.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f0ae4109b7a6934adede490f84d188d53b97051b.1635006010.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
As it turns out, my earlier patch in commit 86d46fdaa12a (block:
ataflop: fix breakage introduced at blk-mq refactoring) was
incomplete. This patch fixes any remaining issues found during
more testing and code review.
Requests exceeding 4 k are handled in 4k segments but
__blk_mq_end_request() is never called on these (still
sectors outstanding on the request). With redo_fd_request()
removed, there is no provision to kick off processing of the
next segment, causing requests exceeding 4k to hang. (By
setting /sys/block/fd0/queue/max_sectors_k <= 4 as workaround,
this behaviour can be avoided).
Instead of reintroducing redo_fd_request(), requeue the remainder
of the request by calling blk_mq_requeue_request() on incomplete
requests (i.e. when blk_update_request() still returns true), and
rely on the block layer to queue the residual as new request.
Both error handling and formatting needs to release the
ST-DMA lock, so call finish_fdc() on these (this was previously
handled by redo_fd_request()). finish_fdc() may be called
legitimately without the ST-DMA lock held - make sure we only
release the lock if we actually held it. In a similar way,
early exit due to errors in ataflop_queue_rq() must release
the lock.
After minor errors, fd_error sets up to recalibrate the drive
but never re-runs the current operation (another task handled by
redo_fd_request() before). Call do_fd_action() to get the next
steps (seek, retry read/write) underway.
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Fixes: 6ec3938cff95f (ataflop: convert to blk-mq)
CC: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/20211024002013.9332-1-schmitzmic@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ioprio setup doesn't depend on other fields that are modified in
io_prep_rw() and we can move it down in the function without worrying
about performance. It's useful as it makes iocb->ki_flags
accesses/modifications closer together, so it's more likely the compiler
will cache it in a register and avoid extra reloads.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8ee98779c06f1b59f6039b1e292db4332efd664b.1634987320.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_file_supports_nowait() doesn't use rw argument anymore, remove it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4bd6709fc573d70c866ea656cb7a7dbe94be8026.1634987320.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
opcode prep functions are one of the first things that are called, we
can't have ->async_data allocated at this point and it's certainly a
bug. Reflect this assumption in io_timeout_prep() and add a WARN_ONCE
just in case.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/75a28ca7dbcc5af8b6cd9092819e8384c24dedd4.1634987320.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If an opcode doesn't support polling, just let it be executed
synchronously in iowq, otherwise it will do a nonblock attempt just to
fail in io_arm_poll_handler() and return back to blocking execution.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6401256db01b88f448f15fcd241439cb76f5b940.1634987320.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
->pollout or ->pollin are set only for opcodes that need a file, so if
io_arm_poll_handler() tests them first we can be sure that the request
has file set and the ->file check can be removed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9adfe4f543d984875e516fce6da35348aab48668.1634987320.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|