Age | Commit message (Collapse) | Author |
|
As the KVM/arm64 selftests are routed via the kvmarm tree,
add the relevant references to the MAINTAINERS file.
Suggested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210622070732.zod7gaqhqo344vg6@gator
|
|
Since KVM commit 11663111cd49 ("KVM: arm64: Hide PMU registers from
userspace when not available") the get-reg-list* tests have been
failing with
...
... There are 74 missing registers.
The following lines are missing registers:
...
where the 74 missing registers are all PMU registers. This isn't a
bug in KVM that the selftest found, even though it's true that a
KVM userspace that wasn't setting the KVM_ARM_VCPU_PMU_V3 VCPU
flag, but still expecting the PMU registers to be in the reg-list,
would suddenly no longer have their expectations met. In that case,
the expectations were wrong, though, so that KVM userspace needs to
be fixed, and so does this selftest. The fix for this selftest is to
pull the PMU registers out of the base register sublist into their
own sublist and then create new, pmu-enabled vcpu configs which can
be tested.
Signed-off-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210531103344.29325-6-drjones@redhat.com
|
|
Now that we can easily run the test for multiple vcpu configs, let's
merge get-reg-list and get-reg-list-sve into just get-reg-list. We
also add a final change to make it more possible to run multiple
tests, which is to fork the test, rather than directly run it. That
allows a test to fail, but subsequent tests can still run.
Signed-off-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210531103344.29325-5-drjones@redhat.com
|
|
Add a new command line option that allows the user to select a specific
configuration, e.g. --config=sve will give the sve config. Also provide
help text and the --help/-h options.
Signed-off-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210531103344.29325-4-drjones@redhat.com
|
|
We don't want to have to create a new binary for each vcpu config, so
prepare to run the test for multiple vcpu configs in a single binary.
We do this by factoring out the test from main() and then looping over
configs. When given '--list' we still never print more than a single
reg-list for a single vcpu config though, because it would be confusing
otherwise.
No functional change intended.
Signed-off-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210531103344.29325-3-drjones@redhat.com
|
|
We already break register lists into sublists that get selected based
on vcpu config. However, since we only had two configs (vregs and sve),
we didn't structure the code very well to manage them. Restructure it
now to more cleanly handle register sublists that are dependent on the
vcpu config.
This patch has no intended functional change (except for the vcpu
config name now being prepended to all output).
Signed-off-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210531103344.29325-2-drjones@redhat.com
|
|
This reverts commit 4cbbe34807938e6e494e535a68d5ff64edac3f20.
Reason for revert: side effect of enlarging CP_MEC_DOORBELL_RANGE may
cause some APUs fail to enter gfxoff in certain user cases.
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
|
|
doorbell."
This reverts commit 1c0b0efd148d5b24c4932ddb3fa03c8edd6097b3.
Reason for revert: Side effect of enlarging CP_MEC_DOORBELL_RANGE may
cause some APUs fail to enter gfxoff in certain user cases.
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
|
|
Once drm_framebuffer_init has returned 0, the framebuffer is hooked up
to the reference counting machinery and can no longer be destroyed with
a simple kfree. Therefore, it must be called last.
If drm_framebuffer_init returns 0 but its caller then returns non-0,
there will likely be memory corruption fireworks down the road.
The following lead me to this fix:
[ 12.891228] kernel BUG at lib/list_debug.c:25!
[...]
[ 12.891263] RIP: 0010:__list_add_valid+0x4b/0x70
[...]
[ 12.891324] Call Trace:
[ 12.891330] drm_framebuffer_init+0xb5/0x100 [drm]
[ 12.891378] amdgpu_display_gem_fb_verify_and_init+0x47/0x120 [amdgpu]
[ 12.891592] ? amdgpu_display_user_framebuffer_create+0x10d/0x1f0 [amdgpu]
[ 12.891794] amdgpu_display_user_framebuffer_create+0x126/0x1f0 [amdgpu]
[ 12.891995] drm_internal_framebuffer_create+0x378/0x3f0 [drm]
[ 12.892036] ? drm_internal_framebuffer_create+0x3f0/0x3f0 [drm]
[ 12.892075] drm_mode_addfb2+0x34/0xd0 [drm]
[ 12.892115] ? drm_internal_framebuffer_create+0x3f0/0x3f0 [drm]
[ 12.892153] drm_ioctl_kernel+0xe2/0x150 [drm]
[ 12.892193] drm_ioctl+0x3da/0x460 [drm]
[ 12.892232] ? drm_internal_framebuffer_create+0x3f0/0x3f0 [drm]
[ 12.892274] amdgpu_drm_ioctl+0x43/0x80 [amdgpu]
[ 12.892475] __se_sys_ioctl+0x72/0xc0
[ 12.892483] do_syscall_64+0x33/0x40
[ 12.892491] entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: f258907fdd835e "drm/amdgpu: Verify bo size can fit framebuffer size on init."
Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
It's not sufficient to skip reading when the pos is beyond the EOF.
There may be data at the head of the page that we need to fill in
before the write.
Add a new helper function that corrects and clarifies the logic of
when we can skip reads, and have it only zero out the part of the page
that won't have data copied in for the write.
Finally, don't set the page Uptodate after zeroing. It's not up to date
since the write data won't have been copied in yet.
[DH made the following changes:
- Prefixed the new function with "netfs_".
- Don't call zero_user_segments() for a full-page write.
- Altered the beyond-last-page check to avoid a DIV instruction and got
rid of then-redundant zero-length file check.
]
Fixes: e1b1240c1ff5f ("netfs: Add write_begin helper")
Reported-by: Andrew W Elble <aweits@rit.edu>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
Link: https://lore.kernel.org/r/20210613233345.113565-1-jlayton@kernel.org/
Link: https://lore.kernel.org/r/162367683365.460125.4467036947364047314.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162391826758.1173366.11794946719301590013.stgit@warthog.procyon.org.uk/ # v2
|
|
Fix afs_write_end() to correctly handle a short copy into the intended
write region of the page. Two things are necessary:
(1) If the page is not up to date, then we should just return 0
(ie. indicating a zero-length copy). The loop in
generic_perform_write() will go around again, possibly breaking up the
iterator into discrete chunks[1].
This is analogous to commit b9de313cf05fe08fa59efaf19756ec5283af672a
for ceph.
(2) The page should not have been set uptodate if it wasn't completely set
up by netfs_write_begin() (this will be fixed in the next patch), so
we need to set uptodate here in such a case.
Also remove the assertion that was checking that the page was set uptodate
since it's now set uptodate if it wasn't already a few lines above. The
assertion was from when uptodate was set elsewhere.
Changes:
v3: Remove the handling of len exceeding the end of the page.
Fixes: 3003bbd0697b ("afs: Use the netfs_write_begin() helper")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YMwVp268KTzTf8cN@zeniv-ca.linux.org.uk/ [1]
Link: https://lore.kernel.org/r/162367682522.460125.5652091227576721609.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162391825688.1173366.3437507255136307904.stgit@warthog.procyon.org.uk/ # v2
|
|
A disabled/masked interrupt marked as wakeup source must be re-enable
and unmasked in order to be able to wake-up the host. That can be done
by flaging the irqchip with IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND.
Note: It 'sometimes' works without that change, but only thanks to the
lazy generic interrupt disabling (keeping interrupt unmasked).
Reported-by: Michal Koziel <michal.koziel@emlogic.no>
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
|
|
on SA8155p-adp board" from Bhupesh Sharma <bhupesh.sharma@linaro.org>:
Changes since v2:
-----------------
- v2 series can be found here: https://lore.kernel.org/linux-arm-msm/20210615074543.26700-1-bhupesh.sharma@linaro.org/T/#m8303d27d561b30133992da88198abb78ea833e21
- Addressed review comments from Bjorn and Mark.
- As per suggestion from Bjorn, seperated the patches in different
patchsets (specific to each subsystem) to ease review and patch application.
Changes since v1:
-----------------
- v1 series can be found here: https://lore.kernel.org/linux-arm-msm/20210607113840.15435-1-bhupesh.sharma@linaro.org/T/#mc524fe82798d4c4fb75dd0333318955e0406ad18
- Addressed review comments from Bjorn and Vinod received on the v1
series.
This series adds the regulator support code for SA8155p-adp board
which is based on Qualcomm snapdragon sa8155p SoC which in turn is
simiar to the sm8150 SoC.
This board supports a new PMIC PMM8155AU.
While at it, also make some cosmetic changes to the regulator driver
and dt-bindings to make sure the compatibles are alphabetical and also
fix issues with extra comma(s) at the end of terminator line(s).
Cc: Mark Brown <broonie@kernel.org>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Bhupesh Sharma (5):
dt-bindings: regulator: qcom,rpmh-regulator: Arrange compatibles
alphabetically
dt-bindings: regulator: qcom,rpmh-regulator: Add compatible for
SA8155p-adp board pmic
regulator: qcom-rpmh: Cleanup terminator line commas
regulator: qcom-rpmh: Add terminator at the end of pm7325x_vreg_data[]
array
regulator: qcom-rpmh: Add new regulator found on SA8155p adp board
.../regulator/qcom,rpmh-regulator.yaml | 17 ++---
drivers/regulator/qcom-rpmh-regulator.c | 62 +++++++++++++++----
2 files changed, 59 insertions(+), 20 deletions(-)
--
2.31.1
|
|
<matti.vaittinen@fi.rohmeurope.com>:
Extend regulator notification support
This series extends the regulator notification and error flag support.
Initial discussion on the topic can be found here:
https://lore.kernel.org/lkml/6046836e22b8252983f08d5621c35ececb97820d.camel@fi.rohmeurope.com/
In a nutshell - the series adds:
1. WARNING level events/error flags. (Patch 3)
Current regulator 'ERROR' event notifications for over/under
voltage, over current and over temperature are used to indicate
condition where monitored entity is so badly "off" that it actually
indicates a hardware error which can not be recovered. The most
typical hanling for that is believed to be a (graceful)
system-shutdown. Here we add set of 'WARNING' level flags to allow
sending notifications to consumers before things are 'that badly off'
so that consumer drivers can implement recovery-actions.
2. Device-tree properties for specifying limit values. (Patches 1, 5)
Add limits for above mentioned 'ERROR' and 'WARNING' levels (which
send notifications to consumers) and also for a 'PROTECTION' level
(which will be used to immediately shut-down the regulator(s) W/O
informing consumer drivers. Typically implemented by hardware).
Property parsing is implemented in regulator core which then calls
callback operations for limit setting from the IC drivers. A
warning is emitted if protection is requested by device tree but the
underlying IC does not support configuring requested protection.
3. Helpers which can be registered by IC. (Patch 4)
Target is to avoid implementing IRQ handling and IRQ storm protection
in each IC driver. (Many of the ICs implementin these IRQs do not allow
masking or acking the IRQ but keep the IRQ asserted for the whole
duration of problem keeping the processor in IRQ handling loop).
4. Emergency poweroff function (refactored out of the thermal_core to
kernel/reboot.c) which is called if IC fires error IRQs but IC reading
fails and given retry-count is exceeded. (Patches 2, 4)
Please note that the mutex in the emergency shutdown was replaced by a
simple atomic in order to allow call from any context.
The helper was attempted to be done so it could be used to implement
roughly same logic as is used in qcom-labibb regulator. This means
amongst other things a safety shut-down if IC registers are not readable.
Using these shut-down retry counters are optional. The idea is that the
helper could be also used by simpler ICs which do not provide status
register(s) which can be used to check if error is still active.
ICs which do not have such status register can simply omit the 'renable'
callback (and retry-counts etc) - and helper assumes the situation is Ok
and re-enables IRQ after given time period. If problem persists the
handler is ran again and another notification is sent - but at least the
delay allows processor to avoid IRQ loop.
Patch 7 takes this notification support in use at BD9576MUF.
Patch 8 is related to MFD change which is not really related to the RFC
here. It was added to this series in order to avoid potential conflicts.
Patch 9 adds a maintainers entry.
Changelog v10-RESEND:
- rebased on v5.13-rc4
Changelog v10:
- rebased on v5.13-rc2
- Move rdev_*() print macros to the internal.h and use rdev_dbg()
from irq_helpers.c
- Export rdev_get_name() and move it from coupler.h to driver.h for
others to use. (It was already in coupler.h but not exported -
usage was limited and coupler.h does not sound like optimal place
as rdev_name is not only used by coupled regulators)
- Send all regulator notifications from irq_helpers.c at one OR'd
event for the sake of simplicity. For BD9576 this does not matter
as it has own IRQ for each event case. Header defining events says
they may be OR'd.
- Change WARN() at protection shutdown to pr_emerg as suggested by
Petr.
Changelog v9:
- rebases on v5.13-rc1
- Update thermal documentation
- Fix regulator notification event number
Changelog v8:
- split shutdown API adding and thermal core taking it in use to
own patches.
- replace the spinlock with atomic when ensuring the emergency
shutdown is only called once.
Changelog v7:
general:
- rebased on v5.12-rc7
- new patch for refactoring the hw-failure reboot logic out of
thermal_core.c for others to use.
notification helpers:
- fix regulator error_flags query
- grammar/typos
- do not BUG() but attempt to shut-down the system
- use BITS_PER_TYPE()
Changelog v6:
Add MAINTAINERS entry
Changes to IRQ notifiers
- move devm functions to drivers/regulator/devres.c
- drop irq validity check
- use devm_add_action_or_reset()
- fix styling issues
- fix kerneldocs
Changelog v5:
- Fix the badly formatted pr_emerg() call.
Changelog v4:
- rebased on v5.12-rc6
- dropped RFC
- fix external FET DT-binding.
- improve prints for cases when expecting HW failure.
- styling and typos
Changelog v3:
Regulator core:
- Fix dangling pointer access at regulator_irq_helper()
stpmic1_regulator:
- fix function prototype (compile error)
bd9576-regulator:
- Update over current limits to what was given in new data-sheet
(REV00K)
- Allow over-current monitoring without external FET. Set limits to
values given in data-sheet (REV00K).
Changelog v2:
Generic:
- rebase on v5.12-rc2 + BD9576 series
- Split devm variant of delayed wq to own series
Regulator framework:
- Provide non devm variant of IRQ notification helpers
- shorten dt-property names as suggested by Rob
- unconditionally call map_event in IRQ handling and require it to be
populated
BD9576 regulators:
- change the FET resistance property to micro-ohms
- fix voltage computation in OC limit setting
|
|
ARM64_SWAPPER_USES_SECTION_MAPS implies that a PMD level huge page mappings
are used for swapper, idmap and vmemmap. Lets make it PMD explicit removing
any possible confusion with generic memory sections and also bit generic as
it's applicable for idmap and vmemmap mappings as well. Hence rename it as
ARM64_KERNEL_USES_PMD_MAPS instead.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/1623991622-24294-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Calculate the max VMCS index for vmcs12 by walking the array to find the
actual max index. Hardcoding the index is prone to bitrot, and the
calculation is only done on KVM bringup (albeit on every CPU, but there
aren't _that_ many null entries in the array).
Fixes: 3c0f99366e34 ("KVM: nVMX: Add a TSC multiplier field in VMCS12")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210618214658.2700765-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As part of smaller maxphyaddr emulation, kvm needs to intercept
present page faults to see if it needs to add the RSVD flag (bit 3) to
the error code. However, there is no need to intercept page faults
that already have the RSVD flag set. When setting up the page fault
intercept, add the RSVD flag into the #PF error code mask field (but
not the #PF error code match field) to skip the intercept when the
RSVD flag is already set.
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210618235941.1041604-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Pull ARM fix from Russell King:
- fix gcc 10 compiler regression with cpu_init()
* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9081/1: fix gcc-10 thumb2-kernel regression
|
|
While checking the master status of the DRM file in
drm_is_current_master(), the device's master mutex should be
held. Without the mutex, the pointer fpriv->master may be freed
concurrently by another process calling drm_setmaster_ioctl(). This
could lead to use-after-free errors when the pointer is subsequently
dereferenced in drm_lease_owner().
The callers of drm_is_current_master() from drm_auth.c hold the
device's master mutex, but external callers do not. Hence, we implement
drm_is_current_master_locked() to be used within drm_auth.c, and
modify drm_is_current_master() to grab the device's master mutex
before checking the master status.
Reported-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reviewed-by: Emil Velikov <emil.l.velikov@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210620110327.4964-2-desmondcheongzx@gmail.com
|
|
Because the __x86_indirect_alt* symbols are just that, objtool will
try and validate them as regular symbols, instead of the alternative
replacements that they are.
This goes sideways for FRAME_POINTER=y builds; which generate a fair
amount of warnings.
Fixes: 9bc0bb50727c ("objtool/x86: Rewrite retpoline thunk calls")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/YNCgxwLBiK9wclYJ@hirez.programming.kicks-ass.net
|
|
Split up the #VC handler code into a from-user and a from-kernel part.
This allows clean and correct state tracking, as the #VC handler needs
to enter NMI-state when raised from kernel mode and plain IRQ state when
raised from user-mode.
Fixes: 62441a1fb532 ("x86/sev-es: Correctly track IRQ states in runtime #VC handler")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210618115409.22735-3-joro@8bytes.org
|
|
The #VC handler only cares about IRQs being disabled while the GHCB is
active, as it must not be interrupted by something which could cause
another #VC while it holds the GHCB (NMI is the exception for which the
backup GHCB exits).
Make sure nothing interrupts the code path while the GHCB is active
by making sure that callers of __sev_{get,put}_ghcb() have disabled
interrupts upfront.
[ bp: Massage commit message. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210618115409.22735-2-joro@8bytes.org
|
|
Function wait_current_trans_commit_start is now fairly trivial so it can
be inlined in its only caller.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's only one caller left btrfs_ioctl_start_sync that passes 0, so we
can remove the switch in btrfs_commit_transaction_async.
A cleanup 9babda9f33fd ("btrfs: Remove async_transid from
btrfs_mksubvol/create_subvol/create_snapshot") removed calls that passed
1, so this is a followup.
As this removes last call of wait_current_trans_commit_start_and_unblock,
remove the function as well.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
clang warns:
fs/btrfs/delayed-inode.c:684:6: warning: variable 'total_data_size' set
but not used [-Wunused-but-set-variable]
int total_data_size = 0, total_size = 0;
^
1 warning generated.
This variable's value has been unused since commit fc0d82e103c7 ("btrfs:
sink total_data parameter in setup_items_for_insert"). Eliminate it.
Link: https://github.com/ClangBuiltLinux/linux/issues/1391
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
By way of inverting the list_empty conditional the insert label can be
eliminated, making the function's flow entirely linear.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is a very rare ASSERT() triggering during full fstests run for
subpage rw support.
No other reproducer so far.
The ASSERT() gets triggered for metadata read in
btrfs_page_set_uptodate() inside end_page_read().
[CAUSE]
There is still a small race window for metadata only, the race could
happen like this:
T1 | T2
------------------------------------+-----------------------------
end_bio_extent_readpage() |
|- btrfs_validate_metadata_buffer() |
| |- free_extent_buffer() |
| Still have 2 refs |
|- end_page_read() |
|- if (unlikely(PagePrivate()) |
| The page still has Private |
| | free_extent_buffer()
| | | Only one ref 1, will be
| | | released
| | |- detach_extent_buffer_page()
| | |- btrfs_detach_subpage()
|- btrfs_set_page_uptodate() |
The page no longer has Private|
>>> ASSERT() triggered <<< |
This race window is super small, thus pretty hard to hit, even with so
many runs of fstests.
But the race window is still there, we have to go another way to solve
it other than relying on random PagePrivate() check.
Data path is not affected, as it will lock the page before reading,
while unlocking the page after the last read has finished, thus no race
window.
[FIX]
This patch will fix the bug by repurposing btrfs_subpage::readers.
Now btrfs_subpage::readers will be a member shared by both metadata and
data.
For metadata path, we don't do the page unlock as metadata only relies
on extent locking.
At the same time, teach page_range_has_eb() to take
btrfs_subpage::readers into consideration.
So that even if the last eb of a page gets freed, page::private won't be
detached as long as there still are pending end_page_read() calls.
By this we eliminate the race window, this will slight increase the
metadata memory usage, as the page may not be released as frequently as
usual. But it should not be a big deal.
The code got introduced in ("btrfs: submit read time repair only for
each corrupted sector"), but the fix is in a separate patch to keep the
problem description and the crash is rare so it should not hurt
bisectability.
Signed-off-by: Qu Wegruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
With current btrfs subpage rw support, the following script can lead to
fs hang:
$ mkfs.btrfs -f -s 4k $dev
$ mount $dev -o nospace_cache $mnt
$ fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt
The fs will hang at btrfs_start_ordered_extent().
[CAUSE]
In above test case, btrfs_invalidate() will be called with the following
parameters:
offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000
Since @offset is 0, btrfs_invalidate() will try to invalidate the full
page, and finally call clear_page_extent_mapped() which will detach
subpage structure from the page.
And since the page no longer has subpage structure, the subpage dirty
bitmap will be cleared, preventing the dirty range from being written
back, thus no way to wake up the ordered extent.
[FIX]
Just follow other filesystems, only to invalidate the page if the range
covers the full page.
There are cases like truncate_setsize() which can call
btrfs_invalidatepage() with offset == 0 and length != 0 for the last
page of an inode.
Although the old code will still try to invalidate the full page, we are
still safe to just wait for ordered extent to finish.
So it shouldn't cause extra problems.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
With current subpage RW support, the following script can hang the fs
with 64K page size.
# mkfs.btrfs -f -s 4k $dev
# mount $dev -o nospace_cache $mnt
# fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt
The kernel will do an infinite loop in btrfs_punch_hole_lock_range().
[CAUSE]
In btrfs_punch_hole_lock_range() we:
- Truncate page cache range
- Lock extent io tree
- Wait any ordered extents in the range.
We exit the loop until we meet all the following conditions:
- No ordered extent in the lock range
- No page is in the lock range
The latter condition has a pitfall, it only works for sector size ==
PAGE_SIZE case.
While can't handle the following subpage case:
0 32K 64K 96K 128K
| |///////||//////| ||
lockstart=32K
lockend=96K - 1
In this case, although the range crosses 2 pages,
truncate_pagecache_range() will invalidate no page at all, but only zero
the [32K, 96K) range of the two pages.
Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
will never meet the loop exit condition.
[FIX]
Fix the problem by doing page alignment for the lock range.
Function filemap_range_has_page() has already handled lend < lstart
case, we only need to round up @lockstart, and round_down @lockend for
truncate_pagecache_range().
This modification should not change any thing for sector size ==
PAGE_SIZE case, as in that case our range is already page aligned.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The modifications are:
- Page copy destination
For subpage case, one page can contain multiple sectors, thus we can
no longer expect the memcpy_to_page()/btrfs_decompress() to copy
data into page offset 0.
The correct offset is offset_in_page(file_offset) now, which should
handle both regular sectorsize and subpage cases well.
- Page status update
Now we need to use subpage helper to handle the page status update.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Only set_page_dirty() and SetPageUptodate() is not subpage compatible.
Convert them to subpage helpers, so that __extent_writepage_io() can
submit page content correctly.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_truncate_block() itself is already mostly subpage compatible, the
only missing part is the page dirtying code.
Currently if we have a sector that needs to be truncated, we set the
sector aligned range delalloc, then set the full page dirty.
The problem is, current subpage code requires subpage dirty bit to be
set, or __extent_writepage_io() won't submit bio, thus leads to ordered
extent never to finish.
So this patch will make btrfs_truncate_block() to call
btrfs_page_set_dirty() helper to replace set_page_dirty() to fix the
problem.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
__extent_writepage_io() function originally just iterates through all
the extent maps of a page, and submits any regular extents.
This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
need to submit the only sector contained in the page.
But for subpage case, one dirty page can contain several clean sectors
with at least one dirty sector.
If __extent_writepage_io() still submit all regular extent maps, it can
submit data which is already written to disk.
And since such already written data won't have corresponding ordered
extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().
Change the behavior of __extent_writepage_io() by finding the first
dirty byte in the page, and only submit the dirty range other than the
full extent.
Since we're also here, also modify the following calls to be subpage
compatible:
- SetPageError()
- end_page_writeback()
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Function btrfs_set_range_writeback() currently just sets the page
writeback unconditionally.
Change it to call the subpage helper so that we can handle both cases
well.
Since the subpage helpers needs btrfs_fs_info, also change the parameter
to accept btrfs_inode.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
__process_pages_contig()
In cow_file_range(), after we have succeeded creating an inline extent,
we unlock the page with extent_clear_unlock_delalloc() by passing
locked_page == NULL.
For sectorsize == PAGE_SIZE case, this is just making the page lock and
unlock harder to grab.
But for incoming subpage case, it can be a big problem.
For incoming subpage case, page locking have two entry points:
- __process_pages_contig()
In that case, we know exactly the range we want to lock (which only
requires sector alignment).
To handle the subpage requirement, we introduce btrfs_subpage::writers
to page::private, and will update it in __process_pages_contig().
- Other directly lock/unlock_page() call sites
Those won't touch btrfs_subpage::writers at all.
This means, page locked by __process_pages_contig() can only be unlocked
by __process_pages_contig().
Thankfully we already have the existing infrastructure in the form of
@locked_page in various call sites.
Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after
creating an inline extent is the exception.
It intentionally call extent_clear_unlock_delalloc() with locked_page ==
NULL, to also unlock current page (and clear its dirty/writeback bits).
To co-operate with incoming subpage modifications, and make the page
lock/unlock pair easier to understand, this patch will still call
extent_clear_unlock_delalloc() with locked_page, and only unlock the
page in __extent_writepage().
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When __process_pages_contig() gets called for
extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
bit is updated, but dirty/writeback/error bits are all skipped.
There are several call sites that call extent_clear_unlock_delalloc()
with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK
- cow_file_range()
- run_delalloc_nocow()
- cow_file_range_async()
All for their error handling branches.
For those call sites, since we skip the locked page for
dirty/error/writeback bit update, the locked page will still have its
subpage dirty bit remaining.
Normally it's the call sites which locked the page to handle the locked
page, but it won't hurt if we also do the update.
Especially there are already other call sites doing the same thing by
manually passing NULL as locked_page.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This involves the following modification:
- Ordered extent creation
This is done in process_one_page(), now PAGE_SET_ORDERED will call
subpage helper to do the work.
- endio functions
This is done in btrfs_mark_ordered_io_finished().
- btrfs_invalidatepage()
- btrfs_cleanup_ordered_extents()
Use the subpage page helper, and add an extra branch to exit if the
locked page have covered the full range.
Now the usage of page Ordered flag for ordered extent accounting is fully
subpage compatible.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This patch introduces the following functions to handle btrfs subpage
ordered (Private2) status:
- btrfs_subpage_set_ordered()
- btrfs_subpage_clear_ordered()
- btrfs_subpage_test_ordered()
These helpers can only be called when the range is ensured to be
inside the page.
- btrfs_page_set_ordered()
- btrfs_page_clear_ordered()
- btrfs_page_test_ordered()
These helpers can handle both regular sector size and subpage without
problem.
These functions are here to coordinate btrfs_invalidatepage() with
btrfs_writepage_endio_finish_ordered(), to make sure only one of those
functions can finish the ordered extent.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Introduce a new data inodes specific subpage member, writers, to record
how many sectors are under page lock for delalloc writing.
This member acts pretty much the same as readers, except it's only for
delalloc writes.
This is important for delalloc code to trace which page can really be
freed, as we have cases like run_delalloc_nocow() where we may exit
processing nocow range inside a page, but need to exit to do cow half
way.
In that case, we need a way to determine if we can really unlock a full
page.
With the new btrfs_subpage::writers, there is a new requirement:
- Page locked by process_one_page() must be unlocked by
process_one_page()
There are still tons of call sites manually lock and unlock a page,
without updating btrfs_subpage::writers.
So if we lock a page through process_one_page() then it must be
unlocked by process_one_page() to keep btrfs_subpage::writers
consistent.
This will be handled in next patch.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now in end_bio_extent_writepage(), the only subpage incompatible code is
the end_page_writeback().
Just call the subpage helpers.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
status
For __process_pages_contig() and process_one_page(), to handle subpage
we only need to pass bytenr in and call subpage helpers to handle
dirty/error/writeback status.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since the extent io tree operations in btrfs_dirty_pages() are already
subpage compatible, we only need to make the page status update to use
subpage helpers.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Just like read page, for subpage support we only require sector size
alignment.
So change the error message condition to only require sector alignment.
This should not affect existing code, as for regular sectorsize ==
PAGE_SIZE case, we are still requiring page alignment.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In the coming subpage RW supports, there are a lot of page status update
calls which need to be converted to subpage compatible version, which
needs @start and @len.
Some call sites already have such @start/@len and are already in
page range, like various endio functions.
But there are also call sites which need to clamp the range for subpage
case, like btrfs_dirty_pagse() and __process_contig_pages().
Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the
clamp for subpage version.
Although in theory all existing btrfs_page_*() calls can be converted to
use btrfs_page_clamp_*() directly, but that would make us to do
unnecessary clamp operations.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In __process_pages_contig() we update page status according to page_ops.
That update process is a bunch of 'if' branches, which lie inside
two loops, this makes it pretty hard to expand for later subpage
operations.
So this patch will extract these operations into its own function,
process_one_pages().
Also since we're refactoring __process_pages_contig(), also move the new
helper and __process_pages_contig() before the first caller of them, to
remove the forward declaration.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As a preparation for incoming subpage support, we need bytenr passed to
__process_pages_contig() directly, not the current page index.
So change the parameter and all callers to pass bytenr in.
With the modification, here we need to replace the old @index_ret with
@processed_end for __process_pages_contig(), but this brings a small
problem.
Normally we follow the inclusive return value, meaning @processed_end
should be the last byte we processed.
If parameter @start is 0, and we failed to lock any page, then we would
return @processed_end as -1, causing more problems for
__unlock_for_delalloc().
So here for @processed_end, we use two different return value patterns.
If we have locked any page, @processed_end will be the last byte of
locked page.
Or it will be @start otherwise.
This change will impact lock_delalloc_pages(), so it needs to check
@processed_end to only unlock the range if we have locked any.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
When running subpage preparation patches on x86, btrfs/125 will hang
forever with one ordered extent never finished.
[CAUSE]
The test case btrfs/125 itself will always fail as the fix is never merged.
When the test fails at balance, btrfs needs to cleanup the ordered
extent in btrfs_cleanup_ordered_extents() for data reloc inode.
The problem is in the sequence how we cleanup the page Order bit.
Currently it works like:
btrfs_cleanup_ordered_extents()
|- find_get_page();
|- btrfs_page_clear_ordered(page);
| Now the page doesn't have Ordered bit anymore.
| !!! This also includes the first (locked) page !!!
|
|- offset += PAGE_SIZE
| This is to skip the first page
|- __endio_write_update_ordered()
|- btrfs_mark_ordered_io_finished(NULL)
Except the first page, all ordered extents are finished.
Then the locked page is cleaned up in __extent_writepage():
__extent_writepage()
|- If (PageError(page))
|- end_extent_writepage()
|- btrfs_mark_ordered_io_finished(page)
|- if (btrfs_test_page_ordered(page))
|- !!! The page gets skipped !!!
The ordered extent is not decreased as the page doesn't
have ordered bit anymore.
This leaves the ordered extent with bytes_left == sectorsize, thus never
finish.
[FIX]
The fix is to ensure we never clear page Ordered bit without running the
ordered extent accounting.
Here we choose to skip the locked page in
btrfs_cleanup_ordered_extents() so that later end_extent_writepage() can
properly finish the ordered extent.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Inside btrfs we use Private2 page status to indicate we have an ordered
extent with pending IO for the sector.
But the page status name, Private2, tells us nothing about the bit
itself, so this patch will rename it to Ordered.
And with extra comment about the bit added, so reader who is still
uncertain about the page Ordered status, will find the comment pretty
easily.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This patch will refactor btrfs_invalidatepage() for the incoming subpage
support.
The involved modifications are:
- Use while() loop instead of "goto again;"
- Use single variable to determine whether to delete extent states
Each branch will also have comments why we can or cannot delete the
extent states
- Do qgroup free and extent states deletion per-loop
Current code can only work for PAGE_SIZE == sectorsize case.
This refactor also makes it clear what we do for different sectors:
- Sectors without ordered extent
We're completely safe to remove all extent states for the sector(s)
- Sectors with ordered extent, but no Private2 bit
This means the endio has already been executed, we can't remove all
extent states for the sector(s).
- Sectors with ordere extent, still has Private2 bit
This means we need to decrease the ordered extent accounting.
And then it comes to two different variants:
* We have finished and removed the ordered extent
Then it's the same as "sectors without ordered extent"
* We didn't finished the ordered extent
We can remove some extent states, but not all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Although we already have btrfs_lookup_first_ordered_extent() and
btrfs_lookup_ordered_extent(), they all have their own limitations:
- btrfs_lookup_ordered_extent() can't do extra range check
It's only designed to lookup any ordered extent before certain bytenr.
- btrfs_lookup_first_ordered_extent() may not return the first ordered
extent in the range
It doesn't ensure the first ordered extent is returned.
The existing callers are only interested in exhausting all ordered
extents in a range, the order is not important.
For incoming btrfs_invalidatepage() refactoring, we need a way to
properly iterate all ordered extents in their bytenr order of a range.
So this patch will introduce a new function,
btrfs_lookup_first_ordered_range(), to do ordered extent with bytenr
order awareness and extra range check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|