Age | Commit message (Collapse) | Author |
|
Devices with compatible="pmbus" field have zero initial page count,
and pmbus_clear_faults() being called before the page count auto-
detection does not actually clear faults because it depends on the
page count. Non-cleared faults in its turn may fail the subsequent
page count auto-detection.
This patch fixes this problem by calling pmbus_clear_fault_page()
for currently set page and calling pmbus_clear_faults() after the
page count was detected.
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Bazhenov <bazhenov.dn@gmail.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
Daniel Borkmann says:
====================
This work adds a generic sk_msg layer and converts both sockmap
and later ktls over to make use of it as a common data structure
for application data (similarly as sk_buff for network packets).
With that in place the sk_msg framework spans accross ULP layer
in the kernel and allows for introspection or filtering of L7
data with the help of BPF programs operating on a common input
context.
In a second step, we enable the latter for ktls which was previously
not possible, meaning, ktls and sk_msg verdict programs were
mutually exclusive in the ULP layer which created challenges for
the orchestrator when trying to apply TCP based policy, for
example. Leveraging the prior consolidation we can finally overcome
this limitation.
Note, there's no change in behavior when ktls is not used in
combination with BPF, and also no change in behavior for stand
alone sockmap. The kselftest suites for ktls, sockmap and ktls
with sockmap combined also runs through successfully. For further
details please see individual patches.
Thanks!
v1 -> v2:
- Removed leftover comment spotted by Alexei
- Improved commit messages, rebase
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add a MAINTAINERS entry to the skmsg and related files such that
patches, features, bug reports land with the right Cc.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This adds a --ktls option to test_sockmap in order to enable the
combination of ktls and sockmap to run, which makes for another
batch of 648 test cases for both in combination.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This work adds BPF sk_msg verdict program support to kTLS
allowing BPF and kTLS to be combined together. Previously kTLS
and sk_msg verdict programs were mutually exclusive in the
ULP layer which created challenges for the orchestrator when
trying to apply TCP based policy, for example. To resolve this,
leveraging the work from previous patches that consolidates
the use of sk_msg, we can finally enable BPF sk_msg verdict
programs so they continue to run after the kTLS socket is
created. No change in behavior when kTLS is not used in
combination with BPF, the kselftest suite for kTLS also runs
successfully.
Joint work with Daniel.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Instead of re-implementing poll routine use the poll callback to
trigger read from kTLS, we reuse the stream_memory_read callback
which is simpler and achieves the same. This helps to align sockmap
and kTLS so we can more easily embed BPF in kTLS.
Joint work with Daniel.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Convert kTLS over to make use of sk_msg interface for plaintext and
encrypted scattergather data, so it reuses all the sk_msg helpers
and data structure which later on in a second step enables to glue
this to BPF.
This also allows to remove quite a bit of open coded helpers which
are covered by the sk_msg API. Recent changes in kTLs 80ece6a03aaf
("tls: Remove redundant vars from tls record structure") and
4e6d47206c32 ("tls: Add support for inplace records encryption")
changed the data path handling a bit; while we've kept the latter
optimization intact, we had to undo the former change to better
fit the sk_msg model, hence the sg_aead_in and sg_aead_out have
been brought back and are linked into the sk_msg sgs. Now the kTLS
record contains a msg_plaintext and msg_encrypted sk_msg each.
In the original code, the zerocopy_from_iter() has been used out
of TX but also RX path. For the strparser skb-based RX path,
we've left the zerocopy_from_iter() in decrypt_internal() mostly
untouched, meaning it has been moved into tls_setup_from_iter()
with charging logic removed (as not used from RX). Given RX path
is not based on sk_msg objects, we haven't pursued setting up a
dummy sk_msg to call into sk_msg_zerocopy_from_iter(), but it
could be an option to prusue in a later step.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In order to prepare sockmap logic to be used in combination with kTLS
we need to detangle it from ULP, and further split it in later commits
into a generic API.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Whenever the ULP data on the socket is mangled, enforce that the
caller has the socket lock held as otherwise things may race with
initialization and cleanup callbacks from ulp ops as both would
mangle internal socket state.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
lockdep_assert_held() is better suited to checking locking requirements,
since it won't get confused when someone else holds the lock. This is
also a step towards possibly removing spin_is_locked().
Signed-off-by: Lance Roy <ldr709@gmail.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
lockdep_assert_held() is better suited to checking locking requirements,
since it won't get confused when someone else holds the lock. This is
also a step towards possibly removing spin_is_locked().
Signed-off-by: Lance Roy <ldr709@gmail.com>
Cc: Cliff Whickman <cpw@sgi.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Robin Holt <robinmholt@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The EEPROMs which hold the SPD data on DDR4 memory modules are no
longer standard AT24C02-compatible EEPROMs. They are 512-byte EEPROMs
which use only 1 I2C address for data access. You need to switch
between the lower page and the upper page of data by sending commands
on the SMBus.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/misc/eeprom/at25.c: In function 'at25_remove':
drivers/misc/eeprom/at25.c:384:20: warning:
variable 'at25' set but not used [-Wunused-but-set-variable]
Since commit 96d08fb43e30 ("eeprom: at25: use devm_nvmem_register()"),
at25_remove is do nothing, so can be removed.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
unsigned, count for max size).
IAD Register is yet readable trough the "iad" sys file.
A write to the "iad" sys file enables or disables the current
measurement, but it was not possible to get the measured value by
reading it.
Fix: %u in snprintf for unsigned values (vdd and vad)
Fix: Avoid possibles overflows (Usage of the 'count' variables)
Signed-off-by: Julien Folly <julien.folly@gmail.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/misc/mic/scif/scif_dma.c: In function 'scif_rma_list_dma_copy_wrapper':
drivers/misc/mic/scif/scif_dma.c:1558:27: warning:
variable 'dst_dma_addr' set but not used [-Wunused-but-set-variable]
drivers/misc/mic/scif/scif_dma.c:1558:13: warning:
variable 'src_dma_addr' set but not used [-Wunused-but-set-variable]
They never used since introduction in
commit 7cc31cd27752 ("misc: mic: SCIF DMA and CPU copy interface")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
In _scif_prog_signal(), the boolean variable 'x100' is used to indicate
whether the MIC Coprocessor is X100. If 'x100' is true, the status
descriptor will be used to write the value to the destination. Otherwise, a
DMA pool will be allocated for this purpose. Specifically, if the DMA pool
is allocated successfully, two memory addresses will be returned. One is
for the CPU and the other is for the device to access the DMA pool. The
former is stored to the variable 'status' and the latter is stored to the
variable 'src'. After the allocation, the address in 'src' is saved to
'status->src_dma_addr', which is actually in the DMA pool, and 'src' is
then modified.
Later on, if an error occurs, the execution flow will transfer to the label
'dma_fail', which will check 'x100' and free up the allocated DMA pool if
'x100' is false. The point here is that 'status->src_dma_addr' is used for
freeing up the DMA pool. As mentioned before, 'status->src_dma_addr' is in
the DMA pool. And thus, the device is able to modify this data. This can
potentially cause failures when freeing up the DMA pool because of the
modified device address.
This patch avoids the above issue by using the variable 'src' (with
necessary calculation) to free up the DMA pool.
Signed-off-by: Wenwen Wang <wang6495@umn.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
checkpacth: Missing a blank line after declarations
Signed-off-by: Roman Kiryanov <rkir@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This casting is not required.
Signed-off-by: Roman Kiryanov <rkir@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Undo effects of misc_register if driver's init fails after
misc_register.
Signed-off-by: Roman Kiryanov <rkir@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
into the driver state
This is the last patch in the series of patches to move file-scope
variables into the driver state. This change will help to introduce
another version of the pipe driver (with different state) for the
older host interface or having several instances of this device.
Signed-off-by: Roman Kiryanov <rkir@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
into the driver state
This is a series of patches to move mutable file-scope variables
into the driver state. This change will help to introduce another
version of the pipe driver (with different state) for the older
host interface or having several instances of this device.
Signed-off-by: Roman Kiryanov <rkir@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
variable into the driver state
This is a series of patches to move mutable file-scope variables
into the driver state. This change will help to introduce another
version of the pipe driver (with different state) for the older
host interface or having several instances of this device.
Signed-off-by: Roman Kiryanov <rkir@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add new GSMI commands (GSMI_CMD_LOG_S0IX_SUSPEND = 0xa,
GSMI_CMD_LOG_S0IX_RESUME = 0xb) that allow firmware to log any
information during S0ix suspend/resume paths.
Traditional ACPI suspend S3 involves BIOS both during the suspend and
the resume paths. However, modern suspend type like S0ix does not
involve firmware on either of the paths. This command gives the
firmware an opportunity to log any required information about the
suspend and resume operations e.g. wake sources.
Additionally, this change adds a module parameter to allow platforms
to specifically enable S0ix logging if required. This prevents any
other platforms from unnecessarily making a GSMI call which could have
any side-effects.
Tested by verifying that wake sources are correctly logged in eventlog.
Signed-off-by: Furquan Shaikh <furquan@chromium.org>
Reviewed-by: Aaron Durbin <adurbin@chromium.org>
Reviewed-by: Rajat Jain <rajatja@chromium.org>
Signed-off-by: Furquan Shaikh <furquan@google.com>
Tested-by: Furquan Shaikh <furquan@chromium.org>
Reviewed-by: Aaron Durbin <adurbin@chromium.org>
[zwisler: update changelog for upstream]
Signed-off-by: Ross Zwisler <zwisler@google.com>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Instead of selecting EFI and EFI_VARS automatically when GSMI is
enabled let that portion of the driver be conditionally compiled
if EFI and EFI_VARS are enabled.
This allows the rest of the driver (specifically event log) to
be used if EFI_VARS is not enabled.
To test:
1) verify that EFI_VARS is not automatically selected when
CONFIG_GOOGLE_GSMI is enabled
2) verify that the kernel boots on Link and that GSMI event log
is still available and functional
3) specifically boot the kernel on Alex to ensure it does not
try to load efivars and that gsmi also does not load because it
is not in the supported DMI table
Signed-off-by: Duncan Laurie <dlaurie@chromium.org>
Reviewed-by: Olof Johansson <olofj@chromium.org>
Signed-off-by: Benson Leung <bleung@chromium.org>
Signed-off-by: Ben Zhang <benzh@chromium.org>
Signed-off-by: Filipe Brandenburger <filbranden@chromium.org>
Signed-off-by: Furquan Shaikh <furquan@google.com>
Tested-by: Furquan Shaikh <furquan@chromium.org>
Reviewed-by: Aaron Durbin <adurbin@chromium.org>
[zwisler: update changelog for upstream]
Signed-off-by: Ross Zwisler <zwisler@google.com>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
In order to use this coreboot needs board support for:
CONFIG_ELOG=y
CONFIG_ELOG_GSMI=y
And the kernel driver needs enabled:
CONFIG_GOOGLE_GSMI=y
To test, verify that clean shutdown event is added to the log:
> mosys eventlog list | grep 'Clean Shutdown'
11 | 2012-06-25 09:49:24 | Kernl Event | Clean Shutdown
Signed-off-by: Duncan Laurie <dlaurie@chromium.org>
Reviewed-by: Vadim Bendebury <vbendeb@chromium.org>
Reviewed-by: Stefan Reinauer <reinauer@chromium.org>
Signed-off-by: Furquan Shaikh <furquan@google.com>
Tested-by: Furquan Shaikh <furquan@chromium.org>
Reviewed-by: Aaron Durbin <adurbin@chromium.org>
Reviewed-by: Justin TerAvest <teravest@chromium.org>
[zwisler: update changelog for upstream]
Signed-off-by: Ross Zwisler <zwisler@google.com>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The sysfs handler should return the number of bytes consumed, which in the
case of a successful write is the entire buffer. Also fix a bug where
param.data_len was being set to (count - (2 * sizeof(u32))) instead of just
(count - sizeof(u32)). The latter is correct because we skip over the
leading u32 which is our param.type, but we were also incorrectly
subtracting sizeof(u32) on the line where we were actually setting
param.data_len:
param.data_len = count - sizeof(u32);
This meant that for our example event.kernel_software_watchdog with total
length 10 bytes, param.data_len was just 2 prior to this change.
To test, successfully append an event to the log with gsmi sysfs.
This sample event is for a "Kernel Software Watchdog"
> xxd -g 1 event.kernel_software_watchdog
0000000: 01 00 00 00 ad de 06 00 00 00
> cat event.kernel_software_watchdog > /sys/firmware/gsmi/append_to_eventlog
> mosys eventlog list | tail -1
14 | 2012-06-25 10:14:14 | Kernl Event | Software Watchdog
Signed-off-by: Duncan Laurie <dlaurie@chromium.org>
Reviewed-by: Vadim Bendebury <vbendeb@chromium.org>
Reviewed-by: Stefan Reinauer <reinauer@chromium.org>
Signed-off-by: Furquan Shaikh <furquan@google.com>
Tested-by: Furquan Shaikh <furquan@chromium.org>
Reviewed-by: Aaron Durbin <adurbin@chromium.org>
Reviewed-by: Justin TerAvest <teravest@chromium.org>
[zwisler: updated changelog for 2nd bug fix and upstream]
Signed-off-by: Ross Zwisler <zwisler@google.com>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
I am one of the main engineers working on ashmem. I have been fixing
bugs in the driver and have been involved in the memfd conversion
discussions and sending patches about that. I also have an understanding
of the binder driver and was involved with some of the development on
finer grained locking. So I would like to be added to the MAINTAINERS
file for android drivers for review and maintenance of ashmem and other
Android drivers.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Todd Kjos <tkjos@android.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add ELAN061C to the ACPI table to support Elan touchpad found in Lenovo
IdeaPad 330-15IGM.
Signed-off-by: Mikhail Nikiforov <jackxviichaos@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
All the other functions that deal with the sd_ail_list run the list
from the tail back to the head, iow, in reverse. We should do the
same while writing revokes, otherwise we might miss removing entries
properly from the list when we hit the limit of how many revokes we
can write at one time (based on block size, which determines how
many block pointers will fit in the revoke block).
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Following warnings are generated when compiled with W=1,
drivers/dma/owl-dma.c:170: warning: Function parameter or member 'cyclic'
not described in 'owl_dma_txd'
drivers/dma/owl-dma.c:198: warning: Function parameter or member 'cfg' not
described in 'owl_dma_vchan'
drivers/dma/owl-dma.c:198: warning: Function parameter or member 'drq' not
described in 'owl_dma_vchan'
drivers/dma/owl-dma.c:225: warning: Function parameter or member 'irq' not
described in 'owl_dma'
Fix this by adding comments for relevant struct members to appear in
kernel-doc.
Fixes: d64e1b3f5cce ("dmaengine: owl: Add Slave and Cyclic mode support for
Actions Semi Owl S900 SoC")
Reported-by: Vinod Koul <vinod.koul@linaro.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
|
|
Using bool is more suitable than int here, and add the comment about the
return_bigger.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The avg_delayed_ref_runtime can be referenced from the transaction
handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It can be referenced from the transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since trans is only used for referring to delayed_refs, there is no need
to pass it instead of delayed_refs to btrfs_delayed_ref_lock().
No functional change.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since trans is only used for referring to delayed_refs, there is no need
to pass it instead of delayed_refs to btrfs_select_ref_head(). No
functional change.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is no reason to put this check in (!qgroup)'s else branch because
if qgroup is null, it will goto out directly. So move it out to reduce
indentation level. No functional change.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Commit 581c1760415c ("btrfs: Validate child tree block's level and first
key") has made tree block level check mandatory.
So if tree block level doesn't match, we won't get a valid extent
buffer. The extra WARN_ON() check can be removed completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
And add one line comment explaining what we're doing for each loop.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Some qgroup trace events like btrfs_qgroup_release_data() and
btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is
not enabled.
This is caused by the lack of qgroup status check before calling some
qgroup functions. Thankfully the functions can handle quota disabled
case well and just do nothing for qgroup disabled case.
This patch will do earlier check before triggering related trace events.
And for enabled <-> disabled race case:
1) For enabled->disabled case
Disable will wipe out all qgroups data including reservation and
excl/rfer. Even if we leak some reservation or numbers, it will
still be cleared, so nothing will go wrong.
2) For disabled -> enabled case
Current btrfs_qgroup_release_data() will use extent_io tree to ensure
we won't underflow reservation. And for delayed_ref we use
head->qgroup_reserved to record the reserved space, so in that case
head->qgroup_reserved should be 0 and we won't underflow.
CC: stable@vger.kernel.org # 4.14+
Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In a scenario like the following:
mkdir /mnt/A # inode 258
mkdir /mnt/B # inode 259
touch /mnt/B/bar # inode 260
sync
mv /mnt/B/bar /mnt/A/bar
mv -T /mnt/A /mnt/B
fsync /mnt/B/bar
<power fail>
After replaying the log we end up with file bar having 2 hard links, both
with the name 'bar' and one in the directory with inode number 258 and the
other in the directory with inode number 259. Also, we end up with the
directory inode 259 still existing and with the directory inode 258 still
named as 'A', instead of 'B'. In this scenario, file 'bar' should only
have one hard link, located at directory inode 258, the directory inode
259 should not exist anymore and the name for directory inode 258 should
be 'B'.
This incorrect behaviour happens because when attempting to log the old
parents of an inode, we skip any parents that no longer exist. Fix this
by forcing a full commit if an old parent no longer exists.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When replaying a log which contains a tmpfile (which necessarily has a
link count of 0) we end up calling inc_nlink(), at
fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like
the following:
[195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 inc_nlink+0x33/0x40
[195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 4.19.0-rc6-btrfs-next-38 #1
[195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[195191.943726] RIP: 0010:inc_nlink+0x33/0x40
[195191.943728] RSP: 0018:ffffb96e425e3870 EFLAGS: 00010246
[195191.943730] RAX: 0000000000000000 RBX: ffff8c0d1e6af4f0 RCX: 0000000000000006
[195191.943731] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c0d1e6af4f0
[195191.943731] RBP: 0000000000000097 R08: 0000000000000001 R09: 0000000000000000
[195191.943732] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb96e425e3a60
[195191.943733] R13: ffff8c0d10cff0c8 R14: ffff8c0d0d515348 R15: ffff8c0d78a1b3f8
[195191.943735] FS: 00007f570ee24480(0000) GS:ffff8c0dfb200000(0000) knlGS:0000000000000000
[195191.943736] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[195191.943737] CR2: 00005593286277c8 CR3: 00000000bb8f2006 CR4: 00000000003606f0
[195191.943739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[195191.943740] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[195191.943741] Call Trace:
[195191.943778] replay_one_buffer+0x797/0x7d0 [btrfs]
[195191.943802] walk_up_log_tree+0x1c1/0x250 [btrfs]
[195191.943809] ? rcu_read_lock_sched_held+0x3f/0x70
[195191.943825] walk_log_tree+0xae/0x1d0 [btrfs]
[195191.943840] btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs]
[195191.943856] ? replay_dir_deletes+0x280/0x280 [btrfs]
[195191.943870] open_ctree+0x1c3b/0x22a0 [btrfs]
[195191.943887] btrfs_mount_root+0x6b4/0x800 [btrfs]
[195191.943894] ? rcu_read_lock_sched_held+0x3f/0x70
[195191.943899] ? pcpu_alloc+0x55b/0x7c0
[195191.943906] ? mount_fs+0x3b/0x140
[195191.943908] mount_fs+0x3b/0x140
[195191.943912] ? __init_waitqueue_head+0x36/0x50
[195191.943916] vfs_kern_mount+0x62/0x160
[195191.943927] btrfs_mount+0x134/0x890 [btrfs]
[195191.943936] ? rcu_read_lock_sched_held+0x3f/0x70
[195191.943938] ? pcpu_alloc+0x55b/0x7c0
[195191.943943] ? mount_fs+0x3b/0x140
[195191.943952] ? btrfs_remount+0x570/0x570 [btrfs]
[195191.943954] mount_fs+0x3b/0x140
[195191.943956] ? __init_waitqueue_head+0x36/0x50
[195191.943960] vfs_kern_mount+0x62/0x160
[195191.943963] do_mount+0x1f9/0xd40
[195191.943967] ? memdup_user+0x4b/0x70
[195191.943971] ksys_mount+0x7e/0xd0
[195191.943974] __x64_sys_mount+0x21/0x30
[195191.943977] do_syscall_64+0x60/0x1b0
[195191.943980] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[195191.943983] RIP: 0033:0x7f570e4e524a
[195191.943986] RSP: 002b:00007ffd83589478 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[195191.943989] RAX: ffffffffffffffda RBX: 0000563f335b2060 RCX: 00007f570e4e524a
[195191.943990] RDX: 0000563f335b2240 RSI: 0000563f335b2280 RDI: 0000563f335b2260
[195191.943992] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
[195191.943993] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000563f335b2260
[195191.943994] R13: 0000563f335b2240 R14: 0000000000000000 R15: 00000000ffffffff
[195191.944002] irq event stamp: 8688
[195191.944010] hardirqs last enabled at (8687): [<ffffffff9cb004c3>] console_unlock+0x503/0x640
[195191.944012] hardirqs last disabled at (8688): [<ffffffff9ca037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
[195191.944018] softirqs last enabled at (8638): [<ffffffff9cc0a5d1>] __set_page_dirty_nobuffers+0x101/0x150
[195191.944020] softirqs last disabled at (8634): [<ffffffff9cc26bbe>] wb_wakeup_delayed+0x2e/0x60
[195191.944022] ---[ end trace 5d6e873a9a0b811a ]---
This happens because the inode does not have the flag I_LINKABLE set,
which is a runtime only flag, not meant to be persisted, set when the
inode is created through open(2) if the flag O_EXCL is not passed to it.
Except for the warning, there are no other consequences (like corruptions
or metadata inconsistencies).
Since it's pointless to replay a tmpfile as it would be deleted in a
later phase of the log replay procedure (it has a link count of 0), fix
this by not logging tmpfiles and if a tmpfile is found in a log (created
by a kernel without this change), skip the replay of the inode.
A test case for fstests follows soon.
Fixes: 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
CC: stable@vger.kernel.org # 4.18+
Reported-by: Martin Steigerwald <martin@lichtvoll.de>
Link: https://lore.kernel.org/linux-btrfs/3666619.NTnn27ZJZE@merkaba/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We don't need it, rsv->size is set once and never changes throughout
its lifetime, so just use that for the reserve size.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
I ran into an issue where there was some reference being held on an
inode that I couldn't track. This assert wasn't triggered, but it at
least rules out we're doing something stupid.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Allocating new chunks modifies both the extent and chunk tree, which can
trigger new chunk allocations. So instead of doing list_for_each_safe,
just do while (!list_empty()) so we make sure we don't exit with other
pending bg's still on our list.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We need to clear the max_extent_size when we clear bits from a bitmap
since it could have been from the range that contains the
max_extent_size.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we're allocating a new space cache inode it's likely going to be
under a transaction handle, so we need to use memalloc_nofs_save() in
order to avoid deadlocks, and more importantly lockdep messages that
make xfstests fail.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We want to release the unused reservation we have since it refills the
delayed refs reserve, which will make everything go smoother when
running the delayed refs if we're short on our reservation.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Btrfs's btree locking has two modes, spinning mode and blocking mode,
while searching btree, locking is always acquired in spinning mode and
then converted to blocking mode if necessary, and in some hot paths we may
switch the locking back to spinning mode by btrfs_clear_path_blocking().
When acquiring locks, both of reader and writer need to wait for blocking
readers and writers to complete before doing read_lock()/write_lock().
The problem is that btrfs_clear_path_blocking() needs to switch nodes
in the path to blocking mode at first (by btrfs_set_path_blocking) to
make lockdep happy before doing its actual clearing blocking job.
When switching to blocking mode from spinning mode, it consists of
step 1) bumping up blocking readers counter and
step 2) read_unlock()/write_unlock(),
this has caused serious ping-pong effect if there're a great amount of
concurrent readers/writers, as waiters will be woken up and go to
sleep immediately.
1) Killing this kind of ping-pong results in a big improvement in my 1600k
files creation script,
MNT=/mnt/btrfs
mkfs.btrfs -f /dev/sdf
mount /dev/def $MNT
time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \
-d $MNT/0 -d $MNT/1 \
-d $MNT/2 -d $MNT/3 \
-d $MNT/4 -d $MNT/5 \
-d $MNT/6 -d $MNT/7 \
-d $MNT/8 -d $MNT/9 \
-d $MNT/10 -d $MNT/11 \
-d $MNT/12 -d $MNT/13 \
-d $MNT/14 -d $MNT/15
w/o patch:
real 2m27.307s
user 0m12.839s
sys 13m42.831s
w/ patch:
real 1m2.273s
user 0m15.802s
sys 8m16.495s
1.1) latency histogram from funclatency[1]
Overall with the patch, there're ~50% less write lock acquisition and
the 95% max latency that write lock takes also reduces to ~100ms from
>500ms.
--------------------------------------------
w/o patch:
--------------------------------------------
Function = btrfs_tree_lock
msecs : count distribution
0 -> 1 : 2385222 |****************************************|
2 -> 3 : 37147 | |
4 -> 7 : 20452 | |
8 -> 15 : 13131 | |
16 -> 31 : 3877 | |
32 -> 63 : 3900 | |
64 -> 127 : 2612 | |
128 -> 255 : 974 | |
256 -> 511 : 165 | |
512 -> 1023 : 13 | |
Function = btrfs_tree_read_lock
msecs : count distribution
0 -> 1 : 6743860 |****************************************|
2 -> 3 : 2146 | |
4 -> 7 : 190 | |
8 -> 15 : 38 | |
16 -> 31 : 4 | |
--------------------------------------------
w/ patch:
--------------------------------------------
Function = btrfs_tree_lock
msecs : count distribution
0 -> 1 : 1318454 |****************************************|
2 -> 3 : 6800 | |
4 -> 7 : 3664 | |
8 -> 15 : 2145 | |
16 -> 31 : 809 | |
32 -> 63 : 219 | |
64 -> 127 : 10 | |
Function = btrfs_tree_read_lock
msecs : count distribution
0 -> 1 : 6854317 |****************************************|
2 -> 3 : 2383 | |
4 -> 7 : 601 | |
8 -> 15 : 92 | |
2) dbench also proves the improvement,
dbench -t 120 -D /mnt/btrfs 16
w/o patch:
Throughput 158.363 MB/sec
w/ patch:
Throughput 449.52 MB/sec
3) xfstests didn't show any additional failures.
One thing to note is that callers may set path->leave_spinning to have
all nodes in the path stay in spinning mode, which means callers are
ready to not sleep before releasing the path, but it won't cause
problems if they don't want to sleep in blocking mode.
[1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The value of blocking_readers is increased only when the lock is taken
for read, no way we can fail the condition with the write lock.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|