Age | Commit message (Collapse) | Author |
|
RXI300 is a HW design to maintain stuffs across BUS, e.g. AXI, AHB, APB.
It will feedback an error when host does an invalid BUS operation.
For example,
* BUS master request without power/clock on.
* host reads/writes/accesses an invalid address.
They might lead to problems such as BUS timeout, platform hang, etc. So,
once if RXI300 feedback an error, it notifies that driver need a L2 SER
(system error recovery) to reset things.
Previously, driver did not parse the error scenario for RXI300. We add
it and assign a corresponding error code which will make SER flow do L2
reset.
Signed-off-by: Zong-Zhe Yang <kevin_yang@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230119063529.61563-3-pkshih@realtek.com
|
|
Only 8852C chip has valid pages on RTW89_DBG_SEL_MAC_30. To other chips,
this section is an address hole. It will lead to crash if trying to access
this section on chips except for 8852C. So, we avoid that.
Signed-off-by: Zong-Zhe Yang <kevin_yang@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230119063529.61563-2-pkshih@realtek.com
|
|
Add mac function rtw89_mac_port_get_tsf() to get TSF by port.
It will be used when MCC (multi-channel concurrency) calculates
timing things.
Signed-off-by: Zong-Zhe Yang <kevin_yang@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230119062453.58341-4-pkshih@realtek.com
|
|
Originally, rtw89_mac_port_tsf_sync() contains randomization logic
internally. However, not all situation, we need the randomization.
So, split out the generic part from it. And, make the full logic of
original one contained in rtw89_mac_port_tsf_sync_rand(). It will
still be used by its original caller as before. Then, the generic
one will be used in MCC (multi-channel concurrency) management flow.
MCC will implement its logic to decide the offset for TSF sync.
Signed-off-by: Zong-Zhe Yang <kevin_yang@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230119062453.58341-3-pkshih@realtek.com
|
|
Strictly speaking, the unit of the offset should be TU instead of ms.
So, correct it and the macro for calculation. Then, to make the macro
generic, the factor n is moved outside.
Signed-off-by: Zong-Zhe Yang <kevin_yang@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230119062453.58341-2-pkshih@realtek.com
|
|
Should update the driver with the supported firmware version of the
below item.
Bluetooth firmware BT_Coex_Ver: 0x07
Wi-Fi firmware version: RTL8852C->v0.27.56.10, RTL8852A->v0.24.36
Signed-off-by: Ching-Te Ku <ku920601@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230117114109.4298-9-pkshih@realtek.com
|
|
The origin code will enable TDMA WL:BT = 50:50 to prevent Wi-Fi
throughput suddenly drop to 0 in the moment while A2DP pause.
And this protection just a short moment, and will turn to Bluetooth
idle case when A2DP turn into sniff mode. But the raspberryPI simulated
A2DP device will not turn into sniff mode. So the protection will bring
the throughput drop.
Signed-off-by: Ching-Te Ku <ku920601@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230117114109.4298-8-pkshih@realtek.com
|
|
When Wi-Fi enter and leave LPS, coexistence driver need to know
is there A2DP exist or not. And when Wi-Fi sleep in deep power save
state will not able to receive mailbox sent from Bluetooth. So update
the A2DP exist information from reading register.
Signed-off-by: Ching-Te Ku <ku920601@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230117114109.4298-7-pkshih@realtek.com
|
|
When Wi-Fi is RX, set Bluetooth background scan to low-priority and
not to break Wi-Fi packet. Bluetooth can RX depend on hardware ability
even RX request has been rejected. This way can improve Wi-Fi RX
throughput performance.
Signed-off-by: Ching-Te Ku <ku920601@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230117114109.4298-6-pkshih@realtek.com
|
|
Adjust the log format and correct variable reference to make the log
more readable.
Signed-off-by: Ching-Te Ku <ku920601@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230117114109.4298-5-pkshih@realtek.com
|
|
Force firmware to update TDMA parameter when enter/exit power saving.
The TDMA instant feature will make firmware force update TDMA parameter
immediately when the TDMA parameter H2C to firmware. Without this feature,
it will have a low fail rate trigger Bluetooth audio sound glitch when
Wi-Fi is under power saving. Or Wi-Fi fail to turn in to power save state.
Signed-off-by: Ching-Te Ku <ku920601@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230117114109.4298-4-pkshih@realtek.com
|
|
Reset the counter no matter Wi-Fi is notified turning into power save or
not. With rest the counter coexistence will recognize Bluetooth is hanged
easily.
Signed-off-by: Ching-Te Ku <ku920601@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230117114109.4298-3-pkshih@realtek.com
|
|
This patch update the external control (Wi-Fi firmware control) type of
TDMA related parameters, almost all of these case were related to Wi-Fi
multi-role situations & AP mode.
Signed-off-by: Ching-Te Ku <ku920601@realtek.com>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230117114109.4298-2-pkshih@realtek.com
|
|
The function cfg80211_set_channel() is an static function in brcmfmac
but the name is misleading. Hence rename it with 'brcmf' prefix. Also
changing the parameters to avoid additional conversions.
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230103124117.271988-4-arend.vanspriel@broadcom.com
|
|
This code re-uses "i" to be the iterator for both the inside and outside
loops. It means the outside loop will exit earlier than intended.
Fixes: d219b7eb3792 ("mwifiex: handle BT coex event to adjust Rx BA window size")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/Y+ERnaDaZD7RtLvX@kili
|
|
One-element arrays are deprecated, and we are replacing them with flexible
array members instead. So, replace one-element array with flexible-array
member in struct mwifiex_ie_types_rates_param_set.
These are the only binary differences I see after the change:
mwifiex.o
_@@ -50154,7 +50154,7 @@
23514: R_X86_64_32S kmalloc_caches+0x50
23518: call 2351d <mwifiex_scan_networks+0x11d>
23519: R_X86_64_PLT32 __tsan_read8-0x4
- 2351d: mov $0x225,%edx
+ 2351d: mov $0x224,%edx
23522: mov $0xdc0,%esi
23527: mov 0x0(%rip),%rdi # 2352e <mwifiex_scan_networks+0x12e>
2352a: R_X86_64_PC32 kmalloc_caches+0x4c
scan.o
_@@ -5582,7 +5582,7 @@
4394: R_X86_64_32S kmalloc_caches+0x50
4398: call 439d <mwifiex_scan_networks+0x11d>
4399: R_X86_64_PLT32 __tsan_read8-0x4
- 439d: mov $0x225,%edx
+ 439d: mov $0x224,%edx
43a2: mov $0xdc0,%esi
43a7: mov 0x0(%rip),%rdi # 43ae <mwifiex_scan_networks+0x12e>
43aa: R_X86_64_PC32 kmalloc_caches+0x4c
and the reason for that is the following line:
drivers/net/wireless/marvell/mwifiex/scan.c:
1517 scan_cfg_out = kzalloc(sizeof(union mwifiex_scan_cmd_config_tlv),
1518 GFP_KERNEL);
sizeof(union mwifiex_scan_cmd_config_tlv) is now one-byte smaller due to the
flex-array transformation:
46 union mwifiex_scan_cmd_config_tlv {
47 /* Scan configuration (variable length) */
48 struct mwifiex_scan_cmd_config config;
49 /* Max allocated block */
50 u8 config_alloc_buf[MAX_SCAN_CFG_ALLOC];
51 };
Notice that MAX_SCAN_CFG_ALLOC is defined in terms of
sizeof(struct mwifiex_ie_types_rates_param_set), see:
26 /* Memory needed to store supported rate */
27 #define RATE_TLV_MAX_SIZE (sizeof(struct mwifiex_ie_types_rates_param_set) \
28 + HOSTCMD_SUPPORTED_RATES)
37 /* Maximum memory needed for a mwifiex_scan_cmd_config with all TLVs at max */
38 #define MAX_SCAN_CFG_ALLOC (sizeof(struct mwifiex_scan_cmd_config) \
39 + sizeof(struct mwifiex_ie_types_num_probes) \
40 + sizeof(struct mwifiex_ie_types_htcap) \
41 + CHAN_TLV_MAX_SIZE \
42 + RATE_TLV_MAX_SIZE \
43 + WILDCARD_SSID_TLV_MAX_SIZE)
This helps with the ongoing efforts to tighten the FORTIFY_SOURCE
routines on memcpy() and help us make progress towards globally
enabling -fstrict-flex-arrays=3 [1].
Link: https://github.com/KSPP/linux/issues/79
Link: https://github.com/KSPP/linux/issues/252
Link: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602902.html [1]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/Y9xkjXeElSEQ0FPY@work
|
|
One-element arrays are deprecated, and we are replacing them with flexible
array members instead. So, replace one-element arrays with flexible-array
members in multiple structures.
This helps with the ongoing efforts to tighten the FORTIFY_SOURCE
routines on memcpy() and help us make progress towards globally
enabling -fstrict-flex-arrays=3 [1].
This results in no differences in binary output.
Link: https://github.com/KSPP/linux/issues/79
Link: https://github.com/KSPP/linux/issues/256
Link: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602902.html [1]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/Y9xkECG3uTZ6T1dN@work
|
|
One-element arrays are deprecated, and we are replacing them with flexible
array members instead. So, replace one-element array with flexible-array
member in struct brcmf_tlv.
This helps with the ongoing efforts to tighten the FORTIFY_SOURCE
routines on memcpy() and help us make progress towards globally
enabling -fstrict-flex-arrays=3 [1].
This results in no differences in binary output.
Link: https://github.com/KSPP/linux/issues/79
Link: https://github.com/KSPP/linux/issues/253
Link: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602902.html [1]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/Y9xjizhMujNEtpB4@work
|
|
Support the firmware hotfix version in GET_HW_SPEC responses to avoid an
irritating "Unknown api_id: 5" message on probe.
Based on this commit in NXP's GPLv2-licensed out-of-tree driver:
https://github.com/nxp-imx/mwifiex/commit/27fd8ecca504
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Sherry Sun <sherry.sun@nxp.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/111c7ee895f12d951e95a2edcd06d87ca26a7d0f.1674827105.git.lukas@wunner.de
|
|
The Marvell SD8978 (aka NXP IW416) uses identical registers as SD8987,
so reuse the existing mwifiex_reg_sd8987 definition.
Note that mwifiex_reg_sd8977 and mwifiex_reg_sd8997 are likewise
identical, save for the fw_dump_ctrl register: They define it as 0xf0
whereas mwifiex_reg_sd8987 defines it as 0xf9. I've verified that
0xf9 is the correct value on SD8978. NXP's out-of-tree driver uses
0xf9 for all of them, so there's a chance that 0xf0 is not correct
in the mwifiex_reg_sd8977 and mwifiex_reg_sd8997 definitions. I cannot
test that for lack of hardware, hence am leaving it as is.
NXP has only released a firmware which runs Bluetooth over UART.
Perhaps Bluetooth over SDIO is unsupported by this chipset.
Consequently, only an "sdiouart" firmware image is referenced, not an
alternative "sdsd" image.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/536b4f17a72ca460ad1b07045757043fb0778988.1674827105.git.lukas@wunner.de
|
|
Commit e3fffc1f0b47 ("devicetree: document new marvell-8xxx and
pwrseq-sd8787 options") documented a compatible string for SD8787 in
the devicetree bindings, but neglected to add it to the mwifiex driver.
Fixes: e3fffc1f0b47 ("devicetree: document new marvell-8xxx and pwrseq-sd8787 options")
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v4.11+
Cc: Matt Ranostay <mranostay@ti.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/320de5005ff3b8fd76be2d2b859fd021689c3681.1674827105.git.lukas@wunner.de
|
|
This chip exists in two revisions (B2=r3 and B3=r4) on different
platforms, and was added without regard to doing proper firmware
selection or differentiating between them. Fix this to have proper
per-revision firmwares and support Apple NVRAM selection.
Revision B2 is present on at least these Apple T2 Macs:
kauai: MacBook Pro 15" (Touch/2018-2019)
maui: MacBook Pro 13" (Touch/2018-2019)
lanai: Mac mini (Late 2018)
ekans: iMac Pro 27" (5K, Late 2017)
And these non-T2 Macs:
nihau: iMac 27" (5K, 2019)
Revision B3 is present on at least these Apple T2 Macs:
bali: MacBook Pro 16" (2019)
trinidad: MacBook Pro 13" (2020, 4 TB3)
borneo: MacBook Pro 16" (2019, 5600M)
kahana: Mac Pro (2019)
kahana: Mac Pro (2019, Rack)
hanauma: iMac 27" (5K, 2020)
kure: iMac 27" (5K, 2020, 5700/XT)
Also fix the firmware interface for 4364, from BCA to WCC.
Fixes: 24f0bd136264 ("brcmfmac: add the BRCM 4364 found in MacBook Pro 15,2")
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230212063813.27622-5-marcan@marcan.st
|
|
This chip is present on at least these Apple T2 Macs:
* tahiti: MacBook Pro 13" (2020, 2 TB3)
* formosa: MacBook Pro 13" (Touch/2019)
* fiji: MacBook Air 13" (Scissor, 2020)
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230212063813.27622-4-marcan@marcan.st
|
|
This chip is present on at least these Apple T2 Macs:
* hawaii: MacBook Air 13" (Late 2018)
* hawaii: MacBook Air 13" (True Tone, 2019)
Users report seeing PCI revision ID 12 for this chip, which Arend
reports should be revision C2, but Apple has the firmware tagged as
revision C1. Assume the right cutoff point for firmware versions is
revision ID 11 then, and leave older revisions using the non-versioned
firmware filename (Apple only uses C1 firmware builds).
Reviewed-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230212063813.27622-3-marcan@marcan.st
|
|
The commit that introduced support for this chip incorrectly claimed it
is a Cypress-specific part, while in actuality it is just a variant of
BCM4355 silicon (as evidenced by the chip ID).
The relationship between Cypress products and Broadcom products isn't
entirely clear but given what little information is available and prior
art in the driver, it seems the convention should be that originally
Broadcom parts should retain the Broadcom name.
Thus, rename the relevant constants and firmware file. Also rename the
specific 89459 PCIe ID to BCM43596, which seems to be the original
subvariant name for this PCI ID (as defined in the out-of-tree bcmdhd
driver).
Since Cypress added this part and will presumably be providing its
supported firmware, we keep the CYW designation for this device.
We also drop the RAW device ID in this commit. We don't do this for the
other chips since apparently some devices with them exist in the wild,
but there is already a 4355 entry with the Broadcom subvendor and WCC
firmware vendor, so adding a generic fallback to Cypress seems
redundant (no reason why a device would have the raw device ID *and* an
explicitly programmed subvendor).
Fixes: dce45ded7619 ("brcmfmac: Support 89459 pcie")
Reviewed-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230212063813.27622-2-marcan@marcan.st
|
|
pahole flags for v1.25
v1.25 of pahole supports filtering out functions with multiple inconsistent
function prototypes or optimized-out parameters from the BTF representation.
These present problems because there is no additional info in BTF saying which
inconsistent prototype matches which function instance to help guide attachment,
and functions with optimized-out parameters can lead to incorrect assumptions
about register contents.
So for now, filter out such functions while adding BTF representations for
functions that have "."-suffixes (foo.isra.0) but not optimized-out parameters.
This patch assumes that below linked changes land in pahole for v1.25.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/1675790102-23037-1-git-send-email-alan.maguire@oracle.com
Link: https://lore.kernel.org/bpf/1675949331-27935-1-git-send-email-alan.maguire@oracle.com
|
|
In order to replace the open coded name cache in send with the lru cache,
we need an API for the lru cache to delete a specific entry for which we
did a previous lookup. This adds the API for it, and a next patch in the
series will use it.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This allows an optional generation number to be associated to each entry
of the lru cache. Entries with the same key but different generations, are
stored in the linked list to which the maple tree points to. This is meant
to be used when there's a small number of different generations, so the
impact of searching a linked list is negligible. The goal is to get rid of
the open coded name cache in the send code (which uses a radix tree and
a similar linked list of values/entries) and use instead the lru cache
module. For that particular use case we have at most 2 generations that
are associated to each key (inode number): one generation for the send
root and another generation for the parent root. The actual migration of
the send name cache is done in the next patch in the series.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During an incremental send, when processing the reference for an inode
we need to check if the directory where the new reference is located was
already created before creating the new reference. This check, which is
done by the helper did_create_dir(), can be expensive if the directory
has many entries, since it consists in searching the send root's b+tree
and visiting every single dir index key until we either find one which
points to an inode with a number smaller than the current inode's number
or until we visited all index keys. So it doesn't scale well for very
large directories.
So improve on this by caching created directories using a lru cache, and
limiting its size to 64 entries, which results in using at most 4096
bytes of memory. The caching is optional, if we fail to allocate memory,
we just proceed as before and use the existing slower path.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The lru cache is backed by a maple tree, which uses the unsigned long
type for keys, and that type has a width of 32 bits on 32 bits systems
and a width of 64 bits on 64 bits systems.
Currently there is only one user of the lru cache, the send backref cache,
which uses a sector number as a key, a logical address right shifted by
fs_info->sectorsize_bits, so a 32 bits width is not yet a problem (the
same happens with the radix tree we use to track extent buffers,
fs_info->buffer_radix).
However the next patches in the series will start using the lru cache for
cases where inode numbers are the keys, and the inode numbers are always
64 bits, even if we are running on a 32 bits system.
So adapt the lru cache to allow multiple values under the same key, by
having the maple tree store a head entry that points to a list of entries
instead of pointing to a single entry. This is a similar approach to what
we currently do for the name cache in send (which uses a radix tree that
has indexes with an unsigned long type as well), and will allow later to
use the lru cache for the send name cache as well.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The backref cache is a cache backed by a maple tree and a linked list to
keep track of temporal access to cached entries (the LRU entry always at
the head of the list). This type of caching method is going to be useful
in other scenarios, so make the cache implementation more generic and
move it into its own header and source files.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
After we allocate the send context object and before we initialize all
the red black trees, we can jump to the 'out' label if some errors happen,
and then under the 'out' label we use RB_EMPTY_ROOT() against some of the
those trees, which we have not yet initialized. This happens to work out
ok because the send context object was initialized to zeroes with kzalloc
and the RB_ROOT initializer just happens to have the following definition:
#define RB_ROOT (struct rb_root) { NULL, }
But it's really neither clean nor a good practice as RB_ROOT is supposed
to be opaque and in case it changes or we change those red black trees to
some other data structure, it leaves us in a precarious situation.
So initialize all the red black trees immediately after allocating the
send context and before any jump into the 'out' label.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When processing the new references for an inode, we unnecessarily iterate
twice the waiting dir moves rbtree, once with is_waiting_for_move() and
if we found an entry in the rbtree, we iterate it again with a call to
get_waiting_dir_move(). This is pointless, we can make this simpler and
more efficient by calling only get_waiting_dir_move(), so just do that.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During an incremental send, every time we remove a reference (dentry) for
an inode and the parent directory does not exists anymore in the send
root, we go check if we can remove the directory by making a call to
can_rmdir(). This helper can only return true (value 1) if all dentries
were already removed, and for that it always does a search on the parent
root for dir index keys - if it finds any dentry referring to an inode
with a number higher then the inode currently being processed, then the
directory can not be removed and it must return false (value 0).
However that means if a directory that was deleted had 1000 dentries, and
each one pointed to an inode with a number higher then the number of the
directory's inode, we end up doing 1000 searches on the parent root.
Typically files are created in a directory after the directory was created
and therefore they get an higher inode number than the directory. It's
also common to have the each dentry pointing to an inode with a higher
number then the inodes the previous dentries point to, for example when
creating a series of files inside a directory, a very common pattern.
So improve on that by having the first call to can_rmdir() for a directory
to check the number of the inode that the last dentry points to and cache
that inode number in the orphan dir structure. Then every subsequent call
to can_rmdir() can avoid doing a search on the parent root if the number
of the inode currently being processed is smaller than cached inode number
at the directory's orphan dir structure.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At can_rmdir() we start by searching the orphan dirs rbtree for an orphan
dir object for the target directory. Later when iterating over the dir
index keys, if we find that any dir entry points to inode for which there
is a pending dir move or the inode was not yet processed, we exit because
we can't remove the directory yet. However we end up always calling
add_orphan_dir_info(), which will iterate again the rbtree and if there is
already an orphan dir object (created by the first call to can_rmdir()),
it returns the existing object. This is unnecessary work because in case
there is already an existing orphan dir object, we got a reference to it
at the start of can_rmdir(). So skip the call to add_orphan_dir_info()
if we already have a reference for an orphan dir object.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At can_rmdir() we are allocating and initializing an orphan dir object
twice. This can be deduplicated outside of the loop that iterates over
the dir index keys. So deduplicate that code, even because other patch
in the series will need to add more initialization code and another one
will add one more condition.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
All callers of can_rmdir() pass sctx->cur_ino as the value for the
send_progress argument, so remove the argument and directly use
sctx->cur_ino.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During an incremental send, when processing the new references of an inode
(either it's a new inode or an existing one renamed/moved), he will search
the b+tree of the send or parent roots in order to find out the inode item
of the parent directory and extract its generation. However we are doing
that search twice, once with is_inode_existent() -> get_cur_inode_state()
and then again at did_overwrite_ref() or will_overwrite_ref().
So avoid that and get the generation at get_cur_inode_state() and then
propagate it up to did_overwrite_ref() and will_overwrite_ref().
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are no resources to release before will_overwrite_ref() returns, so
we don't really need the 'out' label and jumping to it when conditions are
met - we can directly return and get rid of the label and jumps. Also we
can deal with -ENOENT and other errors in a single if-else logic, as it's
more straightforward.
This helps the next patch in the series to be more simple as well.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At did_overwrite_ref() we always call get_inode_gen() to find out the
generation of the inode 'ow_inode'. However we don't always need to use
that generation, and in fact it's very common to not use it, so we end
up doing a b+tree search on the send root, allocating a path, etc, for
nothing. So improve on this by getting the generation only if we need
to use it.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are no resources to release before did_overwrite_ref() returns, so
we don't really need the 'out' label and jumping to it when conditions are
met - we can directly return and get rid of the label and jumps. Also we
can deal with -ENOENT and other errors in a single if-else logic, as it's
more straightforward.
This helps the next patch in the series to be more simple as well.
This patch is part of a larger patchset and the changelog of the last
patch in the series contains a sample performance test and results.
The patches that comprise the patchset are the following:
btrfs: send: directly return from did_overwrite_ref() and simplify it
btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
btrfs: send: directly return from will_overwrite_ref() and simplify it
btrfs: send: avoid extra b+tree searches when checking reference overrides
btrfs: send: remove send_progress argument from can_rmdir()
btrfs: send: avoid duplicated orphan dir allocation and initialization
btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
btrfs: send: reduce searches on parent root when checking if dir can be removed
btrfs: send: iterate waiting dir move rbtree only once when processing refs
btrfs: send: initialize all the red black trees earlier
btrfs: send: genericize the backref cache to allow it to be reused
btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
btrfs: send: cache information about created directories
btrfs: allow a generation number to be associated with lru cache entries
btrfs: add an api to delete a specific entry from the lru cache
btrfs: send: use the lru cache to implement the name cache
btrfs: send: update size of roots array for backref cache entries
btrfs: send: cache utimes operations for directories if possible
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
Since the introduction of per-fs feature sysfs interface
(/sys/fs/btrfs/<UUID>/features/), the content of that directory is never
updated.
Thus for the following case, that directory will not show the new
features like RAID56:
# mkfs.btrfs -f $dev1 $dev2 $dev3
# mount $dev1 $mnt
# btrfs balance start -f -mconvert=raid5 $mnt
# ls /sys/fs/btrfs/$uuid/features/
extended_iref free_space_tree no_holes skinny_metadata
While after unmount and mount, we got the correct features:
# umount $mnt
# mount $dev1 $mnt
# ls /sys/fs/btrfs/$uuid/features/
extended_iref free_space_tree no_holes raid56 skinny_metadata
[CAUSE]
Because we never really try to update the content of per-fs features/
directory.
We had an attempt to update the features directory dynamically in commit
14e46e04958d ("btrfs: synchronize incompat feature bits with sysfs
files"), but unfortunately it get reverted in commit e410e34fad91
("Revert "btrfs: synchronize incompat feature bits with sysfs files"").
The problem in the original patch is, in the context of
btrfs_create_chunk(), we can not afford to update the sysfs group.
The exported but never utilized function, btrfs_sysfs_feature_update()
is the leftover of such attempt. As even if we go sysfs_update_group(),
new files will need extra memory allocation, and we have no way to
specify the sysfs update to go GFP_NOFS.
[FIX]
This patch will address the old problem by doing asynchronous sysfs
update in the cleaner thread.
This involves the following changes:
- Make __btrfs_(set|clear)_fs_(incompat|compat_ro) helpers to set
BTRFS_FS_FEATURE_CHANGED flag when needed
- Update btrfs_sysfs_feature_update() to use sysfs_update_group()
And drop unnecessary arguments.
- Call btrfs_sysfs_feature_update() in cleaner_kthread
If we have the BTRFS_FS_FEATURE_CHANGED flag set.
- Wake up cleaner_kthread in btrfs_commit_transaction if we have
BTRFS_FS_FEATURE_CHANGED flag
By this, all the previously dangerous call sites like
btrfs_create_chunk() need no new changes, as above helpers would
have already set the BTRFS_FS_FEATURE_CHANGED flag.
The real work happens at cleaner_kthread, thus we pay the cost of
delaying the update to sysfs directory, but the delayed time should be
small enough that end user can not distinguish though it might get
delayed if the cleaner thread is busy with removing subvolumes or
defrag.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
extent-tree.h is included more than once, added in a0231804affe ("btrfs:
move extent-tree helpers into their own header file").
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
When debugging a scrub related metadata error, it turns out that our
metadata error reporting is not ideal.
The only 3 error messages are:
- BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 0, gen 1
Showing we have metadata generation mismatch errors.
- BTRFS error (device dm-2): unable to fixup (regular) error at logical 7110656 on dev /dev/mapper/test-scratch1
Showing which tree blocks are corrupted.
- BTRFS warning (device dm-2): checksum/header error at logical 24772608 on dev /dev/mapper/test-scratch2, physical 3801088: metadata node (level 1) in tree 5
Showing which physical range the corrupted metadata is at.
We have to combine the above 3 to know we have a corrupted metadata with
generation mismatch.
And this is already the better case, if we have other problems, like
fsid mismatch, we can not even know the cause.
[CAUSE]
The problem is caused by the fact that, scrub_checksum_tree_block()
never outputs any error message.
It just return two bits for scrub: sblock->header_error, and
sblock->generation_error.
And later we report error in scrub_print_warning(), but unfortunately we
only have two bits, there is not really much thing we can done to print
any detailed errors.
[FIX]
This patch will do the following to enhance the error reporting of
metadata scrub:
- Add extra warning (ratelimited) for every error we hit
This can help us to distinguish the different types of errors.
Some errors can help us to know what's going wrong immediately,
like bytenr mismatch.
- Re-order the checks
Currently we check bytenr first, then immediately generation.
This can lead to false generation mismatch reports, while the fsid
mismatches.
Here is the new output for the bug I'm debugging (we forgot to
writeback tree blocks for commit roots):
BTRFS warning (device dm-2): tree block 24117248 mirror 1 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc
BTRFS warning (device dm-2): tree block 24117248 mirror 0 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc
Now we can immediately know it's some tree blocks didn't even get written
back, other than the original confusing generation mismatch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When a file system has ZNS devices which are constrained by a maximum
number of active block groups, then not being able to use all the block
groups for every allocation is not ideal, and could cause us to loop a
ton with mixed size allocations.
In general, since zoned doesn't write into gaps behind where block
groups are writing, it is not susceptible to the same sort of
fragmentation that size classes are designed to solve, so we can skip
size classes for zoned file systems in general, even though there would
probably be no harm for SMR devices.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since the size class is an artifact of an arbitrary anti fragmentation
strategy, it doesn't really make sense to persist it. Furthermore, most
of the size class logic assumes fresh block groups. That is of course
not a reasonable assumption -- we will be upgrading kernels with
existing filesystems whose block groups are not classified.
To work around those issues, implement logic to compute the size class
of the block groups as we cache them in. To perfectly assess the state
of a block group, we would have to read the entire extent tree (since
the free space cache mashes together contiguous extent items) which
would be prohibitively expensive for larger file systems with more
extents.
We can do it relatively cheaply by implementing a simple heuristic of
sampling a handful of extents and picking the smallest one we see. In
the happy case where the block group was classified, we will only see
extents of the correct size. In the unhappy case, we will hopefully find
one of the smaller extents, but there is no perfect answer anyway.
Autorelocation will eventually churn up the block group if there is
significant freeing anyway.
There was no regression in mount performance at end state of the fsperf
test suite, and the delay until the block group is marked cached is
minimized by the constant number of extent samples.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The aim of this patch is to reduce the fragmentation of block groups
under certain unhappy workloads. It is particularly effective when the
size of extents correlates with their lifetime, which is something we
have observed causing fragmentation in the fleet at Meta.
This patch categorizes extents into size classes:
- x < 128KiB: "small"
- 128KiB < x < 8MiB: "medium"
- x > 8MiB: "large"
and as much as possible reduces allocations of extents into block groups
that don't match the size class. This takes advantage of any (possible)
correlation between size and lifetime and also leaves behind predictable
re-usable gaps when extents are freed; small writes don't gum up bigger
holes.
Size classes are implemented in the following way:
- Mark each new block group with a size class of the first allocation
that goes into it.
- Add two new passes to ffe: "unset size class" and "wrong size class".
First, try only matching block groups, then try unset ones, then allow
allocation of new ones, and finally allow mismatched block groups.
- Filtering is done just by skipping inappropriate ones, there is no
special size class indexing.
Other solutions I considered were:
- A best fit allocator with an rb-tree. This worked well, as small
writes didn't leak big holes from large freed extents, but led to
regressions in ffe and write performance due to lock contention on
the rb-tree with every allocation possibly updating it in parallel.
Perhaps something clever could be done to do the updates in the
background while being "right enough".
- A fixed size "working set". This prevents freeing an extent
drastically changing where writes currently land, and seems like a
good option too. Doesn't take advantage of size in any way.
- The same size class idea, but implemented with xarray marks. This
turned out to be slower than looping the linked list and skipping
wrong block groups, and is also less flexible since we must have only
3 size classes (max #marks). With the current approach we can have as
many as we like.
Performance testing was done via: https://github.com/josefbacik/fsperf
Of particular relevance are the new fragmentation specific tests.
A brief summary of the testing results:
- Neutral results on existing tests. There are some minor regressions
and improvements here and there, but nothing that truly stands out as
notable.
- Improvement on new tests where size class and extent lifetime are
correlated. Fragmentation in these cases is completely eliminated
and write performance is generally a little better. There is also
significant improvement where extent sizes are just a bit larger than
the size class boundaries.
- Regression on one new tests: where the allocations are sized
intentionally a hair under the borders of the size classes. Results
are neutral on the test that intentionally attacks this new scheme by
mixing extent size and lifetime.
The full dump of the performance results can be found here:
https://bur.io/fsperf/size-class-2022-11-15.txt
(there are ANSI escape codes, so best to curl and view in terminal)
Here is a snippet from the full results for a new test which mixes
buffered writes appending to a long lived set of files and large short
lived fallocates:
bufferedappendvsfallocate results
metric baseline current stdev diff
======================================================================================
avg_commit_ms 31.13 29.20 2.67 -6.22%
bg_count 14 15.60 0 11.43%
commits 11.10 12.20 0.32 9.91%
elapsed 27.30 26.40 2.98 -3.30%
end_state_mount_ns 11122551.90 10635118.90 851143.04 -4.38%
end_state_umount_ns 1.36e+09 1.35e+09 12248056.65 -1.07%
find_free_extent_calls 116244.30 114354.30 964.56 -1.63%
find_free_extent_ns_max 599507.20 1047168.20 103337.08 74.67%
find_free_extent_ns_mean 3607.19 3672.11 101.20 1.80%
find_free_extent_ns_min 500 512 6.67 2.40%
find_free_extent_ns_p50 2848 2876 37.65 0.98%
find_free_extent_ns_p95 4916 5000 75.45 1.71%
find_free_extent_ns_p99 20734.49 20920.48 1670.93 0.90%
frag_pct_max 61.67 0 8.05 -100.00%
frag_pct_mean 43.59 0 6.10 -100.00%
frag_pct_min 25.91 0 16.60 -100.00%
frag_pct_p50 42.53 0 7.25 -100.00%
frag_pct_p95 61.67 0 8.05 -100.00%
frag_pct_p99 61.67 0 8.05 -100.00%
fragmented_bg_count 6.10 0 1.45 -100.00%
max_commit_ms 49.80 46 5.37 -7.63%
sys_cpu 2.59 2.62 0.29 1.39%
write_bw_bytes 1.62e+08 1.68e+08 17975843.50 3.23%
write_clat_ns_mean 57426.39 54475.95 2292.72 -5.14%
write_clat_ns_p50 46950.40 42905.60 2101.35 -8.62%
write_clat_ns_p99 148070.40 143769.60 2115.17 -2.90%
write_io_kbytes 4194304 4194304 0 0.00%
write_iops 2476.15 2556.10 274.29 3.23%
write_lat_ns_max 2101667.60 2251129.50 370556.59 7.11%
write_lat_ns_mean 59374.91 55682.00 2523.09 -6.22%
write_lat_ns_min 17353.10 16250 1646.08 -6.36%
There are some mixed improvements/regressions in most metrics along with
an elimination of fragmentation in this workload.
On the balance, the drastic 1->0 improvement in the happy cases seems
worth the mix of regressions and improvements we do observe.
Some considerations for future work:
- Experimenting with more size classes
- More hinting/search ordering work to approximate a best-fit allocator
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
find_free_extent is a complicated function. It consists (at least) of:
- a hint that jumps into the middle of a for loop macro
- a middle loop trying every raid level
- an outer loop ascending through ffe loop levels
- complicated logic for skipping some of those ffe loop levels
- multiple underlying in-bg allocators (zoned, cluster, no cluster)
Which is all to say that more tracing is helpful for debugging its
behavior. Add two new tracepoints: at the entrance to the block_groups
loop (hit for every raid level and every ffe_ctl loop) and at the point
we seriously consider a block_group for allocation. This way we can see
the whole path through the algorithm, including hints, multiple loops,
etc.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The allocator tracepoints currently have a pile of values from ffe_ctl.
In modifying the allocator and adding more tracepoints, I found myself
adding to the already long argument list of the tracepoints. It makes it
a lot simpler to just send in the ffe_ctl itself.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Given that wait is always set to 1, so remove the argument.
Last use of wait with 0 was in 0c304304feab ("Btrfs: remove
csum_bytes_left").
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|