summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-02-13arm64: dts: qcom: sc7280: Adjust zombie PWM frequencyOwen Yang
Tune the PWM to solve screen flashing issue and high frequency noise. While at it, the comment for the PWM settings incorrectly said we were using a 5kHz duty cycle. It should have said "period", not "duty cycle". Correct this while updating the values. Signed-off-by: Owen Yang <ecs.taipeikernel@gmail.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230213105803.v2.1.I610cef0ead2d5df1f7bd18bc0e0ae040b03725d0@changeid
2023-02-13arm64: dts: qcom: sc8280xp-pmics: Specify interrupt parent explicitlyManivannan Sadhasivam
Nodes like pwrkey, resin, iadc, adc-tm, temp-alarm which are the grand children of spmi_bus node represent the interrupt generating devices but don't have "interrupt-parent" property. As per the devicetree spec v0.3, section 2.4: "The physical wiring of an interrupt source to an interrupt controller is represented in the devicetree with the interrupt-parent property. Nodes that represent interrupt-generating devices contain an interrupt-parent property which has a phandle value that points to the device to which the device’s interrupts are routed, typically an interrupt controller. If an interrupt-generating device does not have an interrupt-parent property, its interrupt parent is assumed to be its devicetree parent." This clearly says that if the "interrupt-parent" property is absent, then the immediate devicetree parent will be assumed as the interrupt parent. But the immediate parents of these nodes are not interrupt controllers themselves. This may lead to failure while wiring the interrupt for these nodes by an operating system. But a few operating systems like Linux, workaround this issue by walking up the parent nodes until it finds the "interrupt-cells" property. Then the node that has the "interrupt-cells" property will be used as the interrupt parent. But this workaround is not as per the DT spec and is not being implemented by other operating systems such as OpenBSD. Hence, fix this issue by adding the "interrupts-extended" property that explicitly specifies the spmi_bus node as the interrupt parent. Note that the "interrupts-extended" property is chosen over "interrupt-parent" as it allows specifying both interrupt parent phandle and interrupt specifiers in a single property. Reported-by: Patrick Wildt <patrick@blueri.se> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Reviewed-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230213090118.11527-1-manivannan.sadhasivam@linaro.org
2023-02-13arm64: dts: qcom: sm7225-fairphone-fp4: enable remaining i2c bussesLuca Weiss
Enable all i2c busses where something is connected on this phone. Add comments as placeholders for which components are still missing. Also enable gpi_dma and the other qupv3 for that. Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230213-fp4-more-i2c-v2-2-1c459c572f80@fairphone.com
2023-02-13arm64: dts: qcom: sm7225-fairphone-fp4: move status property downLuca Weiss
Currently the dts contains a mix of status-as-first-property (old qcom style) and status-as-last-property (new style). Move all status properties down to the bottom once and for all so that the style is consistent between different nodes. Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230213-fp4-more-i2c-v2-1-1c459c572f80@fairphone.com
2023-02-13arm64: dts: qcom: pmk8350: Use the correct PON compatibleKonrad Dybcio
A special compatible was introduced for PMK8350 both in the driver and the bindings to facilitate for 2 base registers (PBS & HLOS). Use it. Fixes: b2de43136058 ("arm64: dts: qcom: pmk8350: Add peripherals for pmk8350") Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230213212930.2115182-1-konrad.dybcio@linaro.org
2023-02-13arm64: defconfig: Enable DisplayPort on SC8280XP laptopsBjorn Andersson
The QCOM_PMIC_GLINK implements the parts of a TCPM necessary for negotiating DP altmode and the TYPEC_MUX_GPIO_SBU driver is used for controlling connection and orientation switching of the SBU lanes in the USB-C connector Enable these to enable USB Type-C DisplayPort on SC8280XP laptops. Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230213215619.1362566-5-quic_bjorande@quicinc.com
2023-02-13arm64: dts: qcom: sc8280xp-x13s: Enable external displayBjorn Andersson
Like on the CRD, add the necessary nodes to enable USB Type-C altmode-based external display on the Lenovo ThinkPad X13s. Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230213215619.1362566-4-quic_bjorande@quicinc.com
2023-02-13arm64: dts: qcom: sc8280xp-crd: Introduce pmic_glinkBjorn Andersson
The SC8280XP CRD control over battery management and its two USB Type-C port using pmic_glink and two GPIO-based SBU muxes. Enable the two DisplayPort instances, GPIO SBU mux instance and pmic_glink with the two connectors on the CRD. Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230213215619.1362566-3-quic_bjorande@quicinc.com
2023-02-13arm64: dts: qcom: sc8280xp: Add USB-C-related DP blocksBjorn Andersson
Add the two DisplayPort controllers that are attached to QMP phys for providing display output on USB Type-C. Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230213215619.1362566-2-quic_bjorande@quicinc.com
2023-02-13arm64: dts: qcom: sm8350-hdk: enable GPUDmitry Baryshkov
Enable the GPU on the SM8350-HDK device. The ZAP shader is required for the GPU to function properly. Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230209133839.762631-7-dmitry.baryshkov@linaro.org
2023-02-13arm64: dts: qcom: sm8350: add GPU, GMU, GPU CC and SMMU nodesDmitry Baryshkov
Add device nodes required to enable GPU on the SM8350 platform. Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> [bjorn: Workaround for lacking RPMH_REGULATOR_LEVEL_LOW_SVS_L1 constant] Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230209133839.762631-6-dmitry.baryshkov@linaro.org
2023-02-13Merge tag 'mm-hotfixes-stable-2023-02-13-13-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Twelve hotfixes, mostly against mm/. Five of these fixes are cc:stable" * tag 'mm-hotfixes-stable-2023-02-13-13-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: of: reserved_mem: Have kmemleak ignore dynamically allocated reserved mem scripts/gdb: fix 'lx-current' for x86 lib: parser: optimize match_NUMBER apis to use local array mm: shrinkers: fix deadlock in shrinker debugfs mm: hwpoison: support recovery from ksm_might_need_to_copy() kasan: fix Oops due to missing calls to kasan_arch_is_ready() revert "squashfs: harden sanity check in squashfs_read_xattr_id_table" fsdax: dax_unshare_iter() should return a valid length mm/gup: add folio to list when folio_isolate_lru() succeed aio: fix mremap after fork null-deref mailmap: add entry for Alexander Mikhalitsyn mm: extend max struct page size for kmsan
2023-02-13arm64: dts: qcom: sm8350: finish reordering nodesDmitry Baryshkov
Finish reordering DT nodes by their address. Move PDC, tsens, AOSS, SRAM, SPMI and TLMM nodes to the proper position. Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230209133839.762631-5-dmitry.baryshkov@linaro.org
2023-02-13arm64: dts: qcom: sm8350: move more nodes to correct placeDmitry Baryshkov
Continue ordering DT nodes by their address. Move RNG, UFS, system NoC and SLPI nodes to the proper position. Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230209133839.762631-4-dmitry.baryshkov@linaro.org
2023-02-13arm64: dts: qcom: sm8350: reorder device nodesDmitry Baryshkov
Somehow sm8350 got its device nodes not fully sorted. Start reordering DT nodes by their address. Move apps SMMU, GIC, timer, apps RSC, cpufreq ADSP and cDSP nodes to the end to the proper position at the end of /soc/. Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Signed-off-by: Bjorn Andersson <andersson@kernel.org> Link: https://lore.kernel.org/r/20230209133839.762631-3-dmitry.baryshkov@linaro.org
2023-02-13char/agp: introduce asm-generic/agp.hMike Rapoport
There are several architectures that duplicate definitions of map_page_into_agp(), unmap_page_from_agp() and flush_agp_cache(). Define those in asm-generic/agp.h and use it instead of duplicated per-architecture headers. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-02-13char/agp: consolidate {alloc,free}_gatt_pages()Mike Rapoport
There is a copy of alloc_gatt_pages() and free_gatt_pages in several architectures in arch/$ARCH/include/asm/agp.h. All the copies do exactly the same: alias alloc_gatt_pages() to __get_free_pages(GFP_KERNEL) and alias free_gatt_pages() to free_pages(). Define alloc_gatt_pages() and free_gatt_pages() in drivers/char/agp/agp.h and drop per-architecture definitions. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-02-13arm64: configs: Add virtconfigMark Brown
Provide a slimline configuration intended to be booted on virtual machines, with the goal of providing a light configuration which will boot on and enable features available in mach-virt. This is defined in terms of the standard defconfig, with an additional virt.config fragment which disables options unneeded in a virtual configuration. As a first step we just disable all the ARCH_ configuration options, disabling the build of all the SoC specific drivers. This results in a kernel that builds about 25% faster in my testing, if this approach works for people we can add further options. Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230203-arm64-defconfigs-v1-3-cd0694a05f13@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-02-13kbuild: Provide a version of merge_into_defconfig without override warningsMark Brown
While warning on overridden Kconfig options is a good default for merging config fragements sometimes that is our explicit intent and the warnings are unhelpful, add a new merge_into_defconfig_override which does the merge but with warnings suppressed. Since merge_into_defconfig accepts any number of fragments it is difficult to allow it to accept the flag. Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230203-arm64-defconfigs-v1-2-cd0694a05f13@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-02-13scripts: merge_config: Add option to suppress warning on overridesMark Brown
Currently merge_config.sh will unconditionally warn if a fragment overrides any already set symbol. This is generally desirable but is inconvenient in cases where we want to create a fragment which disables unwanted options in the base configuration, for example when attempting to produce a smaller version of another configuration. Add an option -Q which will suppress these warnings. Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230203-arm64-defconfigs-v1-1-cd0694a05f13@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-02-13ice: fix lost multicast packets in promisc modeJesse Brandeburg
There was a problem reported to us where the addition of a VF with an IPv6 address ending with a particular sequence would cause the parent device on the PF to no longer be able to respond to neighbor discovery packets. In this case, we had an ovs-bridge device living on top of a VLAN, which was on top of a PF, and it would not be able to talk anymore (the neighbor entry would expire and couldn't be restored). The root cause of the issue is that if the PF is asked to be in IFF_PROMISC mode (promiscuous mode) and it had an ipv6 address that needed the 33:33:ff:00:00:04 multicast address to work, then when the VF was added with the need for the same multicast address, the VF would steal all the traffic destined for that address. The ice driver didn't auto-subscribe a request of IFF_PROMISC to the "multicast replication from other port's traffic" meaning that it won't get for instance, packets with an exact destination in the VF, as above. The VF's IPv6 address, which adds a "perfect filter" for 33:33:ff:00:00:04, results in no packets for that multicast address making it to the PF (which is in promisc but NOT "multicast replication"). The fix is to enable "multicast promiscuous" whenever the driver is asked to enable IFF_PROMISC, and make sure to disable it when appropriate. Fixes: e94d44786693 ("ice: Implement filter sync, NDO operations and bump version") Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-13ice: Fix check for weight and priority of a scheduling nodeMichal Wilczynski
Currently checks for weight and priority ranges don't check incoming value from the devlink. Instead it checks node current weight or priority. This makes those checks useless. Change range checks in ice_set_object_tx_priority() and ice_set_object_tx_weight() to check against incoming priority an weight. Fixes: 42c2eb6b1f43 ("ice: Implement devlink-rate API") Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-13btrfs: add an api to delete a specific entry from the lru cacheFilipe Manana
In order to replace the open coded name cache in send with the lru cache, we need an API for the lru cache to delete a specific entry for which we did a previous lookup. This adds the API for it, and a next patch in the series will use it. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: allow a generation number to be associated with lru cache entriesFilipe Manana
This allows an optional generation number to be associated to each entry of the lru cache. Entries with the same key but different generations, are stored in the linked list to which the maple tree points to. This is meant to be used when there's a small number of different generations, so the impact of searching a linked list is negligible. The goal is to get rid of the open coded name cache in the send code (which uses a radix tree and a similar linked list of values/entries) and use instead the lru cache module. For that particular use case we have at most 2 generations that are associated to each key (inode number): one generation for the send root and another generation for the parent root. The actual migration of the send name cache is done in the next patch in the series. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: send: cache information about created directoriesFilipe Manana
During an incremental send, when processing the reference for an inode we need to check if the directory where the new reference is located was already created before creating the new reference. This check, which is done by the helper did_create_dir(), can be expensive if the directory has many entries, since it consists in searching the send root's b+tree and visiting every single dir index key until we either find one which points to an inode with a number smaller than the current inode's number or until we visited all index keys. So it doesn't scale well for very large directories. So improve on this by caching created directories using a lru cache, and limiting its size to 64 entries, which results in using at most 4096 bytes of memory. The caching is optional, if we fail to allocate memory, we just proceed as before and use the existing slower path. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systemsFilipe Manana
The lru cache is backed by a maple tree, which uses the unsigned long type for keys, and that type has a width of 32 bits on 32 bits systems and a width of 64 bits on 64 bits systems. Currently there is only one user of the lru cache, the send backref cache, which uses a sector number as a key, a logical address right shifted by fs_info->sectorsize_bits, so a 32 bits width is not yet a problem (the same happens with the radix tree we use to track extent buffers, fs_info->buffer_radix). However the next patches in the series will start using the lru cache for cases where inode numbers are the keys, and the inode numbers are always 64 bits, even if we are running on a 32 bits system. So adapt the lru cache to allow multiple values under the same key, by having the maple tree store a head entry that points to a list of entries instead of pointing to a single entry. This is a similar approach to what we currently do for the name cache in send (which uses a radix tree that has indexes with an unsigned long type as well), and will allow later to use the lru cache for the send name cache as well. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: send: genericize the backref cache to allow it to be reusedFilipe Manana
The backref cache is a cache backed by a maple tree and a linked list to keep track of temporal access to cached entries (the LRU entry always at the head of the list). This type of caching method is going to be useful in other scenarios, so make the cache implementation more generic and move it into its own header and source files. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: send: initialize all the red black trees earlierFilipe Manana
After we allocate the send context object and before we initialize all the red black trees, we can jump to the 'out' label if some errors happen, and then under the 'out' label we use RB_EMPTY_ROOT() against some of the those trees, which we have not yet initialized. This happens to work out ok because the send context object was initialized to zeroes with kzalloc and the RB_ROOT initializer just happens to have the following definition: #define RB_ROOT (struct rb_root) { NULL, } But it's really neither clean nor a good practice as RB_ROOT is supposed to be opaque and in case it changes or we change those red black trees to some other data structure, it leaves us in a precarious situation. So initialize all the red black trees immediately after allocating the send context and before any jump into the 'out' label. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: send: iterate waiting dir move rbtree only once when processing refsFilipe Manana
When processing the new references for an inode, we unnecessarily iterate twice the waiting dir moves rbtree, once with is_waiting_for_move() and if we found an entry in the rbtree, we iterate it again with a call to get_waiting_dir_move(). This is pointless, we can make this simpler and more efficient by calling only get_waiting_dir_move(), so just do that. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: send: reduce searches on parent root when checking if dir can be removedFilipe Manana
During an incremental send, every time we remove a reference (dentry) for an inode and the parent directory does not exists anymore in the send root, we go check if we can remove the directory by making a call to can_rmdir(). This helper can only return true (value 1) if all dentries were already removed, and for that it always does a search on the parent root for dir index keys - if it finds any dentry referring to an inode with a number higher then the inode currently being processed, then the directory can not be removed and it must return false (value 0). However that means if a directory that was deleted had 1000 dentries, and each one pointed to an inode with a number higher then the number of the directory's inode, we end up doing 1000 searches on the parent root. Typically files are created in a directory after the directory was created and therefore they get an higher inode number than the directory. It's also common to have the each dentry pointing to an inode with a higher number then the inodes the previous dentries point to, for example when creating a series of files inside a directory, a very common pattern. So improve on that by having the first call to can_rmdir() for a directory to check the number of the inode that the last dentry points to and cache that inode number in the orphan dir structure. Then every subsequent call to can_rmdir() can avoid doing a search on the parent root if the number of the inode currently being processed is smaller than cached inode number at the directory's orphan dir structure. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()Filipe Manana
At can_rmdir() we start by searching the orphan dirs rbtree for an orphan dir object for the target directory. Later when iterating over the dir index keys, if we find that any dir entry points to inode for which there is a pending dir move or the inode was not yet processed, we exit because we can't remove the directory yet. However we end up always calling add_orphan_dir_info(), which will iterate again the rbtree and if there is already an orphan dir object (created by the first call to can_rmdir()), it returns the existing object. This is unnecessary work because in case there is already an existing orphan dir object, we got a reference to it at the start of can_rmdir(). So skip the call to add_orphan_dir_info() if we already have a reference for an orphan dir object. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: send: avoid duplicated orphan dir allocation and initializationFilipe Manana
At can_rmdir() we are allocating and initializing an orphan dir object twice. This can be deduplicated outside of the loop that iterates over the dir index keys. So deduplicate that code, even because other patch in the series will need to add more initialization code and another one will add one more condition. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: send: remove send_progress argument from can_rmdir()Filipe Manana
All callers of can_rmdir() pass sctx->cur_ino as the value for the send_progress argument, so remove the argument and directly use sctx->cur_ino. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: send: avoid extra b+tree searches when checking reference overridesFilipe Manana
During an incremental send, when processing the new references of an inode (either it's a new inode or an existing one renamed/moved), he will search the b+tree of the send or parent roots in order to find out the inode item of the parent directory and extract its generation. However we are doing that search twice, once with is_inode_existent() -> get_cur_inode_state() and then again at did_overwrite_ref() or will_overwrite_ref(). So avoid that and get the generation at get_cur_inode_state() and then propagate it up to did_overwrite_ref() and will_overwrite_ref(). This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: send: directly return from will_overwrite_ref() and simplify itFilipe Manana
There are no resources to release before will_overwrite_ref() returns, so we don't really need the 'out' label and jumping to it when conditions are met - we can directly return and get rid of the label and jumps. Also we can deal with -ENOENT and other errors in a single if-else logic, as it's more straightforward. This helps the next patch in the series to be more simple as well. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: send: avoid unnecessary generation search at did_overwrite_ref()Filipe Manana
At did_overwrite_ref() we always call get_inode_gen() to find out the generation of the inode 'ow_inode'. However we don't always need to use that generation, and in fact it's very common to not use it, so we end up doing a b+tree search on the send root, allocating a path, etc, for nothing. So improve on this by getting the generation only if we need to use it. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: send: directly return from did_overwrite_ref() and simplify itFilipe Manana
There are no resources to release before did_overwrite_ref() returns, so we don't really need the 'out' label and jumping to it when conditions are met - we can directly return and get rid of the label and jumps. Also we can deal with -ENOENT and other errors in a single if-else logic, as it's more straightforward. This helps the next patch in the series to be more simple as well. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: sysfs: update fs features directory asynchronouslyQu Wenruo
[BUG] Since the introduction of per-fs feature sysfs interface (/sys/fs/btrfs/<UUID>/features/), the content of that directory is never updated. Thus for the following case, that directory will not show the new features like RAID56: # mkfs.btrfs -f $dev1 $dev2 $dev3 # mount $dev1 $mnt # btrfs balance start -f -mconvert=raid5 $mnt # ls /sys/fs/btrfs/$uuid/features/ extended_iref free_space_tree no_holes skinny_metadata While after unmount and mount, we got the correct features: # umount $mnt # mount $dev1 $mnt # ls /sys/fs/btrfs/$uuid/features/ extended_iref free_space_tree no_holes raid56 skinny_metadata [CAUSE] Because we never really try to update the content of per-fs features/ directory. We had an attempt to update the features directory dynamically in commit 14e46e04958d ("btrfs: synchronize incompat feature bits with sysfs files"), but unfortunately it get reverted in commit e410e34fad91 ("Revert "btrfs: synchronize incompat feature bits with sysfs files""). The problem in the original patch is, in the context of btrfs_create_chunk(), we can not afford to update the sysfs group. The exported but never utilized function, btrfs_sysfs_feature_update() is the leftover of such attempt. As even if we go sysfs_update_group(), new files will need extra memory allocation, and we have no way to specify the sysfs update to go GFP_NOFS. [FIX] This patch will address the old problem by doing asynchronous sysfs update in the cleaner thread. This involves the following changes: - Make __btrfs_(set|clear)_fs_(incompat|compat_ro) helpers to set BTRFS_FS_FEATURE_CHANGED flag when needed - Update btrfs_sysfs_feature_update() to use sysfs_update_group() And drop unnecessary arguments. - Call btrfs_sysfs_feature_update() in cleaner_kthread If we have the BTRFS_FS_FEATURE_CHANGED flag set. - Wake up cleaner_kthread in btrfs_commit_transaction if we have BTRFS_FS_FEATURE_CHANGED flag By this, all the previously dangerous call sites like btrfs_create_chunk() need no new changes, as above helpers would have already set the BTRFS_FS_FEATURE_CHANGED flag. The real work happens at cleaner_kthread, thus we pay the cost of delaying the update to sysfs directory, but the delayed time should be small enough that end user can not distinguish though it might get delayed if the cleaner thread is busy with removing subvolumes or defrag. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: remove duplicate include header in extent-tree.cye xingchen
extent-tree.h is included more than once, added in a0231804affe ("btrfs: move extent-tree helpers into their own header file"). Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: scrub: improve tree block error reportingQu Wenruo
[BUG] When debugging a scrub related metadata error, it turns out that our metadata error reporting is not ideal. The only 3 error messages are: - BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 0, gen 1 Showing we have metadata generation mismatch errors. - BTRFS error (device dm-2): unable to fixup (regular) error at logical 7110656 on dev /dev/mapper/test-scratch1 Showing which tree blocks are corrupted. - BTRFS warning (device dm-2): checksum/header error at logical 24772608 on dev /dev/mapper/test-scratch2, physical 3801088: metadata node (level 1) in tree 5 Showing which physical range the corrupted metadata is at. We have to combine the above 3 to know we have a corrupted metadata with generation mismatch. And this is already the better case, if we have other problems, like fsid mismatch, we can not even know the cause. [CAUSE] The problem is caused by the fact that, scrub_checksum_tree_block() never outputs any error message. It just return two bits for scrub: sblock->header_error, and sblock->generation_error. And later we report error in scrub_print_warning(), but unfortunately we only have two bits, there is not really much thing we can done to print any detailed errors. [FIX] This patch will do the following to enhance the error reporting of metadata scrub: - Add extra warning (ratelimited) for every error we hit This can help us to distinguish the different types of errors. Some errors can help us to know what's going wrong immediately, like bytenr mismatch. - Re-order the checks Currently we check bytenr first, then immediately generation. This can lead to false generation mismatch reports, while the fsid mismatches. Here is the new output for the bug I'm debugging (we forgot to writeback tree blocks for commit roots): BTRFS warning (device dm-2): tree block 24117248 mirror 1 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc BTRFS warning (device dm-2): tree block 24117248 mirror 0 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc Now we can immediately know it's some tree blocks didn't even get written back, other than the original confusing generation mismatch. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: don't use size classes for zoned file systemsBoris Burkov
When a file system has ZNS devices which are constrained by a maximum number of active block groups, then not being able to use all the block groups for every allocation is not ideal, and could cause us to loop a ton with mixed size allocations. In general, since zoned doesn't write into gaps behind where block groups are writing, it is not susceptible to the same sort of fragmentation that size classes are designed to solve, so we can skip size classes for zoned file systems in general, even though there would probably be no harm for SMR devices. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: load block group size class when cachingBoris Burkov
Since the size class is an artifact of an arbitrary anti fragmentation strategy, it doesn't really make sense to persist it. Furthermore, most of the size class logic assumes fresh block groups. That is of course not a reasonable assumption -- we will be upgrading kernels with existing filesystems whose block groups are not classified. To work around those issues, implement logic to compute the size class of the block groups as we cache them in. To perfectly assess the state of a block group, we would have to read the entire extent tree (since the free space cache mashes together contiguous extent items) which would be prohibitively expensive for larger file systems with more extents. We can do it relatively cheaply by implementing a simple heuristic of sampling a handful of extents and picking the smallest one we see. In the happy case where the block group was classified, we will only see extents of the correct size. In the unhappy case, we will hopefully find one of the smaller extents, but there is no perfect answer anyway. Autorelocation will eventually churn up the block group if there is significant freeing anyway. There was no regression in mount performance at end state of the fsperf test suite, and the delay until the block group is marked cached is minimized by the constant number of extent samples. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: introduce size class to block group allocatorBoris Burkov
The aim of this patch is to reduce the fragmentation of block groups under certain unhappy workloads. It is particularly effective when the size of extents correlates with their lifetime, which is something we have observed causing fragmentation in the fleet at Meta. This patch categorizes extents into size classes: - x < 128KiB: "small" - 128KiB < x < 8MiB: "medium" - x > 8MiB: "large" and as much as possible reduces allocations of extents into block groups that don't match the size class. This takes advantage of any (possible) correlation between size and lifetime and also leaves behind predictable re-usable gaps when extents are freed; small writes don't gum up bigger holes. Size classes are implemented in the following way: - Mark each new block group with a size class of the first allocation that goes into it. - Add two new passes to ffe: "unset size class" and "wrong size class". First, try only matching block groups, then try unset ones, then allow allocation of new ones, and finally allow mismatched block groups. - Filtering is done just by skipping inappropriate ones, there is no special size class indexing. Other solutions I considered were: - A best fit allocator with an rb-tree. This worked well, as small writes didn't leak big holes from large freed extents, but led to regressions in ffe and write performance due to lock contention on the rb-tree with every allocation possibly updating it in parallel. Perhaps something clever could be done to do the updates in the background while being "right enough". - A fixed size "working set". This prevents freeing an extent drastically changing where writes currently land, and seems like a good option too. Doesn't take advantage of size in any way. - The same size class idea, but implemented with xarray marks. This turned out to be slower than looping the linked list and skipping wrong block groups, and is also less flexible since we must have only 3 size classes (max #marks). With the current approach we can have as many as we like. Performance testing was done via: https://github.com/josefbacik/fsperf Of particular relevance are the new fragmentation specific tests. A brief summary of the testing results: - Neutral results on existing tests. There are some minor regressions and improvements here and there, but nothing that truly stands out as notable. - Improvement on new tests where size class and extent lifetime are correlated. Fragmentation in these cases is completely eliminated and write performance is generally a little better. There is also significant improvement where extent sizes are just a bit larger than the size class boundaries. - Regression on one new tests: where the allocations are sized intentionally a hair under the borders of the size classes. Results are neutral on the test that intentionally attacks this new scheme by mixing extent size and lifetime. The full dump of the performance results can be found here: https://bur.io/fsperf/size-class-2022-11-15.txt (there are ANSI escape codes, so best to curl and view in terminal) Here is a snippet from the full results for a new test which mixes buffered writes appending to a long lived set of files and large short lived fallocates: bufferedappendvsfallocate results metric baseline current stdev diff ====================================================================================== avg_commit_ms 31.13 29.20 2.67 -6.22% bg_count 14 15.60 0 11.43% commits 11.10 12.20 0.32 9.91% elapsed 27.30 26.40 2.98 -3.30% end_state_mount_ns 11122551.90 10635118.90 851143.04 -4.38% end_state_umount_ns 1.36e+09 1.35e+09 12248056.65 -1.07% find_free_extent_calls 116244.30 114354.30 964.56 -1.63% find_free_extent_ns_max 599507.20 1047168.20 103337.08 74.67% find_free_extent_ns_mean 3607.19 3672.11 101.20 1.80% find_free_extent_ns_min 500 512 6.67 2.40% find_free_extent_ns_p50 2848 2876 37.65 0.98% find_free_extent_ns_p95 4916 5000 75.45 1.71% find_free_extent_ns_p99 20734.49 20920.48 1670.93 0.90% frag_pct_max 61.67 0 8.05 -100.00% frag_pct_mean 43.59 0 6.10 -100.00% frag_pct_min 25.91 0 16.60 -100.00% frag_pct_p50 42.53 0 7.25 -100.00% frag_pct_p95 61.67 0 8.05 -100.00% frag_pct_p99 61.67 0 8.05 -100.00% fragmented_bg_count 6.10 0 1.45 -100.00% max_commit_ms 49.80 46 5.37 -7.63% sys_cpu 2.59 2.62 0.29 1.39% write_bw_bytes 1.62e+08 1.68e+08 17975843.50 3.23% write_clat_ns_mean 57426.39 54475.95 2292.72 -5.14% write_clat_ns_p50 46950.40 42905.60 2101.35 -8.62% write_clat_ns_p99 148070.40 143769.60 2115.17 -2.90% write_io_kbytes 4194304 4194304 0 0.00% write_iops 2476.15 2556.10 274.29 3.23% write_lat_ns_max 2101667.60 2251129.50 370556.59 7.11% write_lat_ns_mean 59374.91 55682.00 2523.09 -6.22% write_lat_ns_min 17353.10 16250 1646.08 -6.36% There are some mixed improvements/regressions in most metrics along with an elimination of fragmentation in this workload. On the balance, the drastic 1->0 improvement in the happy cases seems worth the mix of regressions and improvements we do observe. Some considerations for future work: - Experimenting with more size classes - More hinting/search ordering work to approximate a best-fit allocator Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: add more find_free_extent tracepointsBoris Burkov
find_free_extent is a complicated function. It consists (at least) of: - a hint that jumps into the middle of a for loop macro - a middle loop trying every raid level - an outer loop ascending through ffe loop levels - complicated logic for skipping some of those ffe loop levels - multiple underlying in-bg allocators (zoned, cluster, no cluster) Which is all to say that more tracing is helpful for debugging its behavior. Add two new tracepoints: at the entrance to the block_groups loop (hit for every raid level and every ffe_ctl loop) and at the point we seriously consider a block_group for allocation. This way we can see the whole path through the algorithm, including hints, multiple loops, etc. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: pass find_free_extent_ctl to allocator tracepointsBoris Burkov
The allocator tracepoints currently have a pile of values from ffe_ctl. In modifying the allocator and adding more tracepoints, I found myself adding to the already long argument list of the tracepoints. It makes it a lot simpler to just send in the ffe_ctl itself. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: remove the wait argument to btrfs_start_ordered_extentChristoph Hellwig
Given that wait is always set to 1, so remove the argument. Last use of wait with 0 was in 0c304304feab ("Btrfs: remove csum_bytes_left"). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: use a single variable to track return value for log_dir_items()Filipe Manana
We currently use 'ret' and 'err' to track the return value for log_dir_items(), which is confusing and likely the cause for previous bugs where log_dir_items() did not return an error when it should, fixed in previous patches. So change this and use only a single variable, 'ret', to track the return value. This is simpler and makes it similar to most of the existing code. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: use a negative value for BTRFS_LOG_FORCE_COMMITFilipe Manana
Currently we use the value 1 for BTRFS_LOG_FORCE_COMMIT, but that value has a few inconveniences: 1) If it's ever used by btrfs_log_inode(), or any function down the call chain, we have to remember to btrfs_set_log_full_commit(), which is repetitive and has a chance to be forgotten in future use cases. btrfs_log_inode_parent() only calls btrfs_set_log_full_commit() when it gets a negative value from btrfs_log_inode(); 2) Down the call chain of btrfs_log_inode(), we may have functions that need to force a log commit, but can return either an error (negative value), false (0) or true (1). So they are forced to return some random negative to force a log commit - using BTRFS_LOG_FORCE_COMMIT would make the intention more clear. Currently the only example is flush_dir_items_batch(). So turn BTRFS_LOG_FORCE_COMMIT into a negative value. The chosen value is -(MAX_ERRNO + 1), so that it does not overlap any errno value and makes it easier to debug. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: use PAGE_{ALIGN, ALIGNED, ALIGN_DOWN} macroYushan Zhou
The header file linux/mm.h provides PAGE_ALIGN, PAGE_ALIGNED, PAGE_ALIGN_DOWN macros. Use these macros to make code more concise. Signed-off-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: go to matching label when cleaning em in btrfs_submit_directPeng Hao
When btrfs_get_chunk_map fails to allocate a new em the cleanup does not need to be done so the goto target is out_err, which is consistent with current coding style. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>