summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-02-08kbuild: Add resolve_btfids clean to root clean targetJiri Olsa
The resolve_btfids tool is used during the kernel build, so we should clean it on kernel's make clean. Invoking the the resolve_btfids clean as part of root 'make clean'. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210205124020.683286-5-jolsa@kernel.org
2021-02-08tools/resolve_btfids: Set srctree variable unconditionallyJiri Olsa
We want this clean to be called from tree's root Makefile, which defines same srctree variable and that will screw the make setup. We actually do not use srctree being passed from outside, so we can solve this by setting current srctree value directly. Also changing the way how srctree is initialized as suggested by Andrri. Also root Makefile does not define the implicit RM variable, so adding RM initialization. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210205124020.683286-4-jolsa@kernel.org
2021-02-08tools/resolve_btfids: Check objects before removingJiri Olsa
We want this clean to be called from tree's root clean and that one is silent if there's nothing to clean. Adding check for all object to clean and display CLEAN messages only if there are objects to remove. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210205124020.683286-3-jolsa@kernel.org
2021-02-08tools/resolve_btfids: Build libbpf and libsubcmd in separate directoriesJiri Olsa
Setting up separate build directories for libbpf and libpsubcmd, so it's separated from other objects and we don't get them mixed in the future. It also simplifies cleaning, which is now simple rm -rf. Also there's no need for FEATURE-DUMP.libbpf and bpf_helper_defs.h files in .gitignore anymore. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210205124020.683286-2-jolsa@kernel.org
2021-02-08scsi: scsi_debug: Fix a memory leakMaurizio Lombardi
The sdebug_q_arr pointer must be freed when the module is unloaded. $ cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888e1cfb0000 (size 4096): comm "modprobe", pid 165555, jiffies 4325987516 (age 685.194s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000458f4f5d>] 0xffffffffc06702d9 [<000000003edc4b1f>] do_one_initcall+0xe9/0x57d [<00000000da7d518c>] do_init_module+0x1d1/0x6f0 [<000000009a6a9248>] load_module+0x36bd/0x4f50 [<00000000ddb0c3ce>] __do_sys_init_module+0x1db/0x260 [<000000009532db57>] do_syscall_64+0xa5/0x420 [<000000002916b13d>] entry_SYSCALL_64_after_hwframe+0x6a/0xdf Fixes: 87c715dcde63 ("scsi: scsi_debug: Add per_host_store option") Link: https://lore.kernel.org/r/20210208111734.34034-1-mlombard@redhat.com Acked-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-02-08clk: xilinx: move xlnx_vcu clock driver from socMichael Tretter
The xlnx_vcu driver is actually a clock controller driver which provides clocks that can be used by a driver for the encoder/decoder units. There is no reason to keep this driver in soc. Move the driver to clk. NOTE: The register mapping actually contains registers for AXI performance monitoring, but these are not used by the driver. Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-16-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08soc: xilinx: vcu: fix alignment to open parenthesisMichael Tretter
Fixes the following checkpatch check: CHECK: Alignment should match open parenthesis #610: FILE: drivers/soc/xilinx/xlnx_vcu.c:610: + xvcu->vcu_slcr_ba = devm_ioremap(&pdev->dev, res->start, + resource_size(res)); Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-15-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08soc: xilinx: vcu: fix repeated word the in commentMichael Tretter
Fixes the following checkpatch warning: WARNING: Possible repeated word: 'the' #703: FILE: drivers/soc/xilinx/xlnx_vcu.c:703: + /* Add the the Gasket isolation and put the VCU in reset. */ Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-14-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08soc: xilinx: vcu: use bitfields for register definitionMichael Tretter
This makes the register accesses more readable and is closer to what is usually used in the kernel. Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Reviewed-by: Stephen Boyd <sboyd@kernel.org> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-13-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08soc: xilinx: vcu: remove calculation of PLL configurationMichael Tretter
As the consumers are now responsible for setting the clock rate via clock framework, the clock rate is now calculated using round_rate and the driver does not need to calculate the clock rate beforehand. Remove the code that calculates the PLL configuration. Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-12-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08soc: xilinx: vcu: make the PLL configurableMichael Tretter
Do not configure the PLL when probing the driver, but register the clock in the clock framework and do the configuration based on the respective callbacks. This is necessary to allow the consumers, i.e., encoder and decoder drivers, of the xlnx_vcu clock provider to set the clock rate and actually enable the clocks without relying on some pre-configuration. Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-11-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08soc: xilinx: vcu: make pll post divider explicitMichael Tretter
According to the downstream driver documentation due to timing constraints the output divider of the PLL has to be set to 1/2. Add a helper function for that check instead of burying the code in one large setup function. The bit is undocumented and marked as reserved in the register reference. Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-10-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08soc: xilinx: vcu: implement clock provider for output clocksMichael Tretter
The VCU System-Level Control uses an internal PLL to drive the core and MCU clock for the allegro encoder and decoder based on an external PL clock. In order be able to ensure that the clocks are enabled and to get their rate from other drivers, the module must implement a clock provider and register the clocks at the common clock framework. Other drivers are then able to access the clock via devicetree bindings. Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-9-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08soc: xilinx: vcu: register PLL as fixed rate clockMichael Tretter
Currently, xvcu_pll_set_rate configures the PLL to a clock rate that is pre-calculated when probing the driver. To still make the clock framework aware of the PLL and to allow to configure other clocks based on the PLL rate, register the PLL as a fixed rate clock. Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-8-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08soc: xilinx: vcu: implement PLL disableMichael Tretter
The disabling of the PLL is not fully implemented, because according to the ZynqMP register reference the RESET, POR_IN and PWR_POR bits have to be set to bring the PLL into reset. Set the bits to disable the PLL. Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-7-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08soc: xilinx: vcu: add helpers for configuring PLLMichael Tretter
The xvcu_set_vcu_pll_info function sets the rate of the PLL and enables it, which makes it difficult to cleanly convert the driver to the common clock framework. Split the function and add separate functions for setting the rate, enabling the clock and disabling the clock. Also move the enable of the reference clock from probe to the helper that enables the PLL. Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-6-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08soc: xilinx: vcu: add helper to wait for PLL lockedMichael Tretter
Extract a helper function to wait until the PLL is locked. Also, disabling the bypass was buried in the exit path on the wait loop. Separate the different steps and add a helper function to make the code more readable. Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-5-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08soc: xilinx: vcu: drop coreclk from struct xlnx_vcuMichael Tretter
The coreclk field is newer read after being written to xlnx_vcu. Remove the coreclk field from the xlnx_vcu and use a function local variable instead. Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-4-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08clk: divider: fix initialization with parent_hwMichael Tretter
If a driver registers a divider clock with a parent_hw instead of the parent_name, the parent_hw is ignored and the clock does not have a parent. Fix this by initializing the parents the same way they are initialized for clock gates. Fixes: ff258817137a ("clk: divider: Add support for specifying parents via DT/pointers") Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Reviewed-by: Stephen Boyd <sboyd@kernel.org> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-3-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08ARM: dts: vcu: define indexes for output clocksMichael Tretter
The VCU System-Level Control has 4 output clocks. Define indexes for these clocks to allow to reference them in the device tree. Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Rob Herring <robh@kernel.org> Acked-by: Stephen Boyd <sboyd@kernel.org> Acked-by: Michal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/20210121071659.1226489-2-m.tretter@pengutronix.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08clk: axi-clkgen: use devm_platform_ioremap_resource() short-handAlexandru Ardelean
No major functional change. Noticed while checking the driver code that this could be used. Saves two lines. Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com> Link: https://lore.kernel.org/r/20210201151245.21845-5-alexandru.ardelean@analog.com Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08dt-bindings: clock: adi,axi-clkgen: add compatible string for ZynqMP supportAlexandru Ardelean
The axi-clkgen driver now supports ZynqMP (UltraScale) as well, however the driver needs to use different PFD & VCO limits. For ZynqMP, these needs to be selected by using the 'adi,zynqmp-axi-clkgen-2.00.a' string. Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com> Link: https://lore.kernel.org/r/20210201151245.21845-4-alexandru.ardelean@analog.com Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08clk: clk-axiclkgen: add ZynqMP PFD and VCO limitsAlexandru Ardelean
For ZynqMP (Ultrascale) the PFD and VCO limits are different. In order to support these, this change adds a compatible string (i.e. 'adi,zynqmp-axi-clkgen-2.00.a') which will take into account for these limits and apply them. Signed-off-by: Dragos Bogdan <dragos.bogdan@analog.com> Signed-off-by: Mathias Tausen <mta@gomspace.com> Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com> Link: https://lore.kernel.org/r/20210201151245.21845-3-alexandru.ardelean@analog.com Acked-by: Moritz Fischer <mdf@kernel.org> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08clk: axi-clkgen: replace ARCH dependencies with driver depsAlexandru Ardelean
The intent is to be able to run this driver to access the IP core in setups where FPGA board is also connected via a PCIe bus. In such cases the number of combinations explodes, where the host system can be an x86 with Xilinx Zynq/ZynqMP/Microblaze board connected via PCIe. Or even a ZynqMP board with a ZynqMP/Zynq/Microblaze connected via PCIe. To accommodate for these cases, this change removes the limitation for this driver to be compilable only on Zynq/Microblaze architectures. And adds dependencies on the mechanisms required by the driver to work (OF and HAS_IOMEM). Signed-off-by: Dragos Bogdan <dragos.bogdan@analog.com> Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com> Link: https://lore.kernel.org/r/20210201151245.21845-2-alexandru.ardelean@analog.com Acked-by: Michal Simek <michal.simek@xilinx.com> Reviewed-by: Moritz Fischer <mdf@kernel.org> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-02-08bpf: Simplify bool comparisonJiapeng Chong
Fix the following coccicheck warning: ./tools/bpf/bpf_dbg.c:893:32-36: WARNING: Comparison to bool. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1612777416-34339-1-git-send-email-jiapeng.chong@linux.alibaba.com
2021-02-08selftests/bpf: Add missing cleanup in atomic_bounds testBrendan Jackman
Add missing skeleton destroy call. Fixes: 37086bfdc737 ("bpf: Propagate stack bounds to registers in atomics w/ BPF_FETCH") Reported-by: Yonghong Song <yhs@fb.com> Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210208123737.963172-1-jackmanb@google.com
2021-02-08selftests/bpf: Remove unneeded semicolonYang Li
Eliminate the following coccicheck warning: ./tools/testing/selftests/bpf/test_flow_dissector.c:506:2-3: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1612780213-84583-1-git-send-email-yang.lee@linux.alibaba.com
2021-02-09btrfs: zoned: enable to mount ZONED incompat flagNaohiro Aota
This final patch adds the ZONED incompat flag to the supported flags and enables to mount ZONED flagged file system. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: deal with holes writing out tree-log pagesNaohiro Aota
Since the zoned filesystem requires sequential write out of metadata, we cannot proceed with a hole in tree-log pages. When such a hole exists, btree_write_cache_pages() will return -EAGAIN. This happens when someone, e.g., a concurrent transaction commit, writes a dirty extent in this tree-log commit. If we are not going to wait for the extents, we can hope the concurrent writing fills the hole for us. So, we can ignore the error in this case and hope the next write will succeed. If we want to wait for them and got the error, we cannot wait for them because it will cause a deadlock. So, let's bail out to a full commit in this case. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: reorder log node allocation on zoned filesystemNaohiro Aota
This is the 3/3 patch to enable tree-log on zoned filesystems. The allocation order of nodes of "fs_info->log_root_tree" and nodes of "root->log_root" is not the same as the writing order of them. So, the writing causes unaligned write errors. Reorder the allocation of them by delaying allocation of the root node of "fs_info->log_root_tree," so that the node buffers can go out sequentially to devices. Cc: Filipe Manana <fdmanana@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: serialize log transaction on zoned filesystemsNaohiro Aota
This is the 2/3 patch to enable tree-log on zoned filesystems. Since we can start more than one log transactions per subvolume simultaneously, nodes from multiple transactions can be allocated interleaved. Such mixed allocation results in non-sequential writes at the time of a log transaction commit. The nodes of the global log root tree (fs_info->log_root_tree), also have the same problem with mixed allocation. Serializes log transactions by waiting for a committing transaction when someone tries to start a new transaction, to avoid the mixed allocation problem. We must also wait for running log transactions from another subvolume, but there is no easy way to detect which subvolume root is running a log transaction. So, this patch forbids starting a new log transaction when other subvolumes already allocated the global log root tree. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: extend zoned allocator to use dedicated tree-log block groupNaohiro Aota
This is the 1/3 patch to enable tree log on zoned filesystems. The tree-log feature does not work on a zoned filesystem as is. Blocks for a tree-log tree are allocated mixed with other metadata blocks and btrfs writes and syncs the tree-log blocks to devices at the time of fsync(), which has a different timing than a global transaction commit. As a result, both writing tree-log blocks and writing other metadata blocks become non-sequential writes that zoned filesystems must avoid. Introduce a dedicated block group for tree-log blocks, so that tree-log blocks and other metadata blocks can be separate write streams. As a result, each write stream can now be written to devices separately. "fs_info->treelog_bg" tracks the dedicated block group and assigns "treelog_bg" on-demand on tree-log block allocation time. This commit extends the zoned block allocator to use the block group. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: split alloc_log_tree()Naohiro Aota
This is a preparation patch for the next patch. Split alloc_log_tree() into two parts. The first one allocating the tree structure, remains in alloc_log_tree() and the second part allocating the tree node, which is moved into btrfs_alloc_log_tree_node(). Also export the latter part is to be used in the next patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: relocate block group to repair IO failure in zoned filesystemsNaohiro Aota
When a bad checksum is found and if the filesystem has a mirror of the damaged data, we read the correct data from the mirror and writes it to damaged blocks. This however, violates the sequential write constraints of a zoned block device. We can consider three methods to repair an IO failure in zoned filesystems: (1) Reset and rewrite the damaged zone (2) Allocate new device extent and replace the damaged device extent to the new extent (3) Relocate the corresponding block group Method (1) is most similar to a behavior done with regular devices. However, it also wipes non-damaged data in the same device extent, and so it unnecessary degrades non-damaged data. Method (2) is much like device replacing but done in the same device. It is safe because it keeps the device extent until the replacing finish. However, extending device replacing is non-trivial. It assumes "src_dev->physical == dst_dev->physical". Also, the extent mapping replacing function should be extended to support replacing device extent position in one device. Method (3) invokes relocation of the damaged block group and is straightforward to implement. It relocates all the mirrored device extents, so it potentially is a more costly operation than method (1) or (2). But it relocates only used extents which reduce the total IO size. Let's apply method (3) for now. In the future, we can extend device-replace and apply method (2). For protecting a block group gets relocated multiple time with multiple IO errors, this commit introduces "relocating_repair" bit to show it's now relocating to repair IO failures. Also it uses a new kthread "btrfs-relocating-repair", not to block IO path with relocating process. This commit also supports repairing in the scrub process. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: enable relocation on a zoned filesystemNaohiro Aota
Currently fallocate() is disabled on a zoned filesystem. Since current relocation process relies on preallocation to move file data extents, it must be handled differently. On a zoned filesystem, we just truncate the inode to the size that we wanted to pre-allocate. Then, we flush dirty pages on the file before finishing the relocation process. run_delalloc_zoned() will handle all the allocations and submit IOs to the underlying layers. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: support dev-replace in zoned filesystemsNaohiro Aota
This is 4/4 patch to implement device-replace on zoned filesystems. Even after the copying is done, the write pointers of the source device and the destination device may not be synchronized. For example, when the last allocated extent is freed before device-replace process, the extent is not copied, leaving a hole there. Synchronize the write pointers by writing zeroes to the destination device. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: implement copying for zoned device-replaceNaohiro Aota
This is 3/4 patch to implement device-replace on zoned filesystems. This commit implements copying. To do this, it tracks the write pointer during the device replace process. As device-replace's copy process is smart enough to only copy used extents on the source device, we have to fill the gap to honor the sequential write requirement in the target device. The device-replace process on zoned filesystems must copy or clone all the extents in the source device exactly once. So, we need to ensure allocations started just before the dev-replace process to have their corresponding extent information in the B-trees. finish_extent_writes_for_zoned() implements that functionality, which basically is the removed code in the commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after error during device replace"). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: implement cloning for zoned device-replaceNaohiro Aota
This is 2/4 patch to implement device replace for zoned filesystems. In zoned mode, a block group must be either copied (from the source device to the target device) or cloned (to both devices). Implement the cloning part. If a block group targeted by an IO is marked to copy, we should not clone the IO to the destination device, because the block group is eventually copied by the replace process. This commit also handles cloning of device reset. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: mark block groups to copy for device-replaceNaohiro Aota
This is the 1/4 patch to support device-replace on zoned filesystems. We have two types of IOs during the device replace process. One is an IO to "copy" (by the scrub functions) all the device extents from the source device to the destination device. The other one is an IO to "clone" (by handle_ops_on_dev_replace()) new incoming write IOs from users to the source device into the target device. Cloning incoming IOs can break the sequential write rule in on target device. When a write is mapped in the middle of a block group, the IO is directed to the middle of a target device zone, which breaks the sequential write requirement. However, the cloning function cannot be disabled since incoming IOs targeting already copied device extents must be cloned so that the IO is executed on the target device. We cannot use dev_replace->cursor_{left,right} to determine whether a bio is going to a not yet copied region. Since we have a time gap between finishing btrfs_scrub_dev() and rewriting the mapping tree in btrfs_dev_replace_finishing(), we can have a newly allocated device extent which is never cloned nor copied. So the point is to copy only already existing device extents. This patch introduces mark_block_group_to_copy() to mark existing block groups as a target of copying. Then, handle_ops_on_dev_replace() and dev-replace can check the flag to do their job. Also, btrfs_finish_block_group_to_copy() will check if the copied stripe is the last stripe in the block group. With the last stripe copied, the to_copy flag is finally disabled. Afterwards we can safely clone incoming IOs on this block group. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: do not use async metadata checksum on zoned filesystemsNaohiro Aota
On zoned filesystems, btrfs uses per-fs zoned_meta_io_lock to serialize the metadata write IOs. Even with this serialization, write bios sent from btree_write_cache_pages can be reordered by async checksum workers as these workers are per CPU and not per zone. To preserve write bio ordering, we disable async metadata checksum on a zoned filesystem. This does not result in lower performance with HDDs as a single CPU core is fast enough to do checksum for a single zone write stream with the maximum possible bandwidth of the device. If multiple zones are being written simultaneously, HDD seek overhead lowers the achievable maximum bandwidth, resulting again in a per zone checksum serialization not affecting the performance. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: wait for existing extents before truncatingNaohiro Aota
When truncating a file, file buffers which have already been allocated but not yet written may be truncated. Truncating these buffers could cause breakage of a sequential write pattern in a block group if the truncated blocks are for example followed by blocks allocated to another file. To avoid this problem, always wait for write out of all unwritten buffers before proceeding with the truncate execution. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: serialize metadata IONaohiro Aota
We cannot use zone append for writing metadata, because the B-tree nodes have references to each other using logical address. Without knowing the address in advance, we cannot construct the tree in the first place. So we need to serialize write IOs for metadata. We cannot add a mutex around allocation and submission because metadata blocks are allocated in an earlier stage to build up B-trees. Add a zoned_meta_io_lock and hold it during metadata IO submission in btree_write_cache_pages() to serialize IOs. Furthermore, this adds a per-block group metadata IO submission pointer "meta_write_pointer" to ensure sequential writing, which can break when attempting to write back blocks in an unfinished transaction. If the writing out failed because of a hole and the write out is for data integrity (WB_SYNC_ALL), it returns EAGAIN. A caller like fsync() code should handle this properly e.g. by falling back to a full transaction commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: introduce dedicated data write path for zoned filesystemsNaohiro Aota
If more than one IO is issued for one file extent, these IO can be written to separate regions on a device. Since we cannot map one file extent to such a separate area on a zoned filesystem, we need to follow the "one IO == one ordered extent" rule. The normal buffered, uncompressed and not pre-allocated write path (used by cow_file_range()) sometimes does not follow this rule. It can write a part of an ordered extent when specified a region to write e.g., when its called from fdatasync(). Introduce a dedicated (uncompressed buffered) data write path for zoned filesystems, that will COW the region and write it at once. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: enable zone append writing for direct IONaohiro Aota
Likewise to buffered IO, enable zone append writing for direct IO when its used on a zoned block device. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: use ZONE_APPEND write for zoned modeNaohiro Aota
Enable zone append writing for zoned mode. When using zone append, a bio is issued to the start of a target zone and the device decides to place it inside the zone. Upon completion the device reports the actual written position back to the host. Three parts are necessary to enable zone append mode. First, modify the bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the bi_sector to point the beginning of the zone. Second, record the returned physical address (and disk/partno) to the ordered extent in end_bio_extent_writepage() after the bio has been completed. We cannot resolve the physical address to the logical address because we can neither take locks nor allocate a buffer in this end_bio context. So, we need to record the physical address to resolve it later in btrfs_finish_ordered_io(). And finally, rewrite the logical addresses of the extent mapping and checksum data according to the physical address using btrfs_rmap_block. If the returned address matches the originally allocated address, we can skip this rewriting process. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: save irq flags when looking up an ordered extentJohannes Thumshirn
A following patch will add another caller of btrfs_lookup_ordered_extent(), but from a bio's endio context. btrfs_lookup_ordered_extent() uses spin_lock_irq() which unconditionally disables interrupts. Change this to spin_lock_irqsave() so interrupts aren't disabled and re-enabled unconditionally. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: cache if block group is on a sequential zoneJohannes Thumshirn
On a zoned filesystem, cache if a block group is on a sequential write only zone. On sequential write only zones, we can use REQ_OP_ZONE_APPEND for writing data, therefore provide btrfs_use_zone_append() to figure out if IO is targeting a sequential write only zone and we can use REQ_OP_ZONE_APPEND for data writing. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: extend btrfs_rmap_block for specifying a deviceNaohiro Aota
btrfs_rmap_block currently reverse-maps the physical addresses on all devices to the corresponding logical addresses. Extend the function to match to a specified device. The old functionality of querying all devices is left intact by specifying NULL as target device. A block_device instead of a btrfs_device is passed into btrfs_rmap_block, as this function is intended to reverse-map the result of a bio, which only has a block_device. Also export the function for later use. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: check if bio spans across an ordered extentJohannes Thumshirn
To ensure that an ordered extent maps to a contiguous region on disk, we need to maintain a "one bio == one ordered extent" rule. Ensure that constructing bio does not span more than an ordered extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09btrfs: zoned: split ordered extent when bio is sentNaohiro Aota
For a zone append write, the device decides the location the data is being written to. Therefore we cannot ensure that two bios are written consecutively on the device. In order to ensure that an ordered extent maps to a contiguous region on disk, we need to maintain a "one bio == one ordered extent" rule. Implement splitting of an ordered extent and extent map on bio submission to adhere to the rule. extract_ordered_extent() hooks into btrfs_submit_data_bio() and splits the corresponding ordered extent so that the ordered extent's region fits into one bio and the corresponding device limits. Several sanity checks need to be done in extract_ordered_extent() e.g. - We cannot split once end_bio'd ordered extent because we cannot divide ordered->bytes_left for the split ones - We do not expect a compressed ordered extent - We should not have checksum list because we omit the list splitting. Since the function is called before btrfs_wq_submit_bio() or btrfs_csum_one_bio(), this should be always ensured. We also need to split an extent map by creating a new one. If not, unpin_extent_cache() complains about the difference between the start of the extent map and the file's logical offset. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>