Age | Commit message (Collapse) | Author |
|
The resolve_btfids tool is used during the kernel build,
so we should clean it on kernel's make clean.
Invoking the the resolve_btfids clean as part of root
'make clean'.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210205124020.683286-5-jolsa@kernel.org
|
|
We want this clean to be called from tree's root Makefile,
which defines same srctree variable and that will screw
the make setup.
We actually do not use srctree being passed from outside,
so we can solve this by setting current srctree value
directly.
Also changing the way how srctree is initialized as suggested
by Andrri.
Also root Makefile does not define the implicit RM variable,
so adding RM initialization.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210205124020.683286-4-jolsa@kernel.org
|
|
We want this clean to be called from tree's root clean
and that one is silent if there's nothing to clean.
Adding check for all object to clean and display CLEAN
messages only if there are objects to remove.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210205124020.683286-3-jolsa@kernel.org
|
|
Setting up separate build directories for libbpf and libpsubcmd,
so it's separated from other objects and we don't get them mixed
in the future.
It also simplifies cleaning, which is now simple rm -rf.
Also there's no need for FEATURE-DUMP.libbpf and bpf_helper_defs.h
files in .gitignore anymore.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210205124020.683286-2-jolsa@kernel.org
|
|
The sdebug_q_arr pointer must be freed when the module is unloaded.
$ cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff888e1cfb0000 (size 4096):
comm "modprobe", pid 165555, jiffies 4325987516 (age 685.194s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000458f4f5d>] 0xffffffffc06702d9
[<000000003edc4b1f>] do_one_initcall+0xe9/0x57d
[<00000000da7d518c>] do_init_module+0x1d1/0x6f0
[<000000009a6a9248>] load_module+0x36bd/0x4f50
[<00000000ddb0c3ce>] __do_sys_init_module+0x1db/0x260
[<000000009532db57>] do_syscall_64+0xa5/0x420
[<000000002916b13d>] entry_SYSCALL_64_after_hwframe+0x6a/0xdf
Fixes: 87c715dcde63 ("scsi: scsi_debug: Add per_host_store option")
Link: https://lore.kernel.org/r/20210208111734.34034-1-mlombard@redhat.com
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
The xlnx_vcu driver is actually a clock controller driver which provides
clocks that can be used by a driver for the encoder/decoder units. There
is no reason to keep this driver in soc. Move the driver to clk.
NOTE: The register mapping actually contains registers for AXI
performance monitoring, but these are not used by the driver.
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-16-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
Fixes the following checkpatch check:
CHECK: Alignment should match open parenthesis
#610: FILE: drivers/soc/xilinx/xlnx_vcu.c:610:
+ xvcu->vcu_slcr_ba = devm_ioremap(&pdev->dev, res->start,
+ resource_size(res));
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-15-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
Fixes the following checkpatch warning:
WARNING: Possible repeated word: 'the'
#703: FILE: drivers/soc/xilinx/xlnx_vcu.c:703:
+ /* Add the the Gasket isolation and put the VCU in reset. */
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-14-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
This makes the register accesses more readable and is closer to what is
usually used in the kernel.
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Reviewed-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-13-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
As the consumers are now responsible for setting the clock rate via
clock framework, the clock rate is now calculated using round_rate and
the driver does not need to calculate the clock rate beforehand.
Remove the code that calculates the PLL configuration.
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-12-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
Do not configure the PLL when probing the driver, but register the clock
in the clock framework and do the configuration based on the respective
callbacks.
This is necessary to allow the consumers, i.e., encoder and decoder
drivers, of the xlnx_vcu clock provider to set the clock rate and
actually enable the clocks without relying on some pre-configuration.
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-11-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
According to the downstream driver documentation due to timing
constraints the output divider of the PLL has to be set to 1/2. Add a
helper function for that check instead of burying the code in one large
setup function.
The bit is undocumented and marked as reserved in the register
reference.
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-10-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
The VCU System-Level Control uses an internal PLL to drive the core and
MCU clock for the allegro encoder and decoder based on an external PL
clock.
In order be able to ensure that the clocks are enabled and to get their
rate from other drivers, the module must implement a clock provider and
register the clocks at the common clock framework. Other drivers are
then able to access the clock via devicetree bindings.
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-9-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
Currently, xvcu_pll_set_rate configures the PLL to a clock rate that is
pre-calculated when probing the driver. To still make the clock
framework aware of the PLL and to allow to configure other clocks based
on the PLL rate, register the PLL as a fixed rate clock.
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-8-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
The disabling of the PLL is not fully implemented, because according to
the ZynqMP register reference the RESET, POR_IN and PWR_POR bits have to
be set to bring the PLL into reset.
Set the bits to disable the PLL.
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-7-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
The xvcu_set_vcu_pll_info function sets the rate of the PLL and enables
it, which makes it difficult to cleanly convert the driver to the common
clock framework.
Split the function and add separate functions for setting the rate,
enabling the clock and disabling the clock.
Also move the enable of the reference clock from probe to the helper
that enables the PLL.
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-6-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
Extract a helper function to wait until the PLL is locked. Also,
disabling the bypass was buried in the exit path on the wait loop.
Separate the different steps and add a helper function to make the code
more readable.
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-5-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
The coreclk field is newer read after being written to xlnx_vcu. Remove
the coreclk field from the xlnx_vcu and use a function local variable
instead.
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-4-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
If a driver registers a divider clock with a parent_hw instead of the
parent_name, the parent_hw is ignored and the clock does not have a
parent.
Fix this by initializing the parents the same way they are initialized
for clock gates.
Fixes: ff258817137a ("clk: divider: Add support for specifying parents via DT/pointers")
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Reviewed-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-3-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
The VCU System-Level Control has 4 output clocks. Define indexes for
these clocks to allow to reference them in the device tree.
Signed-off-by: Michael Tretter <m.tretter@pengutronix.de>
Acked-by: Rob Herring <robh@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/20210121071659.1226489-2-m.tretter@pengutronix.de
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
No major functional change. Noticed while checking the driver code that
this could be used.
Saves two lines.
Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Link: https://lore.kernel.org/r/20210201151245.21845-5-alexandru.ardelean@analog.com
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
The axi-clkgen driver now supports ZynqMP (UltraScale) as well, however the
driver needs to use different PFD & VCO limits.
For ZynqMP, these needs to be selected by using the
'adi,zynqmp-axi-clkgen-2.00.a' string.
Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Link: https://lore.kernel.org/r/20210201151245.21845-4-alexandru.ardelean@analog.com
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
For ZynqMP (Ultrascale) the PFD and VCO limits are different. In order to
support these, this change adds a compatible string (i.e.
'adi,zynqmp-axi-clkgen-2.00.a') which will take into account for these
limits and apply them.
Signed-off-by: Dragos Bogdan <dragos.bogdan@analog.com>
Signed-off-by: Mathias Tausen <mta@gomspace.com>
Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Link: https://lore.kernel.org/r/20210201151245.21845-3-alexandru.ardelean@analog.com
Acked-by: Moritz Fischer <mdf@kernel.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
The intent is to be able to run this driver to access the IP core in setups
where FPGA board is also connected via a PCIe bus. In such cases the number
of combinations explodes, where the host system can be an x86 with Xilinx
Zynq/ZynqMP/Microblaze board connected via PCIe.
Or even a ZynqMP board with a ZynqMP/Zynq/Microblaze connected via PCIe.
To accommodate for these cases, this change removes the limitation for this
driver to be compilable only on Zynq/Microblaze architectures.
And adds dependencies on the mechanisms required by the driver to work (OF
and HAS_IOMEM).
Signed-off-by: Dragos Bogdan <dragos.bogdan@analog.com>
Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Link: https://lore.kernel.org/r/20210201151245.21845-2-alexandru.ardelean@analog.com
Acked-by: Michal Simek <michal.simek@xilinx.com>
Reviewed-by: Moritz Fischer <mdf@kernel.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
|
|
Fix the following coccicheck warning:
./tools/bpf/bpf_dbg.c:893:32-36: WARNING: Comparison to bool.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/1612777416-34339-1-git-send-email-jiapeng.chong@linux.alibaba.com
|
|
Add missing skeleton destroy call.
Fixes: 37086bfdc737 ("bpf: Propagate stack bounds to registers in atomics w/ BPF_FETCH")
Reported-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210208123737.963172-1-jackmanb@google.com
|
|
Eliminate the following coccicheck warning:
./tools/testing/selftests/bpf/test_flow_dissector.c:506:2-3: Unneeded
semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/1612780213-84583-1-git-send-email-yang.lee@linux.alibaba.com
|
|
This final patch adds the ZONED incompat flag to the supported flags
and enables to mount ZONED flagged file system.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since the zoned filesystem requires sequential write out of metadata, we
cannot proceed with a hole in tree-log pages. When such a hole exists,
btree_write_cache_pages() will return -EAGAIN. This happens when someone,
e.g., a concurrent transaction commit, writes a dirty extent in this
tree-log commit.
If we are not going to wait for the extents, we can hope the concurrent
writing fills the hole for us. So, we can ignore the error in this case and
hope the next write will succeed.
If we want to wait for them and got the error, we cannot wait for them
because it will cause a deadlock. So, let's bail out to a full commit in
this case.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the 3/3 patch to enable tree-log on zoned filesystems.
The allocation order of nodes of "fs_info->log_root_tree" and nodes of
"root->log_root" is not the same as the writing order of them. So, the
writing causes unaligned write errors.
Reorder the allocation of them by delaying allocation of the root node of
"fs_info->log_root_tree," so that the node buffers can go out sequentially
to devices.
Cc: Filipe Manana <fdmanana@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the 2/3 patch to enable tree-log on zoned filesystems.
Since we can start more than one log transactions per subvolume
simultaneously, nodes from multiple transactions can be allocated
interleaved. Such mixed allocation results in non-sequential writes at
the time of a log transaction commit. The nodes of the global log root
tree (fs_info->log_root_tree), also have the same problem with mixed
allocation.
Serializes log transactions by waiting for a committing transaction when
someone tries to start a new transaction, to avoid the mixed allocation
problem. We must also wait for running log transactions from another
subvolume, but there is no easy way to detect which subvolume root is
running a log transaction. So, this patch forbids starting a new log
transaction when other subvolumes already allocated the global log root
tree.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the 1/3 patch to enable tree log on zoned filesystems.
The tree-log feature does not work on a zoned filesystem as is. Blocks for
a tree-log tree are allocated mixed with other metadata blocks and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which has a different timing than a global transaction commit. As a
result, both writing tree-log blocks and writing other metadata blocks
become non-sequential writes that zoned filesystems must avoid.
Introduce a dedicated block group for tree-log blocks, so that tree-log
blocks and other metadata blocks can be separate write streams. As a
result, each write stream can now be written to devices separately.
"fs_info->treelog_bg" tracks the dedicated block group and assigns
"treelog_bg" on-demand on tree-log block allocation time.
This commit extends the zoned block allocator to use the block group.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is a preparation patch for the next patch. Split alloc_log_tree()
into two parts. The first one allocating the tree structure, remains in
alloc_log_tree() and the second part allocating the tree node, which is
moved into btrfs_alloc_log_tree_node().
Also export the latter part is to be used in the next patch.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently fallocate() is disabled on a zoned filesystem. Since current
relocation process relies on preallocation to move file data extents, it
must be handled differently.
On a zoned filesystem, we just truncate the inode to the size that we
wanted to pre-allocate. Then, we flush dirty pages on the file before
finishing the relocation process. run_delalloc_zoned() will handle all
the allocations and submit IOs to the underlying layers.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is 4/4 patch to implement device-replace on zoned filesystems.
Even after the copying is done, the write pointers of the source device
and the destination device may not be synchronized. For example, when
the last allocated extent is freed before device-replace process, the
extent is not copied, leaving a hole there.
Synchronize the write pointers by writing zeroes to the destination
device.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is 3/4 patch to implement device-replace on zoned filesystems.
This commit implements copying. To do this, it tracks the write pointer
during the device replace process. As device-replace's copy process is
smart enough to only copy used extents on the source device, we have to
fill the gap to honor the sequential write requirement in the target
device.
The device-replace process on zoned filesystems must copy or clone all
the extents in the source device exactly once. So, we need to ensure
allocations started just before the dev-replace process to have their
corresponding extent information in the B-trees.
finish_extent_writes_for_zoned() implements that functionality, which
basically is the removed code in the commit 042528f8d840 ("Btrfs: fix
block group remaining RO forever after error during device replace").
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is 2/4 patch to implement device replace for zoned filesystems.
In zoned mode, a block group must be either copied (from the source
device to the target device) or cloned (to both devices).
Implement the cloning part. If a block group targeted by an IO is marked
to copy, we should not clone the IO to the destination device, because
the block group is eventually copied by the replace process.
This commit also handles cloning of device reset.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the 1/4 patch to support device-replace on zoned filesystems.
We have two types of IOs during the device replace process. One is an IO
to "copy" (by the scrub functions) all the device extents from the source
device to the destination device. The other one is an IO to "clone" (by
handle_ops_on_dev_replace()) new incoming write IOs from users to the
source device into the target device.
Cloning incoming IOs can break the sequential write rule in on target
device. When a write is mapped in the middle of a block group, the IO is
directed to the middle of a target device zone, which breaks the
sequential write requirement.
However, the cloning function cannot be disabled since incoming IOs
targeting already copied device extents must be cloned so that the IO is
executed on the target device.
We cannot use dev_replace->cursor_{left,right} to determine whether a bio
is going to a not yet copied region. Since we have a time gap between
finishing btrfs_scrub_dev() and rewriting the mapping tree in
btrfs_dev_replace_finishing(), we can have a newly allocated device extent
which is never cloned nor copied.
So the point is to copy only already existing device extents. This patch
introduces mark_block_group_to_copy() to mark existing block groups as a
target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
check the flag to do their job.
Also, btrfs_finish_block_group_to_copy() will check if the copied stripe
is the last stripe in the block group. With the last stripe copied,
the to_copy flag is finally disabled. Afterwards we can safely clone
incoming IOs on this block group.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
On zoned filesystems, btrfs uses per-fs zoned_meta_io_lock to serialize
the metadata write IOs.
Even with this serialization, write bios sent from btree_write_cache_pages
can be reordered by async checksum workers as these workers are per CPU
and not per zone.
To preserve write bio ordering, we disable async metadata checksum on a
zoned filesystem. This does not result in lower performance with HDDs as
a single CPU core is fast enough to do checksum for a single zone write
stream with the maximum possible bandwidth of the device. If multiple
zones are being written simultaneously, HDD seek overhead lowers the
achievable maximum bandwidth, resulting again in a per zone checksum
serialization not affecting the performance.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When truncating a file, file buffers which have already been allocated
but not yet written may be truncated. Truncating these buffers could
cause breakage of a sequential write pattern in a block group if the
truncated blocks are for example followed by blocks allocated to another
file. To avoid this problem, always wait for write out of all unwritten
buffers before proceeding with the truncate execution.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We cannot use zone append for writing metadata, because the B-tree nodes
have references to each other using logical address. Without knowing
the address in advance, we cannot construct the tree in the first place.
So we need to serialize write IOs for metadata.
We cannot add a mutex around allocation and submission because metadata
blocks are allocated in an earlier stage to build up B-trees.
Add a zoned_meta_io_lock and hold it during metadata IO submission in
btree_write_cache_pages() to serialize IOs.
Furthermore, this adds a per-block group metadata IO submission pointer
"meta_write_pointer" to ensure sequential writing, which can break when
attempting to write back blocks in an unfinished transaction. If the
writing out failed because of a hole and the write out is for data
integrity (WB_SYNC_ALL), it returns EAGAIN.
A caller like fsync() code should handle this properly e.g. by falling
back to a full transaction commit.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If more than one IO is issued for one file extent, these IO can be
written to separate regions on a device. Since we cannot map one file
extent to such a separate area on a zoned filesystem, we need to follow
the "one IO == one ordered extent" rule.
The normal buffered, uncompressed and not pre-allocated write path (used
by cow_file_range()) sometimes does not follow this rule. It can write a
part of an ordered extent when specified a region to write e.g., when
its called from fdatasync().
Introduce a dedicated (uncompressed buffered) data write path for zoned
filesystems, that will COW the region and write it at once.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Likewise to buffered IO, enable zone append writing for direct IO when
its used on a zoned block device.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Enable zone append writing for zoned mode. When using zone append, a
bio is issued to the start of a target zone and the device decides to
place it inside the zone. Upon completion the device reports the actual
written position back to the host.
Three parts are necessary to enable zone append mode. First, modify the
bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
bi_sector to point the beginning of the zone.
Second, record the returned physical address (and disk/partno) to the
ordered extent in end_bio_extent_writepage() after the bio has been
completed. We cannot resolve the physical address to the logical address
because we can neither take locks nor allocate a buffer in this end_bio
context. So, we need to record the physical address to resolve it later
in btrfs_finish_ordered_io().
And finally, rewrite the logical addresses of the extent mapping and
checksum data according to the physical address using btrfs_rmap_block.
If the returned address matches the originally allocated address, we can
skip this rewriting process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
A following patch will add another caller of
btrfs_lookup_ordered_extent(), but from a bio's endio context.
btrfs_lookup_ordered_extent() uses spin_lock_irq() which unconditionally
disables interrupts. Change this to spin_lock_irqsave() so interrupts
aren't disabled and re-enabled unconditionally.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
On a zoned filesystem, cache if a block group is on a sequential write
only zone.
On sequential write only zones, we can use REQ_OP_ZONE_APPEND for
writing data, therefore provide btrfs_use_zone_append() to figure out if
IO is targeting a sequential write only zone and we can use
REQ_OP_ZONE_APPEND for data writing.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_rmap_block currently reverse-maps the physical addresses on all
devices to the corresponding logical addresses.
Extend the function to match to a specified device. The old functionality
of querying all devices is left intact by specifying NULL as target
device.
A block_device instead of a btrfs_device is passed into btrfs_rmap_block,
as this function is intended to reverse-map the result of a bio, which
only has a block_device.
Also export the function for later use.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
To ensure that an ordered extent maps to a contiguous region on disk, we
need to maintain a "one bio == one ordered extent" rule.
Ensure that constructing bio does not span more than an ordered extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For a zone append write, the device decides the location the data is being
written to. Therefore we cannot ensure that two bios are written
consecutively on the device. In order to ensure that an ordered extent
maps to a contiguous region on disk, we need to maintain a "one bio ==
one ordered extent" rule.
Implement splitting of an ordered extent and extent map on bio submission
to adhere to the rule.
extract_ordered_extent() hooks into btrfs_submit_data_bio() and splits the
corresponding ordered extent so that the ordered extent's region fits into
one bio and the corresponding device limits.
Several sanity checks need to be done in extract_ordered_extent() e.g.
- We cannot split once end_bio'd ordered extent because we cannot divide
ordered->bytes_left for the split ones
- We do not expect a compressed ordered extent
- We should not have checksum list because we omit the list splitting.
Since the function is called before btrfs_wq_submit_bio() or
btrfs_csum_one_bio(), this should be always ensured.
We also need to split an extent map by creating a new one. If not,
unpin_extent_cache() complains about the difference between the start of
the extent map and the file's logical offset.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|