Age | Commit message (Collapse) | Author |
|
When trace_clock option is not set and unstable clcok detected,
tracing_set_default_clock() sets trace_clock(ThinkPad A285 is one of
case). In that case, if lockdown is in effect, null pointer
dereference error happens in ring_buffer_set_clock().
Link: http://lkml.kernel.org/r/20200116131236.3866925-1-masami256@gmail.com
Cc: stable@vger.kernel.org
Fixes: 17911ff38aa58 ("tracing: Add locked_down checks to the open calls of files created for tracefs")
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1788488
Signed-off-by: Masami Ichikawa <masami256@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
While working on a tool to convert SQL syntex into the histogram language of
the kernel, I discovered the following bug:
# echo 'first u64 start_time u64 end_time pid_t pid u64 delta' >> synthetic_events
# echo 'hist:keys=pid:start=common_timestamp' > events/sched/sched_waking/trigger
# echo 'hist:keys=next_pid:delta=common_timestamp-$start,start2=$start:onmatch(sched.sched_waking).trace(first,$start2,common_timestamp,next_pid,$delta)' > events/sched/sched_switch/trigger
Would not display any histograms in the sched_switch histogram side.
But if I were to swap the location of
"delta=common_timestamp-$start" with "start2=$start"
Such that the last line had:
# echo 'hist:keys=next_pid:start2=$start,delta=common_timestamp-$start:onmatch(sched.sched_waking).trace(first,$start2,common_timestamp,next_pid,$delta)' > events/sched/sched_switch/trigger
The histogram works as expected.
What I found out is that the expressions clear out the value once it is
resolved. As the variables are resolved in the order listed, when
processing:
delta=common_timestamp-$start
The $start is cleared. When it gets to "start2=$start", it errors out with
"unresolved symbol" (which is silent as this happens at the location of the
trace), and the histogram is dropped.
When processing the histogram for variable references, instead of adding a
new reference for a variable used twice, use the same reference. That way,
not only is it more efficient, but the order will no longer matter in
processing of the variables.
From Tom Zanussi:
"Just to clarify some more about what the problem was is that without
your patch, we would have two separate references to the same variable,
and during resolve_var_refs(), they'd both want to be resolved
separately, so in this case, since the first reference to start wasn't
part of an expression, it wouldn't get the read-once flag set, so would
be read normally, and then the second reference would do the read-once
read and also be read but using read-once. So everything worked and
you didn't see a problem:
from: start2=$start,delta=common_timestamp-$start
In the second case, when you switched them around, the first reference
would be resolved by doing the read-once, and following that the second
reference would try to resolve and see that the variable had already
been read, so failed as unset, which caused it to short-circuit out and
not do the trigger action to generate the synthetic event:
to: delta=common_timestamp-$start,start2=$start
With your patch, we only have the single resolution which happens
correctly the one time it's resolved, so this can't happen."
Link: https://lore.kernel.org/r/20200116154216.58ca08eb@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 067fe038e70f6 ("tracing: Add variable reference handling to hist triggers")
Reviewed-by: Tom Zanuss <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull reiserfs fix from Jan Kara:
"A fixup of a recently merged reiserfs fix which has caused problem
when xattrs were not compiled in"
* tag 'fixes_for_v5.5-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
reiserfs: fix handling of -EOPNOTSUPP in reiserfs_for_each_xattr
|
|
Fix a memory leak reported by kmemleak:
unreferenced object 0xffff000bc6f50e80 (size 128):
comm "kworker/23:2", pid 201, jiffies 4294894947 (age 942.132s)
hex dump (first 32 bytes):
00 00 00 00 41 00 00 00 86 c0 03 00 00 00 00 00 ....A...........
00 a0 b2 c6 0b 00 ff ff 40 51 fd 10 00 80 ff ff ........@Q......
backtrace:
[<00000000e62d2240>] kmem_cache_alloc_trace+0x1a4/0x320
[<00000000279143c9>] irq_domain_push_irq+0x7c/0x188
[<00000000d9f4c154>] thunderx_gpio_probe+0x3ac/0x438
[<00000000fd09ec22>] pci_device_probe+0xe4/0x198
[<00000000d43eca75>] really_probe+0xdc/0x320
[<00000000d3ebab09>] driver_probe_device+0x5c/0xf0
[<000000005b3ecaa0>] __device_attach_driver+0x88/0xc0
[<000000004e5915f5>] bus_for_each_drv+0x7c/0xc8
[<0000000079d4db41>] __device_attach+0xe4/0x140
[<00000000883bbda9>] device_initial_probe+0x18/0x20
[<000000003be59ef6>] bus_probe_device+0x98/0xa0
[<0000000039b03d3f>] deferred_probe_work_func+0x74/0xa8
[<00000000870934ce>] process_one_work+0x1c8/0x470
[<00000000e3cce570>] worker_thread+0x1f8/0x428
[<000000005d64975e>] kthread+0xfc/0x128
[<00000000f0eaa764>] ret_from_fork+0x10/0x18
Fixes: 495c38d3001f ("irqdomain: Add irq_domain_{push,pop}_irq() functions")
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200120043547.22271-1-haokexin@gmail.com
|
|
The Interrupt Multiplexer (INTMUX) expands the number of peripherals
that can interrupt the core:
* The INTMUX has 8 channels that are assigned to 8 NVIC interrupt slots.
* Each INTMUX channel can receive up to 32 interrupt sources and has 1
interrupt output.
* The INTMUX routes the interrupt sources to the interrupt outputs.
Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200117060653.27485-3-qiangqing.zhang@nxp.com
|
|
multiplexer
This patch adds the DT bindings for the NXP INTMUX interrupt multiplexer
for i.MX8 family SoCs.
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20200117060653.27485-2-qiangqing.zhang@nxp.com
|
|
This patch is written to clean up dependency of ARCH_EXYNOS
Not all exynos device have IRQ_COMBINER, especially aarch64 EXYNOS
but it is built for all exynos devices.
Thus add the config for EXYNOS_IRQ_COMBINER
remove direct dependency between ARCH_EXYNOS and exynos-combiner.c
and only selected on the aarch32 devices
Signed-off-by: Hyunki Koo <hyunki00.koo@samsung.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Krzysztof Kozlowski <krzk@kernel.org>
Link: https://lore.kernel.org/r/20191224211108.7128-1-hyunki00.koo@gmail.com
|
|
The meson a1 Socs have some changes compared with previous
chips. For A113L, it contains 62 pins and can be spied on:
- 62:128 undefined
- 61:50 12 pins on bank A
- 49:37 13 pins on bank F
- 36:20 17 pins on bank X
- 19:13 7 pins on bank B
- 12:0 13 pins on bank P
There are five relative registers for gpio interrupt controller,
details are as below:
- PADCTRL_GPIO_IRQ_CTRL0
bit[31]: enable/disable the whole irq lines
bit[16-23]: both edge trigger
bit[8-15]: single edge trigger
bit[0-7]: pol trigger
- PADCTRL_GPIO_IRQ_CTRL[X]
bit[0-6]: 7 bits to choose gpio source for irq line 2*[X] - 2
bit[16-22]: 7 bits to choose gpio source for irq line 2*[X] - 1
where X =1,2,3,4
Signed-off-by: Qianggui Song <qianggui.song@amlogic.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20191216123645.10099-4-qianggui.song@amlogic.com
|
|
Since Meson-A1 SoCs register layout of gpio interrupt controller has
difference with previous chips, registers to decide irq line and offset
of trigger method are all changed, the current driver should be modified.
Signed-off-by: Qianggui Song <qianggui.song@amlogic.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20191216123645.10099-3-qianggui.song@amlogic.com
|
|
Update dt-binding document for GPIO interrupt controller of Meson-A1 SoCs
Signed-off-by: Qianggui Song <qianggui.song@amlogic.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20191216123645.10099-2-qianggui.song@amlogic.com
|
|
The following crash can be seen for setting
CONFIG_DEBUG_TEST_DRIVER_REMOVE=y for DT FW (which some people still use):
Hisilicon MBIGEN-V2 60080000.interrupt-controller: Failed to create mbi-gen irqdomain
Hisilicon MBIGEN-V2: probe of 60080000.interrupt-controller failed with error -12
[...]
Unable to handle kernel paging request at virtual address 0000000000005008
Mem abort info:
ESR = 0x96000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=0000041fb9990000
[0000000000005008] pgd=0000000000000000
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in:
CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc6-00002-g3fc42638a506-dirty #1622
Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 Nemo 2.0 RC0 04/18/2018
pstate: 40000085 (nZcv daIf -PAN -UAO)
pc : mbigen_set_type+0x38/0x60
lr : __irq_set_trigger+0x6c/0x188
sp : ffff800014b4b400
x29: ffff800014b4b400 x28: 0000000000000007
x27: 0000000000000000 x26: 0000000000000000
x25: ffff041fd83bd0d4 x24: ffff041fd83bd188
x23: 0000000000000000 x22: ffff80001193ce00
x21: 0000000000000004 x20: 0000000000000000
x19: ffff041fd83bd000 x18: ffffffffffffffff
x17: 0000000000000000 x16: 0000000000000000
x15: ffff8000119098c8 x14: ffff041fb94ec91c
x13: ffff041fb94ec1a1 x12: 0000000000000030
x11: 0101010101010101 x10: 0000000000000040
x9 : 0000000000000000 x8 : ffff041fb98c6680
x7 : ffff800014b4b380 x6 : ffff041fd81636c8
x5 : 0000000000000000 x4 : 000000000000025f
x3 : 0000000000005000 x2 : 0000000000005008
x1 : 0000000000000004 x0 : 0000000080000000
Call trace:
mbigen_set_type+0x38/0x60
__setup_irq+0x744/0x900
request_threaded_irq+0xe0/0x198
pcie_pme_probe+0x98/0x118
pcie_port_probe_service+0x38/0x78
really_probe+0xa0/0x3e0
driver_probe_device+0x58/0x100
__device_attach_driver+0x90/0xb0
bus_for_each_drv+0x64/0xc8
__device_attach+0xd8/0x138
device_initial_probe+0x10/0x18
bus_probe_device+0x90/0x98
device_add+0x4c4/0x770
device_register+0x1c/0x28
pcie_port_device_register+0x1e4/0x4f0
pcie_portdrv_probe+0x34/0xd8
local_pci_probe+0x3c/0xa0
pci_device_probe+0x128/0x1c0
really_probe+0xa0/0x3e0
driver_probe_device+0x58/0x100
__device_attach_driver+0x90/0xb0
bus_for_each_drv+0x64/0xc8
__device_attach+0xd8/0x138
device_attach+0x10/0x18
pci_bus_add_device+0x4c/0xb8
pci_bus_add_devices+0x38/0x88
pci_host_probe+0x3c/0xc0
pci_host_common_probe+0xf0/0x208
hisi_pcie_almost_ecam_probe+0x24/0x30
platform_drv_probe+0x50/0xa0
really_probe+0xa0/0x3e0
driver_probe_device+0x58/0x100
device_driver_attach+0x6c/0x90
__driver_attach+0x84/0xc8
bus_for_each_dev+0x74/0xc8
driver_attach+0x20/0x28
bus_add_driver+0x148/0x1f0
driver_register+0x60/0x110
__platform_driver_register+0x40/0x48
hisi_pcie_almost_ecam_driver_init+0x1c/0x24
The specific problem here is that the mbigen driver real probe has failed
as the mbigen_of_create_domain()->of_platform_device_create() call fails,
the reason for that being that we never destroyed the platform device
created during the remove test dry run and there is some conflict.
Since we generally would never want to unbind this driver, and to save
adding a driver tear down path for that, just set the driver
.suppress_bind_attrs member to avoid this possibility.
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Link: https://lore.kernel.org/r/1579196323-180137-1-git-send-email-john.garry@huawei.com
|
|
The Aspeed SOCs provide some interrupts through the System Control
Unit registers. Add an interrupt controller that provides these
interrupts to the system.
Signed-off-by: Eddie James <eajames@linux.ibm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Andrew Jeffery <andrew@aj.id.au>
Link: https://lore.kernel.org/r/1579123790-6894-3-git-send-email-eajames@linux.ibm.com
|
|
Document the Aspeed SCU interrupt controller and add an include file
for the interrupts it provides.
Signed-off-by: Eddie James <eajames@linux.ibm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Rob Herring <robh@kernel.org>
Acked-by: Andrew Jeffery <andrew@aj.id.au>
Link: https://lore.kernel.org/r/1579123790-6894-2-git-send-email-eajames@linux.ibm.com
|
|
Adds the GPIO driver for SiFive RISC-V SoCs.
Signed-off-by: Wesley W. Terpstra <wesley@sifive.com>
[Atish: Various fixes and code cleanup]
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Signed-off-by: Yash Shah <yash.shah@sifive.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/1575976274-13487-6-git-send-email-yash.shah@sifive.com
|
|
Change the exported symbols introduced by commit e9153311491da
("regulator: vctrl-regulator: Avoid deadlock getting and setting the voltage")
from EXPORT_SYMBOL() to EXPORT_SYMBOL_GPL(), like is used for all the core
parts.
Fixes: e9153311491da ("regulator: vctrl-regulator: Avoid deadlock getting and setting the voltage")
Reported-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>
Link: https://lore.kernel.org/r/20200120123921.1204339-1-enric.balletbo@collabora.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
There's no need for the ubifs_crypt_is_encrypted() function anymore.
Just use IS_ENCRYPTED() instead, like ext4 and f2fs do. IS_ENCRYPTED()
checks the VFS-level flag instead of the UBIFS-specific flag, but it
shouldn't change any behavior since the flags are kept in sync.
Link: https://lore.kernel.org/r/20191209212721.244396-1-ebiggers@kernel.org
Acked-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
The bitmap allocation did not use full unsigned long sizes
when calculating the required size and that was triggered by KASAN
as slab-out-of-bounds read in several places. The patch fixes all
of them.
Reported-by: syzbot+fabca5cbf5e54f3fe2de@syzkaller.appspotmail.com
Reported-by: syzbot+827ced406c9a1d9570ed@syzkaller.appspotmail.com
Reported-by: syzbot+190d63957b22ef673ea5@syzkaller.appspotmail.com
Reported-by: syzbot+dfccdb2bdb4a12ad425e@syzkaller.appspotmail.com
Reported-by: syzbot+df0d0f5895ef1f41a65b@syzkaller.appspotmail.com
Reported-by: syzbot+b08bd19bb37513357fd4@syzkaller.appspotmail.com
Reported-by: syzbot+53cdd0ec0bbabd53370a@syzkaller.appspotmail.com
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
We had a report indicating that some read errors aren't reported by the
device stats in the userland. It is important to have the errors
reported in the device stat as user land scripts might depend on it to
take the reasonable corrective actions. But to debug these issue we need
to be really sure that request to reset the device stat did not come
from the userland itself. So log an info message when device error reset
happens.
For example:
BTRFS info (device sdc): device stats zeroed by btrfs(9223)
Reported-by: philip@philip-seeger.de
Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We noticed that we were having regular CG OOM kills in cases where there
was still enough dirty pages to avoid OOM'ing. It turned out there's
this corner case in btrfs's handling of range_cyclic where files that
were being redirtied were not getting fully written out because of how
we do range_cyclic writeback.
We unconditionally were setting scanned = 1; the first time we found any
pages in the inode. This isn't actually what we want, we want it to be
set if we've scanned the entire file. For range_cyclic we could be
starting in the middle or towards the end of the file, so we could write
one page and then not write any of the other dirty pages in the file
because we set scanned = 1.
Fix this by not setting scanned = 1 if we find pages. The rules for
setting scanned should be
1) !range_cyclic. In this case we have a specified range to write out.
2) range_cyclic && index == 0. In this case we've started at the
beginning and there is no need to loop around a second time.
3) range_cyclic && we started at index > 0 and we've reached the end of
the file without satisfying our nr_to_write.
This patch fixes both of our writepages implementations to make sure
these rules hold true. This fixed our over zealous CG OOMs in
production.
Fixes: d1310b2e0cd9 ("Btrfs: Split the extent_map code into two parts")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Dan's smatch tool reports
fs/btrfs/file-item.c:295 btrfs_lookup_bio_sums()
warn: should this be 'count == -1'
which points to the while (count--) loop. With count == 0 the check
itself could decrement it to -1. There's a WARN_ON a few lines below
that has never been seen in practice though.
It turns out that the value of page_bytes_left matches the count (by
sectorsize multiples). The loop never reaches the state where count
would go to -1, because page_bytes_left == 0 is found first and this
breaks out.
For clarity, use only plain check on count (and only for positive
value), decrement safely inside the loop. Any other discrepancy after
the whole bio list processing should be reported by the exising
WARN_ON_ONCE as well.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is a leftover from recently removed bio scheduling framework.
Fixes: ba8a9d079543 ("Btrfs: delete the entire async bio submission framework")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_get_alloc_profile() is a simple wrapper over get_alloc_profile().
The only difference is btrfs_get_alloc_profile() is visible to other
functions in btrfs while get_alloc_profile() is static and thus only
visible to functions in block-group.c.
Let's just fold get_alloc_profile() into btrfs_get_alloc_profile() to
get rid of the unnecessary second function.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
From Dave's testing described below, it's possible to drive a file
system to have bogus values of discardable_extents and _bytes. As
btrfs_discard_calc_delay() is the only user of discardable_extents, we
can correct here for any negative discardable_extents/discardable_bytes.
The problem is not reliably reproducible. The workload that created it
was based on linux git tree, switching between release tags, then
everytihng deleted followed by a full rebalance. At this state the
values of discardable_bytes was 16K and discardable_extents was -1,
expected values 0 and 0.
Repeating the workload again did not correct the bogus values so the
offset seems to be stable once it happens.
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Most callers of free_bitmap() only call it if bitmap_info->bytes is 0.
However, there are certain cases where we may free the free space cache
via __btrfs_remove_free_space_cache(). This exposes a path where
free_bitmap() is called regardless. This may result in a bad accounting
situation for discardable_bytes and discardable_extents. So, remove the
stats and call btrfs_discard_update_discardable().
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It's less than ideal for small extents to eat into our extent budget, so
force extents <= 32KB into the bitmaps save for the first handful.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently, there is no way for the free space cache to recover from
being serviced by purely bitmaps because the extent threshold is set to
0 in recalculate_thresholds() when we surpass the metadata allowance.
This adds a recovery mechanism by keeping large extents out of the
bitmaps and increases the metadata upper bound to 64KB. The recovery
mechanism bypasses this upper bound, thus making it a soft upper bound.
But, with the bypass being 1MB or greater, it shouldn't add unbounded
overhead.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Give a brief overview for how async discard is implemented.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Keep track of how much we are discarding and how often we are reusing
with async discard. The discard_*_bytes values don't need any special
protection because the work item provides the single threaded access.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As mentioned earlier, discarding data can be done either by issuing an
explicit discard or implicitly by reusing the LBA. Metadata block_groups
see much more frequent reuse due to well it being metadata. So instead
of explicitly discarding metadata block_groups, just leave them be and
let the latter implicit discarding be done for them.
For mixed block_groups, block_groups which contain both metadata and
data, we let them be as higher fragmentation is expected.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Non-block group destruction discarding currently only had a single list
with no minimum discard length. This can lead to caravaning more
meaningful discards behind a heavily fragmented block group.
This adds support for multiple lists with minimum discard lengths to
prevent the caravan effect. We promote block groups back up when we
exceed the BTRFS_ASYNC_DISCARD_MAX_FILTER size, currently we support
only 2 lists with filters of 1MB and 32KB respectively.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Expose max_discard_size as a tunable via sysfs and switch the current
fixed maximum to the default value.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Throttle the maximum size of a discard so that we can provide an upper
bound for the rate of async discard. While the block layer is able to
split discards into the appropriate sized discards, we want to be able
to account more accurately the rate at which we are consuming NCQ slots
as well as limit the upper bound of work for a discard.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Provide the ability to rate limit based on kbps in addition to iops as
additional guides for the target discard rate. The delay used ends up
being max(kbps_delay, iops_delay).
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
An earlier patch keeps track of discardable_extents. These are
undiscarded extents managed by the free space cache. Here, we will use
this to dynamically calculate the discard delay interval.
There are 3 rate to consider. The first is the target convergence rate,
the rate to discard all discardable_extents over the
BTRFS_DISCARD_TARGET_MSEC time frame. This is clamped by the lower
limit, the iops limit or BTRFS_DISCARD_MIN_DELAY (1ms), and the upper
limit, BTRFS_DISCARD_MAX_DELAY (1s). We reevaluate this delay every
transaction commit.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Keep track of this metric so that we can understand how ahead or behind
we are in discarding rate. This uses the same accounting method as
discardable_extents, deltas between previous/current values and
propagating them up.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The number of discardable extents will serve as the rate limiting metric
for how often we should discard. This keeps track of discardable extents
in the free space caches by maintaining deltas and propagating them to
the global count.
The deltas are calculated from 2 values stored in PREV and CURR entries,
then propagated up to the global discard ctl. The current counter value
becomes the previous counter value after update.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Setup base sysfs directory for discard stats + tunables.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Btrfs only allowed attributes to be exposed in debug/. Let's let other
groups be created by making debug its own kobject.
This also makes the per-fs debug options separate from the global
features mount attributes. This seems to be needed as
sysfs_create_files() requires const struct attribute * while
sysfs_create_group() can take struct attribute *. This seems nicer as
per file system, you'll probably use to_fs_info().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We probably should call sysfs_remove_group() on debug/.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The prior two patches added discarding via a background workqueue. This
just piggybacked off of the fstrim code to trim the whole block at once.
Well inevitably this is worse performance wise and will aggressively
overtrim. But it was nice to plumb the other infrastructure to keep the
patches easier to review.
This adds the real goal of this series which is discarding slowly (ie. a
slow long running fstrim). The discarding is split into two phases,
extents and then bitmaps. The reason for this is two fold. First, the
bitmap regions overlap the extent regions. Second, discarding the
extents first will let the newly trimmed bitmaps have the highest chance
of coalescing when being readded to the free space cache.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.
This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be addressed in a future patch of the
series.
For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.
On a number of Facebook webservers, I collected data every minute
accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
is where we discard extents synchronously before returning them to the
free space cache.
discard=sync:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
---------------------------------------------------------------
Drive A | 434 | 1170
Drive B | 880 | 2330
Drive C | 2943 | 3920
Drive D | 4763 | 5701
discard=async:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
--------------------------------------------------------------
Drive A | 134 | 956
Drive B | 64 | 1972
Drive C | 59 | 1032
Drive D | 62 | 1200
While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is a cap in btrfs in the amount of free extents that a block group
can have. When it surpasses that threshold, future extents are placed
into bitmaps. Instead of keeping track of if a certain bit is trimmed or
not in a second bitmap, keep track of the relative state of the bitmap.
With async discard, trimming bitmaps becomes a more frequent operation.
As a trade off with simplicity, we keep track of if discarding a bitmap
is in progress. If we fully scan a bitmap and trim as necessary, the
bitmap is marked clean. This has some caveats as the min block size may
skip over regions deemed too small. But this should be a reasonable
trade off rather than keeping a second bitmap and making allocation
paths more complex. The downside is we may overtrim, but ideally the min
block size should prevent us from doing that too often and getting stuck
trimming pathological cases.
BTRFS_TRIM_STATE_TRIMMING is added to indicate a bitmap is in the
process of being trimmed. If additional free space is added to that
bitmap, the bit is cleared. A bitmap will be marked
BTRFS_TRIM_STATE_TRIMMED if the trimming code was able to reach the end
of it and the former is still set.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Async discard will use the free space cache as backing knowledge for
which extents to discard. This patch plumbs knowledge about which
extents need to be discarded into the free space cache from
unpin_extent_range().
An untrimmed extent can merge with everything as this is a new region.
Absorbing trimmed extents is a tradeoff to for greater coalescing which
makes life better for find_free_extent(). Additionally, it seems the
size of a trim isn't as problematic as the trim io itself.
When reading in the free space cache from disk, if sync is set, mark all
extents as trimmed. The current code ensures at transaction commit that
all free space is trimmed when sync is set, so this reflects that.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This series introduces async discard which will use the flag
DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is
synchronously done in transaction commit.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Bitmaps are fairly popular for their space efficiency, but we don't have
generic iterators available. Make percpu's bitmap region iterators
available to everyone.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[PROBLEM]
There is a user report in the mail list, showing the following corrupted
tree blocks:
item 62 key (486836 DIR_ITEM 2543451757) itemoff 6273 itemsize 74
location key (4065004 INODE_ITEM 1073741824) type FILE
transid 21397 data_len 0 name_len 44
name: FILENAME
Note that location key, its offset should be 0 for all INODE_ITEMS.
This caused failed lookup of the inode.
[CAUSE]
That offending value, 1073741824, is 0x40000000. So this looks like a
memory bit flip.
[FIX]
This patch will enhance tree-checker to check location key of
DIR_INDEX/DIR_ITEM/XATTR_ITEM.
There are several different combinations needs to check:
- item_key.type == DIR_INDEX/DIR_ITEM
* location_key.type == BTRFS_INODE_ITEM_KEY
This location_key should follow the check in inode_item check.
* location_key.type == BTRFS_ROOT_ITEM_KEY
Despite the existing check, DIR_INDEX/DIR_ITEM can only points to
subvolume trees.
* All other keys are not allowed.
- item_key.type == XATTR_ITEM
location_key should be all 0.
Reported-by: Mike Gilbert <floppymaster@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
ROOT_ITEM key check itself is not as simple as single line check, and
will be reused for both ROOT_ITEM and DIR_ITEM/DIR_INDEX location key
check, so refactor such check into check_root_key().
Also since we are here, fix a comment error about ROOT_ITEM offset,
which is transid of snapshot creation, not some "older kernel behavior".
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Inode key check is not as easy as several lines, and it will be called
in more than one location (INODE_ITEM check and
DIR_ITEM/DIR_INDEX/XATTR_ITEM location key check).
So here refactor such check into check_inode_key(). And add extra
checks for XATTR_ITEM.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The @fs_info parameter can be extracted from extent_buffer structure,
and there are already some wrappers getting rid of the @fs_info
parameter.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|