linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2022-03-14	mtd: rawnand: rockchip: fix platform_get_irq.cocci warning	Yihao Han
	Remove dev_err() messages after platform_get_irq*() failures. platform_get_irq() already prints an error. Generated by: scripts/coccinelle/api/platform_get_irq.cocci Signed-off-by: Yihao Han <hanyihao@vivo.com> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/linux-mtd/20220303123431.3170-1-hanyihao@vivo.com
2022-03-14	platform/x86: hp-wmi: Fix 0x05 error code reported by several WMI calls	Jorge Lopez
	Several WMI queries leverage hp_wmi_read_int function to read their data. hp_wmi_read_int function was corrected in a previous patch. Now, this function invokes hp_wmi_perform_query with input parameter of size zero and the output buffer of size 4. WMI commands calling hp_wmi_perform_query with input buffer size value of zero are listed below. HPWMI_DISPLAY_QUERY HPWMI_HDDTEMP_QUERY HPWMI_ALS_QUERY HPWMI_HARDWARE_QUERY HPWMI_WIRELESS_QUERY HPWMI_BIOS_QUERY HPWMI_FEATURE_QUERY HPWMI_HOTKEY_QUERY HPWMI_FEATURE2_QUERY HPWMI_WIRELESS2_QUERY HPWMI_POSTCODEERROR_QUERY HPWMI_THERMAL_PROFILE_QUERY HPWMI_FAN_SPEED_MAX_GET_QUERY Invoking those WMI commands with an input buffer size greater than zero will cause error 0x05 to be returned. All WMI commands executed by the driver were reviewed and changes were made to ensure the expected input and output buffer size match the WMI specification. Changes were validated on a HP ZBook Workstation notebook, HP EliteBook x360, and HP EliteBook 850 G8. Additional validation was included in the test process to ensure no other commands were incorrectly handled. Signed-off-by: Jorge Lopez <jorge.lopez2@hp.com> Link: https://lore.kernel.org/r/20220310210853.28367-4-jorge.lopez2@hp.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2022-03-14	platform/x86: hp-wmi: Fix SW_TABLET_MODE detection method	Jorge Lopez
	The purpose of this patch is to introduce a fix and removal of the current hack when determining tablet mode status. Determining the tablet mode status requires reading Byte 0 bit 2 as reported by HPWMI_HARDWARE_QUERY. The investigation identified the failure was rooted in two areas: HPWMI_HARDWARE_QUERY failure (0x05) and reading Byte 0, bit 2 only to determine the table mode status. HPWMI_HARDWARE_QUERY WMI failure also rendered the dock state value invalid. The latest changes use SMBIOS Type 3 (chassis type) and WMI Command 0x40 (device_mode_status) information to determine if the device is in tablet mode or not. hp_wmi_hw_state function was split into two functions; hp_wmi_get_dock_state and hp_wmi_get_tablet_mode. The new functions separate how dock_state and tablet_mode is handled in a cleaner manner. All changes were validated on a HP ZBook Workstation notebook, HP EliteBook x360, and HP EliteBook 850 G8. Signed-off-by: Jorge Lopez <jorge.lopez2@hp.com> Link: https://lore.kernel.org/r/20220310210853.28367-3-jorge.lopez2@hp.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2022-03-14	platform/x86: hp-wmi: Fix hp_wmi_read_int() reporting error (0x05)	Jorge Lopez
	The purpose of this patch is to introduce a fix to hp_wmi_read_int() and eliminate failure error (0x05). Several WMI queries leverage hp_wmi_read_int() to read their data and were failing with error 0x05. HPWMI_DISPLAY_QUERY HPWMI_HDDTEMP_QUERY HPWMI_ALS_QUERY HPWMI_HARDWARE_QUERY HPWMI_WIRELESS_QUERY HPWMI_POSTCODEERROR_QUERY The failure occurs because hp_wmi_read_int() calls hp_wmi_perform_query() with input parameter of size greater than zero. Invoking those WMI commands with an input buffer size greater than zero causes the command to be rejected and error 0x05 be returned. All changes were validated on a HP ZBook Workstation notebook, HP EliteBook x360, and HP EliteBook 850 G8. Signed-off-by: Jorge Lopez <jorge.lopez2@hp.com> Link: https://lore.kernel.org/r/20220310210853.28367-2-jorge.lopez2@hp.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2022-03-14	KVM: s390: selftests: Add error memop tests	Janis Schoetterl-Glausch
	Test that errors occur if key protection disallows access, including tests for storage and fetch protection override. Perform tests for both logical vcpu and absolute vm ioctls. Also extend the existing tests to the vm ioctl. Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Link: https://lore.kernel.org/r/20220308125841.3271721-6-scgl@linux.ibm.com Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-03-14	KVM: s390: selftests: Add more copy memop tests	Janis Schoetterl-Glausch
	Do not just test the actual copy, but also that success is indicated when using the check only flag. Add copy test with storage key checking enabled, including tests for storage and fetch protection override. These test cover both logical vcpu ioctls as well as absolute vm ioctls. Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Link: https://lore.kernel.org/r/20220308125841.3271721-5-scgl@linux.ibm.com Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-03-14	KVM: s390: selftests: Add named stages for memop test	Janis Schoetterl-Glausch
	The stages synchronize guest and host execution. This helps the reader and constraits the execution of the test -- if the observed staging differs from the expected the test fails. Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Link: https://lore.kernel.org/r/20220308125841.3271721-4-scgl@linux.ibm.com Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-03-14	KVM: s390: selftests: Add macro as abstraction for MEM_OP	Janis Schoetterl-Glausch
	In order to achieve good test coverage we need to be able to invoke the MEM_OP ioctl with all possible parametrizations. However, for a given test, we want to be concise and not specify a long list of default values for parameters not relevant for the test, so the readers attention is not needlessly diverted. Add a macro that enables this and convert the existing test to use it. The macro emulates named arguments and hides some of the ioctl's redundancy, e.g. sets the key flag if an access key is specified. Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Link: https://lore.kernel.org/r/20220308125841.3271721-3-scgl@linux.ibm.com Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-03-14	KVM: s390: selftests: Split memop tests	Janis Schoetterl-Glausch
	Split success case/copy test from error test, making them independent. This means they do not share state and are easier to understand. Also, new test can be added in the same manner without affecting the old ones. In order to make that simpler, introduce functionality for the setup of commonly used variables. Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Link: https://lore.kernel.org/r/20220308125841.3271721-2-scgl@linux.ibm.com Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-03-14	KVM: s390x: fix SCK locking	Claudio Imbrenda
	When handling the SCK instruction, the kvm lock is taken, even though the vcpu lock is already being held. The normal locking order is kvm lock first and then vcpu lock. This is can (and in some circumstances does) lead to deadlocks. The function kvm_s390_set_tod_clock is called both by the SCK handler and by some IOCTLs to set the clock. The IOCTLs will not hold the vcpu lock, so they can safely take the kvm lock. The SCK handler holds the vcpu lock, but will also somehow need to acquire the kvm lock without relinquishing the vcpu lock. The solution is to factor out the code to set the clock, and provide two wrappers. One is called like the original function and does the locking, the other is called kvm_s390_try_set_tod_clock and uses trylock to try to acquire the kvm lock. This new wrapper is then used in the SCK handler. If locking fails, -EAGAIN is returned, which is eventually propagated to userspace, thus also freeing the vcpu lock and allowing for forward progress. This is not the most efficient or elegant way to solve this issue, but the SCK instruction is deprecated and its performance is not critical. The goal of this patch is just to provide a simple but correct way to fix the bug. Fixes: 6a3f95a6b04c ("KVM: s390: Intercept SCK instruction") Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reviewed-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Link: https://lore.kernel.org/r/20220301143340.111129-1-imbrenda@linux.ibm.com Cc: stable@vger.kernel.org Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-03-14	staging: vchiq_arm: make vchiq_platform_get_arm_state() static	Gaston Gonzalez
	Fix "no previous prototype" W=1 warning by making the function vchiq_platform_get_arm_state() static. While at it, realign the function declaration in one line and reposition the asterisk symbol to fulfill the 'foo *bar' syntax. Signed-off-by: Gaston Gonzalez <gascoar@gmail.com> Link: https://lore.kernel.org/r/216ad30d674b80e0051ecc233ac26ddb1d3e0e75.1646255044.git.gascoar@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-14	staging: mt7621-dts: fix cpuintc and fixedregulator dtc warnings, fix xhci	Arınç ÜNAL
	Fix the cpuintc and fixedregulator dtc warnings: Warning (unit_address_vs_reg): /cpuintc@0: node has a unit name, but no reg property Warning (unit_address_vs_reg): /fixedregulator@0: node has a unit name, but no reg property Warning (unit_address_vs_reg): /fixedregulator@1: node has a unit name, but no reg property Warning (unique_unit_address): /cpuintc@0: duplicate unit-address (also used in node /fixedregulator@0) Remove the unnecessary status = "okay" property from the xhci node. Reviewed-by: Sergio Paracuellos <sergio.paracuellos@gmail.com> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Link: https://lore.kernel.org/r/20220312091832.6269-1-arinc.unal@arinc9.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-14	staging: mt7621-dts: fix GB-PC2 devicetree	Arınç ÜNAL
	Fix the GB-PC2 devicetree. Refer to the schematics of the device for more information. GB-PC2 devicetree fixes: - Include mt7621.dtsi instead of gbpc1.dts. Add the missing definitions. - Remove gpio-leds node as the system LED is not wired to anywhere on the board and the power LED is directly wired to GND. - Remove uart3 pin group from gpio-pinmux node as it's not used as GPIO. - Use reg 7 for the external phy to be on par with Documentation/devicetree/bindings/net/dsa/mt7530.txt. - Use the status value "okay". Link: https://github.com/ngiger/GnuBee_Docs/blob/master/GB-PCx/Documents/GB-PC2_V1.1_schematic.pdf Reviewed-by: Sergio Paracuellos <sergio.paracuellos@gmail.com> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Link: https://lore.kernel.org/r/20220311090320.3068-2-arinc.unal@arinc9.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-14	staging: mt7621-dts: fix LEDs and pinctrl on GB-PC1 devicetree	Arınç ÜNAL
	Fix LED and pinctrl definitions on the GB-PC1 devicetree. Refer to the schematics of the device for more information. LED fixes: - Change GPIO6 LED label from system to power as GPIO6 is connected to PLED. - Add default-on default-trigger to power LED. - Change GPIO8 LED label from status to system as GPIO8 is connected to SYS_LED. - Add disk-activity default-trigger to system LED. - Switch to the color:function naming scheme. - Remove lan1 and lan2 LEDs as they don't exist. Pinctrl fixes: - Claim state_default node under pinctrl node. - Change pinctrl0 node name to state-default. - Change gpio node name to gpio-pinmux to respect Documentation/devicetree/bindings/pinctrl/ralink,rt2880-pinmux.yaml. - Sort pin groups alphabetically. Misc fixes: - Fix formatting. - Use the status value "okay". - Define hexadecimal addresses in lower case. - Make hexadecimal addresses for memory easier to read. Link: https://github.com/ngiger/GnuBee_Docs/blob/master/GB-PCx/Documents/GB-PC1_V1.0_Schematic.pdf Tested-by: Sergio Paracuellos <sergio.paracuellos@gmail.com> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Link: https://lore.kernel.org/r/20220311090320.3068-1-arinc.unal@arinc9.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-14	staging: rtl8723bs: fix typos in comments	Julia Lawall
	Various spelling mistakes in comments. Detected with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Link: https://lore.kernel.org/r/20220314115354.144023-8-Julia.Lawall@inria.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-14	MIPS: Fix wrong comments in asm/prom.h	Tiezhu Yang
	In arch/mips/include/asm/prom.h, it should use "!CONFIG_USE_OF" after #else and #endif. Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2022-03-14	MIPS: Remove redundant definitions of device_tree_init()	Tiezhu Yang
	There exists many same definitions of device_tree_init() for various platforms, add a weak function in arch/mips/kernel/prom.c to clean up the related code. Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2022-03-14	MIPS: Remove redundant check in device_tree_init()	Tiezhu Yang
	In device_tree_init(), unflatten_and_copy_device_tree() checks initial_boot_params, so remove the redundant check. drivers/of/fdt.c void __init unflatten_and_copy_device_tree(void) { int size; void *dt; if (!initial_boot_params) { pr_warn("No valid device tree found, continuing without\n"); return; } ... } Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2022-03-14	Merge 5.17-rc8 into staging-next	Greg Kroah-Hartman
	We need the staging fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-14	MIPS: pgalloc: fix memory leak caused by pgd_free()	Yaliang Wang
	pgd page is freed by generic implementation pgd_free() since commit f9cb654cb550 ("asm-generic: pgalloc: provide generic pgd_free()"), however, there are scenarios that the system uses more than one page as the pgd table, in such cases the generic implementation pgd_free() won't be applicable anymore. For example, when PAGE_SIZE_4KB is enabled and MIPS_VA_BITS_48 is not enabled in a 64bit system, the macro "PGD_ORDER" will be set as "1", which will cause allocating two pages as the pgd table. Well, at the same time, the generic implementation pgd_free() just free one pgd page, which will result in the memory leak. The memory leak can be easily detected by executing shell command: "while true; do ls > /dev/null; grep MemFree /proc/meminfo; done" Fixes: f9cb654cb550 ("asm-generic: pgalloc: provide generic pgd_free()") Signed-off-by: Yaliang Wang <Yaliang.Wang@windriver.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2022-03-14	MIPS: RB532: fix return value of __setup handler	Randy Dunlap
	__setup() handlers should return 1 to obsolete_checksetup() in init/main.c to indicate that the boot option has been handled. A return of 0 causes the boot option/value to be listed as an Unknown kernel parameter and added to init's (limited) argument or environment strings. Also, error return codes don't mean anything to obsolete_checksetup() -- only non-zero (usually 1) or zero. So return 1 from setup_kmac(). Fixes: 9e21c7e40b7e ("MIPS: RB532: Replace parse_mac_addr() with mac_pton().") Fixes: 73b4390fb234 ("[MIPS] Routerboard 532: Support for base system") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> From: Igor Zhbanov <i.zhbanov@omprussia.ru> Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: linux-mips@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Phil Sutter <n0-1@freewrt.org> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Daniel Walter <dwalter@google.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2022-03-14	MIPS: Only use current_stack_pointer on GCC	Kees Cook
	Unfortunately, Clang did not have support for "sp" as a global register definition, and was crashing after the addition of current_stack_pointer. This has been fixed in Clang 14, but earlier Clang versions need to avoid this code, so add a versioned test and revert back to the open-coded asm instances. Fixes Clang build error: fatal error: error in backend: Invalid register name global variable Fixes: 200ed341b864 ("mips: Implement "current_stack_pointer"") Reported-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/lkml/YikTQRql+il3HbrK@dev-arch.thelio-3990X Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Marc Zyngier <maz@kernel.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Yanteng Si <siyanteng01@gmail.com> Cc: linux-mips@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2022-03-14	MIPS: boot/compressed: Use array reference for image bounds	Kees Cook
	As done with other image addresses in other architectures, use an explicit flexible array instead of "address of char", which can trip bounds checking done by the compiler. Found when building with -Warray-bounds: In file included from ./include/linux/byteorder/little_endian.h:5, from ./arch/mips/include/uapi/asm/byteorder.h:15, from ./arch/mips/include/asm/bitops.h:21, from ./include/linux/bitops.h:33, from ./include/linux/kernel.h:22, from arch/mips/boot/compressed/decompress.c:13: arch/mips/boot/compressed/decompress.c: In function 'decompress_kernel': ./include/asm-generic/unaligned.h:14:8: warning: array subscript -1 is outside array bounds of 'unsigned char[1]' [-Warray-bounds] 14 \| __pptr->x; \ \| ~~~~~~^~~ ./include/uapi/linux/byteorder/little_endian.h:35:51: note: in definition of macro '__le32_to_cpu' 35 \| #define __le32_to_cpu(x) ((__force __u32)(__le32)(x)) \| ^ ./include/asm-generic/unaligned.h:32:21: note: in expansion of macro '__get_unaligned_t' 32 \| return le32_to_cpu(__get_unaligned_t(__le32, p)); \| ^~~~~~~~~~~~~~~~~ arch/mips/boot/compressed/decompress.c:29:37: note: while referencing '__image_end' 29 \| extern unsigned char __image_begin, __image_end; \| ^~~~~~~~~~~ Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: linux-mips@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2022-03-14	mips: cdmm: Fix refcount leak in mips_cdmm_phys_base	Miaoqian Lin
	The of_find_compatible_node() function returns a node pointer with refcount incremented, We should use of_node_put() on it when done Add the missing of_node_put() to release the refcount. Fixes: 2121aa3e2312 ("mips: cdmm: Add mti,mips-cdmm dtb node support") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Acked-by: Serge Semin <fancer.lancer@gmail.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2022-03-14	btrfs: zoned: put block group after final usage	Nikolay Borisov
	It's counter-intuitive (and wrong) to put the block group _before_ the final usage in submit_eb_page. Fix it by re-ordering the call to btrfs_put_block_group after its final reference. Also fix a minor typo in 'implies' Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: don't access possibly stale fs_info data in device_list_add	Dongliang Mu
	Syzbot reported a possible use-after-free in printing information in device_list_add. Very similar with the bug fixed by commit 0697d9a61099 ("btrfs: don't access possibly stale fs_info data for printing duplicate device"), but this time the use occurs in btrfs_info_in_rcu. Call Trace: kasan_report.cold+0x83/0xdf mm/kasan/report.c:459 btrfs_printk+0x395/0x425 fs/btrfs/super.c:244 device_list_add.cold+0xd7/0x2ed fs/btrfs/volumes.c:957 btrfs_scan_one_device+0x4c7/0x5c0 fs/btrfs/volumes.c:1387 btrfs_control_ioctl+0x12a/0x2d0 fs/btrfs/super.c:2409 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Fix this by modifying device->fs_info to NULL too. Reported-and-tested-by: syzbot+82650a4e0ed38f218363@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: add lockdep_assert_held to need_preemptive_reclaim	Niels Dossche
	In a previous patch ("btrfs: extend locking to all space_info members accesses") the locking for the space_info members was extended in btrfs_preempt_reclaim_metadata_space because not all the member accesses that needed locks were actually locked (bytes_pinned et al). It was then suggested to also add a call to lockdep_assert_held to need_preemptive_reclaim. This function also works with space_info members. As of now, it has only two call sites which both hold the lock. Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: verify the tranisd of the to-be-written dirty extent buffer	Qu Wenruo
	[BUG] There is a bug report that a bitflip in the transid part of an extent buffer makes btrfs to reject certain tree blocks: BTRFS error (device dm-0): parent transid verify failed on 1382301696 wanted 262166 found 22 [CAUSE] Note the failed transid check, hex(262166) = 0x40016, while hex(22) = 0x16. It's an obvious bitflip. Furthermore, the reporter also confirmed the bitflip is from the hardware, so it's a real hardware caused bitflip, and such problem can not be detected by the existing tree-checker framework. As tree-checker can only verify the content inside one tree block, while generation of a tree block can only be verified against its parent. So such problem remain undetected. [FIX] Although tree-checker can not verify it at write-time, we still have a quick (but not the most accurate) way to catch such obvious corruption. Function csum_one_extent_buffer() is called before we submit metadata write. Thus it means, all the extent buffer passed in should be dirty tree blocks, and should be newer than last committed transaction. Using that we can catch the above bitflip. Although it's not a perfect solution, as if the corrupted generation is higher than the correct value, we have no way to catch it at all. Reported-by: Christoph Anton Mitterer <calestyo@scientia.org> Link: https://lore.kernel.org/linux-btrfs/2dfcbc130c55cc6fd067b93752e90bd2b079baca.camel@scientia.org/ CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@sus,ree.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: unify the error handling of btrfs_read_buffer()	Qu Wenruo
	There is one oddball error handling of btrfs_read_buffer(): ret = btrfs_read_buffer(tmp, gen, parent_level - 1, &first_key); if (!ret) { *eb_ret = tmp; return 0; } free_extent_buffer(tmp); btrfs_release_path(p); return -EIO; While all other call sites check the error first. Unify the behavior. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: unify the error handling pattern for read_tree_block()	Qu Wenruo
	We had an error handling pattern for read_tree_block() like this: eb = read_tree_block(); if (IS_ERR(eb)) { /* * Handling error here * Normally ended up with return or goto out. / } else if (!extent_buffer_uptodate(eb)) { / * Different error handling here * Normally also ended up with return or goto out; / } This is fine, but if we want to add extra check for each read_tree_block(), the existing if-else-if is not that expandable and will take reader some seconds to figure out there is no extra branch. Here we change it to a more common way, without the extra else: eb = read_tree_block(); if (IS_ERR(eb)) { / * Handling error here / return eb or goto out; } if (!extent_buffer_uptodate(eb)) { / * Different error handling here */ return eb or goto out; } This also removes some oddball call sites which uses some creative way to check error. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: factor out do_free_extent_accounting helper	Josef Bacik
	__btrfs_free_extent() does all of the hard work of updating the extent ref items, and then at the end if we dropped the extent completely it does the cleanup accounting work. We're going to only want to do that work for metadata with extent tree v2, so extract this bit into its own helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: remove last_ref from the extent freeing code	Josef Bacik
	This is a remnant of the work I did for qgroups a long time ago to only run for a block when we had dropped the last ref. We haven't done that for years, but the code remains. Drop this remnant. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: add a alloc_reserved_extent helper	Josef Bacik
	We duplicate this logic for both data and metadata, at this point we've already done our type specific extent root operations, this is just doing the accounting and removing the space from the free space tree. Extract this common logic out into a helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: remove BUG_ON(ret) in alloc_reserved_tree_block	Josef Bacik
	Switch this to an ASSERT() and return the error in the normal case. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: add and use helper for unlinking inode during log replay	Filipe Manana
	During log replay there is this pattern of running delayed items after every inode unlink. To avoid repeating this several times, move the logic into an helper function and use it instead of calling btrfs_unlink_inode() followed by btrfs_run_delayed_items(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: extend locking to all space_info members accesses	Niels Dossche
	bytes_pinned is always accessed under space_info->lock, except in btrfs_preempt_reclaim_metadata_space, however the other members are accessed under that lock. The reserved member of the rsv's are also partially accessed under a lock and partially not. Move all these accesses into the same lock to ensure consistency. This could potentially race and lead to a flush instead of a commit but it's not a big problem as it's only for preemptive flush. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Niels Dossche <niels.dossche@ugent.be> Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: zoned: mark relocation as writing	Naohiro Aota
	There is a hung_task issue with running generic/068 on an SMR device. The hang occurs while a process is trying to thaw the filesystem. The process is trying to take sb->s_umount to thaw the FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is waiting for an ordered extent to finish. However, as the FS is frozen, the ordered extents never finish. Having an ordered extent while the FS is frozen is the root cause of the hang. The ordered extent is initiated from btrfs_relocate_chunk() which is called from btrfs_reclaim_bgs_work(). This commit adds sb_*_write() around btrfs_relocate_chunk() call site. For the usual "btrfs balance" command, we already call it with mnt_want_file() in btrfs_ioctl_balance(). Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.13+ Link: https://github.com/naota/linux/issues/56 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	fs: allow cross-vfsmount reflink/dedupe	Josef Bacik
	Currently we disallow reflink and dedupe if the two files aren't on the same vfsmount. However we really only need to disallow it if they're not on the same super block. It is very common for btrfs to have a main subvolume that is mounted and then different subvolumes mounted at different locations. It's allowed to reflink between these volumes, but the vfsmount check disallows this. Instead fix dedupe to check for the same superblock, and simply remove the vfsmount check for reflink as it already does the superblock check. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: remove the cross file system checks from remap	Josef Bacik
	The sb check is already done in do_clone_file_range, and the mnt check (which will hopefully go away in a subsequent patch) is done in ioctl_file_clone(). Remove the check in our code and put an ASSERT() to make sure it doesn't change underneath us. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: pass btrfs_fs_info to btrfs_recover_relocation	Josef Bacik
	We don't need a root here, we just need the btrfs_fs_info, we can just get the specific roots we need from fs_info. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: pass btrfs_fs_info for deleting snapshots and cleaner	Josef Bacik
	We're passing a root around here, but we only really need the fs_info, so fix up btrfs_clean_one_deleted_snapshot() to take an fs_info instead, and then fix up all the callers appropriately. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: add filesystems state details to error messages	Sweet Tea Dorminy
	When a filesystem goes read-only due to an error, multiple errors tend to be reported, some of which are knock-on failures. Logging fs_states, in btrfs_handle_fs_error() and btrfs_printk() helps distinguish the first error from subsequent messages which may only exist due to an error state. Under the new format, most initial errors will look like: `BTRFS: error (device loop0) in ...` while subsequent errors will begin with: `error (device loop0: state E) in ...` An initial transaction abort error will look like `error (device loop0: state A) in ...` and subsequent messages will contain `(device loop0: state EA) in ...` In addition to the error states we can also print other states that are temporary, like remounting, device replace, or indicate a global state that may affect functionality. Now implemented: E - filesystem error detected A - transaction aborted L - log tree errors M - remounting in progress R - device replace in progress C - data checksums not verified (mounted with ignoredatacsums) Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: deal with unexpected extent type during reflinking	Filipe Manana
	Smatch complains about a possible dereference of a pointer that was not initialized: CC [M] fs/btrfs/reflink.o CHECK fs/btrfs/reflink.c fs/btrfs/reflink.c:533 btrfs_clone() error: potentially dereferencing uninitialized 'trans'. This is because we are not dealing with the case where the type of a file extent has an unexpected value (not regular, not prealloc and not inline), in which case the transaction handle pointer is not initialized. Such unexpected type should be impossible, except in case of some memory corruption caused either by bad hardware or some software bug causing something like a buffer overrun. So ASSERT that if the extent type is neither regular nor prealloc, then it must be inline. Bail out with -EUCLEAN and a warning in case it is not. This silences smatch. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: fix unexpected error path when reflinking an inline extent	Filipe Manana
	When reflinking an inline extent, we assert that its file offset is 0 and that its uncompressed length is not greater than the sector size. We then return an error if one of those conditions is not satisfied. However we use a return statement, which results in returning from btrfs_clone() without freeing the path and buffer that were allocated before, as well as not clearing the flag BTRFS_INODE_NO_DELALLOC_FLUSH for the destination inode. Fix that by jumping to the 'out' label instead, and also add a WARN_ON() for each condition so that in case assertions are disabled, we get to known which of the unexpected conditions triggered the error. Fixes: a61e1e0df9f321 ("Btrfs: simplify inline extent handling when doing reflinks") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: reset last_reflink_trans after fsyncing inode	Filipe Manana
	When an inode has a last_reflink_trans matching the current transaction, we have to take special care when logging its checksums in order to avoid getting checksum items with overlapping ranges in a log tree, which could result in missing checksums after log replay (more on that in the changelogs of commit 40e046acbd2f36 ("Btrfs: fix missing data checksums after replaying a log tree") and commit e289f03ea79bbc ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")). We also need to make sure a full fsync will copy all old file extent items it finds in modified leaves, because they might have been copied from some other inode. However once we fsync an inode, we don't need to keep paying the price of that extra special care in future fsyncs done in the same transaction, unless the inode is used for another reflink operation or the full sync flag is set on it (truncate, failure to allocate extent maps for holes, and other exceptional and infrequent cases). So after we fsync an inode reset its last_unlink_trans to zero. In case another reflink happens, we continue to update the last_reflink_trans of the inode, just as before. Also set last_reflink_trans to the generation of the last transaction that modified the inode whenever we need to set the full sync flag on the inode, just like when we need to load an inode from disk after eviction. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: voluntarily relinquish cpu when doing a full fsync	Filipe Manana
	Doing a full fsync may require processing many leaves of metadata, which can take some time and result in a task monopolizing a cpu for too long. So add a cond_resched() after processing a leaf when doing a full fsync, while not holding any locks on any tree (a subvolume or a log tree). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: hold on to less memory when logging checksums during full fsync	Filipe Manana
	When doing a full fsync, at copy_items(), we iterate over all extents and then collect their checksums into a list. After copying all the extents to the log tree, we then log all the previously collected checksums. Before the previous patch in the series (subject "btrfs: stop copying old file extents when doing a full fsync"), we had to do it this way, because while we were iterating over the items in the leaf of the subvolume tree, we were holding a write lock on a leaf of the log tree, so logging the checksums for an extent right after we collected them could result in a deadlock, in case the checksum items ended up in the same leaf. However after the previous patch in the series we now do a first iteration over all the items in the leaf of the subvolume tree before locking a path in the log tree, so we can now log the checksums right after we have obtained them. This avoids holding in memory all checksums for all extents in the leaf while copying items from the source leaf to the log tree. The amount of memory used to hold all checksums of the extents in a leaf can be significant. For example if a leaf has 200 file extent items referring to 1M extents, using the default crc32c checksums, would result in using over 200K of memory (not accounting for the extra overhead of struct btrfs_ordered_sum), with smaller or less extents it would be less, but it could be much more with more extents per leaf and/or much larger extents. So change copy_items() to log the checksums for an extent after looking them up, and then free their memory, as they are no longer necessary. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: stop copying old file extents when doing a full fsync	Filipe Manana
	When logging an inode in full sync mode, we go over every leaf that was modified in the current transaction and has items associated to our inode, and then copy all those items into the log tree. This includes copying file extent items that were created and added to the inode in past transactions, which is useless and only makes use more leaf space in the log tree. It's common to have a file with many file extent items spanning many leaves where only a few file extent items are new and need to be logged, and in such case we log all the file extent items we find in the modified leaves. So change the full sync behaviour to skip over file extent items that are not needed. Those are the ones that match the following criteria: 1) Have a generation older than the current transaction and the inode was not a target of a reflink operation, as that can copy file extent items from a past generation from some other inode into our inode, so we have to log them; 2) Start at an offset within i_size - we must log anything at or beyond i_size, otherwise we would lose prealloc extents after log replay. The following script exercises a scenario where this happens, and it's somehow close enough to what happened often on a SQL Server workload which I had to debug sometime ago to fix an issue where a pattern of writes to prealloc extents and fsync resulted in fsync failing with -EIO (that was commit ea7036de0d36c4 ("btrfs: fix fsync failure and transaction abort after writes to prealloc extents")). In that particular case, we had large files that had random writes and were often truncated, which made the next fsync be a full sync. $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi MKFS_OPTIONS="-O no-holes -R free-space-tree" MOUNT_OPTIONS="-o ssd" FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # 1G # FILE_SIZE=$((2 * 1024 * 1024 * 1024)) # 2G # FILE_SIZE=$((512 * 1024 * 1024)) # 512M mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT # Create a file with many extents. Use direct IO to make it faster # to create the file - using buffered IO we would have to fsync # after each write (terribly slow). echo "Creating file with $((FILE_SIZE / 4096)) extents of 4K each..." xfs_io -f -d -c "pwrite -b 4K 0 $FILE_SIZE" $MNT/foobar # Commit the transaction, so every extent after this is from an # old generation. sync # Now rewrite only a few extents, which are all far spread apart from # each other (e.g. 1G / 32M = 32 extents). # After this only a few extents have a new generation, while all other # ones have an old generation. echo "Rewriting $((FILE_SIZE / (32 * 1024 * 1024))) extents..." for ((i = 0; i < $FILE_SIZE; i += $((32 * 1024 * 1024)))); do xfs_io -c "pwrite $i 4K" $MNT/foobar >/dev/null done # Fsync, the inode logged in full sync mode since it was never fsynced # before. echo "Fsyncing file..." xfs_io -c "fsync" $MNT/foobar umount $MNT And the following bpftrace program was running when executing the test script: $ cat bpf-script.sh #!/usr/bin/bpftrace k:btrfs_log_inode { @start_log_inode[tid] = nsecs; } kr:btrfs_log_inode /@start_log_inode[tid]/ { @log_inode_dur[tid] = (nsecs - @start_log_inode[tid]) / 1000; delete(@start_log_inode[tid]); } k:btrfs_sync_log { @start_sync_log[tid] = nsecs; } kr:btrfs_sync_log /@start_sync_log[tid]/ { $sync_log_dur = (nsecs - @start_sync_log[tid]) / 1000; printf("btrfs_log_inode() took %llu us\n", @log_inode_dur[tid]); printf("btrfs_sync_log() took %llu us\n", $sync_log_dur); delete(@start_sync_log[tid]); delete(@log_inode_dur[tid]); exit(); } With 512M test file, before this patch: btrfs_log_inode() took 15218 us btrfs_sync_log() took 1328 us Log tree has 17 leaves and 1 node, its total size is 294912 bytes. With 512M test file, after this patch: btrfs_log_inode() took 14760 us btrfs_sync_log() took 588 us Log tree has a single leaf, its total size is 16K. With 1G test file, before this patch: btrfs_log_inode() took 27301 us btrfs_sync_log() took 1767 us Log tree has 33 leaves and 1 node, its total size is 557056 bytes. With 1G test file, after this patch: btrfs_log_inode() took 26166 us btrfs_sync_log() took 593 us Log tree has a single leaf, its total size is 16K With 2G test file, before this patch: btrfs_log_inode() took 50892 us btrfs_sync_log() took 3127 us Log tree has 65 leaves and 1 node, its total size is 1081344 bytes. With 2G test file, after this patch: btrfs_log_inode() took 50126 us btrfs_sync_log() took 586 us Log tree has a single leaf, its total size is 16K. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: do not clean up repair bio if submit fails	Josef Bacik
	The submit helper will always run bio_endio() on the bio if it fails to submit, so cleaning up the bio just leads to a variety of use-after-free and NULL pointer dereference bugs because we race with the endio function that is cleaning up the bio. Instead just return BLK_STS_OK as the repair function has to continue to process the rest of the pages, and the endio for the repair bio will do the appropriate cleanup for the page that it was given. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14	btrfs: do not try to repair bio that has no mirror set	Josef Bacik
	If we fail to submit a bio for whatever reason, we may not have setup a mirror_num for that bio. This means we shouldn't try to do the repair workflow, if we do we'll hit an BUG_ON(!failrec->this_mirror) in clean_io_failure. Instead simply skip the repair workflow if we have no mirror set, and add an assert to btrfs_check_repairable() to make it easier to catch what is happening in the future. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>