summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2019-09-18Merge tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Darrick Wong: "For this cycle we have the usual pile of cleanups and bug fixes, some performance improvements for online metadata scrubbing, massive speedups in the directory entry creation code, some performance improvement in the file ACL lookup code, a fix for a logging stall during mount, and fixes for concurrency problems. It has survived a couple of weeks of xfstests runs and merges cleanly. Summary: - Remove KM_SLEEP/KM_NOSLEEP. - Ensure that memory buffers for IO are properly sector-aligned to avoid problems that the block layer doesn't check. - Make the bmap scrubber more efficient in its record checking. - Don't crash xfs_db when superblock inode geometry is corrupt. - Fix btree key helper functions. - Remove unneeded error returns for things that can't fail. - Fix buffer logging bugs in repair. - Clean up iterator return values. - Speed up directory entry creation. - Enable allocation of xattr value memory buffer during lookup. - Fix readahead racing with truncate/punch hole. - Other minor cleanups. - Fix one AGI/AGF deadlock with RENAME_WHITEOUT. - More BUG -> WARN whackamole. - Fix various problems with the log failing to advance under certain circumstances, which results in stalls during mount" * tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (45 commits) xfs: push the grant head when the log head moves forward xfs: push iclog state cleaning into xlog_state_clean_log xfs: factor iclog state processing out of xlog_state_do_callback() xfs: factor callbacks out of xlog_state_do_callback() xfs: factor debug code out of xlog_state_do_callback() xfs: prevent CIL push holdoff in log recovery xfs: fix missed wakeup on l_flush_wait xfs: push the AIL in xlog_grant_head_wake xfs: Use WARN_ON_ONCE for bailout mount-operation xfs: Fix deadlock between AGI and AGF with RENAME_WHITEOUT xfs: define a flags field for the AG geometry ioctl structure xfs: add a xfs_valid_startblock helper xfs: remove the unused XFS_ALLOC_USERDATA flag xfs: cleanup xfs_fsb_to_db xfs: fix the dax supported check in xfs_ioctl_setattr_dax_invalidate xfs: Fix stale data exposure when readahead races with hole punch fs: Export generic_fadvise() mm: Handle MADV_WILLNEED through vfs_fadvise() xfs: allocate xattr buffer on demand xfs: consolidate attribute value copying ...
2019-09-18Merge tag 'vfs-5.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull swap access updates from Darrick Wong: "Prohibit writing to active swap files and swap partitions. There's no non-malicious use case for allowing userspace to scribble on storage that the kernel thinks it owns" * tag 'vfs-5.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: vfs: don't allow writes to swap files mm: set S_SWAPFILE on blockdev swap devices
2019-09-18Merge tag 'ovl-fixes-5.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Fix a regression in docker introduced by overlayfs changes in 4.19. Also fix a couple of miscellaneous bugs" * tag 'ovl-fixes-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: filter of trusted xattr results in audit ovl: Fix dereferencing possible ERR_PTR() ovl: fix regression caused by overlapping layers detection
2019-09-18Merge tag 'for-5.4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This continues with work on code refactoring, sanity checks and space handling. There are some less user visible changes, nothing that would particularly stand out. User visible changes: - tree checker, more sanity checks of: - ROOT_ITEM (key, size, generation, level, alignment, flags) - EXTENT_ITEM and METADATA_ITEM checks (key, size, offset, alignment, refs) - tree block reference items - EXTENT_DATA_REF (key, hash, offset) - deprecate flag BTRFS_SUBVOL_CREATE_ASYNC for subvolume creation ioctl, scheduled removal in 5.7 - delete stale and unused UAPI definitions BTRFS_DEV_REPLACE_ITEM_STATE_* - improved export of debugging information available via existing sysfs directory structure - try harder to delete relations between qgroups and allow to delete orphan entries - remove unreliable space checks before relocation starts Core: - space handling: - improved ticket reservations and other high level logic in order to remove special cases - factor flushing infrastructure and use it for different contexts, allows to remove some special case handling - reduce metadata reservation when only updating inodes - reduce global block reserve minimum size (affects small filesystems) - improved overcommit logic wrt global block reserve - tests: - fix memory leaks in extent IO tree - catch all TRIM range Fixes: - fix ENOSPC errors, leading to transaction aborts, when cloning extents - several fixes for inode number cache (mount option inode_cache) - fix potential soft lockups during send when traversing large trees - fix unaligned access to space cache pages with SLUB debug on (PowerPC) Other: - refactoring public/private functions, moving to new or more appropriate files - defines converted to enums - error handling improvements - more assertions and comments - old code deletion" * tag 'for-5.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (138 commits) btrfs: Relinquish CPUs in btrfs_compare_trees btrfs: Don't assign retval of btrfs_try_tree_write_lock/btrfs_tree_read_lock_atomic btrfs: create structure to encode checksum type and length btrfs: turn checksum type define into an enum btrfs: add enospc debug messages for ticket failure btrfs: do not account global reserve in can_overcommit btrfs: use btrfs_try_granting_tickets in update_global_rsv btrfs: always reserve our entire size for the global reserve btrfs: change the minimum global reserve size btrfs: rename btrfs_space_info_add_old_bytes btrfs: remove orig_bytes from reserve_ticket btrfs: fix may_commit_transaction to deal with no partial filling btrfs: rework wake_all_tickets btrfs: refactor the ticket wakeup code btrfs: stop partially refilling tickets when releasing space btrfs: add space reservation tracepoint for reserved bytes btrfs: roll tracepoint into btrfs_space_info_update helper btrfs: do not allow reservations if we have pending tickets btrfs: stop clearing EXTENT_DIRTY in inode I/O tree btrfs: treat RWF_{,D}SYNC writes as sync for CRCs ...
2019-09-18Merge tag 'afs-next-20190915' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS updates from David Howells: "Here's a set of patches for AFS. The first three are trivial, deleting unused symbols and rolling out a wrapper function. The fourth and fifth patches make use of the previously added RCU-safe request_key facility to allow afs_permission() and afs_d_revalidate() to attempt to operate without dropping out of RCU-mode pathwalk. Under certain conditions, such as conflict with another client, we still have to drop out anyway, take a lock and consult the server" * tag 'afs-next-20190915' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Support RCU pathwalk afs: Provide an RCU-capable key lookup afs: Use afs_extract_discard() rather than iov_iter_discard() afs: remove unused variable 'afs_zero_fid' afs: remove unused variable 'afs_voltypes'
2019-09-18Merge tag 'fsverity-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fs-verity support from Eric Biggers: "fs-verity is a filesystem feature that provides Merkle tree based hashing (similar to dm-verity) for individual readonly files, mainly for the purpose of efficient authenticity verification. This pull request includes: (a) The fs/verity/ support layer and documentation. (b) fs-verity support for ext4 and f2fs. Compared to the original fs-verity patchset from last year, the UAPI to enable fs-verity on a file has been greatly simplified. Lots of other things were cleaned up too. fs-verity is planned to be used by two different projects on Android; most of the userspace code is in place already. Another userspace tool ("fsverity-utils"), and xfstests, are also available. e2fsprogs and f2fs-tools already have fs-verity support. Other people have shown interest in using fs-verity too. I've tested this on ext4 and f2fs with xfstests, both the existing tests and the new fs-verity tests. This has also been in linux-next since July 30 with no reported issues except a couple minor ones I found myself and folded in fixes for. Ted and I will be co-maintaining fs-verity" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: f2fs: add fs-verity support ext4: update on-disk format documentation for fs-verity ext4: add fs-verity read support ext4: add basic fs-verity support fs-verity: support builtin file signatures fs-verity: add SHA-512 support fs-verity: implement FS_IOC_MEASURE_VERITY ioctl fs-verity: implement FS_IOC_ENABLE_VERITY ioctl fs-verity: add data verification hooks for ->readpages() fs-verity: add the hook for file ->setattr() fs-verity: add the hook for file ->open() fs-verity: add inode and superblock fields fs-verity: add Kconfig and the helper functions for hashing fs: uapi: define verity bit for FS_IOC_GETFLAGS fs-verity: add UAPI header fs-verity: add MAINTAINERS file entry fs-verity: add a documentation file
2019-09-18Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds
Pull fscrypt updates from Eric Biggers: "This is a large update to fs/crypto/ which includes: - Add ioctls that add/remove encryption keys to/from a filesystem-level keyring. These fix user-reported issues where e.g. an encrypted home directory can break NetworkManager, sshd, Docker, etc. because they don't get access to the needed keyring. These ioctls also provide a way to lock encrypted directories that doesn't use the vm.drop_caches sysctl, so is faster, more reliable, and doesn't always need root. - Add a new encryption policy version ("v2") which switches to a more standard, secure, and flexible key derivation function, and starts verifying that the correct key was supplied before using it. The key derivation improvement is needed for its own sake as well as for ongoing feature work for which the current way is too inflexible. Work is in progress to update both Android and the 'fscrypt' userspace tool to use both these features. (Working patches are available and just need to be reviewed+merged.) Chrome OS will likely use them too. This has also been tested on ext4, f2fs, and ubifs with xfstests -- both the existing encryption tests, and the new tests for this. This has also been in linux-next since Aug 16 with no reported issues. I'm also using an fscrypt v2-encrypted home directory on my personal desktop" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: (27 commits) ext4 crypto: fix to check feature status before get policy fscrypt: document the new ioctls and policy version ubifs: wire up new fscrypt ioctls f2fs: wire up new fscrypt ioctls ext4: wire up new fscrypt ioctls fscrypt: require that key be added when setting a v2 encryption policy fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS ioctl fscrypt: allow unprivileged users to add/remove keys for v2 policies fscrypt: v2 encryption policy support fscrypt: add an HKDF-SHA512 implementation fscrypt: add FS_IOC_GET_ENCRYPTION_KEY_STATUS ioctl fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl fscrypt: rename keyinfo.c to keysetup.c fscrypt: move v1 policy key setup to keysetup_v1.c fscrypt: refactor key setup code in preparation for v2 policies fscrypt: rename fscrypt_master_key to fscrypt_direct_key fscrypt: add ->ci_inode to fscrypt_info fscrypt: use FSCRYPT_* definitions, not FS_* fscrypt: use FSCRYPT_ prefix for uapi constants ...
2019-09-18Merge tag 'filelock-v5.4-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "Just a couple of minor bugfixes, a revision to a tracepoint to account for some earlier changes to the internals, and a patch to add a pr_warn message when someone tries to mount a filesystem with '-o mand' on a kernel that has that support disabled" * tag 'filelock-v5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: locks: fix a memory leak bug in __break_lease() locks: print a warning when mount fails due to lack of "mand" support locks: Fix procfs output for file leases locks: revise generic_add_lease tracepoint
2019-09-18Merge branch 'work.mount-base' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount API infrastructure updates from Al Viro: "Infrastructure bits of mount API conversions. The rest is more of per-filesystem updates and that will happen in separate pull requests" * 'work.mount-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: mtd: Provide fs_context-aware mount_mtd() replacement vfs: Create fs_context-aware mount_bdev() replacement new helper: get_tree_keyed() vfs: set fs_context::user_ns for reconfigure
2019-09-18Merge branch 'work.dcache' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull d_path fix from Al Viro: "Fix d_absolute_path() regression in the last cycle (felt by tomoyo, mostly)" * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [PATCH] fix d_absolute_path() interplay with fsmount()
2019-09-18Merge branch 'work.namei' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs namei updates from Al Viro: "Pathwalk-related stuff" [ Audit-related cleanups, misc simplifications, and easier to follow nd->root refcounts - Linus ] * 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: devpts_pty_kill(): don't bother with d_delete() infiniband: don't bother with d_delete() hypfs: don't bother with d_delete() fs/namei.c: keep track of nd->root refcount status fs/namei.c: new helper - legitimize_root() kill the last users of user_{path,lpath,path_dir}() namei.h: get the comments on LOOKUP_... in sync with reality kill LOOKUP_NO_EVAL, don't bother including namei.h from audit.h audit_inode(): switch to passing AUDIT_INODE_... filename_mountpoint(): make LOOKUP_NO_EVAL unconditional there filename_lookup(): audit_inode() argument is always 0
2019-09-18Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Add the ability to abort a skcipher walk. Algorithms: - Fix XTS to actually do the stealing. - Add library helpers for AES and DES for single-block users. - Add library helpers for SHA256. - Add new DES key verification helper. - Add surrounding bits for ESSIV generator. - Add accelerations for aegis128. - Add test vectors for lzo-rle. Drivers: - Add i.MX8MQ support to caam. - Add gcm/ccm/cfb/ofb aes support in inside-secure. - Add ofb/cfb aes support in media-tek. - Add HiSilicon ZIP accelerator support. Others: - Fix potential race condition in padata. - Use unbound workqueues in padata" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (311 commits) crypto: caam - Cast to long first before pointer conversion crypto: ccree - enable CTS support in AES-XTS crypto: inside-secure - Probe transform record cache RAM sizes crypto: inside-secure - Base RD fetchcount on actual RD FIFO size crypto: inside-secure - Base CD fetchcount on actual CD FIFO size crypto: inside-secure - Enable extended algorithms on newer HW crypto: inside-secure: Corrected configuration of EIP96_TOKEN_CTRL crypto: inside-secure - Add EIP97/EIP197 and endianness detection padata: remove cpu_index from the parallel_queue padata: unbind parallel jobs from specific CPUs padata: use separate workqueues for parallel and serial work padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible crypto: pcrypt - remove padata cpumask notifier padata: make padata_do_parallel find alternate callback CPU workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs workqueue: unconfine alloc/apply/free_workqueue_attrs() padata: allocate workqueue internally arm64: dts: imx8mq: Add CAAM node random: Use wait_event_freezable() in add_hwgenerator_randomness() crypto: ux500 - Fix COMPILE_TEST warnings ...
2019-09-18Merge tag 'staging-5.4-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging and IIO driver updates from Greg KH: "Here is the big staging/iio driver update for 5.4-rc1. Lots of churn here, with a few driver/filesystems moving out of staging finally: - erofs moved out of staging - greybus core code moved out of staging Along with that, a new filesytem has been added: - extfat to provide support for those devices requiring that filesystem (i.e. transfer devices to/from windows systems or printers) Other than that, there a number of new IIO drivers, and lots and lots and lots of staging driver cleanups and minor fixes as people continue to dig into those for easy changes. All of these have been in linux-next for a while with no reported issues" * tag 'staging-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (453 commits) Staging: gasket: Use temporaries to reduce line length. Staging: octeon: Avoid several usecases of strcpy staging: vhciq_core: replace snprintf with scnprintf staging: wilc1000: avoid twice IRQ handler execution for each single interrupt staging: wilc1000: remove unused interrupt status handling code staging: fbtft: make several arrays static const, makes object smaller staging: rtl8188eu: make two arrays static const, makes object smaller staging: rtl8723bs: core: Remove Macro "IS_MAC_ADDRESS_BROADCAST" dt-bindings: anybus-controller: move to staging/ tree staging: emxx_udc: remove local TRUE/FALSE definition staging: wilc1000: look for rtc_clk clock staging: dt-bindings: wilc1000: add optional rtc_clk property staging: nvec: make use of devm_platform_ioremap_resource staging: exfat: drop unused function parameter Staging: exfat: Avoid use of strcpy staging: exfat: use integer constants staging: exfat: cleanup spacing for casts staging: exfat: cleanup spacing for operators staging: rtl8723bs: hal: remove redundant variable n staging: pi433: Fix typo in documentation ...
2019-09-18Merge tag 'driver-core-5.4-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg Kroah-Hartman: "Here is the big driver core update for 5.4-rc1. There was a bit of a churn in here, with a number of core and OF platform patches being added to the tree, and then after much discussion and review and a day-long in-person meeting, they were decided to be reverted and a new set of patches is currently being reviewed on the mailing list. Other than that churn, there are two "persistent" branches in here that other trees will be pulling in as well during the merge window. One branch to add support for drivers to have the driver core automatically add sysfs attribute files when a driver is bound to a device so that the driver doesn't have to manually do it (and then clean it up, as it always gets it wrong). There's another branch in here for generic lookup helpers for the driver core that lots of busses are starting to use. That's the majority of the non-driver-core changes in this patch series. There's also some on-going debugfs file creation cleanup that has been slowly happening over the past few releases, with the goal to hopefully get that done sometime next year. All of these have been in linux-next for a while now with no reported issues" [ Note that the above-mentioned generic lookup helpers branch was already brought in by the LED merge (commit 4feaab05dc1e) that had shared it. Also note that that common branch introduced an i2c bug due to a bad conversion, which got fixed here. - Linus ] * tag 'driver-core-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (49 commits) coccinelle: platform_get_irq: Fix parse error driver-core: add include guard to linux/container.h sysfs: add BIN_ATTR_WO() macro driver core: platform: Export platform_get_irq_optional() hwmon: pwm-fan: Use platform_get_irq_optional() driver core: platform: Introduce platform_get_irq_optional() Revert "driver core: Add support for linking devices during device addition" Revert "driver core: Add edit_links() callback for drivers" Revert "of/platform: Add functional dependency link from DT bindings" Revert "driver core: Add sync_state driver/bus callback" Revert "of/platform: Pause/resume sync state during init and of_platform_populate()" Revert "of/platform: Create device links for all child-supplier depencencies" Revert "of/platform: Don't create device links for default busses" Revert "of/platform: Fix fn definitons for of_link_is_valid() and of_link_property()" Revert "of/platform: Fix device_links_supplier_sync_state_resume() warning" Revert "of/platform: Disable generic device linking code for PowerPC" devcoredump: fix typo in comment devcoredump: use memory_read_from_buffer of/platform: Disable generic device linking code for PowerPC device.h: Fix warnings for mismatched parameter names in comments ...
2019-09-17Merge tag 'pm-5.4-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These include a rework of the main suspend-to-idle code flow (related to the handling of spurious wakeups), a switch over of several users of cpufreq notifiers to QoS-based limits, a new devfreq driver for Tegra20, a new cpuidle driver and governor for virtualized guests, an extension of the wakeup sources framework to expose wakeup sources as device objects in sysfs, and more. Specifics: - Rework the main suspend-to-idle control flow to avoid repeating "noirq" device resume and suspend operations in case of spurious wakeups from the ACPI EC and decouple the ACPI EC wakeups support from the LPS0 _DSM support (Rafael Wysocki). - Extend the wakeup sources framework to expose wakeup sources as device objects in sysfs (Tri Vo, Stephen Boyd). - Expose system suspend statistics in sysfs (Kalesh Singh). - Introduce a new haltpoll cpuidle driver and a new matching governor for virtualized guests wanting to do guest-side polling in the idle loop (Marcelo Tosatti, Joao Martins, Wanpeng Li, Stephen Rothwell). - Fix the menu and teo cpuidle governors to allow the scheduler tick to be stopped if PM QoS is used to limit the CPU idle state exit latency in some cases (Rafael Wysocki). - Increase the resolution of the play_idle() argument to microseconds for more fine-grained injection of CPU idle cycles (Daniel Lezcano). - Switch over some users of cpuidle notifiers to the new QoS-based frequency limits and drop the CPUFREQ_ADJUST and CPUFREQ_NOTIFY policy notifier events (Viresh Kumar). - Add new cpufreq driver based on nvmem for sun50i (Yangtao Li). - Add support for MT8183 and MT8516 to the mediatek cpufreq driver (Andrew-sh.Cheng, Fabien Parent). - Add i.MX8MN support to the imx-cpufreq-dt cpufreq driver (Anson Huang). - Add qcs404 to cpufreq-dt-platdev blacklist (Jorge Ramirez-Ortiz). - Update the qcom cpufreq driver (among other things, to make it easier to extend and to use kryo cpufreq for other nvmem-based SoCs) and add qcs404 support to it (Niklas Cassel, Douglas RAILLARD, Sibi Sankar, Sricharan R). - Fix assorted issues and make assorted minor improvements in the cpufreq code (Colin Ian King, Douglas RAILLARD, Florian Fainelli, Gustavo Silva, Hariprasad Kelam). - Add new devfreq driver for NVidia Tegra20 (Dmitry Osipenko, Arnd Bergmann). - Add new Exynos PPMU events to devfreq events and extend that mechanism (Lukasz Luba). - Fix and clean up the exynos-bus devfreq driver (Kamil Konieczny). - Improve devfreq documentation and governor code, fix spelling typos in devfreq (Ezequiel Garcia, Krzysztof Kozlowski, Leonard Crestez, MyungJoo Ham, Gaël PORTAY). - Add regulators enable and disable to the OPP (operating performance points) framework (Kamil Konieczny). - Update the OPP framework to support multiple opp-suspend properties (Anson Huang). - Fix assorted issues and make assorted minor improvements in the OPP code (Niklas Cassel, Viresh Kumar, Yue Hu). - Clean up the generic power domains (genpd) framework (Ulf Hansson). - Clean up assorted pieces of power management code and documentation (Akinobu Mita, Amit Kucheria, Chuhong Yuan). - Update the pm-graph tool to version 5.5 including multiple fixes and improvements (Todd Brandt). - Update the cpupower utility (Benjamin Weis, Geert Uytterhoeven, Sébastien Szymanski)" * tag 'pm-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (126 commits) cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available cpuidle-haltpoll: do not set an owner to allow modunload cpuidle-haltpoll: return -ENODEV on modinit failure cpuidle-haltpoll: set haltpoll as preferred governor cpuidle: allow governor switch on cpuidle_register_driver() PM: runtime: Documentation: add runtime_status ABI document pm-graph: make setVal unbuffered again for python2 and python3 powercap: idle_inject: Use higher resolution for idle injection cpuidle: play_idle: Increase the resolution to usec cpuidle-haltpoll: vcpu hotplug support cpufreq: Add qcs404 to cpufreq-dt-platdev blacklist cpufreq: qcom: Add support for qcs404 on nvmem driver cpufreq: qcom: Refactor the driver to make it easier to extend cpufreq: qcom: Re-organise kryo cpufreq to use it for other nvmem based qcom socs dt-bindings: opp: Add qcom-opp bindings with properties needed for CPR dt-bindings: opp: qcom-nvmem: Support pstates provided by a power domain Documentation: cpufreq: Update policy notifier documentation cpufreq: Remove CPUFREQ_ADJUST and CPUFREQ_NOTIFY policy notifier events PM / Domains: Verify PM domain type in dev_pm_genpd_set_performance_state() PM / Domains: Simplify genpd_lookup_dev() ...
2019-09-17Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - Two NVMe pull requests: - ana log parse fix from Anton - nvme quirks support for Apple devices from Ben - fix missing bio completion tracing for multipath stack devices from Hannes and Mikhail - IP TOS settings for nvme rdma and tcp transports from Israel - rq_dma_dir cleanups from Israel - tracing for Get LBA Status command from Minwoo - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself - Some consolidation between the fabrics transports for handling the CAP register - reset race with ns scanning fix for fabrics (move fabrics commands to a dedicated request queue with a different lifetime from the admin request queue)." - controller reset and namespace scan races fixes - nvme discovery log change uevent support - naming improvements from Keith - multiple discovery controllers reject fix from James - some regular cleanups from various people - Series fixing (and re-fixing) null_blk debug printing and nr_devices checks (André) - A few pull requests from Song, with fixes from Andy, Guoqing, Guilherme, Neil, Nigel, and Yufen. - REQ_OP_ZONE_RESET_ALL support (Chaitanya) - Bio merge handling unification (Christoph) - Pick default elevator correctly for devices with special needs (Damien) - Block stats fixes (Hou) - Timeout and support devices nbd fixes (Mike) - Series fixing races around elevator switching and device add/remove (Ming) - sed-opal cleanups (Revanth) - Per device weight support for BFQ (Fam) - Support for blk-iocost, a new model that can properly account cost of IO workloads. (Tejun) - blk-cgroup writeback fixes (Tejun) - paride queue init fixes (zhengbin) - blk_set_runtime_active() cleanup (Stanley) - Block segment mapping optimizations (Bart) - lightnvm fixes (Hans/Minwoo/YueHaibing) - Various little fixes and cleanups * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits) null_blk: format pr_* logs with pr_fmt null_blk: match the type of parameter nr_devices null_blk: do not fail the module load with zero devices block: also check RQF_STATS in blk_mq_need_time_stamp() block: make rq sector size accessible for block stats bfq: Fix bfq linkage error raid5: use bio_end_sector in r5_next_bio raid5: remove STRIPE_OPS_REQ_PENDING md: add feature flag MD_FEATURE_RAID0_LAYOUT md/raid0: avoid RAID0 data corruption due to layout confusion. raid5: don't set STRIPE_HANDLE to stripe which is in batch list raid5: don't increment read_errors on EILSEQ return nvmet: fix a wrong error status returned in error log page nvme: send discovery log page change events to userspace nvme: add uevent variables for controller devices nvme: enable aen regardless of the presence of I/O queues nvme-fabrics: allow discovery subsystems accept a kato nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery() nvme: Remove redundant assignment of cq vector nvme: Assign subsys instance from first ctrl ...
2019-09-17Merge tag 'for-5.4/io_uring-2019-09-15' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: - Allocate SQ/CQ ring together, more efficient. Expose this through a feature flag as well, so we can reduce the number of mmaps by 1 (Hristo and me) - Fix for sequence logic with SQ thread (Jackie). - Add support for links with drain commands (Jackie). - Improved async merging (me) - Improved buffered async write performance (me) - Support SQ poll wakeup + event get in single io_uring_enter() (me) - Support larger SQ ring size. For epoll conversions, the 4k limit was too small for some prod workloads (Daniel). - put_user_page() usage (John) * tag 'for-5.4/io_uring-2019-09-15' of git://git.kernel.dk/linux-block: io_uring: increase IORING_MAX_ENTRIES to 32K io_uring: make sqpoll wakeup possible with getevents io_uring: extend async work merging io_uring: limit parallelism of buffered writes io_uring: add io_queue_async_work() helper io_uring: optimize submit_and_wait API io_uring: add support for link with drain io_uring: fix wrong sequence setting logic io_uring: expose single mmap capability io_uring: allocate the two rings together fs/io_uring.c: convert put_page() to put_user_page*()
2019-09-17Merge tag 'docs-5.4' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "It's a somewhat calmer cycle for docs this time, as the churn of the mass RST conversion is happily mostly behind us. - A new document on reproducible builds. - We finally got around to zapping the documentation for hardware support that was removed in 2004; one doesn't want to rush these things. - The usual assortment of fixes, typo corrections, etc" * tag 'docs-5.4' of git://git.lwn.net/linux: (67 commits) Documentation: kbuild: Add document about reproducible builds docs: printk-formats: Stop encouraging use of unnecessary %h[xudi] and %hh[xudi] Documentation: Add "earlycon=sbi" to the admin guide doc:lock: remove reference to clever use of read-write lock devices.txt: improve entry for comedi (char major 98) docs: mtd: Update spi nor reference driver doc: arm64: fix grammar dtb placed in no attributes region Documentation: sysrq: don't recommend 'S' 'U' before 'B' mailmap: Update email address for Quentin Perret docs: ftrace: clarify when tracing is disabled by the trace file docs: process: fix broken link Documentation/arm/samsung-s3c24xx: Remove stray U+FEFF character to fix title Documentation/arm/sa1100/assabet: Fix 'make assabet_defconfig' command Documentation/arm/sa1100: Remove some obsolete documentation docs/zh_CN: update Chinese howto.rst for latexdocs making Documentation: virt: Fix broken reference to virt tree's index docs: Fix typo on pull requests guide kernel-doc: Allow anonymous enum Documentation: sphinx: Don't parse socket() as identifier reference Documentation: sphinx: Add missing comma to list of strings ...
2019-09-17Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core timer updates from Thomas Gleixner: "Timers and timekeeping updates: - A large overhaul of the posix CPU timer code which is a preparation for moving the CPU timer expiry out into task work so it can be properly accounted on the task/process. An update to the bogus permission checks will come later during the merge window as feedback was not complete before heading of for travel. - Switch the timerqueue code to use cached rbtrees and get rid of the homebrewn caching of the leftmost node. - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a single function - Implement the separation of hrtimers to be forced to expire in hard interrupt context even when PREEMPT_RT is enabled and mark the affected timers accordingly. - Implement a mechanism for hrtimers and the timer wheel to protect RT against priority inversion and live lock issues when a (hr)timer which should be canceled is currently executing the callback. Instead of infinitely spinning, the task which tries to cancel the timer blocks on a per cpu base expiry lock which is held and released by the (hr)timer expiry code. - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests resulting in faster access to timekeeping functions. - Updates to various clocksource/clockevent drivers and their device tree bindings. - The usual small improvements all over the place" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits) posix-cpu-timers: Fix permission check regression posix-cpu-timers: Always clear head pointer on dequeue hrtimer: Add a missing bracket and hide `migration_base' on !SMP posix-cpu-timers: Make expiry_active check actually work correctly posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build tick: Mark sched_timer to expire in hard interrupt context hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n posix-cpu-timers: Utilize timerqueue for storage posix-cpu-timers: Move state tracking to struct posix_cputimers posix-cpu-timers: Deduplicate rlimit handling posix-cpu-timers: Remove pointless comparisons posix-cpu-timers: Get rid of 64bit divisions posix-cpu-timers: Consolidate timer expiry further posix-cpu-timers: Get rid of zero checks rlimit: Rewrite non-sensical RLIMIT_CPU comment posix-cpu-timers: Respect INFINITY for hard RTTIME limit posix-cpu-timers: Switch thread group sampling to array posix-cpu-timers: Restructure expiry array posix-cpu-timers: Remove cputime_expires ...
2019-09-17Merge branch 'pm-sleep'Rafael J. Wysocki
* pm-sleep: (29 commits) ACPI: PM: s2idle: Always set up EC GPE for system wakeup ACPI: PM: s2idle: Avoid rearming SCI for wakeup unnecessarily PM / wakeup: Unexport wakeup_source_sysfs_{add,remove}() PM / wakeup: Register wakeup class kobj after device is added PM / wakeup: Fix sysfs registration error path PM / wakeup: Show wakeup sources stats in sysfs PM / wakeup: Use wakeup_source_register() in wakelock.c PM / wakeup: Drop wakeup_source_init(), wakeup_source_prepare() PM: sleep: Replace strncmp() with str_has_prefix() PM: suspend: Fix platform_suspend_prepare_noirq() intel-hid: Disable button array during suspend-to-idle intel-hid: intel-vbtn: Avoid leaking wakeup_mode set ACPI: PM: s2idle: Execute LPS0 _DSM functions with suspended devices ACPI: EC: PM: Make acpi_ec_dispatch_gpe() print debug message ACPI: EC: PM: Consolidate some code depending on PM_SLEEP ACPI: PM: s2idle: Eliminate acpi_sleep_no_ec_events() ACPI: PM: s2idle: Switch EC over to polling during "noirq" suspend ACPI: PM: s2idle: Add acpi.sleep_no_lps0 module parameter ACPI: PM: s2idle: Rearrange lps0_device_attach() PM/sleep: Expose suspend stats in sysfs ...
2019-09-15Revert "ext4: make __ext4_get_inode_loc plug"Linus Torvalds
This reverts commit b03755ad6f33b7b8cd7312a3596a2dbf496de6e7. This is sad, and done for all the wrong reasons. Because that commit is good, and does exactly what it says: avoids a lot of small disk requests for the inode table read-ahead. However, it turns out that it causes an entirely unrelated problem: the getrandom() system call was introduced back in 2014 by commit c6e9d6f38894 ("random: introduce getrandom(2) system call"), and people use it as a convenient source of good random numbers. But part of the current semantics for getrandom() is that it waits for the entropy pool to fill at least partially (unlike /dev/urandom). And at least ArchLinux apparently has a systemd that uses getrandom() at boot time, and the improvements in IO patterns means that existing installations suddenly start hanging, waiting for entropy that will never happen. It seems to be an unlucky combination of not _quite_ enough entropy, together with a particular systemd version and configuration. Lennart says that the systemd-random-seed process (which is what does this early access) is supposed to not block any other boot activity, but sadly that doesn't actually seem to be the case (possibly due bogus dependencies on cryptsetup for encrypted swapspace). The correct fix is to fix getrandom() to not block when it's not appropriate, but that fix is going to take a lot more discussion. Do we just make it act like /dev/urandom by default, and add a new flag for "wait for entropy"? Do we add a boot-time option? Or do we just limit the amount of time it will wait for entropy? So in the meantime, we do the revert to give us time to discuss the eventual fix for the fundamental problem, at which point we can re-apply the ext4 inode table access optimization. Reported-by: Ahmed S. Darwish <darwish.07@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Willy Tarreau <w@1wt.eu> Cc: Alexander E. Patrakov <patrakov@gmail.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-14io_uring: increase IORING_MAX_ENTRIES to 32KDaniel Xu
Some workloads can require far more than 4K oustanding entries. For example memcached can have ~300K sockets over ~40 cores. Bumping the max to 32K seems to work pretty well. Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-13Merge tag 'for-5.3-rc8-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Here are two fixes, one of them urgent fixing a bug introduced in 5.2 and reported by many users. It took time to identify the root cause, catching the 5.3 release is higly desired also to push the fix to 5.2 stable tree. The bug is a mess up of return values after adding proper error handling and honestly the kind of bug that can cause sleeping disorders until it's caught. My appologies to everybody who was affected. Summary of what could happen: 1) either a hang when committing a transaction, if this happens there's no risk of corruption, still the hang is very inconvenient and can't be resolved without a reboot 2) writeback for some btree nodes may never be started and we end up committing a transaction without noticing that, this is really serious and that will lead to the "parent transid verify failed" messages" * tag 'for-5.3-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix unwritten extent buffers and hangs on future writeback attempts Btrfs: fix assertion failure during fsync and use of stale transaction
2019-09-12io_uring: make sqpoll wakeup possible with geteventsJens Axboe
The way the logic is setup in io_uring_enter() means that you can't wake up the SQ poller thread while at the same time waiting (or polling) for completions afterwards. There's no reason for that to be the case. Reported-by: Lewis Baker <lbaker@fb.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-12io_uring: extend async work mergingJens Axboe
We currently merge async work items if we see a strict sequential hit. This helps avoid unnecessary workqueue switches when we don't need them. We can extend this merging to cover cases where it's not a strict sequential hit, but the IO still fits within the same page. If an application is doing multiple requests within the same page, we don't want separate workers waiting on the same page to complete IO. It's much faster to let the first worker bring in the page, then operate on that page from the same worker to complete the next request(s). Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-12Btrfs: fix unwritten extent buffers and hangs on future writeback attemptsFilipe Manana
The lock_extent_buffer_io() returns 1 to the caller to tell it everything went fine and the callers needs to start writeback for the extent buffer (submit a bio, etc), 0 to tell the caller everything went fine but it does not need to start writeback for the extent buffer, and a negative value if some error happened. When it's about to return 1 it tries to lock all pages, and if a try lock on a page fails, and we didn't flush any existing bio in our "epd", it calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or an error. The page might have been locked elsewhere, not with the goal of starting writeback of the extent buffer, and even by some code other than btrfs, like page migration for example, so it does not mean the writeback of the extent buffer was already started by some other task, so returning a 0 tells the caller (btree_write_cache_pages()) to not start writeback for the extent buffer. Note that epd might currently have either no bio, so flush_write_bio() returns 0 (success) or it might have a bio for another extent buffer with a lower index (logical address). Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the extent buffer and writeback is never started for the extent buffer, future attempts to writeback the extent buffer will hang forever waiting on that bit to be cleared, since it can only be cleared after writeback completes. Such hang is reported with a trace like the following: [49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds. [49887.347059] Not tainted 5.2.13-gentoo #2 [49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [49887.347062] btrfs-transacti D 0 1752 2 0x80004000 [49887.347064] Call Trace: [49887.347069] ? __schedule+0x265/0x830 [49887.347071] ? bit_wait+0x50/0x50 [49887.347072] ? bit_wait+0x50/0x50 [49887.347074] schedule+0x24/0x90 [49887.347075] io_schedule+0x3c/0x60 [49887.347077] bit_wait_io+0x8/0x50 [49887.347079] __wait_on_bit+0x6c/0x80 [49887.347081] ? __lock_release.isra.29+0x155/0x2d0 [49887.347083] out_of_line_wait_on_bit+0x7b/0x80 [49887.347084] ? var_wake_function+0x20/0x20 [49887.347087] lock_extent_buffer_for_io+0x28c/0x390 [49887.347089] btree_write_cache_pages+0x18e/0x340 [49887.347091] do_writepages+0x29/0xb0 [49887.347093] ? kmem_cache_free+0x132/0x160 [49887.347095] ? convert_extent_bit+0x544/0x680 [49887.347097] filemap_fdatawrite_range+0x70/0x90 [49887.347099] btrfs_write_marked_extents+0x53/0x120 [49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0 [49887.347102] btrfs_commit_transaction+0x6bb/0x990 [49887.347103] ? start_transaction+0x33e/0x500 [49887.347105] transaction_kthread+0x139/0x15c So fix this by not overwriting the return value (ret) with the result from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK bit in case flush_write_bio() returns an error, otherwise it will hang any future attempts to writeback the extent buffer, and undo all work done before (set back EXTENT_BUFFER_DIRTY, etc). This is a regression introduced in the 5.2 kernel. Fixes: 2e3c25136adfb ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()") Fixes: f4340622e0226 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up") Reported-by: Zdenek Sojka <zsojka@seznam.cz> Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t Reported-by: Drazen Kacar <drazen.kacar@oradian.com> Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-12Btrfs: fix assertion failure during fsync and use of stale transactionFilipe Manana
Sometimes when fsync'ing a file we need to log that other inodes exist and when we need to do that we acquire a reference on the inodes and then drop that reference using iput() after logging them. That generally is not a problem except if we end up doing the final iput() (dropping the last reference) on the inode and that inode has a link count of 0, which can happen in a very short time window if the logging path gets a reference on the inode while it's being unlinked. In that case we end up getting the eviction callback, btrfs_evict_inode(), invoked through the iput() call chain which needs to drop all of the inode's items from its subvolume btree, and in order to do that, it needs to join a transaction at the helper function evict_refill_and_join(). However because the task previously started a transaction at the fsync handler, btrfs_sync_file(), it has current->journal_info already pointing to a transaction handle and therefore evict_refill_and_join() will get that transaction handle from btrfs_join_transaction(). From this point on, two different problems can happen: 1) evict_refill_and_join() will often change the transaction handle's block reserve (->block_rsv) and set its ->bytes_reserved field to a value greater than 0. If evict_refill_and_join() never commits the transaction, the eviction handler ends up decreasing the reference count (->use_count) of the transaction handle through the call to btrfs_end_transaction(), and after that point we have a transaction handle with a NULL ->block_rsv (which is the value prior to the transaction join from evict_refill_and_join()) and a ->bytes_reserved value greater than 0. If after the eviction/iput completes the inode logging path hits an error or it decides that it must fallback to a transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a non-zero value from btrfs_log_dentry_safe(), and because of that non-zero value it tries to commit the transaction using a handle with a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes the transaction commit hit an assertion failure at btrfs_trans_release_metadata() because ->bytes_reserved is not zero but the ->block_rsv is NULL. The produced stack trace for that is like the following: [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 [192922.917553] ------------[ cut here ]------------ [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] (...) [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [192922.925105] Call Trace: [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] [192922.926731] do_fsync+0x38/0x60 [192922.927138] __x64_sys_fdatasync+0x13/0x20 [192922.927543] do_syscall_64+0x60/0x1c0 [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [192922.934077] ---[ end trace f00808b12068168f ]--- 2) If evict_refill_and_join() decides to commit the transaction, it will be able to do it, since the nested transaction join only increments the transaction handle's ->use_count reference counter and it does not prevent the transaction from getting committed. This means that after eviction completes, the fsync logging path will be using a transaction handle that refers to an already committed transaction. What happens when using such a stale transaction can be unpredictable, we are at least having a use-after-free on the transaction handle itself, since the transaction commit will call kmem_cache_free() against the handle regardless of its ->use_count value, or we can end up silently losing all the updates to the log tree after that iput() in the logging path, or using a transaction handle that in the meanwhile was allocated to another task for a new transaction, etc, pretty much unpredictable what can happen. In order to fix both of them, instead of using iput() during logging, use btrfs_add_delayed_iput(), so that the logging path of fsync never drops the last reference on an inode, that step is offloaded to a safe context (usually the cleaner kthread). The assertion failure issue was sporadically triggered by the test case generic/475 from fstests, which loads the dm error target while fsstress is running, which lead to fsync failing while logging inodes with -EIO errors and then trying later to commit the transaction, triggering the assertion failure. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-11ovl: filter of trusted xattr results in auditMark Salyzyn
When filtering xattr list for reading, presence of trusted xattr results in a security audit log. However, if there is other content no errno will be set, and if there isn't, the errno will be -ENODATA and not -EPERM as is usually associated with a lack of capability. The check does not block the request to list the xattrs present. Switch to ns_capable_noaudit to reflect a more appropriate check. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: linux-security-module@vger.kernel.org Cc: kernel-team@android.com Cc: stable@vger.kernel.org # v3.18+ Fixes: a082c6f680da ("ovl: filter trusted xattr for non-admin") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-11ovl: Fix dereferencing possible ERR_PTR()Ding Xiang
if ovl_encode_real_fh() fails, no memory was allocated and the error in the error-valued pointer should be returned. Fixes: 9b6faee07470 ("ovl: check ERR_PTR() return value from ovl_encode_fh()") Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com> Cc: <stable@vger.kernel.org> # v4.16+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10io_uring: limit parallelism of buffered writesJens Axboe
All the popular filesystems need to grab the inode lock for buffered writes. With io_uring punting buffered writes to async context, we observe a lot of contention with all workers hamming this mutex. For buffered writes, we generally don't need a lot of parallelism on the submission side, as the flushing will take care of that for us. Hence we don't need a deep queue on the write side, as long as we can safely punt from the original submission context. Add a workqueue with a limit of 2 that we can use for buffered writes. This greatly improves the performance and efficiency of higher queue depth buffered async writes with io_uring. Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-10io_uring: add io_queue_async_work() helperJens Axboe
Add a helper for queueing a request for async execution, in preparation for optimizing it. No functional change in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-10io_uring: optimize submit_and_wait APIJens Axboe
For some applications that end up using a submit-and-wait type of approach for certain batches of IO, we can make that a bit more efficient by allowing the application to block for the last IO submission. This prevents an async when we don't need it, as the application will be blocking for the completion event(s) anyway. Typical use cases are using the liburing io_uring_submit_and_wait() API, or just using io_uring_enter() doing both submissions and completions. As a specific example, RocksDB doing MultiGet() is sped up quite a bit with this change. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-09io_uring: add support for link with drainJackie Liu
To support the link with drain, we need to do two parts. There is an sqes: 0 1 2 3 4 5 6 +-----+-----+-----+-----+-----+-----+-----+ | N | L | L | L+D | N | N | N | +-----+-----+-----+-----+-----+-----+-----+ First, we need to ensure that the io before the link is completed, there is a easy way is set drain flag to the link list's head, so all subsequent io will be inserted into the defer_list. +-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+ Second, ensure that the following IO will not be completed first, an easy way is to create a mirror of drain io and insert it into defer_list, in this way, as long as drain io is not processed, the following io in the defer_list will not be actively process. +-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ ('3) | D | <== This is a shadow of (3) +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+ Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-09io_uring: fix wrong sequence setting logicJackie Liu
Sqo_thread will get sqring in batches, which will cause ctx->cached_sq_head to be added in batches. if one of these sqes is set with the DRAIN flag, then he will never get a chance to process, and finally sqo_thread will not exit. Fixes: de0617e4671 ("io_uring: add support for marking commands as draining") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-09btrfs: Relinquish CPUs in btrfs_compare_treesNikolay Borisov
When doing any form of incremental send the parent and the child trees need to be compared via btrfs_compare_trees. This can result in long loop chains without ever relinquishing the CPU. This causes softlockup detector to trigger when comparing trees with a lot of items. Example report: watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153] CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased) Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 40000005 (nZcv daif -PAN -UAO) pc : __ll_sc_arch_atomic_sub_return+0x14/0x20 lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs] sp : ffff00001273b7e0 Call trace: __ll_sc_arch_atomic_sub_return+0x14/0x20 release_extent_buffer+0xdc/0x120 [btrfs] free_extent_buffer.part.0+0xb0/0x118 [btrfs] free_extent_buffer+0x24/0x30 [btrfs] btrfs_release_path+0x4c/0xa0 [btrfs] btrfs_free_path.part.0+0x20/0x40 [btrfs] btrfs_free_path+0x24/0x30 [btrfs] get_inode_info+0xa8/0xf8 [btrfs] finish_inode_if_needed+0xe0/0x6d8 [btrfs] changed_cb+0x9c/0x410 [btrfs] btrfs_compare_trees+0x284/0x648 [btrfs] send_subvol+0x33c/0x520 [btrfs] btrfs_ioctl_send+0x8a0/0xaf0 [btrfs] btrfs_ioctl+0x199c/0x2288 [btrfs] do_vfs_ioctl+0x4b0/0x820 ksys_ioctl+0x84/0xb8 __arm64_sys_ioctl+0x28/0x38 el0_svc_common.constprop.0+0x7c/0x188 el0_svc_handler+0x34/0x90 el0_svc+0x8/0xc Fix this by adding a call to cond_resched at the beginning of the main loop in btrfs_compare_trees. Fixes: 7069830a9e38 ("Btrfs: add btrfs_compare_trees function") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: Don't assign retval of ↵Nikolay Borisov
btrfs_try_tree_write_lock/btrfs_tree_read_lock_atomic Those function are simple boolean predicates there is no need to assign their return values to interim variables. Use them directly as predicates. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: create structure to encode checksum type and lengthJohannes Thumshirn
Create a structure to encode the type and length for the known on-disk checksums. This makes it easier to add new checksums later. The structure and helpers are moved from ctree.h so they don't occupy space in all headers including ctree.h. This save some space in the final object. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: add enospc debug messages for ticket failureJosef Bacik
When debugging weird enospc problems it's handy to be able to dump the space info when we wake up all tickets, and see what the ticket values are. This helped me figure out cases where we were enospc'ing when we shouldn't have been. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: do not account global reserve in can_overcommitJosef Bacik
We ran into a problem in production where a box with plenty of space was getting wedged doing ENOSPC flushing. These boxes only had 20% of the disk allocated, but their metadata space + global reserve was right at the size of their metadata chunk. In this case can_overcommit should be allowing allocations without problem, but there's logic in can_overcommit that doesn't allow us to overcommit if there's not enough real space to satisfy the global reserve. This is for historical reasons. Before there were only certain places we could allocate chunks. We could go to commit the transaction and not have enough space for our pending delayed refs and such and be unable to allocate a new chunk. This would result in a abort because of ENOSPC. This code was added to solve this problem. However since then we've gained the ability to always be able to allocate a chunk. So we can easily overcommit in these cases without risking a transaction abort because of ENOSPC. Also prior to now the global reserve really would be used because that's the space we relied on for delayed refs. With delayed refs being tracked separately we no longer have to worry about running out of delayed refs space while committing. We are much less likely to exhaust our global reserve space during transaction commit. Fix the can_overcommit code to simply see if our current usage + what we want is less than our current free space plus whatever slack space we have in the disk is. This solves the problem we were seeing in production and keeps us from flushing as aggressively as we approach our actual metadata size usage. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: use btrfs_try_granting_tickets in update_global_rsvJosef Bacik
We have some annoying xfstests tests that will create a very small fs, fill it up, delete it, and repeat to make sure everything works right. This trips btrfs up sometimes because we may commit a transaction to free space, but most of the free metadata space was being reserved by the global reserve. So we commit and update the global reserve, but the space is simply added to bytes_may_use directly, instead of trying to add it to existing tickets. This results in ENOSPC when we really did have space. Fix this by calling btrfs_try_granting_tickets once we add back our excess space to wake any pending tickets. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: always reserve our entire size for the global reserveJosef Bacik
While messing with the overcommit logic I noticed that sometimes we'd ENOSPC out when really we should have run out of space much earlier. It turns out it's because we'll only reserve up to the free amount left in the space info for the global reserve, but that doesn't make sense with overcommit because we could be well above our actual size. This results in the global reserve not carving out it's entire reservation, and thus not putting enough pressure on the rest of the infrastructure to do the right thing and ENOSPC out at a convenient time. Fix this by always taking our full reservation amount for the global reserve. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: change the minimum global reserve sizeJosef Bacik
It made sense to have the global reserve set at 16M in the past, but since it is used less nowadays set the minimum size to the number of items we'll need to update the main trees we update during a transaction commit, plus some slop area so we can do unlinks if we need to. In practice this doesn't affect normal file systems, but for xfstests where we do things like fill up a fs and then rm * it can fall over in weird ways. This enables us for more sane behavior at extremely small file system sizes. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: rename btrfs_space_info_add_old_bytesJosef Bacik
This name doesn't really fit with how the space reservation stuff works now, rename it to btrfs_space_info_free_bytes_may_use so it's clear what the function is doing. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: remove orig_bytes from reserve_ticketJosef Bacik
Now that we do not do partial filling of tickets simply remove orig_bytes, it is no longer needed. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: fix may_commit_transaction to deal with no partial fillingJosef Bacik
Now that we aren't partially filling tickets we may have some slack space left in the space_info. We need to account for this in may_commit_transaction, otherwise we may choose to not commit the transaction despite it actually having enough space to satisfy our ticket. Calculate the free space we have in the space_info, if any, and subtract this from the ticket we have and use that amount to determine if we will need to commit to reclaim enough space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: rework wake_all_ticketsJosef Bacik
Now that we no longer partially fill tickets we need to rework wake_all_tickets to call btrfs_try_to_wakeup_tickets() in order to see if any subsequent tickets are able to be satisfied. If our tickets_id changes we know something happened and we can keep flushing. Also if we find a ticket that is smaller than the first ticket in our queue then we want to retry the flushing loop again in case may_commit_transaction() decides we could satisfy the ticket by committing the transaction. Rename this to maybe_fail_all_tickets() while we're at it, to better reflect what the function is actually doing. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: refactor the ticket wakeup codeJosef Bacik
Now that btrfs_space_info_add_old_bytes simply checks if we can make the reservation and updates bytes_may_use, there's no reason to have both helpers in place. Factor out the ticket wakeup logic into it's own helper, make btrfs_space_info_add_old_bytes() update bytes_may_use and then call the wakeup helper, and replace all calls to btrfs_space_info_add_new_bytes() with the wakeup helper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: stop partially refilling tickets when releasing spaceJosef Bacik
btrfs_space_info_add_old_bytes is used when adding the extra space from an existing reservation back into the space_info to be used by any waiting tickets. In order to keep us from overcommitting we check to make sure that we can still use this space for our reserve ticket, and if we cannot we'll simply subtract it from space_info->bytes_may_use. However this is problematic, because it assumes that only changes to bytes_may_use would affect our ability to make reservations. Any changes to bytes_reserved would be missed. If we were unable to make a reservation prior because of reserved space, but that reserved space was free'd due to unlink or truncate and we were allowed to immediately reclaim that metadata space we would still ENOSPC. Consider the example where we create a file with a bunch of extents, using up 2MiB of actual space for the new tree blocks. Then we try to make a reservation of 2MiB but we do not have enough space to make this reservation. The iput() occurs in another thread and we remove this space, and since we did not write the blocks we simply do space_info->bytes_reserved -= 2MiB. We would never see this because we do not check our space info used, we just try to re-use the freed reservations. To fix this problem, and to greatly simplify the wakeup code, do away with this partial refilling nonsense. Use btrfs_space_info_add_old_bytes to subtract the reservation from space_info->bytes_may_use, and then check the ticket against the total used of the space_info the same way we do with the initial reservation attempt. This keeps the reservation logic consistent and solves the problem of early ENOSPC in the case that we free up space in places other than bytes_may_use and bytes_pinned. Thanks, Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: add space reservation tracepoint for reserved bytesJosef Bacik
I noticed when folding the trace_btrfs_space_reservation() tracepoint into the btrfs_space_info_update_* helpers that we didn't emit a tracepoint when doing btrfs_add_reserved_bytes(). I know this is because we were swapping bytes_may_use for bytes_reserved, so in my mind there was no reason to have the tracepoint there. But now there is because we always emit the unreserve for the bytes_may_use side, and this would have broken if compression was on anyway. Add a tracepoint to cover the bytes_reserved counter so the math still comes out right. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: roll tracepoint into btrfs_space_info_update helperJosef Bacik
We duplicate this tracepoint everywhere we call these helpers, so update the helper to have the tracepoint as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>