summaryrefslogtreecommitdiff
path: root/include/trace
AgeCommit message (Collapse)Author
2023-04-26Merge tag 'dlm-6.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: - Remove some unused features (related to lock timeouts) that have been previously scheduled for removal - Fix a bug where the pending callback flag would be incorrectly cleared, which could potentially result in missing a completion callback - Use an unbound workqueue for dlm socket handling so that socket operations can be processed with less delay - Fix possible lockspace join connection errors with large clusters (e.g. over 16 nodes) caused by a small socket backlog setting - Use atomic bit ops for internal flags to help avoid mistakes copying flag values from messages - Fix recently introduced bug where memory for lvb data could be unnecessarily allocated for a lock * tag 'dlm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: fs: dlm: stop unnecessarily filling zero ms_extra bytes fs: dlm: switch lkb_sbflags to atomic ops fs: dlm: rsb hash table flag value to atomic ops fs: dlm: move internal flags to atomic ops fs: dlm: change dflags to use atomic bits fs: dlm: store lkb distributed flags into own value fs: dlm: remove DLM_IFL_LOCAL_MS flag fs: dlm: rename stub to local message flag fs: dlm: remove deprecated code parts DLM: increase socket backlog to avoid hangs with 16 nodes fs: dlm: add unbound flag to dlm_io workqueue fs: dlm: fix DLM_IFL_CB_PENDING gets overwritten
2023-04-26Merge tag 'for-6.4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Mostly core changes and cleanups, some notable fixes and two performance improvements in directory logging. The IO path cleanups are removing or refactoring old code, scrub main loop has been completely rewritten also refactoring old code. There are some changes to non-btrfs code, mostly trivial, the cgroup punt bio logic is only moved from generic code. Performance improvements: - improve logging changes in a directory during one transaction, avoid iterating over items and reduce lock contention (fsync time 4x lower) - when logging directory entries during one transaction, reduce locking of subvolume trees by checking tree-log instead (improvement in throughput and latency for concurrent access to a subvolume) Notable fixes: - dev-replace: - properly honor read mode when requested to avoid reading from source device - target device won't be used for eventual read repair, this is unreliable for NODATASUM files - when there are unpaired (and unrepairable) metadata during replace, exit early with error and don't try to finish whole operation - scrub ioctl properly rejects unknown flags - fix global block reserve calculations - fix partial direct io write when there's a page fault in the middle, iomap will try to continue with partial request but the btrfs part did not match that, this can lead to zeros written instead of data Core changes: - io path: - continued cleanups and refactoring around bio handling - extent io submit path simplifications and cleanups - flush write path simplifications and cleanups - rework logic of passing sync mode of bio, with further cleanups - rewrite scrub code flow, restructure how the stripes are enumerated and verified in a more unified way - allow to set lower threshold for block group reclaim in debug mode to aid zoned mode testing - remove obsolete time-based delayed ref throttling logic when truncating items - DREW locks are not using percpu variables anymore - more warning fixes (-Wmaybe-uninitialized) - u64 division simplifications - error handling improvements Non-btrfs code changes: - push cgroup punt bio logic to btrfs code (there was no other user of that), the functionality can be now selected separately by BLK_CGROUP_PUNT_BIO - crc32c_impl removed after removing last uses in btrfs code - add btrfs_assertfail() to objtool table" * tag 'for-6.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits) btrfs: mark btrfs_assertfail() __noreturn btrfs: fix uninitialized variable warnings btrfs: use log root when iterating over index keys when logging directory btrfs: avoid iterating over all indexes when logging directory btrfs: dev-replace: error out if we have unrepaired metadata error during btrfs: remove pointless loop at btrfs_get_next_valid_item() btrfs: scrub: reject unsupported scrub flags btrfs: reinterpret async discard iops_limit=0 as no delay btrfs: set default discard iops_limit to 1000 btrfs: remove unused raid56 functions which were dedicated for scrub btrfs: scrub: remove scrub_bio structure btrfs: scrub: remove scrub_block and scrub_sector structures btrfs: scrub: remove the old scrub recheck code btrfs: scrub: remove the old writeback infrastructure btrfs: scrub: remove scrub_parity structure btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure btrfs: scrub: introduce helper to queue a stripe for scrub btrfs: scrub: introduce error reporting functionality for scrub_stripe btrfs: scrub: introduce a writeback helper for scrub_stripe ...
2023-04-26Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "There are a number of major cleanups in ext4 this cycle: - The data=journal writepath has been significantly cleaned up and simplified, and reduces a large number of data=journal special cases by Jan Kara. - Ojaswin Muhoo has replaced linked list used to track extents that have been used for inode preallocation with a red-black tree in the multi-block allocator. This improves performance for workloads which do a large number of random allocating writes. - Thanks to Kemeng Shi for a lot of cleanup and bug fixes in the multi-block allocator. - Matthew wilcox has converted the code paths for reading and writing ext4 pages to use folios. - Jason Yan has continued to factor out ext4_fill_super() into smaller functions for improve ease of maintenance and comprehension. - Josh Triplett has created an uapi header for ext4 userspace API's" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (105 commits) ext4: Add a uapi header for ext4 userspace APIs ext4: remove useless conditional branch code ext4: remove unneeded check of nr_to_submit ext4: move dax and encrypt checking into ext4_check_feature_compatibility() ext4: factor out ext4_block_group_meta_init() ext4: move s_reserved_gdt_blocks and addressable checking into ext4_check_geometry() ext4: rename two functions with 'check' ext4: factor out ext4_flex_groups_free() ext4: use ext4_group_desc_free() in ext4_put_super() to save some duplicated code ext4: factor out ext4_percpu_param_init() and ext4_percpu_param_destroy() ext4: factor out ext4_hash_info_init() Revert "ext4: Fix warnings when freezing filesystem with journaled data" ext4: Update comment in mpage_prepare_extent_to_map() ext4: Simplify handling of journalled data in ext4_bmap() ext4: Drop special handling of journalled data from ext4_quota_on() ext4: Drop special handling of journalled data from ext4_evict_inode() ext4: Fix special handling of journalled data from extent zeroing ext4: Drop special handling of journalled data from extent shifting operations ext4: Drop special handling of journalled data from ext4_sync_file() ext4: Commit transaction before writing back pages in data=journal mode ...
2023-04-26NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor()Chuck Lever
There have been several bugs over the years where the NFSD splice actor has attempted to write outside the rq_pages array. This is a "should never happen" condition, but if for some reason the pipe splice actor should attempt to walk past the end of rq_pages, it needs to terminate the READ operation to prevent corruption of the pointer addresses in the fields just beyond the array. A server crash is thus prevented. Since the code is not behaving, the READ operation returns -EIO to the client. None of the READ payload data can be trusted if the splice actor isn't operating as expected. Suggested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2023-04-25Merge tag 'thermal-6.4-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull thermal control updates from Rafael Wysocki: "These mostly continue to prepare the thermal control subsystem for using unified representation of trip points, which includes cleanups, code refactoring and similar and update several drivers (for other reasons), which includes new hardware support. Specifics: - Add a thermal zone 'devdata' accessor and modify several drivers to use it (Daniel Lezcano) - Prevent drivers from using the 'device' internal thermal zone structure field directly (Daniel Lezcano) - Clean up the hwmon thermal driver (Daniel Lezcano) - Add thermal zone id accessor and thermal zone type accessor and prevent drivers from using thermal zone fields directly (Daniel Lezcano) - Clean up the acerhdf and tegra thermal drivers (Daniel Lezcano) - Add lower bound check for sysfs input to the x86_pkg_temp_thermal Intel thermal driver (Zhang Rui) - Add more thermal zone device encapsulation: prevent setting structure field directly, access the sensor device instead the thermal zone's device for trace, relocate the traces in drivers/thermal (Daniel Lezcano) - Use the generic trip point for the i.MX and remove the get_trip_temp ops (Daniel Lezcano) - Use the devm_platform_ioremap_resource() in the Hisilicon driver (Yang Li) - Remove R-Car H3 ES1.* handling as public has only access to the ES2 version and the upstream support for the ES1 has been shutdown (Wolfram Sang) - Add a delay after initializing the bank in order to let the time to the hardware to initialze itself before reading the temperature (Amjad Ouled-Ameur) - Add MT8365 support (Amjad Ouled-Ameur) - Preparational cleanup and DT bindings for RK3588 support (Sebastian Reichel) - Add driver support for RK3588 (Finley Xiao) - Use devm_reset_control_array_get_exclusive() for the Rockchip driver (Ye Xingchen) - Detect power gated thermal zones and return -EAGAIN when reading the temperature (Mikko Perttunen) - Remove thermal_bind_params structure as it is unused (Zhang Rui) - Drop unneeded quotes in DT bindings allowing to run yamllint (Rob Herring) - Update the power allocator documentation according to the thermal trace relocation (Lukas Bulwahn) - Fix sensor 1 interrupt status bitmask for the Mediatek LVTS sensor (Chen-Yu Tsai) - Use the dev_err_probe() helper in the Amlogic driver (Ye Xingchen) - Add AP domain support to LVTS thermal controllers for mt8195 (Balsam CHIHI) - Remove buggy call to thermal_of_zone_unregister() (Daniel Lezcano) - Make thermal_of_zone_[un]register() private to the thermal OF code (Daniel Lezcano) - Create a private copy of the thermal zone device parameters structure when registering a thermal zone (Daniel Lezcano) - Fix a kernel NULL pointer dereference in thermal_hwmon (Zhang Rui) - Revert recent message adjustment in thermal_hwmon (Rafael Wysocki) - Use of_property_present() for testing DT property presence in thermal control code (Rob Herring) - Clean up thermal_list_lock locking in the thermal core (Rafael Wysocki) - Add DLVR support for RFIM control in the int340x Intel thermal driver (Srinivas Pandruvada)" * tag 'thermal-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (55 commits) thermal: intel: int340x: Add DLVR support for RFIM control thermal/core: Alloc-copy-free the thermal zone parameters structure thermal/of: Unexport unused OF functions thermal/drivers/bcm2835: Remove buggy call to thermal_of_zone_unregister thermal/drivers/mediatek/lvts_thermal: Add AP domain for mt8195 dt-bindings: thermal: mediatek: Add AP domain to LVTS thermal controllers for mt8195 thermal: amlogic: Use dev_err_probe() thermal/drivers/mediatek/lvts_thermal: Fix sensor 1 interrupt status bitmask MAINTAINERS: adjust entry in THERMAL/POWER_ALLOCATOR after header movement dt-bindings: thermal: Drop unneeded quotes thermal/core: Remove thermal_bind_params structure thermal/drivers/tegra-bpmp: Handle offline zones thermal/drivers/rockchip: use devm_reset_control_array_get_exclusive() dt-bindings: rockchip-thermal: Support the RK3588 SoC compatible thermal/drivers/rockchip: Support RK3588 SoC in the thermal driver thermal/drivers/rockchip: Support dynamic sized sensor array thermal/drivers/rockchip: Simplify channel id logic thermal/drivers/rockchip: Use dev_err_probe thermal/drivers/rockchip: Simplify clock logic thermal/drivers/rockchip: Simplify getting match data ...
2023-04-25Merge tag 'irq-core-2023-04-24' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt updates from Thomas Gleixner: "Core: - Add tracepoints for tasklet callbacks which makes it possible to analyze individual tasklet functions instead of guess working from the overall duration of tasklet processing - Ensure that secondary interrupt threads have their affinity adjusted correctly Drivers: - A large rework of the RISC-V IPI management to prepare for a new RISC-V interrupt architecture - Small fixes and enhancements all over the place - Removal of support for various obsolete hardware platforms and the related code" * tag 'irq-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) irqchip/st: Remove stih415/stih416 and stid127 platforms support irqchip/gic-v3: Add Rockchip 3588001 erratum workaround genirq: Update affinity of secondary threads softirq: Add trace points for tasklet entry/exit irqchip/loongson-pch-pic: Fix pch_pic_acpi_init calling irqchip/loongson-pch-pic: Fix registration of syscore_ops irqchip/loongson-eiointc: Fix registration of syscore_ops irqchip/loongson-eiointc: Fix incorrect use of acpi_get_vec_parent irqchip/loongson-eiointc: Fix returned value on parsing MADT irqchip/riscv-intc: Add empty irq_eoi() for chained irq handlers RISC-V: Use IPIs for remote icache flush when possible RISC-V: Use IPIs for remote TLB flush when possible RISC-V: Allow marking IPIs as suitable for remote FENCEs RISC-V: Treat IPIs as normal Linux IRQs irqchip/riscv-intc: Allow drivers to directly discover INTC hwnode RISC-V: Clear SIP bit only when using SBI IPI operations irqchip/irq-sifive-plic: Add syscore callbacks for hibernation irqchip: Use of_property_read_bool() for boolean properties irqchip/bcm-6345-l1: Request memory region irqchip/gicv3: Workaround for NVIDIA erratum T241-FABRIC-4 ...
2023-04-24Merge tag 'erofs-for-6.4-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "In this cycle, sub-page block support for uncompressed files is available. It's mainly used to enable original signing ('golden') 4k-block images on arm64 with 16/64k pages. In addition, end users could also use this feature to build a manifest to directly refer to golden tar data. Besides, long xattr name prefix support is also introduced in this cycle to avoid too many xattrs with the same prefix (e.g. overlayfs xattrs). It's useful for erofs + overlayfs combination (like Composefs model): the image size is reduced by ~14% and runtime performance is also slightly improved. Others are random fixes and cleanups as usual. Summary: - Add sub-page block size support for uncompressed files - Support flattened block device for multi-blob images to be attached into virtual machines (including cloud servers) and bare metals - Support long xattr name prefixes to optimize images with common xattr namespaces (e.g. files with overlayfs xattrs) use cases - Various minor cleanups & fixes" * tag 'erofs-for-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: cleanup i_format-related stuffs erofs: sunset erofs_dbg() erofs: fix potential overflow calculating xattr_isize erofs: get rid of z_erofs_fill_inode() erofs: enable long extended attribute name prefixes erofs: handle long xattr name prefixes properly erofs: add helpers to load long xattr name prefixes erofs: introduce on-disk format for long xattr name prefixes erofs: move packed inode out of the compression part erofs: keep meta inode into erofs_buf erofs: initialize packed inode after root inode is assigned erofs: stop parsing non-compact HEAD index if clusterofs is invalid erofs: don't warn ztailpacking feature anymore erofs: simplify erofs_xattr_generic_get() erofs: rename init_inode_xattrs with erofs_ prefix erofs: move several xattr helpers into xattr.c erofs: tidy up EROFS on-disk naming erofs: support flattened block device for multi-blob images erofs: set block size to the on-disk block size erofs: avoid hardcoded blocksize for subpage block support
2023-04-24Merge tag 'rcu.6.4.april5.2023.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux Pull RCU updates from Joel Fernandes: - Updates and additions to MAINTAINERS files, with Boqun being added to the RCU entry and Zqiang being added as an RCU reviewer. I have also transitioned from reviewer to maintainer; however, Paul will be taking over sending RCU pull-requests for the next merge window. - Resolution of hotplug warning in nohz code, achieved by fixing cpu_is_hotpluggable() through interaction with the nohz subsystem. Tick dependency modifications by Zqiang, focusing on fixing usage of the TICK_DEP_BIT_RCU_EXP bitmask. - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n kernels, fixed by Zqiang. - Improvements to rcu-tasks stall reporting by Neeraj. - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for increased robustness, affecting several components like mac802154, drbd, vmw_vmci, tracing, and more. A report by Eric Dumazet showed that the API could be unknowingly used in an atomic context, so we'd rather make sure they know what they're asking for by being explicit: https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/ - Documentation updates, including corrections to spelling, clarifications in comments, and improvements to the srcu_size_state comments. - Better srcu_struct cache locality for readers, by adjusting the size of srcu_struct in support of SRCU usage by Christoph Hellwig. - Teach lockdep to detect deadlocks between srcu_read_lock() vs synchronize_srcu() contributed by Boqun. Previously lockdep could not detect such deadlocks, now it can. - Integration of rcutorture and rcu-related tools, targeted for v6.4 from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis module parameter, and more - Miscellaneous changes, various code cleanups and comment improvements * tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits) checkpatch: Error out if deprecated RCU API used mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep() rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep() ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep() net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep() net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep() lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep() tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep() misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep() drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep() rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan() rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early rcu: Remove never-set needwake assignment from rcu_report_qs_rdp() rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race rcu/trace: use strscpy() to instead of strncpy() tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem ...
2023-04-19net/handshake: Create a NETLINK service for handling handshake requestsChuck Lever
When a kernel consumer needs a transport layer security session, it first needs a handshake to negotiate and establish a session. This negotiation can be done in user space via one of the several existing library implementations, or it can be done in the kernel. No in-kernel handshake implementations yet exist. In their absence, we add a netlink service that can: a. Notify a user space daemon that a handshake is needed. b. Once notified, the daemon calls the kernel back via this netlink service to get the handshake parameters, including an open socket on which to establish the session. c. Once the handshake is complete, the daemon reports the session status and other information via a second netlink operation. This operation marks that it is safe for the kernel to use the open socket and the security session established there. The notification service uses a multicast group. Each handshake mechanism (eg, tlshd) adopts its own group number so that the handshake services are completely independent of one another. The kernel can then tell via netlink_has_listeners() whether a handshake service is active and prepared to handle a handshake request. A new netlink operation, ACCEPT, acts like accept(2) in that it instantiates a file descriptor in the user space daemon's fd table. If this operation is successful, the reply carries the fd number, which can be treated as an open and ready file descriptor. While user space is performing the handshake, the kernel keeps its muddy paws off the open socket. A second new netlink operation, DONE, indicates that the user space daemon is finished with the socket and it is safe for the kernel to use again. The operation also indicates whether a session was established successfully. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-18mm/khugepaged: skip shmem with userfaultfdDavid Stevens
Make sure that collapse_file respects any userfaultfds registered with MODE_MISSING. If userspace has any such userfaultfds registered, then for any page which it knows to be missing, it may expect a UFFD_EVENT_PAGEFAULT. This means collapse_file needs to be careful when collapsing a shmem range would result in replacing an empty page with a THP, to avoid breaking userfaultfd. Synchronization when checking for userfaultfds in collapse_file is tricky because the mmap locks can't be used to prevent races with the registration of new userfaultfds. Instead, we provide synchronization by ensuring that userspace cannot observe the fact that pages are missing before we check for userfaultfds. Although this allows registration of a userfaultfd to race with collapse_file, it ensures that userspace cannot observe any pages transition from missing to present after such a race occurs. This makes such a race indistinguishable to the collapse occurring immediately before the userfaultfd registration. The first step to provide this synchronization is to stop filling gaps during the loop iterating over the target range, since the page cache lock can be dropped during that loop. The second step is to fill the gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final time, to avoid races with accesses to the page cache that only take the RCU read lock. The fact that we don't fill holes during the initial iteration means that collapse_file now has to handle faults occurring during the collapse. This is done by re-validating the number of missing pages after acquiring the page cache lock for the final time. This fix is targeted at khugepaged, but the change also applies to MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now return EBUSY if there are any missing pages (instead of succeeding on shmem and returning EINVAL on anonymous memory). There is also now a window during MADV_COLLAPSE where a fault on a missing page will cause the syscall to fail with EAGAIN. The fact that intermediate page cache state can no longer be observed before the rollback of a failed collapse is also technically a userspace-visible change (via at least SEEK_DATA and SEEK_END), but it is exceedingly unlikely that anything relies on being able to observe that transient state. Link: https://lkml.kernel.org/r/20230404120117.2562166-4-stevensd@google.com Signed-off-by: David Stevens <stevensd@chromium.org> Acked-by: Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18mm/khugepaged: recover from poisoned anonymous memoryJiaqi Yan
Problem ======= Memory DIMMs are subject to multi-bit flips, i.e. memory errors. As memory size and density increase, the chances of and number of memory errors increase. The increasing size and density of server RAM in the data center and cloud have shown increased uncorrectable memory errors. There are already mechanisms in the kernel to recover from uncorrectable memory errors. This series of patches provides the recovery mechanism for the particular kernel agent khugepaged when it collapses memory pages. Impact ====== The main reason we chose to make khugepaged collapsing tolerant of memory failures was its high possibility of accessing poisoned memory while performing functionally optional compaction actions. Standard applications typically don't have strict requirements on the size of its pages. So they are given 4K pages by the kernel. The kernel is able to improve application performance by either 1) giving applications 2M pages to begin with, or 2) collapsing 4K pages into 2M pages when possible. This collapsing operation is done by khugepaged, a kernel agent that is constantly scanning memory. When collapsing 4K pages into a 2M page, it must copy the data from the 4K pages into a physically contiguous 2M page. Therefore, as long as there exists one poisoned cache line in collapsible 4K pages, khugepaged will eventually access it. The current impact to users is a machine check exception triggered kernel panic. However, khugepaged’s compaction operations are not functionally required kernel actions. Therefore making khugepaged tolerant to poisoned memory will greatly improve user experience. This patch series is for cases where khugepaged is the first guy that detects the memory errors on the poisoned pages. IOW, the pages are not known to have memory errors when khugepaged collapsing gets to them. In our observation, this happens frequently when the huge page ratio of the system is relatively low, which is fairly common in virtual machines running on cloud. Solution ======== As stated before, it is less desirable to crash the system only because khugepaged accesses poisoned pages while it is collapsing 4K pages. The high level idea of this patch series is to skip the group of pages (usually 512 4K-size pages) once khugepaged finds one of them is poisoned, as these pages have become ineligible to be collapsed. We are also careful to unwind operations khuagepaged has performed before it detects memory failures. For example, before copying and collapsing a group of anonymous pages into a huge page, the source pages will be isolated and their page table is unlinked from their PMD. These operations need to be undone in order to ensure these pages are not changed/lost from the perspective of other threads (both user and kernel space). As for file backed memory pages, there already exists a rollback case. This patch just extends it so that khugepaged also correctly rolls back when it fails to copy poisoned 4K pages. This patch (of 3): Make __collapse_huge_page_copy return whether copying anonymous pages succeeded, and make collapse_huge_page handle the return status. Break existing PTE scan loop into two for-loops. The first loop copies source pages into target huge page, and can fail gracefully when running into memory errors in source pages. If copying all pages succeeds, the second loop releases and clears up these normal pages. Otherwise, the second loop rolls back the page table and page states by: - re-establishing the original PTEs-to-PMD connection. - releasing source pages back to their LRU list. Tested manually: 0. Enable khugepaged on system under test. 1. Start a two-thread application. Each thread allocates a chunk of non-huge anonymous memory buffer. 2. Pick 4 random buffer locations (2 in each thread) and inject uncorrectable memory errors at corresponding physical addresses. 3. Signal both threads to make their memory buffer collapsible, i.e. calling madvise(MADV_HUGEPAGE). 4. Wait and check kernel log: khugepaged is able to recover from poisoned pages and skips collapsing them. 5. Signal both threads to inspect their buffer contents and make sure no data corruption. Link: https://lkml.kernel.org/r/20230329151121.949896-1-jiaqiyan@google.com Link: https://lkml.kernel.org/r/20230329151121.949896-2-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Cc: David Stevens <stevensd@chromium.org> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Tong Tiangen <tongtiangen@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18mm: khugepaged: fix kernel BUG in hpage_collapse_scan_file()Ivan Orlov
Syzkaller reported the following issue: kernel BUG at mm/khugepaged.c:1823! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 5097 Comm: syz-executor220 Not tainted 6.2.0-syzkaller-13154-g857f1268a591 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/16/2023 RIP: 0010:collapse_file mm/khugepaged.c:1823 [inline] RIP: 0010:hpage_collapse_scan_file+0x67c8/0x7580 mm/khugepaged.c:2233 Code: 00 00 89 de e8 c9 66 a3 ff 31 ff 89 de e8 c0 66 a3 ff 45 84 f6 0f 85 28 0d 00 00 e8 22 64 a3 ff e9 dc f7 ff ff e8 18 64 a3 ff <0f> 0b f3 0f 1e fa e8 0d 64 a3 ff e9 93 f6 ff ff f3 0f 1e fa 4c 89 RSP: 0018:ffffc90003dff4e0 EFLAGS: 00010093 RAX: ffffffff81e95988 RBX: 00000000000001c1 RCX: ffff8880205b3a80 RDX: 0000000000000000 RSI: 00000000000001c0 RDI: 00000000000001c1 RBP: ffffc90003dff830 R08: ffffffff81e90e67 R09: fffffbfff1a433c3 R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000000 R13: ffffc90003dff6c0 R14: 00000000000001c0 R15: 0000000000000000 FS: 00007fdbae5ee700(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fdbae6901e0 CR3: 000000007b2dd000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> madvise_collapse+0x721/0xf50 mm/khugepaged.c:2693 madvise_vma_behavior mm/madvise.c:1086 [inline] madvise_walk_vmas mm/madvise.c:1260 [inline] do_madvise+0x9e5/0x4680 mm/madvise.c:1439 __do_sys_madvise mm/madvise.c:1452 [inline] __se_sys_madvise mm/madvise.c:1450 [inline] __x64_sys_madvise+0xa5/0xb0 mm/madvise.c:1450 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd The xas_store() call during page cache scanning can potentially translate 'xas' into the error state (with the reproducer provided by the syzkaller the error code is -ENOMEM). However, there are no further checks after the 'xas_store', and the next call of 'xas_next' at the start of the scanning cycle doesn't increase the xa_index, and the issue occurs. This patch will add the xarray state error checking after the xas_store() and the corresponding result error code. Tested via syzbot. [akpm@linux-foundation.org: update include/trace/events/huge_memory.h's SCAN_STATUS] Link: https://lkml.kernel.org/r/20230329145330.23191-1-ivan.orlov0322@gmail.com Link: https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com> Reported-by: syzbot+9578faa5475acb35fa50@syzkaller.appspotmail.com Tested-by: Zach O'Keefe <zokeefe@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Himadri Pandya <himadrispandya@gmail.com> Cc: Ivan Orlov <ivan.orlov0322@gmail.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Song Liu <songliubraving@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-17btrfs: replace btrfs_io_context::raid_map with a fixed u64 valueQu Wenruo
In btrfs_io_context structure, we have a pointer raid_map, which indicates the logical bytenr for each stripe. But considering we always call sort_parity_stripes(), the result raid_map[] is always sorted, thus raid_map[0] is always the logical bytenr of the full stripe. So why we waste the space and time (for sorting) for raid_map? This patch will replace btrfs_io_context::raid_map with a single u64 number, full_stripe_start, by: - Replace btrfs_io_context::raid_map with full_stripe_start - Replace call sites using raid_map[0] to use full_stripe_start - Replace call sites using raid_map[i] to compare with nr_data_stripes. The benefits are: - Less memory wasted on raid_map It's sizeof(u64) * num_stripes vs sizeof(u64). It'll always save at least one u64, and the benefit grows larger with num_stripes. - No more weird alloc_btrfs_io_context() behavior As there is only one fixed size + one variable length array. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17erofs: avoid hardcoded blocksize for subpage block supportJingbo Xu
As the first step of converting hardcoded blocksize to that specified in on-disk superblock, convert all call sites of hardcoded blocksize to sb->s_blocksize except for: 1) use sbi->blkszbits instead of sb->s_blocksize in erofs_superblock_csum_verify() since sb->s_blocksize has not been updated with the on-disk blocksize yet when the function is called. 2) use inode->i_blkbits instead of sb->s_blocksize in erofs_bread(), since the inode operated on may be an anonymous inode in fscache mode. Currently the anonymous inode is allocated from an anonymous mount maintained in erofs, while in the near future we may allocate anonymous inodes from a generic API directly and thus have no access to the anonymous inode's i_sb. Thus we keep the block size in i_blkbits for anonymous inodes in fscache mode. Be noted that this patch only gets rid of the hardcoded blocksize, in preparation for actually setting the on-disk block size in the following patch. The hard limit of constraining the block size to PAGE_SIZE still exists until the next patch. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230313135309.75269-2-jefflexu@linux.alibaba.com [ Gao Xiang: fold a patch to fix incorrect truncated offsets. ] Link: https://lore.kernel.org/r/20230413035734.15457-1-zhujia.zj@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-04-15softirq: Add trace points for tasklet entry/exitLingutla Chandrasekhar
Tasklets are supposed to finish their work quickly and should not block the current running process, but it is not guaranteed that they do so. Currently softirq_entry/exit can be used to analyse the total tasklets execution time, but that's not helpful to track individual tasklets execution time. That makes it hard to identify tasklet functions, which take more time than expected. Add tasklet_entry/exit trace point support to track individual tasklet execution. Trivial usage example: # echo 1 > /sys/kernel/debug/tracing/events/irq/tasklet_entry/enable # echo 1 > /sys/kernel/debug/tracing/events/irq/tasklet_exit/enable # cat /sys/kernel/debug/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 4/4 #P:4 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <idle>-0 [003] ..s1. 314.011428: tasklet_entry: tasklet=0xffffa01ef8db2740 function=tcp_tasklet_func <idle>-0 [003] ..s1. 314.011432: tasklet_exit: tasklet=0xffffa01ef8db2740 function=tcp_tasklet_func <idle>-0 [003] ..s1. 314.017369: tasklet_entry: tasklet=0xffffa01ef8db2740 function=tcp_tasklet_func <idle>-0 [003] ..s1. 314.017371: tasklet_exit: tasklet=0xffffa01ef8db2740 function=tcp_tasklet_func Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> Signed-off-by: J. Avila <elavila@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230407230526.1685443-1-jstultz@google.com [elavila: Port to android-mainline] [jstultz: Rebased to upstream, cut unused trace points, added comments for the tracepoints, reworded commit]
2023-04-14Merge back general thermal control changes for 6.4-rc1.Rafael J. Wysocki
2023-04-08notifiers: add tracepoints to the notifiers infrastructureGuilherme G. Piccoli
Currently there is no way to show the callback names for registered, unregistered or executed notifiers. This is very useful for debug purposes, hence add this functionality here in the form of notifiers' tracepoints, one per operation. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20230314200058.1326909-1-gpiccoli@igalia.com Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Michael Kelley <mikelley@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Xiaoming Ni <nixiaoming@huawei.com> Cc: Baoquan He <bhe@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com> Cc: Guilherme G. Piccoli <gpiccoli@igalia.com> Cc: Guilherme G. Piccoli <kernel@gpiccoli.net> Cc: Petr Mladek <pmladek@suse.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/ethernet/google/gve/gve.h 3ce934558097 ("gve: Secure enough bytes in the first TX desc for all TCP pkts") 75eaae158b1b ("gve: Add XDP DROP and TX support for GQI-QPL format") https://lore.kernel.org/all/20230406104927.45d176f5@canb.auug.org.au/ https://lore.kernel.org/all/c5872985-1a95-0bc8-9dcc-b6f23b439e9d@tessares.net/ Adjacent changes: net/can/isotp.c 051737439eae ("can: isotp: fix race between isotp_sendsmg() and isotp_release()") 96d1c81e6a04 ("can: isotp: add module parameter for maximum pdu size") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-05trace: cma: remove unnecessary event class cma_alloc_classWenchao Hao
After commit cb6c33d4dc09 ("cma: tracing: print alloc result in trace_cma_alloc_finish"), cma_alloc_class has only one event which is cma_alloc_busy_retry. So we can remove the cma_alloc_class. Link: https://lkml.kernel.org/r/20230323114136.177677-1-haowenchao2@huawei.com Signed-off-by: Wenchao Hao <haowenchao2@huawei.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Feilong Lin <linfeilong@huawei.com> Cc: Hongxiang Lou <louhongxiang@huawei.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency checkZqiang
This commit adds checks for the TICK_DEP_MASK_RCU_EXP bit, thus enabling RCU expedited grace periods to actually force-enable scheduling-clock interrupts on holdout CPUs. Fixes: df1e849ae455 ("rcu: Enable tick for nohz_full CPUs slow to provide expedited QS") Signed-off-by: Zqiang <qiang1.zhang@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05rcu/trace: use strscpy() to instead of strncpy()Xu Panda
This commit saves a line of code by switching from strncpy() to strscpy() by permitting the later NUL assignment to be removed. While in the area, save another line by taking advantage of 100 characters. Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Xu Panda <xu.panda@zte.com.cn> Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-04net: qrtr: correct types of trace event parametersSimon Horman
The arguments passed to the trace events are of type unsigned int, however the signature of the events used __le32 parameters. I may be missing the point here, but sparse flagged this and it does seem incorrect to me. net/qrtr/ns.c: note: in included file (through include/trace/trace_events.h, include/trace/define_trace.h, include/trace/events/qrtr.h): ./include/trace/events/qrtr.h:11:1: warning: cast to restricted __le32 ./include/trace/events/qrtr.h:11:1: warning: restricted __le32 degrades to integer ./include/trace/events/qrtr.h:11:1: warning: restricted __le32 degrades to integer ... (a lot more similar warnings) net/qrtr/ns.c:115:47: expected restricted __le32 [usertype] service net/qrtr/ns.c:115:47: got unsigned int service net/qrtr/ns.c:115:61: warning: incorrect type in argument 2 (different base types) ... (a lot more similar warnings) Fixes: dfddb54043f0 ("net: qrtr: Add tracepoint support") Reviewed-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230402-qrtr-trace-types-v1-1-92ad55008dd3@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-03Merge tag 'thermal-v6.4-rc1-1' of ↵Rafael J. Wysocki
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thermal/linux Pull thermal control material for 6.4-rc1 from Daniel Lezcano: "- Add more thermal zone device encapsulation: prevent setting structure field directly, access the sensor device instead the thermal zone's device for trace, relocate the traces in drivers/thermal (Daniel Lezcano) - Use the generic trip point for the i.MX and remove the get_trip_temp ops (Daniel Lezcano) - Use the devm_platform_ioremap_resource() in the Hisilicon driver (Yang Li) - Remove R-Car H3 ES1.* handling as public has only access to the ES2 version and the upstream support for the ES1 has been shutdown (Wolfram Sang) - Add a delay after initializing the bank in order to let the time to the hardware to initialze itself before reading the temperature (Amjad Ouled-Ameur) - Add MT8365 support (Amjad Ouled-Ameur)" * tag 'thermal-v6.4-rc1-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thermal/linux: thermal/drivers/ti: Use fixed update interval thermal/drivers/stm: Don't set no_hwmon to false thermal/drivers/db8500: Use driver dev instead of tz->device thermal/core: Relocate the traces definition in thermal directory thermal/drivers/hisi: Use devm_platform_ioremap_resource() thermal/drivers/imx: Use the thermal framework for the trip point thermal/drivers/imx: Remove get_trip_temp ops thermal/drivers/rcar_gen3_thermal: Remove R-Car H3 ES1.* handling thermal/drivers/mediatek: Add delay after thermal banks initialization thermal/drivers/mediatek: Add support for MT8365 SoC thermal/drivers/mediatek: Control buffer enablement tweaks dt-bindings: thermal: mediatek: Add binding documentation for MT8365 SoC
2023-04-03tracing: Error if a trace event has an array for a __field()Steven Rostedt (Google)
A __field() in the TRACE_EVENT() macro is used to set up the fields of the trace event data. It is for single storage units (word, char, int, pointer, etc) and not for complex structures or arrays. Unfortunately, there's nothing preventing the build from accepting: __field(int, arr[5]); from building. It will turn into a array value. This use to work fine, as the offset and size use to be determined by the macro using the field name, but things have changed and the offset and size are now determined by the type. So the above would only be size 4, and the next field will be located 4 bytes from it (instead of 20). The proper way to declare static arrays is to use the __array() macro. Instead of __field(int, arr[5]) it should be __array(int, arr, 5). Add some macro tricks to the building of a trace event from the TRACE_EVENT() macro such that __field(int, arr[5]) will fail to build. A comment by the failure will explain why the build failed. Link: https://lore.kernel.org/lkml/20230306122549.236561-1-douglas.raillard@arm.com/ Link: https://lore.kernel.org/linux-trace-kernel/20230309221302.642e82d9@gandalf.local.home Reported-by: Douglas RAILLARD <douglas.raillard@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-04-03io_uring: rename trace_io_uring_submit_sqe() tracepointJens Axboe
It has nothing to do with the SQE at this point, it's a request submission. While in there, get rid of the 'force_nonblock' argument which is also dead, as we only pass in true. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-01thermal/core: Relocate the traces definition in thermal directoryDaniel Lezcano
The traces are exported but only local to the thermal core code. On the other side, the traces take the thermal zone device structure as argument, thus they have to rely on the exported thermal.h header file. As we want to move the structure to the private thermal core header, first we have to relocate those traces to the same place as many drivers do. Cc: Steven Rostedt <rostedt@goodmis.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230307133735.90772-2-daniel.lezcano@linaro.org
2023-03-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/ethernet/mediatek/mtk_ppe.c 3fbe4d8c0e53 ("net: ethernet: mtk_eth_soc: ppe: add support for flow accounting") 924531326e2d ("net: ethernet: mtk_eth_soc: add missing ppe cache flush when deleting a flow") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-29Merge tag 'f2fs-fix-6.3-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fix from Jaegeuk Kim: "This fixes a tracepoint field size in f2fs in preparation for stricter rules for tracing fields" * tag 'f2fs-fix-6.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: Fix f2fs_truncate_partial_nodes ftrace event
2023-03-29ipv6: Remove in6addr_any alternatives.Kuniyuki Iwashima
Some code defines the IPv6 wildcard address as a local variable and use it with memcmp() or ipv6_addr_equal(). Let's use in6addr_any and ipv6_addr_any() instead. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-28kasan: remove PG_skip_kasan_poison flagPeter Collingbourne
Code inspection reveals that PG_skip_kasan_poison is redundant with kasantag, because the former is intended to be set iff the latter is the match-all tag. It can also be observed that it's basically pointless to poison pages which have kasantag=0, because any pages with this tag would have been pointed to by pointers with match-all tags, so poisoning the pages would have little to no effect in terms of bug detection. Therefore, change the condition in should_skip_kasan_poison() to check kasantag instead, and remove PG_skip_kasan_poison and associated flags. Link: https://lkml.kernel.org/r/20230310042914.3805818-3-pcc@google.com Link: https://linux-review.googlesource.com/id/I57f825f2eaeaf7e8389d6cf4597c8a5821359838 Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm, printk: introduce new format %pGt for page_typeHyeonggon Yoo
%pGp format is used to display 'flags' field of a struct page. However, some page flags (i.e. PG_buddy, see page-flags.h for more details) are stored in page_type field. To display human-readable output of page_type, introduce %pGt format. It is important to note the meaning of bits are different in page_type. if page_type is 0xffffffff, no flags are set. Setting PG_buddy (0x00000080) flag results in a page_type of 0xffffff7f. Clearing a bit actually means setting a flag. Bits in page_type are inverted when displaying type names. Only values for which page_type_has_type() returns true are considered as page_type, to avoid confusion with mapcount values. if it returns false, only raw values are displayed and not page type names. Link: https://lkml.kernel.org/r/20230130042514.2418-3-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> [vsprintf part] Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mmflags.h: use less error prone method to define pageflag_namesHyeonggon Yoo
Patch series "mm, printk: introduce new format for page_type", v4. This series moves PG_slab page flag to page_type, freeing one bit in page->flags and introduces %pGt format that prints human-readable page_type like %pGp for printing page flags. See changelog of patch 2 for more implementation details. Thanks everyone that gave valuable comments. This patch (of 3): Use helper macro to decrease chances of typo when defining pageflag_names. Link: https://lkml.kernel.org/r/20230130042514.2418-1-42.hyeyoo@gmail.com Link: https://lore.kernel.org/lkml/Y6AycLbpjVzXM5I9@smile.fi.intel.com Link: https://lkml.kernel.org/r/20230130042514.2418-2-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: John Ogness <john.ogness@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm: add tracepoints to ksmStefan Roesch
This adds the following tracepoints to ksm: - start / stop scan - ksm enter / exit - merge a page - merge a page with ksm - remove a page - remove a rmap item This patch has been split off from the RFC patch series "mm: process/cgroup ksm support". Link: https://lkml.kernel.org/r/20230210214645.2720847-1-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28Merge tag 'urgent-rcu.2023.03.28a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU fix from Paul McKenney: "This brings the rcu_torture_read event trace into line with the new trace tools by replacing this event trace's __field() with the corresponding __array(). Without this, the new trace tools will fail when presented wtih an rcu_torture_read event trace, which is a regression from the viewpoint of trace tools users" Link: https://lore.kernel.org/all/20230320133650.5388a05e@gandalf.local.home/ * tag 'urgent-rcu.2023.03.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: rcu: Fix rcu_torture_read ftrace event
2023-03-24trace: Add trace_ipi_send_cpu()Peter Zijlstra
Because copying cpumasks around when targeting a single CPU is a bit daft... Tested-and-reviewed-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230322103004.GA571242%40hirez.programming.kicks-ass.net
2023-03-24trace: Add trace_ipi_send_cpumask()Valentin Schneider
trace_ipi_raise() is unsuitable for generically tracing IPI sources due to its "reason" argument being an uninformative string (on arm64 all you get is "Function call interrupts" for SMP calls). Add a variant of it that exports a target cpumask, a callsite and a callback. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230307143558.294354-2-vschneid@redhat.com
2023-03-23mm: mmap: remove newline at the end of the traceMinwoo Im
We already have newline in TP_printk so remove the redundant newline character at the end of the mmap trace. <...>-345 [006] ..... 95.589290: exit_mmap: mt_mod ... <...>-345 [006] ..... 95.589413: vm_unmapped_area: addr=... <...>-345 [006] ..... 95.589571: vm_unmapped_area: addr=... <...>-345 [006] ..... 95.589606: vm_unmapped_area: addr=... to <...>-336 [006] ..... 44.762506: exit_mmap: mt_mod ... <...>-336 [006] ..... 44.762654: vm_unmapped_area: addr=... <...>-336 [006] ..... 44.762794: vm_unmapped_area: addr=... <...>-336 [006] ..... 44.762835: vm_unmapped_area: addr=... Link: https://lkml.kernel.org/r/ZAu6qDsNPmk82UjV@minwoo-desktop FIxes: df529cabb7a25 ("mm: mmap: add trace point of vm_unmapped_area") Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23ext4: Convert data=journal writeback to use ext4_writepages()Jan Kara
Add support for writeback of journalled data directly into ext4_writepages() instead of offloading it to write_cache_pages(). This actually significantly simplifies the code and reduces code duplication. For checkpointing of committed data we can use ext4_writepages() rightaway the same way as writeback of ordered data uses it on transaction commit. For journalling of dirty mapped pages, we need to add a special case to mpage_prepare_extent_to_map() to add all page buffers to the journal. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230228051319.4085470-8-tytso@mit.edu
2023-03-22rcu: Fix rcu_torture_read ftrace eventDouglas Raillard
Fix the rcutorturename field so that its size is correctly reported in the text format embedded in trace.dat files. As it stands, it is reported as being of size 1: field:char rcutorturename[8]; offset:8; size:1; signed:0; Signed-off-by: Douglas Raillard <douglas.raillard@arm.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Cc: stable@vger.kernel.org Fixes: 04ae87a52074e ("ftrace: Rework event_create_dir()") Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> [ boqun: Add "Cc" and "Fixes" tags per Steven ] Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-03-17inet: preserve const qualifier in inet_sk()Eric Dumazet
We can change inet_sk() to propagate const qualifier of its argument. This should avoid some potential errors caused by accidental (const -> not_const) promotion. Other helpers like tcp_sk(), udp_sk(), raw_sk() will be handled in separate patch series. v2: use container_of_const() as advised by Jakub and Linus Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/20230315142841.3a2ac99a@kernel.org/ Link: https://lore.kernel.org/netdev/CAHk-=wiOf12nrYEF2vJMcucKjWPN-Ns_SW9fA7LwST_2Dzp7rw@mail.gmail.com/ Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-16scsi: ufs: core: Add trace event for MCQZiqi Chen
Add MCQ hardware queue ID in the existing trace event ufshcd_command(). Signed-off-by: Ziqi Chen <quic_ziqichen@quicinc.com> Link: https://lore.kernel.org/r/1678866271-49601-1-git-send-email-quic_ziqichen@quicinc.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-03-11spi: Replace all spi->chip_select and spi->cs_gpiod references with function ↵Amit Kumar Mahapatra via Alsa-devel
call Supporting multi-cs in spi drivers would require the chip_select & cs_gpiod members of struct spi_device to be an array. But changing the type of these members to array would break the spi driver functionality. To make the transition smoother introduced four new APIs to get/set the spi->chip_select & spi->cs_gpiod and replaced all spi->chip_select and spi->cs_gpiod references with get or set API calls. While adding multi-cs support in further patches the chip_select & cs_gpiod members of the spi_device structure would be converted to arrays & the "idx" parameter of the APIs would be used as array index i.e., spi->chip_select[idx] & spi->cs_gpiod[idx] respectively. Signed-off-by: Amit Kumar Mahapatra <amit.kumar-mahapatra@amd.com> Acked-by: Heiko Stuebner <heiko@sntech.de> # Rockchip drivers Reviewed-by: Michal Simek <michal.simek@amd.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> # Aspeed driver Reviewed-by: Dhruva Gole <d-gole@ti.com> # SPI Cadence QSPI Reviewed-by: Patrice Chotard <patrice.chotard@foss.st.com> # spi-stm32-qspi Acked-by: William Zhang <william.zhang@broadcom.com> # bcm63xx-hsspi driver Reviewed-by: Serge Semin <fancer.lancer@gmail.com> # DW SSI part Link: https://lore.kernel.org/r/167847070432.26.15076794204368669839@mailman-core.alsa-project.org Signed-off-by: Mark Brown <broonie@kernel.org>
2023-03-10f2fs: Fix f2fs_truncate_partial_nodes ftrace eventDouglas Raillard
Fix the nid_t field so that its size is correctly reported in the text format embedded in trace.dat files. As it stands, it is reported as being of size 4: field:nid_t nid[3]; offset:24; size:4; signed:0; Instead of 12: field:nid_t nid[3]; offset:24; size:12; signed:0; This also fixes the reported offset of subsequent fields so that they match with the actual struct layout. Signed-off-by: Douglas Raillard <douglas.raillard@arm.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-03-06fs: dlm: change dflags to use atomic bitsAlexander Aring
Currently manipulating lkb_dflags assumes to held the rsb lock assigned to the lkb. This is held by dlm message processing after certain time to lookup the right rsb from the received lkb message id. For user space locks flags, which is currently the only use case for lkb_dflags, flags are also being set during dlm character device handling without holding the rsb lock. To minimize the risk that bit operations are getting corrupted we switch to atomic bit operations. This patch will also introduce helpers to snapshot atomic bit values in an non atomic way. There might be still issues with the flag handling e.g. running in case of manipulating bit ops and snapshot them at the same time, but this patch minimize them and will start to use atomic bit operations. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-06fs: dlm: store lkb distributed flags into own valueAlexander Aring
This patch stores lkb distributed flags value in an separate value instead of sharing internal and distributed flags in lkb->lkb_flags value. This has the advantage to not mask/write back flag values in receive_flags() functionality. The dlm debug_fs does not provide the distributed flags anymore, those can be added in future. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-06fs: dlm: remove DLM_IFL_LOCAL_MS flagAlexander Aring
The DLM_IFL_LOCAL_MS flag is an internal non shared flag but used in m_flags of dlm messages. It is not shared because it is only used for local messaging. Instead using DLM_IFL_LOCAL_MS in dlm messages we pass a parameter around to signal local messaging or not. This patch is adding the local parameter to signal local messaging. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-03-06fs: dlm: rename stub to local message flagAlexander Aring
This patch renames DLM_IFL_STUB_MS to DLM_IFL_LOCAL_MS flag. The DLM_IFL_STUB_MS flag is somewhat misnamed, it means the dlm message is used for local message transfer only. It is used by recovery to resolve lock states if a node got fenced. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-02-27Merge tag 'f2fs-for-6.3-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we've got a huge number of patches that improve code readability along with minor bug fixes, while we've mainly fixed some critical issues in recently-added per-block age-based extent_cache, atomic write support, and some folio cases. Enhancements: - add sysfs nodes to set last_age_weight and manage discard_io_aware_gran - show ipu policy in debugfs - reduce stack memory cost by using bitfield in struct f2fs_io_info - introduce trace_f2fs_replace_atomic_write_block - enhance iostat support and adds flush commands Bug fixes: - revert "f2fs: truncate blocks in batch in __complete_revoke_list()" - fix kernel crash on the atomic write abort flow - call clear_page_private_reference in .{release,invalid}_folio - support .migrate_folio for compressed inode - fix cgroup writeback accounting with fs-layer encryption - retry to update the inode page given data corruption - fix kernel crash due to NULL io->bio - fix some bugs in per-block age-based extent_cache: - wrong calculation of block age - update age extent in f2fs_do_zero_range() - update age extent correctly during truncation" * tag 'f2fs-for-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (81 commits) f2fs: drop unnecessary arg for f2fs_ioc_*() f2fs: Revert "f2fs: truncate blocks in batch in __complete_revoke_list()" f2fs: synchronize atomic write aborts f2fs: fix wrong segment count f2fs: replace si->sbi w/ sbi in stat_show() f2fs: export ipu policy in debugfs f2fs: make kobj_type structures constant f2fs: fix to do sanity check on extent cache correctly f2fs: add missing description for ipu_policy node f2fs: fix to set ipu policy f2fs: fix typos in comments f2fs: fix kernel crash due to null io->bio f2fs: use iostat_lat_type directly as a parameter in the iostat_update_and_unbind_ctx() f2fs: add sysfs nodes to set last_age_weight f2fs: fix f2fs_show_options to show nogc_merge mount option f2fs: fix cgroup writeback accounting with fs-layer encryption f2fs: fix wrong calculation of block age f2fs: fix to update age extent in f2fs_do_zero_range() f2fs: fix to update age extent correctly during truncation f2fs: fix to avoid potential memory corruption in __update_iostat_latency() ...
2023-02-27Merge tag 'soc-drivers-6.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC driver updates from Arnd Bergmann: "As usual, there are lots of minor driver changes across SoC platforms from NXP, Amlogic, AMD Zynq, Mediatek, Qualcomm, Apple and Samsung. These usually add support for additional chip variations in existing drivers, but also add features or bugfixes. The SCMI firmware subsystem gains a unified raw userspace interface through debugfs, which can be used for validation purposes. Newly added drivers include: - New power management drivers for StarFive JH7110, Allwinner D1 and Renesas RZ/V2M - A driver for Qualcomm battery and power supply status - A SoC device driver for identifying Nuvoton WPCM450 chips - A regulator coupler driver for Mediatek MT81xxv" * tag 'soc-drivers-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (165 commits) power: supply: Introduce Qualcomm PMIC GLINK power supply soc: apple: rtkit: Do not copy the reg state structure to the stack soc: sunxi: SUN20I_PPU should depend on PM memory: renesas-rpc-if: Remove redundant division of dummy soc: qcom: socinfo: Add IDs for IPQ5332 and its variant dt-bindings: arm: qcom,ids: Add IDs for IPQ5332 and its variant dt-bindings: power: qcom,rpmpd: add RPMH_REGULATOR_LEVEL_LOW_SVS_L1 firmware: qcom_scm: Move qcom_scm.h to include/linux/firmware/qcom/ MAINTAINERS: Update qcom CPR maintainer entry dt-bindings: firmware: document Qualcomm SM8550 SCM dt-bindings: firmware: qcom,scm: add qcom,scm-sa8775p compatible soc: qcom: socinfo: Add Soc IDs for IPQ8064 and variants dt-bindings: arm: qcom,ids: Add Soc IDs for IPQ8064 and variants soc: qcom: socinfo: Add support for new field in revision 17 soc: qcom: smd-rpm: Add IPQ9574 compatible soc: qcom: pmic_glink: remove redundant calculation of svid soc: qcom: stats: Populate all subsystem debugfs files dt-bindings: soc: qcom,rpmh-rsc: Update to allow for generic nodes soc: qcom: pmic_glink: add CONFIG_NET/CONFIG_OF dependencies soc: qcom: pmic_glink: Introduce altmode support ...
2023-02-25Merge tag 'cxl-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxlLinus Torvalds
Pull Compute Express Link (CXL) updates from Dan Williams: "To date Linux has been dependent on platform-firmware to map CXL RAM regions and handle events / errors from devices. With this update we can now parse / update the CXL memory layout, and report events / errors from devices. This is a precursor for the CXL subsystem to handle the end-to-end "RAS" flow for CXL memory. i.e. the flow that for DDR-attached-DRAM is handled by the EDAC driver where it maps system physical address events to a field-replaceable-unit (FRU / endpoint device). In general, CXL has the potential to standardize what has historically been a pile of memory-controller-specific error handling logic. Another change of note is the default policy for handling RAM-backed device-dax instances. Previously the default access mode was "device", mmap(2) a device special file to access memory. The new default is "kmem" where the address range is assigned to the core-mm via add_memory_driver_managed(). This saves typical users from wondering why their platform memory is not visible via free(1) and stuck behind a device-file. At the same time it allows expert users to deploy policy to, for example, get dedicated access to high performance memory, or hide low performance memory from general purpose kernel allocations. This affects not only CXL, but also systems with high-bandwidth-memory that platform-firmware tags with the EFI_MEMORY_SP (special purpose) designation. Summary: - CXL RAM region enumeration: instantiate 'struct cxl_region' objects for platform firmware created memory regions - CXL RAM region provisioning: complement the existing PMEM region creation support with RAM region support - "Soft Reservation" policy change: Online (memory hot-add) soft-reserved memory (EFI_MEMORY_SP) by default, but still allow for setting aside such memory for dedicated access via device-dax. - CXL Events and Interrupts: Takeover CXL event handling from platform-firmware (ACPI calls this CXL Memory Error Reporting) and export CXL Events via Linux Trace Events. - Convey CXL _OSC results to drivers: Similar to PCI, let the CXL subsystem interrogate the result of CXL _OSC negotiation. - Emulate CXL DVSEC Range Registers as "decoders": Allow for first-generation devices that pre-date the definition of the CXL HDM Decoder Capability to translate the CXL DVSEC Range Registers into 'struct cxl_decoder' objects. - Set timestamp: Per spec, set the device timestamp in case of hotplug, or if platform-firwmare failed to set it. - General fixups: linux-next build issues, non-urgent fixes for pre-production hardware, unit test fixes, spelling and debug message improvements" * tag 'cxl-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (66 commits) dax/kmem: Fix leak of memory-hotplug resources cxl/mem: Add kdoc param for event log driver state cxl/trace: Add serial number to trace points cxl/trace: Add host output to trace points cxl/trace: Standardize device information output cxl/pci: Remove locked check for dvsec_range_allowed() cxl/hdm: Add emulation when HDM decoders are not committed cxl/hdm: Create emulated cxl_hdm for devices that do not have HDM decoders cxl/hdm: Emulate HDM decoder from DVSEC range registers cxl/pci: Refactor cxl_hdm_decode_init() cxl/port: Export cxl_dvsec_rr_decode() to cxl_port cxl/pci: Break out range register decoding from cxl_hdm_decode_init() cxl: add RAS status unmasking for CXL cxl: remove unnecessary calling of pci_enable_pcie_error_reporting() dax/hmem: build hmem device support as module if possible dax: cxl: add CXL_REGION dependency cxl: avoid returning uninitialized error code cxl/pmem: Fix nvdimm registration races cxl/mem: Fix UAPI command comment cxl/uapi: Tag commands from cxl_query_cmd() ...