summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2023-07-19bpf: Add generic attach/detach/query API for multi-progsDaniel Borkmann
This adds a generic layer called bpf_mprog which can be reused by different attachment layers to enable multi-program attachment and dependency resolution. In-kernel users of the bpf_mprog don't need to care about the dependency resolution internals, they can just consume it with few API calls. The initial idea of having a generic API sparked out of discussion [0] from an earlier revision of this work where tc's priority was reused and exposed via BPF uapi as a way to coordinate dependencies among tc BPF programs, similar as-is for classic tc BPF. The feedback was that priority provides a bad user experience and is hard to use [1], e.g.: I cannot help but feel that priority logic copy-paste from old tc, netfilter and friends is done because "that's how things were done in the past". [...] Priority gets exposed everywhere in uapi all the way to bpftool when it's right there for users to understand. And that's the main problem with it. The user don't want to and don't need to be aware of it, but uapi forces them to pick the priority. [...] Your cover letter [0] example proves that in real life different service pick the same priority. They simply don't know any better. Priority is an unnecessary magic that apps _have_ to pick, so they just copy-paste and everyone ends up using the same. The course of the discussion showed more and more the need for a generic, reusable API where the "same look and feel" can be applied for various other program types beyond just tc BPF, for example XDP today does not have multi- program support in kernel, but also there was interest around this API for improving management of cgroup program types. Such common multi-program management concept is useful for BPF management daemons or user space BPF applications coordinating internally about their attachments. Both from Cilium and Meta side [2], we've collected the following requirements for a generic attach/detach/query API for multi-progs which has been implemented as part of this work: - Support prog-based attach/detach and link API - Dependency directives (can also be combined): - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none} - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user space application does not need CAP_SYS_ADMIN to retrieve foreign fds via bpf_*_get_fd_by_id() - BPF_F_LINK flag as {prog,link} toggle - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and BPF_F_AFTER will just append for attaching - Enforced only at attach time - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their own infra for replacing their internal prog - If no flags are set, then it's default append behavior for attaching - Internal revision counter and optionally being able to pass expected_revision - User space application can query current state with revision, and pass it along for attachment to assert current state before doing updates - Query also gets extension for link_ids array and link_attach_flags: - prog_ids are always filled with program IDs - link_ids are filled with link IDs when link was used, otherwise 0 - {prog,link}_attach_flags for holding {prog,link}-specific flags - Must be easy to integrate/reuse for in-kernel users The uapi-side changes needed for supporting bpf_mprog are rather minimal, consisting of the additions of the attachment flags, revision counter, and expanding existing union with relative_{fd,id} member. The bpf_mprog framework consists of an bpf_mprog_entry object which holds an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path structure) is part of bpf_mprog_bundle. Both have been separated, so that fast-path gets efficient packing of bpf_prog pointers for maximum cache efficiency. Also, array has been chosen instead of linked list or other structures to remove unnecessary indirections for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry is populated and then just swapped which avoids additional allocations that could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could be converted to dynamic allocation if necessary at a point in future. Locking is deferred to the in-kernel user of bpf_mprog, for example, in case of tcx which uses this API in the next patch, it piggybacks on rtnl. An extensive test suite for checking all aspects of this API for prog-based attach/detach and link API comes as BPF selftests in this series. Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog management. [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19xsk: add new netlink attribute dedicated for ZC max fragsMaciej Fijalkowski
Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will carry maximum fragments that underlying ZC driver is able to handle on TX side. It is going to be included in netlink response only when driver supports ZC. Any value higher than 1 implies multi-buffer ZC support on underlying device. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-11-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19bpf: consider CONST_PTR_TO_MAP as trusted pointer to struct bpf_mapAnton Protopopov
Add the BTF id of struct bpf_map to the reg2btf_ids array. This makes the values of the CONST_PTR_TO_MAP type to be considered as trusted by kfuncs. This, in turn, allows users to execute trusted kfuncs which accept `struct bpf_map *` arguments from non-tracing programs. While exporting the btf_bpf_map_id variable, save some bytes by defining it as BTF_ID_LIST_GLOBAL_SINGLE (which is u32[1]) and not as BTF_ID_LIST (which is u32[64]). Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230719092952.41202-3-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19sched/fair: Implement an EEVDF-like scheduling policyPeter Zijlstra
Where CFS is currently a WFQ based scheduler with only a single knob, the weight. The addition of a second, latency oriented parameter, makes something like WF2Q or EEVDF based a much better fit. Specifically, EEVDF does EDF like scheduling in the left half of the tree -- those entities that are owed service. Except because this is a virtual time scheduler, the deadlines are in virtual time as well, which is what allows over-subscription. EEVDF has two parameters: - weight, or time-slope: which is mapped to nice just as before - request size, or slice length: which is used to compute the virtual deadline as: vd_i = ve_i + r_i/w_i Basically, by setting a smaller slice, the deadline will be earlier and the task will be more eligible and ran earlier. Tick driven preemption is driven by request/slice completion; while wakeup preemption is driven by the deadline. Because the tree is now effectively an interval tree, and the selection is no longer 'leftmost', over-scheduling is less of a problem. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org
2023-07-19rbtree: Add rb_add_augmented_cached() helperPeter Zijlstra
While slightly sub-optimal, updating the augmented data while going down the tree during lookup would be faster -- alas the augment interface does not currently allow for that, provide a generic helper to add a node to an augmented cached tree. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.862983648@infradead.org
2023-07-19sched/fair: Add lag based placementPeter Zijlstra
With the introduction of avg_vruntime, it is possible to approximate lag (the entire purpose of introducing it in fact). Use this to do lag based placement over sleep+wake. Specifically, the FAIR_SLEEPERS thing places things too far to the left and messes up the deadline aspect of EEVDF. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.794929315@infradead.org
2023-07-19Merge tag 'v6.5-rc2' into sched/core, to pick up fixesIngo Molnar
Sync with upstream fixes before applying EEVDF. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-07-19sched/headers: Rename task_struct::state to task_struct::__state in the ↵Chin Yik Ming
comments too The rename in 2f064a59a11f ("sched: Change task_struct::state") missed the comments. [ mingo: Improved the changelog. ] Signed-off-by: Chin Yik Ming <yikming2222@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Link: https://lore.kernel.org/r/20230717064952.2804-1-yikming2222@gmail.com
2023-07-18tcp: get rid of sysctl_tcp_adv_win_scaleEric Dumazet
With modern NIC drivers shifting to full page allocations per received frame, we face the following issue: TCP has one per-netns sysctl used to tweak how to translate a memory use into an expected payload (RWIN), in RX path. tcp_win_from_space() implementation is limited to few cases. For hosts dealing with various MSS, we either under estimate or over estimate the RWIN we send to the remote peers. For instance with the default sysctl_tcp_adv_win_scale value, we expect to store 50% of payload per allocated chunk of memory. For the typical use of MTU=1500 traffic, and order-0 pages allocations by NIC drivers, we are sending too big RWIN, leading to potential tcp collapse operations, which are extremely expensive and source of latency spikes. This patch makes sysctl_tcp_adv_win_scale obsolete, and instead uses a per socket scaling factor, so that we can precisely adjust the RWIN based on effective skb->len/skb->truesize ratio. This patch alone can double TCP receive performance when receivers are too slow to drain their receive queue, or by allowing a bigger RWIN when MSS is close to PAGE_SIZE. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-18bpf: Add 'owner' field to bpf_{list,rb}_nodeDave Marchevsky
As described by Kumar in [0], in shared ownership scenarios it is necessary to do runtime tracking of {rb,list} node ownership - and synchronize updates using this ownership information - in order to prevent races. This patch adds an 'owner' field to struct bpf_list_node and bpf_rb_node to implement such runtime tracking. The owner field is a void * that describes the ownership state of a node. It can have the following values: NULL - the node is not owned by any data structure BPF_PTR_POISON - the node is in the process of being added to a data structure ptr_to_root - the pointee is a data structure 'root' (bpf_rb_root / bpf_list_head) which owns this node The field is initially NULL (set by bpf_obj_init_field default behavior) and transitions states in the following sequence: Insertion: NULL -> BPF_PTR_POISON -> ptr_to_root Removal: ptr_to_root -> NULL Before a node has been successfully inserted, it is not protected by any root's lock, and therefore two programs can attempt to add the same node to different roots simultaneously. For this reason the intermediate BPF_PTR_POISON state is necessary. For removal, the node is protected by some root's lock so this intermediate hop isn't necessary. Note that bpf_list_pop_{front,back} helpers don't need to check owner before removing as the node-to-be-removed is not passed in as input and is instead taken directly from the list. Do the check anyways and WARN_ON_ONCE in this unexpected scenario. Selftest changes in this patch are entirely mechanical: some BTF tests have hardcoded struct sizes for structs that contain bpf_{list,rb}_node fields, those were adjusted to account for the new sizes. Selftest additions to validate the owner field are added in a further patch in the series. [0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2 Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230718083813.3416104-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18bpf: Introduce internal definitions for UAPI-opaque bpf_{rb,list}_nodeDave Marchevsky
Structs bpf_rb_node and bpf_list_node are opaquely defined in uapi/linux/bpf.h, as BPF program writers are not expected to touch their fields - nor does the verifier allow them to do so. Currently these structs are simple wrappers around structs rb_node and list_head and linked_list / rbtree implementation just casts and passes to library functions for those data structures. Later patches in this series, though, will add an "owner" field to bpf_{rb,list}_node, such that they're not just wrapping an underlying node type. Moreover, the bpf linked_list and rbtree implementations will deal with these owner pointers directly in a few different places. To avoid having to do void *owner = (void*)bpf_list_node + sizeof(struct list_head) with opaque UAPI node types, add bpf_{list,rb}_node_kern struct definitions to internal headers and modify linked_list and rbtree to use the internal types where appropriate. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230718083813.3416104-3-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19ASoC: Improve coverage in default KUnit runsMark Brown
Merge series from Mark Brown <broonie@kernel.org>: We have some KUnit tests for ASoC but they're not being run as much as they should be since ASoC isn't enabled in the configs used by default with KUnit and in the case of the topology tests there is no way to enable them without enabling drivers that use them. This series provides a Kconfig option which KUnit can use directly rather than worry about drivers. Further, since KUnit is typically run in UML but ALSA prevents build with UML we need to remove that Kconfig conflict. As far as I can tell the motiviation for this is that many ALSA drivers use iomem APIs which are not available under UML and it's more trouble than it's worth to go through and add per driver dependencies. In order to avoid these issues we also provide stubs for these APIs so there are no build time issues if a driver relies on iomem but does not depend on it. With these stubs I am able to build all the sound drivers available in a UML defconfig (UML allmodconfig appears to have substantial other issues in a quick test). With this series I am able to run the topology KUnit tests as part of a kunit --alltests run.
2023-07-18cgroup/misc: Change counters to be explicit 64bit typesHaitao Huang
So the variables can account for resources of huge quantities even on 32-bit machines. Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-18cgroup/misc: update struct members descriptionsKamalesh Babulal
Update the miscellaneous controller's structure member's description of struct misc_res and struct misc_cg. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-18platform: Provide stubs for !HAS_IOMEM buildsMark Brown
The various _ioremap_resource functions are not built when CONFIG_HAS_IOMEM is disabled but no stubs are provided. Given how widespread IOMEM usage is in drivers and how rare !IOMEM configurations are in practical use let's just provide some stubs so users will build without having to add explicit dependencies on IOMEM. The most likely use case is builds with UML for KUnit testing. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: David Gow <davidgow@google.com> Reviewed-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230718-asoc-topology-kunit-enable-v2-2-0ee11e662b92@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
2023-07-18driver core: Provide stubs for !IOMEM buildsMark Brown
The various _ioremap_resource functions are not built when CONFIG_HAS_IOMEM is disabled but no stubs are provided. Given how widespread IOMEM usage is in drivers and how rare !IOMEM configurations are in practical use let's just provide some stubs so users will build without having to add explicit dependencies on HAS_IOMEM. The most likely use case is builds with UML for KUnit testing. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: David Gow <davidgow@google.com> Reviewed-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230718-asoc-topology-kunit-enable-v2-1-0ee11e662b92@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
2023-07-18regmap: Let users check if a register is cachedMark Brown
The HDA driver has a use case for checking if a register is cached which it bodges in awkwardly and unclearly. Provide an API which allows it to directly do what it's trying to do. Reviewed-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230717-regmap-cache-check-v1-1-73ef688afae3@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
2023-07-18PCI: Add Intel Audio DSP devices to pci_ids.hAmadeusz Sławiński
Those IDs are mostly sprinkled between HDA, Skylake, SOF and avs drivers. Almost every use contains additional comments to identify to which platform those IDs refer to. Add those IDs to pci_ids.h header, so that there is one place which defines those names. Acked-by: Mark Brown <broonie@kernel.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> # for the Intel Tangier ID Reviewed-by: Cezary Rojewski <cezary.rojewski@intel.com> Signed-off-by: Amadeusz Sławiński <amadeuszx.slawinski@linux.intel.com> Link: https://lore.kernel.org/r/20230717114511.484999-3-amadeuszx.slawinski@linux.intel.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2023-07-18PCI: Sort Intel PCI IDs by numberAmadeusz Sławiński
Some of the PCI IDs are not sorted correctly, reorder them by growing ID number. Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Mark Brown <broonie@kernel.org> Reviewed-by: Cezary Rojewski <cezary.rojewski@intel.com> Signed-off-by: Amadeusz Sławiński <amadeuszx.slawinski@linux.intel.com> Link: https://lore.kernel.org/r/20230717114511.484999-2-amadeuszx.slawinski@linux.intel.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2023-07-18Randomized slab caches for kmalloc()GONG, Ruiqi
When exploiting memory vulnerabilities, "heap spraying" is a common technique targeting those related to dynamic memory allocation (i.e. the "heap"), and it plays an important role in a successful exploitation. Basically, it is to overwrite the memory area of vulnerable object by triggering allocation in other subsystems or modules and therefore getting a reference to the targeted memory location. It's usable on various types of vulnerablity including use after free (UAF), heap out- of-bound write and etc. There are (at least) two reasons why the heap can be sprayed: 1) generic slab caches are shared among different subsystems and modules, and 2) dedicated slab caches could be merged with the generic ones. Currently these two factors cannot be prevented at a low cost: the first one is a widely used memory allocation mechanism, and shutting down slab merging completely via `slub_nomerge` would be overkill. To efficiently prevent heap spraying, we propose the following approach: to create multiple copies of generic slab caches that will never be merged, and random one of them will be used at allocation. The random selection is based on the address of code that calls `kmalloc()`, which means it is static at runtime (rather than dynamically determined at each time of allocation, which could be bypassed by repeatedly spraying in brute force). In other words, the randomness of cache selection will be with respect to the code address rather than time, i.e. allocations in different code paths would most likely pick different caches, although kmalloc() at each place would use the same cache copy whenever it is executed. In this way, the vulnerable object and memory allocated in other subsystems and modules will (most probably) be on different slab caches, which prevents the object from being sprayed. Meanwhile, the static random selection is further enhanced with a per-boot random seed, which prevents the attacker from finding a usable kmalloc that happens to pick the same cache with the vulnerable subsystem/module by analyzing the open source code. In other words, with the per-boot seed, the random selection is static during each time the system starts and runs, but not across different system startups. The overhead of performance has been tested on a 40-core x86 server by comparing the results of `perf bench all` between the kernels with and without this patch based on the latest linux-next kernel, which shows minor difference. A subset of benchmarks are listed below: sched/ sched/ syscall/ mem/ mem/ messaging pipe basic memcpy memset (sec) (sec) (sec) (GB/sec) (GB/sec) control1 0.019 5.459 0.733 15.258789 51.398026 control2 0.019 5.439 0.730 16.009221 48.828125 control3 0.019 5.282 0.735 16.009221 48.828125 control_avg 0.019 5.393 0.733 15.759077 49.684759 experiment1 0.019 5.374 0.741 15.500992 46.502976 experiment2 0.019 5.440 0.746 16.276042 51.398026 experiment3 0.019 5.242 0.752 15.258789 51.398026 experiment_avg 0.019 5.352 0.746 15.678608 49.766343 The overhead of memory usage was measured by executing `free` after boot on a QEMU VM with 1GB total memory, and as expected, it's positively correlated with # of cache copies: control 4 copies 8 copies 16 copies total 969.8M 968.2M 968.2M 968.2M used 20.0M 21.9M 24.1M 26.7M free 936.9M 933.6M 931.4M 928.6M available 932.2M 928.8M 926.6M 923.9M Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Dennis Zhou <dennis@kernel.org> # percpu Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-18net: phylink: remove legacy mac_an_restart() methodRussell King (Oracle)
The mac_an_restart() method is now completely unused, and has been superseded by phylink_pcs support. Remove this method. Since phylink_pcs_mac_an_restart() now only deals with the PCS, rename the function to remove the _mac infix. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-17sched: add a few helpers to wake up tasks on the current cpuAndrei Vagin
Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake up tasks on the current CPU. These two helpers are useful when the task needs to make a synchronous context switch to another task. In this context, synchronous means it wakes up the target task and falls asleep right after that. One example of such workloads is seccomp user notifies. This mechanism allows the supervisor process handles system calls on behalf of a target process. While the supervisor is handling an intercepted system call, the target process will be blocked in the kernel, waiting for a response to come back. On-CPU context switches are much faster than regular ones. Signed-off-by: Andrei Vagin <avagin@google.com> Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Link: https://lore.kernel.org/r/20230308073201.3102738-4-avagin@google.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17cpumask: eliminate kernel-doc warningsRandy Dunlap
Update lib/cpumask.c and <linux/cpumask.h> to fix all kernel-doc warnings: include/linux/cpumask.h:185: warning: Function parameter or member 'srcp1' not described in 'cpumask_first_and' include/linux/cpumask.h:185: warning: Function parameter or member 'srcp2' not described in 'cpumask_first_and' include/linux/cpumask.h:185: warning: Excess function parameter 'src1p' description in 'cpumask_first_and' include/linux/cpumask.h:185: warning: Excess function parameter 'src2p' description in 'cpumask_first_and' lib/cpumask.c:59: warning: Function parameter or member 'node' not described in 'alloc_cpumask_var_node' lib/cpumask.c:169: warning: Function parameter or member 'src1p' not described in 'cpumask_any_and_distribute' lib/cpumask.c:169: warning: Function parameter or member 'src2p' not described in 'cpumask_any_and_distribute' Fixes: 7b4967c53204 ("cpumask: Add alloc_cpumask_var_node()") Fixes: 839cad5fa54b ("cpumask: fix function description kernel-doc notation") Fixes: 93ba139ba819 ("cpumask: use find_first_and_bit()") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Yury Norov <yury.norov@gmail.com>
2023-07-17Merge tag 'v6.4' into nextDmitry Torokhov
Sync up with mainline to bring in updates to shared infrastructure.
2023-07-17fs: distinguish between user initiated freeze and kernel initiated freezeDarrick J. Wong
Userspace can freeze a filesystem using the FIFREEZE ioctl or by suspending the block device; this state persists until userspace thaws the filesystem with the FITHAW ioctl or resuming the block device. Since commit 18e9e5104fcd ("Introduce freeze_super and thaw_super for the fsfreeze ioctl") we only allow the first freeze command to succeed. The kernel may decide that it is necessary to freeze a filesystem for its own internal purposes, such as suspends in progress, filesystem fsck activities, or quiescing a device prior to removal. Userspace thaw commands must never break a kernel freeze, and kernel thaw commands shouldn't undo userspace's freeze command. Introduce a couple of freeze holder flags and wire it into the sb_writers state. One kernel and one userspace freeze are allowed to coexist at the same time; the filesystem will not thaw until both are lifted. I wonder if the f2fs/gfs2 code should be using a kernel freeze here, but for now we'll use FREEZE_HOLDER_USERSPACE to preserve existing behaviors. Cc: mcgrof@kernel.org Cc: jack@suse.cz Cc: hch@infradead.org Cc: ruansy.fnst@fujitsu.com Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2023-07-17blk-flush: reuse rq queuelist in flush state machineChengming Zhou
Since we don't need to maintain inflight flush_data requests list anymore, we can reuse rq->queuelist for flush pending list. Note in mq_flush_data_end_io(), we need to re-initialize rq->queuelist before reusing it in the state machine when end, since the rq->rq_next also reuse it, may have corrupted rq->queuelist by the driver. This patch decrease the size of struct request by 16 bytes. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230717040058.3993930-5-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17blk-mq: use percpu csd to remote complete instead of per-rq csdChengming Zhou
If request need to be completed remotely, we insert it into percpu llist, and smp_call_function_single_async() if llist is empty previously. We don't need to use per-rq csd, percpu csd is enough. And the size of struct request is decreased by 24 bytes. This way is cleaner, and looks correct, given block softirq is guaranteed to be scheduled to consume the list if one new request is added to this percpu list, either smp_call_function_single_async() returns -EBUSY or 0. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230717040058.3993930-2-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17block: don't allow enabling a cache on devices that don't support itChristoph Hellwig
Currently the write_cache attribute allows enabling the QUEUE_FLAG_WC flag on devices that never claimed the capability. Fix that by adding a QUEUE_FLAG_HW_WC flag that is set by blk_queue_write_cache and guards re-enabling the cache through sysfs. Note that any rescan that calls blk_queue_write_cache will still re-enable the write cache as in the current code. Fixes: 93e9d8e836cb ("block: add ability to flag write back caching on a device") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230707094239.107968-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17Merge tag 'drm-misc-next-2023-07-13' of ↵Daniel Vetter
git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for v6.6: UAPI Changes: * fbdev: * Make fbdev userspace interfaces optional; only leaves the framebuffer console active * prime: * Support dma-buf self-import for all drivers automatically: improves support for many userspace compositors Cross-subsystem Changes: * backlight: * Fix interaction with fbdev in several drivers * base: Convert struct platform.remove to return void; part of a larger, tree-wide effort * dma-buf: Acquire reservation lock for mmap() in exporters; part of an on-going effort to simplify locking around dma-bufs * fbdev: * Use Linux device instead of fbdev device in many places * Use deferred-I/O helper macros in various drivers * i2c: Convert struct i2c from .probe_new to .probe; part of a larger, tree-wide effort * video: * Avoid including <linux/screen_info.h> Core Changes: * atomic: * Improve logging * prime: * Remove struct drm_driver.gem_prime_mmap plus driver updates: all drivers now implement this callback with drm_gem_prime_mmap() * gem: * Support execution contexts: provides locking over multiple GEM objects * ttm: * Support init_on_free * Swapout fixes Driver Changes: * accel: * ivpu: MMU updates; Support debugfs * ast: * Improve device-model detection * Cleanups * bridge: * dw-hdmi: Improve support for YUV420 bus format * dw-mipi-dsi: Fix enable/disable of DSI controller * lt9611uxc: Use MODULE_FIRMWARE() * ps8640: Remove broken EDID code * samsung-dsim: Fix command transfer * tc358764: Handle HS/VS polarity; Use BIT() macro; Various cleanups * Cleanups * ingenic: * Kconfig REGMAP fixes * loongson: * Support display controller * mgag200: * Minor fixes * mxsfb: * Support disabling overlay planes * nouveau: * Improve VRAM detection * Various fixes and cleanups * panel: * panel-edp: Support AUO B116XAB01.4 * Support Visionox R66451 plus DT bindings * Cleanups * ssd130x: * Support per-controller default resolution plus DT bindings * Reduce memory-allocation overhead * Cleanups * tidss: * Support TI AM625 plus DT bindings * Implement new connector model plus driver updates * vkms * Improve write-back support * Documentation fixes Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20230713090830.GA23281@linux-uq9g
2023-07-17net: phy: bcm7xxx: Add EPHY entry for 74165Florian Fainelli
74165 is a 16nm process SoC with a 10/100 integrated Ethernet PHY, utilize the recently defined 16nm EPHY macro to configure that PHY. Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Justin Chen <justin.chen@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-16Merge tag 'sched_urgent_for_v6.5_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Remove a cgroup from under a polling process properly - Fix the idle sibling selection * tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/psi: use kernfs polling functions for PSI trigger polling sched/fair: Use recent_used_cpu to test p->cpus_ptr
2023-07-15remoteproc: core: Export the rproc coredump APIsSiddharth Gupta
The remoteproc coredump APIs are currently only part of the internal remoteproc header. This prevents the remoteproc platform drivers from using these APIs when needed. This change moves the rproc_coredump() and rproc_coredump_cleanup() APIs to the linux header and marks them as exported symbols. Signed-off-by: Siddharth Gupta <sidgup@codeaurora.org> Signed-off-by: Gokul krishna Krishnakumar <quic_gokukris@quicinc.com> Link: https://lore.kernel.org/r/20230224211707.30916-2-quic_gokukris@quicinc.com Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2023-07-15rpmsg: core: Add signal API supportDeepak Kumar Singh
Some transports like Glink support the state notifications between clients using flow control signals similar to serial protocol signals. Local glink client drivers can send and receive flow control status to glink clients running on remote processors. Add APIs to support sending and receiving of flow control status by rpmsg clients. Signed-off-by: Deepak Kumar Singh <quic_deesin@quicinc.com> Signed-off-by: Sarannya S <quic_sarannya@quicinc.com> Acked-by: Arnaud Pouliquen <arnaud.pouliquen@foss.st.com> Link: https://lore.kernel.org/r/1688679698-31274-2-git-send-email-quic_sarannya@quicinc.com Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2023-07-15clk: qcom: smd-rpm: Move some RPM resources to the common headerKonrad Dybcio
In preparation for handling the bus clocks in the icc driver, carve out some defines and a struct definition to the common rpm header. Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Acked-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org> Acked-by: Georgi Djakov <djakov@kernel.org> Link: https://lore.kernel.org/r/20230526-topic-smd_icc-v7-4-09c78c175546@linaro.org Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2023-07-15soc: qcom: smd-rpm: Use tabs for definesKonrad Dybcio
Use tabs for defines to make things spaced consistently. Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org> Acked-by: Georgi Djakov <djakov@kernel.org> Link: https://lore.kernel.org/r/20230526-topic-smd_icc-v7-3-09c78c175546@linaro.org Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2023-07-15soc: qcom: smd-rpm: Add QCOM_SMD_RPM_STATE_NUMKonrad Dybcio
Add a preprocessor define to indicate the number of RPM contexts/states. Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org> Acked-by: Georgi Djakov <djakov@kernel.org> Link: https://lore.kernel.org/r/20230526-topic-smd_icc-v7-2-09c78c175546@linaro.org Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2023-07-14Merge tag 'block-6.5-2023-07-14' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Don't require quirk to use duplicate namespace identifiers (Christoph, Sagi) - One more BOGUS_NID quirk (Pankaj) - IO timeout and error hanlding fixes for PCI (Keith) - Enhanced metadata format mask fix (Ankit) - Association race condition fix for fibre channel (Michael) - Correct debugfs error checks (Minjie) - Use PAGE_SECTORS_SHIFT where needed (Damien) - Reduce kernel logs for legacy nguid attribute (Keith) - Use correct dma direction when unmapping metadata (Ming) - Fix for a flush handling regression in this release (Christoph) - Fix for batched request time stamping (Chengming) - Fix for a regression in the mq-deadline position calculation (Bart) - Lockdep fix for blk-crypto (Eric) - Fix for a regression in the Amiga partition handling changes (Michael) * tag 'block-6.5-2023-07-14' of git://git.kernel.dk/linux: block: queue data commands from the flush state machine at the head blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rq nvme-pci: fix DMA direction of unmapping integrity data nvme: don't reject probe due to duplicate IDs for single-ported PCIe devices block/mq-deadline: Fix a bug in deadline_from_pos() nvme: ensure disabling pairs with unquiesce nvme-fc: fix race between error recovery and creating association nvme-fc: return non-zero status code when fails to create association nvme: fix parameter check in nvme_fault_inject_init() nvme: warn only once for legacy uuid attribute block: remove dead struc request->completion_data field nvme: fix the NVME_ID_NS_NVM_STS_MASK definition nvmet: use PAGE_SECTORS_SHIFT nvme: add BOGUS_NID quirk for Samsung SM953 blk-crypto: use dynamic lock class for blk_crypto_profile::lock block/partition: fix signedness issue for Amiga partitions
2023-07-14rcuscale: Measure grace-period kthread CPU timePaul E. McKenney
This commit adds the ability to output the CPU time consumed by the grace-period kthread for the RCU variant under test. The CPU time is whatever is in the designated task's current->stime field, and thus is controlled by whatever CPU-time accounting scheme is in effect. This output appears in microseconds as follows on the console: rcu_scale: Grace-period kthread CPU time: 42367.037 [ paulmck: Apply feedback from Stephen Rothwell and kernel test robot. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Yujie Liu <yujie.liu@intel.com>
2023-07-14jiffies: add kernel-doc for all APIsRandy Dunlap
Add kernel-doc notation in <linux/jiffies.h> for interfaces that don't already have it (i.e. most interfaces). Fix some formatting and typos in other comments. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: John Stultz <jstultz@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Acked-by: John Stultz <jstultz@google.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20230704052405.5089-2-rdunlap@infradead.org
2023-07-14Merge tag 'drm-fixes-2023-07-14-1' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull drm fixes from Dave Airlie: "There were a bunch of fixes lined up for 2 weeks, so we have quite a few scattered fixes, mostly amdgpu and i915, but ttm has a bunch and nouveau makes an appearance. So a bit busier than usual for rc2, but nothing seems out of the ordinary. fbdev: - dma: Fix documented default preferred_bpp value ttm: - fix warning that we shouldn't mix && and || - never consider pinned BOs for eviction&swap - Don't leak a resource on eviction error - Don't leak a resource on swapout move error - fix bulk_move corruption when adding a entry client: - Send hotplug event after registering a client dma-buf: - keep the signaling time of merged fences v3 - fix an error pointer vs NULL bug sched: - wait for all deps in kill jobs - call set fence parent from scheduled i915: - Don't preserve dpll_hw_state for slave crtc in Bigjoiner - Consider OA buffer boundary when zeroing out reports - Remove dead code from gen8_pte_encode - Fix one wrong caching mode enum usage amdgpu: - SMU i2c locking fix - Fix a possible deadlock in process restoration for ROCm apps - Disable PCIe lane/speed switching on Intel platforms (the platforms don't support it) nouveau: - disp: fix HDMI on gt215+ - disp/g94: enable HDMI - acr: Abort loading ACR if no firmware was found - bring back blit subchannel for pre nv50 GPUs - Fix drm_dp_remove_payload() invocation ivpu: - Fix VPU register access in irq disable - Clear specific interrupt status bits on C0 bridge: - dw_hdmi: fix connector access for scdc - ti-sn65dsi86: Fix auxiliary bus lifetime panel: - simple: Add connector_type for innolux_at043tn24 - simple: Add Powertip PH800480T013 drm_display_mode flags" * tag 'drm-fixes-2023-07-14-1' of git://anongit.freedesktop.org/drm/drm: (32 commits) drm/nouveau: bring back blit subchannel for pre nv50 GPUs drm/nouveau/acr: Abort loading ACR if no firmware was found drm/amd: Align SMU11 SMU_MSG_OverridePcieParameters implementation with SMU13 drm/amd: Move helper for dynamic speed switch check out of smu13 drm/amd/pm: conditionally disable pcie lane/speed switching for SMU13 drm/amd/pm: share the code around SMU13 pcie parameters update drm/amdgpu: avoid restore process run into dead loop. drm/amd/pm: fix smu i2c data read risk drm/nouveau/disp/g94: enable HDMI drm/nouveau/disp: fix HDMI on gt215+ drm/client: Send hotplug event after registering a client drm/i915: Fix one wrong caching mode enum usage drm/i915: Remove dead code from gen8_pte_encode drm/i915/perf: Consider OA buffer boundary when zeroing out reports drm/i915: Don't preserve dpll_hw_state for slave crtc in Bigjoiner drm/ttm: never consider pinned BOs for eviction&swap drm/fbdev-dma: Fix documented default preferred_bpp value dma-buf: fix an error pointer vs NULL bug accel/ivpu: Clear specific interrupt status bits on C0 accel/ivpu: Fix VPU register access in irq disable ...
2023-07-14iommu: Optimise PCI SAC address trickRobin Murphy
Per the reasoning in commit 4bf7fda4dce2 ("iommu/dma: Add config for PCI SAC address trick") and its subsequent revert, this mechanism no longer serves its original purpose, but now only works around broken hardware/drivers in a way that is unfortunately too impactful to remove. This does not, however, prevent us from solving the performance impact which that workaround has on large-scale systems that don't need it. Once the 32-bit IOVA space fills up and a workload starts allocating and freeing on both sides of the boundary, the opportunistic SAC allocation can then end up spending significant time hunting down scattered fragments of free 32-bit space, or just reestablishing max32_alloc_size. This can easily be exacerbated by a change in allocation pattern, such as by changing the network MTU, which can increase pressure on the 32-bit space by leaving a large quantity of cached IOVAs which are now the wrong size to be recycled, but also won't be freed since the non-opportunistic allocations can still be satisfied from the whole 64-bit space without triggering the reclaim path. However, in the context of a workaround where smaller DMA addresses aren't simply a preference but a necessity, if we get to that point at all then in fact it's already the endgame. The nature of the allocator is currently such that the first IOVA we give to a device after the 32-bit space runs out will be the highest possible address for that device, ever. If that works, then great, we know we can optimise for speed by always allocating from the full range. And if it doesn't, then the worst has already happened and any brokenness is now showing, so there's little point in continuing to try to hide it. To that end, implement a flag to refine the SAC business into a per-device policy that can automatically get itself out of the way if and when it stops being useful. CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Jakub Kicinski <kuba@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Vasant Hegde <vasant.hegde@amd.com> Tested-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/b8502b115b915d2a3fabde367e099e39106686c8.1681392791.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2023-07-14spi: Use struct_size() helperAndy Shevchenko
The Documentation/process/deprecated.rst suggests to use flexible array members to provide a way to declare having a dynamically sized set of trailing elements in a structure.This makes code robust agains bunch of the issues described in the documentation, main of which is about the correctness of the sizeof() calculation for this data structure. Due to above, prefer struct_size() over open-coded versions. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20230714091748.89681-5-andriy.shevchenko@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2023-07-14platform/x86/intel/tpmi: Read feature control statusSrinivas Pandruvada
Some of the PM features can be locked or disabled. In that case, write interface can be locked. This status is read via a mailbox. There is one TPMI ID which provides base address for interface and data register for mail box operation. The mailbox operations is defined in the TPMI specification. Refer to https://github.com/intel/tpmi_power_management/ for TPMI specifications. An API is exposed to feature drivers to read feature control status. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20230712225950.171326-2-srinivas.pandruvada@linux.intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2023-07-14Merge tag 'ib-pdx86-simatic-v6.6' into review-hansHans de Goede
Immutable branch between pdx86 simatic branch and LED due for the v6.6 merge window v6.5-rc1 + recent pdx86 simatic-ipc patches for merging into the LED subsystem for v6.6.
2023-07-14platform/x86: simatic-ipc: add another modelHenning Schild
This is the panel variant of a device we already did have. All the same, just no LEDs. Signed-off-by: Henning Schild <henning.schild@siemens.com> Link: https://lore.kernel.org/r/20230713144832.26473-2-henning.schild@siemens.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2023-07-14platform/x86: simatic-ipc: add CMOS battery monitoringHenning Schild
Siemens Simatic Industrial PCs can monitor the voltage of the CMOS battery with two bits that indicate low or empty state. This can be GPIO or PortIO based. Here we model that as a hwmon voltage. The core driver does the PortIO and provides boilerplate for the GPIO versions. Which are split out to model runtime dependencies while allowing fine-grained kernel configuration. Signed-off-by: Henning Schild <henning.schild@siemens.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230706154831.19100-3-henning.schild@siemens.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2023-07-14media: i2c: add I2C Address Translator (ATR) supportLuca Ceresoli
An ATR is a device that looks similar to an i2c-mux: it has an I2C slave "upstream" port and N master "downstream" ports, and forwards transactions from upstream to the appropriate downstream port. But it is different in that the forwarded transaction has a different slave address. The address used on the upstream bus is called the "alias" and is (potentially) different from the physical slave address of the downstream chip. Add a helper file (just like i2c-mux.c for a mux or switch) to allow implementing ATR features in a device driver. The helper takes care of adapter creation/destruction and translates addresses at each transaction. Signed-off-by: Luca Ceresoli <luca@lucaceresoli.net> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Wolfram Sang <wsa@kernel.org> Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2023-07-14platform/x86: simatic-ipc: add another model BX-21AHenning Schild
This adds support for the Siemens Simatic IPC model BX-21A. Actual drivers for that model will be sent in separate patches. Signed-off-by: Henning Schild <henning.schild@siemens.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20230713115639.16419-2-henning.schild@siemens.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2023-07-14net: mdio: add unlocked mdiobus and mdiodev bus accessorsRussell King (Oracle)
Add the following unlocked accessors to complete the set: __mdiobus_modify() __mdiodev_read() __mdiodev_write() __mdiodev_modify() __mdiodev_modify_changed() which we will need for Marvell DSA PCS conversion. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-14net: phylink: add support for PCS link change notificationsRussell King (Oracle)
Add a function, phylink_pcs_change() which can be used by PCs drivers to notify phylink about changes to the PCS link state. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>