summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-11-11mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'Yunsheng Lin
Currently there is one 'struct page_frag' for every 'struct sock' and 'struct task_struct', we are about to replace the 'struct page_frag' with 'struct page_frag_cache' for them. Before begin the replacing, we need to ensure the size of 'struct page_frag_cache' is not bigger than the size of 'struct page_frag', as there may be tens of thousands of 'struct sock' and 'struct task_struct' instances in the system. By or'ing the page order & pfmemalloc with lower bits of 'va' instead of using 'u16' or 'u32' for page size and 'u8' for pfmemalloc, we are able to avoid 3 or 5 bytes space waste. And page address & pfmemalloc & order is unchanged for the same page in the same 'page_frag_cache' instance, it makes sense to fit them together. After this patch, the size of 'struct page_frag_cache' should be the same as the size of 'struct page_frag'. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-7-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11xtensa: remove the get_order() implementationYunsheng Lin
As the get_order() implemented by xtensa supporting 'nsau' instruction seems be the same as the generic implementation in include/asm-generic/getorder.h when size is not a constant value as the generic implementation calling the fls*() is also utilizing the 'nsau' instruction for xtensa. So remove the get_order() implemented by xtensa, as using the generic implementation may enable the compiler to do the computing when size is a constant value instead of runtime computing and enable the using of get_order() in BUILD_BUG_ON() macro in next patch. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-6-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11mm: page_frag: avoid caller accessing 'page_frag_cache' directlyYunsheng Lin
Use appropriate frag_page API instead of caller accessing 'page_frag_cache' directly. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20241028115343.3405838-5-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11mm: page_frag: use initial zero offset for page_frag_alloc_align()Yunsheng Lin
We are about to use page_frag_alloc_*() API to not just allocate memory for skb->data, but also use them to do the memory allocation for skb frag too. Currently the implementation of page_frag in mm subsystem is running the offset as a countdown rather than count-up value, there may have several advantages to that as mentioned in [1], but it may have some disadvantages, for example, it may disable skb frag coalescing and more correct cache prefetching We have a trade-off to make in order to have a unified implementation and API for page_frag, so use a initial zero offset in this patch, and the following patch will try to make some optimization to avoid the disadvantages as much as possible. 1. https://lore.kernel.org/all/f4abe71b3439b39d17a6fb2d410180f367cadf5c.camel@gmail.com/ CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-4-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11mm: move the page fragment allocator from page_alloc into its own fileYunsheng Lin
Inspired by [1], move the page fragment allocator from page_alloc into its own c file and header file, as we are about to make more change for it to replace another page_frag implementation in sock.c As this patchset is going to replace 'struct page_frag' with 'struct page_frag_cache' in sched.h, including page_frag_cache.h in sched.h has a compiler error caused by interdependence between mm_types.h and mm.h for asm-offsets.c, see [2]. So avoid the compiler error by moving 'struct page_frag_cache' to mm_types_task.h as suggested by Alexander, see [3]. 1. https://lore.kernel.org/all/20230411160902.4134381-3-dhowells@redhat.com/ 2. https://lore.kernel.org/all/15623dac-9358-4597-b3ee-3694a5956920@gmail.com/ 3. https://lore.kernel.org/all/CAKgT0UdH1yD=LSCXFJ=YM_aiA4OomD-2wXykO42bizaWMt_HOA@mail.gmail.com/ CC: David Howells <dhowells@redhat.com> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-3-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11mm: page_frag: add a test module for page_fragYunsheng Lin
The testing is done by ensuring that the fragment allocated from a frag_frag_cache instance is pushed into a ptr_ring instance in a kthread binded to a specified cpu, and a kthread binded to a specified cpu will pop the fragment from the ptr_ring and free the fragment. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://patch.msgid.link/20241028115343.3405838-2-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net: convert to nla_get_*_default()Johannes Berg
Most of the original conversion is from the spatch below, but I edited some and left out other instances that were either buggy after conversion (where default values don't fit into the type) or just looked strange. @@ expression attr, def; expression val; identifier fn =~ "^nla_get_.*"; fresh identifier dfn = fn ## "_default"; @@ ( -if (attr) - val = fn(attr); -else - val = def; +val = dfn(attr, def); | -if (!attr) - val = def; -else - val = fn(attr); +val = dfn(attr, def); | -if (!attr) - return def; -return fn(attr); +return dfn(attr, def); | -attr ? fn(attr) : def +dfn(attr, def) | -!attr ? def : fn(attr) +dfn(attr, def) ) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Link: https://patch.msgid.link/20241108114145.0580b8684e7f.I740beeaa2f70ebfc19bfca1045a24d6151992790@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net: netlink: add nla_get_*_default() accessorsJohannes Berg
There are quite a number of places that use patterns such as if (attr) val = nla_get_u16(attr); else val = DEFAULT; Add nla_get_u16_default() and friends like that to not have to type this out all the time. Acked-by: Toke Høiland-Jørgensen <toke@kernel.org> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://patch.msgid.link/20241108114145.acd2aadb03ac.I3df6aac71d38a5baa1c0a03d0c7e82d4395c030e@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11drm/msm/adreno: Setup SMMU aparture for per-process page tableBjorn Andersson
Support for per-process page tables requires the SMMU aparture to be setup such that the GPU can make updates with the SMMU. On some targets this is done statically in firmware, on others it's expected to be requested in runtime by the driver, through a SCM call. One place where configuration is expected to be done dynamically is the QCS6490 rb3gen2. The downstream driver does this unconditioanlly on any A6xx and newer, so follow suite and make the call. Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com> Reviewed-by: Rob Clark <robdclark@gmail.com> Link: https://lore.kernel.org/r/20241110-adreno-smmu-aparture-v2-2-9b1fb2ee41d4@oss.qualcomm.com Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2024-11-11firmware: qcom: scm: Introduce CP_SMMU_APERTURE_IDBjorn Andersson
The QCOM_SCM_SVC_MP service provides QCOM_SCM_MP_CP_SMMU_APERTURE_ID, which is used to trigger the mapping of register banks into the SMMU context for per-processes page tables to function (in case this isn't statically setup by firmware). This is necessary on e.g. QCS6490 Rb3Gen2, in order to avoid "CP | AHB bus error"-errors from the GPU. Introduce a function to allow the msm driver to invoke this call. Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com> Reviewed-by: Rob Clark <robdclark@gmail.com> Link: https://lore.kernel.org/r/20241110-adreno-smmu-aparture-v2-1-9b1fb2ee41d4@oss.qualcomm.com Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2024-11-11nvme: check ns's volatile write cache not presentGuixin Liu
When the VWC of a namespace does not exist, the BLK_FEAT_WRITE_CACHE flag should not be set when registering the block device, regardless of whether the controller supports VWC. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvme: add rotational supportWang Yugui
Rotational devices, such as hard-drives, can be detected using the rotational bit in the namespace independent identify namespace data structure. Make the bit visible to the block layer through the rotational queue setting. Signed-off-by: Wang Yugui <wangyugui@e16-tech.com> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvme: use command set independent id ns if availableMatias Bjørling
The NVMe 2.0 specification adds an independent identify namespace data structure that contains generic attributes that apply to all namespace types. Some attributes carry over from the NVM command set identify namespace data structure, and others are new. Currently, the data structure only considered when CRIMS is enabled or when the namespace type is key-value. However, the independent namespace data structure is mandatory for devices that implement features from the 2.0+ specification. Therefore, we can check this data structure first. If unavailable, retrieve the generic attributes from the NVM command set identify namespace data structure. Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: support for csi identify nsKeith Busch
Implements reporting the I/O Command Set Independent Identify Namespace command. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement rotational media information logKeith Busch
Most of the information is stubbed. Supporting these commands is a requirement for supporting rotational media. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement endurance groupsKeith Busch
Most of the returned information is just stubbed data. The target must support these in order to report rotational media. Since this driver doesn't know any better, each namespace is its own endurance group with the engid value matching the nsid. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: declare 2.1 version complianceKeith Busch
The target driver implements all the mandatory logs, identifications, features, and properties up to nvme sepcification 2.1. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement crto propertyKeith Busch
This property is required for nvme 2.1. The target only supports ready with media, so this is just the same value as CAP.TO. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement supported features logKeith Busch
This log is required for nvme 2.1. Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement supported log pagesKeith Busch
This log is required for nvme 2.1. Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement active command set ns listKeith Busch
This is required for nvme 2.1 for targets that support multiple command sets. We support NVM and ZNS, so are required to support this identification. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement id ns for nvm command setKeith Busch
We don't report anything here, but it's a mandatory identification for nvme 2.1. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: support reservation featureGuixin Liu
This patch implements the reservation feature, including: 1. reservation register(register, unregister and replace). 2. reservation acquire(acquire, preempt, preempt and abort). 3. reservation release(release and clear). 4. reservation report. 5. set feature and get feature of reservation notify mask. 6. get log page of reservation event. Not supported: 1. persistent reservation through power loss. Test cases: Use nvme-cli and fio to test all implemented sub features: 1. use nvme resv-register to register host a registrant or unregister or replace a new key. 2. use nvme resv-acquire to set host to the holder, and use fio to send read and write io in all reservation type. And also test preempt and "preempt and abort". 3. use nvme resv-report to show all registrants and reservation status. 4. use nvme resv-release to release all registrants. 5. use nvme get-log to get events generated by the preceding operations. In addition, make reservation configurable, one can set ns to support reservation before enable ns. The default of resv_enable is false. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio fixes from Michael Tsirkin: "Several small bugfixes all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vdpa/mlx5: Fix error path during device add vp_vdpa: fix id_table array not null terminated error virtio_pci: Fix admin vq cleanup by using correct info pointer vDPA/ifcvf: Fix pci_read_config_byte() return code handling Fix typo in vringh_test.c vdpa: solidrun: Fix UB bug with devres vsock/virtio: Initialization of the dangling pointer occurring in vsk->trans
2024-11-11sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> ↵Tejun Heo
scx_bpf_dsq_move[_vtime]*() In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume": - ops.dispatch() - scx_bpf_dispatch[_vtime]() - scx_bpf_consume() - scx_bpf_dispatch[_vtime]_from_dsq*() This overloading of the term is historical. Originally, there were only built-in DSQs and moving a task into a DSQ always dispatched it for execution. Using the verb "dispatch" for the kfuncs to move tasks into these DSQs made sense. Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was from a non-local DSQ to a local DSQ and this operation was named "consume". This was already confusing as a task could be dispatched to a user DSQ from ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch(). Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse as "dispatch" in this context meant moving a task to an arbitrary DSQ from a user DSQ. Clean up the API with the following renames: 1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]() 2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local() 3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() This patch performs the third set of renames. Compatibility is maintained by: - The previous kfunc names are still provided by the kernel so that old binaries can run. Kernel generates a warning when the old names are used. - compat.bpf.h provides wrappers for the new names which automatically fall back to the old names when running on older kernels. They also trigger build error if old names are used for new builds. - scx_bpf_dispatch[_vtime]_from_dsq*() were already wrapped in __COMPAT macros as they were introduced during v6.12 cycle. Wrap new API in __COMPAT macros too and trigger build errors on both __COMPAT prefixed and naked usages of the old names. The compat features will be dropped after v6.15. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Johannes Bechberger <me@mostlynerdless.de> Acked-by: Giovanni Gherdovich <ggherdovich@suse.com> Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Ming Yang <yougmark94@gmail.com>
2024-11-11sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local()Tejun Heo
In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume": - ops.dispatch() - scx_bpf_dispatch[_vtime]() - scx_bpf_consume() - scx_bpf_dispatch[_vtime]_from_dsq*() This overloading of the term is historical. Originally, there were only built-in DSQs and moving a task into a DSQ always dispatched it for execution. Using the verb "dispatch" for the kfuncs to move tasks into these DSQs made sense. Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was from a non-local DSQ to a local DSQ and this operation was named "consume". This was already confusing as a task could be dispatched to a user DSQ from ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch(). Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse as "dispatch" in this context meant moving a task to an arbitrary DSQ from a user DSQ. Clean up the API with the following renames: 1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]() 2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local() 3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() This patch performs the second rename. Compatibility is maintained by: - The previous kfunc names are still provided by the kernel so that old binaries can run. Kernel generates a warning when the old names are used. - compat.bpf.h provides wrappers for the new names which automatically fall back to the old names when running on older kernels. They also trigger build error if old names are used for new builds. The compat features will be dropped after v6.15. v2: Comment and documentation updates. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Johannes Bechberger <me@mostlynerdless.de> Acked-by: Giovanni Gherdovich <ggherdovich@suse.com> Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Ming Yang <yougmark94@gmail.com>
2024-11-11sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]()Tejun Heo
In sched_ext API, a repeatedly reported pain point is the overuse of the verb "dispatch" and confusion around "consume": - ops.dispatch() - scx_bpf_dispatch[_vtime]() - scx_bpf_consume() - scx_bpf_dispatch[_vtime]_from_dsq*() This overloading of the term is historical. Originally, there were only built-in DSQs and moving a task into a DSQ always dispatched it for execution. Using the verb "dispatch" for the kfuncs to move tasks into these DSQs made sense. Later, user DSQs were added and scx_bpf_dispatch[_vtime]() updated to be able to insert tasks into any DSQ. The only allowed DSQ to DSQ transfer was from a non-local DSQ to a local DSQ and this operation was named "consume". This was already confusing as a task could be dispatched to a user DSQ from ops.enqueue() and then the DSQ would have to be consumed in ops.dispatch(). Later addition of scx_bpf_dispatch_from_dsq*() made the confusion even worse as "dispatch" in this context meant moving a task to an arbitrary DSQ from a user DSQ. Clean up the API with the following renames: 1. scx_bpf_dispatch[_vtime]() -> scx_bpf_dsq_insert[_vtime]() 2. scx_bpf_consume() -> scx_bpf_dsq_move_to_local() 3. scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() This patch performs the first set of renames. Compatibility is maintained by: - The previous kfunc names are still provided by the kernel so that old binaries can run. Kernel generates a warning when the old names are used. - compat.bpf.h provides wrappers for the new names which automatically fall back to the old names when running on older kernels. They also trigger build error if old names are used for new builds. The compat features will be dropped after v6.15. v2: Documentation updates. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Johannes Bechberger <me@mostlynerdless.de> Acked-by: Giovanni Gherdovich <ggherdovich@suse.com> Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Ming Yang <yougmark94@gmail.com>
2024-11-11nvme-multipath: don't bother clearing max_hw_zone_append_sectorsChristoph Hellwig
The limits stacking now properly zeroes it if at least one of the underlying limits clears it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20241108154657.845768-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11block: pre-calculate max_zone_append_sectorsChristoph Hellwig
max_zone_append_sectors differs from all other queue limits in that the final value used is not stored in the queue_limits but needs to be obtained using queue_limits_max_zone_append_sectors helper. This not only adds (tiny) extra overhead to the I/O path, but also can be easily forgotten in file system code. Add a new max_hw_zone_append_sectors value to queue_limits which is set by the driver, and calculate max_zone_append_sectors from that and the other inputs in blk_validate_zoned_limits, similar to how max_sectors is calculated to fix this. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11Merge branch 'refactor-lock-management'Alexei Starovoitov
Kumar Kartikeya Dwivedi says: ==================== Refactor lock management This set refactors lock management in the verifier in preparation for spin locks that can be acquired multiple times. In addition to this, unnecessary code special case reference leak logic for callbacks is also dropped, that is no longer necessary. See patches for details. Changelog: ---------- v5 -> v6 v5: https://lore.kernel.org/bpf/20241109225243.2306756-1-memxor@gmail.com * Move active_locks mutation to {acquire,release}_lock_state (Alexei) v4 -> v5 v4: https://lore.kernel.org/bpf/20241109074347.1434011-1-memxor@gmail.com * Make active_locks part of bpf_func_state (Alexei) * Remove unneeded in_callback_fn logic for references v3 -> v4 v3: https://lore.kernel.org/bpf/20241104151716.2079893-1-memxor@gmail.com * Address comments from Alexei * Drop struct bpf_active_lock definition * Name enum type, expand definition to multiple lines * s/REF_TYPE_BPF_LOCK/REF_TYPE_LOCK/g * Change active_lock type to int * Fix type of 'type' in acquire_lock_state * Filter by taking type explicitly in find_lock_state * WARN for default case in refsafe switch statement v2 -> v3 v2: https://lore.kernel.org/bpf/20241103212252.547071-1-memxor@gmail.com * Rebase on bpf-next to resolve merge conflict v1 -> v2 v1: https://lore.kernel.org/bpf/20241103205856.345580-1-memxor@gmail.com * Fix refsafe state comparison to check callback_ref and ptr separately. ==================== Link: https://lore.kernel.org/r/20241109231430.2475236-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11bpf: Drop special callback reference handlingKumar Kartikeya Dwivedi
Logic to prevent callbacks from acquiring new references for the program (i.e. leaving acquired references), and releasing caller references (i.e. those acquired in parent frames) was introduced in commit 9d9d00ac29d0 ("bpf: Fix reference state management for synchronous callbacks"). This was necessary because back then, the verifier simulated each callback once (that could potentially be executed N times, where N can be zero). This meant that callbacks that left lingering resources or cleared caller resources could do it more than once, operating on undefined state or leaking memory. With the fixes to callback verification in commit ab5cfac139ab ("bpf: verify callbacks as if they are called unknown number of times"), all of this extra logic is no longer necessary. Hence, drop it as part of this commit. Cc: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241109231430.2475236-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11bpf: Refactor active lock managementKumar Kartikeya Dwivedi
When bpf_spin_lock was introduced originally, there was deliberation on whether to use an array of lock IDs, but since bpf_spin_lock is limited to holding a single lock at any given time, we've been using a single ID to identify the held lock. In preparation for introducing spin locks that can be taken multiple times, introduce support for acquiring multiple lock IDs. For this purpose, reuse the acquired_refs array and store both lock and pointer references. We tag the entry with REF_TYPE_PTR or REF_TYPE_LOCK to disambiguate and find the relevant entry. The ptr field is used to track the map_ptr or btf (for bpf_obj_new allocations) to ensure locks can be matched with protected fields within the same "allocation", i.e. bpf_obj_new object or map value. The struct active_lock is changed to an int as the state is part of the acquired_refs array, and we only need active_lock as a cheap way of detecting lock presence. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241109231430.2475236-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11selftests/bpf: skip the timer_lockup test for single-CPU nodesViktor Malik
The timer_lockup test needs 2 CPUs to work, on single-CPU nodes it fails to set thread affinity to CPU 1 since it doesn't exist: # ./test_progs -t timer_lockup test_timer_lockup:PASS:timer_lockup__open_and_load 0 nsec test_timer_lockup:PASS:pthread_create thread1 0 nsec test_timer_lockup:PASS:pthread_create thread2 0 nsec timer_lockup_thread:PASS:cpu affinity 0 nsec timer_lockup_thread:FAIL:cpu affinity unexpected error: 22 (errno 0) test_timer_lockup:PASS: 0 nsec #406 timer_lockup:FAIL Skip the test if only 1 CPU is available. Signed-off-by: Viktor Malik <vmalik@redhat.com> Fixes: 50bd5a0c658d1 ("selftests/bpf: Add timer lockup selftest") Tested-by: Philo Lu <lulie@linux.alibaba.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241107115231.75200-1-vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11Merge branch 'fix-lockdep-warning-for-htab-of-map'Alexei Starovoitov
Hou Tao says: ==================== The patch set fixes a lockdep warning for htab of map. The warning is found when running test_maps. The warning occurs when htab_put_fd_value() attempts to acquire map_idr_lock to free the map id of the inner map while already holding the bucket lock (raw_spinlock_t). The fix moves the invocation of free_htab_elem() after htab_unlock_bucket() and adds a test case to verify the solution. ==================== Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11selftests/bpf: Test the update operations for htab of mapsHou Tao
Add test cases to verify the following four update operations on htab of maps don't trigger lockdep warning: (1) add then delete (2) add, overwrite, then delete (3) add, then lookup_and_delete (4) add two elements, then lookup_and_delete_batch Test cases are added for pre-allocated and non-preallocated htab of maps respectively. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241106063542.357743-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11selftests/bpf: Move ENOTSUPP from bpf_util.hHou Tao
Moving the definition of ENOTSUPP into bpf_util.h to remove the duplicated definitions in multiple files. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241106063542.357743-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11bpf: Call free_htab_elem() after htab_unlock_bucket()Hou Tao
For htab of maps, when the map is removed from the htab, it may hold the last reference of the map. bpf_map_fd_put_ptr() will invoke bpf_map_free_id() to free the id of the removed map element. However, bpf_map_fd_put_ptr() is invoked while holding a bucket lock (raw_spin_lock_t), and bpf_map_free_id() attempts to acquire map_idr_lock (spinlock_t), triggering the following lockdep warning: ============================= [ BUG: Invalid wait context ] 6.11.0-rc4+ #49 Not tainted ----------------------------- test_maps/4881 is trying to lock: ffffffff84884578 (map_idr_lock){+...}-{3:3}, at: bpf_map_free_id.part.0+0x21/0x70 other info that might help us debug this: context-{5:5} 2 locks held by test_maps/4881: #0: ffffffff846caf60 (rcu_read_lock){....}-{1:3}, at: bpf_fd_htab_map_update_elem+0xf9/0x270 #1: ffff888149ced148 (&htab->lockdep_key#2){....}-{2:2}, at: htab_map_update_elem+0x178/0xa80 stack backtrace: CPU: 0 UID: 0 PID: 4881 Comm: test_maps Not tainted 6.11.0-rc4+ #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ... Call Trace: <TASK> dump_stack_lvl+0x6e/0xb0 dump_stack+0x10/0x20 __lock_acquire+0x73e/0x36c0 lock_acquire+0x182/0x450 _raw_spin_lock_irqsave+0x43/0x70 bpf_map_free_id.part.0+0x21/0x70 bpf_map_put+0xcf/0x110 bpf_map_fd_put_ptr+0x9a/0xb0 free_htab_elem+0x69/0xe0 htab_map_update_elem+0x50f/0xa80 bpf_fd_htab_map_update_elem+0x131/0x270 htab_map_update_elem+0x50f/0xa80 bpf_fd_htab_map_update_elem+0x131/0x270 bpf_map_update_value+0x266/0x380 __sys_bpf+0x21bb/0x36b0 __x64_sys_bpf+0x45/0x60 x64_sys_call+0x1b2a/0x20d0 do_syscall_64+0x5d/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e One way to fix the lockdep warning is using raw_spinlock_t for map_idr_lock as well. However, bpf_map_alloc_id() invokes idr_alloc_cyclic() after acquiring map_idr_lock, it will trigger a similar lockdep warning because the slab's lock (s->cpu_slab->lock) is still a spinlock. Instead of changing map_idr_lock's type, fix the issue by invoking htab_put_fd_value() after htab_unlock_bucket(). However, only deferring the invocation of htab_put_fd_value() is not enough, because the old map pointers in htab of maps can not be saved during batched deletion. Therefore, also defer the invocation of free_htab_elem(), so these to-be-freed elements could be linked together similar to lru map. There are four callers for ->map_fd_put_ptr: (1) alloc_htab_elem() (through htab_put_fd_value()) It invokes ->map_fd_put_ptr() under a raw_spinlock_t. The invocation of htab_put_fd_value() can not simply move after htab_unlock_bucket(), because the old element has already been stashed in htab->extra_elems. It may be reused immediately after htab_unlock_bucket() and the invocation of htab_put_fd_value() after htab_unlock_bucket() may release the newly-added element incorrectly. Therefore, saving the map pointer of the old element for htab of maps before unlocking the bucket and releasing the map_ptr after unlock. Beside the map pointer in the old element, should do the same thing for the special fields in the old element as well. (2) free_htab_elem() (through htab_put_fd_value()) Its caller includes __htab_map_lookup_and_delete_elem(), htab_map_delete_elem() and __htab_map_lookup_and_delete_batch(). For htab_map_delete_elem(), simply invoke free_htab_elem() after htab_unlock_bucket(). For __htab_map_lookup_and_delete_batch(), just like lru map, linking the to-be-freed element into node_to_free list and invoking free_htab_elem() for these element after unlock. It is safe to reuse batch_flink as the link for node_to_free, because these elements have been removed from the hash llist. Because htab of maps doesn't support lookup_and_delete operation, __htab_map_lookup_and_delete_elem() doesn't have the problem, so kept it as is. (3) fd_htab_map_free() It invokes ->map_fd_put_ptr without raw_spinlock_t. (4) bpf_fd_htab_map_update_elem() It invokes ->map_fd_put_ptr without raw_spinlock_t. After moving free_htab_elem() outside htab bucket lock scope, using pcpu_freelist_push() instead of __pcpu_freelist_push() to disable the irq before freeing elements, and protecting the invocations of bpf_mem_cache_free() with migrate_{disable|enable} pair. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241106063542.357743-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11Merge branch 'bpf-add-uprobe-session-support'Andrii Nakryiko
Jiri Olsa says: ==================== bpf: Add uprobe session support hi, this patchset is adding support for session uprobe attachment and using it through bpf link for bpf programs. The session means that the uprobe consumer is executed on entry and return of probed function with additional control: - entry callback can control execution of the return callback - entry and return callbacks can share data/cookie Uprobe changes (on top of perf/core [1] are posted in here [2]. This patchset is based on bpf-next/master and will be merged once we pull [2] in bpf-next/master. v9 changes: - rebased on bpf-next/master with perf/core tag merged (thanks Peter!) thanks, jirka [1] git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core [2] https://lore.kernel.org/bpf/20241018202252.693462-1-jolsa@kernel.org/T/#ma43c549c4bf684ca1b17fa638aa5e7cbb46893e9 --- ==================== Link: https://lore.kernel.org/r/20241108134544.480660-1-jolsa@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11selftests/bpf: Add threads to consumer testJiri Olsa
With recent uprobe fix [1] the sync time after unregistering uprobe is much longer and prolongs the consumer test which creates and destroys hundreds of uprobes. This change adds 16 threads (which fits the test logic) and speeds up the test. Before the change: # perf stat --null ./test_progs -t uprobe_multi_test/consumers #421/9 uprobe_multi_test/consumers:OK #421 uprobe_multi_test:OK Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED Performance counter stats for './test_progs -t uprobe_multi_test/consumers': 28.818778973 seconds time elapsed 0.745518000 seconds user 0.919186000 seconds sys After the change: # perf stat --null ./test_progs -t uprobe_multi_test/consumers 2>&1 #421/9 uprobe_multi_test/consumers:OK #421 uprobe_multi_test:OK Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED Performance counter stats for './test_progs -t uprobe_multi_test/consumers': 3.504790814 seconds time elapsed 0.012141000 seconds user 0.751760000 seconds sys [1] commit 87195a1ee332 ("uprobes: switch to RCU Tasks Trace flavor for better performance") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-14-jolsa@kernel.org
2024-11-11selftests/bpf: Add uprobe sessions to consumer testJiri Olsa
Adding uprobe session consumers to the consumer test, so we get the session into the test mix. In addition scaling down the test to have just 1 uprobe and 1 uretprobe, otherwise the test time grows and is unsuitable for CI even with threads. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-13-jolsa@kernel.org
2024-11-11selftests/bpf: Add uprobe session single consumer testJiri Olsa
Testing that the session ret_handler bypass works on single uprobe with multiple consumers, each with different session ignore return value. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-12-jolsa@kernel.org
2024-11-11selftests/bpf: Add kprobe session verifier test for return valueJiri Olsa
Making sure kprobe.session program can return only [0,1] values. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-11-jolsa@kernel.org
2024-11-11selftests/bpf: Add uprobe session verifier test for return valueJiri Olsa
Making sure uprobe.session program can return only [0,1] values. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-10-jolsa@kernel.org
2024-11-11selftests/bpf: Add uprobe session recursive testJiri Olsa
Adding uprobe session test that verifies the cookie value is stored properly when single uprobe-ed function is executed recursively. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-9-jolsa@kernel.org
2024-11-11selftests/bpf: Add uprobe session cookie testJiri Olsa
Adding uprobe session test that verifies the cookie value get properly propagated from entry to return program. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-8-jolsa@kernel.org
2024-11-11selftests/bpf: Add uprobe session testJiri Olsa
Adding uprobe session test and testing that the entry program return value controls execution of the return probe program. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-7-jolsa@kernel.org
2024-11-11libbpf: Add support for uprobe multi session attachJiri Olsa
Adding support to attach program in uprobe session mode with bpf_program__attach_uprobe_multi function. Adding session bool to bpf_uprobe_multi_opts struct that allows to load and attach the bpf program via uprobe session. the attachment to create uprobe multi session. Also adding new program loader section that allows: SEC("uprobe.session/bpf_fentry_test*") and loads/attaches uprobe program as uprobe session. Adding sleepable hook (uprobe.session.s) as well. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-6-jolsa@kernel.org
2024-11-11bpf: Add support for uprobe multi session contextJiri Olsa
Placing bpf_session_run_ctx layer in between bpf_run_ctx and bpf_uprobe_multi_run_ctx, so the session data can be retrieved from uprobe_multi link. Plus granting session kfuncs access to uprobe session programs. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-5-jolsa@kernel.org
2024-11-11bpf: Add support for uprobe multi session attachJiri Olsa
Adding support to attach BPF program for entry and return probe of the same function. This is common use case which at the moment requires to create two uprobe multi links. Adding new BPF_TRACE_UPROBE_SESSION attach type that instructs kernel to attach single link program to both entry and exit probe. It's possible to control execution of the BPF program on return probe simply by returning zero or non zero from the entry BPF program execution to execute or not the BPF program on return probe respectively. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-4-jolsa@kernel.org
2024-11-11bpf: Force uprobe bpf program to always return 0Jiri Olsa
As suggested by Andrii make uprobe multi bpf programs to always return 0, so they can't force uprobe removal. Keeping the int return type for uprobe_prog_run, because it will be used in following session changes. Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-3-jolsa@kernel.org